Dissertation: Cross-Platform Heterogeneous Runtime Environment

Cross-Platform Heterogeneous Runtime Environment

A Dissertation Presented

by

Enqiang Sun

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University

Boston, Massachusetts

April 2016

To my family.

i

Contents

List of Figures iv

List of Tables vi

Acknowledgments vii

Abstract of the Dissertation viii

1 Introduction 11.1 A Brief History of Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . 41.2 Heterogeneous Computing with OpenCL . . . . . . . . . . . . . . . . . . . . . . 51.3 Task-level Parallelism across Platforms with Multiple Computing Devices . . . . . 51.4 Scope and Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background and Related Work 92.1 From Serial to Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Programming Paradigms for Many Core Architecture . . . . . . . . . . . . . . . . 14

2.3.1 Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.4 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Computing with Graphic Processing Units . . . . . . . . . . . . . . . . . . . . . . 162.4.1 The Emergence of Programmable Graphics Hardware . . . . . . . . . . . 162.4.2 General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.2 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.3 OpenCL Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 SURF in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Monte Carlo Extreme in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ii

3 Cross-platform Heterogeneous Runtime Environment 393.1 Limitations of the OpenCL Command-Queue Approach . . . . . . . . . . . . . . . 39

3.1.1 Working with Multiple Devices . . . . . . . . . . . . . . . . . . . . . . . 393.2 The Task Queuing Execution System . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Work Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 Work Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.3 Common Runtime Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.4 Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.5 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.6 Task-Queuing API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Experimental Results 504.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Static Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 Performance Opportunities on a Single GPU Device . . . . . . . . . . . . 514.2.2 Heterogeneous Platform with Multiple GPU Devices . . . . . . . . . . . . 524.2.3 Heterogeneous Platform with CPU and GPU(APU) Device . . . . . . . . . 56

4.3 Design Space Exploration for Flexible Workload Balancing . . . . . . . . . . . . . 594.3.1 Synthetic Workload Generator . . . . . . . . . . . . . . . . . . . . . . . . 594.3.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 Workload Balancing with Irregular Work Units . . . . . . . . . . . . . . . 63

4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL . . . . . . . . . 644.5 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Summary and Conclusions 695.1 Portable Execution across Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 APIs to Expose Both Task-level and Data-level Parallelism . . . . . . . . . . . . . 705.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Including Flexible Workload Balancing Schemes . . . . . . . . . . . . . . 705.4.2 Running specific kernels on the best computing devices . . . . . . . . . . . 715.4.3 Prediction of data locality . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Bibliography 73

iii

List of Figures

1.1 Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices:homogeneous multi-core with cores of various size of complexity or heterogeneoussystem-on-chip architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Intel Processors Introduction Trends . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Multi-core Processors with Shared Memory . . . . . . . . . . . . . . . . . . . . . 112.3 Intels TeraFlops Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Intels Xeon Phi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 IBMs Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 High Level Block Diagram of a GPU . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 OpenCL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9 The OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.10 OpenCL work-items mapping to GPU devices. . . . . . . . . . . . . . . . . . . . 232.11 OpenCL work-items mapping to CPU devices. . . . . . . . . . . . . . . . . . . . . 242.12 The OpenCL Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.13 Qilin Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.14 The OpenCL environment with the IBM OpenCL common runtime. . . . . . . . . 302.15 Maestros Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.16 Symphony Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.17 The Program Flow of clSURF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.18 Block Diagram of the Parallel Monte Carlo simulation for photon migration. . . . . 38

3.1 Distributing work units from work pools to multiple devices. . . . . . . . . . . . . 413.2 CPU and GPU execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 An example execution of vector addition on multiple devices with different process-

ing capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 The performance of our work pool implementation on a single device One WorkPool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 The performance of our work pool implementation on a single device Two WorkPools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Load balancing on dual devices V9800P and HD6970. . . . . . . . . . . . . . . 55

iv

4.4 Load balancing on dual devices V9800P and GTX 285. . . . . . . . . . . . . . . 564.5 Performance assuming different device fission configurations and load balancing

schemes between CPU and Fused HD6550D GPU. . . . . . . . . . . . . . . . . . 574.6 The Load Balancing on dual devices HD6550D and CPU. . . . . . . . . . . . . . 584.7 Performance of different workload balancing schemes on all 3 CPU and GPU devices,

an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800GPU device alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.8 Performance of different workload balancing schemes on 1 CPU and 2 GPU devices,an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, ascompared to the NVS 5400M GPU device alone. . . . . . . . . . . . . . . . . . . 63

4.9 Performance Comparison of clSURF implemented with various workload balancingschemes on the platform with V7800 and HD6550D GPUs. . . . . . . . . . . . . . 65

4.10 Performance Comparison of MCXCL implemented with various workload balancingschemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.11 Number of lines of the source code using our runtime API versus a baseline OpenCLimplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

v

List of Tables

3.1 Typical memory bandwidth between different processing units for reads. . . . . . . 473.2 Typical memory bandwidth between different processing units for writes. . . . . . 473.3 The Classes and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Input sets emphasizing different phases of the SURF algorithm. . . . . . . . . . . . 52

vi

Acknowledgments

First, I would like to thank my advisor, Prof. David Kaeli for his insightful and inspiringguidance during the course of my graduate study. I have always enjoyed talking with him on variousresearch topics, and his computer architecture class is one of the best classes I have ever taken.

The enlightening suggestions from my committee of Prof. Norman Rubin and Prof.Ningfang Mi have been a great help to this thesis. Norm was my mentor when I was doing a 6-monthinternship at AMD, and thats where this thesis essentially started.

I would also like to thank Dr. Xinping Zhu, who gave me valuable guidance for my earlygraduate study. My fellow NUCAR colleagues, Dana Schaa, Byunghyun Jang, Perhaad Mistry, etc,also helped me so much through technical discussions and feedback. If life is a train ride, I cherishevery moment and every scene outside of the window we share together.

My deepest appreciation goes to my family, as it is always where I can put myself togetherwith their endless love. I would like to thank my mom and dad for their consistent support andmotivation, and my brother for his advice and encouragement. And finally, but most importantly, Iwould like to thank my wife and mother of my two kids, Liwei, for her understanding, patience, andfaith in me. I couldnt have finished this thesis without her love.

vii

Abstract of the Dissertation

Cross-Platform Heterogeneous Runtime Environment

by

Enqiang Sun

Doctor of Philosophy in Computer Engineering

Northeastern University, April 2016

Dr. David Kaeli, Adviser

Heterogeneous platforms are becoming widely adopted thanks to the support from newprogramming languages and models. Among these languages/models, OpenCL is an industrystandard for parallel programming on heterogeneous devices. With OpenCL, compute-intensiveportions of an application can be offloaded to a variety of processing units on a system. OpenCL isone of the first standards that focuses on portability, allowing programs to be written once and rununmodified on multiple, heterogeneous devices, regardless of vendor.

While OpenCL has been widely adopted, there still remains a lack of support for automaticworkload balancing and data consistency when multiple devices are present in the system. To addressthis need, we have designed a cross-platform heterogeneous runtime environment which provides ahigh-level, unified, execution model that is coupled with an intelligent resource management facility.The main motivation for developing this runtime environment is to provide OpenCL programmerswith a convenient programming paradigm to fully utilize all possible devices in a system andincorporate flexible workload balancing schemes without compromising the users ability to assigntasks according to the data affinity. Our work removes much of the cumbersome initialization of theplatform, and now devices and related OpenCL objects are hidden under the hood.

Equipped with this new runtime environment and associated programming interface, theprogrammer can focus on designing the application and worry less about customization to the targetplatform. Further, the programmer can now take advantage of multiple devices using a dynamicworkload balancing algorithm to reap the benefits of task-level parallelism.

To demonstrate the value of this cross-platform heterogeneous runtime environment, wehave evaluated it running both micro benchmarks and popular OpenCL benchmark applications. Withminimal overhead of managing data objects across devices, we show that we can achieve scalableperformance and application speedup as we increase the number of computing devices, without anychanges to the program source code.

viii

Chapter 1

Introduction

Moores law describes technology advances that double transistor density on integrated

circuits every 12 to 18 months [1]. However, with the size of transistors approaching the size of

individual atoms, and as power density outpaces current cooling techniques, the end of Moores law

has appeared on the horizon. This has encouraged the research community to look at new solutions

in system architecture, including heterogeneous computing architectures.

Since 2003, the semiconductor industry has settled on three main trends for microprocessors

design. The first trend is to continue improving the sequential execution speed while increasing the

number of cores [2]. This kind of microprocessors are called multicore processors. An example of

multicore CPU is Intels widely used Core 2 Duo processor. It has dual processor cores, each of

which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction

set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed

of sequential programs. The second trend focuses more on the execution of parallel applications with

as many as possible threads. This kind of processors are called many-thread processors. Most of

the current popular GPUs are many-thread architecture. For example, with full occupancy NVidias

GTX 970 can host 26,624 threads, executing in a large number of simple and in-order pipelines.

The third trend is the improved combination of both multicore and many-thread architecture. This

kind of processors are represented by most current desktop processors with integrated Graphics

processing units. For example, the 6th generation Intels Core i7-6567U processor has dual-core

CPU with integrated Iris graphics 550 GPU, which has 72 execution units [3]. AMDs A8-3850

fusion processor has four x86-64 CPU cores integrated together with a Radeon HD6550D Radeon

GPU, which has 5 SIMD engines (16-wide) and a total of 400 streaming processors [4].

With its long evolving history, the design philosophy of a CPU is to minimize the execution

1

CHAPTER 1. INTRODUCTION

latency of a single thread. Large on-chip caches are integrated to store frequently accessed data and

improve some of the long-latency memory accesses, providing short-latency cache access. There

is also prediction logic, such as branch prediction and data prefetching designed to minimize the

effective latency of operations at the cost of increased chip area and power. With all these hardware

logic components, the CPU greatly reduces the execution latency of each individual thread. However,

the large cache memory, low-latency arithmetic units, and sophisticated prediction logic consume

chip area and power that could be otherwise used to provide more arithmetic execution units and

memory access channels. This design style inside CPUs emphasizes on minimizing the latency and

is latency-oriented design.

The GPUs, either standalone or integrated, on the other hand, are designed as parallel,

throughput-oriented computing engines. The application software is expected to be organized with

much more data parallelism. The hardware takes advantage of the large number of arithmetic

execution units, and pipelines the execution when some of them are waiting for long-latency memory

accesses or arithmetic operations. Only limited amount of cache memories are supplied to help

increase the memory bandwidth requirements of these applications and facilitate the data synchro-

nization between multiple threads that access the same memory data. This design style strives

to maximize the total execution throughput of a large amount of data parallelism while allowing

individual threads to take a potentially much longer time to execute.

GPUs have been leading the race of floating-point performance since 2013. With enough

data parallelism and proper memory arrangement, the performance gap can be more than ten times.

These are not necessarily the application speeds, but only the raw speed the execution resources can

potentially support. For applications that have one or a few threads, CPUs can achieve much higher

performance than GPUs. Therefore, the heterogeneous architectures combining CPUs and GPUs

would be the natural selection for the applications, which can execute the sequential parts on the

CPU and numerically intensive parallel parts on GPU.

Figure 1.1 is an high-level illustration of multi-core CPU, many-thread accelerator GPU,

and a heterogeneous system-on-chip architecture with CPU and GPU on the same die. High-

performance computing might emphasize single-threaded latency whereas commercial transaction

processing might emphasize aggregate throughput. Designers began to put both of these devices

with very different characteristics together, and expected a performance gain, leveraging properly

workload distribution and balancing.

Graphics processing units used to be very difficult to program since programmers had to

use the corresponding graphics application programming interface. OpenGL and Direct3D are the

2


S

Homogeneous multi-core CPU Homogeneous multi-core CPU Homogeneous multi-core GPU Heterogeneous System-on-Chip

with CPU and GPU

Figure 1.1: Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices: homogeneousmulti-core with cores of various size of complexity or heterogeneous system-on-chip architectures.

most widely used graphics API specifications. More precisely, a computation must be mapped to a

graphical function that programs an pixel processing engine so that they can be executed on the early

GPUs. These APIs require extensive knowledge of graphics processing and also limit the kinds of

applications that one can actually write for early general purpose GPU programming. To quench

the increasing demands, new GPU programming paradigms became more and more popular, such

as CUDA [5], OpenCL [6] OpenACC [7], and C++AMP [8]. Many runtime and execution system

are also designed to help developer to manage the heterogeneous platform with multiple computing

devices with dramatically different characteristics.

In this thesis, we present a cross-platform heterogeneous runtime environment, providing a

convenient programming interface to fully utilize all possible devices in a heterogeneous system. Our

framework incorporates flexible workload balancing schemes without compromising the users ability

to assign tasks according to the data affinity. Our framework provides significant enhancements to

the state-of-the-art in OpenCL programming practice in terms of workload balancing and distribution.

Furthermore, the details of programming the specific platform are hidden from the programmer,

enabling the programmer to focus more on high-level design of the algorithms.

In this chapter, we present the reader with an introduction to some basic concepts of

heterogeneous computing. This includes a very brief history of heterogeneous computing with

CPUs and GPUs, the potential benefits that heterogeneous computing provides, and the ability of

our runtime framework to adapt applications to heterogeneous computing platforms. Finally, we

3


highlight the contributions of this thesis and outline the organization of the remainder of this thesis.

1.1 A Brief History of Heterogeneous Computing

Over the last decade, developers have witnessed the field of computer architecture transi-

tioning from single-core compute devices to a wide range of parallel architectures. The change in

architecture has also produced new challenges with the underlying parallel programming paradigms.

Existing algorithms designed to scale with single-core systems had to be redesigned to reap the

performance benefits of new parallel architectures. Multi-core is the chosen path of the industry to

quench the thirsty for performance, and at the same time, respecting thermal and power design limits.

While multi-core processors have ushered in a new era of concurrency, there has also

been work on exploiting existing parallel platforms such as GPUs. Since the early 1990s, software

architects have explored how best to run general-purpose applications on computer graphics hardware

(i.e., GPUs). GPUs were originally designed to execute a set of predefined functions as a graphics

rendering pipeline. Even today, GPUs are mainly designed to calculate the color of pixels on the

screen to support complex graphics processing functions. GPUs provide deterministic performance

when rendering frames. In the beginning of this revolution, GPU programming was done using a

graphics Application Programming Interface (API) such as OpenGL [9] or DirectX [10]. This model

required general purpose application developers to have intimate knowledge of graphics hardware

and graphics APIs. These restrictions severely impacted the implementation of many algorithms on

GPUs.

General purpose GPU (GPGPU) programming was not widely accepted until new GPU

architectures unified vertex and pixel processors (first available in the R600 family from AMD and

the G80 family from NVIDIA). New general purpose programming languages such as CUDA [5]

and Brook+ [11] were introduced in 2006. The introduction of fully programmable hardware and

new programming languages lifted many of the restrictions and greatly increased the interest in using

GPU for general purpose computing. Heterogeneous platforms that include GPUs as a powerful data-

parallel co-processor have been adopted for many scientific and engineering environments [12] [13]

[14] [15]. On current systems, discrete GPUs are connected to the rest of the system through a PCI

express bus. All data transfer between the CPU and GPU is limited by the speed of PCI express

protocol.

Recently, industry leaders have recognized that scalar processing on the CPU, combined

with parallel processing on the GPU, could be a power model for application throughput. More

4


recently, the Heterogeneous System Architecture (HSA) Foundation [16] was founded in 2012 by

many vendors. HSA has provided industry with standards to further support heterogeneity across

systems and devices. We have also seen solutions with a CPU and a GPU on the same die, such

as AMDs APU [4] series, INTELs Ivybridge [17] series and Qualcomms Snapdragon [18], has

demonstrated potential power/performance savings. Current state-of-the-art supercomputers utilize a

heterogeneous solution.

Heterogeneous systems can be found in every domain of computing, ranging from high-

performance computing servers to low-power embedded processors in mobile phones and tablets.

Industry and academia are investing huge amount of effort and budget to improve every aspects of

heterogeneous computing [19] [20] [21] [15] [22] [23].

1.2 Heterogeneous Computing with OpenCL

The emerging software framework for programming heterogeneous devices is the Open

Computing Language(OpenCL) [6]. OpenCL is an open industry standard managed by the non-

profit technology consortium Khronos Group. Support for OpenCL has been increasing from major

companies such as Qualcomm, AMD, Intel and Imagination.

The aim of OpenCL is to serve as a universal language for programming heterogeneous

platforms such as GPUs, CPUs, DSPs, and FPGAs. In order to support such a wide variety of

heterogeneous devices, some elements of the OpenCL API are necessarily low-level. As with the

CUDA/C language [5], OpenCL does not provide support for automatic workload balancing, nor

guarantee global data consistencyit is up to the programmer to explicitly define tasks and enqueue

them on devices, and to move data between devices as required. Furthermore, when different

implementations of OpenCL produced by different vendors are used, OpenCL objects from vendor

As implementation may not run on vendor Bs hardware. Given these limitations, there still remain

barriers to achieve straightforward heterogeneous computing.

1.3 Task-level Parallelism across Platforms with Multiple Computing

Devices

Platform agnostic is a quality that is taken for granted for many existing programming

languages such as C/C++, Java, etc. Programmers rely on compilers or run-time systems to automati-

cally generate executables for different processing units. Until recently, there did not exist a set of

5


API functions that would enable the programmer to automatically exploit all computing resources

when the characteristics of the underlying platform change (e.g., number of processing units and

accelerators).

To help illustrate some of the challenges with heterogeneous computing, we consider the

OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)[24]

to demonstrate a typical use of the OpenCL programming model. In OpenSURF, the degree of

data-parallelism a single kernel can vary when executing on different computing devices. Execution

dynamics are also dependent on the characteristics of the input images or video frames, such as the

size and image complexity. Furthermore, when mapping to another platform with a different number

of devices, we usually have to re-design the kernel binding and associated data transfers, without

proper runtime management. Without runtime workload balancing, the additional processing units

available on the targeted accelerator may remain idle unless the application is redesigned. Even

with the range of parallelism present in OpenSURF, an application has no inherent ability to exploit

the extra computing resources, and is not able to improve performance if we upgrade our hardware

platform.

In this thesis we present a cross-platform heterogeneous runtime environment that helps

ameliorate many of the burdens faced when performing heterogeneous programming. New pro-

gramming models such as OpenCL and CUDA provide the ability to dynamically initialize the

platforms and objects, and acquire the processing capability of each device, such as the number of

compute units, core frequency, etc.. The presented runtime environment augments this ability, and

incorporates a central task queuing/scheduling system. This central task queuing system is based on

the concepts of work pools and work units, and cooperates with workload balancing algorithms to

execute applications on heterogeneous hardware platforms. Using the runtime API, programmers

can easily develop and tune flexible workload balancing schemes across different platforms.

In the proposed runtime environment, data-parallel kernels in an application are wrapped

with metadata into work units. These work units are then enqueued into a work pool and assigned to

computing devices according to a selectable workload balancing policy. A resource management

system is seamlessly integrated in the central task-queuing system to provide for migration of kernels

between devices and platforms. We demonstrate the utility of this class of task queuing runtime

system by implementing selected benchmark applications from OpenCL benchmark suites. We also

benchmark the performance trade-off by implementing real world applications such as clSURF[25],

an OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)

framework, and Monte Carlo Extreme in OpenCL[26], a Monte Carlo simulation for time-resolved

6


photon transport in 3D turbid media.

1.4 Scope and Contribution of This Thesis

Unlike the ubiquity of the x86 architecture and the long life cycle of CPU designs, GPUs

often have much shorter release cycles and ever-changing ISAs and hardware features. Platforms

incorporating GPUs as accelerators can have very different configurations in terms of processing

capabilities and number of devices. As such, the need has arisen for a programming interface and

runtime execution system that allows a single program to be portable across different platforms, and

can automatically use all devices supported by an associated workload balancing scheme.

The key contribution of this thesis is the development of a cross-platform heterogeneous

runtime environment, which enables flexible task-level workload balancing on heterogeneous plat-

forms with multiple computing devices. Together with the application programming interface, this

extension layer is designed in the form of a library. Different levels of this runtime environment are

considered. We study the following aspects of our runtime environment:

We enable portable execution of applications across platforms. Our runtime environmentprovides a unified abstraction for all processing units, including CPU, GPU and many existing

OpenCL devices. With this unified abstraction, tasks are able to be distributed on all devices.

An application is portable across different platforms with a variable number of processing

units.

We provide APIs to expose both task-level and data-level parallelism. The program de-signer is in the best position to identify all levels of parallelism present in his/her application.

We provide API functions and a dependency description mechanism so that the programmer

can expose task-level parallelism. When combined with the data-level parallelism present in

OpenCL kernels, the run-time and/or the compiler can effectively adapt any type of parallel

machine without the modification of the source code.

Balance task execution dynamically based on static and run-time profiling information.The optimal static mapping of task execution on the underlying platform requires a significant

amount of analysis of all of the devices on the platform, and it is impossible for programmers

to perform such analysis and remap whenever new hardware is used. A dynamic workload

balancing scheme makes it possible for the same source code to obtain portable performance.

7


We support the management of data locality at runtime. Due to the data transfer overheadand its impact on the overall performance for OpenCL applications, data locality is an important

issue for the portable execution of tasks. In our OpenCL support, data management is tightly

integrated with the workload migrating decisions. The runtime layer ensures data availability

and data coherency throughout the whole system.

We simplify the initialization of platforms. Scientific programmers usually are not familiarwith the best way to initialize platforms across different types of OpenCL devices. With a new

API designed for our runtime environment, we shift this burden to the underlying execution

system, so that the programmer can focus on the development of his/her algorithms.

1.5 Organization of This Thesis

The rest of this thesis is organized as follows. Chapter 2 provides necessary background

information and related work on heterogeneous computing. It also presents a summary of related work

on previously proposed runtime environments targeting heterogeneous platforms. In Chapter 3, we

describe the structure and components in our cross-platform heterogeneous runtime environment and

discuss how it can facilitate more effective use of the resources present on heterogeneous platforms.

In Chapter 4, we explore the design space by using our heterogeneous runtime environment equipped

with different scheduling schemes when running synthetic workloads. We then demonstrate the

true value of our proposed runtime environment by evaluating the performance of benchmark

applications run on multiple cross-vendor heterogeneous platforms. We present a detailed analysis

on the performance components and demonstrate the programming efficiency. In Chapter 5, we

conclude this thesis, summarizing the major contributions embodied in this work, and describe

potential directions for future work.

8

Chapter 2

Background and Related Work

2.1 From Serial to Parallel

In July 2004, Intel released a statement that their 4GHz chip, originally targeted for the

fourth quarter, will be delayed until the first quarter of next year. Company Spokesman Howard

High said the delay will help ensure that the company can deliver high chip quantities when the

product is launched. Later in mid-October, in a surprising announcement, Intel officially abandoned

their entire plans to release the 4GHz version of the processor, and moved their engineers onto other

projects. This marks an abrupt change of 34 years of CPU-frequency scaling, where the increase in

CPU frequency grew exponentially over time.

Figure 2.1 illustrates a brief history of Intel processors, plotting the number of transistors

per chip and the associated clock speed [27]. As the total number of transistors continued to climb,

the clock speed did not keep up. The reason behind this major change in CPU development is due to

current power and cooling challenges, more specifically power density. The power density in proces-

sors has already exceeded that of a hot plate. Continuing to increase the frequency would require

either new cooling technologies or new materials to relax the physical limits of what a processor can

withstand. Processor design has hit the power wall. Our ability to improve performance automatically

by increasing frequency of the processor is gone. To further improve application throughput, major

silicon vendors elected to provide multi-core designs, providing higher performance within the

constraints of thermal limits and power density thresholds.

9

CHAPTER 2. BACKGROUND AND RELATED WORK

1980

Quad-Core Ivy Bridge

2010

Dual-Core Itanium 2

Transistors (103)

Clock Speed (MHz)

Power (W)

Performance/Clock (ILP)

Pentium 4

Pentium

Intel CPU Trends

(sources: Intel, Wikipedia, K. Olukotun)

386

1970

0

100

1975

1000

1

1985

10,000

10

2000

100,000

1,000,000

1990

10,000,000

100,000,000

1995 2005

Figure 2.1: Intel Processors Introduction Trends

2.2 Many-Core Architecture

In recent years, multi-core processors become the norm. Figure 2.2 shows an example of a

multi-core processor. A multi-core processor has two or more processing cores on a single chip, each

core with their own level-1 cache. The common global memory is shared among different processing

cores, while multiple tasks are executed on the multi-core processors.

Intels TeraFlops architecture [28] was designed to demonstrate a prototype of a many-core

processor, as shown in Figure 2.3. Developed by Intel Corporations Tera-Scale Computing Research

Program, this research processor contains 80 tiles of cores, and can yield 1.8 teraflops at 5.6GHz.

10


Shared L3 Cache

Core 0

CPU

L1 Cache

L2 Cache

Core 1

CPU

L1 Cache

L2 Cache

Core 2

CPU

L1 Cache

L2 Cache

Core 3

CPU

L1 Cache

L2 Cache

System Memory

Figure 2.2: Multi-core Processors with Shared Memory

While data transfers can occur between any pair of cores, no cache coherency is enforced across cores,

and all memory transfers are explicit. Therefore, the biggest hurdle to fully take advantage of the

power of these 80 cores is parallel programming. As shown in Figure 2.3, another interesting point is

that some dedicated hardware engines could be integrated with some of the cores for multimedia,

networking, security, and other tasks.

The Intel Xeon Phi coprocessor [29] inherited many design elements from Larrabee

project [30], which is another high performance co-processor based on the TeraFlops architecture.

The Intel Xeon Phi coprocessor is primarily composed of processing cores, caches, PCIe client

logic, and a very high bandwidth, bidirectional ring interconnect, as illustrated in Figure 2.4. Intel is

using Xeon Phi as the primary element for its family of Many Integrated Core architectures. Intel

revealed its second generation Many Integrated Core architecture in November 2013, with codename

Knights Landing [31]. The Knights Landing contains up to 72 cores and 36 tiles manufactured

in 14nm technology, with each core running 4 threads. The Knight Landing chip also has a 2MB

coherent shared cache between 2 cores in a tile, which indicates the effort to make this architecture as

programmable as possible. The Knights Landing is ISA compatible with the Intel Xeon processors

with support for Intels Advanced Vector Extension 512, and supports most of todays parallel

optimizations. One interesting feature of the Knights Landing is that it can either be the main

11


PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$ HD

Video

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

Crypto

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

DSP

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

GPU

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

PE

$

Physics

Figure 2.3: Intels TeraFlops Architecture

processor on a compute node, or as a coprocessor in a PCIe slot. Intel is exploring different

heterogeneous computing organizations.

Another example of a heterogeneous many core architecture is IBMs Cell processor [32].

It includes a general purpose PowerPC core with 8 very simple SIMD coprocessors, which are

specially designed for accelerating vector or multimedia operations. An operating system runs on

the main core, which is called Power Processing Unit (PPU). It is functioning as a master device

controlling the 8 coprocessors, which are called Synergistic Processing Elements (SPE). Each SPE

is a dual issue in-order processor composed of a Synergistic Processing Unit (SPU) and a Memory

Flow Controller (MFC). The Element Interconnect Bus (EIB) is the internal communication bus

12


TD

Core

L2

TD

Core

L2

TD

Core

L2

TD

Core

L2

TD

Core

L2

TD

Core

L2

TD

Core

L2

TD

Core

L2

GDDR MC

GDDR MC

GDDR MC

GDDR MC GD

DR

IO

GD

DR

IO

PCIe

Client

Logic

PCIe IO

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

GDDR5

Figure 2.4: Intels Xeon Phi Architecture

connecting various on-chip system elements.

The Cell processor was used as the processor for the Sonys PlayStation 3 game console

and some high performance computing servers, such as the IBM Roadrunner supercomputer and

Mercury System servers with Cell accelerator boards [33].

By November 2009, IBM discontinued the development of the Cell processor. The Cell

processor benefits from very high internal memory bandwidth, but all transfers must be explicitly

programmed by using low-level asynchronous DMA transfers. It requires significant expertise to

write efficient code for this architecture, especially with the limited size of the local storage on each

SPU (256 KB). Load balancing is another challenging issue on the Cell. The application programmer

is responsible for evenly mapping the different pieces of computation on the SPUs.

Besides the novelty in each hardware design of these many-core processors, industry

realized that programmability can not be overlooked anymore. When the hardware design of these

processors reaches an unprecedented complexity, it is impossible for software designers to manage all

the processing elements manually. Suitable programming models are desperately needed to exploit

the computing power on these architectures.

13


RAM SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

PPU

$

E I B

Figure 2.5: IBMs Cell

2.3 Programming Paradigms for Many Core Architecture

Given the development of a number of many-core architectures, many parallel program-

ming models have been developed to facilitate the usage of these architectures.

2.3.1 Pthreads

Pthreads or Portable Operating System Interface (POSIX) Threads is a set of C program-

ming language types, functions and variables [34]. Pthreads is implemented as a header (pthread.h)

and a library, which creates and manages multiple threads. When using Pthreads, the programmer

has to explicitly create and destroy threads by making use of pthread API functions.

The Pthreads library provides mechanisms to synchronize different threads, resolve race

conditions, avoid deadlock conditions, and protect critical sections. However, the programmer has

the responsibility to manage threads explicitly. Therefore, it is usually very challenging to design

a scalable multithreaded application on modern many-core architectures, especially systems with

hundreds of cores on a single machine.

14


2.3.2 OpenMP

OpenMP is an open specification for shared memory parallelism [35] [36]. It comprises

compiler directives, callable runtime library routines and environment variables which extend

FORTRAN, C and C++ programs. OpenMP is portable across a shared memory architecture. The

thread management is implicit, and the programmer has to use special directives to specify the section

of code is to be run in parallel. The number of threads to be used is specified by the environment

variables. OpenMP is also extended as a parallel programing model for clusters.

OpenMP uses several constructs to support implicit synchronization, so that the the program

is relieved from worrying about the actual synchronization mechanism.

As with Pthreads, scalability is still an issue for OpenMP, as it is a thread-based mechanism.

Furthermore, since OpenMP is using implicit thread management, there is no fine-grained way to do

thread-to-processor mapping.

2.3.3 MPI

The Message Passing Interface (MPI) [37] provides a virtual topology, synchronization,

and communication functionality between nodes in clusters. It is a natural candidate for accelerating

applications in distributed systems. MPI is currently the most widely used standard for developing

High Performance Computing (HPC) applications for distributed memory architectures. It provides

programming interfaces for C, C++, and FORTRAN. Some of the well-known MPI implementations

include OpenMPI [38], MVAPICH [39], MPICH [40], GridMPI [41], and LAM/MPI [42].

Similar to Pthreads, workload partitioning and task mapping have to be done by the

programmer, but message passing is a convenient way to express date transfer between different

processors. MPI barriers are used to specify that synchronization is needed. The barrier operation

blocks each process from continuing its execution until all processes have entered the barrier. A

typical usage of barriers is to ensure that the global data has been dispersed to the appropriate

processes.

2.3.4 Hadoop MapReduce

Hadoop MapReduce is a software framework for developing parallel applications easily,

and is especially well suited for processing vast amounts of data (e.g., multi-terabyte data-sets)

in-parallel on large clusters (thousands of nodes), and on commodity hardware, in a reliable, fault-

tolerant manner [43]. A MapReduce job usually splits the input data-set into independent chunks

15


which are processed by the map tasks in a completely parallel manner. The framework sorts the

outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output

of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them

and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same. For example, the

MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes.

This configuration allows the framework to effectively schedule tasks on the nodes where data is

already present, resulting in very high aggregate bandwidth across the cluster [44].

2.4 Computing with Graphic Processing Units

Although the multicore architectures have made it possible for the applications to over-

come some of the physical limits encountered with purely sequential architectures, their degree of

parallelization is not comparable to the parallelism on graphic processing units (GPUs). Intrinsically,

GPUs are designed for highly parallel problem. With more and more complex graphic problems,

new architectures and APIs are created, and GPUs became more and more programmable. In this

section, we briefly review the current state of GPU computing, and the GPUs transition from a

hardware implementation of standard graphic APIs to become a fully programmable general purpose

processing unit.

2.4.1 The Emergence of Programmable Graphics Hardware

The interactive 3D graphics applications have very different characteristics as compared

to general-purpose applications. Specifically, interactive 3D application requires high throughput

and exhibit substantial parallelism. Since the late 1990s custom hardware has been built to take

advantage the native parallelism in the application. Those early custom accelerators were designed

in the form of fixed-function pipelines based on a hardware implementation of OpenGL [9] standard

and Microsofts DirectX programming APIs. At each stage of the pipeline, a sequence of different

operations were implemented in hardware units for specific tasks.

Given that the GPU was originally designed to produce visual realism of rendered images,

a fixed-function pipeline graphics hardware has some limitations to perform efficiently. In the mean-

time, offline rendering systems such as Pixars Renderman [45] can be used to achieve impressive

visual results by replacing the fixed-function pipeline with a more flexible programmable pipeline. In

16


the programmable pipeline, fixed-function operations are replaced by user-provided pieces of code

called shaders. Pixel shaders, vertex shaders and geometry shaders are introduced to enable flexible

processing at each programmable pipeline.

Initially in early shader models, vertex and pixel shares were implemented using very

different instruction sets. But later in 2006, OpenGLs Unified Shader Model and DirectX 10s

Shader Model 4.0 provided consistent instruction sets across all shader types - geometry, vertex and

pixel shaders. All three types of shaders have almost the same capabilities. For example, they can

perform the same set of arithmetic instructions and read from texture or data buffers.

Graphics hardware designers continued to explore the best ISA for the shader models.

Before the unified shader model, ATIs Xenos graphics chip integrated in the Xbox 360 used unified

shader architecture. Most shaders continued to build dedicated hardware units, even though they have

a unified shader model. But eventually, all major GPU makers chose a Unified Shader Architecture,

which allows a single type of processing unit to be used for all types of shaders. The Unified

Shader Architecture decouples the type of shaders from the processing unit, and allows a dynamic

assignment of shaders to the different processing cores. This flexibility leads to better workload

balance, allowing hardware resources to be allocated dynamically for different types of shaders,

based on the needs of the workload.

Figure 2.6 is an illustration of high level block diagram of a modern GPU architecture.

2.4.2 General Purpose GPUs

With emergence of programmable graphics hardware, new shader languages and program-

ming APIs have been created to facilitate the programming effort. Since DirectX 9, Microsoft has

been using the High Level Shading Language (HLSL) [46], which supports shader construction

with C-like syntax, types, expressions, statements and functions. Similarly, the OpenGL Shading

Language (GLSL) [47] is the corresponding high level language targeting OpenGL shader programs.

Nvidias Cg [48] is a collaborated effort with Microsoft. The Cg compiler outputs both DirectX and

OpenGL shader programs. Although these shader languages are very popular across the graphics

community, mainstream programmers feel a lack of connection between the graphics primitives in

these shader languages and the constructs in general purpose programming languages.

With the introduction of unified shader architectures and unified shader models, a uniform

ISA makes it easier to design high-level languages for this workload. Some examples of these

higher level languages include Brook [49], Scott [50], Glift [51], Nvidias CUDA [5] and the

17


Shader

Core

Shader

Core

Shader

Core

Interconnection Network

L2 Cache

Global memory

Figure 2.6: High Level Block Diagram of a GPU

Khronos Groups OpenCL [6], which is an extension of Brook. These high-level languages hide

the graphic primitives with programming constructs which are more familiar to general purpose

programmers. The availability of CUDA and OpenCL, currently the two most popular languages, has

dramatically increased the programmability of GPU hardware. As a result, GPUs have been widely

adopted in many general purpose platforms for executing data-parallel, computationally-intensive

workloads [52]. Many key applications possessing a high degree of data-level parallelism have been

successfully accelerated using GPUs.

GPUs have been included in the standard configuration for many desktop machines and

servers. The availability of high-level languages has allowed industry to support both graphics and

compute on the same GPU. According to the 42nd TOP500 list, GPUs are used in the No.2 and

No.6 fastest supercomputers in the world [53]. Intel Xeon Phi processors are used in the No.1 and

No.7 fastest supercomputers in the world. A total of fifty-three systems on the list use accelerator/co-

processor technology. Thirty-eight of these systems use NVIDIA GPU chips, two use ATI Radeon,

18


and there are now thirteen systems with Intel MIC technology (Xeon Phi).

Application OpenCL Kernel

OpenCL Framework

OpenCL API OpenCL C Language

OpenCL Runtime

OpenCL Driver

GPU Hardware

The OpenCL Architecture

Figure 2.7: OpenCL Architecture

2.5 OpenCL

OpenCL (Open Computing Language) is an open standard for general purpose parallel

programming on CPUs, GPUs and other processors, giving software developers portable and efficient

access to the computing resource on these heterogeneous processing platforms [54]. OpenCL allows

a heterogeneous platform be viewed as a single platform with multiple computing devices. It is a

mature framework that includes a language definition, a set of APIs, compiler libraries, and a runtime

system to support software development. Figure 2.7 shows a high-level breakdown of the OpenCL

architecture.

19


Host

Processing Element

Compute Unit

Compute Device

Figure 2.8: An OpenCL Platform

2.5.1 An OpenCL Platform

An OpenCL framework adopts the concept of a platform, which has a host device with

multiple OpenCL devices interconnected [55]. The OpenCL devices can be a CPU, GPU or any type

of processing unit which supports the OpenCL standard. An OpenCL device can be divided into one

or more compute units (CUs), and a CU can be further divided into one or more processing elements

(PEs). Figure 2.8 shows how the OpenCL standard hierarchically describes a heterogeneous platform

with multiple OpenCL devices, multiple CUs and multiple PEs.

2.5.2 OpenCL Execution Model

The execution model of OpenCL consists of two parts: a host program running on the host

device, setting up data and scheduling execution on a compute device, and kernels executed on one

or more OpenCL devices [56]. Figure 2.9 shows the OpenCL execution model.

An OpenCL command queue is where the host interacts with an OpenCL device by queuing

computation kernels. Each command-queue is associated with a single device. There are three types

of commands in a command-queue:

20


F

Device 0 Device 1 Device 2 Device 3

foo() bar() baz() qux()

.... Host

Program

foo() bar() baz() qux ()

Command Queue Context

Kernel

Figure 2.9: The OpenCL Execution Model

Kernel-enqueue commands: Enqueues a kernel for execution on a device.

Memory commands: Transfers data between the host and device memory, between memoryobjects, or maps and unmaps memory objects from the host address space.

Synchronization commands: Explicit synchronization points that define ordering constraintsbetween commands.

Commands communicate their status through Event objects. Successful completion is

indicated by setting the event status to CL COMPLETE. Unsuccessful completion results in abnormal

termination of the command which is indicated by setting the event status to a negative value. In

this case, the command-queue associated with the abnormally terminated command and all other

command-queues in the same context may no longer be available and their behavior is implementation

defined.

A command submitted to a device will not launch until prerequisites that constrain the

order of commands have been resolved. These prerequisites have two sources. First, they may

21


arise from commands submitted to a command-queue that constrain the order that commands are

launched. For example, commands that follow a command queue barrier will not launch until all

commands prior to the barrier are complete. The second source of prerequisites is dependencies

between commands expressed through events. A command may include an optional list of events.

The command will wait and not launch until all the events in the list are in the CL COMPLETE state.

Using this mechanism, event objects define ordering constraints between commands and coordinate

execution between the host and one or more devices [54]. In our cross-platform runtime system,

we expand this mechanism to support dependencies between events across OpenCL devices from

different vendors.

A command may be submitted to a device, and yet there may be no visible side effects

except to wait on and satisfy event dependencies. Examples include markers-, kernels executed over

ranges of no work-items or copy operations of zero size. Such commands may pass directly from the

ready state to the ended state.

Command execution can be blocking or non-blocking. Consider a sequence of OpenCL

commands. For blocking commands, the OpenCL API functions that enqueue commands do not

return until the command has completed. Alternatively, OpenCL functions that enqueue non-

blocking commands return immediately and require that a programmer defines dependencies between

enqueued commands to ensure that enqueued commands are not launched before needed resources

are available. In both cases, the actual execution of the command may occur asynchronously with

execution of the host program.

Multiple command-queues can be present within a single context. Multiple command-

queues execute commands independently. Event objects visible to the host program can be used to

define synchronization points between commands in multiple command queues. If such synchroniza-

tion points are established between commands in multiple command-queues, an implementation must

assure that the command-queues progress concurrently and correctly accounts for the dependencies

established by the synchronization points.

The core of the OpenCL execution model is defined by how the kernels execute. When a

kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the

argument values associated with the arguments to the kernel, and the parameters that define the index

space define a kernel-instance. When a kernel-instance executes on a device, the kernel function

executes for each point in the defined index space. Each of these executing kernel functions is called

a work-item. The work-items associated with a given kernel-instance are managed by the device in

groups called work-groups. These work-groups define a coarse grained decomposition of the Index

22


space. Work-groups are further divided into sub-groups, which provide an additional level of control

over execution.

work-item

( + , + )

, = (0, 0)

work-item

( + , + )

, = ( 1, 0)

work-item

( + , + )

, = (0, 1)

work-item

( + , + )

, = ( 1, 1)

work-group

Type equation here.

work-group size

Type equation here.

wo

rk-g

rou

p s

ize

Typ

e eq

uation here.

NDRange size

Type equation here.

ND

Ran

ge s

ize

Typ

e eq

uation here.

Figure 2.10: OpenCL work-items mapping to GPU devices.

2.5.2.1 Mapping OpenCL Work-items

Each work-items global ID is an N-dimensional tuple. The global ID components are

values in the range from F, to F plus the number of elements in that dimension minus one.

If a kernel is compiled as an OpenCL 2.0 kernel [20], the size of work-groups in an

NDRange (the local size) need not be the same for all work-groups. In this case, any single

dimension for which the global size is not divisible by the local size will be partitioned into two

regions. One region will have work-groups that have the same number of work items as was specified

for that dimension by the programmer (the local size). The other region will have work-groups

with less than the number of work items specified by the local size parameter in that dimension (the

remainder work-groups). Work-group sizes can be non-uniform in multiple dimensions, potentially

producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range.

Each work-item is assigned to a work-group and is given a local ID to represent its position

within the work-group. A work-items local ID is an N-dimensional tuple with components in the

range from zero to the size of the work-group in that dimension minus one.

23


bar

rier

bar

rier

Workgroup 0 Workgroup n

bar

rier

bar

rier

Workgroup 1 Workgroup n+1

bar

rier

bar

rier

Workgroup n-1 Workgroup 2n-1

Execution

CPU

Thread 0

CPU

Thread 1

CPU

Thread n-1

Figure 2.11: OpenCL work-items mapping to CPU devices.

Work-groups are assigned IDs similarly. The number of work-groups in each dimension

is not directly defined but is inferred from the local and global NDRanges provided when a kernel

instance is enqueued. A work-groups ID is an N-dimensional tuple with components in the range 0

to the ceiling of the global size in that dimension divided by the local size in the same dimension. As

a result, the combination of a work-group ID and the local-ID within a work-group uniquely defines

a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of

a work-group index plus a local index within a work group.

On a CPU device, work-items are mapped by a different mechanism. An example mapping

of OpenCL execution on a CPU is shown in Figure 2.11. In this example, one worker thread is

created per physical CPU core when executing a kernel. Then this worker-thread, which is usually a

CPU thread, takes a work-group from the ND-range and begins to execute its associated work-items

one by one in sequence. If an OpenCL barrier is reached, the work-items state is stored and the

execution of the following work-item begins. When all work-items in this work group have reached

the barrier, execution will go back to the first work-item which stops at the barrier. It will resume the

24


execution until the next synchronization point. In the absence of barriers, the first work-item will

run to the end of the kernel before switching to the next. In both cases, the CPU will continuously

process all the work-items until the entire work-group is executed. During the whole process, idle

CPU threads will look for any remaining work-groups in the ND-range and begin the process them.

2.5.2.2 Kernel Execution

A kernel object is defined to include a function within the program object and a collection

of arguments connecting the kernel to a set of argument values [57]. The host program enqueues a

kernel object to the command queue, along with the NDRange and the work-group decomposition.

These define a kernel instance. In addition, an optional set of events may be defined when the kernel

is enqueued. The events associated with a particular kernel instance are used to constrain when the

kernel instance is launched with respect to other commands in the queue or with respect to commands

in other queues within the same context.

A kernel instance is submitted to a device. For an in-order command queue, the kernel

instances appear to launch and then execute in that same order.

Once these conditions are met, the kernel instance is launched and the work-groups

associated with the kernel instance are placed into a pool of ready-to-execute workgroups. The

device schedules work-groups from the pool for execution on the compute units of the device. The

kernel-enqueue command is complete when all work-groups associated with the kernel instance

end their execution, updates to global memory associated with a command are visible globally, and

the device signals successful completion by setting the event associated with the kernel-enqueue

command to CL COMPLETE.

While a command-queue is associated with only one device, a single device may be

associated with multiple command-queues. A device may also be associated with command queues

associated with different contexts within the same platform. The device will pull work-groups

from the pool and execute them on one or several compute units in any order; possibly interleaving

execution of work-groups from multiple commands. A conforming implementation may choose

to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute

in parallel. There is no safe and portable way to synchronize across the independent execution of

work-groups since they can execute in any order.

The work-items within a single sub-group execute concurrently, but not necessarily in

parallel (i.e., they are not guaranteed to make independent forward progress). Therefore, only

25


high-level synchronization constructs (e.g. sub-group functions such as barriers) that apply to all the

work-items in a sub-group are well defined and included in OpenCL.

Sub-groups execute concurrently within a given work-group and with appropriate device

support may make independent forward progress with respect to each other, with respect to host

threads and with respect to any entities external to the OpenCL system but running on an OpenCL

device, even in the absence of work-group barrier operations. In this situation, sub-groups are able

to internally synchronize using barrier operations without synchronizing with each other and may

perform operations that rely on runtime dependencies on operations other sub-groups perform.

The work-items within a single work-group execute concurrently, but are only guaranteed

to make independent progress in the presence of sub-groups and device support. In the absence

of this capability, only high-level synchronization constructs (e.g., work-group functions such as

barriers), that apply to all the work-items in a work-group, are well defined and included in OpenCL

for synchronization within a work-group.

2.5.2.3 Synchronization

Synchronization across all work-items within a single work-group is carried out using a

work-group function [58]. These functions carry out collective operations across all the work-items

in a work-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and

evaluation of a predicate. A work-group function must occur within a converged control flow; i.e.,

all work-items in the work-group must encounter precisely the same work-group function. For

example, if a work-group function occurs within a loop, the work-items must encounter the same

work-group function in the same loop iterations. All the work-items of a work-group must execute

the work-group function and complete reads and writes to memory before any are allowed to continue

execution beyond the work-group function. Work-group functions that apply between work-groups

are not provided in OpenCL since OpenCL does not define forward progress or ordering relations

between work-groups, hence collective synchronization operations are not well defined.

Synchronization across all work-items within a single sub-group is carried out using a

sub-group function. These functions carry out collective operations across all the work-items in

a sub-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and

evaluation of a predicate. A sub-group function must occur within a converged control flow; i.e., all

work-items in the sub-group must encounter precisely the same sub-group function. For example,

if a work-group function occurs within a loop, the work-items must encounter the same sub-group

26


function in the same loop iterations. All the work-items of a sub-group must execute the sub-group

function and complete reads and writes to memory before any are allowed to continue execution

beyond the sub-group function. Synchronization between sub-groups must either be performed using

work-group functions, or through memory operations. Using memory operations for sub-group

synchronization should be used carefully as forward progress of sub-groups relative to each other is

only supported optionally by OpenCL implementations.

A synchronization point between a pair of commands (A and B) assures that results of

command A happens-before command B is launched. This requires that any updates to memory

from command A complete and are made available to other commands before the synchronization

point completes. Likewise, this requires that command B waits until after the synchronization point

before loading values from global memory. The concept of a synchronization point works in a similar

fashion for commands such as a barrier that apply to two sets of commands. All the commands prior

to the barrier must complete and make their results available to following commands. Furthermore,

any commands following the barrier must wait for the commands prior to the barrier before loading

values and continuing their execution.

2.5.3 OpenCL Memory Model

The OpenCL memory model describes the structure, contents, and behavior of the memory

exposed by an OpenCL platform as an OpenCL program runs [59]. The model allows a programmer

to reason about values in memory as the host program and multiple kernel-instances execute.

An OpenCL program defines a context that includes a host, one or more devices, command-

queues, and memory exposed within the context. Consider the units of execution involved with such

a program. The host program runs as one or more host threads managed by the operating system

running on the host (the details of which are defined outside of OpenCL). There may be multiple

devices in a single context which all have access to memory objects defined by OpenCL. On a

single device, multiple work-groups may execute in parallel with potentially overlapping updates to

memory. Finally, within a single work-group, multiple work-items concurrently execute, once again

with potentially overlapping updates to memory.

The memory regions, and their relationship to the OpenCL Platform model, are summarized

in Figure 2.12. Local and private memories are always associated with a particular device. The

global and constant memories, however, are shared between all devices within a given context. An

OpenCL device may include a cache to support efficient access to these shared memories.

27


Host

Host Memory Global Memory Constant Memory

PE

Local Memory

Private

Memory

Private

Memory

PE

Local Memory

Private

Memory

Private

Memory

Compute Unit 0 Compute Unit 1

PE

Local Memory

Private

Memory

Private

Memory

PE

Local Memory

Private

Memory

Private

Memory

Compute Unit 0 Compute Unit 1

Kernel A Kernel B

PCIE

Figure 2.12: The OpenCL Memory Hierarchy.

To understand memory in OpenCL, it is important to appreciate the relationship between

these named address spaces. The four named address spaces available to a device are disjoint, which

means that they do not overlap. This is their logical relationship, however, and an implementation

may choose to let these disjoint named address spaces share physical memory.

Programmers often need functions callable from kernels, where the pointers manipulated

by those functions can point to multiple named address spaces. This saves a programmer from

the error-prone and wasteful practice of creating multiple copies of functions, one for each named

address space. Therefore, the global, local and private address spaces belong to a single generic

address space.

2.6 Heterogeneous Computing

To take full advantage of the resources on a heterogeneous platform, the programmer has

to manage these the allocation of these resources. In this section, we introduce several projects which

were designed or extended to support heterogeneous computing platforms. All of these runtimes

or libraries provide higher-level software layers with convenient abstractions, which alleviates the

programmer from the burden of managing resources on the targeted heterogeneous platform.

28


C++ source

Qilin API

Compiler

Code Cache

Scheduler

Libraries Dev.

tools

CPU GPU

Application

Qilin

System

Hardware

Figure 2.13: Qilin Software Architecture

2.6.0.1 Qilin

Qilin [60] is a programming system recently developed for heterogeneous multiprocessors.

Figure 2.13 shows the software architecture of Qilin. At the application level, Qilin provides an API

to programmers for describe parallelizable operations. By explicitly expressing these computations

through the API, the compiler does not have to extract any implicit parallelism from the serial code,

and instead can focus on performance tuning. Similar to OpenMP, the Qilin API is built on top of

C/C++ so that it can be easily adopted. But unlike standard OpenMP, where parallelization only

happens on the CPU, Qilin can exploit the hardware parallelism available on both the CPU and the

GPU.

Beneath the API layer is the Qilin system layer, which consists of a dynamic compiler

and its code cache, a number of libraries, a set of development tools, and a scheduler. The compiler

dynamically translates the API calls into native machine codes. It also produces a near-optimal map-

29


ping from computations to processing elements using an adaptive algorithm. To reduce compilation

overhead, translated code is stored in the code cache so that it can be reused without recompilation,

whenever possible. Once native machine code is available, it can be scheduled to run on the CPU

and/or the GPU by the scheduler. Libraries include commonly used functions such as BLAS and FFT.

Finally, debugging, visualization, and profiling tools can be provided to facilitate the development of

Qilin programs.

Qilin uses off-line profiling to obtain information about each task on each computing

device. This information is then used to partition tasks and create an appropriate performance model

for the targeted heterogeneous platform. However, the overhead to carry out the initial profiling

phase can be prohibitively high and results may be inaccurate if computation behavior is heavily

input dependent.

OpenCL Application

OpenCL Common Runtime for Linux on x86

Platform

Device 0 Context Device 1

Queue 0 Program MemObj Queue 1

Figure 2.14: The OpenCL environment with the IBM OpenCL common runtime.

30


2.6.0.2 IBM OpenCL common runtime

IBMs OpenCL common runtime [61] improves the OpenCL programming experience

by removing the burden from the programmer of managing multiple OpenCL platforms and dupli-

cated resources, such as contexts and memory objects. In the conventional OpenCL programming

environment, programmers are responsible for managing the movement of memory between two

or more contexts, when there are multiple OpenCL devices are present on the platform. In this

case, the application is forced to have host side synchronization in order to move their memory

objects between coordinating contexts. Equipped with the common runtime, this movement and

synchronization is done automatically.

In addition, the common runtime also improves the OpenCL programming experience by

alleviating the programmer from managing cross-queue scheduling and event dependencies. By

convention, OpenCL requires that command queue event dependencies must originate from the same

context as that of the command queue. In a multiple context environment, this restriction forces

programmers to manage their own cross-queue scheduling and dependencies. Again, this requires

additional host-side synchronization in the application. With the common runtime, the handling of

cross-queue event dependencies and scheduling are handled for the programmer.

Finally, The common runtime improves application portability and resource usage, which

reduces application complexity. In the conventional OpenCL environment, coordination of OpenCL

resources is more than just an inconvenience. Managing resources comes with challenges of

application portability, which becomes an issue when code is tuned for a particular underlying

platform. Applications are forced to choose whether to support only one platform, potentially leaving

compute resources unused, or adding complexity to manage resources across a range of platforms.

Using the unifying platform provided by the IBM OpenCL common runtime, applications are more

portable and resources can be more easily exploited.

IBMs OpenCL common runtime is designed to improve the OpenCL programming expe-

rience by managing multiple OpenCL platforms and duplicated resources. It minimizes application

complexity by presenting the programming environment as a single OpenCL platform. Shared

OpenCL resources, such as data buffers, events, and kernel programs are transparently managed

across the installed vendor implementations. The result is simpler programming in heterogeneous

environments. However, even equipped with this commercially-developed common runtime, many

of the multiple contexts features, such as scheduling decisions and data synchronization, must still

be manually performed by the programmer.

31


2.6.0.3 StarPU

StarPU [62] automatically schedules tasks across the different processing units of an

accelerator-based machine. Applications using StarPU do not have to deal with low-level concerns

such as data transfers or an efficient load balancing that are target system dependent. StarPU

is a C library that provides an API to describe application data, and can asynchronously submit

tasks that are dispatched and executed transparently over the entire machine in an efficient way.

Providing a separation of concerns between writing efficient algorithms and mapping them on

complex accelerator-based machines therefore makes it possible to achieve portable performance,

tapping into the potential of both accelerators and multi-core architectures.

An application first has to register data with StarPU. Once a piece of data has been

registered, its state is fully described using an opaque data structure, called a handle. Programmers

must then divide their applications into sets of possibly inter-dependent tasks. In order to obtain

portable performance, programmers do not explicitly choose which processing units will process the

different tasks.

Each task is described by a structure that contains the list of handles of the data that the task

will manipulate, the corresponding access modes (i.e. read, write, etc.), and a multi-versioned kernel

called a codelet, which gathers the various kernel implementations available on the different types of

processing units. The different tasks are submitted asynchronously to StarPU, which automatically

decides where to execute them. Thanks to the data description stored in the handle data structure,

StarPU also ensures that a coherent replicate of the different pieces of data accessed by a task are

automatically transferred to the appropriate processing unit. If StarPU selects a CUDA device to

execute a task, the CUDA implementation of the corresponding codelet will be provided with pointers

to locally replicated data allocated in the memory on the GPU.

Programmers need not worry about where the tasks are executed, nor how data replicates

are managed for these tasks. They simply need to register data, submit tasks with their implemen-

tations for the various processing units, and just wait for their termination, or simply rely on task

dependencies.

StarPU is a simple tasking API that provides numerical kernel designers with a convenient

way to execute parallel tasks on heterogeneous platforms, and incorporates a number of different

scheduling policies. StarPU is based on the integration of a resource management facility with a

task execution engine. Several scientific kernels [63][64] [65][66] have been deployed on StarPU to

utilize the computing power of heterogeneous platforms. However, StarPU is implemented in C and

32


the basic schedulable units (codelets) have to be implemented multiple times if they are targeting

multiple devices. This limits the migration of the codelets across platforms, and increases the

programmers burden. To overcome this limitation, StarPU has initiated a recent effort to incorporate

OpenCL [67] as the front-end.

2.6.0.4 Maestro

The Maestro model [68] unifies the disparate, device-specific, queues into a single, high-

level, task queue. At runtime, Maestro queries OpenCL to obtain information about the available

GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data

and divide work among the available devices automatically. This frees the programmer from having

to synchronize multiple devices and keep track of device-specific information.

Since OpenCL can execute on devices which differ radically in architecture and compu-

tational capabilities, it is difficult to develop simple heuristics with strong performance guarantees.

Hence, Maestros optimizations rely solely on empirical data, instead of any performance model

or apriori knowledge. Maestros general strategy for all optimizations can be summarized by the

following steps as show in Figure 2.15.

This strategy is used to optimize a variety of parameters, including local work group

size, data transfer size, and the division of work across multiple devices. However, these dynamic

execution parameters are only one of the obstacles to true portability. Another obstacle is the choice

of hardware-specific kernel optimizations. For instance, some kernel optimizations may result in

excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem.

Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro.

Maestro is an open source library for data orchestration on OpenCL devices. It provides

automatic data transfer, task decomposition across multiple devices, and auto-tuning of dynamic

execution parameters for selected problems. However, Maestro relies heavily on empirical data and

benchmark profiling beforehand. This limits its ability to run on applications with data-dependent

program flow and/or data dependencies.

2.6.0.5 Symphony

Symphony [69], previously known as MARE (Multicore Asynchronous Runtime Environ-

ment) [70], seamlessly integrates heterogeneous execution into a concurrent task graph and removes

the burden from the programmer of managing data transfers and explicit data copies between kernels

33


Estimate based on benchmarks

Collect empirical data from execution

Optimize based on results

Performance continues improving?

Final performance stratety

No

Yes

Figure 2.15: Maestros Optimization Flow

executing on different devices. At a low level, Symphony provides state-of-the-art algorithms for

work stealing and power optimizations that can hide hardware idiosyncrasies, allowing for portable

application development. In addition, Symphony is designed to support dynamic mapping of kernels

to heterogeneous execution units. Moreover, expert programmers can take charge of the execution

through a carefully designed system of attributes and di

Documents

Dissertation: Cross-Platform Heterogeneous Runtime Environment