A Study of Persistent Threads Style GPU Programming for ... · 2.1.1 Core Limitations of GPGPU Programming A brief summary of other major characteristics of the cur-rent GPGPU programming

A Study of Persistent Threads Style GPU Programming for

GPGPU Workloads

Kshitij GuptaDepartment of Electrical &

Computer EngineeringUniversity of California, [email protected]

Jeff A. StuartDepartment of Computer

ScienceUniversity of California, Davis

[email protected]

John D. OwensDepartment of Electrical &

Computer EngineeringUniversity of California, [email protected]

ABSTRACTIn this paper, we characterize and analyze an increasinglypopular style of programming for the GPU called Persis-tent Threads (PT). We present a concise formal definitionfor this programming style, and discuss the di↵erence be-tween the traditional GPU programming style (nonPT) andPT, why PT is attractive for some high-performance usagescenarios, and when using PT may or may not be appropri-ate. We identify limitations of the nonPT style and identifyfour primary use cases it could be useful in addressing—CPU-GPU synchronization, load balancing/irregular paral-lelism, producer-consumer locality, and global synchroniza-tion. Through micro-kernel benchmarks we show the PTapproach can achieve up to an order-of-magnitude speedupover nonPT kernels, but can also result in performance lossin many cases. We conclude by discussing the hardware andsoftware fundamentals that will influence the developmentof Persistent Threads as a programming style in future sys-tems.

1. INTRODUCTIONGPGPU programming has spawned a new era in high

performance computing by enabling massively parallel com-modity graphics processors to be utilized for non-graphicsapplications. This widespread adoption has been possibledue to architectural innovations of transforming the GPUfrom fixed-function hardware blocks to a programmable uni-fied shader model, and programming languages like CUDA [11]and OpenCL [9] that present an easy-to-program codingstyle by heavily virtualizing processor hardware, and shift-ing the onus of extracting parallelism from the programmer(explicit SIMD) to the compiler (implicit SIMD) for instruc-tion generation.

There are numerous examples of the current GPGPU pro-gramming environment facilitating good speed-up over se-quential implementations [12]. This trend has been espe-cially true for core compute portions from a broad varietyof application domains which quite often tend to be bruteforce, or generally embarrassingly parallel in nature. How-ever, from an application standpoint, the overall benefit ofemploying a GPU is governed by Amdahl’s Law, necessitat-ing the port of larger portions of an application to the GPUfor achieving better overall application acceleration. As awider variety of workloads are implemented on the GPU,many of which are highly irregular, limitations of GPU pro-gramming become apparent. We believe that the key sourcefor these limitations stems from the programming environ-ment that was originally created in 2006 [10] lacking native

Unified Shader Core

ibuff

obuff

ibuff

obuff

ibuff

obuff

Shader A

Shader B

Shader C

(i)

(ii)

(iii)

(a) Pre-2006: Discrete cores

Core C

Core B

Core A

ibuff

obuff

ibuff

obuff

ibuff

obuff

(b) 2006: Stream programming – CUDA architecture with unified cores; along

with ‘C for CUDA’

(c) Today: A sample of irregular workload patterns

ibuff

obuff obuff

ibuff

obuff

ibuff

ibuff

obuff

sync

sync

Figure 1: An illustration of GPU architectureand workload evolution through an example of athree-shader application—A (blue), B (red) andC (green)—(a) dedicated processor cores for eachshader in the graphics pipeline; (b) unified proces-sor architecture supporting all three shaders, wellsuited for data-parallel stream workloads (kernels);(c) data-flow patterns of modern workloads: (i) mul-tiple kernels within an application to be executedsuccessively (without having to write intermediatevalues to global memory), (ii) processing of a sin-gle kernel with synchronization requirement beforethe next iteration can be started, and (iii) stronglyinter-dependent kernels B and C producing variablework for the next stage to process.

support for handling the kinds of communication patternsshown in Figure 1(c).

Many application-specific solutions have been proposedin literature recently to address these limitations — rang-ing from fundamental parallel programming primitives likescan and sort [8] and core algorithms like FFT [18] to pro-grammable graphics pipelines like ray tracing [1] and Reyes-style rendering [16] — to maximize performance. Thesesolutions while seemingly disjoint in usage are all centeredaround a common programming principle, called PersistentThreads (PT). However, showing that PT is useful in a spe-cific instance does not address the larger questions: broadly,when is the PT style applicable and why is it the right choice.We go a step further by providing insights into this success-fully used GPU-compute programming style in this paper.

Our first major contribution is to formally introduce/define

the Persistent Threads model in a domain-independent waythat is understandable to anyone with reasonable knowledgeof GPU computing. We hope it contributes to a deeperunderstanding of this programming style. Our second ma-jor contribution is to concisely categorize the possible usagescenarios based on our needs and those encountered in lit-erature into four broad use cases, each of which have beenclearly described and benchmarked in the paper. These usecases encompass a wide body of issues that are encounteredwith the irregular workloads addressed by PT and thereforethis paper provides a good starting point for anyone try-ing to look at ways to improve their performance. As thispaper is neither intended as a summary nor validation ofresults seen in past literature, our third and most importantcontribution is a deep distillation and analysis of the usecases via un-biased benchmarks developed by us to provideinsight into the key questions of using PT—when and whyis PT appropriate. We believe this understanding is sorelylacking today. In this paper we take a systematic approachthat would be useful to a broad audience from a variety ofapplication domains.

In specific cases, the reason that makes PT successful isbecause of ine�ciencies in current hardware and program-ming systems. We hope that over time, many of these inef-ficiencies that we identify in this paper will be resolved ashardware and software continue to move forward. However,we believe that the use cases we describe in this paper arerelevant today and will continue to be relevant in the fu-ture. So another contribution of this paper is the creationof a set of benchmarks (comparison points) to help evaluateimplementations using PT against not only the traditionalprogramming style but also (and most importantly) againstfuture software and hardware advances. As these use caseswill continue to exist despite hardware and software evolu-tion, the broader discussion we hope to initiate through thispaper is the relative merits of hardware scheduling (as in thetraditional GPU programming style) vs. software scheduling(as in PT).

We begin by describing the present programming styleand identifying key performance bottlenecks in GPGPU pro-gramming today, followed by a description of the PersistentThreads style of programming in Section 2. In Section 3we briefly describe each of the four use cases where wefind PT to be applicable, and discuss them in detail sub-sequently. Through synthetic micro-kernel benchmarks, wepresent guidelines for when a PT formulation could be ben-eficial to programmers. We propose modifications to hard-ware and software programming based on the experiencegained while running our benchmarks for native-PT-styleprogramming in Section 4. We conclude in Section 5 by ob-serving that the changes required for native support for PTstyle does not require a significant re-working of the proces-sor hardware, making it a potentially attractive option forinclusion in near-term hardware and software programmingextensions of CUDA and OpenCL.

2. PROGRAMMING STYLE BACKGROUND

2.1 Traditional Programming StyleGPU’s hardware and programming style are designed such

that they rely heavily on Single Instruction Multiple Thread(SIMT) and Single Program Multiple Data (SPMD) pro-gramming paradigms. Both these paradigms virtualize the

DRA

M%

SM0%

P0% P1% P7%

Warp/Wavefront%

Block%

P0_0% P1_0% P7_0% P0_X% P1_X% P7_X%

Warp0% Warp1% WarpN%

Block0% Block1% BlockM%

SM1%

(a)

(b)

Figure 2: An illustration of the various levels of hi-erarchy in GPGPU programming. Solid rectangu-lar boxes indicate a physical processor while dottedboxes indicate virtual entities. (a) Multiple blockscan simultaneously execute on a single SM; (b) Fur-ther virtualization of blocks on the SM scheduledover time.

underlying hardware at multiple levels, abstracting the viewfor the software programmer from actual hardware opera-tion. Every lane of the physical SIMD Streaming Multi-processor (SM) is virtualized into larger batches of threads,which are some multiple of the SIMD-lane width, calledwarps or wavefronts (SIMT processing). Just like in SIMDprocessing, each warp operates in lock-step and executes thesame instruction, time-multiplexed on the hardware proces-sor. Multiple warps are combined to form a higher abstrac-tion called thread blocks, with threads within each block al-lowed to communicate and share data at runtime via L1cache/shared memory or registers. The process is furthervirtualized with multiple thread blocks being scheduled ontoeach SM simultaneously, and each block operating indepen-dently on di↵erent instances of the program on di↵erent data(SPMD processing). A hierarchy of these virtualizations areshown in Figure 2.

This programming style (which we shall refer to as ‘nonPT’)forces the developer to abstract units of work to virtualthreads. As the number of blocks is dependent on the num-ber of work units, in most scenarios there are several hun-dreds or thousands more blocks to run on the hardware thancan be initiated at kernel launch. In the traditional program-ming style, these extra blocks are scheduled at runtime. Theswitching of blocks is managed entirely by a hardware sched-uler, with the programmer having no means of influencinghow blocks are scheduled onto the SM. So while these ab-stractions provide an easy-to-program model by presentinga low barrier to entry for developers from a wide variety ofapplication domains, it gets in the way of seasoned program-mers working on highly irregular workloads that are alreadyhard to parallelize. This exposes a significant limitation ofthe current SPMD programming style that neither guaran-tees order, location and timing, nor does it explicitly allowdevelopers to influence the above three parameters without

CUDA OpenCL

thread work itemwarp —

thread block work groupgrid index space

local memory private memoryshared memory local memoryglobal memory global memoryscalar core processing element

multi-processor (SM) compute unit

Table 1: Equivalent terminologies between CUDAand OpenCL are listed here. Although we make useof NVIDIA hardware and ‘C’ for CUDA terminolo-gies in this paper, the same discussion can be ex-tended to GPUs from other vendors using OpenCL.

bypassing the hardware scheduler altogether. To circumventthese limitations, developers have used a PT style and itslower level of abstraction to gain performance by directlycontrolling scheduling.

2.1.1 Core Limitations of GPGPU Programming

A brief summary of other major characteristics of the cur-rent GPGPU programming style, many of which impact per-formance and are directly targeted by the Persistent Threadsprogramming style, are outlined below:

1. Host-Device Interface:

Master-slave processing: Only the host (master) pro-cessor has the ability to issue commands for data move-ment, synchronization, and execution on the deviceoutside of a kernel.

Kernel size: The dimensions of a block, and the num-ber of blocks per kernel invocation, are passed as launchconfiguration parameters to the kernel invocation API.

2. Device-side Properties:

Lifetime of a block: Every block is assumed to per-form its function independent of other blocks, and re-tire upon completion of its task.

Hardware scheduler: The hardware manages a list ofyet-to-be executed blocks and automatically schedulesthem onto a multi-processor (SM) at runtime. Asscheduling is a runtime decision, the programming styleo↵ers no guarantees of when or where a block will bescheduled.

Block state: When a new block is mapped onto a par-ticular SM, the old state (register and shared memory)on that SM is considered stale, disallowing any com-munication between blocks, even when run on the sameSM.

3. Memory Consistency:

Intra-block: Threads within a block communicate datavia either local (per-block) or global (DRAM) mem-ory. Memory consistency is guaranteed between twosections within a block if they are separated by anappropriate intrinsic function (typically a block-wide

barrier).

Inter-block: The only mechanism for inter-block com-munication is global memory. Because blocks are in-dependent and their execution order is undefined, themost common method for communicating between blocksis to cause a global synchronization by ending a kerneland starting a new one. Inter-block communicationthrough atomic memory operations is also an option,but may not be suitable or deliver su�cient perfor-mance for some application scenarios.

4. Kernel Invocations:

Producer-consumer: Due to the restrictions imposedon inter-block data sharing, kernels can only producedata as they run to completion. Consuming data onthe GPU produced by this kernel requires another ker-nel.

Spawning kernels: A kernel cannot invoke another copyof itself (recursion), spawn other kernels, or dynami-cally add more blocks. This is especially costly in caseswhere data reuse exists between invocations.

2.2 Persistent Threads Programming StyleThe requirement imposed by the nonPT programming

style of dividing a workload into several blocks, more thancan physically reside simultaneously at kernel launch time,is a design choice of the programming style, and not dueto some constraint imposed by the SM. From a hardwareperspective, threads are active throughout the execution ofa kernel. This di↵ers from a developer’s perspective usingnonPT style coding guidelines by implying that as blocksrun to completion, threads corresponding to these blocksare “retired”, while a batch of threads are “launched” as newblocks are scheduled onto the SM.

The Persistent Threads style of programming alters thenotion of the lifetime of virtual software threads, bring-ing them closer to execution lifetime of physical hardwarethreads, i.e. the developer’s view is that threads are activefor the entire duration of a kernel. This is achieved by twosimple modifications to kernel code: First, the virtualizationof hardware SMs is limited to the level shown in Figure 2(a),also referred to as maximal launch. Second, the lack of ad-ditional blocks shown in Figure 2(b) is compensated by em-ploying work queues. Both these are described in greaterdetail below. The PT style of coding is much closer to howone would program CPUs or Intel’s Larrabee processor [14],which exposes the scheduler for the programmer to influ-ence, if desired.

1. Maximal Launch: A kernel uses at most asmany blocks as can be concurrently scheduledon the SM:

Since each thread remains persistent throughout theexecution of a kernel, and is active across traditionalblock boundaries until no work remains, the program-mer schedules only as many threads as the GPU SMscan concurrently run. This represents the upper boundon the number of threads with which a kernel canlaunch. The lower bound can be as small as the num-ber of threads required to launch a single block. Inorder to distinguish nonPT and PT thread blocks, wewill refer to blocks in PT style programming as thread

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

2 3 1 0

1 0 3 2

3 2 0 1

0 1 2 3

0 1 2 3

A sample image of 64x64 pixels divided into 16 blocks

Software view

nonPT illustration PT illustration

vertical motion blur; processed on a 4-SM GPU

Hardware view

Inpu

t Im

age

Out

put I

mag

e G

PU

0 1 2 3

16 blocks mapped to the 4 SMs in random order

4 SMs Æ 4 thread groups

Figure 3: An illustration of nonPT and PT programming styles through an example image of 64x64 pixelsundergoing vertical motion blur — In this example, 16x16 threads combine to form a block and process theimage in a set of 4x4 blocks. The GPU has four SMs labeled 0-3. We assume a load-balanced system whereeach SM runs one block at a time and four blocks each. For nonPT, the kernel launches with 16 blocks. At runtime the hardware non-deterministically schedules new blocks to SMs as other blocks complete. A PT kernellaunches at-most the number of blocks corresponding to ‘maximal launch’, which in this example correspondto four thread groups, along with a 16-entry work queue; with each entry in this queue representing a blockindex. Through the work queue, developers can control scheduling of blocks onto SMs. In this example,each SM processes all blocks within a vertical sub-section of the image, thus helping data-reuse as pixels areshared across vertical block boundaries.

groups for the remainder of this paper. A thread grouphas the same dimensions as a thread block, but isformed by combining persistent threads launched atkernel invocation, processing work equivalent to zero,one, or many thread blocks, and remaining active untilno more work is left for the kernel to process.

2. Software schedules work through work queues,not hardware:

The traditional programming environment does notexpose the hardware scheduler to the programmer, thuslimiting the ability to exploit workload communicationpatterns. In contrast, the PT style bypasses the hard-ware scheduler by relying on a work queue of all blocksthat are to be processed for kernel execution to com-plete. When a block finishes, it checks the queue formore work and continues doing so until no work is left,at which point the block retires.Depending on the communication pattern exhibited bythe algorithm, the queue can either be static (known atcompile time) or dynamic (generated at runtime) andcan be used to control the order, location, and timingof the execution of each block. There are several kindsof optimizations one could incorporate in how work issubmitted and fetched from these queues – one exam-ple would be to incorporate distributed queues insteadof a global queue that could lead to optimal load bal-ancing via work stealing/donating or a hybrid of thetwo approach. In this paper we focus on the base caseof a single queue which primarily as a FIFO.

3. USE CASESOver the past two years, several researchers have used PT

techniques in their applications to improve performance. Wehave categorized these into four use cases, each of which aredescribed in detail in the following four subsections.

PT-style programming lacks native hardware and pro-gramming API support. While past studies have shown PTto be advantageous, the benefits and limitations of hack-ing the nonPT model to fit into a PT style in software isnot well understood and documented, and therefore cannotbe universally applied to all scenarios. Although a handfulbenchmark suites are available for GPU computing [5, 6],these workloads primarily deal with performance at a higherlevel than we wanted for our analysis. Hence, apart fromdistinctly categorizing use cases in this paper, another con-tribution of this paper is the creation of an un-biased micro-kernel benchmark suite that future studies can use/compareagainst, and that vendors can use to evaluate future designs.

For each use case, we give a brief introduction and citerelevant studies that have utilized the use case. We pro-vide implementation details of our synthetic workloads fol-lowed by results and conclusions. We wrote our tests inboth OpenCL and CUDA. Specifically, since OpenCL pro-vides the ability for fine-grain profiling, we wrote the firstuse case in OpenCL, and the remaining in CUDA. We ranall our use cases on an NVIDIA GeForce GTX2951. Un-

1We also tested on a GTX580 (Fermi) and found the generaltrends were similar, even though we made heavy use of atom-ics and the Fermi architecture supposedly has better atomicsupport (atomics in Fermi happen in the L2 cache/shared

Use Case Scenario Advantage of Persistent Threads

CPU-GPU Synchronization Kernel A produces a variable amountof data that must be consumed byKernel B

nonPT implementations require a round-trip commu-nication to the host to launch Kernel B with the exactnumber of blocks corresponding to work items pro-duced by Kernel A.

Load Balancing Traversing an irregularly-structured,hierarchical data structure

PT implementations build an e�cient queue to allowa single kernel to produce a variable amount of out-put per thread and load balance those outputs ontothreads for further processing.

Maintaining Active State A kernel accumulates a single valueacross a large number of threads, orKernel A wants to pass data to KernelB through shared memory or registers

Because a PT kernel processes many more items perblock than a nonPT kernel, it can e↵ectively leverageshared memory across a larger block size for an appli-cation like a global reduction.

Global Synchronization Global synchronization within a ker-nel across thread blocks

In a nonPT kernel, synchronizing across blocks withina kernel is not possible because blocks run to comple-tion and cannot wait for blocks that have not yet beenscheduled. The PT model ensures that all blocks areresident and thus allows global synchronization.

Table 2: Summary of Persistent Threads use cases

less otherwise stated, every block consists of 256 threadsin our experiments, and as a proxy for work per thread,we perform a varying number of fused multiply-add (FMA)operations on single-precision floating-point values. In thefollowing subsections, “more FMAs per item” is equivalentto “more work per item”. For each experiment, we imple-mented two versions of our synthetic benchmarks, nonPTand PT (both with equivalent functionality), to evaluateperformance and identify tradeo↵s. The four use cases areCPU-GPU synchronization, load balancing/irregular paral-lelism, producer-consumer locality, and global synchroniza-tion, which are summarized in Table 2.

3.1 CPU-GPU SynchronizationSince the CPU (host) and GPU (device) are coupled to-

gether as a master and slave, the device lacks the ability tosubmit work to itself. Outside of the functionality built intoa kernel, the device is dependent on the host for issuing alldata movement, execution and synchronization commands.In cases where a producer kernel generates a variable num-ber of items for the consumer kernel to process at run time,the host must issue a readback, determine the number ofblocks needed to consume the intermediate data, and thenlaunch a new kernel. The readbacks have significant over-head, especially in systems where the host and device do notshare the same memory space or are not on the same chip.

Boyer et al. present a systematic approach for acceleratingthe process of detecting and tracking in-vivo blood vessels ofleukocytes [3]. By modifying memory access patterns, refor-mulating the computations within the kernel, and fusing ker-nels to minimize the CPU-GPU synchronization overhead,they report a 60⇥ speedup over the baseline naive imple-mentation. Further, upon restructuring the computationsthat originally required 50,000 kernel invocations to just onePT invocation, they achieved a 211⇥ speedup over the base-line. Tzeng et al. utilized a work-queue based task-donationstrategy for e�ciently handling split and dice stages in theReyes pipeline [16]. Their work splits micropolygons of ascene into a variable number of patches at each step. With

memory, not in DRAM).

the nonPT traditional programming style, this would re-quire a host readback of the number of patches to be splitin the next iteration. But by combining these two stagesinto a single uberkernel and wrapping it in a PT kernel,they eliminated the need for this readback and created aself-sustaining Reyes engine that e�ciently handles their ir-regular workload.

3.1.1 Implementation

Overhead Characterization: In order to characterizethe overhead of host-device synchronization, we performeda timing analysis of the three steps involved, shown in redas the critical path in Figure 4-I: a blocking readback bythe host to determine the subsequent launch bounds, thehost overhead of generating configuration parameters for akernel with the appropriate number of items, and the kernellaunch time from the moment of issue to the moment thedevice starts the kernel.

To obtain an accurate assessment of the synchronizationoverhead, we used the low-level profiling API, clGetEvent-ProfilingInfo(), provided in OpenCL (CUDA does not pro-vide the fine-grained control we need, hence why we did notimplement this use case in CUDA). This call provides fourinstances pertaining to when a call in the command queue isissued by the host, dispatched to the device execution queue,begins execution on the device, and the end time. For thisexperiment, we computed the time it takes from the momentthe producer kernel (GPU-kP) ends to when the consumerkernel (GPU-kC ) starts.

Persistent Threads Alternative: The PT alternativerequires two modifications. First, GPU’-kC is reformulatedto encapsulate the original kernel in a while loop, along withan atomic operation to either increment or decrement thenumber of items remaining. Second, the modified kernelis launched with a fixed number of blocks by the host, re-gardless of how many items are to be processed. When thework queue counter exceeds the item count in the case of anincremental queue or goes below zero in the case of a decre-mental queue, the corresponding block exits. At runtime,the last block of the GPU-kP writes the number of itemsGPU’-kP must process to a predefined GPU-accessible lo-

cation. GPU’-kP then uses this value as its terminal count.Thread groups retire once they reach this count.

Both modifications outlined here in the PT formulationare sources of overhead. In order to further understand thisoverhead w.r.t. nonPT kernels, we created two syntheticworkloads, one that was compute-intensive (CI), and theother a combination of compute and memory (CMI) op-erations. For the CI kernel we read data from memory,perform a variable number of FMAs, and then write theresult to global memory. For the CMI kernel we performstrided access to memory for every iteration of FMA com-putation, committing the result to global memory after thepre-determined number of operations has been performed.In both cases, we unroll the FMA loop completely in orderto minimize loop overhead. For the nonPT case, the timeof execution is the sum of performing a synchronous datatransfer from GPU to CPU, and kernel execution.

3.1.2 Results & Conclusions

Using empty kernels we measured the round-trip cost of asynchronized copy to the CPU and the launch on the GPUto be around 400 µs. In addition to running this use caseon a NVIDIA GTX295, we also ran it on a GeForce 9400Mprocessor. A more detailed analysis of PT vs. nonPT per-formance for a varying number of FMAs and blocks for ourworkloads on both the processors is shown in Figure 5.

From Figure 5(a) we see that non-compute-intensive ker-nels benefit from a PT formulation. As the number of blocksto process increases, the atomic pressure increases consider-ably, resulting in a slowdown. As the arithmetic intensity ofthe amount of work processed per block increases, the PTformulation generally results in a speedup, eventually dip-ping below the nonPT baseline. In Figure 5(b) we see a dif-ferent trend than for the CI case. While having fewer blocksresults in significant speedup with a small amount of workper thread, as this work increases, we transition to a con-siderable slowdown. We attribute this, in part, to the dropin occupancy, the ratio of number of thread groups startedat launched time to the theoretical maximum supported bythe hardware. As the number of iterations to be unrolledincreases, the compiler requires more registers, and is morerestricted in making optimizations within the while-loop forthe PT case.

In stark contrast, we see a very distinct set of results forthese workloads on the GTX295. On both CI and CMIworkloads, the PT version is almost always slower than itsnonPT counterpart. The slowdown further increases as thenumber of blocks to process increase. The only reasonableexplanation we hypothesize for this is the cost of contentionfor atomics. The 9400M has only 2 SMs while the GTX295has 30.

From these workloads we conclude that using PTto avoid CPU-GPU synchronization is beneficial forsmall workloads with few memory accesses and smallarithmetic intensity. Further, PT through softwareis better for kernels launching fewer thread groups,and where the cost of synchronization is a significantfraction of the overall cost of the application.

3.2 Load Balancing / Irregular ParallelismE�ciently processing poorly-structured algorithms or ir-

regular data-structures is a challenge on any compute device,but especially the GPU. Extracting parallelism on the GPU

(a) 9400M – Compute only (b) 9400M – Compute+Mem.

(c) GTX295 – Compute only (d) GTX295 – Compute+Mem.

Figure 5: Results from CPU-GPU Synchronizationon NVIDIA’s GeForce 9400M and GTX295.

Initial Inputs

# of

leve

ls

Initial Inputs

# of

leve

ls

(a) Full Forest

(b) Tilted Forest

Figure 6: An illustration of our complete and tiltedforest.

is left to the programmer, often requiring low-level controlroutines to be written in software. These workloads presenttwo important issues—vector processor e�ciency and han-dling variable work items being consumed or produced, Ailaet al. present a thorough study of improving the e�ciencyof processing the irregular the trace() phase in ray tracingon vector processors through PT [1]. Their solution for han-dling the varying depths of traversal for each ray was by us-ing warp-wide blocks for avoiding a single ray holding severalwarps within a block hostage2,and bypassing the hardwarescheduler by the use of PT. Tzeng et al. [16] addressed the

2We surmise that the large speedups seen by Aila et al.and other researchers in this area were in large part due towarp retirement serialization issues with pre-Fermi NVIDIAhardware, and that newer hardware does not su↵er from thesame issues. However, we note that this presumed hardwareissue is only relevant to this particular use case; the otherthree use cases do not retire warps out of order.

CPU

GPU’-kC

GPU-kP

data

param

CPU

CPU

GPU-kP

data

param

GPU-kC

param

data

dat

a

read-back barrier trb

tcpu tlaunch

(a) nonPT (b) PT

Kernel-X2

Kernel-X1

Kernel-X3

GPU Kernel-X’

(a) nonPT (b) PT

Kernel-X2

Kernel-X1

cross-block barrier #1

cross-block barrier #2

Kernel-X3

GPU Kernel-X’

PT barrier #1

PT barrier #2

(a) nonPT (b) PT

#1: CPU-GPU Synchronization #3: Producer-Consumer Locality #4: Global Synchronization

Figure 4: nonPT and PT illustrations of the use cases.

issue of variable producer/consumer work items using dis-tributed work queues coupled with a task-donation strategyto mitigate load-balancing issues of the split and dice phasesin the Reyes pipeline.


In order to explore Load Balancing and Irregular Paral-lelism (LBIP), we design a test (a forest expansion) thatemits a variable amount of work per thread and requiresmultiple stages to complete. The algorithm starts with aninitial number of input items. Each thread performs a pre-defined amount of FMAs in register space, after which theowning block generates zero, one, or two new items (we triedusing a larger maximum branching factor, but it had littlee↵ect on the resulting performance trends). The result is aforest, where items that generate another item(s) are eitherroots or intermediate nodes, and items that do not generatework are leaves. Every block processes one item at a timeusing all available threads.

We experimented with two di↵erent types of forests. Wecall the first a complete forest, since every input item gener-ates a complete binary tree of a shallow, pre-specified depth.The second we call a tilted forest. In it, the “first half” of theitems on any given level generate no new items, and everyitem in the “second half” generates two new items, shownin Figure 6. Work generation halts at a pre-specified depth.The tilted-tree model has the same number of nodes on ev-ery level (the first level might di↵er by one item). Instead ofhalting the forest growth at a shallow level, we let the treegrow to a pre-specified depth of at most one thousand.

nonPT Implementation: Each nonPT kernel invoca-tion processes one or more items comprising a complete levelof the forest. Each node generates zero or two new items.We can safely store only the current level and the next levelof the forest since we work on a level at a time. Each levelof the tree has a counter dictating how many items are inthat level. For a given block, we check the global index ofthe block and terminate immediately if there are fewer itemsthan this index, otherwise the block processes the input andhelps to populate the next level of the forest (until the forestreaches a certain depth). To achieve perfect load balancing(a 1-to-1 matching of number of blocks and items at thecurrent level), we could issue a readback before each kernelinvocation to get a tight bound on the number of blocks weneed, or since we know implicitly the new number of itemsin the next level (we know the initial number and that ev-

Figure 7: Load Balancing/Irregular Parallelism Re-sults (Complete Forest): We run with a large varietyof MADs per thread per level (0–50,000), maximumdepth (1–12), and the number of PT blocks (1–240).The graph shows the general trends we notice fromall runs. We see a general loss of speedup as thenumber of elements increases as the NonPT kernelhas more work to do. We see PT speedup greaterthan 1.0 when the number of MADs is high, startingsomewhere between 1000 and 5000 per thread perlevel.

Figure 8: Load Balancing/Irregular Parallelism Re-sults (Tilted Forest): We again run with a large vari-ation in our configuration. However, we found thatas we varied the maximum depth and the numberof inputs, the trends stayed consistent. So we showour results from a maximum depth of 50 with 128initial inputs. We found similar behavior as we var-ied the number of initial elements. This graph showsthe general trends we notice from all runs. As wesaturate the SMs with at least 30 blocks, we achievebetter performance than the NonPT kernel when wehave a high amount of work per thread per level.

ery item creates two new items at each level), we could justuse that knowledge. But we felt doing either of these wentagainst the spirit of the use case, since real application prob-ably would not have full expansion, and it might be slowerto perform the readback than to just schedule the emptyblocks.

When processing an item, we assume that a block knowshow many new items its current item will generate before ac-tually processing the item. This enables us to cheat slightlyby maximizing the distance between the atomic incrementand the use of the calculated value. The complete forestapplication knows ahead of time how many levels of itemsto process, so it launches the correct number of kernels allat once in a single stream, thus lowering kernel-launch over-head. For the tilted forest, we do not allow the applicationto presume knowledge of the final level of work. Thus, theapplication schedules X levels of work at a time and thenperforms a readback to determine if any work remains. Wedid not want to assume that either application knew howmany items were on any level on either test, so we sched-uled the maximum number of blocks necessary to processa full level. This caused a performance hit for our nonPTkernel because we allowed each level to grow quite large (onthe order of millions of elements for the complete forest andtens of thousands of element for the tilted forest).

PT Implementation: Our PT kernel is di↵erent fromour nonPT kernel in that it lacks implicit knowledge of thelevel of an item. We use a single work queue to store workitems instead of two queues and a ping pong strategy (aswith nonPT). Our queue grants access via an atomicallyguarded spin lock. Each block loops until it detects pro-gram termination via an atomically guarded variable. Asa block enters an iteration of the loop, it locks the queue,reads an item, and releases the lock. If the block reads avalid item, it processes that item and determines how manynew items to generate. The block locks the queue to addnew items and then unlocks the queue. Locking is impor-tant both for adding and removing elements since it aloneguarantees coherency of the queue pointers. When no moreitems are available and all blocks are idle, the kernel exits.In theory the implementation of our PT kernel is quite sim-ple, but correctly manipulating the locks and counters forthe queue is tricky.

Our PT implementation uses a shared, global queue. Thereare other queuing implementations such as many global queues,or a global queue with local queues combined with task steal-ing and/or donation. We used a single queue based on priortests, implementation time, performance, and the desire tokeep low the number of variables in our experiments. Wealso noted that with the complete-tree implementation, us-ing small local queues o↵ered no significant performance im-provement because the queues fill up rapidly (often even atthe end of every level). For the tilted tree, one or moreblocks must constantly steal work from other queues, againmitigating any performance improvements.


LBIP has a large number of variables, both for the nonPTand the PT kernel. The common variables between them arethe initial number of input items, the amount of work peritem, and the depth of the forest. We also varied the totalnumber of blocks to use per SM for PT and the number ofinvoked kernels per read back (to check for termination inthe nonPT kernel).

We conclude that the more items per level, the poorer per-formance for PT, because the hardware scheduler will alwaysbest an ad hoc software scheduler. The stable and downwardtrends in Figure 7 illustrate this point. Concurrent atomicpressure is a major factor in our PT kernel. We use onlyround-trip atomics, meaning we need to wait for the resultof every atomic operation, and thus each block must wait forthe atomic pipeline to flush. The atomic pipeline in NVIDIAGPUs (both on Tesla and Fermi) does not scale linearlywith the number of concurrent atomics (in fact the behaviorseems non-deterministic and su↵ers from starvation in caseswhere atomics are used as locks). However, as each blockspaces out queue-access requests with significant amountsof work, the number of concurrent accesses decreases, theoverall time to access the queue decreases rapidly, and theratio between the two (and thus queue access time in gen-eral) quickly becomes negligible. Figure 8 shows this trendvery well: once the number of MADs per thread per levelincreases beyond a certain threshold (dictated by the num-ber of active threads on the GPU), performance stabilizesin relation to nonPT.

Our experiments show two di↵erent trends. First, regard-less of the number of FMAs per item, PT tends to to outper-form nonPT with a small number of initial inputs. However,

as the number of inputs grows, PT only does well when thekernel executes a lot of work per item. This is due to concur-rent atomic pressure. With many items (regardless of workper item), nonPT blocks spend most of their time doing ac-tual work, instead of simply coming alive, seeing no work,and dying. With many items, but with little work per itemthough, PT blocks spend a significant fraction of their timeaccessing the high-contention queue. As the amount of workper item increases, access to the queue spreads out, thus low-ering contention and significantly speeding up queue-accesstime.

Based on our workloads on the GPU under study,we conclude that for the general case, PT is betteron specific combinations of variables. PT outper-forms nonPT on small, irregular work and regulardeeply-recursive work, and in either of our forest im-plementations PT tends to outperform nonPT whenthere are not many initial input elements and whenthe growth in elements is fairly constrained.

3.3 Producer-Consumer LocalityMany algorithms require data sharing between di↵erent

threads within a kernel. Producer-consumer locality refersto the ability of one piece of work to pass its results toanother piece of work with minimal cost. Current hardwareallows data sharing by threads in the same block via on-chipmemory, but threads in di↵erent blocks must communicatethrough DRAM and use expensive intrinsics to guaranteecoherency. This limitation is an artifact of the lifetime of ablock (threads retire after finishing a small amount of work),and the inability to express explicit communication patternsthat aid in guaranteeing coherence without using DRAM.PT addresses this limitation.

In previous work in this area, Bell and Garland [2] usepersistent warps for passing the “carry-out” of one iterationas the “carry-in” of the next in their COO format for SpMVcomputations. Breitbart [4] details a prefix-sum implemen-tation that requires a single kernel launch by using staticworkgroups as opposed to many kernel launches. Recentsorting and prefix-sum scan work also uses PT to minimizeglobal memory tra�c by exploiting producer-consumer lo-cality [8, 13].


To test how PT deals with producer-consumer locality,we use two di↵erent workloads. The first is a large reducefunction; given a set of N floats, compute their sum. Thesecond is a dummy benchmark that executes a syntheticworkload of FMAs in two stages. We begin with a set ofN inputs. In the first step, we perform a certain amount ofarithmetic on each item, then in the second step, performadditional arithmetic on only a certain percentage of thoseitems.

nonPT Reduce: We use a multi-kernel approach for ournonPT reduce. Given N floats, we schedule dN/256e blocksto reduce the initial input set. Each block reduces 256 floatsdown to 1 float and stores it back to DRAM. This reducesthe input set size from N to dN/256e. We continue to sched-ule kernels in the same manner until we reduce our set toa single item, the final value. This implies that we mustschedule dlog256 Ne kernels to reduce any input set.

PT Reduce: We use a single kernel with at most 256blocks to complete the two-stage process. The kernel stati-

cally divides the input. Each thread reads a value from thearray, performs a reduction, and then when it has read all itsvalues, performs a blockwise reduction. Once all blocks fin-ish their reductions, they write their single value to DRAMand the kernel executes a global barrier. After the barrier,block 0 reads all values from DRAM and performs one finalreduction.

nonPT Synthetic Workload: For our nonPT syntheticworkload, we have two implementations; one for the casethat most/all threads in the second stage would work onitems passed from the first stage, and one for the case thatrelatively few threads in the second stage would work onitems passed from the first stage.

The first implementation uses a single kernel. We param-eterize the number of items to pass per block. Each of the256 threads in a block reads a single float from DRAM, per-forms a certain amount of FMAs, conditionally progressesto the second stage (or dies), and carries with it a single re-sultant float stored in a register. Once in the second stage,the thread performs the same amount of arithmetic as thefirst stage and stores a single float out to DRAM. This im-plementation is most e�cient when all input items producean output item, meaning every thread in the first stage re-mains active during the second stage and also passes with ita value to compute in the second stage.

Our second implementation is similar to the above interms of compute characteristics but di↵ers by storing inter-mediate values to DRAM and using two kernels (one pro-ducer, one consumer). As each block finishes stage one, thethreads conditionally store their intermediate data into atightly packed array in DRAM (those who do not pass thecondition simply die). Next, we launch the consumer with asmany threads as input items; each thread reads an elementfrom the array, performs the same amount of arithmetic asin the producer kernel, and writes the result to DRAM.

Our second approach is most e�cient when relatively fewinput items produce an output item. In this case, since theconsumer kernel can be configured for the appropriate num-ber of items, it would better utilize hardware resources (inthe first, many threads per block do not progress to the con-sumer stage, thus holding hostage many hardware threadsper SM).

PT Synthetic Workload: We again use a single kernelwritten as an uberkernel, statically partition the entire in-put among blocks, and, as with the nonPT implementation,parameterize the number of items to keep per block. Eachblock has a shared-memory storage arena. Just as in thenonPT case, each thread again reads in one float, does apre-determined number of FMAs, and conditionally storesthe result into the shared memory queue. When the queuefills up, the entire block processes the queue until it is emptyby moving an item from the queue into a register, performingthe same amount of arithmetic per item as in the producerstage, and then storing the result to DRAM. The block thenresumes processing more items from the producer stage.


Reduce: Our PT implementation of reduce outperformsan optimized nonPT reduce at every size of input we tested,with speedup eventually leveling o↵ around 1.1⇥ as theamount of work to process increases. At small input sizes,the largest bottleneck in nonPT reduce is the global bar-rier accomplished via implicit kernel boundaries. By using a

Figure 9: Producer-Consumer Results: We run oursynthetic workload passing items from the producerstage to the consumer stage. We vary the numberof items to pass, as well as the full number of initialitems to process. As we vary the number of itemsto pass from 0 to 32 (per 256), we maintain perfor-mance between 0.95–1.05⇥ that of a nonPT kernel.However, when we pass a large percentage (at least25%) of the items to the consumer stage, we getup to 1.40⇥ speedup, in large part because we haveproducer-consumer data locality and can pass itemsvia registers and shared memory.

cheaper global barrier in our PT implementation we achievea significant reduction in execution time, between 1.50⇥ and1.85⇥. As input size increases, the time taken to execute abarrier is small compared to DRAM access time. Thus, theperformance improvements we see come not so much fromusing an explicit PT barrier, but from passing items viashared memory.

Synthetic Workload: We draw the following conclusionsfrom the three major data points in Figure 9: When we passno items from producer to consumer, we achieve slowdownwith our PT implementation because the hardware sched-uler is much better at load-balancing the SMs than our PTkernel. As the number of passed input items ranges be-tween one and less than thirty-two (per block), we achieveonly moderate speedup. As the number of input items topass grows beyond thirty-two, we begin to achieve notice-able speedup, finishing o↵ at approximately 1.45⇥ for largeinput sizes where we pass every produced item to the con-sumer stage.

3.4 Global SynchronizationThe GPU provides hardware support for synchronizing

threads in a block, but not for synchronizing between ac-tive blocks on an the same SM, or the entire GPU. Cur-rent hardware guidelines dictate that the only way to glob-

ally synchronize is to launch separate kernels at would-besynchronization points. Kernel-launch overhead costs be-tween 3-7 µs [17], which can lead to significant overheadsdepending on the number of kernel invocations [3]. Stuartand Owens used PT-based global synchronization in theirimplementation of a message-passing API for GPUs [15] toe�ciently implement collective communications such as bar-riers. Luo et al. [7] present a hierarchical kernel arrangementfor building a BFS through the use a global barrier to syn-chronize between blocks to mitigate the cost of kernel launchoverhead.

We, and others, specifically propose the use of Global Syn-chronization (GS) through PT to minimize kernel-launchoverhead. The GS construct is a barrier implemented in soft-ware for inter-block communication. The GS barrier acts onevery block concurrently on the GPU. It can be used for sep-arating di↵erent instances of the same kernel being processedrecursively, or multiple sections of code within the same ker-nel in the case of uberkernel formulations. The GS constructhas been shown to consume between 1.3–2.0 µs [18]. Theseresults, however, only cover one data point, and in orderto better understand the performance tradeo↵s, we varied awide number of parameters for our micro-benchmarks.


nonPT: The nonPT implementation of our synthetic GSuse case works as a two-step process, one on the GPU andone on the CPU. The GPU stage simply performs a prede-termined number of FMAs in register space, then stores avalue in DRAM. The CPU stage then implictly synchronizesby launching another kernel. We do not explicitly synchro-nize, nor do we need to, since the hardware guarantees thatif kernels are launched asynchronously in a single stream, allblocks of the preceding kernel will run to completion beforethe next kernel begins. The point of our synthetic workloadis to determine roughly how much work on average a blockmust perform to mitigate the cost of GS.

PT: The PT implementation of GS use case works as asingle-step process consisting of two stages, with both stagesexecuting on the GPU. Like the nonPT version, each blockin the first PT stage performs its pre-defined set of oper-ations. However, because blocks are persistent, each mustperform the work of a variable number of blocks from thenonPT implementation. As each block completes an iter-ation, it begins processing the second stage. This stage iswhere blocks synchronize globally. Our implementation ofGS is based on the lock-free version described by Xiao andFeng [18].


We varied the number of blocks, the number of FMAsto be performed per input item, and the total number ofblocks to process. The overall trend remains the same un-til 2000 FMAs per thread, but falls o↵ dramatically as thecompute intensity is increased. After inspecting the PTX,this seems to be due to the compiler no longer unrolling forloops. Figure 10 shows the graph obtained on running GSon a workload performing 1000 FMAs, with a varying num-ber of items to process per block and a varying number ofsyncs. We see a consistent speedup of 2–2.5⇥, which is inline with the previously-reported results3. From our testswe conclude two things. First, the amount of arithmetic to3We note one exception: we achieve less speedup when run-

Figure 10: Global Synchronization Results: Theruns for this graph used 1000 FMAs per input itemper sync.

be performed has little bearing on the performance of globalsynchronization. Second, the benefit of syncing on the GPUincreases asymptotically with the number of syncs: wheneach PT block executes a significant amount of work, thetime taken to launch a new kernel is small in relation to thetotal execution time. Thus the application provides littleroom for a PT-based speedup.

From these results we conclude that the amount ofarithmetic to be performed has little bearing on theperformance of global synchronization, and the ben-efit of syncing on the GPU increases asymptoticallywith the number of syncs.

4. DISCUSSIONBased on previous work, and the micro-benchmarks dis-

cussed in Section 3, we have shown that the PT program-ming style can be used to address a range of computationalpatterns. PT addresses many of the restrictions mentionedin Section 2.1.1 pertaining to the current hardware and pro-gramming style. Taking a long-term view, we discuss thepotential impact use cases outlined in this paper are likelyto have with eminent changes in future, and possible changeshardware vendors and Khronos partners might consider use-ful to look into for future iterations of the programming en-vironment.

1. CPU-GPU Synchronization: The cost of synchro-nization across the CPU and GPU is largely dependenton the round-trip time for data to be copied from GPUto CPU memory-space, and kernel launch overhead.For integrated processors with CPU and GPU on thesame die, like AMD’s Llano and Intel’s Sandy Bridge

ning the workload of 50 nonPT blocks and have yet to finda reasonable answer as to why.

processors, that will share dedicated on-chip memoryfor inter-processor communication in future revisionsof the architecture, we see comparitively lesser benefitin using PT on such platforms when compared to dis-crete systems where the CPU and GPU are connectedthrough a system interconnect bus like PCI-Express.

2. Load Balancing: E↵ectively utilizing all cores on anyparallel processor is a di�cult problem. Quite oftenthe best way to load balance is by running task-parallelworkloads, an area that is emerging as an importantprimitive. Hence, for this use case to be handled e�-ciently, it is imperative that there be dedicated supportfor queues in hardware with API extensions that pro-vide tracking of head and tail pointers, queue wrap-around logic, overflow/underflow guard flags, nearly-full/empty flags, distributed queues, etc. Currentlythe developer is required to implement one in software,taking care of guaranteeing both data consistency anddeadlock avoidance, which is non-trivial. In additionto this, the ability for kernels to generate and launchtheir own work would greatly ease the implementationof load-balanced and task-parallel systems.

3. Producer-Consumer: Extracting producer-consumerlocality is the most challenging of all the use cases dis-cussed in this paper. The nonPT style provides no wayof expressing data locality patterns, which is especiallyimportant on massively data-parallel machines. Usingcache-coherence could be one way of addressing this,but would be expensive to build and does not scalevery well, and hence not popular in most GPUs today.The PT style helps address these issues by giving morecontrol over scheduling and exploiting locality exhib-ited by the underlying algorithm. We see this as thebiggest benefit of using PT style programming, withuse even in future processor generations.

4. Global Synchronization: As the number of proces-sors on GPU grows, so will the number of blocks thatcan be resident on it. This implies that beyond a pointthe cost of synchronizing on the GPU would be greaterthan the kernel launch overhead, and therefore for fu-ture systems might not be very useful at a global level.However, there are several patterns that require onlya subset of blocks to communicate and synchronize,and hence it could still be quite useful to synchronizeon the GPU. Ultimately, research into fast and goodhigher-level synchronization primitives than a barrieris required.

4.1 Implementation AspectsAs discussed in Section 2.2, the primary purpose of PT

style programming is to bypass the default (traditional) pro-gramming style o↵ered by CUDA and OpenCL by alteringthe lifetime of thread blocks and how they are scheduled.In providing a greater degree of flexibility to optimally dealwith algorithms with complex compute, data movement andsynchronization patterns, a much greater burden is put uponthe developer as aspects that would otherwise have beenhandled transparently by the driver/hardware must now bedone manually.

Even minor modifications in code can lead to enough changein register and shared memory usage that ultimately a↵ects

multi-processor occupancy (the number of thread groupsthat can be simultaneously scheduled on the SM). In theabsence of formal support for PT kernels, the programmeris required to manually account for changes in occupancy asit directly influences maximal launch parameters (especiallyrelevant to the global synchronization use case), and alsoimpacts the implementation of the underlying algorithm us-ing complex queue-control structures. A list of portabilityaspects with reference to each use case are shown in Table 3.One possible means of extending support for PT is througha dedicated API. A sample of how simple this modificationcould be is shown in API 1.

4.2 Return-on-InvestmentOur results clearly show that there is no guarantee of suc-

cess for kernels designed via PT today. With limited nativesupport either in the form of dedicated APIs or lower-levelGPU primitives, developer time and e↵ort involved in PTstyle programming, in a wide majority of cases, is signifi-cant. Further, due to the complexities involved in alteringkernels to fit into a PT framework, debugging can also proveto be an extremely challenging task.

Based on our experiences and results, we suggest that anydecision to use this programming style should be made af-ter careful consideration as a significant investment in allaspects of design, implementation, and debug are involved,that require considerably more e↵ort to program successfullythan any nonPT implementation.

4.3 PowerAlthough we have not done an analysis regards impact of

PT on power consumption, we believe this to be an interest-ing aspect for future work to consider, particularly relevantto low-power GPUs in consumer electronic devices. On theone hand as atomic operations are inherently serial in na-ture, they are likely to exhibit a poor power profile (thepower consumption of PT with a high degree of contentionthat gets resolved in L2 cache or global memory is likelyto be quite high), it can also lead to bypassing the hard-ware scheduler (that is one of the most complex parts ofthe modern GPU) thereby providing dynamic power savingopportunities by putting these transistors into power-savemode for the duration of the PT kernel.

5. CONCLUSIONGPU computing is still a nascent field, with its program-

ming models and styles, and hardware that supports them,still in a rapid state of change. While GPU architectureshave made steady strides since the adoption of unified shadersfor supporting GPU computing workloads, there is a long listof potential advancements from the CPU world, from thesupercomputing world, and that have yet to be developedthat may be required to unlock its full potential. Supportfor recursion, memory management, and preemption, just toname a few, may eventually be in the GPU, but only afterpotentially significant redesigns of the architecture.

Rather than only inheriting techniques from the CPUworld, one of the most exciting aspects of GPU computingis how promising software techniques developed first on theGPU quickly become system and hardware features. Ourdiscussion in Section 4 shows that extending native supportfor PT would not be a particularly expensive exercise, andwe certainly expect that some of them are already on the

radar of future system architects. The most important usecase in the broad sense is that of producer-consumer locality:locality is poorly supported by the traditional programmingstyle, but PT shows a definite win over the traditional stylein a realm that is absolutely critical for future system e�-ciency. Further investigation into how to expose locality infuture programming models is an essential research directionfor the community at large.

More broadly, the GPU is moving from an engine that sup-ports only regular workloads to one that increasingly targetsirregular ones with complex parallelism and dependencies.These modifications, while certainly helping PT implemen-tations, promise to also allow the GPU the broader potentialto support a wider variety of irregular workloads. Our testsclearly show that PT implementations are not necessarilyan out-right winner, and do not in fact lead to significantgains for some use cases on our micro-benchmark workloads,today. However, these results could be markedly di↵erent infuture as GPUs continue to evolve, and some of the sugges-tions discussed in this paper are incorporated by hardwarevendors for PT style programming, natively. Our ultimategoal of this paper is to formally introduce this importantprogramming style to the broader GPU computing commu-nity, and through an initial set of micro-kernel benchmarksto initiate a broader discussion on this topic. We hope thatour work toward characterizing and understanding persis-tent threads will play an important role in simplifying andenhancing general-purpose applications on the GPU.

AcknowledgmentsThanks to our reviewers for their constructive and valu-able feedback, and to Mark Harris, David Luebke, ErikLindholm, Aaron Lefohn and Mike Houston for thought-ful discussions and suggestions on this work. We appreci-ate the support of an Intel Microprocessor Technology Labgrant, the SciDAC Institute for Ultrascale Visualization, andthe National Science Foundation (grants OCI-1032859 andCCF-1017399).

References[1] Timo Aila and Samuli Laine. Understanding the ef-

ficiency of ray traversal on GPUs. In Proceedings ofHigh Performance Graphics 2009, pages 145–149, Au-gust 2009.

[2] Nathan Bell and Michael Garland. Implementingsparse matrix-vector multiplication on throughput-oriented processors. In SC ’09: Proceedings ofthe 2009 ACM/IEEE Conference on Supercomputing,pages 18:1–18:11, November 2009.

[3] Michael Boyer, David Tarjan, Scott T. Acton, andKevin Skadron. Accelerating leukocyte tracking usingCUDA: A case study in leveraging manycore coproces-sors. In Proceedings of the 2009 IEEE InternationalSymposium on Parallel & Distributed Processing, May2009.

[4] Jens Breitbart. Static GPU threads and an improvedscan algorithm. In Euro-Par 2010 Workshops: Proceed-ings of the Third Workshop on UnConventional HighPerformance Computing (UCHPC 2010), volume 6586

Use Case # Occ. Sch. Comments

1. CPU-GPU Sync — — indirectly impacts portability in the form of CPU-GPU workload partitioning based onapplication profiling analyses

2. LBIP — 3 for the simple case (of a single queue) we see minimal design and implementation impact,but when more complex queuing structures (local + global) and work stealing/donationoptimizations are used, portability becomes non-trivial

3. Prod-Cons 3 3 requires di↵erent kernel organization and partitioning strategies for parallel-friendly al-gorithm modification on processors with varying compute capabilities

4. Global Sync 3 — changes in occupancy makes PT-based kernels extremely di�cult to debug as the numberof thread groups that must synchronize must be entered manually by the programmer

Table 3: Fundamental aspects that a↵ect PT portability and usability based on occupancy and schedulingissues for our four use cases.

API 1 Example modifications to CUDA/OpenCL APIs for Persistent Threads.

CUDA API:current : <<< grid, block, shmem >>>

proposed : <<< grid, block, shmem, pt >>>

OpenCL API:current : clEnqueueNDRangeKernel(cmdQ, kern, wkDim, gOff, *WrkGrp, *WrkItm, numEve, *EveList, *Eve)

proposed : clEnqueueNDRangeKernel(cmdQ, kern, pt, wkDim, gOff, *WrkGrp, *WrkItm, numEve, *EveList, *Eve)

Description: pt can take the following parameters – (a) NONPT : nonPT kernel launch; (b) MAX TGROUPS : PT kernelis invocated with maximal launch; (c) USER TGROUPS : PT kernel is invocated with user-supplied grid or wkDim. Toavoid global synchronization deadlock issues, if the user-supplied thread group size is greater than that possible for maximallaunch, the kernel would exit with a pre-set error code.

of Lecture Notes in Computer Science, pages 373–380.Springer, August 2010.

[5] Shuai Che, Michael Boyer, Jiayuan Meng, David Tar-jan, Jeremy W. Shea↵er, Sang-Ha Lee, and KevinSkadron. Rodinia: A benchmark suite for heteroge-neous computing. In IEEE International Symposiumon Workload Characterization, pages 44–54, October2009.

[6] Anthony Danalis, Gabriel Marin, Collin McCurdy,Jeremy S. Meredith, Philip C. Roth, Kyle Spa↵ord,Vinod Tipparaju, and Je↵rey S. Vetter. The scalableheterogeneous computing SHOC benchmark suite. InProceedings of the 3rd Workshop on General-PurposeComputation on Graphics Processing Units, pages 63–74, March 2010.

[7] Lijuan Luo, Martin Wong, and Wen-mei Hwu. An ef-fective GPU implementation of breadth-first search. InProceedings of the 47th Design Automation Conference,pages 52–55, June 2010.

[8] Duane Merrill and Andrew Grimshaw. Parallel scan forstream architectures. Technical Report CS2009-14, De-partment of Computer Science, University of Virginia,December 2009.

[9] Aaftab Munshi. The OpenCL Specification (Version1.1, Document Revision 48), June 2010.

[10] John Nickolls, Ian Buck, Michael Garland, and KevinSkadron. Scalable parallel programming with CUDA.ACM Queue, pages 40–53, March/April 2008.

[11] NVIDIA Corporation. NVIDIA CUDA compute unifieddevice architecture, programming guide, 2011. http:

//developer.nvidia.com/.

[12] John D. Owens, Mike Houston, David Luebke, SimonGreen, John E. Stone, and James C. Phillips. GPUcomputing. Proceedings of the IEEE, 96(5):879–899,May 2008.

[13] Nadathur Satish, Mark Harris, and Michael Garland.Designing e�cient sorting algorithms for manycoreGPUs. In Proceedings of the 23rd IEEE InternationalParallel and Distributed Processing Symposium, May2009.

[14] Larry Seiler, Doug Carmean, Eric Sprangle, TomForsyth, Michael Abrash, Pradeep Dubey, StephenJunkins, Adam Lake, Jeremy Sugerman, Robert Cavin,Roger Espasa, Ed Grochowski, Toni Juan, and PatHanrahan. Larrabee: A many-core x86 architecturefor visual computing. ACM Transactions on Graphics,27(3):18:1–18:15, August 2008.

[15] Je↵ A. Stuart and John D. Owens. Message passing ondata-parallel architectures. In Proceedings of the 23rdIEEE International Parallel and Distributed ProcessingSymposium, May 2009.

[16] Stanley Tzeng, Anjul Patney, and John D. Owens.Task management for irregular-parallel workloads onthe GPU. In Proceedings of High Performance Graph-ics 2010, pages 29–37, June 2010.

[17] Vasily Volkov and James W. Demmel. BenchmarkingGPUs to tune dense linear algebra. In Proceedings ofthe 2008 ACM/IEEE Conference on Supercomputing,pages 31:1–31:11, November 2008.

[18] Shucai Xiao and Wu-chun Feng. Inter-block GPU com-munication via fast barrier synchronization. In Pro-ceedings of the 2010 IEEE International Symposium onParallel & Distributed Processing, April 2010.

Documents

A Study of Persistent Threads Style GPU Programming for ... · 2.1.1 Core Limitations of GPGPU Programming A brief summary of other major characteristics of the cur-rent GPGPU programming