To GPU Synchronize or Not GPU Synchronize? - nvidia.com · applications: FFT, dynamic programming, and bitonic sort •From a performance and correctness perspective … –How to

synergy.cs.vt.edu

To GPU Synchronize or Not GPU Synchronize?

Prof. Wu FENGDepartments of Computer Science and Electrical & Computer Engineering

synergy.cs.vt.edu

What are the Takeaways?

• At the systems software level … – How to support efficient communication between SMs

via barrier synchronization on the GPU GPU Synchronization

• At the application level …– How to integrate the GPU synchronization capability into real

applications: FFT, dynamic programming, and bitonic sort

• From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized

applications

– How to guarantee correctness and its associated cost

synergy.cs.vt.edu

Motivation

• Only task- or data-parallel algorithms map well to the GPU.– No explicit support for communication between SMs,

i.e., inter-block data communication

• Consecutive kernel launches from CPU serve as an implicit barrier synchronization for inter-block communication.– How expensive is CPU implicit barrier synchronization?

– Would barrier synchronization on the GPU be better in support of “more general-purpose computation”?

• To GPU synchronize or not GPU synchronize?

synergy.cs.vt.edu

Outline

• Motivation

• Background

– GTX 280 and CUDA Programming Model

• GPU Synchronization

– GPU Lock-Based Synchronization

– GPU Lock-Free Synchronization

• Experimental Results

• Conclusion & Future Work

synergy.cs.vt.edu

GTX 280 Architecture

On-chip memory • Small sizes

• Fast access

Off-chip memory • Large size

• High access latency

Local and Global Memory

synergy.cs.vt.edu

CUDA Programming Model

• CUDA: An extension of the C programming language

Host

Kernel 1

Kernel: A global function called from host and executed on device

•Consists of multiple blocks with each block consisting of multiple threads

•Intra-block sync is implemented with __syncthreads()

•Inter-block sync is implemented via kernel launches

Device

Kernel 2

Grid 1

Block

#1

Block

#2

Block

#N

Block

#3

Grid 2

Block

#1

Block

#2

Block

#N

Block

#3

Thread

#1

Thread

# (M-1)

Thread

#2

Thread

#N

__syncthreads()

Barrier between two kernel launches

synergy.cs.vt.edu

Outline

• Motivation

• Background







synergy.cs.vt.edu

Types of Barrier Synchronization

CPU Synchronization vs. GPU Synchronization

Host Device

Computation

Kernel Launch

Return

Kernel Launch

Return

Host Device

Kernel Launch

Return

Barrier

Barrier

__gpu_sync()

__gpu_sync()

Computation

Kernel Launch

Return

Barrier

Barrier

for() {

__kernel_func<<<grid, block>>>();

}

__global__ void __kernel_func()

{

for () {

__device_func();

__gpu_sync();

}

}

Computation

Computation

ComputationComputation

synergy.cs.vt.edu

GPU Lock-Based Synchronization

• Time Profile

Block

#1

Block

#2

Block

#3

Block

#N

g_mutex

atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1)

Block

#1

Block

#2

Block

#3

Block

#N

g_mutex == Ng_mutex == Ng_mutex == N

g_mutex == N

Block #1

Execute sequentially by different blocksatN

Block #2

Block #N

1. atomicAdd()

2. g_mutex checking

Execute in parallelBlock #1 Block #2 Block #Nct

Total synchronization time: sca tttN

?

? ? ? Intra-block sync with __syncthreads()

3. __syncthreads()st

synergy.cs.vt.edu

GPU Lock-Free Synchronization

• Time Profile

Block

#1

Block

#2

Block

#3

Block

#N

Ain[1]=1

Block

#1

Block

#2

Block

#3

Block

#N

Block #1

1. Set Ain2. Check Ain3. Intra-block sync4. Set Aout5. Check Aout6. Intra-block sync

Note: Each step can be executed in parallel bydifferent blocks

Block #2 Block #3 Block #N

Block #1 Block #2 Block #N

Ain[2]=1 Ain[3]=1 Ain[N]=1

Block #1

T #1 T #2 T #3 T #N

Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1

Intra-block sync with __syncthreads()

Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1

? ? ? ?

Aout[1]==1?

Aout[2]==1?

Aout[3]==1?

Aout[N]==1?

Block #3

T #1 T #2 T #3 T #N

__syncthreads()

T #1 T #2 T #3 T #N

Block #1

SIt

CIt

StSOt

COt

Total synchronization time: COSOSCISI ttttt 2

Intra-block sync with __syncthreads()

synergy.cs.vt.edu

Guaranteeing Correctness

Block

#1

Block

#2

Block

#3

Block

#N

Ain[1]=1

Block

#1

Block

#2

Block

#3

Block

#N

Ain[2]=1 Ain[3]=1 Ain[N]=1

Block #1

T #1 T #2 T #3 T #N

Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1

Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1

? ? ? ?

Aout[1]==1?

Aout[2]==1?

Aout[3]==1?

Aout[N]==1?

Block

#1

Block

#2

Block

#3

Block

#N

g_mutex

atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1)

Block

#1

Block

#2

Block

#3

Block

#N

g_mutex == Ng_mutex == Ng_mutex == N

g_mutex == N?

?? ?

GPU lock-based synchronization

GPU lock-free synchronization

__threadfence()__threadfence()




synergy.cs.vt.edu

Outline

• Motivation

• Background







synergy.cs.vt.edu

Experimental Set-Up

• Hardware– Host

• 2.2-GHz Intel Core 2 Duo CPU

• 2 X 2GB of DDR2 SDRAM

– Device

• GTX 280 video card

• 1024 MB device memory

• Software– 64-bit Ubuntu GNU/Linux 8.04

– NVIDIA CUDA 2.3 SDK toolkit

synergy.cs.vt.edu

Measurements

• Execution time without __threadfence()

• Execution time with __threadfence()

• Synchronization time percentage (without __threadfence())

Fast Fourier

Transformation (FFT)

Smith-Waterman

Bitonic sort

CPU synchronization

GPU lock-free synchronization

Applications Synchronization Approach

GPU lock-based synchronization

synergy.cs.vt.edu

Execution Time without __threadfence()

Kernel Execution Time vs. Number of Blocks in the Kernel

Smith-Waterman

FFT Bitonic sort

• Less time is needed with more

blocks in the kernel

• Matrix filling time difference in FFT

is smaller than the other two with

different sync approaches used

• GPU sync has a better performance

than CPU implicit sync, BUT …

Baseline: CPU Sync

Baseline: CPU sync

Baseline: CPU sync

Relative Time Decreases

Baseline is CPU synchronization

synergy.cs.vt.edu

Performance & Correctness:Execution Time w/o __threadfence()

• Performance Improvement (relative to existing GPU)– FFT: 10% | Dynamic Programming: 26% | Bitonic Sort: 40%

• Overall Performance Improvement (relative to CPU serial)– FFT: 70x | Dynamic Programming: 13x | Bitonic Sort: 24x

• But …– Our GPU barrier synchronizations run the risk that writes performed

before our gpu sync() barrier are not completed by the time the GPU is released from the barrier.

– syncthreads() can only guarantee writes to shared memory and global memory visible to threads of the same block, it cannot do so for threads across different blocks.

– In practice, highly unlikely the above will ever happen given the amount of time spent spinning at the barrier, but still possible. So, …

To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009

synergy.cs.vt.edu

Execution Time with __threadfence()Kernel Execution Time vs. Number of Blocks in the Kernel

Smith-Waterman

FFT Bitonic sort

• Matrix filling time difference in FFT

is smaller than the other two with

different sync approaches used

• With __threadfence() called,

performance of GPU sync is worse

than CPU implicit sync

Baseline: CPU Sync

Baseline: CPU Sync

Baseline: CPU Sync

Relative Time Increases

Baseline is CPU synchronization

synergy.cs.vt.edu

Percentage of Time Spent Synchronizing

Synchronization Time Percentages (without __threadfence())

• % time to sync in FFT is lower than the other two algorithms

• Sync time percentages of SWat and bitonic sort are more than 50%

with CPU sync

• % time to GPU sync is lower than that of CPU implicit sync

synergy.cs.vt.edu

Profiling the Synchronization Time• Micro-benchmark – Compute average of two floats 10,000

times

Elements computed

by Block #1

Array

Array

Array

Original value

Round #2

Round #3

CPU sync –

Kernel launchRound #1

GPU sync –

__gpu_sync()CPU sync –

Kernel launch

GPU sync –

__gpu_sync()

Elements computed

by Block #2

Elements computed

by Block #3

synergy.cs.vt.edu

Decomposing Synchronization Time

• Ways to compute each time component– Only kernel execution time can be recorded

– Indirect method

– Use GPU lock-based synchronization as the example

– Synchronization time can be represented as

– Times that can be recorded directlyfsca ttttN

atN : Kernel consisting of only atomicAdd

comt : Kernel consisting of only computation

scacom tttNt : Kernel with GPU lock-based synchronization

scomtt : Kernel with computation and __syncthreads()

fscacom ttttNt : Kernel with GPU lock-based

synchronization (__threadfence())

synergy.cs.vt.edu

Profile of Synchronization Time

• Results

atN

comt black line

scomtt red line

scacom tttNt

fscacom ttttNt

comt + cpu snyc

683.5comt nta 300.2

564.5ct

540.0st

267.7333.0ntffor 10,000 times execution

404.54_synccput

synergy.cs.vt.edu

Outline

• Motivation

• Background







synergy.cs.vt.edu

Conclusion

• At the systems software level … – How to support efficient communication between SMs

via barrier synchronization on the GPU GPU Synchronization

• At the application level …– How to integrate the GPU synchronization capability into real

applications: FFT, dynamic programming, and bitonic sort

• From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized

applications

– How to guarantee correctness and its associated cost

• __threadfence guarantees correctness but needs to be optimized to support GPU synch. Is Fermi the answer?!


synergy.cs.vt.edu

Conclusion

• To GPU synchronize or not GPU synchronize?– GTX 280 / Tesla C1060 / Tesla S1080: Do NOT GPU synchronize.

– Fermi? Likely GPU synchronize?

• Next steps?– Efficient inter-block synchronization via NVIDIA Fermi

– Efficient inter-block synchronization in OpenCL

– Automated tool to transform CPU sync to GPU sync

• For more information– “On the Robust Mapping of Dynamic Programming onto a Graphics

Processing Unit,” 15th Int’l Conf. on Parallel & Distributed Systems, 12/2009.

– “Inter-Block GPU Communication via Fast Barrier Synchronization,” Technical Report TR-09-19, Computer Science, Virginia Tech, 10/2009.

synergy.cs.vt.edu

Yawn …

• Where are the massive speed-ups and cool pictures?


synergy.cs.vt.edu

Electrostatic Potential for Molecular Dynamics

Processor + Optimization

Execution Time(seconds)

Speed-Up

CPU Serial 36,360 -

GPU + Kernel Split + Multi-Level HCP

0.37 98,270x

Viral Capsid

• Visit the Supermicro booth, i.e., behind you, or go to http://www.youtube.com/watch?v=zPBFenYg2Zk

Contact: Prof. Alexey Onufriev for info on the science!

http://www.youtube.com/watch?v=zPBFenYg2Zk

Wu FENG, [email protected]

http://sss.cs.vt.edu/

http://synergy.cs.vt.edu/

Laboratory

http://www.green500.org/

http://www.mpiblast.org/

Documents

To GPU Synchronize or Not GPU Synchronize? - nvidia.com · applications: FFT, dynamic programming, and bitonic sort •From a performance and correctness perspective … –How to