Upload
donga
View
277
Download
0
Embed Size (px)
Citation preview
synergy.cs.vt.edu
To GPU Synchronize or Not GPU Synchronize?
Prof. Wu FENGDepartments of Computer Science and Electrical & Computer Engineering
synergy.cs.vt.edu
What are the Takeaways?
• At the systems software level … – How to support efficient communication between SMs
via barrier synchronization on the GPU GPU Synchronization
• At the application level …– How to integrate the GPU synchronization capability into real
applications: FFT, dynamic programming, and bitonic sort
• From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized
applications
– How to guarantee correctness and its associated cost
synergy.cs.vt.edu
Motivation
• Only task- or data-parallel algorithms map well to the GPU.– No explicit support for communication between SMs,
i.e., inter-block data communication
• Consecutive kernel launches from CPU serve as an implicit barrier synchronization for inter-block communication.– How expensive is CPU implicit barrier synchronization?
– Would barrier synchronization on the GPU be better in support of “more general-purpose computation”?
• To GPU synchronize or not GPU synchronize?
synergy.cs.vt.edu
Outline
• Motivation
• Background
– GTX 280 and CUDA Programming Model
• GPU Synchronization
– GPU Lock-Based Synchronization
– GPU Lock-Free Synchronization
• Experimental Results
• Conclusion & Future Work
synergy.cs.vt.edu
GTX 280 Architecture
On-chip memory • Small sizes
• Fast access
Off-chip memory • Large size
• High access latency
Local and Global Memory
synergy.cs.vt.edu
CUDA Programming Model
• CUDA: An extension of the C programming language
Host
Kernel 1
Kernel: A global function called from host and executed on device
•Consists of multiple blocks with each block consisting of multiple threads
•Intra-block sync is implemented with __syncthreads()
•Inter-block sync is implemented via kernel launches
Device
Kernel 2
Grid 1
Block
#1
Block
#2
Block
#N
Block
#3
Grid 2
Block
#1
Block
#2
Block
#N
Block
#3
Thread
#1
Thread
# (M-1)
Thread
#2
Thread
#N
__syncthreads()
Barrier between two kernel launches
synergy.cs.vt.edu
Outline
• Motivation
• Background
– GTX 280 and CUDA Programming Model
• GPU Synchronization
– GPU Lock-Based Synchronization
– GPU Lock-Free Synchronization
• Experimental Results
• Conclusion & Future Work
synergy.cs.vt.edu
Types of Barrier Synchronization
CPU Synchronization vs. GPU Synchronization
Host Device
Computation
Kernel Launch
Return
Kernel Launch
Return
Host Device
Kernel Launch
Return
Barrier
Barrier
__gpu_sync()
__gpu_sync()
Computation
Kernel Launch
Return
Barrier
Barrier
for() {
__kernel_func<<<grid, block>>>();
}
__global__ void __kernel_func()
{
for () {
__device_func();
__gpu_sync();
}
}
Computation
Computation
ComputationComputation
synergy.cs.vt.edu
GPU Lock-Based Synchronization
• Time Profile
Block
#1
Block
#2
Block
#3
Block
#N
g_mutex
atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1)
Block
#1
Block
#2
Block
#3
Block
#N
g_mutex == Ng_mutex == Ng_mutex == N
g_mutex == N
Block #1
Execute sequentially by different blocksatN
Block #2
Block #N
1. atomicAdd()
2. g_mutex checking
Execute in parallelBlock #1 Block #2 Block #Nct
Total synchronization time: sca tttN
?
? ? ? Intra-block sync with __syncthreads()
3. __syncthreads()st
synergy.cs.vt.edu
GPU Lock-Free Synchronization
• Time Profile
Block
#1
Block
#2
Block
#3
Block
#N
Ain[1]=1
Block
#1
Block
#2
Block
#3
Block
#N
Block #1
1. Set Ain2. Check Ain3. Intra-block sync4. Set Aout5. Check Aout6. Intra-block sync
Note: Each step can be executed in parallel bydifferent blocks
Block #2 Block #3 Block #N
Block #1 Block #2 Block #N
Ain[2]=1 Ain[3]=1 Ain[N]=1
Block #1
T #1 T #2 T #3 T #N
Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1
Intra-block sync with __syncthreads()
Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1
? ? ? ?
Aout[1]==1?
Aout[2]==1?
Aout[3]==1?
Aout[N]==1?
Block #3
T #1 T #2 T #3 T #N
__syncthreads()
T #1 T #2 T #3 T #N
Block #1
SIt
CIt
StSOt
COt
Total synchronization time: COSOSCISI ttttt 2
Intra-block sync with __syncthreads()
synergy.cs.vt.edu
Guaranteeing Correctness
Block
#1
Block
#2
Block
#3
Block
#N
Ain[1]=1
Block
#1
Block
#2
Block
#3
Block
#N
Ain[2]=1 Ain[3]=1 Ain[N]=1
Block #1
T #1 T #2 T #3 T #N
Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1
Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1
? ? ? ?
Aout[1]==1?
Aout[2]==1?
Aout[3]==1?
Aout[N]==1?
Block
#1
Block
#2
Block
#3
Block
#N
g_mutex
atomicAdd(1) atomicAdd(1) atomicAdd(1) atomicAdd(1)
Block
#1
Block
#2
Block
#3
Block
#N
g_mutex == Ng_mutex == Ng_mutex == N
g_mutex == N?
?? ?
GPU lock-based synchronization
GPU lock-free synchronization
__threadfence()__threadfence()
__threadfence()__threadfence()
__threadfence()__threadfence()
__threadfence()__threadfence()
synergy.cs.vt.edu
Outline
• Motivation
• Background
– GTX 280 and CUDA Programming Model
• GPU Synchronization
– GPU Lock-Based Synchronization
– GPU Lock-Free Synchronization
• Experimental Results
• Conclusion & Future Work
synergy.cs.vt.edu
Experimental Set-Up
• Hardware– Host
• 2.2-GHz Intel Core 2 Duo CPU
• 2 X 2GB of DDR2 SDRAM
– Device
• GTX 280 video card
• 1024 MB device memory
• Software– 64-bit Ubuntu GNU/Linux 8.04
– NVIDIA CUDA 2.3 SDK toolkit
synergy.cs.vt.edu
Measurements
• Execution time without __threadfence()
• Execution time with __threadfence()
• Synchronization time percentage (without __threadfence())
Fast Fourier
Transformation (FFT)
Smith-Waterman
Bitonic sort
CPU synchronization
GPU lock-free synchronization
Applications Synchronization Approach
GPU lock-based synchronization
synergy.cs.vt.edu
Execution Time without __threadfence()
Kernel Execution Time vs. Number of Blocks in the Kernel
Smith-Waterman
FFT Bitonic sort
• Less time is needed with more
blocks in the kernel
• Matrix filling time difference in FFT
is smaller than the other two with
different sync approaches used
• GPU sync has a better performance
than CPU implicit sync, BUT …
Baseline: CPU Sync
Baseline: CPU sync
Baseline: CPU sync
Relative Time Decreases
Baseline is CPU synchronization
synergy.cs.vt.edu
Performance & Correctness:Execution Time w/o __threadfence()
• Performance Improvement (relative to existing GPU)– FFT: 10% | Dynamic Programming: 26% | Bitonic Sort: 40%
• Overall Performance Improvement (relative to CPU serial)– FFT: 70x | Dynamic Programming: 13x | Bitonic Sort: 24x
• But …– Our GPU barrier synchronizations run the risk that writes performed
before our gpu sync() barrier are not completed by the time the GPU is released from the barrier.
– syncthreads() can only guarantee writes to shared memory and global memory visible to threads of the same block, it cannot do so for threads across different blocks.
– In practice, highly unlikely the above will ever happen given the amount of time spent spinning at the barrier, but still possible. So, …
To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009
synergy.cs.vt.edu
Execution Time with __threadfence()Kernel Execution Time vs. Number of Blocks in the Kernel
Smith-Waterman
FFT Bitonic sort
• Matrix filling time difference in FFT
is smaller than the other two with
different sync approaches used
• With __threadfence() called,
performance of GPU sync is worse
than CPU implicit sync
Baseline: CPU Sync
Baseline: CPU Sync
Baseline: CPU Sync
Relative Time Increases
Baseline is CPU synchronization
synergy.cs.vt.edu
Percentage of Time Spent Synchronizing
Synchronization Time Percentages (without __threadfence())
• % time to sync in FFT is lower than the other two algorithms
• Sync time percentages of SWat and bitonic sort are more than 50%
with CPU sync
• % time to GPU sync is lower than that of CPU implicit sync
synergy.cs.vt.edu
Profiling the Synchronization Time• Micro-benchmark – Compute average of two floats 10,000
times
Elements computed
by Block #1
Array
Array
Array
Original value
Round #2
Round #3
CPU sync –
Kernel launchRound #1
GPU sync –
__gpu_sync()CPU sync –
Kernel launch
GPU sync –
__gpu_sync()
Elements computed
by Block #2
Elements computed
by Block #3
synergy.cs.vt.edu
Decomposing Synchronization Time
• Ways to compute each time component– Only kernel execution time can be recorded
– Indirect method
– Use GPU lock-based synchronization as the example
– Synchronization time can be represented as
– Times that can be recorded directlyfsca ttttN
atN : Kernel consisting of only atomicAdd
comt : Kernel consisting of only computation
scacom tttNt : Kernel with GPU lock-based synchronization
scomtt : Kernel with computation and __syncthreads()
fscacom ttttNt : Kernel with GPU lock-based
synchronization (__threadfence())
synergy.cs.vt.edu
Profile of Synchronization Time
• Results
atN
comt black line
scomtt red line
scacom tttNt
fscacom ttttNt
comt + cpu snyc
683.5comt nta 300.2
564.5ct
540.0st
267.7333.0ntffor 10,000 times execution
404.54_synccput
synergy.cs.vt.edu
Outline
• Motivation
• Background
– GTX 280 and CUDA Programming Model
• GPU Synchronization
– GPU Lock-Based Synchronization
– GPU Lock-Free Synchronization
• Experimental Results
• Conclusion & Future Work
synergy.cs.vt.edu
Conclusion
• At the systems software level … – How to support efficient communication between SMs
via barrier synchronization on the GPU GPU Synchronization
• At the application level …– How to integrate the GPU synchronization capability into real
applications: FFT, dynamic programming, and bitonic sort
• From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized
applications
– How to guarantee correctness and its associated cost
• __threadfence guarantees correctness but needs to be optimized to support GPU synch. Is Fermi the answer?!
To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009
synergy.cs.vt.edu
Conclusion
• To GPU synchronize or not GPU synchronize?– GTX 280 / Tesla C1060 / Tesla S1080: Do NOT GPU synchronize.
– Fermi? Likely GPU synchronize?
• Next steps?– Efficient inter-block synchronization via NVIDIA Fermi
– Efficient inter-block synchronization in OpenCL
– Automated tool to transform CPU sync to GPU sync
• For more information– “On the Robust Mapping of Dynamic Programming onto a Graphics
Processing Unit,” 15th Int’l Conf. on Parallel & Distributed Systems, 12/2009.
– “Inter-Block GPU Communication via Fast Barrier Synchronization,” Technical Report TR-09-19, Computer Science, Virginia Tech, 10/2009.
synergy.cs.vt.edu
Yawn …
• Where are the massive speed-ups and cool pictures?
To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009
synergy.cs.vt.edu
Electrostatic Potential for Molecular Dynamics
Processor + Optimization
Execution Time(seconds)
Speed-Up
CPU Serial 36,360 -
GPU + Kernel Split + Multi-Level HCP
0.37 98,270x
Viral Capsid
• Visit the Supermicro booth, i.e., behind you, or go to http://www.youtube.com/watch?v=zPBFenYg2Zk
Contact: Prof. Alexey Onufriev for info on the science!
Wu FENG, [email protected]
http://sss.cs.vt.edu/
http://synergy.cs.vt.edu/
Laboratory
http://www.green500.org/
http://www.mpiblast.org/