Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
Parallel Computing With CUDA
October 12, 2011
Ryan AlbrightOregon State University
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
About CUDA
Hardware
Software
Research
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
About CUDA
Hardware
Software
Research
Synctium
Conclusion
About CUDA
Compute Unified Device Architecture
CUDA is both an Architecture and a Programming Style
Scalable performance based on number of ”cores”
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
SIMD
Single Instruction Multiple Decode
Optimizes for dense repetitive tasks
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
GF100 Architecture
Graphics Processing Clusters
Broken Into Simultaneous Multiprocessors
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
What is inside a CUDA Core?
Each core also has 1 Shader Module
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
Some Example Cards...
We are using the GTS450
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
Compare with ATI
How is this different than ATI’s Cards
ATI Typically has more ”Processors” (Really Just ALU’s)
NVIDIA has fewer but more independant ”Cores” (WithShaders)
Often times performance is very similar
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Architecture
Software
Research
Synctium
Conclusion
ATI vs NVIDIA
Benchmark
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Software
Languages Supported
C, C++, JAVAPerl, Python, RubyMatlab, MathematicaFortran, Haskell, Lua, IDL, .NET
Limited to NVIDIA Architectures
Really Really Really Easy to set up...
Many Useable standard libraries
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Setting Up CUDA
Go to: ”http://developer.nvidia.com/cuda-toolkit-sdk”
Download Drivers, CUDA Toolkit, SDK & Examples, and”Getting Started” manual
Install Drivers, unpack the rest (automaticly putseverything in the right place)
run ”make” in the C directory to set up libraries.
run ”make” in any of the example projects directories tobuild the project.
If this works, CUDA is set up properly
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Basic Steps To Use CUDA
Allocate Local Memory
Allocate GPU Memory
Store Values in Local Memory
Pass Values to GPU Memory
Give GPU and instruction
Pass Results Back
Clean Up GPU Memory
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
sgemmExample
Stands for Single Precision General Matrix Multiply
Also Called Dense Matrix Multiply
Requires multiplying two N x N Matrices
Many Simple multiplies work well on highly parallelprocessors
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Memory Allocation: Host
Consider A X B = C Matrix Multiply
/∗ A l l o c a t e hos t memory f o r the ma t r i c e s ∗/h A = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h A [ 0 ] ) ) ;i f ( h A == 0) {
f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;
}h B = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h B [ 0 ] ) ) ;i f ( h B == 0) {
f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;
}h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {
f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;
}
/∗ F i l l the ma t r i c e s w i th t e s t data ∗/f o r ( i = 0 ; i < n2 ; i ++) {
h A [ i ] = rand ( ) / ( f l o a t )RAND MAX;h B [ i ] = rand ( ) / ( f l o a t )RAND MAX;h C [ i ] = rand ( ) / ( f l o a t )RAND MAX;
}
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Memory Allocation: Device
/∗ A l l o c a t e d e v i c e memory f o r the ma t r i c e s ∗/s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d A [ 0 ] ) , ( vo id∗∗)&d A ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;
}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d B [ 0 ] ) , ( vo id∗∗)&d B ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;
}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d C [ 0 ] ) , ( vo id∗∗)&d C ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;
}/∗ I n i t i a l i z e the d e v i c e ma t r i c e s w i th the hos t ma t r i c e s ∗/s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h A [ 0 ] ) , h A , 1 , d A , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e A)\n” ) ;r e t u r n EXIT FAILURE ;
}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h B [ 0 ] ) , h B , 1 , d B , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e B)\n” ) ;r e t u r n EXIT FAILURE ;
}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , h C , 1 , d C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e C)\n” ) ;r e t u r n EXIT FAILURE ;
}
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Multiply and Return Values
cublasSgemm ( ’ n ’ , ’ n ’ , N, N, N, a lpha , d A , N, d B , N, beta , d C , N ) ;s t a t u s = c u b l a s G e t E r r o r ( ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
f p r i n t f ( s t d e r r , ” ! ! ! ! k e r n e l e x e c u t i o n e r r o r .\n” ) ;r e t u r n EXIT FAILURE ;
}
/∗ A l l o c a t e hos t memory f o r r e a d i n g back the r e s u l t from de v i c e memory ∗/h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {
f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;
}
/∗ Read the r e s u l t back ∗/s t a t u s = c u b l a s G e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , d C , 1 , h C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {
r e t u r n EXIT FAILURE ;}
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Highlights
Using CUDA
Basic Steps Tofollow
Research
Synctium
Conclusion
Speedup: Timing Logs
sgemm t e s t r u n n i n g . .Standard C v e r s i o n took : 35785 usCUBLAS SGEMM v e r s i o n took : 628 usPASSED
P r e s s ENTER to e x i t . . .
Using GTS450 (192 Cores)
Matrix sizes of 256 were used
Why isnt this 192x Faster?
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Energy Efficient Computing
What does this mean?
Typically measured in Performance/Watt
How do systems do this today?
Embedded Systems (Go Go Go .... Sleep)Typical PC (Uh... We dont need that core right now)Super Computer (Ummm... I don’t care! More dataplease!)
How can we improve this?
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Power Saving Techniques
Power Gating
Turn off sections that arent being used
Lower Supply Voltage
Usually Requires slowing down the processorPower is related to Voltage Squared
P = CV 2f
Clock Gating
Works in synchronous systemsDisables portions so less of the circuit switches statesCan lose speed, performance, and induce errors (lesscircuitry to check)
How can we save power without losing performance?
We believe this should be done by optimizing the systemfor the job it is performing at any given time.
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Optimize Hardware settings for Algorithms
We are looking into optimizing hardware settings based on thealgorithms being run. These include:
Memory Bandwidth
Main Clock Frequency
Supply Voltage
Number of Cores
Whatever else we can get our hands on...
A good example is computation vs memory bandwidth limitedalgorithms
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Why a GPU?
Allows for easy simulation of SIMD instructionsMass Parallel Execution (NVIDIA GTS 450 has over 190Cores)Overclocking and Voltage ScalableSupply easily interrupted for Power measurementsCUDA Environment and LibrariesGood Relationship with NVIDIA
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Energy EfficientComputing
GPU Character-ization
Synctium
Conclusion
Algorithms We are Looking At
Dense matrix multiply
Brute force Matrix multiplyMemory IntensiveComputationally Simple for highly parallel architecture
Sparse matrix multiply
Matrices typically with lots of zerosTypically can be compressedLess memory intensiveComputationally more complicated
These should allow for fairly straight forward comparisonsfor two algorithms that are known to have theselimitations.
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
Outline
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
What is Synctium?
A SIMD processor meant for testing sub-thresholdoperation.
Very basic ALU with just Add and Multiply
It shows performance/watt gains in sub-threshold region
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
Conclusion/Recap
1 Introduction to CUDA
2 Hardware
3 Software
4 Research
5 Synctium
ParallelComputing
With CUDA
Outline
Introductionto CUDA
Hardware
Software
Research
Synctium
Conclusion
Questions
Questions?