View
235
Download
1
Tags:
Embed Size (px)
Citation preview
© NVIDIA Corporation 2008
Outline
Why Open64
How we use Open64
What we did to Open64
Future work in Open64
© NVIDIA Corporation 2008
Compiling CUDA for GPUs
NVCC
C/C++ CUDAApplication
GPU Code CPU CodeGPU Code
executable
© NVIDIA Corporation 2008
Why Open64
We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes.
own gcc open64
© NVIDIA Corporation 2008
Why Open64
We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes.
own gcc open64
take too long
© NVIDIA Corporation 2008
Why Open64
We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes.
own gcc open64
take too long good long-term support
© NVIDIA Corporation 2008
Why Open64
We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes.
own gcc open64
take too long good long-term support
best performance
(kudos to PathScale)
© NVIDIA Corporation 2008
NVCC processing of GPU code
cudafe
C code for GPU
nvopencc (Open64)
ptx
OCG
object code
© NVIDIA Corporation 2008
Changes: Rehosting Open64
Our compiler has to run on 32 & 64bit Linux, 32 & 64bit Windows, and Mac OS.
Main Open64 source tree is only for Linux.This is an area where sharing our changes can help grow the user base by making it easier to port Open64.
For Windows we build using Cygwin’s MINGW
© NVIDIA Corporation 2008
Changes: Memory and registers
We don’t have a stack or fast memory
Therefore want to keep data in registers
Inline everything and optimize as much as possible
Try to keep small structs in registers by expanding struct copies into field copies (versus taking address and generating loop to do byte copy)
© NVIDIA Corporation 2008
Changes: Vector loads and stores
Coalesce adjacent loads and stores for performance
Do this in CG:Iterate through ops, trying to add to vectors
Check for intervening kills
Change alignment and use dummy regs for padding if helps to create wider vector (e.g. may use 4-word vector for 3-word struct).
© NVIDIA Corporation 2008
Changes: 16bit optimization
Cheaper to use 16bit registers and operations
But C converts shorts to int.
So add pass in CG that converts back to 16bit:Mark 16bit loads, stores, and converts
Propagate 16bit-ness forwards and backwards
Unmark 16bit-ness if cannot be 16bit
Change remaining registers and instructions to be 16bit.
© NVIDIA Corporation 2008
Future work
1 person -> 4 people working with Open64
New application TBA
Merging changes into trunkThanks to Sun Chan and Shin!
Investigating register pressure in WOPTWant better control of register pressure during optimization
Investigating using other features (LNO, IPA, etc)