Image Processing using CUDA - Amazon S3 · Image Processing using CUDA Anders Eklund, PhD ......

Image Processing using CUDA

Anders Eklund, PhD

Virginia Tech Carilion Research Institute

andek@vtc.vt.edu

Outline

• Storing an image in memory

• 2D Convolution

• Interpolation

• Calculating a similarity measure between two images

• Image registration

Storing an image

• How is an image stored in memory?

• There are at least two possibilities

• Row major order, column major order

Storing an image

• Row major order (C programming)

• A = [1 2 3] [4 5 6]

• Values are stored in memory as 1, 2, 3, 4, 5, 6

• Pixel at location (x,y) is accessed as x + y * WIDTH where WIDTH is the number of columns

Storing an image

• Column major order (Matlab)

• A = [1 2 3] [4 5 6]

• Values are stored in memory as 1, 4, 2, 5, 3, 6

• Pixel at location (x,y) is accessed as y + x * HEIGHT where HEIGHT is the number of rows

Storing an image

• Why is this important?

• When reading/writing data from/to global memory it is important to use coalesced reads and writes, for optimal performance

• Coalesced operation = the threads read/write consecutive memory locations

• Use the Nvidia profiler to check for uncoalesced memory operations

Storing an image

• Assume the image is stored in row major order • We use 2D thread blocks, 64 along x, 2 along y • int x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y * blockDim.y + threadIdx.y;

• int idx = y + x * HEIGHT (wrong) • Image[idx] = 3.0f; Uncoalesced writes

• int idx = x + y * WIDTH (correct) • Image[idx] = 3.0f; Coalesced writes

Storing an image

• int idx = y + x * HEIGHT (wrong)

• Image[idx] = 3.0f; Uncoalesced writes

• Indices are not consecutive

• threadIdx.y = 0

• idx = 0, HEIGHT, 2*HEIGHT, 3*HEIGHT, 4*HEIGHT, …

• threadIdx.y = 1

• idx = 1, 1+HEIGHT, 1+2*HEIGHT, 1+3*HEIGHT, 1+4*HEIGHT, …

Storing an image

• int idx = x + y * WIDTH (correct)

• Image[idx] = 3.0f; Coalesced writes

• Indices are consecutive

• threadIdx.y = 0

• idx = 0, 1, 2, 3, 4, …

• threadIdx.y = 1

• idx = WIDTH, 1+WIDTH, 2+WIDTH, 3+WIDTH, …

Multiplying two images

• __global__ void Multiply(float* Result, const float* Image1, const float* Image2 , int DATA_W, int DATA_H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if ( (x >= DATA_W) || (y >= DATA_H)) return; int idx = x + y * DATA_W; Result[idx] = Image1[idx] * Image2[idx]; }

Multiplying two images

• The kernel is completely bound by the memory bandwidth, two read operations, one write operation

• Uncoalesced memory operations make a big difference

• (In this specific kernel we could have used 1D thread blocks)

Image processing with C/C++

• We will use the CImg library to read and write images using C++ objects

• The CImg library is open source and consists of a single header file (Cimg.h)

• Works on Windows, Linux, Mac

• cimg.sourceforge.net

Convolution • Convolution = scalar product between filter

values and pixel values in each neighbourhood

• Slide the filter over all pixels, save each result in the center pixel

• Note the minus signs (means that the filter is rotated 180 degrees)!

Task 1

• Open imageprocessing_convolution.cu and imageprocessing_kernel.cu

• Complete the code for the kernel Convolution_2D_Texture

• The code reads an image from file, sends it to the GPU, copies back the filter response, writes the filter response to a new image

• Compares your result to convolution with CImg

Constant memory

• Constant memory is normally 64 KB

• Each multiprocessor on the GPU has a constant memory cache (8 KB)

• Put the filter kernel in constant memory

• __device__ __constant__ float c_Filter_2D[11][11];

• The filter will be in the cache during the whole execution, saves reads from global memory

Texture memory

• Texture memory is cached for spatially local reads

• Hardware support for reading outside the image

• Read the value at position (x,y), value = tex2D(tex_Image, x + 0.5f, y + 0.5f);

• Note the addition of 0.5f !

Compiling the code

• See the top of each file for how to compile the code

Checking results

• Compare the images filteredImageCUDA.bmp and filteredImageCImg.bmp, difference is given in difference.bmp

• Copy images from your account to your own computer, to be able to see the images

• scp aeklund@10.73.25.207:/home/aeklund/GPULab/*.bmp .

• display filteredImageCUDA.bmp &

Checking results

• For the convolution, the maximum error compared to CImg should be something like 0.000015

• The total error should be something like 0.09

Convolution – First part

• __global__ void Convolution_2D_Texture (float* Result, int DATA_W, int DATA_H, int FILTER_W, int FILTER_H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if ( (x >= DATA_W) || (y >= DATA_H)) return; float sum = 0.0f;

Convolution – Second part

• float yoffset = -((float)FILTER_H-1)/2.0f + 0.5f; for (int fy = FILTER_H-1; fy >= 0; fy--) { int xoffset = -((float)FILTER_W-1.0f)/2.0f + 0.5f; for (int fx = FILTER_W-1; fx >= 0; fx--) { sum += tex2D(tex_Image, x + xoffset,y + yoffset) * c_Filter[fy][fx]; xoffset += 1.0f; } yoffset += 1.0f; }

int idx = x + y * DATA_W; Result[idx] = sum;

Texture memory

• The texture memory has hardware support for linear interpolation

• So far we have only used the texture memory for fast reading from global memory (using the texture cache)

• Lets use the texture memory for interpolation

Rotating an image

• Use the rotation matrix

( cos(angle) -sin(angle) ) ( sin(angle) cos(angle) )

to transform each (x,y) coordinate, read from the new coordinate using texture memory

Rotating an image

(xnew) = ( cos(angle) -sin(angle) ) (xold) (ynew) ( sin(angle) cos(angle) ) (yold)

Rotating an image

• cos and sin in CUDA use double precision

• cosf and sinf use single precision

• All functions use radians and not degrees

Task 2

• Open imageprocessing_transformation.cu and imageprocessing_kernel.cu

• Complete the code for the kernel RotateImage, which rotates an image using texture memory for interpolation

• Extra task, rotate the image around the center of the image, instead of around the corner

Checking results

• Compare the images transformedImageCUDA.bmp and transformedImageCImg.bmp, difference is given in difference.bmp

Checking results

• For the rotation, the maximum error compared to CImg should be something like 5.23

• The total error should be something like 12331.2

• Interpolation in CImg is probably performed slightly differently

Rotating an image

• First part of the transformation kernel is the same as for the convolution kernel

• float xf = (float)x; float yf = (float)y; • float angler = angled/180.0f*pi; • float xnew = cosf(angler) * xf – sinf(angler)*yf; • float ynew = sinf(angler) * xf + cosf(angler)*yf; • value = tex2D(tex_Image, xnew + 0.5f,ynew + 0.5f); • TransformedImage[idx] = value;

Rotating an image around its center

• float xf = (float)x – (float)IMAGE_WIDTH/2.0f;

float yf = (float)y - (float)IMAGE_HEIGHT/2.0f; • float angler = angled/180.0f*pi; • float xnew = cosf(angler) * xf – sinf(angler)*yf; • float ynew = sinf(angler) * xf + cosf(angler)*yf; • xnew += (float)IMAGE_WIDTH/2.0f; • ynew += (float)IMAGE_HEIGHT/2.0f; • value = tex2D(tex_Image, xnew + 0.5f,ynew + 0.5f); • TransformedImage[idx] = value;

Image registration

• Image registration relies on the concept of optimizing a similarity measure

• Find the translations and rotations that maximize the similarity between two images

• Normalized cross correlation (NCC) is one of the most common similarity measures

Normalized cross correlation (NCC)

• Correlation between variables x and y

Normalized cross correlation (NCC)

• NCC can be calculated using vectors x and y as

• Only three scalar products are needed, between x and y, between x and x and between y and y (remove the mean of x and y first)

Representing an image as a long vector

CUBLAS

• CUBLAS has many functions for matrix algebra

• Matrix-matrix multiplications,

matrix-vector multiplications, vector-vector multiplications

• The function cublasSdot can be used to calculate the scalar product between two vectors with float values

• Look in the CUDA documentation to see how it works

Task 3 • Open imageprocessing_similarity.cu

• Complete the code for how to calculate the correlation

between two images, using the CUBLAS library

• Each image is treated as a vector of length IMAGE_WIDTH * IMAGE_HEIGHT

• The mean values have already been removed

• You have to allocate memory on the GPU and copy data to the GPU

• Your correlation value is compared to a correlation calculated using regular C code

Calculating correlation using CUBLAS

• #include <cublas_v2.h>

• First create a handle to CUBLAS, already provided in the code

• cublasStatus_t status;

• cublasHandle_t handle;

• status = cublasCreate(&handle);

• cublasStatus_t cublasSdot (cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result)

• n is the length of the vectors, in our case IMAGE_WIDTH * IMAGE_HEIGHT

• x is the pointer to the first image, d_Image1

• y is the pointer to the second image, d_Image2

• incx and incy is simply the distance between each element in the vectors, in our case 1

• result is the pointer to the calculated scalar product

Calculating correlation using CUBLAS

• float productAB, productAA, productBB;

• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image1, 1, d_Image2, 1, &productAB);

• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image1, 1, d_Image1, 1, &productAA);

• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image2, 1, d_Image2, 1, &productBB);

• float correlationGPU = productAB / (sqrt(productAA * productBB));

Image registration

• We now have the two most important building blocks for image registration

• Calculation of a similarity measure

• Applying a transformation to an image

Image registration

• Lets combine these two functions to perform a very simple form of image registration

• Register two images by finding the optimal rotation to apply to one of the images

Task 4

• Open imageprocessing_registration.cu and complete the code to perform a registration between the two images

• Make a for loop that in each iteration applies a rotation and calculates the correlation between the fixed image and the rotated image

• Print the rotation that gives the highest correlation

• Do not copy data to/from the GPU in each iteration!

Checking results

• The best rotation should be -30 degrees, giving a correlation of 0.861115

Image Processing using CUDA - Amazon S3 · Image Processing using CUDA Anders Eklund, PhD ......

Documents

Linear Algebra on the GPU - SHARCNET · • CUBLAS library can be used by applications written in Fortran, via wrappers 16. SHARCNET seminar 2017 Pawel Pomorski CUBLAS in CUDA 4.0+

MD-CUDA · GPGPU CUDA N-body problem ... –Application programming interface (API) –CUDA runtime –CUFFT –CUBLAS. 20 CUDA Layers. 21 GPU Architecture In CUDA Memory Addressing

An Introduction to the Thrust Parallel Algorithms Library · Thrust Parallel Algorithms Library . ... CUDA C/C++ CUBLAS, CUFFT, NPP STL CUDA Fortran ... Best Practices In general

CUDA & OpenACC - GIST. GIST__CUDA... · CUDA & OpenACC Hyungon Ryu 유현곤 (NVIDIA Korea) hryu@nvidia.com . NVIDIA Tutorial Agenda ... cuBLAS call step

CUDA CUBLAS Library - developer.download.nvidia.comdeveloper.download.nvidia.com/.../CUBLAS_Library... · PG-00000-002_V1.0 1 NVIDIA CHAPTER1 The CUBLAS Library CUBLAS is an implementation

CUDA CUBLAS Library - RUC.dkdirac.ruc.dk/manuals/cuda-3.0/CUBLAS_Library_3.0.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V3.0 Portions of the SGEMM, DGEMM and ZGEMM library

NVIDIA CUDA Libraries - Peoplepeople.sc.fsu.edu/~gerlebacher/gpus/nvidia_cuda_libraries_gtc2010... · NVIDIA CUDA Libraries —CUFFT —CUBLAS ... Directly approach our CUDA Library

CUDA Libraries - Camlunitycamlunity.ru/swap/Library/Conflux/NVIDIA CUDA/C2_BLAS_FFT.pdf · CUDA Libraries. 2 Outline ... initializes the CUBLAS library and must be called before any

CUDA Toolkit - Stanford University · CUDA Toolkit. 4 M02: High Performance Computing with CUDA ... Call SGEMM in CUBLAS library using NON-THUNKING interface (library …

CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

Introduction to Scientific Programming using GPGPU and CUDA · Introduction to Scientific Programming using GPGPU and CUDA ... (NVIDIA CUDA Programming Guide) ... CUDA C OpenCL CUDA

CUDA CUBLAS Library - Camlunitycamlunity.ru/swap/Library/Conflux/Algorithms and Data...PG-00000-002_V3.1 1 NVIDIA CHAPTER1 The CUBLAS Library CUBLAS is an implementation of BLAS (Basic

TRM-06704-001 v6.5 | August 2014 CUDA SAMPLES …montecristo.co.it.pt/cudaDoc65/pdf/CUDA_Samples.pdf · a conjugate gradient solver on GPU using cuBLAS and cuSPARSE library, using

CUDA Toolkit and Libraries - Hot Chips · #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement)

CUDA Specialized Libraries and Tools - University of Denverdconnors/courses/GPUarchitecture/notes/tools.pdf · CUDA Libraries and Tools ! Specialized Libraries: CUBLAS, CUFFT, HONIE,

CUDA CUBLAS Librarydirac.ruc.dk/manuals/cuda-3.2/CUBLAS_Library_02.pdfNVIDIA Corporation CUBLAS Library PG-05326-032_V02 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa

CUDA CUBLAS Library - Research School of Computer … · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara,

RN-08476-001 v18.03 | April 2018 MXNET Release Notes · ‣ CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) 9.0.282 Patch 2 which is installed by default ‣ cuBLAS 9.0.234

Fortran CUDA Library Interfaces - softek.co.jp · Fortran CUDA Library Interfaces Version 2017 | ii ... Pointer Modes in cuBLAS and cuSPARSE.....6 1.7. Writing Your Own CUDA Interfaces

The CUBLAS and CULA libraries - GitHub Pages · The CUBLAS and CULA libraries Will Landau CUBLAS overview Using CUBLAS The CUBLAS and CULA libraries CULA Will Landau Iowa State University