40
Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design, University of Erlangen-Nuremberg July 21, 2016

Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Fast Bilateral Filter GPU implementation

Multi-Core Architectures and Programming

Gerhard Mlady, Rafael Bernardelli

Hardware/Software Co-Design, University of Erlangen-Nuremberg

July 21, 2016

Page 2: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Overview

• Fast Bilateral Filter

• Implementation

• Benchmark

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 2

Page 3: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Introduction to Bilateral filter

First lets take a look on 2d convolution

𝑓 𝑥 =

𝑥′∈𝒩(𝑛)

𝑔 𝑥′ 𝑘(𝑥 − 𝑥′)

𝑓 𝑥 = 𝑔 𝑥 ∗ 𝑘(𝑥)

𝑓 𝑥 = ℱ−1{ℱ 𝑔 𝑥 ℱ 𝑘 𝑥 }

Gaussian kernel

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 3

Page 4: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

(a) Original image

(b) Gaussian filtered image

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 4

Page 5: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

(a) Original image

(b) Gaussian filtered image

(c) Bilateral filtered image*

(d) Bilateral filtered image**

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 5

Page 6: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

(a) Original image

(b) Bilateral kernel

(c) Result

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 6

Page 7: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Bilateral filter equations

𝑓 𝒙 = 𝑘−1(𝒙)

𝑥′∈𝒩

𝑔 𝒙′ 𝑐 𝒙, 𝒙′ 𝑠(𝑔 𝒙 , 𝑔(𝒙′))

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 7

𝑓 𝒙 ∶ Output image

𝑔 𝒙 ∶ Input image

Page 8: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Bilateral filter equations

𝑐 𝒙, 𝒙′ = 𝑒−𝒙−𝒙′ 2

2

2𝜎𝑑2

𝑠 𝑔(𝒙), 𝑔(𝒙′) = 𝑒−𝑔 𝒙 −𝑔 𝒙′

2

2𝜎𝑟2

𝑘 𝒙 =

𝒙′∈𝒩

𝑐 𝒙, 𝒙′ 𝑠(𝑔 𝒙 , 𝑔(𝒙′))

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 8

Spacial similarity function

Range similarity function

Normalizing factor

𝑓 𝒙 = 𝑘−1(𝒙)

𝑥′∈𝒩

𝑔 𝒙′ 𝑐 𝒙, 𝒙′ 𝑠(𝑔 𝒙 , 𝑔(𝒙′))

Page 9: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 9

Page 10: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Fast Bilateral Filter

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 10

Page 11: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Fast Bilateral Filter

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 11

Page 12: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Fast Bilateral Filter

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 12

Page 13: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Overview

• Fast Bilateral Filter

• Implementation

• Benchmark

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 13

Page 14: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Implementation

Load imageinto GPU

Fill cubesPerform

separableconvolution

Slicing & nonlinearity

Copy imageback

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 14

Page 15: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Cube filling

𝑊𝐼 𝑊

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 15

Page 16: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Perform separable convolution

𝐼 ∗1 2 12 4 21 2 1

= 𝐼 ∗121∗ 1 2 1

𝑂(𝑁𝑀𝑘2) 𝑂(2 ∗ 𝑁𝑀𝑘)

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 16

𝑁,𝑀 ∶ Image dimensionsk : convolution kernel length

Page 17: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Perform Slicing & Nonlinearity

𝑊𝐼 𝑊

output

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 17

Page 18: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Texture fetching

21/07/2016 18Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Page 19: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Overview

• Fast Bilateral Filter

• Implementation

• Benchmark

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 19

Page 20: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 20

Page 21: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 21

Page 22: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 22

Page 23: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 23

Page 24: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 24

Page 25: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 25

Page 26: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 26

Page 27: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

for intensity kernel length, intensity and spatial scalling 11

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 27

Page 28: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark – Best method for filling

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 28

𝑊𝐼 𝑊

Page 29: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

__global__ void cubefilling_loop(const float* image, float *dev_cube_wi, float *dev_cube_w, const dim3 image_size, int scale_xy, int scale_eps,dim3 dimensions_down)

{

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < dimensions_down.x && j < dimensions_down.y) {

size_t cube_idx_1 = i + dimensions_down.x*j;

#pragma unroll

for (int ii = 0; ii < scale_xy; ii++)

{

#pragma unroll

for (int jj = 0; jj < scale_xy; jj++)

{

size_t i_idx = scale_xy*i + ii;

size_t j_idx = scale_xy*j + jj;

if (i_idx < image_size.x && j_idx < image_size.y)

{

float k = image[i_idx + image_size.x*j_idx];

size_t cube_idx_2 = cube_idx_1 +dimensions_down.x*dimensions_down.y*floorf(k / scale_eps);

dev_cube_wi[cube_idx_2] += k;

dev_cube_w[cube_idx_2] += 1.0f;

}

}

}

}

}

21/07/2016 29Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Page 30: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

__global__ void cubefilling_atomic(const float* image, float *dev_cube_wi, float *dev_cube_w, const dim3 image_size, int scale_xy, int

scale_eps, dim3 dimensions_down)

{

const size_t i = blockIdx.x * blockDim.x + threadIdx.x;

const size_t j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < image_size.x && j < image_size.y) {

const float k = image[i + image_size.x*j];

const size_t cube_idx = (i / scale_xy) + dimensions_down.x*(j / scale_xy) +

dimensions_down.x*dimensions_down.y*((int)k / scale_eps);

atomicAdd(&dev_cube_wi[cube_idx], k);

atomicAdd(&dev_cube_w[cube_idx], 1.0f);

}

}

21/07/2016 30Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Page 31: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark – Best method for filling

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 31

Page 32: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark – Best method for filling

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 32

Page 33: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark – Best method for filling

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 33

Page 34: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark – Best method for filling

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 34

Page 35: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark

21/07/2016 35Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

51 102 153 204 256 307 358 409 460 512

Image size

Relative runtime on Telsa K20c

convolution cubefilling slicing

Page 36: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Benchmark - CPU

for intensity kernel length 21 and spatial k.l. 25

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 36

Page 37: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

Issues on the Implementation

• On windows, a register key must be set to allow the GPU

to run a kernel more than 2 seconds(HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers > TdrDelay=10)

• On (almost) every CUDA API call, retrieve cudaStatus

and check for errors. It doesn’t raise any exceptions.

• On linux, add some lines to .bashrcexport PKG_CONFIG_PATH=/scratch-local/usr/lib64/pkgconfig/

export PATH=$PATH:/opt/cuda/bin

export LD_LIBRARY_PATH=/scratch-local/usr/lib/:$(LD_LIBRARY_PATH)

21/07/2016 37Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Page 38: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

If you are interested on the code and more

plots…

• https://github.com/bernardelli/Bilateral-Filter-GPU

21/07/2016 38Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Page 39: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

21/07/2016 39Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation

Thank You For Your Attention!

Page 40: Fast Bilateral Filter GPU implementation · Fast Bilateral Filter GPU implementation Multi-Core Architectures and Programming Gerhard Mlady, Rafael Bernardelli Hardware/Software Co-Design,

References

• https://developer.apple.com/library/prerelease/content/documentation/

Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionO

perations.html

• http://homepages.inf.ed.ac.uk/rbf/HIPR2/gsmooth.htm

• C. Tomasi and R. Manduchi Bilateral Filtering for gray and color

images. In Proceedings of IEEE International Conference on

Computer Vision, pages 839-846

• http://people.csail.mit.edu/sparis/publi/2006/eccv/Paris_06_Fast_Appro

ximation.pdf

• http://docs.nvidia.com/cuda/cuda-c-programming-

guide/index.html#linear-filtering__linear-filtering-of-1-d-texture-of-4-

texels

21/07/2016Mlady, Bernardelli | Multi-Core Architectures and Programming | Fast Bilateral Filter GPU implementation 40