2014/07/17 Parallelize computer vision by GPGPU computing

Preview:

Citation preview

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Wang, Yuan-Kai (王元凱)Electrical Engineering Department, Fu Jen Catholic

University (輔仁大學電機工程系)ykwang@mail.fju.edu.tw

http://www.ykwang.tw

2014/07/17

Parallelize Computer Visionby GPGPU Computing

1

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

About this Course❖ Multicore Era for Computer Vision❖ GPGPU❖ Parallel Programming

(CUDA, OpenCL, Renderscript)❖ OpenCV Acceleration with GPGPU❖ Computer Vision Acceleration

2

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

1. Multicore Era forComputer Vision

Paradigm shift from Clock Speed Race

to Multicore Race

3

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore Computing❖ What Is Multicore

• Combine multiple processors(CPU, DSP, GPGPU, FPGA)into single chip

❖ Multicore computing is inevitable

4

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Moore's Law❖ In 1965, Gordon Moore (Intel co-founder)

predicted• The transistors no. on an IC would double

every 18 months❖ The well-known law

• The performance of computer doubles every 18 months• More transistors → More performance

❖ The prediction was kept correctly by Intel's CPUs for 40 years

5

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Review of Moore's Law ❖ Transistors in a chip did increase

6

Software enjoys the fruits of hardware's labour.

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Problems❖ More transistors need high frequency

• We come into the Clock Speed Race❖ But high frequency needs high power

consumption• High power consumption è Heat problem• 4GHz has been the limit of Moore’s law

7

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Paradigm Shift from 2000 AD❖ General-purpose multicore

comes of age❖ Chip companies race to create multicore

processors• CPU: Intel Core Duo, Quad-core,

ARM v7, ...• DSP: TI OMAP, ARM NEON, …• GPU/GPGPU:

• nVidia: GeForce/Tesla, Tegra• ARM: Mali-T6x• …

8

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

The Multicore Evolution

Pentium processorOptimized for single

thread

Core Duo 5~10 years10~100 energy efficient

cores optimized for parallel execution

From large mono-core to multiple lightweight cores

9

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Moore’s Law Needs Multicore❖ Single core cannot fit Moore's law❖ Multicore can fit Moore's law if a

parallel programming model exists

Time

Per

form

ance

Single Core

Multi-Core

10

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Two Architectures for Multicore

❖ Symmetric multiprocessing (SMP)• Multicore CPU, GPGPU, DSP multicore• Homogeneous computing

❖ Asymmetric multiprocessing (AMP)• CPU+GPGPU,

CPU+FPGA, CPU+DSP

• Heterogeneous computing

11

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore CPU (1/2)❖ Two or more CPUs in a chip❖ Ex.: Intel Core i7

12

Multiple Execution Cores

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore CPU (2/2)❖ Windows Task Manager(工作管理員)

Two cores Eight cores

13

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU (1/2)❖ GPU (Graphical Processing Unit)

• The processor in graphics card to speed up 3D graphics

• Game playingis a majorapplication

❖ GPGPU: General-Purpose GPU• General purpose computation using

GPU in applications other than 3D graphics

14

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU (2/2)❖ GPGPU has more cores than CPU

• 120 ~ 3072 cores vs. 2 ~ 8 cores(Many-core vs. Multi-core)

❖ GPGPU is more powerful than multicore CPU

❖ Vendors: • nVidia • Quadcomm

(AMD, ATI)• ARM• Intel

15

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p. 16

It is the Software, Stupid❖Gary Smith and Daya Nadamuni, Gartner

Dataquest, Design Automation Conf., 2006❖The biggest problem with SoC design

is embedded software development. ❖The next big hurdle is

programmability. It's the ability to program these multicore platforms."❖You can have elegant algorithms,

first-pass silicon, and fancy intellectual property. But without software, the product goes nowhere.

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore Demands Threading17

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore Demands Threading18

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What Is Computer Vision19

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

VideoCapture

ImageEnhance

Object/Event

DetectionObjectTracking

Object/Event

RecognitionBehaviorAnalysis Retrieval

Imaging

Event Detection

Abnormal Detection Face Recognition Retrieval

TripwireImage/Video Enhancement

A Complete Vision System– Video Surveillance as an Example

20

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Computer Vision NeedsHigh Performance Computing❖ A CV example : video processing

• Intelligent video surveillance,❖ Its complexity is high

• Video (1080p RGB): 6 Megapixels per frame, 30fps

• 100 – 1K flops per pixel• ⇒ 18 - 180 Gigaflops per second

❖ Massive data processing• Intensive computation

21

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

HPC Approaches❖ Cluster/distributed computing

• Hadoop/MAP-REDUCE(Google, Cloud Computing)

• MPI❖ Multi-processing

computing• Multicore (GPGPU, CPU, FPGA/DSP)• Programming: multi-thread

• Windows thread, Pthraed, OpenMP• CUDA, renderscript, C++ AMP, …

Supercomputer

22

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

However❖ Can CV algorithms speed-up every 18

months with multicore?❖ Multicore is not a simple solution for

upgrading CV algorithm performance• The transition from single core to

multicore will be blocked by software• We are not ready to face the software

programming challenges• It is the software, stupid.

23

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Software, Threading, and Parallel Computing

❖ Identify parallelism: Analyze algorithm❖ Express parallelism: Write parallel code❖ Validate parallelism: Debug & verify parallel code❖ Optimize parallelism: enhance parallel

performance

24

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multi-threading DemandsNew Programming Skills

❖ Previous multi-threading techniques❖ Windows thread, pthread, OpenMP,

MPI, …❖ New techniques

• CUDA, C++ AMP, OpenCL, Renderscript,OpenACC, Map Reduce, …

❖ Concepts• Race condition, deadlock,• Domain partition, function partition, …

25

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multicore Programming Practice (MPP)

❖ Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible• Proposed by a

MPP working group in the Multicore Association

http://www.multicore-association.org/workgroup/mpp.php

26

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenACC❖ An organization develops API to

• describes a collection of compiler directives

• To specify loops and regions of code in standard C, C++ and Fortran

• To be offloaded from a host CPU to an attached accelerator, including• APUs, GPUs, and many-core coprocessor

27

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

HSA Foundation❖Heterogeneous System Architecture

• Key members: AMD, QUALCOMM, ARM, SAMSUNG, TI

❖System architecture easing efficient use of accelerators, SoCs

• Intended to support high-level parallel programming frameworks

• OpenCL, C++, C#, OpenMP, Java • Accelerator requirements

• Full-system SVM, memory coherency, preemption, user-mode dispatch

28

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

The ParLab in Berkeley❖ The Parallel Computing Lab. in UC

Berkeleyhttp://parlab.eecs.berkeley.edu• The ParLab. offers programmers a

practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers.

29

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

HPEC❖ High Performance Embedded

Computing• MIT Lincoln Lab, 1997 ~

30

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL❖ Royalty-free, cross-platform, cross-

vendor standard • Targeting: supercomputers è embedded systems è mobile devices

❖Enables programming of diverse compute resources • CPU, GPU, DSP, FPGA …

31

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL Working Group Members

❖Diverse industry participation – many industry experts

❖NVIDIA is chair, Apple is specification editor

32

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)❖ Vendor, Hardware

❖ How parallel programming (Sec. 3)

❖ OpenCV Acceleration (Sec. 4)

❖ Computer vision Acceleration-PC (Sec. 5)

❖ Computer vision Acceleration-Android(Sec. 6)

33

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

2. GPGPU

PC platformMobile platform

34

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Why GPGPU❖ GPGPU has many-core (vs. multi-core)

• Suitable for masssively parallel computing

35

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU as a Coprocessor

Heterogeneous Computing

36

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

PC Platform• Discrete GPUs• GPGPU card as a coprocessor

From PC to PSC (Personal Super-Computer)

37

PCIe

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Mobile Platform• Integrated GPUs• GPGPU sub-chip as a coprocessor

From mobile phone to mobile personal computer

38

No PCIe

GPGPU

CPU

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Solutions - nVidia• Compute Architecture:

Tesla, Fermi, Kepler, …• PC

• GeForce, Quadro• Tesla

• 870, 1060, 2070, K40• Mobile

• Tegra: …, 4, K1(192 cores)

39

It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Solutions– Qualcomm/AMD

❖ Qualcomm, AMD, ATI❖ APU: integrated CPU+GPU❖ Low energy consumption

❖ PC(AMD): FirePro❖ Mobile(Snapdragon):❖ Adreno: 330(32 cores)

40

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Solutions - ARM❖ Mali❖ Samsung Exynos, MediaTek❖ Compute engine

after T-600 ❖ Exynos 5

❖ At most 8 cores(Mali-T678)

41

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Intel – Multicore CPU• PC (Xeon Phi)

• IRIS pro GPU• Knight Landing: 60 cores• Knight Cover: 48 CPU cores,

PCIe• Mobile

• Haswell• Atom

42

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Applications of GPGPU

http://developer.nvidia.com/category/zone/cuda-zone

43

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Heterogeneous Architecture❖Host: CPU❖Device: GPGPU❖Notice: memory hierarchy in device

44

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPUs Architecture- nVidia

❖ GT200• GTX 260/280, Quardro5800, Tesla 1060

❖ Fermi• Tesla 2060

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU(host)Multicore

GPU(device)Many-core

45

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

nVidia GPGPU Architecture❖ SM/SP(Stream multiprocessor/Stream

processor) + Shared memory + DRAM

46

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Memory Hierarchy❖ On-Chip Memory

• Registers• Shared Memory• Constant Memory• Texture Memory

❖ Off-Chip Memory• Local Memory• Global Memory

47

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU vs. FPGA

❖GPU: nVidia GeForce GTX 280, GTX580

❖FPGA: Xilinx Virtex4, Virtex5

A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local Image Features, IEEE Transactions on Computers, 2012.

48

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU vs. FPGA

❖GPU: nVidia GeForce 7900 GTX❖FPGA: Xilinx Virtex-4

Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study, IEEE Transactions on Computers, 2010.

49

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU vs. FPGA vs. Multicore❖Application: 2-D image convolution

GPU: nVidia GeForce 295 GTXFPGA: Altera Stratix III E260

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications, ACM/SIGDA international symposium on FPGA, 2012.

50

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

However, GPGPU May NotAlways Improve Speed & Energy

51

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Hardware vs. Software52

GPGPU

nVidia

Qualcomm

ARM

Intel

ParallelProgramming

CUDA

OpenCL

RenderScript

C++ AMP

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)

❖ How parallel programming (Sec. 3)• CUDA, renderscript, OpenCL, …

❖ OpenCV Acceleration (Sec. 4)

❖ Computer vision Acceleration-PC (Sec. 5)

❖ Computer vision Acceleration-Android(Sec. 6)

53

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

3. Parallel Programming

Multi-threadingProgramming Languages for Parallels

54

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Computing❖ Serial

Computing

❖ ParallelComputing

CPU/GPU

55

Core

Core

Core

Core

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Programming❖ Many codes are written in C/C++/Java

• Especially algorithmic programs❖ Can we write GPGPU parallel

programs by C/C++/Java?❖ However, C/C++ is sequential

• Three control structures of C/C++/Java:sequence, selection, repetition

56

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multi-threading❖ Multi-threading is the fundamental

concept for parallel programming• Some techniques are ready

• Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)...

• New techniques• CUDA, OpenCL, Renderscript,

OpenACC, C++ AMP, ...

57

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Programming Models58

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Programming in Sequential Language

❖ Do we need to learn new languages for multi-threading?• No

❖ Write multi-threading codes in C/C++• Add functions/directives to C/C++ for

multi-threading• That is the way current solutions did

• pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ...

59

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Decompose the Problem❖ Two basic approaches to partition

computational work• Domain decomposition

• Partition the data used in solving the problem

• Function decomposition• Partition the jobs (functions)

from the overall work (problem)

GPGPU

CPUCooperate

60

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multi-Threading❖ A program running

In Serial

http://en.wikipedia.org/wiki/Thread_(computer_science)

In Parallel

61

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Domain Decomposition (1/3) ❖An image example

• It is 2D data• Three popular partition ways

62

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Domain Decomposition (2/3)❖Domain data are usually processed

by loop• for (i=0; i<height; i++)

for (j=0; j<width; j++)img2[i][j] = RemoveNoise(img1[i][j]);

Original image(img1) Enhanced image(img2)

The X-ray image of a circuit board

ij

SIMDSPMDSIMT

63

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Domain Decomposition (3/3)❖A three-block partition

example• // Thread 1

for (i=0; i<height/3; i++)for (j=0; j<width; j++)

img2[i][j] = RemoveNoise(img1[i][j]);• // Thread 2

for (i=height/3; i<height*2/3; i++)for (j=0; j<width; j++)

img2[i][j] = RemoveNoise(img1[i][j]);• // Thread 3

for (i=height*2/3; i<height; i++)for (j=0; j<width; j++)

img2[i][j] = RemoveNoise(img1[i][j]);

ij

OpenMPCUDA(SPMD)

fork(threads)

join(barrier)

i=0i=1i=2i=3

i=4i=5i=6i=7

i=8i=9i=10i=11

subdomain1 subdomain2 subdomain3

64

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Programming: SIMT model

❖ CPU (“host”) program often written in C or C++

❖ GPU code is written as a sequential kernel in (usually) a C or C++ dialect

65

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU ProgrammingTechniques

CUDA

OpenCL

C++ AMP

Rednerscript

66

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Programming Techniques

67

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA

68

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA❖ CUDA: Compute Unified Device

Architecture❖ Parallel programming

for nVidia's GPGPU❖ Use C/C++ language

• Java, Fortran, Matlab are OK❖ When executing CUDA programs,

the GPU operates as coprocessor to the main CPU

69

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Hardware Environment: CPU+GPU

❖ CPU• Organizes, interprets, and

communicates information❖ GPU

• Handles the core processing on large quantities of parallel information

• Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU

CPU GPUPCI-E

70

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Software Stack71

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Processing Flow on CUDA

Copyprocessingdata

2

Copytheresult

5 Instructtheprocessing

3Main

Memory CPU

Memoryfor GPU Execute

parallelineachcore

4

Releasedevicememory

6

Allocatedevicememory

1

72

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Programming withMemory Hierarchy

❖ Locality principle• Temporal

locality• Spatial

locality

73

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(1/3)int main(){

char src[12]="Hello World";char h_hello[12];

char* d_hello1; char* d_hello2;

cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12);

cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice);

hello<<<1,1>>>(d_hello1 , d_hello2 );

Host

src

h_hello

Device

d_hello1

d_hello2

call the kernel function

74

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(2/3)❖ Kernel Function__global__ void hello(char* hello1 , char* hello2 ){

int k;

for(k = 0 ; hello1[k] != '\0' ; k++){hello2[k] = hello1[k];

}}

Host

src

h_hello

Device

d_hello1

d_hello2No parallel processing in this example

75

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(3/3)cudaMemcpy(h_hello, d_hello2, sizeof(char)*12, cudaMemcpyDeviceToHost);

printf("%s\n", h_hello);

cudaFree(d_hello1);❖ cudaFree(d_hello2);

system("pause");return 0;

}Result:

Host

src

h_hello

Device

d_hello1

d_hello2

76

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL

Standard

77

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

The Inspiration for OpenCL78

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What's OpenCL❖One code tree can be executed on CPUs, GPUs,

DSPs and hardware • Dynamically interrogate system load and

balance across available processors ❖Powerful, low-level flexibility

• Foundational access to compute resources for higher-level engines, frameworks and languages

79

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Broad OpenCL Implementer Adoption

❖Multiple conformant implementations shipping on desktop and mobile

❖Android ICD extension released in latest extension specification

❖Multiple implementations shipping in Android NDK

80

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL Enables Portability ❖C to gates programs are

proprietary

81

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Altera OpenCL SDK for FPGAs82

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

NVIDIA OpenCL SDK for GPU83

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

AMD OpenCL Optimization Case Study

❖Platform• AMD Phenom II X4 965 CPU (quad core)• ATI Radeon HD 5870 GPU

❖Unoptimized CPU performance: 1 GFLOP/s❖Optimized CPU performance reaches: 4 GFLOP/s❖Optimized GPU performance reaches: 50 GFLOP/s

84

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(1/3)Including

Declaring

85

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(2/3)

Creating

86

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(2/3)

Do

Copy to host &display

Creating

87

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(3/3)Kernel Function

88

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

C++ AMP

Microsoft

89

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What's C++ AMP(1/2)❖Microsoft’s C++ AMP (Accelerated Massive

Parallelism) • Part of Visual C++, integrated with Visual

Studio, built on Direct3D • “Performance for the mainstream”

❖STL-like library for multidimensional array data

• Special convenience support for 1, 2, and 3 dimensional arrays on CPU or GPU

• C++ AMP runtime handles CPU<->GPU data copying

• Tiles enable efficient processing of sub-arrays

90

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What's C++ AMP(2/2)❖Parallel_for_each

• Executes a kernel (C++ lambda) at each point in the extent

• restrict() clause specifies where to run the kernel: cpu (default) or direct3d (GPU)

91

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(1/2)

Declaring&Coping to device

92

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(2/2)

Do

Display

93

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript

Google Android

94

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What's Renderscript(1/2)❖Higher-level than CUDA or OpenCL: simpler &

less performance control • Emphasis on mobile devices & cross-SoC

performance portability ❖Programming model

• C99-based kernel language, JIT-compiled, single input-single output

• Automatic Java class reflection • Intrinsics: built-in, highly-tuned operations,

e.g. ScriptIntrinsicConvolve3x3 • Script groups combine kernels to amortize

launch cost & enable kernel fusion

95

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What's Renderscript(2/2)❖ Data type:

• 1D/2D collections of elements, C types like intand short2, types include size

• Runtime type checking ❖ Parallelism

• Implicit: one thread per data element, atomics for thread-safe access

• Thread scheduling not exposed, VM-decided

96

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Architecture97

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Low Level Virtual Machine❖Low Level Virtual Machine (LLVM)

is a compiler infrastructure

98

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Offline Compiler Flow99

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Renderscript Compiler: libbcc100

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Renderscript Project Framework

101

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(1/8)102

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(2/8)HelloWorld.java

103

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(3/8)HelloWorld.java

104

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(4/8)HelloWorldView.java

105

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(5/8)HelloWorldView.java

106

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(6/8)HelloWorldRS.java

107

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(7/8)HelloWorldRS.java

108

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(7/8)ScriptC_helloworld.java

109

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(7/8)ScriptC_helloworld.java

110

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Example - Hello World(8/8)HelloWorld.rs

111

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Comparison (1/2)❖Renderscript vs. Native(NDK) vs. Java(SDK)

• OS: Honeycomb v3.2(CPU only)

Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." in Proc. First Asia-Pacific Programming Languages and Compilers Workshop (APPLC). 2012.

112

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Comparison(2/2)❖OpenCL & CUDA

• Sobel filter with(CMw/o) and without(CMw) constant memory

OpenCL’s portability does not fundamentally affect its performance

Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A comprehensive performance comparison of CUDA and OpenCL." in Proc. International Conference Parallel Processing (ICPP), 2011.

113

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Programming114

Performance: more control, better performance

Productivity: ease use, quick programming, portability

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

❖ Multicore/Multi-threading❖ Data Parallelization

• Data distribution• Parallel convolution• Reduction algorithm• Amdahl’s law

❖ Memory Hierarchy Management• Locality principle

• Program accesses a relatively small portion of the address space at any instant of time

Parallelization115

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Multi-thread Programming withthe Discipline of Parallelization

❖ Identify parallelism: Analyze algorithm❖ Express parallelism: Write parallel code❖ Validate parallelism: Debug & verify parallel code❖ Optimize parallelism: enhance parallel

performance

116

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)

❖ How parallel programming (Sec. 3)

❖ OpenCV Acceleration (Sec. 4)

❖ Computer vision Acceleration-PC (Sec. 5)

❖ Computer vision Acceleration-Android(Sec. 6)

117

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

4. OpenCVAcceleration

118

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

What Is OpenCV❖A very popular computer vision

library• 6M downloads• BSD licenses• 2000 ~ CV functions• Modularized and efficient• Optimization

• Intel SSE, IPP, TBB• ARM NEON & GLSL (Tegra)• CUDA, OpenCL

119

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV Modules❖Image/video I/O, processing, display (core,

imgproc, highgui) ❖Object/feature detection (objdetect, features2d,

nonfree) ❖Geometry-based monocular or stereo computer

vision (calib3d, stitching, videostab) ❖Computational photography (photo, video,

superres) ❖Machine learning & clustering (ml, flann) ❖CUDA and OpenCL GPU acceleration (gpu, ocl)

Normal CV modules: 14Acceleration modules: 2

120

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV GPU Module

❖Implemented using NVIDIA CUDA Runtime API❖Latest version: 2.4.9

• Utilizing Multiple GPUs❖Implemented modules: 11 ❖Implemented functions: 270

Focus on PC platformNot fully compatible to mobile GPGPU on Android

121

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Matrix Operations❖Point-wise matrix math

• gpu::add(), ::sum(), ::div(), ::sqrt(), ::sqrSum(), ::meanStdDev, ::min(), ::max(), ::minMaxLoc(), ::magnitude(), ::norm(), ::countNonZero(), ::cartToPolar(), etc..

❖Matrix multiplication • gpu::gemm()

❖Channel manipulation • gpu::merge(), ::split()

122

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Geometric Operations ❖Image resize with sub-pixel interpolation

• gpu::resize() ❖Image rotate with sub-pixel interpolation

• gpu::rotate() ❖Image warp (e.g., panoramic stitching)

• gpu::warpPerspective(), ::warpAffine()

123

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA other Math and Geometric Operations

❖Integral images• gpu::integral(), ::sqrIntegral()

❖Custom geometric transformation (e.g., lens distortion correction)

• gpu::remap(), ::buildWarpCylindricalMaps(), ::buildWarpSphericalMaps()

124

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Image Processing(1/2) ❖Smoothing

• gpu::blur(), ::boxFilter(), ::GaussianBlur()

❖Morphological • gpu::dilate(), ::erode(), ::morphologyEx()

❖Edge Detection • gpu::Sobel(), ::Scharr(), ::Laplacian(),

gpu::Canny() ❖Custom 2D filters

• gpu::filter2D(), ::createFilter2D_GPU(), ::createSeparableFilter_GPU()

❖Color space conversion • gpu::cvtColor()

125

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Image Processing(2/2) ❖Image blending

• gpu::blendLinear() ❖Template matching (automated inspection)

• gpu::matchTemplate() ❖Gaussian pyramid (scale invariant

feature/object detection) • gpu::pyrUp(), ::pyrDown()

❖Image histogram • gpu::calcHist(), gpu::histEven,

gpu::histRange() ❖Contract enhancement

• gpu::equalizeHist()

126

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA De-noising ❖Gaussian noise removal

• gpu::FastNonLocalMeansDenoising() ❖Edge preserving smoothing

• gpu::bilateralFilter()

127

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Fourier and MeanShift❖Fourier analysis

• gpu::dft(), ::convolve(), ::mulAndScaleSpectrums(), etc..

❖MeanShift• gpu::meanShiftFiltering(), ::meanShiftSegmentation()

128

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Shape Detection ❖Line detection (e.g., lane detection, building

detection, perspective correction) • gpu::HoughLines(), ::HoughLinesDownload()

❖Circle detection (e.g., cells, coins, balls) • gpu::HoughCircles(),

::HoughCirclesDownload()

129

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Object Detection ❖HAAR and LBP cascaded adaptive boosting

(e.g., face, nose, eyes, mouth) • gpu::CascadeClassifier_GPU::detectMulti

Scale() ❖HOG detector (e.g., person, car, fruit, hand)

• gpu::HOGDescriptor::detectMultiScale()

130

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Object Recognition❖Interest point detectors

• gpu::cornerHarris(), ::cornerMinEigenVal(), ::SURF_GPU, ::FAST_GPU, ::ORB_GPU(), ::GoodFeaturesToTrackDetector_GPU()

❖Feature matching • gpu::BruteForceMatcher_GPU(),

::BFMatcher_GPU()

131

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Stereo and 3D ❖RANSAC

• gpu::solvePnPRansac() ❖Stereo correspondence (disparity map)

• gpu::StereoBM_GPU(), ::StereoBeliefPropagation(), ::StereoConstantSpaceBP(), ::DisparityBilateralFilter()

❖Represent stereo disparity as 3D or 2D • gpu::reprojectImageTo3D(),

::drawColorDisp()

132

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Optical Flow ❖Dense/sparse optical flow

gpu::FastOpticalFlowBM(), ::PyrLKOpticalFlow, ::BroxOpticalFlow(), ::FarnebackOpticalFlow(), ::OpticalFlowDual_TVL1_GPU(), ::interpolateFrames()

133

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA Background Segmentation

❖Foregrdound/background segmentation (e.g., object detection/removal, motion tracking, background removal)

• gpu::FGDStatModel, ::GMG_GPU, ::MOG_GPU, ::MOG2_GPU

134

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Performance of OpenCV GPU Accelerators on PC

135

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)

❖ How parallel programming (Sec. 3)

❖ OpenCV Acceleration (Sec. 4)

❖ Computer vision Acceleration-PC (Sec. 5)

❖ Computer vision Acceleration-Android(Sec. 6)

136

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

5. Computer VisionAcceleration on PCImage enhancement (HDR)

Feature extractionVideo surveillance cloud

137

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

HDR andImage Enhancement

138

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

❖ Restore and enhance an image❖ Its complexity is high for large images

HDR Image Enhancement

Original RestoredComplexity:O(N2) ~ O(N3)

139

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Algorithms for Image Restoration

❖ Wiener Filter❖ Histogram Based Approach

• Histogram Equalization, Histogram Modification, …

❖ Retinex• Path-based Retinex• Recursive Retinex• Center/surround Retinex

• No iterative process and is suitable for parallelization• Multi-Scale Retinex with Color Restoration (MSRCR)

[Rahman et al. 1997]

140

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

MSRCR Algorithm

• : the MSRCR output

• : the original image distribution in the ith spectral band

• : the kth Gaussian Surround function

• : the convolution operation

• : the weight

• : the color restoration factor in the ith spectral band

N : the number of spectral bands: the gain constant: controls the strength of the nonlinearity

141

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

The Method

Gaussian Blur

Log-domain Processing

Normalization

Copy Data from CPU to

GPGPU

Copy Data from GPGPU to

CPU

GPGPUCPU

Histogram Stretching

• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on. IEEE, 2011.

• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.

142

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

❖ Multicore/Multi-threading• Tesla C1060 : 240 SP (Stream Processor)• CUDA: , Thread , Block , Grid

❖ Data Parallelization• Parallel convolution

Parallelization by GPGPU

• Parallel convolution

A(0)A(1)A(2)A(3)A(4)A(5)A(6)A(7)

A(0)+A(1)

A(2)+A(3)

A(4)+A(5)

A(6)+A(7)

A(0)+A(1)+A(2)+A(3)

A(4)+A(5)+A(6)+A(7)

sum

PE data timet0 t1t2t3t4t5

01234567

PEi{ {pixels

pixels

Mpixels

Mpixels

PEipixels

pixels

pixels

pixels

1pixels 1pixels

1pixels 1pixels

143

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Our Memory Hierarchy

Parallel Gaussian Blur

Parallel Log-domain Processing

Parallel Normalization

Texture Memory

Parallel Histogram Stretching

Constant Memory

Global Memory

Shared Memory

144

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CPU results GPGPU resultsOriginal images

Experimental Results (1/2)145

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CPU results GPGPU resultsOriginal images

Experimental Results (2/2)146

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU Speedup over CPU74x

2x

• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103• NPP: nVidia Performance Primitive

147

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Feature Extraction (SIFT)

148

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

❖SIFT• Scale Invariant Feature Transform

❖Invariance of feature points• Translation• Rotation• Scale

What Is SIFT149

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

❖Object recognition/tracking❖Image retrieval❖Autostitch

Applications of SIFT150

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallelize SIFT by GPGPU

Intel Q9400Quad cores(2.66GHz)

Geforce GTS 250128 SPs(1.836GHz)

151

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CPU GPUExperimental Results

152

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Execution Timem s

CPU:10 secondsin average

GPGPU:0.8 secondsin average

153

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Speedup

13x speedup in average

154

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Video Surveillance Cloud

155

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU雲端視訊監控系統

警戒區域入侵偵測

PTZ相機追蹤

攝影機異常偵測

高效率影片事件瀏覽系統中央視訊及訊息管理系統多重解析度廣域監視系統

戶外停車場

空位偵測

非法停車偵測

動態場景人臉偵測

Storage Area Network

PC Mobile device

Multi-core

Hypervisor

GPGPU

156

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

私有雲機房

157

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)

❖ How parallel programming (Sec. 3)

❖ OpenCV Acceleration (Sec. 4)

❖ Computer vision Acceleration-PC (Sec. 5)

❖ Computer vision Acceleration-Android(Sec. 6)

158

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

6. Computer VisionAcceleration on

AndroidOpenCV

RenderScript

159

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCVon Android

160

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV4Android SDK❖Enables development of Android applications

with use of OpenCV library.❖Use java native interface (JNI) directly access c

code❖Support nVIDAs’ Tegra android development

pack(TADP)

Not fullycompatible withGPU module

161

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

System Framework162

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Two Methods to Call OpenCV❖Using Java API

❖Using native C++

163

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV for Android SDK by GPU(1/5)

164

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV for Android SDK by GPU(2/5)

165

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV for Android SDK by GPU(3/5)

166

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV for Android SDK by GPU(4/5)

167

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV for Android SDK by GPU(5/5)

168

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on Android with GPU

Acceleration

169

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on android with GPU(1/5)

170

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on android with GPU(2/5)

171

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on android with GPU(3/5)

172

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on android with GPU(4/5)

173

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript on android with GPU(5/5)

174

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Image Processing Intrinsics

Name OperationScriptIntrinsicConvolve3x3,ScriptIntrinsicConvolve5x5

Performs a 3x3 or 5x5 convolution.

ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale and RGBA buffers and is used by the system framework for drop shadows.

ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to process camera data.

ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.

ScriptIntrinsicBlend Blends two allocations in a variety of ways.

ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.

ScriptIntrinsic3DLUT Applies a color cube with interpolation to a buffer.

ScriptIntrinsicHistogram Intrinsic Histogram filter

175

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Gaussian Blur Example by RenderScript Intrinsic

RenderScript rs = RenderScript.create(theActivity);ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,Element.U8_4(rs));;Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);theIntrinsic.setRadius(25.f);theIntrinsic.setInput(tmpIn);theIntrinsic.forEach(tmpOut);tmpOut.copyTo(outputBitmap);

176

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Intrinsic Example(1/2)

177

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Intrinsic Example(2/2)

178

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Blur Intrinsic Performance Analysis

179

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Performance of RenderScript Intrinsics

❖On new Nexus 7❖Relative to equivalent multithreaded C

implementations.

180

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Image Processing Benchmarks(1/2) ❖CPU only on a Galaxy Nexus device.

181

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript Image Processing Benchmarks(2/2)

182

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Acceleration of Retinex Using RenderScript

❖This paper presents an implementation of rsRetinex, an optimized Retinex algorithm by using Renderscript technique.

❖The experimental results show that rsRetinexcould gain up to five times speedup when applied to different image resolution.

Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for ImageProcessing on Android Device Using Renderscript." in Proc. The 8th InternationalConference on Robotic, Vision, Signal Processing & Power Applications, 2014.

183

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Mobile GPGPU ListAdoption OpenCL/ CUDA OpenCV Renderscript

Qualcomm Adreno

Google Nexus 10, Google new Nexus 7, SONY Xperia Tablet Z2

1.2(302~420) OCL module

Android 4.0 later

ARM Mali Nexus 10, Samsung Note 3, Samsung Note PRO 12.2, Meizu MX3

OpenCL 1.1 (T604~T678)

OCL module

Android 4.0 later

nVIDIATegra

Google Project Tango, HTC Nexus 9, Microsoft Surface 2, Nvidia Shield Note 7

CUDA, OpenCL1.2(K1 only)

GPU module

Android 4.0 later(K1 only)

AnandTechPowerVR

iPad Air, iPad mini OpenCL 1.2 OCL module

none

Intel HD Graphics

Microsoft Surface Pro 3, Sony VAIO Tap 11

OpenCL 1.1 OCL module

none

Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.

184

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

7. Summary

185

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPGPU❖ Single-coreè Multi-coreè Many-core❖PC

• nVidia Tesla + CUDA/OpenCV❖Android

• Qualcomm Adreno + OpenCV ocl• nVidia Tegra + OpenCV gpu

186

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Programming❖C/C++/OpenCV

• OpenMP, OpenACC, CUDA, C++ AMP• OpenCL

❖Java• OpenCL, RenderScript

❖Notice that OpenCL and RenderScript is • Not Efficient in parallelization.• Efficient in CV algorithmic design.

187

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV Acceleration (1/2)❖Ver. 2.4.x

• gpu module: CUDA, PC• ocl module: OpenCL, mobile

❖Ver. 3.0 (2014/6)• Transparent API for GPGPU

acceleration

188

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV Acceleration (2/2)189

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL 2.0❖Released in 2013❖SVM: Shared Virtual Memory

• OpenCL 1.2: Explicit memory management

❖Dynamic (Nested) Parallelism • Allows a device to enqueue kernels onto

itself – no round trip to host required❖Disadvantage

• Strong hardware support• Not well supported in current GPGPUs

190

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA still Dominant in the Near Future

❖ However, we have to manually parallelize the algorithm: more design overhead

❖ We need expertise in• Algorithms of image and signal processing

• Filtering, frequency analysis, compression, feature extraction, recognition, ...

• Theory, tools and methodology of parallel computing• Communication, synchronization, resource

management, load balancing, debugging, ...

191

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPUs for Multimedia

Motion Estimation forH.264/AVC on Multiple GPUs

Using NVIDIA CUDA

10 XCUDA JPEG Decoder

10 XDivideFrame GPU Decoder

Hyperspectral Image Compression on NVIDIA GPUs

10 XGPU Decoder

(Vegas/Premiere) -Using the Power of

NVIDIA Graphic Card to Decode H.264 Video Files

26 X

PowerDirector7 Ultra

3.5X

192

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPUs for Computer Vision(1/2)

87 XCUDA SURF – A Real-

timeImplementation for SURF

TU Darmstadt

26 XLeukocyte Tracking:

ImageJ PluginUniversity of Virginia

200 XReal-time SpatiotemporalStereo Matching Using theDual-Cross-Bilateral Grid

100 XImage Denoising with

Bilateral Filter Wlroclaw University

of Technology

85 XDigital BreastTomosynthesisReconstruction

Massachusetts General Hospital

100 XFast Optical Flow on GPUAt Video Rate for Full HD

ResolutionOnera

8 XA Framework for Efficientand Scalable Execution of

Domain-specific TemplatesOn GPU

NEC Labs, Berkeley, Purdue

13 XAccelerating Advanced MRI

ReconstructionsUniversity of Illinois

193

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

GPUs for Computer Vision(2/2)

20 XGPU for Surveillance

13 XFast Human Detection with

Cascaded Ensembles

109 XFast Sliding-Window

Object Detection

263 XGPU Acceleration of Object

Classification AlgorithmUsing NVIDIA CUDA

10 XReal-time

Visual Tracker byStream Processing

45 XA GPU Accelerated

Evolutionary Computer Vision System

3 XCanny Edge Detection

300 XAudience Measurement –Real-time Video Analysisfor Counting People, Face Detection and Tracking

194

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

The Embedded VisionAlliance

195

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Readings (1/2)• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved

Retinex algorithm." IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 2011.

• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.

• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features." Computers, IEEE Transactions on 61.7 (2012): 999-1012.

• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A review." Medical physics 38.5 (2011): 2685-2697.

• Cope, Ben, et al. "Performance comparison of graphics processors to reconfigurable logic: a case study." Computers, IEEE Transactions on 59.4 (2010): 433-448.

196

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Readings (2/2)❖ “Designing Visionary Mobile Apps Using the Tegra

Android Development Pack,” http://bit.ly/1jvwbgV❖ “Getting Started With GPU-Accelerated Computer

Vision Using OpenCV and CUDA,” http://bit.ly/1oMwJEG

❖ “The open standard for parallel programming of heterogeneous systems,” https://www.khronos.org/opencl/

197

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCV Acceleration❖ GPU Module Introduction — OpenCV 2.4.9.0

documentation❖ OpenCL Module Introduction - opencv documentation!❖ OpenCV-CL: Computer vision with OpenCL

acceleration, AMD Developer Central, 2013.❖ Pulli, Kari, et al. "Real-time computer vision with

OpenCV." Communications of the ACM 55.6 (2012): 61-69.

❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated framework for image processing and computer vision." Advances in Visual Computing. Springer Berlin Heidelberg, 2008. 430-439.

198

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

CUDA❖ CUDA Programming guide. nVidia.❖ CUDA Best Practices Guide. nVidia.❖ CUDA Reference Manual. nVidia.❖ CUDA Zone - NVIDIA Developer,

https://developer.nvidia.com/cuda-zone❖ Parallel Programming and Computing Platform | CUDA

Home, www.nvidia.com/object/cuda_home_new.html❖ Applications of CUDA for Imaging and Computer

Visionhttp://www.nvidia.com/object/imaging_comp_vision.html

❖ nVidia Performance Primitives (NPP)http://developer.nvidia.com/object/npp_home.html

199

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

OpenCL❖ Khronos OpenCL specification, reference card, tutorials, etc:

http://www.khronos.org/opencl❖ AMD OpenCL Resources:

http://developer.amd.com/opencl❖ NVIDIA OpenCL Resources:

http://developer.nvidia.com/opencl❖ Books

• Using OpenCL: Programming Massively Parallel Computers. IOS Press, 2012.

• OpenCL programming guide. Pearson Education, 2011.• Heterogeneous Computing with OpenCL: Revised OpenCL 1.

Newnes, 2012.• OpenCL in Action: how to accelerate graphics and

computation. Manning, 2012.

200

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

RenderScript❖ RenderScript for Android Developer, Official web site

http://developer.android.com/guide/topics/renderscript/compute.html

❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." First Asia-Pacific Programming Languages and Compilers Workshop. 2012.

❖ "High Performance Apps Development with RenderScript," 12th Kandroid Conference, 2013.

201

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Web Sites and Resources❖Embedded Vision Alliance,

http://www.embedded-vision.com❖GPUComputing.Net,

http://www.gpucomputing.net❖HAS Foundation, www.hsafoundation.com❖

202

Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.

Parallel Computing withGPGPU

❖Programming Massively Parallel Processors – A Hands-on Approach• D. B. Kirk, W. M. Hwu• Morgan Kaufmann, 2010• http://www.nvidia.com/object/promotion_david_kirk_book.html

203

Recommended