Upload
ieee-international-conference-on-intelligent-information-hiding-and-multimedia-signal-processing
View
82
Download
3
Embed Size (px)
Citation preview
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Wang, Yuan-Kai (王元凱)Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)[email protected]
http://www.ykwang.tw
2014/07/17
Parallelize Computer Visionby GPGPU Computing
1
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
About this Course❖ Multicore Era for Computer Vision❖ GPGPU❖ Parallel Programming
(CUDA, OpenCL, Renderscript)❖ OpenCV Acceleration with GPGPU❖ Computer Vision Acceleration
2
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
1. Multicore Era forComputer Vision
Paradigm shift from Clock Speed Race
to Multicore Race
3
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Computing❖ What Is Multicore
• Combine multiple processors(CPU, DSP, GPGPU, FPGA)into single chip
❖ Multicore computing is inevitable
4
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Moore's Law❖ In 1965, Gordon Moore (Intel co-founder)
predicted• The transistors no. on an IC would double
every 18 months❖ The well-known law
• The performance of computer doubles every 18 months• More transistors → More performance
❖ The prediction was kept correctly by Intel's CPUs for 40 years
5
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Review of Moore's Law ❖ Transistors in a chip did increase
6
Software enjoys the fruits of hardware's labour.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Problems❖ More transistors need high frequency
• We come into the Clock Speed Race❖ But high frequency needs high power
consumption• High power consumption è Heat problem• 4GHz has been the limit of Moore’s law
7
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Paradigm Shift from 2000 AD❖ General-purpose multicore
comes of age❖ Chip companies race to create multicore
processors• CPU: Intel Core Duo, Quad-core,
ARM v7, ...• DSP: TI OMAP, ARM NEON, …• GPU/GPGPU:
• nVidia: GeForce/Tesla, Tegra• ARM: Mali-T6x• …
8
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Multicore Evolution
Pentium processorOptimized for single
thread
Core Duo 5~10 years10~100 energy efficient
cores optimized for parallel execution
From large mono-core to multiple lightweight cores
9
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Moore’s Law Needs Multicore❖ Single core cannot fit Moore's law❖ Multicore can fit Moore's law if a
parallel programming model exists
Time
Per
form
ance
Single Core
Multi-Core
10
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Two Architectures for Multicore
❖ Symmetric multiprocessing (SMP)• Multicore CPU, GPGPU, DSP multicore• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)• CPU+GPGPU,
CPU+FPGA, CPU+DSP
• Heterogeneous computing
11
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore CPU (1/2)❖ Two or more CPUs in a chip❖ Ex.: Intel Core i7
12
Multiple Execution Cores
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore CPU (2/2)❖ Windows Task Manager(工作管理員)
Two cores Eight cores
13
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (1/2)❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed up 3D graphics
• Game playingis a majorapplication
❖ GPGPU: General-Purpose GPU• General purpose computation using
GPU in applications other than 3D graphics
14
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (2/2)❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores(Many-core vs. Multi-core)
❖ GPGPU is more powerful than multicore CPU
❖ Vendors: • nVidia • Quadcomm
(AMD, ATI)• ARM• Intel
15
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p. 16
It is the Software, Stupid❖Gary Smith and Daya Nadamuni, Gartner
Dataquest, Design Automation Conf., 2006❖The biggest problem with SoC design
is embedded software development. ❖The next big hurdle is
programmability. It's the ability to program these multicore platforms."❖You can have elegant algorithms,
first-pass silicon, and fancy intellectual property. But without software, the product goes nowhere.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Demands Threading17
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Demands Threading18
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What Is Computer Vision19
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
VideoCapture
ImageEnhance
Object/Event
DetectionObjectTracking
Object/Event
RecognitionBehaviorAnalysis Retrieval
Imaging
Event Detection
Abnormal Detection Face Recognition Retrieval
TripwireImage/Video Enhancement
A Complete Vision System– Video Surveillance as an Example
20
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Computer Vision NeedsHigh Performance Computing❖ A CV example : video processing
• Intelligent video surveillance,❖ Its complexity is high
• Video (1080p RGB): 6 Megapixels per frame, 30fps
• 100 – 1K flops per pixel• ⇒ 18 - 180 Gigaflops per second
❖ Massive data processing• Intensive computation
21
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
HPC Approaches❖ Cluster/distributed computing
• Hadoop/MAP-REDUCE(Google, Cloud Computing)
• MPI❖ Multi-processing
computing• Multicore (GPGPU, CPU, FPGA/DSP)• Programming: multi-thread
• Windows thread, Pthraed, OpenMP• CUDA, renderscript, C++ AMP, …
Supercomputer
22
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
However❖ Can CV algorithms speed-up every 18
months with multicore?❖ Multicore is not a simple solution for
upgrading CV algorithm performance• The transition from single core to
multicore will be blocked by software• We are not ready to face the software
programming challenges• It is the software, stupid.
23
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Software, Threading, and Parallel Computing
❖ Identify parallelism: Analyze algorithm❖ Express parallelism: Write parallel code❖ Validate parallelism: Debug & verify parallel code❖ Optimize parallelism: enhance parallel
performance
24
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading DemandsNew Programming Skills
❖ Previous multi-threading techniques❖ Windows thread, pthread, OpenMP,
MPI, …❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,OpenACC, Map Reduce, …
❖ Concepts• Race condition, deadlock,• Domain partition, function partition, …
25
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Programming Practice (MPP)
❖ Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible• Proposed by a
MPP working group in the Multicore Association
http://www.multicore-association.org/workgroup/mpp.php
26
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenACC❖ An organization develops API to
• describes a collection of compiler directives
• To specify loops and regions of code in standard C, C++ and Fortran
• To be offloaded from a host CPU to an attached accelerator, including• APUs, GPUs, and many-core coprocessor
27
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
HSA Foundation❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM, SAMSUNG, TI
❖System architecture easing efficient use of accelerators, SoCs
• Intended to support high-level parallel programming frameworks
• OpenCL, C++, C#, OpenMP, Java • Accelerator requirements
• Full-system SVM, memory coherency, preemption, user-mode dispatch
28
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The ParLab in Berkeley❖ The Parallel Computing Lab. in UC
Berkeleyhttp://parlab.eecs.berkeley.edu• The ParLab. offers programmers a
practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers.
29
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
HPEC❖ High Performance Embedded
Computing• MIT Lincoln Lab, 1997 ~
30
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL❖ Royalty-free, cross-platform, cross-
vendor standard • Targeting: supercomputers è embedded systems è mobile devices
❖Enables programming of diverse compute resources • CPU, GPU, DSP, FPGA …
31
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL Working Group Members
❖Diverse industry participation – many industry experts
❖NVIDIA is chair, Apple is specification editor
32
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)❖ Vendor, Hardware
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android(Sec. 6)
33
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
2. GPGPU
PC platformMobile platform
34
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Why GPGPU❖ GPGPU has many-core (vs. multi-core)
• Suitable for masssively parallel computing
35
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU as a Coprocessor
Heterogeneous Computing
36
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
PC Platform• Discrete GPUs• GPGPU card as a coprocessor
From PC to PSC (Personal Super-Computer)
37
PCIe
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Mobile Platform• Integrated GPUs• GPGPU sub-chip as a coprocessor
From mobile phone to mobile personal computer
38
No PCIe
GPGPU
CPU
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - nVidia• Compute Architecture:
Tesla, Fermi, Kepler, …• PC
• GeForce, Quadro• Tesla
• 870, 1060, 2070, K40• Mobile
• Tegra: …, 4, K1(192 cores)
39
It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions– Qualcomm/AMD
❖ Qualcomm, AMD, ATI❖ APU: integrated CPU+GPU❖ Low energy consumption
❖ PC(AMD): FirePro❖ Mobile(Snapdragon):❖ Adreno: 330(32 cores)
40
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - ARM❖ Mali❖ Samsung Exynos, MediaTek❖ Compute engine
after T-600 ❖ Exynos 5
❖ At most 8 cores(Mali-T678)
41
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Intel – Multicore CPU• PC (Xeon Phi)
• IRIS pro GPU• Knight Landing: 60 cores• Knight Cover: 48 CPU cores,
PCIe• Mobile
• Haswell• Atom
42
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Applications of GPGPU
http://developer.nvidia.com/category/zone/cuda-zone
43
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Heterogeneous Architecture❖Host: CPU❖Device: GPGPU❖Notice: memory hierarchy in device
44
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPUs Architecture- nVidia
❖ GT200• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi• Tesla 2060
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU(host)Multicore
GPU(device)Many-core
45
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
nVidia GPGPU Architecture❖ SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
46
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Memory Hierarchy❖ On-Chip Memory
• Registers• Shared Memory• Constant Memory• Texture Memory
❖ Off-Chip Memory• Local Memory• Global Memory
47
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce GTX 280, GTX580
❖FPGA: Xilinx Virtex4, Virtex5
A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local Image Features, IEEE Transactions on Computers, 2012.
48
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce 7900 GTX❖FPGA: Xilinx Virtex-4
Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study, IEEE Transactions on Computers, 2010.
49
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA vs. Multicore❖Application: 2-D image convolution
GPU: nVidia GeForce 295 GTXFPGA: Altera Stratix III E260
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications, ACM/SIGDA international symposium on FPGA, 2012.
50
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
However, GPGPU May NotAlways Improve Speed & Energy
51
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Hardware vs. Software52
GPGPU
nVidia
Qualcomm
ARM
Intel
ParallelProgramming
CUDA
OpenCL
RenderScript
C++ AMP
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)• CUDA, renderscript, OpenCL, …
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android(Sec. 6)
53
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
3. Parallel Programming
Multi-threadingProgramming Languages for Parallels
54
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing❖ Serial
Computing
❖ ParallelComputing
CPU/GPU
55
Core
Core
Core
Core
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming❖ Many codes are written in C/C++/Java
• Especially algorithmic programs❖ Can we write GPGPU parallel
programs by C/C++/Java?❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:sequence, selection, repetition
56
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading❖ Multi-threading is the fundamental
concept for parallel programming• Some techniques are ready
• Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)...
• New techniques• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
57
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming Models58
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming in Sequential Language
❖ Do we need to learn new languages for multi-threading?• No
❖ Write multi-threading codes in C/C++• Add functions/directives to C/C++ for
multi-threading• That is the way current solutions did
• pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ...
59
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Decompose the Problem❖ Two basic approaches to partition
computational work• Domain decomposition
• Partition the data used in solving the problem
• Function decomposition• Partition the jobs (functions)
from the overall work (problem)
GPGPU
CPUCooperate
60
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-Threading❖ A program running
In Serial
http://en.wikipedia.org/wiki/Thread_(computer_science)
In Parallel
61
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (1/3) ❖An image example
• It is 2D data• Three popular partition ways
62
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (2/3)❖Domain data are usually processed
by loop• for (i=0; i<height; i++)
for (j=0; j<width; j++)img2[i][j] = RemoveNoise(img1[i][j]);
Original image(img1) Enhanced image(img2)
The X-ray image of a circuit board
ij
SIMDSPMDSIMT
63
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (3/3)❖A three-block partition
example• // Thread 1
for (i=0; i<height/3; i++)for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);• // Thread 2
for (i=height/3; i<height*2/3; i++)for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);• // Thread 3
for (i=height*2/3; i<height; i++)for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
ij
OpenMPCUDA(SPMD)
fork(threads)
join(barrier)
i=0i=1i=2i=3
i=4i=5i=6i=7
i=8i=9i=10i=11
subdomain1 subdomain2 subdomain3
64
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming: SIMT model
❖ CPU (“host”) program often written in C or C++
❖ GPU code is written as a sequential kernel in (usually) a C or C++ dialect
65
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU ProgrammingTechniques
CUDA
OpenCL
C++ AMP
Rednerscript
66
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming Techniques
67
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
68
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA❖ CUDA: Compute Unified Device
Architecture❖ Parallel programming
for nVidia's GPGPU❖ Use C/C++ language
• Java, Fortran, Matlab are OK❖ When executing CUDA programs,
the GPU operates as coprocessor to the main CPU
69
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Hardware Environment: CPU+GPU
❖ CPU• Organizes, interprets, and
communicates information❖ GPU
• Handles the core processing on large quantities of parallel information
• Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU
CPU GPUPCI-E
70
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Software Stack71
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Processing Flow on CUDA
Copyprocessingdata
2
Copytheresult
5 Instructtheprocessing
3Main
Memory CPU
Memoryfor GPU Execute
parallelineachcore
4
Releasedevicememory
6
Allocatedevicememory
1
72
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Programming withMemory Hierarchy
❖ Locality principle• Temporal
locality• Spatial
locality
73
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/3)int main(){
char src[12]="Hello World";char h_hello[12];
char* d_hello1; char* d_hello2;
cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
Host
src
h_hello
Device
d_hello1
d_hello2
call the kernel function
74
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)❖ Kernel Function__global__ void hello(char* hello1 , char* hello2 ){
int k;
for(k = 0 ; hello1[k] != '\0' ; k++){hello2[k] = hello1[k];
}}
Host
src
h_hello
Device
d_hello1
d_hello2No parallel processing in this example
75
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/3)cudaMemcpy(h_hello, d_hello2, sizeof(char)*12, cudaMemcpyDeviceToHost);
printf("%s\n", h_hello);
cudaFree(d_hello1);❖ cudaFree(d_hello2);
system("pause");return 0;
}Result:
Host
src
h_hello
Device
d_hello1
d_hello2
76
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
Standard
77
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Inspiration for OpenCL78
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What's OpenCL❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware • Dynamically interrogate system load and
balance across available processors ❖Powerful, low-level flexibility
• Foundational access to compute resources for higher-level engines, frameworks and languages
79
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Broad OpenCL Implementer Adoption
❖Multiple conformant implementations shipping on desktop and mobile
❖Android ICD extension released in latest extension specification
❖Multiple implementations shipping in Android NDK
80
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL Enables Portability ❖C to gates programs are
proprietary
81
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Altera OpenCL SDK for FPGAs82
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
NVIDIA OpenCL SDK for GPU83
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
AMD OpenCL Optimization Case Study
❖Platform• AMD Phenom II X4 965 CPU (quad core)• ATI Radeon HD 5870 GPU
❖Unoptimized CPU performance: 1 GFLOP/s❖Optimized CPU performance reaches: 4 GFLOP/s❖Optimized GPU performance reaches: 50 GFLOP/s
84
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/3)Including
Declaring
85
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
Creating
86
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
Do
Copy to host &display
Creating
87
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/3)Kernel Function
88
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
C++ AMP
Microsoft
89
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(1/2)❖Microsoft’s C++ AMP (Accelerated Massive
Parallelism) • Part of Visual C++, integrated with Visual
Studio, built on Direct3D • “Performance for the mainstream”
❖STL-like library for multidimensional array data
• Special convenience support for 1, 2, and 3 dimensional arrays on CPU or GPU
• C++ AMP runtime handles CPU<->GPU data copying
• Tiles enable efficient processing of sub-arrays
90
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(2/2)❖Parallel_for_each
• Executes a kernel (C++ lambda) at each point in the extent
• restrict() clause specifies where to run the kernel: cpu (default) or direct3d (GPU)
91
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/2)
Declaring&Coping to device
92
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/2)
Do
Display
93
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript
Google Android
94
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(1/2)❖Higher-level than CUDA or OpenCL: simpler &
less performance control • Emphasis on mobile devices & cross-SoC
performance portability ❖Programming model
• C99-based kernel language, JIT-compiled, single input-single output
• Automatic Java class reflection • Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3 • Script groups combine kernels to amortize
launch cost & enable kernel fusion
95
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(2/2)❖ Data type:
• 1D/2D collections of elements, C types like intand short2, types include size
• Runtime type checking ❖ Parallelism
• Implicit: one thread per data element, atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
96
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Architecture97
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Low Level Virtual Machine❖Low Level Virtual Machine (LLVM)
is a compiler infrastructure
98
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Offline Compiler Flow99
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Compiler: libbcc100
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Project Framework
101
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/8)102
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/8)HelloWorld.java
103
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/8)HelloWorld.java
104
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(4/8)HelloWorldView.java
105
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(5/8)HelloWorldView.java
106
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(6/8)HelloWorldRS.java
107
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)HelloWorldRS.java
108
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)ScriptC_helloworld.java
109
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)ScriptC_helloworld.java
110
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(8/8)HelloWorld.rs
111
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison (1/2)❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." in Proc. First Asia-Pacific Programming Languages and Compilers Workshop (APPLC). 2012.
112
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison(2/2)❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw) constant memory
OpenCL’s portability does not fundamentally affect its performance
Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A comprehensive performance comparison of CUDA and OpenCL." in Proc. International Conference Parallel Processing (ICPP), 2011.
113
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming114
Performance: more control, better performance
Productivity: ease use, quick programming, portability
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading❖ Data Parallelization
• Data distribution• Parallel convolution• Reduction algorithm• Amdahl’s law
❖ Memory Hierarchy Management• Locality principle
• Program accesses a relatively small portion of the address space at any instant of time
Parallelization115
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-thread Programming withthe Discipline of Parallelization
❖ Identify parallelism: Analyze algorithm❖ Express parallelism: Write parallel code❖ Validate parallelism: Debug & verify parallel code❖ Optimize parallelism: enhance parallel
performance
116
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android(Sec. 6)
117
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
4. OpenCVAcceleration
118
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
What Is OpenCV❖A very popular computer vision
library• 6M downloads• BSD licenses• 2000 ~ CV functions• Modularized and efficient• Optimization
• Intel SSE, IPP, TBB• ARM NEON & GLSL (Tegra)• CUDA, OpenCL
119
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Modules❖Image/video I/O, processing, display (core,
imgproc, highgui) ❖Object/feature detection (objdetect, features2d,
nonfree) ❖Geometry-based monocular or stereo computer
vision (calib3d, stitching, videostab) ❖Computational photography (photo, video,
superres) ❖Machine learning & clustering (ml, flann) ❖CUDA and OpenCL GPU acceleration (gpu, ocl)
Normal CV modules: 14Acceleration modules: 2
120
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV GPU Module
❖Implemented using NVIDIA CUDA Runtime API❖Latest version: 2.4.9
• Utilizing Multiple GPUs❖Implemented modules: 11 ❖Implemented functions: 270
Focus on PC platformNot fully compatible to mobile GPGPU on Android
121
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Matrix Operations❖Point-wise matrix math
• gpu::add(), ::sum(), ::div(), ::sqrt(), ::sqrSum(), ::meanStdDev, ::min(), ::max(), ::minMaxLoc(), ::magnitude(), ::norm(), ::countNonZero(), ::cartToPolar(), etc..
❖Matrix multiplication • gpu::gemm()
❖Channel manipulation • gpu::merge(), ::split()
122
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Geometric Operations ❖Image resize with sub-pixel interpolation
• gpu::resize() ❖Image rotate with sub-pixel interpolation
• gpu::rotate() ❖Image warp (e.g., panoramic stitching)
• gpu::warpPerspective(), ::warpAffine()
123
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA other Math and Geometric Operations
❖Integral images• gpu::integral(), ::sqrIntegral()
❖Custom geometric transformation (e.g., lens distortion correction)
• gpu::remap(), ::buildWarpCylindricalMaps(), ::buildWarpSphericalMaps()
124
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Image Processing(1/2) ❖Smoothing
• gpu::blur(), ::boxFilter(), ::GaussianBlur()
❖Morphological • gpu::dilate(), ::erode(), ::morphologyEx()
❖Edge Detection • gpu::Sobel(), ::Scharr(), ::Laplacian(),
gpu::Canny() ❖Custom 2D filters
• gpu::filter2D(), ::createFilter2D_GPU(), ::createSeparableFilter_GPU()
❖Color space conversion • gpu::cvtColor()
125
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Image Processing(2/2) ❖Image blending
• gpu::blendLinear() ❖Template matching (automated inspection)
• gpu::matchTemplate() ❖Gaussian pyramid (scale invariant
feature/object detection) • gpu::pyrUp(), ::pyrDown()
❖Image histogram • gpu::calcHist(), gpu::histEven,
gpu::histRange() ❖Contract enhancement
• gpu::equalizeHist()
126
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA De-noising ❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising() ❖Edge preserving smoothing
• gpu::bilateralFilter()
127
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Fourier and MeanShift❖Fourier analysis
• gpu::dft(), ::convolve(), ::mulAndScaleSpectrums(), etc..
❖MeanShift• gpu::meanShiftFiltering(), ::meanShiftSegmentation()
128
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Shape Detection ❖Line detection (e.g., lane detection, building
detection, perspective correction) • gpu::HoughLines(), ::HoughLinesDownload()
❖Circle detection (e.g., cells, coins, balls) • gpu::HoughCircles(),
::HoughCirclesDownload()
129
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Detection ❖HAAR and LBP cascaded adaptive boosting
(e.g., face, nose, eyes, mouth) • gpu::CascadeClassifier_GPU::detectMulti
Scale() ❖HOG detector (e.g., person, car, fruit, hand)
• gpu::HOGDescriptor::detectMultiScale()
130
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Recognition❖Interest point detectors
• gpu::cornerHarris(), ::cornerMinEigenVal(), ::SURF_GPU, ::FAST_GPU, ::ORB_GPU(), ::GoodFeaturesToTrackDetector_GPU()
❖Feature matching • gpu::BruteForceMatcher_GPU(),
::BFMatcher_GPU()
131
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Stereo and 3D ❖RANSAC
• gpu::solvePnPRansac() ❖Stereo correspondence (disparity map)
• gpu::StereoBM_GPU(), ::StereoBeliefPropagation(), ::StereoConstantSpaceBP(), ::DisparityBilateralFilter()
❖Represent stereo disparity as 3D or 2D • gpu::reprojectImageTo3D(),
::drawColorDisp()
132
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Optical Flow ❖Dense/sparse optical flow
gpu::FastOpticalFlowBM(), ::PyrLKOpticalFlow, ::BroxOpticalFlow(), ::FarnebackOpticalFlow(), ::OpticalFlowDual_TVL1_GPU(), ::interpolateFrames()
133
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Background Segmentation
❖Foregrdound/background segmentation (e.g., object detection/removal, motion tracking, background removal)
• gpu::FGDStatModel, ::GMG_GPU, ::MOG_GPU, ::MOG2_GPU
134
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Performance of OpenCV GPU Accelerators on PC
135
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android(Sec. 6)
136
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
5. Computer VisionAcceleration on PCImage enhancement (HDR)
Feature extractionVideo surveillance cloud
137
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
HDR andImage Enhancement
138
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Restore and enhance an image❖ Its complexity is high for large images
HDR Image Enhancement
Original RestoredComplexity:O(N2) ~ O(N3)
139
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Algorithms for Image Restoration
❖ Wiener Filter❖ Histogram Based Approach
• Histogram Equalization, Histogram Modification, …
❖ Retinex• Path-based Retinex• Recursive Retinex• Center/surround Retinex
• No iterative process and is suitable for parallelization• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
140
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
MSRCR Algorithm
• : the MSRCR output
• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band
N : the number of spectral bands: the gain constant: controls the strength of the nonlinearity
141
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Method
Gaussian Blur
Log-domain Processing
Normalization
Copy Data from CPU to
GPGPU
Copy Data from GPGPU to
CPU
GPGPUCPU
Histogram Stretching
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
142
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading• Tesla C1060 : 240 SP (Stream Processor)• CUDA: , Thread , Block , Grid
❖ Data Parallelization• Parallel convolution
Parallelization by GPGPU
• Parallel convolution
A(0)A(1)A(2)A(3)A(4)A(5)A(6)A(7)
A(0)+A(1)
A(2)+A(3)
A(4)+A(5)
A(6)+A(7)
A(0)+A(1)+A(2)+A(3)
A(4)+A(5)+A(6)+A(7)
sum
PE data timet0 t1t2t3t4t5
01234567
PEi{ {pixels
pixels
Mpixels
Mpixels
PEipixels
pixels
pixels
pixels
1pixels 1pixels
1pixels 1pixels
143
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Our Memory Hierarchy
Parallel Gaussian Blur
Parallel Log-domain Processing
Parallel Normalization
Texture Memory
Parallel Histogram Stretching
Constant Memory
Global Memory
Shared Memory
144
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CPU results GPGPU resultsOriginal images
Experimental Results (1/2)145
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CPU results GPGPU resultsOriginal images
Experimental Results (2/2)146
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Speedup over CPU74x
2x
• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103• NPP: nVidia Performance Primitive
147
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Feature Extraction (SIFT)
148
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
❖SIFT• Scale Invariant Feature Transform
❖Invariance of feature points• Translation• Rotation• Scale
What Is SIFT149
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
❖Object recognition/tracking❖Image retrieval❖Autostitch
Applications of SIFT150
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallelize SIFT by GPGPU
Intel Q9400Quad cores(2.66GHz)
Geforce GTS 250128 SPs(1.836GHz)
151
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CPU GPUExperimental Results
152
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Execution Timem s
CPU:10 secondsin average
GPGPU:0.8 secondsin average
153
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Speedup
13x speedup in average
154
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Video Surveillance Cloud
155
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU雲端視訊監控系統
警戒區域入侵偵測
PTZ相機追蹤
攝影機異常偵測
高效率影片事件瀏覽系統中央視訊及訊息管理系統多重解析度廣域監視系統
戶外停車場
空位偵測
非法停車偵測
動態場景人臉偵測
Storage Area Network
PC Mobile device
Multi-core
Hypervisor
GPGPU
156
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
私有雲機房
157
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android(Sec. 6)
158
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
6. Computer VisionAcceleration on
AndroidOpenCV
RenderScript
159
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCVon Android
160
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV4Android SDK❖Enables development of Android applications
with use of OpenCV library.❖Use java native interface (JNI) directly access c
code❖Support nVIDAs’ Tegra android development
pack(TADP)
Not fullycompatible withGPU module
161
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
System Framework162
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Two Methods to Call OpenCV❖Using Java API
❖Using native C++
163
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by GPU(1/5)
164
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by GPU(2/5)
165
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by GPU(3/5)
166
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by GPU(4/5)
167
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by GPU(5/5)
168
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on Android with GPU
Acceleration
169
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android with GPU(1/5)
170
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android with GPU(2/5)
171
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android with GPU(3/5)
172
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android with GPU(4/5)
173
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android with GPU(5/5)
174
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image Processing Intrinsics
Name OperationScriptIntrinsicConvolve3x3,ScriptIntrinsicConvolve5x5
Performs a 3x3 or 5x5 convolution.
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale and RGBA buffers and is used by the system framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
175
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Gaussian Blur Example by RenderScript Intrinsic
RenderScript rs = RenderScript.create(theActivity);ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,Element.U8_4(rs));;Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);theIntrinsic.setRadius(25.f);theIntrinsic.setInput(tmpIn);theIntrinsic.forEach(tmpOut);tmpOut.copyTo(outputBitmap);
176
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic Example(1/2)
177
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic Example(2/2)
178
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Blur Intrinsic Performance Analysis
179
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Performance of RenderScript Intrinsics
❖On new Nexus 7❖Relative to equivalent multithreaded C
implementations.
180
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image Processing Benchmarks(1/2) ❖CPU only on a Galaxy Nexus device.
181
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image Processing Benchmarks(2/2)
182
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Acceleration of Retinex Using RenderScript
❖This paper presents an implementation of rsRetinex, an optimized Retinex algorithm by using Renderscript technique.
❖The experimental results show that rsRetinexcould gain up to five times speedup when applied to different image resolution.
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for ImageProcessing on Android Device Using Renderscript." in Proc. The 8th InternationalConference on Robotic, Vision, Signal Processing & Power Applications, 2014.
183
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Mobile GPGPU ListAdoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm Adreno
Google Nexus 10, Google new Nexus 7, SONY Xperia Tablet Z2
1.2(302~420) OCL module
Android 4.0 later
ARM Mali Nexus 10, Samsung Note 3, Samsung Note PRO 12.2, Meizu MX3
OpenCL 1.1 (T604~T678)
OCL module
Android 4.0 later
nVIDIATegra
Google Project Tango, HTC Nexus 9, Microsoft Surface 2, Nvidia Shield Note 7
CUDA, OpenCL1.2(K1 only)
GPU module
Android 4.0 later(K1 only)
AnandTechPowerVR
iPad Air, iPad mini OpenCL 1.2 OCL module
none
Intel HD Graphics
Microsoft Surface Pro 3, Sony VAIO Tap 11
OpenCL 1.1 OCL module
none
Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.
184
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
7. Summary
185
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU❖ Single-coreè Multi-coreè Many-core❖PC
• nVidia Tesla + CUDA/OpenCV❖Android
• Qualcomm Adreno + OpenCV ocl• nVidia Tegra + OpenCV gpu
186
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP• OpenCL
❖Java• OpenCL, RenderScript
❖Notice that OpenCL and RenderScript is • Not Efficient in parallelization.• Efficient in CV algorithmic design.
187
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (1/2)❖Ver. 2.4.x
• gpu module: CUDA, PC• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)• Transparent API for GPGPU
acceleration
188
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (2/2)189
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL 2.0❖Released in 2013❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory management
❖Dynamic (Nested) Parallelism • Allows a device to enqueue kernels onto
itself – no round trip to host required❖Disadvantage
• Strong hardware support• Not well supported in current GPGPUs
190
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA still Dominant in the Near Future
❖ However, we have to manually parallelize the algorithm: more design overhead
❖ We need expertise in• Algorithms of image and signal processing
• Filtering, frequency analysis, compression, feature extraction, recognition, ...
• Theory, tools and methodology of parallel computing• Communication, synchronization, resource
management, load balancing, debugging, ...
191
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Multimedia
Motion Estimation forH.264/AVC on Multiple GPUs
Using NVIDIA CUDA
10 XCUDA JPEG Decoder
10 XDivideFrame GPU Decoder
Hyperspectral Image Compression on NVIDIA GPUs
10 XGPU Decoder
(Vegas/Premiere) -Using the Power of
NVIDIA Graphic Card to Decode H.264 Video Files
26 X
PowerDirector7 Ultra
3.5X
192
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(1/2)
87 XCUDA SURF – A Real-
timeImplementation for SURF
TU Darmstadt
26 XLeukocyte Tracking:
ImageJ PluginUniversity of Virginia
200 XReal-time SpatiotemporalStereo Matching Using theDual-Cross-Bilateral Grid
100 XImage Denoising with
Bilateral Filter Wlroclaw University
of Technology
85 XDigital BreastTomosynthesisReconstruction
Massachusetts General Hospital
100 XFast Optical Flow on GPUAt Video Rate for Full HD
ResolutionOnera
8 XA Framework for Efficientand Scalable Execution of
Domain-specific TemplatesOn GPU
NEC Labs, Berkeley, Purdue
13 XAccelerating Advanced MRI
ReconstructionsUniversity of Illinois
193
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(2/2)
20 XGPU for Surveillance
13 XFast Human Detection with
Cascaded Ensembles
109 XFast Sliding-Window
Object Detection
263 XGPU Acceleration of Object
Classification AlgorithmUsing NVIDIA CUDA
10 XReal-time
Visual Tracker byStream Processing
45 XA GPU Accelerated
Evolutionary Computer Vision System
3 XCanny Edge Detection
300 XAudience Measurement –Real-time Video Analysisfor Counting People, Face Detection and Tracking
194
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
The Embedded VisionAlliance
195
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (1/2)• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features." Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors to reconfigurable logic: a case study." Computers, IEEE Transactions on 59.4 (2010): 433-448.
196
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (2/2)❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,” http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of heterogeneous systems,” https://www.khronos.org/opencl/
197
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation❖ OpenCL Module Introduction - opencv documentation!❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012): 61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated framework for image processing and computer vision." Advances in Visual Computing. Springer Berlin Heidelberg, 2008. 430-439.
198
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA❖ CUDA Programming guide. nVidia.❖ CUDA Best Practices Guide. nVidia.❖ CUDA Reference Manual. nVidia.❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html❖ Applications of CUDA for Imaging and Computer
Visionhttp://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)http://developer.nvidia.com/object/npp_home.html
199
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl❖ AMD OpenCL Resources:
http://developer.amd.com/opencl❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl❖ Books
• Using OpenCL: Programming Massively Parallel Computers. IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
200
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.html
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." First Asia-Pacific Programming Languages and Compilers Workshop. 2012.
❖ "High Performance Apps Development with RenderScript," 12th Kandroid Conference, 2013.
201
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Web Sites and Resources❖Embedded Vision Alliance,
http://www.embedded-vision.com❖GPUComputing.Net,
http://www.gpucomputing.net❖HAS Foundation, www.hsafoundation.com❖
202
Wang,Yuan-Kai(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing withGPGPU
❖Programming Massively Parallel Processors – A Hands-on Approach• D. B. Kirk, W. M. Hwu• Morgan Kaufmann, 2010• http://www.nvidia.com/object/promotion_david_kirk_book.html
203