[IEEE 2010 42nd Southeastern Symposium on System Theory (SSST 2010) - Tyler, TX, USA (2010.03.7-2010.03.9)] 2010 42nd Southeastern Symposium on System Theory (SSST 2010) - Optimization

Optimization of Computer Vision Algorithms for Real Time Platforms

Pramod Poudel and Mukul Shirvaikar Electrical Engineering Department The University of Texas at Tyler

Tyler, TX 75799 USA E-mail: [email protected],

[email protected]

Abstract - Real time computer vision applications like video streaming on cell phones, remote surveillance and virtual reality have stringent performance requirements but can be severely restrained by limited resources. The use of optimized algorithms is vital to meet real-time requirements especially on popular mobile platforms. This paper presents work on performance optimization of common computer vision algorithms such as correlation on such embedded systems. The correlation algorithm which is popular for face recognition, can be implemented using convolution or the Discrete Fourier Transform (DFT). The algorithms are benchmarked on the Intel Pentium processor and Beagleboard, which is a new low-cost low-power platform based on the Texas Instruments (TI) OMAP 3530 processor architecture. The OMAP processor consists of an asymmetric dual-core architecture, including an ARM and a DSP supported by shared memory. OpenCV, which is a computer vision library developed by Intel corporation was utilized for some of the algorithms. Comparative results for the various approaches are presented and discussed with an emphasis on real-time implementation. Keywords: Real-time Image Processing, Computer Vision, Correlation.

I. INTRODUCTION Computer vision has found widespread acceptance in mobile applications like video streaming, smart cameras [1], vehicle navigation [2], smart traffic light systems [3], and virtual reality [4]. The increasing demand for video applications like context-aware computing on the mobile embedded systems like cell phones requires the use of computer vision algorithms. The implementation of computer vision algorithms is computationally intensive and resource exhaustive. The system engineer has a mandate to optimize these algorithms so as to meet real time deadlines. Fortunately engineers can take advantage of sophisticated features available on mobile processors like dual core architectures. This can be

achieved by selecting the proper mobile processor to improve performance of the algorithms by using different on-chip resources. For this work, we have selected our base non-mobile platform as the Intel Pentium 4 processor with 2.1 GHz clock and 3 GB of memory. The mobile platform selected is the BeagleBoard powered by an OMAP 3530 mobile processor from TI which runs at 500 MHz and has 256 MB of NAND RAM and 128 MB MDDR SDRAM POP memory [5]. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and TMS320C64x Digital Signal Processor (DSP). A two-dimensional correlation algorithm is used to benchmark performance, since it is one of the basic algorithms for image processing and computer vision. It finds many applications like pattern matching, where an image containing a particular object or region can be searched for the same object or region using a template image [6], signal detection [7]. The correlation algorithm which is popular for face recognition, can be implemented using convolution or the Discrete Fourier Transform (DFT). It is a computationally intensive algorithm with numerous multiply-and-accumulate (MAC) operations. Special processors which can perform multiple MAC operations in a single cycle [9] are the first choice to process such performance hungry algorithms. In the past, computer vision algorithms like Gaussian pyramids, Bayer filter demosaicing, Sobel edge detection, sum of absolute difference have been benchmarked and optimized [8, 9]. These efforts have focused on single-core architectures. When dealing with specialized mobile processors like OMAP3530, shown in figure 3, OpenCV algorithms can be modified to take advantage of the available resources to speed up computation. OpenCV is an open source computer vision library developed by Intel Corporation. It is highly optimized to take advantage of Intel’s architecture enhancements like Advanced Vector Extensions, Streaming SIMD Extensions and MMX technology [11, 12].

51978-1-4244-5692-5/10/$26.00 © IEEE 2010

42nd South Eastern Symposium on System TheoryUniversity of Texas at TylerTyler, TX, USA, March 7-9, 2010

M1B.6

The cross correlation algorithms that are utilized for this study are initially benchmarked on the Pentium 4. In the next step, they are ported to the BeagleBoard platform. In the final step, optimization techniques will be utilized to speed up the algorithm further.

II. BACKGROUND Cross correlation algorithms find intensive application in pattern matching and represent one of the most common classes of algorithms used in computer vision. A template image is used to find a similar pattern or a region of interest another image. An example application is face detection for a security system. Other example applications include satellite image analysis, product inspection and currency handling machines. Cross-Correlation An image consists of two-dimensional discrete values ( , )f x y of size M N×

h x

. Thus, when dealing with image correlation we are always concerned with 2-dimensional correlation of the image function with template ( , )y of size

. The correlation function J ×( ,

K) ( , )f x y h x yo as defined in [8] is expressed as

1 1

0 0

( , ) ( , )

1( , ) ( , ).........(1)

M N

m n

f x y h x y

f m n h x m y nMN

− −

= =

=

+ +∑∑

o

To find matching sub-image ( , )h x y of size

within an image J K× ( , )f x y of size M N× where and , we can write (1) in a more simpler form as

J M≤ K N≤

1 1

0 0

( , ) ( , ) ( , ) ( , )J K

m n

f x y h x y h m n f x m y n− −

= =

= +∑∑o +

Or

1 1

0 0

( , ) ( , ) ( , )....(2)J K

m n

c x y h m n f x m y n− −

= =

= + +∑∑

Equation (2) does not take into account the constant

term 1

MN. This is due to the fact that normalized

function is used to avoid amplitude sensitivity of

f and , when correlating two images. The normalized function is expressed as

h

1 1

0 0

( ,

h m

h m n

11 1 1 1 22 2

0 0 0 0

)

( , ) ( , ) ( , ).....(3)

) ( , ) ( , )

J K

m n

J K J K

m n

y

n h f x m y n f m n

h f x m y n f m n

− −

= =

− − − −

= = = =

( ,n

m n

c x =

⎡ ⎤⎡ ⎤− + + −⎣ ⎦⎣ ⎦

⎧ ⎫⎡ ⎤ ⎡ ⎤− + + −⎨ ⎬⎣ ⎦ ⎣ ⎦⎩ ⎭

∑∑

∑∑∑∑ where, h

h is the average value of subimage or

template and f is the average value of f that overlaps with h. The highest correlation value is the instant of best match of the sub-image. The computational complexity of the spatial-domain approach increases exponentially with image and template size. The correlation algorithm can be implemented using the spatial-domain convolution method described in the above equations or the frequency-domain method using the Discrete Fourier Transform (DFT), as governed by the correlation theorem pair for the 2-D discrete case:

*, ) , ) ( , ) ( , ).........(4)y x y F u v H u v⇔(f x (ho

*

where ( , )F u v and represents the complex conjugate of the 2-D DFT of the image ( , )f x y , and the image and template functions have been extended by zero padding [6]. The inverse DFT can be utilized to obtain the resulting image as shown in equation (5) below:

1 12 ( / / )

0 0.........(5)( , ( , )

M Nj ux M vy N

u vf x y F u v e π

− −+

= =∑∑) =

The OpenCV algorithm is based on the frequency domain approach, which is computationally efficient for larger image and template sizes.

III. IMPLEMENTATION

The spatial-domain convolution and the frequency-domain DFT-based correlation algorithms were both implemented on two platforms. The first non-mobile pentium 4 PC platform serves as the benchmark for mobile platform performance. The latter is measured using the Beagleboard and the algorithm is then optimized to improve speed.

52

Two gray-scale test images of size 1194 x 898 pixels and 1024 x 695 pixels are searched for templates of size 87 x 29 and 27 x 34 pixels respectively. Figure 1 and 2 show the test images, template images and the result images. Image 1 is a picture of printed circuit board (PCB) in which a subimage of resistor is searched for. Image 2 is an aerial view of small town. In this image, a house of particular interest is being located. Pentium The correlation algorithms are implemented on a Pentium 4 processor with 2.1 GHz clock and 3 GB of memory. Ubuntu 8.04 virtual machine with gcc GNU compiler is used to run the algorithms. The frequency-domain algorithm is based on the OpenCV function cvMatchTemplate. The method of correlation selected is CV_TM_CCOEFF_NORMED which is defined by equation (3). Function clock_gettime() that is declared in the time.h header file is used to calculate the execution time. The resolution of the clock_gettime() is in nanoseconds. The algorithm is repeatedly executed 10 times and the elapsed time is calculated to measure the speed of the algorithm. The actual time for a single iteration is calculated by dividing the elapsed time by the number of iterations. Ten different samples were calculated following the same procedure and the average execution time is tabulated for different images. The same process was followed for the spatial-domain based convolution algorithm. OMAP3530 (ARM Side) The algorithms were ported to our target mobile platform, the BeagleBoard, with the OMAP3530 processor. It is one of the high-end multimedia processors for mobile applications designed by TI and has an ARM Cortex A8 as one of the cores, running at a speed of 500 MHz. An identical procedure, as mentioned in the section above, is followed to benchmark the algorithm on the OMAP3530. OpenCV libraries are currently ported only to the ARM side and do not take advantage of the dual core architecture. OMAP 3530 (ARM and DSP) OMAP3530 is an asymmetric dual-core chip, which also has an on-board TMS320C64x DSP. The DSP core runs at the speed of 450 MHz. DSPLINK, an application programming interface (API) from TI, is used to link the ARM side and the DSP side. The ARM side runs the Angstrom OS while the DSP side runs DSP/BIOS II. With the help of DSPLINK, computationally intensive tasks can be assigned to the DSP processor. The DSP processor

is architecturally designed to handle such tasks efficiently due to special architecture for Very Long Instruction Word (VLIW) execution, heavy hardware pipeline and software pipelining.

(a)

(b)

(c)

194x898 pixels. (b) Tem te image of size 87x29

Figure 1 Component Detection. (a) PCB image of size 1 plapixels of a resister. (c) Component match found on the image shown by black rectangular box, which is pointed by the white arrow, on lower right hand side of the image.

53

(a)

(b)

(c) Figure 2 Locality Match. (a) Image of size 1024x695 pixels, showing aerial view of some town (source

oogle.com). (b) Template image of size of 27x34 pixels

volve a large number of MAC operations. The

The execution tim hms on different latforms is shown in Table 1.

s of execution me.

rm PCB Image

Town Image

Gshowing a small house. (c) Black rectangular box towards East-South direction from the center of the image, pointed by the white arrow, showing the locality match. It can also handle many MAC operations at the same time. The correlation algorithms, by nature,inuse of on-chip DSP processor is expected to lower the execution time for the correlation algorithms. The same methodology is used to implement the algorithms and calculate the execution time.

IV. RESULTS

e for the algoritp Table 1 Measure of performance in termti Platfo

Op(DFT)

m (Convol

Op(DFT)

(Convo

enCV Custo

(sec) ution) (sec)

enCV Custom

(sec) lution) (sec)

Pentium

(PC) 0.2178 0.1469

4

53.39

14.93

OMAP

530

5.2019

621.39 3.6995

160.55

3(ARM only)

OMAP 3530 (ARM+ DSP)

__

438.9

__

102.6

The Pentium execution times are much lower in ach case above. This can be explained by the

oved when oth the cores on OMAP are used due to higher

NCLUSIONS

This paper br w to achieve erformance optimization for mobile applications

.

eOpenCV libraries, which are highly optimized for the Intel internal chip architecture and also can take advantage of Intel’s performance primitives. The difference in execution time is also justified by thefact that the Pentium platform is not mobile and has unlimited resources, compared to the OMAP3530 which is mobile in nature. The execution time is significantly imprbMAC computational capability of the DSP processor core.

V CO

iefly investigates hoplike cell phones. The results from the benchmarking tests suggest that exploiting all the available on chip hardware resources and assigning computation intensive tasks to dedicated hardware is one of the main techniques to achieve real time performance. Benchmarking of an OpenCV cross-correlation algorithm was accomplished on a mobile platform

54

The performance was further improved by using the on-chip DSP core.

VI. REFERENCES

[1] B. Rinner ntroductDistributed Smart cameras,” pp.1565-15

lo, “Computer Visi

009.

cation,

and W. Wolf, “An I ion to

75 [8] TMS320C64x Technical Overview, No.

SPRU395, Texas Instruments Inc, Feb., 2000.

[7] L. Jian-fei et al, “Application of Cross-correlation Algorithm in Radio Weak Signal Detection,” Seventh Annual Communication Networks and Services Research Conference, pp.440-442, 2009.

vol.96 No.10, Proceedings of the IEEE 2008. [2] Y. Wang, L. Bai, and M. Fairhurst, “Robust

Road Modeling and Tracking Using

[9] D. Baumgartner, P. Rossler and W. Kubinger, "Performance Benchmark of DSP and FPGA Implementations of Low-Level Vision Algorithms," IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.

Condensation,” pp.570-579, IEEE Transactions on Intelligent Transportation Systems, vol.9, no.4, December 2008.

[3] A. Serrono, C. Conde, L. J. Rodriguez-Aragon, R. Montes, and E. Cabel on

[10] H.-J. You, S.-T. Chung, and S. Jung, “Optimization of SAD Algorithm on VLIW DSP,” World Academy of Science, Engineering and Technology. vol.37 pp.96-102, 2008.

Application: Real Time Smart Traffic Light,” pp. 525-530, International Conference on Computer Aided Systems Theory, 2005.

[4] W. Liu, C. Zhang, B. Yuan, “AVR theory, techniques and application,” Proceedings of [12] Intel Integrated Performance Primitives (IPP

Library) v6.1 update 2 for Linux OS Release

[11] G. Bradski and A. Kaehler, “Learning OpenCV,” O’Reilly Publications, 2008.

International Conference on Signal Processing, pp.1163-1166, vol.2, 2000.

[5] BeagleBoard System Reference Manual, http://beagleboard.org, Oct. 2

Notes, No: 321360-003US, Intel Corporation, Oct., 2009.

[6] R. C. Gonzalez and R. E. Woods, Digital Image Processing, Pearson Edu rd3

[13

Edition, 2008.

] OMAP3530/25 Application Processor, SPRS507F-FEBRUARY 2008-REVISED OCTOBER 2009, Texas Instruments, Oct., 2009.

Figure 3 Architectural block diagram of OMAP3530 [13]

55

Documents

[IEEE 2010 42nd Southeastern Symposium on System Theory (SSST 2010) - Tyler, TX, USA (2010.03.7-2010.03.9)] 2010 42nd Southeastern Symposium on System Theory (SSST 2010) - Optimization