Real-Time Stereo Vision Techniques - br2-publication.com · Real-Time Stereo Vision Techniques Christos Georgoulas and Ioannis Andreadis Laboratory of Electronics, Department of Electrical

Real-Time Stereo Vision Techniques

Christos Georgoulas and Ioannis Andreadis

Laboratory of Electronics, Department of Electrical and Computer Engineering

Democritus University of Thrace

Xanthi 67100, Greece

{cgeorg,iandread}@ee.duth.gr

Abstract— This PhD aims at the design and implementation of

real-time hardware-based stereo vision systems. Stereo vision

deals with images acquired by a stereo camera setup, where the

disparity between the stereo images allows depth estimation

within a scene. In the first part of my research a new hardware-

efficient real time disparity map computation module was

developed which involved a fully parallel-pipelined design, for the

overall module, realized on a single FPGA device. The design

technique was extended to include the hardware implementation

of a fuzzy inference system (FIS), to deal more efficiently with the

detection of conjugate pairs in stereo images, which is a

challenging research problem known as the correspondence

problem, i.e. to find for each point in the left image, the

corresponding point in the right one. The performance of the

hardware implemented design methodologies was evaluated

against the state of the art methods to prove their efficiency.

I. INTRODUCTION

Distance calculation of scene points within an acquired image relative to the position of a camera is one of the important tasks of a computer vision system. This allows depth estimation and environment reconstruction [1]. The most common method for depth extraction is the use of stereo camera setup consisting of two cameras displaced by a known distance with co-planar optical axes. Point to point matching to the intensity images acquired by the stereo setup, derives the depth images, or the so called disparity maps [2]. If the stereo image pairs are accurately rectified, the matching procedure can be performed in one dimension, since horizontal scan lines reside on the same epipolar line. This can be seen in Fig. 1. A point P1 in one image plane may have arisen from any of points in the line C1P1, and may appear in the alternate image plain at any point on the so-called epipolar line E2 [1]. Thus the search for corresponding image points between the stereo image pair is reduced within the same scan line. The horizontal distance in pixel coordinates between possible corresponding points is the disparity. A disparity map consists of all the disparity values among the stereo image pair. The extraction of the disparity map efficiently deals with problems such as 3D reconstruction, mobile robot navigation and obstacle avoidance, positioning, etc [3,4].

A challenging research problem known as the correspondence problem comprises the detection of conjugate pairs in stereo image pairs, i.e. to find the for each point in the left image, its corresponding in the right image [5]. To perform point matching, the points should be distinctly different from their surrounding pixels. Thus suitable feature extraction

should precede stereo matching. The two major categories of feature extraction algorithms are: area-based [6,7] and area based [8,9]. Area-based algorithms produce dense disparity information using local pixel intensity measures. Feature-based algorithms use specific point, depending on the extracted features. Although they produce more accurate result, the resulting disparity information is sparse and they are significantly slower compared to area-based algorithms. Thus in applications were speed demand is dominant, area-based methods have been preferred. Only in the recent years real-time methods have been reported, due to the increase of CPU and custom hardware structure speeds [10]. Dense stereo algorithms provide dense disparity information, but have been proved to ignore possible uncertainties during the matching stage. To overcome this issue, various methods have been employed, in order to minimize the uncertainty and efficiently perform an elimination of possible false matches during the matching process [2,6]. Most of these methods rely on the use of sum of absolute differences (SAD) or correlation algorithms, which are not highly capable to minimize the false matches.

In the first part of this PhD, a novel three-stage hardware implemented module was proposed, which is capable of addressing the stereo vision correspondence problem, producing disparity maps in real-time speeds, reaching up to 275 frames per second for a 640 x 480 pixel resolution stereo pair with 80 levels of disparity. Its novelty compared to previous algorithms is its high processing speed. Remaining false correspondences are eliminated using a cellular automata (CA) filter, producing semi-dense disparity maps with minimal noise [11].

The current research of this PhD includes the hardware implementation of a two-stage module, which is able to produce dense disparity maps in real-time speeds, reaching up to 439 frames per second for a 640 x 480 pixel resolution stereo pair with 80 levels of disparity. The novelty compared to previous reported algorithms is the combination of highly accurate dense disparity information extraction, along with real-time processing speed. A fuzzy inference system (FIS) is implemented to deal with the elimination of false matches, improving even more the disparity map accuracy and coverage. The proposed module can implement an efficient module suitable for real-time stereo vision applications [12].

In Section II the proposed algorithms are described and the proposed hardware structure details are given in Section III. In Section IV experimental results are shown and finally, conclusions are drawn in Section V.

Figure 1. Geometry of Epipolar Lines

Figure 2. Block Diagram of the proposed 3-stage system

II. PROPOSED ALGORITHMS

A. First Method

An overview of the algorithm is presented in Fig. 2. The local variation of the stereo pair images is initially computed for 2 x 2 and 5 x 5 pixel window sizes. The disparity map is then computed using an SAD window based technique, where the window size is determined by the local variation results. Finally, a CA method is used to filter the noise due to false matched points. All three steps are fully implemented in hardware on an FPGA device.

1) Local Variation estimation: The measure of local

variation, in terms of pixel grayscale value, over the image

through variable sized windows, can provide with a more

efficient disparity map evaluation. The measure of local

variation is provided by (1).

∑∑= =

−=N

i

N

j

jiIpLV1 1

),()( µ (1)

The local variation for a given window with central pixel p is calculated according to the neighboring pixel grayscale values, where N is the selected square window size and µ is the average grayscale value of the given window. For the case of the 2 x 2 pixel window the local variation is calculated for the upper left pixel.

2) Adaptive window search: To perform the disparity map

generation an SAD window based algorithm was used to find

the corresponding points in the stereo image pair, which is

given below.

,),()ν,(),,(-w -wv

∑ ∑= =

+−+−++=w w

rl djiIjiIdjiSADµ

νµµ (2)

where Il and Ir denote the left and right image pixel grayscale values, d is the disparity range, w is the window size and i, j are the coordinates (rows, columns) of the center pixel of the window for which the SAD is computed.

To extract the appropriate disparity value for each pixel, a search in the SAD for all disparity values (dmin up to dmax) is performed. At the disparity range where the computed SAD value for a given pixel is minimum, this value is assigned as the disparity map value for the given pixel.

),,(minarg),(],[ maxmin

djiSADjiDddd∈

= (3)

3) Cellular automata filtering: The resulting disparity map

is usually corrupted due to the disparity map assignment stage

described in (3). In order to enhance the disparity map, without

the loss of 3D information, a simple CA approach is employed.

Filtering with a CA approach was introduced, as it presents

more efficient noise removal compared to standard filtering

techniques [13], while preserving image details. The following

2D transition rules are taken into account for the disparity map

filtering; the first rule states that

a) ,0 then )0 and 0( 0 ,

1

-1ji,

, =≠≠∀=∑=

jiji CjiCif

b) .1 then )0 and 0( 7 ,

1

-1ji,

, =≠≠∀≥∑=

jiji CjiCif

The second Rule of CA states that,

c) then,1 and )0 and 0( 1 ,

1

-1ji,

, =≠≠∀≥∑=

jiji CjiCif

d) .0 then })1,1{ and }1,1{( 0 ,

3

-3ji,

, =−∉−∉∀=∑=

jiji CjiCif

For a given disparity map with, i.e. 16 disparity levels, 16 binary images were created by decomposition, where C1 image has logic ones on every pixel that has value 1 in the disparity map, and logic zeros elsewhere, and so on. The CA rules were applied separately to each Cd binary image and the resulting disparity map was then recomposed by the following formula:

∑ ∈⋅= ],[,),(),( maxmin ddddjiCjiD d (4)

B. Second Method

An overview of this algorithm is presented in Fig. 3. The disparity map is calculated using an SAD window based technique, using color intensity images. The three color components intensity values, (R, G and B), are taken into account for the calculation of the SAD value. The square window size selected for the implementation has a fixed size of 7 x 7 pixels. The resulting disparity map is then filtered using a 3-input 1-output Mamdani type fuzzy inference system (FIS), to minimize incorrect matches during the previous step.

Figure 3. Block Diagram of the proposed 2-stage system

Both steps are fully implemented in hardware realized on an FPGA device.

1) Color SAD window based technique: Color intensity

images were used, where R, G and B color components from

the stereo pair images were fed to the module. The sum of

absolute differences (SAD) block-matching method used to

extract the corresponding points in the color stereo image pair

is presented below.

,),,() ν,,(),,(

-w -wv

3

1

∑ ∑∑= = =

+−+−++=w w

k

rl kdjiIkjiIdjiSADµ

νµµ (5)

where Il and Ir denote the left and right image pixel values, d is the disparity range, w is the window size, k is the corresponding RGB space color component, (1 for Red, 2 for Green and 3 for Blue), and i, j are the coordinates (rows, columns) of the center pixel of the window for which the SAD is computed. Having constructed the SAD, the disparity value for each pixel is calculated according to (3).

2) Fuzzy inference system filtering: As me mentioned

previously, numerous false matches are introduced during the

disparity value assignment stage (3). This type of random noise

is not tolerable, if dense disparity map extraction is required.

Additionally if high processing speed is demanded, an

algorithm to fullfill both these requirements in efficient way

needs to be realized. A 3-input 1-output fuzzy inference system

(FIS) which holds 27 rules was implemented. The calculated

disparity values from the previous step were fed into the FIS

module. A set of three disparity values, [(i,j), (i,j-3), (i,j+3)],

where i, j are the coordinates (rows, columns) of the 1D mask

center pixel, were used as the three FIS inputs. According to

the selected rules the corresponding mask center pixel was

reassigned a new disparity value provided by the FIS output.

Tests proved that the resulting disparity map exhibits an

increase of more than 10% in total accurary. The following

formula describes the FIS operation.

)]3,(),3,(),,([),( +−= jiDjiDjiDFISjiD

(6)

The number of bits used for the digitization of the membership functions determines the accuracy of the representation and affects the results of fuzzy rules. One basic type of error due to the finiteness of the word length in digital implementations of fuzzy inference systems are the membership function errors. These errors result from the digitization of the membership function values [14]. As the resolution increases, accuracy improves, but at the same time,

TABLE I. MAXIMUM ERRORS (ME) FOR THE MEMBERSHIP FUNCTIONS

USED

Number of bits, N ME

4 0.061

5 0.029

6 0.015

7 0.007

8 0.003

hardware complexity increases, complicating the hardware implementation of the system. The maximum errors (ME) evaluated for the membership functions used can be seen in Table I. N represents the number of bits per membership degree and the cases for 4 up to 8 were considered. The values of the membership functions used in the present design are represented by 6-bit binary words.

III. HARDWARE STRUCTURES DESCRIPTION

A. First Method

The module was implemented in parallel pipelined hardware architecture, realized on a single FPGA device of the Stratix II family of Altera Devices. The typical operating clock frequency was found to be 256 MHz. The hardware design architecture is shown in Fig. 4. Details for the hardware description can be found in [11].

1) Speed Issues: The hardware module presented in Fig. 4,

was designed to calculate disparity maps for a stereo pair that

present a range of disparity up to 80 pixels. The relationship

between the number of frames processed per second and the

processed image width, assuming square images, is

approximated by (7).

[ ] 996.027 )(108sec

−××= pixelsinwidthimage

frames (7)

2) FPGA device specifications: The architecture presented

in Fig. 4, was implemented using Altera Quartus II schematic

editor. It has been simulated to prove functionallity, and once

tested, finally mapped on an FPGA device. The analytical

specifications of the target device are given in Table II.

Figure 4. FPGA Design of the proposed 3-stage module

B. Second Method

The proposed module was implemented in hardware on a single FPGA device of the Stratix III family of Altera Devices. Parallel pipelined architecture was followed considering speed issues. The typical operating clock frequency was found to be 138 MHz. The overall hardware design architecture is presented in Fig. 5. For more hardware architecture details see [12].

1) Speed issues: The hardware presented in Fig. 5, was

designed to provide with real time disparity map extraction,

for stereo images with up to 80 levels of disparity. The

achieved speed performance, concerning the frames per

second relative to image pixel resolution, assuming square

images, is approximated by (8).

[ ] 976.028 )(10sec

−×= pixelsinwidthimage

frames (8)

2) FPGA device specifications: The overall architecture

was realized on a single FPGA device using Altera Quartus II

schematic editor. The designed was finally mapped to a

device, after succefull testing and simulation. The

specifications of the target device are given in Table III.

IV. EXPERIMENTAL RESULTS

A. First Method

The proposed module is using an adaptive technique where the support window size is selected automatically depending on the local variation of the support neighborhood. This minimizes the percentage of false correspondences during the matching stage. The CA post processing filter, enables satisfactory elimination of remaining false correspondences, preserving details in the resulting disparity map, in contrast to existing filtering methods that noticeably alter disparity values in order to remove unwanted noise. The generated disparity maps are shown in Fig. 6.

Figure 5. FPGA Design of the proposed 2-stage module

Figure 6. Resulting disparity maps for (a) corridor and (b) cones stereo pair

respectively, along with original image pairs

The hardware implemented module presents higher processing rates compared to the existing methods in terms of speed, enabling the method appropriate in real-time demanding applications. Table IV presents the processing speed of the realized module, for an operating frequency of 256 MHz and for image pairs with a disparity range of 80 levels. Quantitative results of the proposed module under various configurations can be seen in Table V, and compared to previous algorithms in Table VI respectively. In Table VII the proposed module is compared to previous algorithms in terms of speed.

The achieved accuracy along with the speed performance of the proposed module, enable an efficient module that can perform adequately enough in real stereo vision applications, with disparity range usually between 60 and 120 levels, regarding nearly 10% decrease in accuracy and more than 2,000% increase in speed compared to previous algorithms [15-18], Tables VI and VII.

TABLE II. SPECIFICATIONS OF TARGET DEVICE (FIRST METHOD)

Device Total Comb.

functions

Total

Registers

Total

ALUTs

(%)

Total

LABs

(%)

Total

pins

(%)

Altera EP2S1

80F102

0C3

84,251 5208

59 (84,307/

143,520)

83 (7484/

8970)

3 (25/

743)

TABLE III. SPECIFICATIONS OF TARGET DEVICE (SECOND METHOD)

Device Total block

memory bits

Tot.

Registers

Total

ALUTs

(%)

Total

LABs

(%)

Total

pins

(%)

Altera EP3SL

340H1

152C3

<1 (6,912/

16,662,528) 15,624

77 (208,940/

270,400)

96 (12,914

/

13,520)

8 (56/

744)

TABLE IV. FRAME RATE OUTPUT OF THE REALIZED MODULE (FIRST

METHOD)

Disparity

Levels 80

Image size 320x240 640x480 800x600 1024x768 1280x1024

Frames/s 1090 275 176 108 65

a

b

TABLE V. QUANTITATIVE RESULTS OF THE PROPOSED MODULE UNDER

VARIOUS CONFIGURATIONS (FIRST METHOD)

Tsukuba Corridor Cones Teddy Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

a 55 88 36 98 48 65 90 49

b 88 51 81 75 72 56 93 47

a. Without CA filtering, b. With CA filtering.

TABLE VI. QUANTITATIVE RESULTS OF THE PROPOSED METHOD

COMPARED TO PREVIOUS ALGORITHMS (FIRST METHOD)

Tsukuba

(disp. levels=16)

Venus

(disp. levels=20)

Acc (%) Acc (%)

Proposed method 88 92

[15] 99 99.08

[16] 99.3 99.7

[17] 99.64 99.84

[18] 99.62 99.17

TABLE VII. SPEED COMPARISON TO PREVIOUS ALGORITHMS

(FISRT METHOD)

Tsukuba (384 x 288)

(disp. levels =16)

Venus (434 x 383)

(Disp. levels=20)

Time (ms) Time (ms)

Time

increase

(%)

Proposed

method 1.3 1.97 51.51

[16] 6.2 156 2416.12

[15] 4.7 109 2219.1414

[18] 1000 5000 400

[17] 6000 13,000 116.66

B. Second Method

The proposed module uses an SAD window based technique to find the corresponding points in the stereo pair. It uses RGB intensity images, to minimize false correspondences due to more efficient discrimination between the possible candidate pixels, since RGB component values provide better intensity information variation. Moreover, the FIS filtering post processing technique minimizes even more false disparity estimations, improving the resulting disparity map accuracy and coverage, while maintaining image details. The resulting disparity maps are depicted in Fig. 7.

Table VIII presents the processing speed of the realized module, for an operating frequency of 138 MHz and for image pairs with a disparity range of 80 levels. Quantitative results of the proposed module under various configurations can be seen

Figure 7. Resulting disparity maps for (a) tsukuba and (b) teddy stereo pair

respectively, along with original image pairs

in Table IX, and compared to previous algorithms in Table X respectively. In Table XI the proposed module is compared to previous algorithms in terms of speed.

V. CONCLUSSIONS

A three-stage [11], and a two-stage [12], modules addressing the stereo vision matching problem, have been proposed. The modules were implemented in hardware, aiming at real-time applications. Reference [11], employs an adaptive window constraint search as well as a CA approach to false reconstruction removal, while [12] employs a color SAD fixed window size technique, along with a FIS module for remaining false match elimination. Both from qualitative and quantitative terms, concerning the quality of the produced disparity maps relative to the frame rate outputs, highly efficient methods dealing with the stereo correspondence problem have been proposed.

Efficient performance under various levels of disparity, maintaining the frame rate output almost constant, while the disparity range increases from small up to large values. Changes of less than 1 frame per second for [11] and [12], have been measured for the modules corresponding output rates, while the disparity ranges from 16 up to 80 levels for a 640 x 480 pixel resolution image pair. Real-time speeds rated up to 275 and 439 frames per second, for 640 x 480 image resolution pair with 80 levels of disparity, are achieved by [11] and [12] respectively. This makes the proposed modules suitable for real stereo vision applications. Additionally each of the proposed modules, fitted on a single Altera FPGA device. As a result, they could be applied to enable efficient systems including high-speed tracking and mobile robots, object recognition and navigation, biometrics, vision-guided robotics in the automotive industry, three-dimensional modeling and many

TABLE VIII. FRAME RATE OUTPUT OF THE REALIZED MODULE

(SECOND METHOD)

Disparity

Levels 80

Image size 320x240 640x480 800x600 1024x768 1280x1024

Frames/s 1713 439 282 173 104

TABLE IX. QUANTITATIVE RESULTS OF THE PROPOSED MODULE UNDER

VARIOUS CONFIGURATIONS (SECOND METHOD)

Tsukuba Venus Cones Teddy Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

Acc

(%)

Cov

(%)

a 94 77 93 75 99 80 98 77

b 89 92 92 90 99 93 98 88

a. Without FIS filtering, b. With FIS filtering.

TABLE X. QUANTITATIVE RESULTS OF THE PROPOSED METHOD

COMPARED TO PREVIOUS ALGORITHMS (SECOND METHOD)

Tsukuba

(disp. levels=16)

Venus

(disp. levels=20)

Acc (%) Acc (%)

Proposed method 89 92

[15] 99 99.08

[16] 99.3 99.7

[17] 99.64 99.84

[18] 99.62 99.17

b

a

TABLE XI. SPEED COMPARISON TO PREVIOUS ALGORITHMS

(SECOND METHOD)

Tsukuba (384 x 288)

(disp. levels =16)

Venus (434 x 383)

(Disp. levels=20)

Time (ms) Time (ms)

Time

increase

(%)

Proposed

method 0.8 1.2 50

[16] 6.2 156 2416.12

[15] 4.7 109 2219.1414

[18] 1000 5000 400

[17] 6000 13,000 116.66

more. Results confirm that semi-dense, for [11], and dense, for [12], disparity maps can be efficiently calculated without the expense of speed reduction.

The accuracy of the calculated disparity maps is reduced by nearly 10% compared to previous methods. The main novelty for both implementations compared to previous algorithms is the high processing speed. The trade-off between disparity map quality and performance speed will always be present in this type of implementations. Despite the reduced resulted accuracy compared to [15-18], the proposed modules can effectively be applied to real-time stereo vision applications due to their speed performance.

REFERENCES

[1] R. Jain, R. Kasturi, B.G. Schunck, McGraw-Hill, New York, 1995.

[2] O. Faugeras, Three-dimensional Computer Vision: A geometric viewpoint, MIT Press, Cambridge, MA, 1993.

[3] D. Murray, C. Jennings, “Stereo vision based mapping for a mobile robot”, in Proc. of the IEEE International Conference on Robottics and Automation (ICRA 1997), 1997, pp. 1694-1699

[4] D. Murray and J. Little, “Using real-time stereo vision for mobile robot navigation,” Auton. Robots, vol. 8, no. 2, pp. 161–171, 2000.

[5] S.T. Barnard, W.B. Thompson, “Disparity analysis of images”, IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 2 , pp. 333–340, 1980.

[6] L. Di Stefano, M. Marchionni, S. Mattoccia, “A fast area-based stereo matching algorithm”, Image and Vision Computing, vol. 22, no. 12, pp. 983-1005, 2004.

[7] K. Muhlmann, D. Maier, J. Hesser, R. Manner, “Calculating dense disparity maps from color stereo images, an efficient implementation”, International Journal of Computer Vision, vol 47, no. 1–3, pp. 79-88, 2002.

[8] J.R. Jordan, A.C. Bovik, “Using chromatic information in edge-based stereo correspondence”, Computer Vision Graphics, Image Processing: Image Understand, vol. 54, no. 1, pp. 98-118, 1991.

[9] A. Baumberg, “Reliable feature matching across widely separated views”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000) , 2000, pp. 774–81.

[10] J.L. Crowley, J. Piater, “Introduction to the special issue: international conference on vision systems”, Machine Vision Applications, vol. 16 , no. 1, pp. 4-5, 2004.

[11] C. Georgoulas, L. Kotoulas, G. Ch. Sirakoulis, I. Andreadis, A. Gasteratos, “Real-time disparity map computation module”, Microprocessors & Microsystems, vol. 32, no. 3, pp.159-170, 2008.

[12] C. Georgoulas, I. Andreadis, “A real-time fuzzy hardware structure for disparity map computation”, In preparation.

[13] V. Murino, U. Castellani, A. Fusiello, “Disparity Map Restoration by Integration of Confidence in Markov Random Fields Models”, in Proc. of the International Conference on Image Processing (ICIP 2001), 2001, no. 2, pp. 29-32.

[14] I. Del Campo and J. M. Tarela, “Consequences of the digitization on the performance of a fuzzy logic controller,” IEEE Transactions on Fuzzy Systems, vol. 7, pp. 85–92, 1999.

[15] M. Gong, Y.H. Yang, “Fast Unambiguous Stereo Matching Using Reliability-Based Dynamic Programming”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 998-1003, 2005.

[16] M. Gong, Y.H. Yang, “Near real-time reliable stereo matching using programmable graphics hardware”, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), 2005, pp. 924-931.

[17] O. Veksler, “Dense Features for Semi-Dense Stereo Correspondence”, International Journal of Computer Vision, vol. 47, no. 1-3, pp. 247-260, 2002.

[18] O. Veksler, “Extracting Dense Features for Visual Correspondence with Graph Cuts”, in: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVRP 2003), 2003, pp. 689-694.

Documents

Real-Time Stereo Vision Techniques - br2-publication.com · Real-Time Stereo Vision Techniques Christos Georgoulas and Ioannis Andreadis Laboratory of Electronics, Department of Electrical