[IEEE 2008 International Conference on Signals and Electronic Systems - Krakow, Poland (2008.09.14-2008.09.17)] 2008 International Conference on Signals and Electronic Systems - VHDL

ICSES 2008 INTERNATIONAL CONFERENCE ON SIGNALS AND ELECTRONIC SYSTEMS

KRAKÓW, SEPTEMBER 14-17, 2008_____________________________________________________________________________________________________________________________________________________________________________________________________

Copyright © 2008 by Department of Electronics, AGH University of Science and Technology

VHDL described Finger Tracking System forReal-Time Human-Machine Interaction

Xabier Iturbe, Andoni Altuna, Alberto Ruiz de Olano and Imanol MartinezIKERLAN-IK4 Research Alliance,Embedded System-on-Chip Group,

J. M. Arizmendiarrieta, 2, 20500, Mondragon, Spain{xiturbe, aaltuna, arolano, imartinez}@ikerlan.es

http://www.ikerlan.es

Abstract—This work presents an image recognition techniquethat enables finger position to be automatically detected innatural interaction systems, within the context of ambient in-telligence, AmI. The aim is a robust system in which users donot need to wear gloves containing specific colored markingsto establish location. The system has been validated, havingbeen used for the design of a virtual keyboard. The algorithmis based on the Hough Transform to detect the presence ofstraight lines in a scene, as the finger’s profile is sufficientfor this method to be employed. To distinguish between afinger and other straight forms that may accidentally appearin a scene, the algorithm fine-tunes location by incorporatingmovement information. The system is implemented using a fieldprogrammable gate array (FPGA), providing real-time capabilityand improving the sensation of interaction for the user. In thisarticle, we present a detailed description of all the stages involved.

Index Terms—FPGA, Real-Time, Tracking, Video Processing,Hough Transform, Human-Machine Interaction, Ambient Intel-ligence.

I. INTRODUCTION

Because of the growing miniaturization of electronic circuitsand the corresponding increase in the computing power ofembedded systems, it is now possible to integrate electronicsinto people’s daily lives and day-to-day objects. Current typesof human-machine interface based on buttons, keyboards andmenus are set to disappear, and be replaced by intelligentsystems that humans relate to using their voices, movements,gestures, etc. Ambient Intelligence (AmI) is the term usedto describe this concept [1]. The appearance of this conceptmeans that systems will need to be designed to enable userinteraction to occur in a natural, transparent and non-intrusivemanner. The real-time recognition of the position pointed at bya finger represents an advance in this line of research, involv-ing a general objective with multiple immediate applicationsin the area of interest, i.e. [2] and [3].

Tracking algorithms require several operations to be per-formed on each pixel in the image, resulting in a largenumber of operations per second. Field programmable gate

Fig. 1: Tracking system functional block diagram

arrays (FPGAs) provide a fast parallel alternative to usingconventional slow serialized processors [4].

The virtual keyboard application used to validate the system,eliminates touch screens and conventional push buttons, andreplaces these with a single camera. Furthermore it combinesmany other functions, and likewise replaces the sensors thatthey would require.

The remainder of this paper is organized as follows. SectionII explains meticulously the stages included in the trackingsystem. In Section III the virtual keyboard application is de-scribed, which has been used in order to validate the algorithm.Section IV shows the obtained results after having tested thetracking system. Finally, in section V, the conclusions aredescribed.

171

II. FINGER TRACKING SYSTEM DESCRIPTION

The background in finger tracking techniques show thatmost of them are not suitable for FPGA implementation.For instance, the substraction of consecutive captured framesso that estimate finger movement requires lots of expensivestorage resources in order to keep previously captured frame[5].

In this way, a specific-algorithm is designed in order to trackthe finger by using FPGAs. This algorithm combines a setof well-known techniques so that achieve real-time tracking.This section describes each of the stages, described in VHDLlanguage, involved in the proposed recognition system. Figure1 contains a system flow chart. The image delivered by thecamera, VGA resolution, is binarized in order to detect thepixels that belong to a user’s hand. With the objective of fine-tuning this classification further, the resulting image is filteredusing a low-pass filter that eliminates any small residuesfrom the image background that may have mistakenly beenidentified as pixels belonging to the hand. Sobel XY edgeenhancement is then performed to detect the orientation of theedges in the scene. Fingers profile straight lines are detectedusing the Hough Transform. With the aim of eliminatingthe edges of the hand that do not belong to the finger, alogical AND operation is performed on the edge image andthe straight lines detected using Hough. Four candidates areselected for potential finger location: extreme pixel detected atthe top, bottom, right and left position. The optimum candidateamongst those selected is obtained by processing informationabout the hand’s position and orientation in the frame.

A. Identification of hand pixels: binarization

This stage classifies pixels delivered by the camera asbelonging to skin, hand, in which case the output will be ’1’;or background, output ’0’ [6]. At the same time, the amount ofinformation to be processed in the later stages of the algorithmis reduced: binarization is performed. It is decided to modelskin based on a set of conditions applied to the RGB inputchannels [7]:

R > 50; G > 20;B > 10R > G + 24R > B + 10

(1)

B. Low-pass filtering

The binarized image may present pixels that have erro-neously been identified as skin, which can be considered asnoise. Noise effects on Hough Transform results are not de-sirable [8]. A linear low-pass filter (hardware implementationfacility) is applied to reduce the effects of such noise onthe determination of edge orientation. The convolution matrixw[m, n], dimension 5x5, used is:

wLP [m, n] =1

100·

⎛⎜⎜⎜⎜⎝

1 1 1 1 11 5 5 5 11 5 44 5 11 5 5 5 11 1 1 1 1

⎞⎟⎟⎟⎟⎠

(2)

Fig. 2: 5x5 Image filter architecture

Based on this information, the hand’s position in a frame isdetermined. To do so, the frame is divided into four exclusivetwo by two quadrants, that is, the hand is placed in the topor bottom and right or left quadrant. Then, depending on theorientation of the hand, vertical or horizontal, the final decisionis taken.

Decision criteria consists of counting the pixels classifiedas skin in each quadrant, selecting those that after analyzinga frame present a number of detected pixels greater than theirexclusive quadrant.

1) Filter hardware implementation [9]: The filtering pro-cesses involves carrying out the convolution operation, equa-tion 3.

i′[m, n] = i[m, n] ⊗ w[m, n] =2∑

j=−2

2∑k=−2

i[m − j, n − k] · w[m − j, n − k](3)

Where i is the convolution window for the image to befiltered, w the matrix of filter coefficients (dimension 5x5)and i′[m, n] the pixel processed from the resulting image.Line (Z-Lin) and column (Z-p) delayers are employed tosimultaneously provide the pixels required for each clockflank. These delayers are implemented using dual port BRAMswith the aim of reducing significatively the number of logi-cal slices employed. Multiplication of coefficients is carriedout using embedded wired multipliers in the FPGA. Theprocessing of each pixel involves 25 multiplications and 25additions. The acquired area/speed commitment establishesperforming five multiplications in parallel, serializing the useof this architecture by means of an accumulator. To maintainthe input/output rate, a clock five times faster than the baseclock is employed (figure 2, in dark grey). In so doing, theaforementioned BRAMs adapt both speeds by making use ofthe independent ports. The 25 filter coefficients are storedin distributed logic, and common coefficient multiplication iscarried out using an additional wired multiplier.

IMAGE PROCESSING AND RECOGNITION

172

C. Edge detection

Edge orientations are necessary to calculate the HoughTransform. To obtain such orientations a convolution maskfilter is employed based on the Sobel XY operator (matrix 4).Filter hardware implementation is the same that is describedin section II-B1.

wSobelXY [m, n] =

⎛⎜⎜⎜⎜⎝

0 0 0 0 00 0 −1 −1 00 1 0 −1 00 1 1 0 00 0 0 0 0

⎞⎟⎟⎟⎟⎠

(4)

D. Hough Transform

Hough Transform is recognized as a very effective techniquefor detecting straight lines and other geometric forms, althoughocclusions exist.

Our recognition system detects the finger using the HoughTransform as it presents a very structured form, mainly straightlines. Hough Transform not only collaborates in the detectionof the finger in the frame, it also provides information aboutits’ orientation, and subsequently about hand’s orientation:vertical or horizontal.

A broad range of applications that use the Hough Transformcan be found, for example [10] and [11]. To extract the mostprominent straight lines in a scene, the normal parametrizationof the Hough Transform is used:

ρ = m · cos θ + n · sin θ (5)

Where θ and ρ are the orientation and longitude respectively,of the normal of the detected straight line to the origin of theimage coordinates. Each straight line is defined in a uniquemanner by θ and ρ, and for each point (m, n) in the originalimage, a correspondence can be created between the imageand the transformed space.

By discretizing the transformed space in a number ofaccumulating cells, a register of number of ’votes’ in eachof the corresponding cells can be obtained (θ, ρ) for eachpixel of coordinates (m, n) in the image. Having completedthis process for all the pixels in the image, the maximums fortransformed space indicate the presence of straight lines in theimage space. The equation of the aforementioned straight linesis defined by parameters θ and ρ.

The transformed space has been defined by four Block-RAMs, giving rise to 200 x 20 accumulators, stored in memorypositions. All pixels (m, n) whose transformed (θ, ρ) exceedsthe determined threshold value Th, are considered to formpart of a straight line in the scene.

1) Hardware architecture for calculating the Hough Trans-form based on CORDIC [12]: The CORDIC (COordinateRotational Digital Computer) algorithm was proposed in 1959by Volder [13] and enables elemental trigonometry functionsto be calculated such as sine, cosine, tangent, arctan, inaddition to logarithms and exponentials.

�Y

�X��

��

��

��

R�

x

y

βθ

R’�

x′

y′

Fig. 3: Vector (R, β) rotated a θ angle

Considering the rotation of a vector R with coordinates(x, y) for an angle θ ∈ [0, π/2], (Figure 3) calculable byapplying CORDIC:

x′ = x · cos θ − y · sin θ = (x − y · tan θ) · cos θy′ = y · cos θ + x · sin θ = (y + x · tan θ) · cos θ

(6)

The CORDIC algorithm is an iterative procedure in whicheach iteration or micro-iteration, i, rotates the vector in onedirection or another by an order of magnitude αi = atan2−1.

xi+1 = xi − yi · tan αi = xi − δi · yi · 2−i

yi+1 = yi − xi · tan αi = yi − δi · xi · 2−i

θi+1 = θi + δi · αi = θi + δi · arctan 2−i

(7)

Where x0 and y0 are the initial components, θ0 = 0 andδi = {−1, 1}. Direction δi (+/-) for each αi is chosen as + ifθ − θi−1 > 0 and vice versa. The algorithm is applied untilθi is very close to θ. At this point, the result obtained is Ktimes greater than correct, where

K =∏

i→∞

1

cosθi

=∏

i→∞

√1 + 2−2·i ≈ 1.6467 . . . (8)

Exchanging the inputs x and y in the CORDIC algorithm,that is, introducing coordinate n to input x0 and m to input y0.The two outputs from the algorithm are used to calculate theHough Transform, reducing the computational load. Output xis valid when the rotated angle belongs to the second quadrantand output y is when it belongs to the first quadrant.

With the aim of reducing the hardware resources required, abasic CORDIC processing unit has been designed, that serial-izes the calculation of micro-rotations. Ten micro-rotations areperformed at a clock frequency 10 times greater than the basefrequency, thereby maintaining the input/output rate constant.The correction factor, divided by K, is applied via 2 LUTs,implemented through BRAMs.

By performing the Hough Transform, the angles to becalculated do not vary, and the δi sequence for each base cellcan be stored in a register.

The resources required for each CORDIC base cell are 37Slices, 43 LUTs with 4 inputs and 2 BRAMs for the initialcorrection. Ten basic CORDIC cells are instantiated that giverise to 20 values for θ by performing the Hough Transform,attaining a resolution of 9 degrees.

The greatest challenge for a hardware implementation ofthe Hough Transform is to maintain the characteristic pipeline,minimizing the required silicon area. The problem resides in


173

Fig. 4: Accumulator cell maintenance

the need to calculate the transform first, in order to subse-quently identify its’ maximums. This problem has been re-solved by duplicating the basic processing structure, so that theimage frames are processed alternately (ping-pong strategy) ineach of the structures. A serialized base architecture has beendesigned, that works at a clock speed ten times greater thanthe base frequency, capable of storing the ”votes” issued in4 BRAMs. The BRAMs have double independent ports, withone used to read the value stored on the accumulator cell andthe second port used to write the new up to date value. To doso, it is necessary to delay the address bus for the second portby the number of latent clock cycles in the processing of thevalue maintained by the accumulator. In addition, a mechanismhas been implemented which protects the accumulator cellsfrom overflowing.

In parallel the two outputs delivered by the transformcalculation based on CORDIC are processed, angles of the firstand second quadrant, dedicating 2 BRAMs to the maintenanceof each one. Each BRAM has 1024 positions available, andtherefore the precision of ρ achieved is approximately 200. Aspecific piece of electronics is designed to generate the offsetsto be added to the calculated module, so that the value iswritten in the range of addresses assigned to the correspondingvalue for θ.

Counter maintenance is implemented in a way that allowmaintain the pipeline and it is the corresponding image pixelwhich is responsible for selecting via MUX the final value tobe written in the accumulator cell: increased or not increased,Figure 4.

Having completed frame processing, each memory positionis accessed and the stored value is compared to a Th threshold.In the event that the stored value is greater it is determinedthat the pixel in question belongs to a straight line. In this casedepending on the memory position where it resides (refersto θ) a vertical (if −45 < θ < 45) or horizontal counter isincreased. In this way, at the end of the process both countersindicate the majority orientation of the straight lines detected.

E. Finger Position Prediction Algorithm

Based on the position of the finger and its position in theprevious frame, its’ anticipated position is predicted in the nextframe. In the first instance, the speed of the finger (vx,k, vy,k)

is calculated based on two positions obtained for the finger inthe processing of the k-th frame:

vx,k = xk − xk−1

vy,k = yk − yk−1

(9)

Then the next position is estimated: (xk+1, yk+1):

xk+1 = xk ± vx,k · Tfr

yk+1 = yk ± vy,k · Tfr(10)

Where Tfr is the period between two frames processedconsecutively. Currently the system processes a frame every33 ms, depending exclusively on the camera’s rate of in-formation delivery. This finger position prediction algorithm,implemented in software, is used to select the candidate pixelbest suited amongst those preselected, increasing the systemsrobustness.

F. Candidate selection

Once the analysis of the resulting image after calculating theAND logical operation amongst the straight lines detected byHough and the edge image is done, this sub-block decidesin which candidate pixel the finger is. Position predictionprovides additional information for doing this choice, reducingthe number of candidates to be taken into account.

The criteria adopted is based on the following idea: thefingertip is always located at the end of the hand, that is, itoccupies one of the border pixels detected. The final choice isdone based on the orientation of the hand, vertical/ horizontal,and its position, top/bottom and right/left. For example, bydetermining that a hand is vertically orientated, it means thatit will only be necessary to consider if it occupies the top orbottom position, and then select the opposite candidate pixel.For instance, if the hand occupies the top position, choose thebottom candidate.

III. VIRTUAL KEYBOARD APPLICATION

As mentioned before, the design was applied with the aimof implementing a virtual keyboard, which included a set of15 alphanumeric keys, numerated from ’1’ to ’9’ and ’A’ to’F’. The keys are circular, despite being rectangular or square,so that avoid non-finger straight lines detection by the HoughTransform based algorithm.

A Xilinx MicroBlaze processor (MB) is used as systemCPU, therefore, whole system is implemented in a singlechip: System-on-chip, SoC. MicroBlaze [14] is a soft-core32-bit Harvard RISC architecture optimized for Xilinx FPGAfamilies and its architecture consists of 32 general-purposeregisters, an Arithmetic Logic Unit (ALU), a shift unit, andtwo levels of interrupt. MB is responsible of the detectedfinger position coordinates processing. On the one hand, itimplements the movement based position prediction algorithm,described in section II-E; on the other hand, MB decodesthe coordinates and acts in consequence, depending on theassigned functionality to them. This software layer, guaranteesthe flexibility of the system to quickly adapt to differentapplications. In this application, the pointed key is written in


174

(a) (b)

(c) (d)

(e) (f)

(g)

Fig. 5: The sequence of stages during the detection of theposition pointed at: (a) Binarization of the input image: back-ground in white, skin in black. (b) Low-pass filter to eliminatefalse detections that may interfere in the subsequent process,it is determined that the hand occupies the upper left partof the image. (c) Edge detection using a Sobel XY operator.(d) Hough Transform where the detected maximums can beappreciated (θ = 53.44 [H], 64.69 [H], 67.50 [H], 59.06 [H]and 50.62 [H]): it is determined that the hand is in a horizontalposition. (e) Superposition of the input image and the straightlines detected via Hough. (f) Logic AND between hand edgesand the lines detected by applying Hough, identifying thesilhouette of the finger. The candidate pixels are selected: 1)right; 2) bottom; 3) left; 4) top. (g) Final detection of fingerposition, first pixel detected from the right (candidate 1).

Fig. 6: Virtual keyboard application block diagram

a screen. The communication between finger tracking systemand processor is implemented by using interrupts and registers:every processed frame generates an interruption to the MB,which access the register in order to read detected position.This gives a 30 interruptions per second rate, allowing the MBto perform required computation between them. The wholesystem is shown in figure 6: it includes a serial UART whichallow communication with a monitoring PC and the driversfor the sensor image and display, both described using VHDLlanguage. A timer is added for measuring the time fingerspends in each position. If the finger remains in the sameposition for 2 or more seconds, it is considered that it isselecting that key. A system on-chip PLB (Processor LocalBus) bus is used in order to interconnect the implementedmodules. This is a natural process to design a system on achip [15].

In a near future we will tray to combine the developedsystem and a projection method, in order to provide theambient intelligent environment described in section I. In thisway, the projected image would be possible to be changeddepending on the previous user selection. The functionalityassigned to the pointed position could be quite different, i.e.Making a phone call when a familiar or friend photographis pointed at. Furthermore, the ”key” concept should bedismissed in order to achieve a more intelligent and abstractmodel in the human-machine interaction, based on conceptsinstead of concrete directives. Also a second camera will beadded so that estimate the third dimension: depth. Addingthis measure, stereo-vision [16], will permit a refine in thealgorithm procedure: only when the finger is placed in theprojection surface surroundings will the detection be activated.

IV. RESULTS

The recognition system described has been implemented ona FPGA Xilinx XC2V1000. Table I shows synthesis report foreach tracking stage. The base system clock is the pixel clockprovided by the VGA Omnivision OV7670 image sensor (640x 480 pixels). Thanks to the pipeline architecture proposed,the processing of a single frame is done in real-time, despitehaving a latency of two frames. This feature is suitable in


175

TABLE I: Logic resources needed, by typeBinarization Low-pass filter Edge detection Hough Transform Selection Total

4 input LUTs 36 440 440 978 223 2,117Flip-Flops 28 423 423 887 234 1,99518 Kb Block-RAMs 0 4 4 10 0 18MULT18x18 0 6 6 0 0 12

order to improve the users comfort sensation. The maximumoperational frequency of the tracking system is a base clockof 25.21 MHz.

Where multiple datasets have been used from the sameindividual they have been sourced at different times andaccordingly contain some variability in pose and orientationof the camera as well as the illuminance conditions. A 90%success rate was obtained, considering a success to be whenan identified pixel is located at a radius of less than 10 pixelsfrom the end of the finger. The tracking-algorithm is designedto be placed at 30 to 100 cm distance from the tracking finger.If this condition is not verified, the performance is degraded.

Figure 5 shows the procedure for the detection of the fingerposition in each processed image frame.

V. CONCLUSIONS

This article presents a detailed description of the stages andarchitecture used in a finger tracking SoC system, capable ofperforming it in real-time. The use of this technique enablesa more natural, intuitive yet flexible interaction than classicalor tangible interfaces, without color restrictions or the needto wear special gloves [17]. This interaction is an applicationthat has been frequently worked in the context of Ambient In-telligence (AmI). Developed tracking system has been appliedin order to implement a virtual keyboard application, whichprints in a display the selected alphanumeric key.

Up to 30 VGA frames per second are processed, a through-put of 9216000 pixels per second, enabling it to be affirmedthat the tracking system presents real-time response. Frameresponse time matches camera sampling rate: system wouldbe able to operate at a higher frequency.

Implementation of this architecture is suitable for XilinxFPGAs that offer enough memory on the chip for storage ofintermediate results, Hough Transform. In this case, there isno need for external RAM modules and the complete trackingsystem is implemented on a single chip. Furthermore, theflexibility achieved by using reconfigurable logic elementsmakes the system very suitable for integration as a coprocessorfor FPGA based processor, such as PowerPC hard-core orMicroBlaze soft-core. The processor can simultaneously ex-ecute required functionality depending on the finger positioncoordinates detected by the tracking-algorithm, allowing animprovement in the Ambient Intelligence approach.

ACKNOWLEDGMENTS

The authors would like to thank the Fundacion CentrosTecnologicos Inaki Goenaga for the partial funding of thiswork.

REFERENCES

[1] Hagras, H., Callaghan, V., Colley, M., Clarke, G., Pounds-Cornish,A., Duman, H. ”Creating an Ambient-intelligence Environment usingEmbedded Agents” (2004) IEEE Intelligent Systems, 19 (6), pp. 12-20.

[2] Kubota, N. and Abe, M., ”Human Hand Detection for Gestures Recog-nition of a Partner Robot”, Proceedings of World Automation Congress(WAC) 2006, Budapest, Hungary, 2006.

[3] Kubota, N. ”Human Detection and Gesture Recognition based on AmbientIntelligence”, ISBN 978-3-902613-03-5, pp. 558, I-Tech, Vienna, Austria,2007.

[4] Tessier, R., Burleson, W. ”Reconfigurable Computing for Digital SignalProcessing: A Survey” (2001) Journal of VLSI Signal Processing Systemsfor Signal, Image, and Video Technology, 28 (1-2), pp. 7-27.

[5] Lee, J., Lee, Y., Lee, E., Hong, S. ”Hand Region Extraction and Ges-ture Recognition from Video Stream with Complex Background throughEntropy Analysis” (2004) Annual International Conference of the IEEEEngineering in Medicine and Biology - Proceedings, 26 II, pp. 1513-1516.

[6] Zarit B. D., Super B. J. and Quek F. K. H., ”Comparison of Five ColorModels in Skin Pixel Classification”, ICCV’99 Workshop on Recognition,Analysis and Tracking of Faces and Gestures in Real-Time Systems, pp.58-63.

[7] Vezhnevets V., Sazonov V. and Andreeva A., ”A Survey on Pixel-BasedSkin Color Detection Techniques”, Proc. Graphicon 2003, pp. 85-92,Moscow, Russia, 2003.

[8] D. J. Hunt, L. W. Nolte, A. R. Reibman and W. H. Ruedger, ”HoughTransform and Signal Detection Theory Performance for Images withAdditive Noise”, Comput. Vision Graphics Image Process. 52(3), pp. 386-401, 1990.

[9] A. Lukin ”Tips and Tricks: Fast Image Filtering Algorithms”, Proc. ofGraphicon 2007, pp. 186-189, Moscow, Russia, 2007.

[10] D. H. Ballard, ”Generalizing the Hough Transform to Detect ArbitraryShapes”, Pattern Recognition, 13(2) pp. 111-112, 1981.

[11] V. F. Leavers, ”Shape Detection in Computer Vision using the HoughTransform”, Springer-Verlag. New York/Berlin, 1992.

[12] Feng Zhou and Peter Kornerup, ”A High Speed Hough Transform usingCORDIC”, University of Southern Denmark, 1995.

[13] Volder J., ”The CORDIC Trigonometric Computing Technique”, IRETrans. Electronic Computing, Vol EC-8, pp. 330-334, 1959.

[14] Xilinx Inc. ”MicroBlaze Processor Reference Guide”, 2007.[15] C. Rowen, ”Reducing SoC Simulation and Development Time”, Com-

puter, vol. 25, no. 12, pp. 29-34, Dec. 2002.[16] Geiger, D., Ladendorf, B., Yuille, A. ”Occlusions and Binocular Stereo”,

(1995) International Journal of Computer Vision, 14 (3), pp. 211-226.[17] Y. Iwai, K. Watanabe, Y. Yagi, M. Yachida. ”Gesture Recognition by

Using Colored Gloves”, IEEE International Conference on Systems, Manand Cybernetics (SMC’96), Vol. 1, pp. 76-81, Beijing, China, Aug. 1996.


176

Documents

[IEEE 2008 International Conference on Signals and Electronic Systems - Krakow, Poland (2008.09.14-2008.09.17)] 2008 International Conference on Signals and Electronic Systems - VHDL