[IEEE 2013 2nd International Conference on Informatics, Electronics and Vision (ICIEV) - Dhaka, Bangladesh (2013.05.17-2013.05.18)] 2013 International Conference on Informatics, Electronics

Real time Rotation invariant Static hand Gesture recognition using an Orientation based Hash code

Saleh Ud-din Ahmad

Department of Computer Science American International University Bangladesh (AIUB)

Dhaka, Bangladesh. [email protected]

Shamim Akhter

Department of Computer Science American International University Bangladesh (AIUB)

Dhaka, Bangladesh. [email protected]

Abstract— Human gesture recognition allows for a more natural human machine interface eliminating expensive training for human’s to get accustomed to the machines and avoid costly mistakes that follow till one becomes an experienced user. With advances in technology embedded devices with additional processing power and memory are becoming available. This is making our machines more capable and complex to operate, though the cost of human error is even higher. Hand gesture recognition offers a solution, but it still remains a very time and space complex problem when most non statistical methods are employed. Thus most embedded systems with limited space and processing power are unable to support hand gesture recognition. The paper introduces a statistical method which converts image contour to orientation based hash codes in-order to project it to a 3D-address space bounded by hamming distance. The main objectives are to reduce time, space complexity along with complete rotation invariance and online scalability. The implemented method proved to be 82.1% accurate against 1000 images comprising of 10 distinct static hand gesture sets.

Keywords- Hand Gesture, Hash Code, Rotation invariant, Orientation Histogram.

I. INTRODUCTION Since the 1980's human computer interaction has been a 2D

desktop on a monitor, manipulated by devices like the mouse and keyboard. Advances in technology are starting to bring new means of interaction, as speech recognition and gesture recognition. Gesture recognition is one of the major areas to explore for engineers, scientists and bioinformatics [1 2]. It plays a fundamental role in human communication, often powerful enough to be a complete message by itself. This offers a new window for Human Computer Interaction (HCI), which comes naturally to humans. In other words, a system that is capable of understanding human gestures would allow its user to communicate with the system as if it was a human. This eliminates any need for training the user in order to operate the system. Applicability can be to a wide range of application domains; namely virtual authenticity, robotics and tele-presence, desktop and tablet pc application, games and sign language [1 2].

Gesture recognition can be achieved by wearable sensors (mechanical or optical) attached to a glove or computer vision. There are a number of approaches those have been employed to gesture re-cognition. All of them try to address the specific

challenges of gesture recognition. Some of the basic challenges are rotation, varying hand size and real-time recognition. Algorithms like Neural Networks, Hidden Markov Model and Shape Context have been employed with considerable success [1 3 4]. This research paper introduces another method that is scalable to online learning of new gesture sets, real-time, complete rotation invariant and aimed towards systems with limited processing power and memory, basically for embedded systems.

II. BACKGROUND AND PRESENT STATE OF THE PROBLEM Human gesture offers a more natural way of interaction

between human and machine. To this date people still learn to interact with machines using interfaces like keyboard and mouse. Hand gesture recognition is an automatic, real-time capability of a system to identify and interpret user gestures using a camera. It has gained popularity in recent years, and could become the future tool for humans to interact effectively with computers or virtual environments [9]. Extensive research into various hurdles of gesture recognition uses Neural Networks, Support Vector Machines (SVM), Graph matching, Inductive Learning Systems, Voting theory, Hidden Markov Models, Chamfer distance or Dynamic Bayesian Networks with considerable success [9]. But most of the approaches are not aimed to address online scalability and rotation invariance at lower time and space complexities.

Figure 1: The proposed system flow diagram

Output the smallest Hamming distance

Search for the best result Hamming distance by rotating the Hash codes

Apply threshold to orientation to generate hash codes

Generate orientation bin of gesture contour

Find the contour of gesture ROI

Select Region of Hand Gesture (ROI region of interest)

Background Subtraction and Color segmentation

Capture Image Frame from CameraN

E X T

F R A M E

978-1-4799-0400-6/13/$31.00 ©2013 IEEE

The method proposed is aimed to address the challenges of real-time rotation invariant gesture recognition with online scalability. The concepts of orientation histogram, as applied to human detection or hand gesture recognition in [5 6], is taken as the core feature considered for recognition. Inspired by the success and efficiency of semantic hashing in [7] with text data mining and spectral hashing in [10] for sorting images: hashing along with hamming distance was used as the classifier. Hashing meant the reduction of time complexity for recognition compared to other classifiers [7 10].

III. RESEARCH METHODOLOGY/EXPERIMENTAL DESIGN This research introduces a statistical method for generating

orientation based hash codes of image contours to classify them in a 3D-address space bounded by hamming distance. By circular shifting of the generated hash codes complete rotation invariance was achieved. This proposed method is inspired from the robustness of orientation histogram and the efficiency of semantic/spectral hashing [7 10] as its core. Following is the proposed system flow diagram Figure 1.

A. Image acquisition Nowadays, there are a number of types of camera

available. They vary from resolution to frame rates or even in pixel depth. But image acquisition these days is not even limited to cameras; satellites, planes and even sophisticated radars can be a source. There are cameras that in combination with high end sensors provide more information about the image. Or multiple cameras can even be coupled to generate 3d images. For testing our method a web cam available with most laptops and other devices, providing RBG 640x320 pixels at 15 frames per second with 8 bit depth is used.

B. Background subtraction by color segmentation Background subtraction is required to get the Hand gesture

or Region of Interest (ROI), but in order to do this it has cope with complex, dynamic backgrounds and varying lighting conditions. There are two basic scenarios to background subtraction depending on whether the system associated uses a static or mobile camera. The method proposed in this paper is aimed towards autonomous robots where the camera will not be static. In such cases color segmentation is done to get the region of interest. Color segmentation is one of the most popular methods employed for hand gesture recognition [2 11 12]. Human skin color distributes to a small region in color spaces like RGB, HSI or L*a*b*. It forms clusters at specific pixels varying more in terms of intensity than in color [11].

C. Pre-processing on the region of interest Although color segmentation is computationally cheap but

there are a few extra processing needed in order to reduce noise from the region of interest. It’s expected that, user's hand gesture will be the only thing that remains after color segmentation is complete. Though skin color pixels around the hand can produce variations on the gesture contour generated. By applying the Opening operation (including erosion and dilation processing) in Morphology, those contour noises are eliminated [4]. Other problems include the detection of arms

and face that may be present in the scene. They can be discarded by the classifier or removed by running region identification algorithm using connected component labeling and wrist detection algorithm for clipping [11]. But the gestures can also vary in size and shape, due to distance from camera or variation of hand from person to person. So the selected region of interest is resized to 112x112 pixels gray scale images. The resulting images are shown in Figure 2.

IV. IMPLEMENTATION Capturing the image from the camera, background

subtraction, finding the ROI, resizing the ROI to 112x112 pixels and contour have been done by using standard OpenCV library[12].

A. Orientation histogram and Shape context Counting occurrences of gradient orientation in image contours is the basic similarity between both orientation histogram and shape context. Though they differ in the process of how it is calculated. Shape context calculates a set of vectors originating from a pixel to all pixels on the contour. Thus describing the contour with respect to the reference pixel and done for uniformly spaced pixels on the contour. But orientation histogram differs as it is calculated over a dense network of uniformly spaced cells that uses overlapping local contrast normalization for improved accuracy. Figure 3 shows some problematic images that are not identified by orientation histogram [6] for both gesture images (d) and (e) are very similar, whereas (a) and (b) are outputted in (c) as quiet different. The situation explains the drawback of orientation histogram approach and the reasons why it was tweaked in order to adapt to the proposed goals.

B. Proposed Method Once the contour is found, an approach similar to

orientation histogram and shape context is applied. Both orientation histogram and shape context are calculated with respect to a pixel, with all of the pixels uniformly distributed on the image and the procedure is repeated for all pixels. This provides a rich feature descriptor to define the shape of the contour described in [6 13]. So to reduce the time complexity of orientation histogram, instead of computing with respect to all pixels on the image, orientation histograms with respect to the centre of contour was used and applied on all pixels on the contour. As the method tries to compute with respect to the centre both orientations histogram in this context becomes a weak feature. So an approach inspired from the shape context described in [13] was used in combination to orientation histogram. Figure 4 shows the shape context applied to the letter ‘A’. In short the method combines orientation histograms and the shape context to generate circular histograms similar to what is described in [8] and to produce circular bins shown in Figure 5 where the circular bins are h1-hn. Although the proposed system in [8] uses the summation of cord lengths shown in Figure 6 as the feature; but the proposed method uses the summation of distance for all pixels on the contour with respect to the centre and shown in Figure 7.

Figure 2: Gesture images after pre-proc

Figure 3: Orientation histogram for hand

Figure 4: Shape context computation and matching. (a)pixels of two shapes. (c)Log-polar Histogram bins. (d

histograms for the pixels marked on (a) and (b) r

Figure 5: Image divided in regions and feature vectors b

Figure 6: Cord size representation

cessing

gestures [6]

) and (b) sampled edge d), (e) and (f) are the espectively [13]

based on histograms [8]

n [8]

Figure 7: A single bin showing the cofrom th

Figure 8: Shows 15 circular bins and bounding rec

Table 1: SummarizCentre of contour is Pixels on the contour is Angle of contour pixel with respect tBin index Total number of pixels on contour Distance of contour pixel from centrCo-variance of each bin Hash code digit of test image Hash code digit of train image Co-variance of each bin for circle Co-variance of each bin for gesture

The centre that is considered centre of the rectangle boundinshows how the centre changAnother key difference with [does not consider a starting pixall the bins will be discussed la

15 circular bins is used to initially it was tested with 4,numbers are chosen such that come as a fraction example 36conclusion that 15bins yieldeoverall accuracy and processinnumber of bins is directly propand overall accuracy (but thispixels per bin or else the resul2 summaries the results).

Table 2: Summarizing the resulNo. of Bins

Max No. of Gesture Set

Mi

4 -

8 2

12 5

15 7

30 8

45 6

To divide the image contour

given below is used to get an an

ontour pixels and the length calculated

he centre.

how the centre shifts with the contours

ctangles centre

zing all symbols used , ,to centre Ø

i n

re

as our reference is actually the ng the gesture contour. Figure 8 ges position with the contour. [8] is that the proposed method xel or in other words it considers ater.

divide the gesture contour but ,8,12,15,30 and so on (the bin the angle for each bin does not 0/12 or 360/15) ending with the ed the best result in terms of ng required i.e., increasing the portional to processing required

s is true for at least 30 contour lts become unpredictable, Table

lts to find the optimal bin number in. Rotation

Angle

% False Recognition

90 -

45 10

30 50

24 75

12 80

8 66

r into circular bins, the equation ngle for the contour pixels.

Ø /Now to convert Ø in the positive 360 degreeto +180 the following is done. 180 By simply dividing the angle by 24 we arwhich bin this contour pixel belongs. /24 3For all pixels falling in a particular Bin distance of the pixel from the centre is calcul

Variance of the values is used in ordevariance of the lengths caused by the shape all lengths measured the pixels on the gessame section of contour (which is assumed so the variation in the contour is directly caof the contour shown in Figure 9. Figure 10 histogram for an oval and the value for thefor all the bins and similarly for a hand gesthroughout the circular histogram. Log of vto reduce the variance of consecutive bins tchanges of hand shape for different users. Tsensitivity of the method towards pixels nbeing influenced by pixels further away (considered as a continuous distribution for a

Using the Eq.(5) the 15 variance values bins identified by Eq.(3) is calculated. Figuresult after applying the above method on twAll calculation results were truncated to inmaintain a low runtime memory. This affecthe classifier, but not so much on the overall system is targeted for autonomous robofunctionalities like navigation, monitoring sother outputs we ignore the loss in accuraconline learning algorithms and other applications as well.

Now to convert the 15 H values to binafirst subtract the variance values of the circgesture image Eq.(6) (shown in Figure 10) holding with respect to zero (shown in Figure ∆

The H values are then converted to binaby simply subtracting the hash codes and zero (in Figure 11). However the subtranegative values which are also trend to zero.

C. Rotation Invariance Up to this point the method is able to cla

without being rotation invariant. To make thinvariant, the pattern generated by the metgesture rotated clockwise 3 times by 90 deg.12) had to be considered. The hash codes shright shift circularly 4 times for each 90 rotation. To explain this pattern mathematshift has to account for 90⁄4 =22.5 degre

1 es range from -180 2 re able to identify 3

index (0-14) the lated as follows. 4 er to describe the of the gesture. As

sture are from the to be continuous)

aused by the shape shows the circular

e log variance is 7 sture which varies

variance was taken to normalize it for This increases the

nearby rather than (as the contour is single bin) [13]. 5

for all 15 circular ure 10 shows the

wo different images. ntegers in order to cts the accuracy of

efficiency. As the ots running other sensory inputs and cy. It may include

memory hungry

ary hash codes we le image from the and apply thresh-

e 11). 6 ary hash values as

thresh-holding at action may yield

assify the gestures he method rotation thod for the same . (shown in Figure hown in Figure 12 degree clockwise

tically, each right ee. As 15 bins are

used, which divide the image icovers an angle of 360/15=24 d

All previous calculations consider the fractional part oassume that each bin or hash(percentage error = 6.25%). Cofor a gesture image is rightresulting hash code will be thimage rotated 24 degree clocksmaller than 24 degrees cannoshifting. Although any chacompensated by the binary shif

Figure 9: The contour pixel lengths vcontour pixels fo

Figure 10: Left to right; gesture image

its circularFigure 11: Converting gesture image

Figure 12: Rotation of gesture ima

hamming distance as a hamminfor 6.25% error. So to sum evfor an image gesture is calculahash-code with regions of gestu

in a circular manner so each bin degree.

leading up to now did not of the calculation, it’s safe to h digit accounts for 24 degree onsidering this, if the hash code t shifted circularly by 1, the

he one generated by the gesture kwise. In case of rotation angles t be compensated by the binary

ange in the hash code not fting, is compensated by the

varying from the average length of the or a particular bin.

e, its circular histogram, oval image and

r histogram e variance values to binary hash codes

ages and the respective hash codes.

ng distance of 1 can compensate verything up, first the hash-code ated. Then we try to match the ure class. If no matchs we rotate

the image by shifting the hash-code. This is done till a match or we have rotated by 360 degrees. For classifying the gesture image using the generated hash code the proposed method project’s the hash code to a 2D-address space with a hamming radius of 1 (Figure 13 illustrates this scenario which is explained in Semantic Hashing [7]).

D. Advancement to 3D space The proposed method get advantages over the technique proposed in [6 8] for autonomous robots. Like being completely rotation invariant, less complexity (not using neural networks for recognition) and the advantage of requiring no training and the requirement to store large training data. But it was only able to work with limited number of gestures (4-5 gestures) and on addition of a few new gestures the accuracy of the method will decrease dramatically. So to increase the scalability of the method we changed from 2d to 3d address space to accommodate more gestures using the same 15 circular bins(as discussed earlier 15 bins were found to be optimum through experimentation). At present 15 bit binary hash codes can provide us 215=32768 possible unique numbers in the address pool. Considering a hamming distance of 1 the address pool comes down to 215⁄15=2184, for 2 its 215⁄152 =145 and for 3 it comes down to 215⁄153 =9. This is why adding more gesture produces overlapping regions which ultimately reduce the overall accuracy (Figure 14 shows this). To fix the main limitation of our method we stop using binary and instead use 1000 as the base for each of the 15 bins. So doing this increases our address pool from 215 to 100015. The number is considerably big and this is the reason we imagine it to be a 3D sphere . To get the advantage of 1000 based hash codes we do not take the log of variance in Eq.(5) nor apply the thres-holding part in Eq.(6). Then we normalize all the bins using the following Eq.(7). 1000 / 7

Although the method using the modified approach seemed promising at first but for complex gestures like the one shown in Figure 15 lacked in accuracy. The highlighted section of the contour image in Figure 15 shows the bin which is not consistent with the assumption made earlier about the contours being continuous (shown in Figure 7). Thus the distance calculation is updated to ignore small differences of ±2 (we landed on this value by trial and error) this made average calculations more prone to be biased by large changes in contour pixels. This alone did not solve the accuracy problem completely. So the average of the bin values instead of their co-variance was tried. It achieved overall better results but made the method more vulnerable to noise in case of gesture images which produce continuous contour pixels for majority of the bins. In those cases using co-variance out performs that of the average which in turn performs better for other complex contours. To improve the proposed method’s performance both approaches are combined to boost the overall recognition accuracy. Hamming distance is considered to combine the two

approaches. But as base 1000 is being used instead of binary, some modification of the hamming distance calculation is required to adapt. In case of binary, hamming distance is simply a count for the number of digits that do not match. Keeping the same principle, the absolute difference between the two base 1000 digits is added to the hamming distance. This resolves the problem of hamming distance calculation but causes a bigger problem when we try to use variance and average together. As the bin values are completely different for variance and average shown in Figure 16. Thus a priority based system driven by absolute percentage error in cases where variance and average does not yield the same gesture.

Figure 13: The 2D-address space [7]

Figure 14: The limitation of using only binary hash codes to produce

overlapping regions on the address space

Figure 15: Hand gesture image and its contour with the problematic bin

section hilighted Binary Hash code 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 Co-Varience Hash code 1000 as the base 52

139

7 71

92

65

34

29

34

27

21

53

111

36

219

Average Hash code 1000 as the base 54

51

36

56

64

57

84

83

88

78

70 64 76 75 57

Figure 16: Hash codes for base 2 and 1000 for the same gesture The priority system is done by biasing the result of Eq.(8)

with a factor of 2. And this has to be done whenever a new gesture image is added, with a bit of trial and error to choose variance or average, to the recognition list and is only a single bit flag where 0 means biased for Average and 1 means biased for Variance. Now comparing the percentage error of the two processes we are able to solve the problem of merging the two approaches. Finally coming to rotation invariance, the

approach discussed earlier worked fine whamming distance calculation. % 100 /

V. RESULT ANALYSIS The implemented method proved to be 82.1%1000 images comprising of 10 distinct staticshown in Figure 18. The accuracy can be ilarger ROI so that we can use a greater binwill directly impact on the running time. comparison between our method and otcomparison with others the proposed methoaccuracy, but in terms of memory only a totatrain 10 hand gestures is required) and the ticomparison to most popular classifiers is veaverage for recognition. An optimizedimplementation in [14] takes 34ms in aapproximately 4 times than our method. Thismeans less CPU usage per frame and thus sessential for handhelds and autonomous rothat is used to produce a frame every 6ignoring the time to get ROI the proposed able to support 9 gestures (66.6ms⁄7.36ms)can say, it will be able to support 9 concurren

The method can be adapted to new set oadding new regions of gesture to the 3D adbeing online scalable in comparison to mostrequire rigorous training. Figure 17 showsregion added and the type of gestures thconfusion. Even with the new gesture setwork with the same accuracy and ignorabletime.

Figure 17: Shows a new region being added to the

the type of gestures that usually cause co

Table 3: Comparison of results between our method andK-Nearest Neighbors(k-NN), Normalized Cross CoHamming Distance (HD).

Study i ii iii

Number of gesture sets

10 12 12

Number of Testing image

1000 20 1000

Classifier k-NN NCC k-NN

Overall % recognition

87.9% 88.8% 93.88%

with the modified

8 % accurate against c hand gesture sets mproved by using n number but this Table 3 shows a

thers. Though in od is a bit less in al of 482 bytes (to ime complexity in

ery less, 7.36ms in d multi-processor average which is s faster processing saving battery life, obots. The camera 66.6ms, and thus, method would be

) per frame or we nt users at a time. of gestures just by ddress space. Thus t classifiers which s the new gesture hat usually cause t the method will e increase in CPU

3D address space and

onfusion

d three others from [9]. orrelation (NCC) and

Proposed

10

0 1000

N HD

% 82.1%

VI. CO

This method can be used forby autonomous robots or canapplications like optical charplate recognition etc. Plus amethod is low, further improvethe recognition is done on mgesture.

Figure 18: Hand ge

REFER

[1] Ms. Sweta A. Raut and Prof. NRecognition Using Image ObjecNCETSIT-2011.J. Clerk MaxwMagnetism, 3rd ed., vol. 2. Oxfor

[2] Pushkar Dhawale, Masood MaGesture Input to Interactive Sys2006.

[3] Tin Hninn and Hninn Maung “RRecognition System Using NeuScience, Engineering and Techno

[4] Lawrence Y. Deng, Jason C. HuJen Liu and Nan-Ching Huang “RShape Context Based Matching aVOL.6, NO.5, 2011.

[5] Navneet Dalal and Bill Triggs Human Detection” in IEEE ComVision and Pattern Recognition 2

[6] William T. Freeman and Michalgesture recognition” in IEEE IGesture Recognition, Zurich, Jun

[7] Rushlan Salakhutdinov and GeInternational Journal of Approxim

[8] Simei G. Wysoski, Marcus V. LIwata “A Rotation invariant apprboundary histograms and NeurInternational Conference on NeuNovember 2002.

[9] Herve Lahamy and Derek D. Liinvariant American Sign LanguagCamera” in Sensors 2012, 12, October 2012.

[10] Yair Weiss, Antonio Torralba aNIPS, 2008.

[11] Xiaoming Yin, Dong Gou and using color and RCE neural nesystems October 2000.

[12] http://opencv.willowgarage.com/[13] Serge Belongie, Jitendra Malik

Object Recognition Using Shappattern analysis and machine lear

[14] Tsukasa Ike, Nobuhisa KishikaHand Gesture Interface ImplemMVA2007 IAPR Conference o2007.

ONCLUSION r online/real-time shape learning n be applied to a number of racter recognition, car number as the time complexity of the ement of accuracy is possible if multiple frames for the same

esture sets that were used

RENCES Nitin J. Janwe “A Review of Gesture ct Comparison and Neural Network” in well, A Treatise on Electricity and rd: Clarendon, 1892.

asoodian and Bill Rogers “Bare-Hand stem” in ACM SIGCHI New Zealand

Real-Time Hand Tracking and Gesture ural Networks” in world Academy of ology 50 2009. ung, Huan-Chao Keh, Kun-Yi Lin, Yi-Real-time Hand Gesture Recognition by and Cost Matrix” in Journal of networks

“Histogram of Oriented Gradients for mputer Society Conference on Computer 2005. l Roth “Orientation histograms for hand Intl. Wkshp. on Automatic Face and

ne, 1995. eoffrey Hilton “Semantic Hashing” in mate Reasoning 2008. Lamar, Susumu Kuroyanagi and Akira roach on static gesture recognition using ral Networks” in Proceedings of 9th ural Information Processing, Singapore,

ichiti “Towards Real-time and Rotation ge Alphabet Recognition Using a Range 14416-14441; doi:10.3390/s121114416

and Rob Fergus “Spectral Hashing” in

Ming Xie “Hand image segmentation etworks” in Robotics and Autonomous

/wiki/, Accessed on 12-12-2012. and Jan Puzicha “Shape Matching and pe Contexts” in IEEE transactions on rning April 2002. awa and Bjorn Stenger “A Real-Time mentation on a Multi-Core Processor” in on Machine Vision Applications May

Documents

[IEEE 2013 2nd International Conference on Informatics, Electronics and Vision (ICIEV) - Dhaka, Bangladesh (2013.05.17-2013.05.18)] 2013 International Conference on Informatics, Electronics