Novel Haar features for real-time hand gesture recognition using SVM

SPECIAL ISSUE

Novel Haar features for real-time hand gesture recognitionusing SVM

Chen-Chiung Hsieh • Dung-Hua Liou

Received: 24 May 2012 / Accepted: 24 October 2012

� Springer-Verlag Berlin Heidelberg 2012

Abstract Due to the effect of lighting and complex

background, most visual hand gesture recognition systems

work only under restricted environments. Here, we propose

a robust system which consists of three modules: digital

zoom, adaptive skin detection, and hand gesture recogni-

tion. The first module detects user face and zooms in so

that the face and upper torus take the central part of the

image. The second module utilizes the detected user facial

color information to detect the other skin color regions like

hands. The last module is the most important part for doing

both static and dynamic hand gesture recognition. The

region of interest next to the detected user face is for fist/

waving hand gesture recognition. To classify the dynamic

hand gestures under complex background, motion history

image and four groups of novel Haar-like features are

investigated to classify the dynamic up, down, left, and

right hand gestures. A simple efficient algorithm using

Support Vector Machine is developed. These defined hand

gestures are intuitive and easy for user to control most

home appliances. Five users doing 50 dynamic hand ges-

tures at near, medium, and far distances, respectively, were

tested under complex environments. Experimental results

showed that the accuracy was 95.37 % on average and the

processing speed was 3.93 ms per frame. An application

integrated with the developed hand gesture recognition was

also given to demonstrate the feasibility of proposed

system.

Keywords Face detection � Gesture recognition �Man–machine interface � Pattern analysis

Support Vector Machine

1 Introduction

Emerging computer vision-based applications include video

surveillance, human–computer interaction (HCI) [1], vehi-

cle driver assistance, and machine inspection. Among these

applications, hand gesture recognition is being developed

vigorously as interface for HCI. Hand gestures recognition

as controller [2] to manipulate devices is the most intuitive

way for man–machine interface. The advantage of these

recognition systems is that user can control devices such as

panel, keyboard, or mouse without touching them. Users

just need to face the camera and raise their hands for

operation control. Hand gesture recognition systems give

people high degree of freedom and intuitive feelings. Robot

control [3, 4], television remote control [5], presentation

slide control [6], and visual mouse [7, 8] are common HCI

applications. However, computer vision-based gesture rec-

ognition systems are not very popular in our daily life due to

two major problems. Firstly, it is very difficult to detect skin

color under variety of lightings and environments. Sec-

ondly, people may need heavy training before using com-

plex hand gestures.

We defined two static and four dynamic hand gestures

which are natural and easy to use. The proposed real-time

hand gesture recognition system consists of three major

parts: digital zoom, adaptive skin detection, and hand

gesture recognition. The first module detects user face and

then applies trivial trimming and bilinear interpolation for

zooming in so the face and upper torus take the central part

of the image. The second module is based on an adaptive

C.-C. Hsieh (&) � D.-H. Liou

Department of Computer Science and Engineering,

Tatung University, No. 40, Sec. 3, Jhongshan N. Rd,

Taipei 104, Taiwan, R.O.C.

e-mail: [email protected]

123

J Real-Time Image Proc

DOI 10.1007/s11554-012-0295-0

skin color model for hand region extraction. By adaptive

skin color model, the effects from lighting, environment,

and camera can be greatly reduced, and the robustness of

static hand gesture recognition could be improved. The last

part is the most important part which does both static and

dynamic hand gestures recognition.

To speed up the recognition of static hand gestures, a

region of interest (ROI) besides the detected user face is

defined. Fist hand is verified by a three gray-level Haar-like

feature and waving hand is detected by checking the

amount of motion within the specified ROI. As for dynamic

hand gestures, we observe the motion history image (MHI)

for each dynamic directional hand gesture and design four

groups of Haar-like patterns. These Haar-like patterns are

introduced for the first time to count the number of black-

white patterns as statistical features for classification. A

real-time efficient algorithm using Support Vector Machine

(SVM) based on the statistical features is then developed to

distinguish these dynamic hand gestures.

An initial version of this paper appeared in [9] where the

dynamic hand gesture recognition using heuristic method

was redesigned and several thresholds were removed in this

paper. For comparisons, a back-propagation neural network

[10] for dynamic hand gesture recognition was also

implemented using the same MHI. In addition, more

experiments were conducted for system performance

analysis. In the following section, we will briefly review

related researches in hand gesture recognition. In Sect. 3,

we present the detail of ROI and adaptive skin color

detection-based static hand gesture recognition, and the

MHI-based dynamic hand gesture recognition using SVM.

In Sect. 4, experimental results were given to demonstrate

the system performance. In the last section, we conclude

and give some directions for future researches.

2 Related works

Various computer vision-based HCI researches were

developed by using cameras of single lens [11], multi-lens

[12], depth perception lens [13], infra-red lens [14], or

combined lens [15]. Different lens give different informa-

tion. The more the information utilized, the higher the

recognition accuracy would be. However, more cameras

may require special installations and cost much for the

extra information processing. For example, the launched

Xbox 360 is equipped with a Kinect1 sensor consisting of

an infra-red emitter, an infra-red and RGB camera. Though

Kinect sensor can recover the 3D depth map, it costs more

and needs more computations. According to the survey

given in [16], there are a variety of methodologies used for

human gesture recognition ranging from principle compo-

nent analysis [17], hidden Markov model [18], particular

filtering [19], and finite state machine [20], to neural net-

works [21]. In the following, we will brief some researches

in hand gesture recognition for device control and discuss

their feasibilities.

Licsar and Sziranyi [22] proposed a vision-based hand

gesture recognition system with interactive training, aimed

to achieve a user-independent application by online

supervised training. Their main goal is that any non-trainer

user should be able to use the system instantly. If the

recognition accuracy decreases, only the faulty detected

gestures are retrained to realize fast adaptation. They

implemented a hand gesture recognition system for a

camera–projector environment in which users can directly

interact with the projected image by hand gestures, real-

izing an augmented reality tool in a multi-user environ-

ment. However, the color of hand gestures would be

affected by the color of projected content.

Wu [23] developed a hand gesture recognition system to

control the media player with a common web camera. The

shapes of arms represent different commands for the con-

trol of media player. The system firstly separated the left-

arm by background subtraction and detected the straight

line by both Hough transform and Radon transform. The

disadvantage of this method was the non-instinct of defined

hand gestures. For example, the straight arm was to play

the music and bent arm was to stop the music. Furthermore,

it was inconvenient for user to adjust the operating distance

before usage.

Lai [24] designed and implemented an interactive biped

robot which could be controlled by hand gestures. The

number of fingers and angles between fingers was used to

classify nine types of static hand gestures. Two dynamic

hand gestures were defined by the direction of forefinger

and the size of palm. To overcome the effect of lighting,

they utilized scroll bars to manually set the scope of skin

color in YCbCr space. Fixed background model and image

subtraction were used to segment the moving arm for hand

gesture recognition.

Tu [25] presented a face-based hand gesture recognition

system by a single camera for HCI application. Face was

firstly detected by defining specific scopes for skin color in

normalized RGB. Hand region was assumed to appear by

the side of face. Eleven static hand gestures as shown in

Fig. 1 were defined to control the computer. Back-propa-

gation neural network was utilized for hand gesture

recognition. However, these hand gestures are easily confused

due to similar shape. Still, lighting and environment may also

cause problems for skin color detection.

In summary, most previous methods adopting non-

adaptive skin color models may cause problems if envi-

ronment is complex. As for the vocabulary set of hand1 http://msdn.microsoft.com/zh-tw/hh367958.aspx.


123

http://msdn.microsoft.com/zh-tw/hh367958.aspx

gestures, most works were based on the fist hand and varied

with different number of fingers as shown in Fig. 1. User

may spend some time to memorize these gestures and

would be confused with these similar gestures. In addition,

some systems have limitations that users sit in front of

camera within a specified distance. Here, we try to relax

these limitations. Firstly, we propose a face-based adaptive

skin color model for static hand region segmentation. Sec-

ondly, the adopted static and dynamic hand gestures are

simple and intuitive. Our defined hand gestures are quite

different from one another and easy to memorize. The

intuitiveness comes from the idea that we bind the up/down/

left/right dynamic hand gesture with the up/down/left/right

directional key of remote control. Last but not the least are

the innovative recognition kernels for both static and

dynamic hand gestures.

To speed up the recognition of static hand gestures, a

ROI besides detected user face is defined for checking. Fist

hand is verified by a three gray-level Haar-like feature and

waving hand is detected by checking the amount of motion

within that specified ROI. As for dynamic hand gestures,

four groups of Haar-like patterns are designed and a simple

but efficient algorithm is developed to classify these

directional gestures in MHI representations. In contrast to

the Haar features [26, 27] designed for face detection using

Adaboost cascade classifier, our designed Haar features are

different and the corresponding algorithm using SVM

executes even faster.

3 System architecture

There are dynamic and static hand gestures as shown in

Fig. 2. The direction of moving hand is used to classify

the four dynamic hand gestures in Fig. 2a–d while motion

detected or not by side of face is used to classify the two

static hand gestures in Fig. 2e, f. Figure 3 shows the flow

chart of the proposed system which is divided into three

major parts: digital zoom, adaptive skin color detection,

and hand gesture recognition. Each part is described in

the following subsections. Note that face detection, pro-

posed by Viola and Jones [26] and extended by Lienhart

and Maydt [27], is adopted as one of the key components.

The characteristic of their method is the use of the black-

white Haar-like patterns to find eyes on face that is

independent of the skin color of people. However, false

alarms would happen at eyes-like patterns. In this paper,

false alarms would be filtered out if the number of skin

color pixels within the detected face region is less than a

given threshold.

3.1 Digital zoom

It is necessary to magnify the image area around the user for

hand gesture recognition if the user is distant from camera.

Thus, users do need to adjust their positions for convenience.

This step also normalizes the image size to 320 9 240 pixels

because the initially set image resolution may be different. If

the detected face is smaller than the standard size of face,

users could either adjust their face size manually by the

equipped optical zoom capability of camera or achieved by

the developed automatic bilinear zooming of the user as in

(1) based on detected face in this stage.

ROI1ðx; y;w; hÞ ¼ ðfacemax:x� 5� facemax:r;

facemax:y� 5� facemax:r; 10� facemax:r; 10� facemax:rÞ;ð1Þ

where ROI1 represents the zoomed image, (x, y) is the

coordinate of the top left corner point of ROI1, and (w, h) is

its width and height. facemax is the largest detected face

who is the nearest to the camera owns the control of the

system, where (facemax.x, facemax.y) is the circle center of

the detected maximum face with radius r.

Fig. 1 Face-based hand gesture

recognition for HCI applications

[25]. a Face detection and the

operation ROI as shown by side

of face. b Defined hand gestures


123

The ideal operating distance is about 60 cm in our hand

gesture recognition system as shown in Fig. 4a. If user is

far away, the detected face would appear smaller as shown

in Fig. 4b. By digital zoom, the user area could be enlarged

as ROI1 as in Fig. 4c and the hand gesture recognition

results would not be corrupted. This is because the black-

white properties of the Haar-like features would not change

even if the images are magnified.

Assume the user would not move dramatically, the

location of face in the next frame would be confined in

another region of interest ROI2 centered at the current

location of detected maximum face and ROI2 size is set

Fig. 2 Defined dynamic hand gestures (a–d) and static hand gestures (e–f)

Fig. 3 The system flowchart is consisted of a digital zoom, b adaptive skin detection, and c hand gesture recognition


123

1.5 9 1.5 times that detected face. Thus, we could reduce

the time needed by Adaboost cascade classifier for face

detection. Table 1 summarizes the processing times with or

without ROI2 setting. If no face is detected, then it is

necessary to search face in the whole image. The operating

distance is subject to the resolution limits of webcam. If

user operates at 3 m away or more, the captured image

would be too small to be used for human face detection.

3.2 Adaptive skin color detection

Due to the scope of general skin color [28] covers many

skin-like colors, false positive or false negative are some-

times unacceptable. Hence, if we could construct an

adaptive skin color model, the misclassification rate would

be greatly reduced. By exploiting skin color information

from individual’s face, we could create the skin color

model for each person and then improve system robustness

because of the reduced amount of color variations between

a person’s face and hands [29].

The face-based adaptive skin color model proposed by

Liou [30] is adopted here. Skin region of detected face

could be obtained by eliminating eyes, nostrils, and mouth

regions by gray-level histogram analysis. Color distribu-

tions in normalized red, normalized green, and original red

are assumed to be Gaussian distributions so that the means

and SD are calculated to build the adaptive skin color

model. Afterward, we can use that skin color model to

detect the other skin color regions for that person. From

experimental results, our system could detect correct skin

pixels even in an extremely bad lighting condition and the

face colors are distorted in abnormal skin chromaticity.

3.3 Static hand gesture recognition

Hand regions are detected based on the previous adaptive

skin color model. It is assumed that user does fist hand

gesture and waving hand gesture of the same depth as

face in the specified area as indicated by the red rect-

angle ROI3 in Fig. 5a. ROI3 for static hand gesture

detection is just next to the right side of detected user

face. The set ROI3 is based on the habit of right-hand

users and could be changed to the left side of face next

to left ear for left-hand users. ROI3 could be adjusted by

displaying it in a monitor. The position and size of

detected maximum face is used to automatically specify

ROI3 as in (2).

ROI3ðx; y;width; heightÞ ¼ ðfacemax:x� 4:2� facemax:r;

facemax:y� 1:5� facemax:r; 3� facemax:r; 3:� facemax:rÞð2Þ

The size of specified ROI3 is defined 1.5 times the size

of detected face because a palm usually covers half of the

user’s own face.

3.3.1 Fist hand

The ROI for fist hand detection is further divided into Rin

and Rout as shown in Fig. 5b. User hand gestures are per-

formed in the same depth as user face. The detected skin

region would be as shown in Fig. 5c if user made a fist

hand. Hence, fist hand gesture could be recognized by

checking these two areas as in (3), where Threshold1

corresponding to the hand region in Rout and Threshold2

corresponding to the hand region in Rin are usually set 0.3

and 0.8 to prevent from misclassifying a very big (close)

and a very small (far) fist hand, respectively.

ðSkinPix 2 RoutÞ=Rout � Threshold1 and ðSkinPix

2 RinÞ=Rin�Threshold2: ð3Þ

Figure 5c gives a perfect example. However, the user

palm may not occupy the center part all the time. The

system could tolerate this situation in real world by (2).

However, some detected false palms may happen and in

order to filter out the detected wrong fist hand gesture,

Fig. 4 a Operating at 0.6 m

from the camera. b Operating at

three meters from the camera.

c Zoomed image of b

Table 1 Processing time of face detection with and without ROI2

setting

Frame size ROI2 setting Operation distance

(m)

Processing time

(ms)

320 9 240 Yes \1.5 5–8

No \1.5 45–55

640 9 480 Yes \3 5–8

No \3 200–230


123

a simple Haar-like feature as shown in Fig. 6a is used

for verification as shown in Fig. 6b. The color space of

ROI3 is firstly transformed from RGB to gray level and the

histogram is equalized as shown in Fig. 6c for verification.

The OpenCV library function HaarTraining2 is used to

train classifiers for fist hand detection. Since ROI3 size is

small and the verification process spends a little time.

3.3.2 Waving hand

The ROI3 for fist hand detection is also used for waving

hand gesture recognition but based on motion detection

and time sequence as shown in Fig. 7. By observing the

waving hand gesture in Fig. 7a, two apparent phenomena

would happen. Firstly, the motion obtained by subtract-

ing two continuous frames would be very obviously as

shown in Fig. 7b. Secondly, the motion would last for a

period of time as shown in Fig. 7c. Therefore, these two

conditions are used to verify waving hand gesture. If the

portion of motion region in ROI3 is greater than a given

threshold Threshold3 and lasts for a period of time, the

waving hand gesture could be confirmed. Threshold3 is

set as 70 % to prevent from misclassifying small vibra-

tions of user hand and the time period is set as 3 s in

this paper.

3.4 Dynamic hand gesture recognition

As shown in Fig. 3c, dynamic and static hand gesture

recognition processes are executed in the same time.

However, ROI1 instead of the small ROI3 is used for

dynamic hand gesture recognition for the overall motion

is around the upper part of operating user. In addition,

dynamic hand gesture recognition is conducted on motion

history image in which motion information among frames

could be accumulated in a single image. Four groups of

Haar-like features are designed for measuring the quan-

tities of directions. An innovative direction detection of

moving hand using motion history image-based SVM is

then proposed.

3.4.1 Motion history image

The benefit of motion history image is that it could pre-

serve object trajectories in a frame. Continuous motion

information is used to update the motion history image

MHI as in (4), where DF is the difference frame and a is set

as 15. The values in MHI are placed within 0–255.

MHIðx; yÞt ¼ MHIðx; yÞt�1 þ DFðx; yÞt�1 � a ð4Þ

Figure 8 gives an example of MHI. Figure 8a, b show

two continuous frames and Fig. 8c is the difference frame

in which the resulting regions are the motion regions. From

the MHI as shown in Fig. 8d, the moving direction could

be recovered by checking the orientation of variations from

black to white, which is the moving direction of hand.

Fig. 5 a Automatically

specified ROI3 for static hand

gesture recognition and user

could adjust the ROI3 to be

larger than user’s fist. b Divide

ROI3 into Rin and Rout for fist

hand detection. c An example of

detected fist hand

Fig. 6 a Defined Haar-like

feature for fist hand verification.

b The ROI3 in RGB. c ROI3 in

gray scale and after histogram

equalization

2 http://note.sonots.com/SciSoftware/haartraining.html.


123

http://note.sonots.com/SciSoftware/haartraining.html

For real-time issues, images are reduced to 24 rows 9 32

columns for the detection of moving direction. Figure 9

shows examples of the reduced ROI3, difference image, and

motion history image. Although this step would blur the

details, it does not affect the following direction detection

method. There are four categories of MHI as shown in

Fig. 10. Note Fig. 10c, d are not misplaced for the captured

images are mirrored.

Neural network was tried to recognize the dynamic hand

gestures. It adjusts the weight for each neural through

training process so as to identify different patterns. Back-

propagation neural network was developed in our system

for the recognition of the four MHIs. In the tests, 40 sam-

ples were used to train the neural network model and the

other 40 different patterns were used for test. We found that

the accuracy rate stuck at about 85 % because some vari-

ations of different dynamic hand gestures were similar. For

example, hand moving upper rightward and upper leftward

would be ambiguous and misclassified. Therefore, we

investigate a novel statistical method to surpass the neural

network approach.

3.4.2 Direction detection by Support Vector Machine

To recognize the four kinds of MHI, four groups of Haar-

like directional patterns for moving up, down, left, or right

as shown in Fig. 11 are designed, respectively. As stated in

the previous section, the MHI is of 24 9 32. To avoid the

boundary condition, there are 22 patterns for upward and

downward, and 30 patterns for rightward and leftward. If

the two gray levels located at the corresponding locations in

the MHI match with a pattern, the counter for that pattern

would be increased by one. Take an example as shown in

Fig. 12 to describe this methodology. The counter Count(right)

of right direction would be increased by one if any of the

patterns in Fig. 11c matches with the motion history image

as shown in Fig. 12b. That is, if the left element is brighter

than the right element, Count(right) would be increased by

one. The white and black pattern in the top one pattern of

Fig. 11c is of the nearest which represents that part of MHI

was updated by the last few frames. On the contrary, the

white and black pattern in the bottom represents that part of

MHI was updated by the first few frames.

Fig. 7 a Waving hand gesture.

b Result of motion detection.

c Threshold of waving hand

elapsed time

Fig. 8 a Previous frame. b Current frame. c Difference frame. d The resulting MHI

Fig. 9 Reduced image of

a ROI1, b difference image, and

c resulting MHI


123

Similarly, the corresponding direction counter will be

increased if the black-white condition is met for each pattern

of the other three groups. Note the patterns are designed for

right-hand users. The mirrored patterns of moving rightward

as in Fig. 12c could be used for left-hand user. Patterns in

Fig. 11 do not mean that user hands need to move across the

image and then out. The detailed process is described in

the following algorithm. The central design concept of the

algorithm is to recover the motion track by counting the

number of matched black-white patterns. On the other hand,

to distinguish waving hand gestures from moving gestures,

there are two conditions to be met. One is the moving hand

gesture should not last more than 3 s. The other is the

directional counter must exceed a given threshold. The

sum of all direction counters as in Step 4 is used to normalize

the four direction counters. In the original design, the

Fig. 10 Four categories of MHI that correspond to the four dynamic hand gestures. a Upward, b downward, c rightward, d leftward

Fig. 11 Patterns for detecting

the direction of moving hand.

a Up, b down, c right, d left

Fig. 12 a A left-hand user

doing a hand gesture of moving

right. b The corresponding

MHI. c The Haar patterns of

moving rightward detection for

left-hand user


123

dynamic hand gestures would be classified as the direction

counter with the maximum value in Step 5. However, hand

moving upward may sometimes be misclassified as hand

moving rightward. Similar confusing situations would hap-

pen by this simple maximum principle. To gain better rec-

ognition results, a machine learning system SVM developed

by Alex and Bernhard [31] based on statistical theory can be

used to solve nonlinear and high-dimensional problems.

Supervised learning is used to inform the machine of the

correct answer to facilitate correction. The purpose of SVM

is to achieve a maximal margin hyperplane using the least

amount of training data.

In a linear division environment, SVM can use hyperplane

directly for classification. However, most problems arise

from nonlinear division environments. Such classification

problems like ours, therefore, should first be handled by

changing data type using kernel function. Boser et al. [32]

proposed converting primary data at lower dimensions by

using a core function to become a higher-dimension feature

space to find a linear hyperplane from a high dimension to

solve the problem of data classification. Data that cannot be

classified using linear functions can be categorized by using

a hyperplane in a high-dimensional feature space. Equation

(5) indicates the function of converted data classified in high

dimensions:

f ðxÞ ¼ sgnXn

i¼1

aiyiKðxi; xÞ þ b

!; ð5Þ

where ai is a Lagrange multiplier and K(xi, x) shows the

kernel function of the conversion to a high dimension.

Radial Basis Function (RBF Kernel) defined as follows is

selected as the kernel function. The derived eigenvector is

then identified using a nonlinear SVM.

Kðxi; xjÞ ¼ exp �c xi � xj

�� 2� �

; c � 0 ð6Þ

Algorithm: Hand gesture recognition using Support Vector Machine Input: Motion history image MHI(i, j). Output: The direction of moving hand. 1. Initialization

Count(i)=0, i=up, down, left, and right; 2. For each row i of MHI(i, j) For each column j of MHI(i, j) { For each left/right directional Haar-like pattern k in Figs. 11(c) and 11(d) If MHI(i, j) > MH(i, j+k) /* The left is lighter than the right */

Count(left)++; elseif MHI(i, j) < MHI(i, j+k) /* The right is lighter than the left */

Count(right)++; } 3. For each column j of MHI(i, j) For each row i of MHI(i, j) { For each up/down directional Haar-like pattern k in Figs. 11(a) and 11(b) If MHI(i, j) > MH(i*k, j) /* The up is lighter than the down */

Count(up)++; elseif MHI(i, j) < MHI(i*k, j) /* The down is lighter than the up */

Count(down)++; } 4. Normalization.

);()()()( rightCountleftCountdownCountupCountSum +++= For each counter i

;/ SumiCountCount =5. /* Original design: ( ;)maxarg iCountDirection = */

/* Dynamic hand gesture recognition by trained Support Vector Machi ne. */ Feed the four direction Count(i) as a 4-dimensional data into SVM; Trained SVM output the recognized dynamic hand gesture type;

( )ii( )i


123

On comparing our algorithm with traditional Haar-like

pattern based Adaboost cascade classifier, there are three

advantages. The first is that classifiers for different Haar

patterns are not executed sequentially but calculated

together. Secondly, integral image is not necessary.

Instead, we adopt the reduced image for feature calcu-

lations to avoid multi-scale problem. Finally, SVM

which could classify high dimension samples using

nonlinear separation function is adopted to recognize

dynamic hand gestures. However, training is required

before doing dynamic hand gesture recognition. These

advantages remove the nondeterministic factors out of

the developed system and the time complexity for the

proposed algorithm is imply O(w 9 h), where w 9 h is

the MHI size.

4 Experimental results

The proposed hand gesture recognition system was tested

to demonstrate its feasibility. The platform is Microsoft

Windows XP on a PC with AMD Processor 5200? and

main memory 2G bytes. For portability, Logitech portable

webcam C905 is deployed to grab images. The software

development environment is Visual C?? 6.0 with image

processing library OpenCV 1.1 installed. Our goal is to

design a real-time robust man–machine interface using

hand gesture recognition.

Figure 13 shows the system–user interface where user

could start camera and set the digital zooming scale

according to the distance between user and the camera. The

purpose of digital zoom is to adjust the size of user ROI1

for processing. Alternatively, user could press ‘‘Auto’’

button to automatically zoom into the user ROI1 by the

position and size of detected user face as shown in (1). ROI

settings for face, static hand gesture, and dynamic hand

gesture detection also play an important role for real-time

processing.

After setting the digital zoom, adaptive skin color

detection and hand gesture recognition could be activated

by the check boxes. The execution speed is also displayed

in the right text box. The result of hand gesture recognition

is indicated by the corresponding graphic icon. The lower

right text box is also used to display the text description of

hand gesture recognition for verification and log.Fig. 13 User interface of developed hand gesture recognition system

Fig. 14 The counter values for the four continuous directional dynamic hand gestures. Each gesture was operated a once at 0.6 m and b three

times at 2 m


123

4.1 Hand gesture recognition

Figure 14a shows the recognition results of a video con-

sisting of 241 frames. There are four continuous different

dynamic hand gestures starting from up, down, left, to

down. These dynamic hand gestures could be classified as

the index of the counter with the maximum normalized

value. Figure 14b shows the recognition results of another

video consisting of 541 frames in which the continuous up/

down/left/right hand gestures were repeated for three times

at 2 m. This experiment also proved that the maximum

direction counter still works even in case of operating at

long distance.

Five users were invited to do the dynamic hand gestures

50 times per type at three different distances (\1 m,

1–1.5 m, and 1.5–2 m) and static hand gestures 25 times

per type. These tests were recorded and labeled with dif-

ferent types of hand gesture, names, and distances. The

length of test videos ranges from 75 to 125 s as the oper-

ating speed was dependent on the user. Each individual

hand gesture was separated by a short period of time

without motion. Users could practice for 1 min to prevent

from wrong operation before testing. The results were as

shown in Table 2 in which the recognition rates were

93.13 % for dynamic hand gestures and 95.07 % for static

hand gestures in average.

However, the recognition accuracy for moving-up hand

gesture was not good. Therefore, half of the dynamic hand

gesture videos were used to train the SVM and the other

half videos were used for testing. The recognition accuracy

for moving upward was improved to 93.8 % while the

overall accuracy achieved 95.66 %. This demonstrates that

the recognition problem would be more appropriate solved

by SVM rather than by the simple maximum principle.

4.2 Processing time analysis

The objective of this paper is to develop a real-time con-

venient hand gesture recognition system as a man–machine

interface. By the advantages as stated in the previous

section, the system could work quite efficiently. Table 3

gives the processing time for each system component and

the overall processing speed is 3.81 ms per frame. That is,

our system processing rate can reach more than 250 frames

per second. It is especially useful for embedded system

which requires less computations for user interface while

spends more computations for the main user tasks such as

media player or Internet surfing.

The developed man–machine interface was integrated

with a well-known album browser Cooliris3 as shown in

Fig. 15a. Users can browse by their own hands as shown in

Fig. 15b. The six hand gestures were bound to six different

commands such as the commonly used buttons in a remote

controller in Fig. 15c. Dynamic hand gestures map to

moving up, down, left, and right while waving hand rep-

resents ‘‘wake-up’’ (system) and fist hand represents

‘‘enter’’ (menu).

Table 4 gives a functional comparison with several

surveyed works. Most methods except [15] need to have

restrictions on the operating environment due to the com-

plex background in real world. The reason is that a time-of-

flight (ToF) and other IR-based cameras are used in [15] to

register the depth to filter out the background objects.

However, higher hardware cost and more computations

involving calibrations and data are inevitable. Here, by the

designed novel Haar-like features and recognition algo-

rithm processing on motion history image, we successfully

remove background objects and achieve quite high

Table 2 Average accuracy

for the defined hand gesturesHand gesture Times Accuracy (%) SVM

Times Accuracy (%)

Dynamic hand gesture Moving up 669/750 89.2 352/375 93.8

Moving down 689/750 91.87 356/375 94.9

Moving left 720/750 96 365/375 97.3

Moving right 716/750 95.47 362/375 96.5

Average 2,794/3,000 93.13 1,435/1500 95.66

Static hand gesture Fist hand 352/375 93.87

Waving hand 361/375 96.26

Average 713/750 95.07

Table 3 Average processing time for each system component

System component Time (ms)

Face-based adaptive skin color detection 3.45

Motion detection 0.08

Direction counter 0.28

Support vector machine 0.12

Total 3.93

3 http://www.cooliris.com/.


123

http://www.cooliris.com/

execution speed. It is very useful for embedded systems

which have limited computation capability.

5 Conclusions and future works

In this paper, a real-time hand gesture recognition system is

developed which could tolerate complex background by

using MHI and ROI settings. Real-time computation is

achieved by using the designed novel Haar-like features and

the corresponding simple and efficient SVM classification

method. There are two static hand gestures, fist hand and

waving hand, and four dynamic hand gestures representing

hand moving up, down, left, and right are defined. These

hand gestures are natural and simple to use. As to static

hand gesture recognition, the hand regions are firstly

extracted by the face-based adaptive skin color model in the

defined ROI for static hand under variety of environmental

changes. And then respective conditions for static hand

gestures are checked for classification. Five persons were

invited to test the developed system. Experimental results

show that the accuracy is 95.37 % on average which

demonstrate the feasibility of proposed system.

Computer vision concerns with the theories for devel-

oping intelligent systems that could investigate information

from images. The images can be taken in many forms such

Fig. 15 a Screenshot of Cooliris. b Cooliris operated by natural hand gesture recognition. c A commonly designed remote control with a central

‘‘enter’’ button surrounded by four directional buttons

Table 4 Comparisons with several surveyed hand gesture recognition systems

Reference Function

Methodology Vocabulary set Device Operating environment Accuracy

(%)

Speed

(ms)

Licsar and Sziranyi

[22]

User adaptive recognition

with interactive training

Palm with varying

fingers

Projector

and

camera

Only hand allowed to

appear

Over 98 NA

Kim et al. [14] Active shape model for gait

recognition

Human gaits Infra-red

camera

Works across illumination

changes

Over 90 NA

Kao et al. [17] Face and hand gesture

recognition by PCA

Palm with varying

fingers

Color

camera

Elevator Over 94 NA

Xie [1] Fuzzy neural network for

mode classification

Four hand postures Stereo

vision

Hand profile in meeting

room

NA Real

time

Stergiopouloua and

Papamarkos [21]

SGO neural gas network No. of raised fingers RGB

camera

Hands in simple

background

90.45 NA

Van den Bergh and

Van Gool [15]

Classifier using traditional

2D Haarlets

Six hand postures

with varying fingers

ToF (depth)

plus RGB

Allow other persons in

background

99.54 33.4

Qing et al. [33] Traditional 2D Haarlets

classifier with SCFG

Four hand postures

with varying fingers

RGB

camera

Hands in simple

background

95.65 3.04

Ciprian et al. [34] Dynamic hand gesture using

tensor voting filter

Three/four static/

dynamic hand

gestures

RGB

camera

Hands in simple

background

Simulation

only

Real

time

Ours Novel Haar features and

SVM classifier

Six easily operated

gestures

RGB

camera

Allow hands overlapped

with skin-like objects

95.37 3.93


123

as video sequences, views from multiple cameras, or

multi-dimensional slice from specific scanners. Though

computer vision has been developed for a long time, the

commercial applications are still few. At the present

stage, a considerable number of research institutes and

companies are active in this field. One typical example is

Microsoft’s Xbox 360 equipped with Kinect sensor which

allows users to operate the system by bare hands. Though

Kinect provides depth information, user gestures are still

easily occluded by oneself or other person. Therefore,

how to integrate all the captured data from multi-cameras

to recover more motion information accurately is still

necessary for intelligent man–machine interface in the

future.

References

1. Xie, W., Teoh, E.K., Venkateswarlu, R., Chen, X.: Hand as

natural man–machine interface in smart environments. In: Pro-

ceedings of the 24th IASTED International Conference on Signal

Processing, Pattern Recognition, and Applications, pp. 117–122

(2006)

2. Lee, M., Woo, W.: ARKB:3D vision-based augmented reality

keyboard. In: Proceeding of International Conference on Artifi-

cial Real Telexistence, pp. 54–57 (2003)

3. Chen, J.Y., Haas, E., Barnes, M.: Human performance issues and

user interface design for teleoperated robots. IEEE Trans. Syst.

Man Cybernet. Part C 37, 1231–1245 (2007)

4. Hu, C., Meng, M.Q., Liu, P.X., Wang, X.: Visual gesture rec-

ognition for human–machine interface of robot teleoperation. In:

Proceedings of the IEEE International Conference on Intelligent

Robots and Systems, Las Vegas, Nevada, pp. 1560–1565 (2003)

5. Breizner, L., Laptev, I., Lindeberg, T., Lenman, S., Sundhlad, Y.:

A prototype system for computer vision based human computer

interaction, Technical report CVAP251, ISRN KTH NA/P–01/

09—SE, Department of Numerical Analysis and Computer Sci-

ence, KTH, Royal Institute of Technology, Stockholm, Sweden

(2001) ftp://ftp.nada.kth.se/CVAP/reports/cvap251.pdf

6. Chen, C.Y., Fan, Y.P., Chou, H.L.: Hand gesture commands for

slide view control in a PC based presentation. In: Proceedings of

the International Computer Symposium, Workshop on Computer

Graphics and Virtual Reality, Tainan, Taiwan (1998)

7. Argyros, A.A., Lourakis, M.I.A.: Vision-based interpretation of

hand gestures for remote control of a computer mouse. In: Pro-

ceedings of the HCI Workshop, LNCS 3979, Graz, Austria,

pp. 40–51 (2006)

8. Zhou, H., Xie, L., Fang, X.: Visual mouse: SIFT detection and

PCA recognition. In: Proceedings of the International Conference

on Computational Intelligence and Security Workshops, pp. 263–

266 (2007)

9. Hsieh, C.C., Liou, D.H.: A robust hand gesture recognition sys-

tem using Haar-like features. In: Proceedings of the 2nd Inter-

national Conference on Signal Processing and Systems, Dalian,

China, V2-394–V2-398 (2010)

10. Rojas, R.: The backpropagation algorithm. In: Neural Net-

works—A Systematic Introduction, chap. 7. Springer, Berlin

(1996)

11. Du, W., Li, H.: Vision based gesture recognition system with

single camera. In: Proceedings of ICSP, pp. 1351–1357 (2000)

12. Segen, J., Kumar, S.: Human–computer interaction using gesture

recognition and 3D hand tracking. In: Proceedings of ICIP,

Chicago, pp. 188–192 (1998)

13. Liu, Y., Jiam, Y.: A robust hand tracking and gesture recognition

method for wearable visual interfaces and its applications. In

Proceedings of the Third International Conference on Image and

Graphics, pp. 472–475 (2004)

14. Kim, D., Lee, S., Paik, J.: Active shape model-based gait rec-

ognition using infrared images. Intl J Signal Process Image

Process Pattern Recogn 2, 1–13 (2009)

15. Bergh, M.V.D., Gool L.V.: Combining RGB and ToF cameras for

real-time 3D hand gesture interaction. In: Proceedings of IEEE

Workshop on Applications of Computer Vision, pp. 66–72 (2011)

16. Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans

Syst Man Cybern Part C 37, 311–324 (2007)

17. Kao, Y.W., Gu, H.Z., Yuan, S.M.: Integration of face and hand

gesture recognition. In: Proceedings of Third International Con-

ference on Convergence and Hybrid Information Technology,

vol. 1, pp. 330–335 (2008)

18. Ramamoorthy, A., Vaswani, N., Chaudhury, S., Banerjee, S.:

Recognition of dynamic hand gestures. Pattern Recogn. 36,

2069–2081 (2003)

19. Kwok, C., Fox, D., Meila, M.: Adaptive real-time particle filter

for robot localization. In: Proceedings of Robotics and Automa-

tion, vol. 2, pp. 2836–2841 (2003)

20. Yeasin, M., Chaudhuri, S.: Visual understanding of dynamic hand

gestures. Pattern Recogn 33, 1805–1817 (2000)

21. Stergiopouloua, E., Papamarkos, N.: Hand gesture recognition

using a neural network shape fitting technique. Eng. Appl. Artif.

Intel. 22, 1141–1158 (2009)

22. Licsar, A., Sziranyi, T.: User-adaptive hand gesture recognition

system with interactive training. Image Vis. Comput. 23, 1102–

1114 (2005)

23. Wu, Y.M.: The implementation of gesture recognition for media

player system. Master thesis of the Department of Electrical

Engineering, National Taiwan University of Science and Tech-

nology, Taipei, Taiwan (2009)

24. Lai, I.H.: The following robot with searching and obstacle-

avoiding. Master thesis of the Dept. of Electrical Engineering,

National Central University, Chung-Li, Taiwan (2009)

25. Tu, Y.J.: Human computer interaction using face and gesture

recognition. Master thesis of the Department of Electrical Engi-

neering, National Chung Cheng University, Taiwan (2007)

26. Viola, P., Jones, M.: Rapid object detection using a boosted

cascade of simple features. In: Proceedings of the IEEE Com-

puter Society Conference on Computer Vision and Pattern Rec-

ognition, pp. I-511–I-518 (2001)

27. Lienhart, R., Maydt, J.: An extended set of Haar-like features for

rapid object detection. In: Proceedings International Conference

on Image Processing, vol. 1, I-900–I-903 (2002)

28. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-

based skin color detection techniques. Graphics and Media

Laboratory, Faculty of Computational Mathematics and Cyber-

netics, Moscow State University, Russia (2003)

29. Wimmer, M., Radig, B.: Adaptive skin color classificatory. In:

Proceedings of the first ICGST International Conference on

Graphics, Vision and Image Processing GVIP, vol. I, Cairo,

Egypt, pp. 324–327 (2005)

30. Liou, D.H.: A real-time hand gesture recognition system by

adaptive skin-color detection and motion history image. Master

thesis of the Dept. of Computer Science and Engineering, Tatung

University, Taipei, Taiwan (2009)

31. Alex, J.S., Bernhard, S.L.: A tutorial on support vector regres-

sion. Stat Comput 14, 199–222 (2004)

32. Boser, B.E., Guyon, I.M., Vapnik, V.N.: Support vector machines

are universally consistent. J Complex pp. 768–791 (2002)


123

ftp://ftp.nada.kth.se/CVAP/reports/cvap251.pdf

33. Qing, C., Georganas, N.D., Petriu, E.M.: Hand gesture recogni-

tion using Haar-like features and a stochastic context-free

grammar. IEEE Trans Instrum Meas 57, 1562–1571 (2008)

34. Ciprian, D., Vasile, G., Pekka, N., Veijo, K.: Dynamic hand

gesture recognition for human-computer interactions. In: Pro-

ceedings of the 6th IEEE International Symposium on Applied

Computational Intelligence and Informatics, May 19–21, Roma-

nia, pp. 165–170 (2011)

Author Biographies

Chen-Chiung Hsieh received

his B.S., M.S., and Ph.D.

degrees in the Department of

Computer Science and Infor-

mation Engineering, National

Chiao Tung University, Hsin-

chu, Taiwan, in 1986, 1988, and

1992, respectively. During Dec.

1992 to Jan. 2004, he was with

the Institute for Information

Industry (III) as a vice director.

From Dec. 2004 to Jan. 2006, he

joined Acer Inc. as a senior

director. He is presently an

associate professor in the

Department of Computer Science and Engineering at Tatung

University, Taipei, Taiwan. His research area is mainly focused in

image and multimedia processing.

Dung-Hua Liou received his

B.S. and M.S. degrees in the

Department of Computer Sci-

ence and Engineering, Tatung

University, in 2007 and 2009,

respectively. His research inter-

ests include image processing

and video surveillance.


123

Documents

Novel Haar features for real-time hand gesture recognition using SVM