Upload
dung-hua
View
213
Download
1
Embed Size (px)
Citation preview
SPECIAL ISSUE
Novel Haar features for real-time hand gesture recognitionusing SVM
Chen-Chiung Hsieh • Dung-Hua Liou
Received: 24 May 2012 / Accepted: 24 October 2012
� Springer-Verlag Berlin Heidelberg 2012
Abstract Due to the effect of lighting and complex
background, most visual hand gesture recognition systems
work only under restricted environments. Here, we propose
a robust system which consists of three modules: digital
zoom, adaptive skin detection, and hand gesture recogni-
tion. The first module detects user face and zooms in so
that the face and upper torus take the central part of the
image. The second module utilizes the detected user facial
color information to detect the other skin color regions like
hands. The last module is the most important part for doing
both static and dynamic hand gesture recognition. The
region of interest next to the detected user face is for fist/
waving hand gesture recognition. To classify the dynamic
hand gestures under complex background, motion history
image and four groups of novel Haar-like features are
investigated to classify the dynamic up, down, left, and
right hand gestures. A simple efficient algorithm using
Support Vector Machine is developed. These defined hand
gestures are intuitive and easy for user to control most
home appliances. Five users doing 50 dynamic hand ges-
tures at near, medium, and far distances, respectively, were
tested under complex environments. Experimental results
showed that the accuracy was 95.37 % on average and the
processing speed was 3.93 ms per frame. An application
integrated with the developed hand gesture recognition was
also given to demonstrate the feasibility of proposed
system.
Keywords Face detection � Gesture recognition �Man–machine interface � Pattern analysis
Support Vector Machine
1 Introduction
Emerging computer vision-based applications include video
surveillance, human–computer interaction (HCI) [1], vehi-
cle driver assistance, and machine inspection. Among these
applications, hand gesture recognition is being developed
vigorously as interface for HCI. Hand gestures recognition
as controller [2] to manipulate devices is the most intuitive
way for man–machine interface. The advantage of these
recognition systems is that user can control devices such as
panel, keyboard, or mouse without touching them. Users
just need to face the camera and raise their hands for
operation control. Hand gesture recognition systems give
people high degree of freedom and intuitive feelings. Robot
control [3, 4], television remote control [5], presentation
slide control [6], and visual mouse [7, 8] are common HCI
applications. However, computer vision-based gesture rec-
ognition systems are not very popular in our daily life due to
two major problems. Firstly, it is very difficult to detect skin
color under variety of lightings and environments. Sec-
ondly, people may need heavy training before using com-
plex hand gestures.
We defined two static and four dynamic hand gestures
which are natural and easy to use. The proposed real-time
hand gesture recognition system consists of three major
parts: digital zoom, adaptive skin detection, and hand
gesture recognition. The first module detects user face and
then applies trivial trimming and bilinear interpolation for
zooming in so the face and upper torus take the central part
of the image. The second module is based on an adaptive
C.-C. Hsieh (&) � D.-H. Liou
Department of Computer Science and Engineering,
Tatung University, No. 40, Sec. 3, Jhongshan N. Rd,
Taipei 104, Taiwan, R.O.C.
e-mail: [email protected]
123
J Real-Time Image Proc
DOI 10.1007/s11554-012-0295-0
skin color model for hand region extraction. By adaptive
skin color model, the effects from lighting, environment,
and camera can be greatly reduced, and the robustness of
static hand gesture recognition could be improved. The last
part is the most important part which does both static and
dynamic hand gestures recognition.
To speed up the recognition of static hand gestures, a
region of interest (ROI) besides the detected user face is
defined. Fist hand is verified by a three gray-level Haar-like
feature and waving hand is detected by checking the
amount of motion within the specified ROI. As for dynamic
hand gestures, we observe the motion history image (MHI)
for each dynamic directional hand gesture and design four
groups of Haar-like patterns. These Haar-like patterns are
introduced for the first time to count the number of black-
white patterns as statistical features for classification. A
real-time efficient algorithm using Support Vector Machine
(SVM) based on the statistical features is then developed to
distinguish these dynamic hand gestures.
An initial version of this paper appeared in [9] where the
dynamic hand gesture recognition using heuristic method
was redesigned and several thresholds were removed in this
paper. For comparisons, a back-propagation neural network
[10] for dynamic hand gesture recognition was also
implemented using the same MHI. In addition, more
experiments were conducted for system performance
analysis. In the following section, we will briefly review
related researches in hand gesture recognition. In Sect. 3,
we present the detail of ROI and adaptive skin color
detection-based static hand gesture recognition, and the
MHI-based dynamic hand gesture recognition using SVM.
In Sect. 4, experimental results were given to demonstrate
the system performance. In the last section, we conclude
and give some directions for future researches.
2 Related works
Various computer vision-based HCI researches were
developed by using cameras of single lens [11], multi-lens
[12], depth perception lens [13], infra-red lens [14], or
combined lens [15]. Different lens give different informa-
tion. The more the information utilized, the higher the
recognition accuracy would be. However, more cameras
may require special installations and cost much for the
extra information processing. For example, the launched
Xbox 360 is equipped with a Kinect1 sensor consisting of
an infra-red emitter, an infra-red and RGB camera. Though
Kinect sensor can recover the 3D depth map, it costs more
and needs more computations. According to the survey
given in [16], there are a variety of methodologies used for
human gesture recognition ranging from principle compo-
nent analysis [17], hidden Markov model [18], particular
filtering [19], and finite state machine [20], to neural net-
works [21]. In the following, we will brief some researches
in hand gesture recognition for device control and discuss
their feasibilities.
Licsar and Sziranyi [22] proposed a vision-based hand
gesture recognition system with interactive training, aimed
to achieve a user-independent application by online
supervised training. Their main goal is that any non-trainer
user should be able to use the system instantly. If the
recognition accuracy decreases, only the faulty detected
gestures are retrained to realize fast adaptation. They
implemented a hand gesture recognition system for a
camera–projector environment in which users can directly
interact with the projected image by hand gestures, real-
izing an augmented reality tool in a multi-user environ-
ment. However, the color of hand gestures would be
affected by the color of projected content.
Wu [23] developed a hand gesture recognition system to
control the media player with a common web camera. The
shapes of arms represent different commands for the con-
trol of media player. The system firstly separated the left-
arm by background subtraction and detected the straight
line by both Hough transform and Radon transform. The
disadvantage of this method was the non-instinct of defined
hand gestures. For example, the straight arm was to play
the music and bent arm was to stop the music. Furthermore,
it was inconvenient for user to adjust the operating distance
before usage.
Lai [24] designed and implemented an interactive biped
robot which could be controlled by hand gestures. The
number of fingers and angles between fingers was used to
classify nine types of static hand gestures. Two dynamic
hand gestures were defined by the direction of forefinger
and the size of palm. To overcome the effect of lighting,
they utilized scroll bars to manually set the scope of skin
color in YCbCr space. Fixed background model and image
subtraction were used to segment the moving arm for hand
gesture recognition.
Tu [25] presented a face-based hand gesture recognition
system by a single camera for HCI application. Face was
firstly detected by defining specific scopes for skin color in
normalized RGB. Hand region was assumed to appear by
the side of face. Eleven static hand gestures as shown in
Fig. 1 were defined to control the computer. Back-propa-
gation neural network was utilized for hand gesture
recognition. However, these hand gestures are easily confused
due to similar shape. Still, lighting and environment may also
cause problems for skin color detection.
In summary, most previous methods adopting non-
adaptive skin color models may cause problems if envi-
ronment is complex. As for the vocabulary set of hand1 http://msdn.microsoft.com/zh-tw/hh367958.aspx.
J Real-Time Image Proc
123
gestures, most works were based on the fist hand and varied
with different number of fingers as shown in Fig. 1. User
may spend some time to memorize these gestures and
would be confused with these similar gestures. In addition,
some systems have limitations that users sit in front of
camera within a specified distance. Here, we try to relax
these limitations. Firstly, we propose a face-based adaptive
skin color model for static hand region segmentation. Sec-
ondly, the adopted static and dynamic hand gestures are
simple and intuitive. Our defined hand gestures are quite
different from one another and easy to memorize. The
intuitiveness comes from the idea that we bind the up/down/
left/right dynamic hand gesture with the up/down/left/right
directional key of remote control. Last but not the least are
the innovative recognition kernels for both static and
dynamic hand gestures.
To speed up the recognition of static hand gestures, a
ROI besides detected user face is defined for checking. Fist
hand is verified by a three gray-level Haar-like feature and
waving hand is detected by checking the amount of motion
within that specified ROI. As for dynamic hand gestures,
four groups of Haar-like patterns are designed and a simple
but efficient algorithm is developed to classify these
directional gestures in MHI representations. In contrast to
the Haar features [26, 27] designed for face detection using
Adaboost cascade classifier, our designed Haar features are
different and the corresponding algorithm using SVM
executes even faster.
3 System architecture
There are dynamic and static hand gestures as shown in
Fig. 2. The direction of moving hand is used to classify
the four dynamic hand gestures in Fig. 2a–d while motion
detected or not by side of face is used to classify the two
static hand gestures in Fig. 2e, f. Figure 3 shows the flow
chart of the proposed system which is divided into three
major parts: digital zoom, adaptive skin color detection,
and hand gesture recognition. Each part is described in
the following subsections. Note that face detection, pro-
posed by Viola and Jones [26] and extended by Lienhart
and Maydt [27], is adopted as one of the key components.
The characteristic of their method is the use of the black-
white Haar-like patterns to find eyes on face that is
independent of the skin color of people. However, false
alarms would happen at eyes-like patterns. In this paper,
false alarms would be filtered out if the number of skin
color pixels within the detected face region is less than a
given threshold.
3.1 Digital zoom
It is necessary to magnify the image area around the user for
hand gesture recognition if the user is distant from camera.
Thus, users do need to adjust their positions for convenience.
This step also normalizes the image size to 320 9 240 pixels
because the initially set image resolution may be different. If
the detected face is smaller than the standard size of face,
users could either adjust their face size manually by the
equipped optical zoom capability of camera or achieved by
the developed automatic bilinear zooming of the user as in
(1) based on detected face in this stage.
ROI1ðx; y;w; hÞ ¼ ðfacemax:x� 5� facemax:r;
facemax:y� 5� facemax:r; 10� facemax:r; 10� facemax:rÞ;ð1Þ
where ROI1 represents the zoomed image, (x, y) is the
coordinate of the top left corner point of ROI1, and (w, h) is
its width and height. facemax is the largest detected face
who is the nearest to the camera owns the control of the
system, where (facemax.x, facemax.y) is the circle center of
the detected maximum face with radius r.
Fig. 1 Face-based hand gesture
recognition for HCI applications
[25]. a Face detection and the
operation ROI as shown by side
of face. b Defined hand gestures
J Real-Time Image Proc
123
The ideal operating distance is about 60 cm in our hand
gesture recognition system as shown in Fig. 4a. If user is
far away, the detected face would appear smaller as shown
in Fig. 4b. By digital zoom, the user area could be enlarged
as ROI1 as in Fig. 4c and the hand gesture recognition
results would not be corrupted. This is because the black-
white properties of the Haar-like features would not change
even if the images are magnified.
Assume the user would not move dramatically, the
location of face in the next frame would be confined in
another region of interest ROI2 centered at the current
location of detected maximum face and ROI2 size is set
Fig. 2 Defined dynamic hand gestures (a–d) and static hand gestures (e–f)
Fig. 3 The system flowchart is consisted of a digital zoom, b adaptive skin detection, and c hand gesture recognition
J Real-Time Image Proc
123
1.5 9 1.5 times that detected face. Thus, we could reduce
the time needed by Adaboost cascade classifier for face
detection. Table 1 summarizes the processing times with or
without ROI2 setting. If no face is detected, then it is
necessary to search face in the whole image. The operating
distance is subject to the resolution limits of webcam. If
user operates at 3 m away or more, the captured image
would be too small to be used for human face detection.
3.2 Adaptive skin color detection
Due to the scope of general skin color [28] covers many
skin-like colors, false positive or false negative are some-
times unacceptable. Hence, if we could construct an
adaptive skin color model, the misclassification rate would
be greatly reduced. By exploiting skin color information
from individual’s face, we could create the skin color
model for each person and then improve system robustness
because of the reduced amount of color variations between
a person’s face and hands [29].
The face-based adaptive skin color model proposed by
Liou [30] is adopted here. Skin region of detected face
could be obtained by eliminating eyes, nostrils, and mouth
regions by gray-level histogram analysis. Color distribu-
tions in normalized red, normalized green, and original red
are assumed to be Gaussian distributions so that the means
and SD are calculated to build the adaptive skin color
model. Afterward, we can use that skin color model to
detect the other skin color regions for that person. From
experimental results, our system could detect correct skin
pixels even in an extremely bad lighting condition and the
face colors are distorted in abnormal skin chromaticity.
3.3 Static hand gesture recognition
Hand regions are detected based on the previous adaptive
skin color model. It is assumed that user does fist hand
gesture and waving hand gesture of the same depth as
face in the specified area as indicated by the red rect-
angle ROI3 in Fig. 5a. ROI3 for static hand gesture
detection is just next to the right side of detected user
face. The set ROI3 is based on the habit of right-hand
users and could be changed to the left side of face next
to left ear for left-hand users. ROI3 could be adjusted by
displaying it in a monitor. The position and size of
detected maximum face is used to automatically specify
ROI3 as in (2).
ROI3ðx; y;width; heightÞ ¼ ðfacemax:x� 4:2� facemax:r;
facemax:y� 1:5� facemax:r; 3� facemax:r; 3:� facemax:rÞð2Þ
The size of specified ROI3 is defined 1.5 times the size
of detected face because a palm usually covers half of the
user’s own face.
3.3.1 Fist hand
The ROI for fist hand detection is further divided into Rin
and Rout as shown in Fig. 5b. User hand gestures are per-
formed in the same depth as user face. The detected skin
region would be as shown in Fig. 5c if user made a fist
hand. Hence, fist hand gesture could be recognized by
checking these two areas as in (3), where Threshold1
corresponding to the hand region in Rout and Threshold2
corresponding to the hand region in Rin are usually set 0.3
and 0.8 to prevent from misclassifying a very big (close)
and a very small (far) fist hand, respectively.
ðSkinPix 2 RoutÞ=Rout � Threshold1 and ðSkinPix
2 RinÞ=Rin�Threshold2: ð3Þ
Figure 5c gives a perfect example. However, the user
palm may not occupy the center part all the time. The
system could tolerate this situation in real world by (2).
However, some detected false palms may happen and in
order to filter out the detected wrong fist hand gesture,
Fig. 4 a Operating at 0.6 m
from the camera. b Operating at
three meters from the camera.
c Zoomed image of b
Table 1 Processing time of face detection with and without ROI2
setting
Frame size ROI2 setting Operation distance
(m)
Processing time
(ms)
320 9 240 Yes \1.5 5–8
No \1.5 45–55
640 9 480 Yes \3 5–8
No \3 200–230
J Real-Time Image Proc
123
a simple Haar-like feature as shown in Fig. 6a is used
for verification as shown in Fig. 6b. The color space of
ROI3 is firstly transformed from RGB to gray level and the
histogram is equalized as shown in Fig. 6c for verification.
The OpenCV library function HaarTraining2 is used to
train classifiers for fist hand detection. Since ROI3 size is
small and the verification process spends a little time.
3.3.2 Waving hand
The ROI3 for fist hand detection is also used for waving
hand gesture recognition but based on motion detection
and time sequence as shown in Fig. 7. By observing the
waving hand gesture in Fig. 7a, two apparent phenomena
would happen. Firstly, the motion obtained by subtract-
ing two continuous frames would be very obviously as
shown in Fig. 7b. Secondly, the motion would last for a
period of time as shown in Fig. 7c. Therefore, these two
conditions are used to verify waving hand gesture. If the
portion of motion region in ROI3 is greater than a given
threshold Threshold3 and lasts for a period of time, the
waving hand gesture could be confirmed. Threshold3 is
set as 70 % to prevent from misclassifying small vibra-
tions of user hand and the time period is set as 3 s in
this paper.
3.4 Dynamic hand gesture recognition
As shown in Fig. 3c, dynamic and static hand gesture
recognition processes are executed in the same time.
However, ROI1 instead of the small ROI3 is used for
dynamic hand gesture recognition for the overall motion
is around the upper part of operating user. In addition,
dynamic hand gesture recognition is conducted on motion
history image in which motion information among frames
could be accumulated in a single image. Four groups of
Haar-like features are designed for measuring the quan-
tities of directions. An innovative direction detection of
moving hand using motion history image-based SVM is
then proposed.
3.4.1 Motion history image
The benefit of motion history image is that it could pre-
serve object trajectories in a frame. Continuous motion
information is used to update the motion history image
MHI as in (4), where DF is the difference frame and a is set
as 15. The values in MHI are placed within 0–255.
MHIðx; yÞt ¼ MHIðx; yÞt�1 þ DFðx; yÞt�1 � a ð4Þ
Figure 8 gives an example of MHI. Figure 8a, b show
two continuous frames and Fig. 8c is the difference frame
in which the resulting regions are the motion regions. From
the MHI as shown in Fig. 8d, the moving direction could
be recovered by checking the orientation of variations from
black to white, which is the moving direction of hand.
Fig. 5 a Automatically
specified ROI3 for static hand
gesture recognition and user
could adjust the ROI3 to be
larger than user’s fist. b Divide
ROI3 into Rin and Rout for fist
hand detection. c An example of
detected fist hand
Fig. 6 a Defined Haar-like
feature for fist hand verification.
b The ROI3 in RGB. c ROI3 in
gray scale and after histogram
equalization
2 http://note.sonots.com/SciSoftware/haartraining.html.
J Real-Time Image Proc
123
For real-time issues, images are reduced to 24 rows 9 32
columns for the detection of moving direction. Figure 9
shows examples of the reduced ROI3, difference image, and
motion history image. Although this step would blur the
details, it does not affect the following direction detection
method. There are four categories of MHI as shown in
Fig. 10. Note Fig. 10c, d are not misplaced for the captured
images are mirrored.
Neural network was tried to recognize the dynamic hand
gestures. It adjusts the weight for each neural through
training process so as to identify different patterns. Back-
propagation neural network was developed in our system
for the recognition of the four MHIs. In the tests, 40 sam-
ples were used to train the neural network model and the
other 40 different patterns were used for test. We found that
the accuracy rate stuck at about 85 % because some vari-
ations of different dynamic hand gestures were similar. For
example, hand moving upper rightward and upper leftward
would be ambiguous and misclassified. Therefore, we
investigate a novel statistical method to surpass the neural
network approach.
3.4.2 Direction detection by Support Vector Machine
To recognize the four kinds of MHI, four groups of Haar-
like directional patterns for moving up, down, left, or right
as shown in Fig. 11 are designed, respectively. As stated in
the previous section, the MHI is of 24 9 32. To avoid the
boundary condition, there are 22 patterns for upward and
downward, and 30 patterns for rightward and leftward. If
the two gray levels located at the corresponding locations in
the MHI match with a pattern, the counter for that pattern
would be increased by one. Take an example as shown in
Fig. 12 to describe this methodology. The counter Count(right)
of right direction would be increased by one if any of the
patterns in Fig. 11c matches with the motion history image
as shown in Fig. 12b. That is, if the left element is brighter
than the right element, Count(right) would be increased by
one. The white and black pattern in the top one pattern of
Fig. 11c is of the nearest which represents that part of MHI
was updated by the last few frames. On the contrary, the
white and black pattern in the bottom represents that part of
MHI was updated by the first few frames.
Fig. 7 a Waving hand gesture.
b Result of motion detection.
c Threshold of waving hand
elapsed time
Fig. 8 a Previous frame. b Current frame. c Difference frame. d The resulting MHI
Fig. 9 Reduced image of
a ROI1, b difference image, and
c resulting MHI
J Real-Time Image Proc
123
Similarly, the corresponding direction counter will be
increased if the black-white condition is met for each pattern
of the other three groups. Note the patterns are designed for
right-hand users. The mirrored patterns of moving rightward
as in Fig. 12c could be used for left-hand user. Patterns in
Fig. 11 do not mean that user hands need to move across the
image and then out. The detailed process is described in
the following algorithm. The central design concept of the
algorithm is to recover the motion track by counting the
number of matched black-white patterns. On the other hand,
to distinguish waving hand gestures from moving gestures,
there are two conditions to be met. One is the moving hand
gesture should not last more than 3 s. The other is the
directional counter must exceed a given threshold. The
sum of all direction counters as in Step 4 is used to normalize
the four direction counters. In the original design, the
Fig. 10 Four categories of MHI that correspond to the four dynamic hand gestures. a Upward, b downward, c rightward, d leftward
Fig. 11 Patterns for detecting
the direction of moving hand.
a Up, b down, c right, d left
Fig. 12 a A left-hand user
doing a hand gesture of moving
right. b The corresponding
MHI. c The Haar patterns of
moving rightward detection for
left-hand user
J Real-Time Image Proc
123
dynamic hand gestures would be classified as the direction
counter with the maximum value in Step 5. However, hand
moving upward may sometimes be misclassified as hand
moving rightward. Similar confusing situations would hap-
pen by this simple maximum principle. To gain better rec-
ognition results, a machine learning system SVM developed
by Alex and Bernhard [31] based on statistical theory can be
used to solve nonlinear and high-dimensional problems.
Supervised learning is used to inform the machine of the
correct answer to facilitate correction. The purpose of SVM
is to achieve a maximal margin hyperplane using the least
amount of training data.
In a linear division environment, SVM can use hyperplane
directly for classification. However, most problems arise
from nonlinear division environments. Such classification
problems like ours, therefore, should first be handled by
changing data type using kernel function. Boser et al. [32]
proposed converting primary data at lower dimensions by
using a core function to become a higher-dimension feature
space to find a linear hyperplane from a high dimension to
solve the problem of data classification. Data that cannot be
classified using linear functions can be categorized by using
a hyperplane in a high-dimensional feature space. Equation
(5) indicates the function of converted data classified in high
dimensions:
f ðxÞ ¼ sgnXn
i¼1
aiyiKðxi; xÞ þ b
!; ð5Þ
where ai is a Lagrange multiplier and K(xi, x) shows the
kernel function of the conversion to a high dimension.
Radial Basis Function (RBF Kernel) defined as follows is
selected as the kernel function. The derived eigenvector is
then identified using a nonlinear SVM.
Kðxi; xjÞ ¼ exp �c xi � xj
�� ��2� �
; c � 0 ð6Þ
Algorithm: Hand gesture recognition using Support Vector Machine Input: Motion history image MHI(i, j). Output: The direction of moving hand. 1. Initialization
Count(i)=0, i=up, down, left, and right; 2. For each row i of MHI(i, j) For each column j of MHI(i, j) { For each left/right directional Haar-like pattern k in Figs. 11(c) and 11(d) If MHI(i, j) > MH(i, j+k) /* The left is lighter than the right */
Count(left)++; elseif MHI(i, j) < MHI(i, j+k) /* The right is lighter than the left */
Count(right)++; } 3. For each column j of MHI(i, j) For each row i of MHI(i, j) { For each up/down directional Haar-like pattern k in Figs. 11(a) and 11(b) If MHI(i, j) > MH(i*k, j) /* The up is lighter than the down */
Count(up)++; elseif MHI(i, j) < MHI(i*k, j) /* The down is lighter than the up */
Count(down)++; } 4. Normalization.
);()()()( rightCountleftCountdownCountupCountSum +++= For each counter i
;/ SumiCountCount =5. /* Original design: ( ;)maxarg iCountDirection = */
/* Dynamic hand gesture recognition by trained Support Vector Machi ne. */ Feed the four direction Count(i) as a 4-dimensional data into SVM; Trained SVM output the recognized dynamic hand gesture type;
( )ii( )i
J Real-Time Image Proc
123
On comparing our algorithm with traditional Haar-like
pattern based Adaboost cascade classifier, there are three
advantages. The first is that classifiers for different Haar
patterns are not executed sequentially but calculated
together. Secondly, integral image is not necessary.
Instead, we adopt the reduced image for feature calcu-
lations to avoid multi-scale problem. Finally, SVM
which could classify high dimension samples using
nonlinear separation function is adopted to recognize
dynamic hand gestures. However, training is required
before doing dynamic hand gesture recognition. These
advantages remove the nondeterministic factors out of
the developed system and the time complexity for the
proposed algorithm is imply O(w 9 h), where w 9 h is
the MHI size.
4 Experimental results
The proposed hand gesture recognition system was tested
to demonstrate its feasibility. The platform is Microsoft
Windows XP on a PC with AMD Processor 5200? and
main memory 2G bytes. For portability, Logitech portable
webcam C905 is deployed to grab images. The software
development environment is Visual C?? 6.0 with image
processing library OpenCV 1.1 installed. Our goal is to
design a real-time robust man–machine interface using
hand gesture recognition.
Figure 13 shows the system–user interface where user
could start camera and set the digital zooming scale
according to the distance between user and the camera. The
purpose of digital zoom is to adjust the size of user ROI1
for processing. Alternatively, user could press ‘‘Auto’’
button to automatically zoom into the user ROI1 by the
position and size of detected user face as shown in (1). ROI
settings for face, static hand gesture, and dynamic hand
gesture detection also play an important role for real-time
processing.
After setting the digital zoom, adaptive skin color
detection and hand gesture recognition could be activated
by the check boxes. The execution speed is also displayed
in the right text box. The result of hand gesture recognition
is indicated by the corresponding graphic icon. The lower
right text box is also used to display the text description of
hand gesture recognition for verification and log.Fig. 13 User interface of developed hand gesture recognition system
Fig. 14 The counter values for the four continuous directional dynamic hand gestures. Each gesture was operated a once at 0.6 m and b three
times at 2 m
J Real-Time Image Proc
123
4.1 Hand gesture recognition
Figure 14a shows the recognition results of a video con-
sisting of 241 frames. There are four continuous different
dynamic hand gestures starting from up, down, left, to
down. These dynamic hand gestures could be classified as
the index of the counter with the maximum normalized
value. Figure 14b shows the recognition results of another
video consisting of 541 frames in which the continuous up/
down/left/right hand gestures were repeated for three times
at 2 m. This experiment also proved that the maximum
direction counter still works even in case of operating at
long distance.
Five users were invited to do the dynamic hand gestures
50 times per type at three different distances (\1 m,
1–1.5 m, and 1.5–2 m) and static hand gestures 25 times
per type. These tests were recorded and labeled with dif-
ferent types of hand gesture, names, and distances. The
length of test videos ranges from 75 to 125 s as the oper-
ating speed was dependent on the user. Each individual
hand gesture was separated by a short period of time
without motion. Users could practice for 1 min to prevent
from wrong operation before testing. The results were as
shown in Table 2 in which the recognition rates were
93.13 % for dynamic hand gestures and 95.07 % for static
hand gestures in average.
However, the recognition accuracy for moving-up hand
gesture was not good. Therefore, half of the dynamic hand
gesture videos were used to train the SVM and the other
half videos were used for testing. The recognition accuracy
for moving upward was improved to 93.8 % while the
overall accuracy achieved 95.66 %. This demonstrates that
the recognition problem would be more appropriate solved
by SVM rather than by the simple maximum principle.
4.2 Processing time analysis
The objective of this paper is to develop a real-time con-
venient hand gesture recognition system as a man–machine
interface. By the advantages as stated in the previous
section, the system could work quite efficiently. Table 3
gives the processing time for each system component and
the overall processing speed is 3.81 ms per frame. That is,
our system processing rate can reach more than 250 frames
per second. It is especially useful for embedded system
which requires less computations for user interface while
spends more computations for the main user tasks such as
media player or Internet surfing.
The developed man–machine interface was integrated
with a well-known album browser Cooliris3 as shown in
Fig. 15a. Users can browse by their own hands as shown in
Fig. 15b. The six hand gestures were bound to six different
commands such as the commonly used buttons in a remote
controller in Fig. 15c. Dynamic hand gestures map to
moving up, down, left, and right while waving hand rep-
resents ‘‘wake-up’’ (system) and fist hand represents
‘‘enter’’ (menu).
Table 4 gives a functional comparison with several
surveyed works. Most methods except [15] need to have
restrictions on the operating environment due to the com-
plex background in real world. The reason is that a time-of-
flight (ToF) and other IR-based cameras are used in [15] to
register the depth to filter out the background objects.
However, higher hardware cost and more computations
involving calibrations and data are inevitable. Here, by the
designed novel Haar-like features and recognition algo-
rithm processing on motion history image, we successfully
remove background objects and achieve quite high
Table 2 Average accuracy
for the defined hand gesturesHand gesture Times Accuracy (%) SVM
Times Accuracy (%)
Dynamic hand gesture Moving up 669/750 89.2 352/375 93.8
Moving down 689/750 91.87 356/375 94.9
Moving left 720/750 96 365/375 97.3
Moving right 716/750 95.47 362/375 96.5
Average 2,794/3,000 93.13 1,435/1500 95.66
Static hand gesture Fist hand 352/375 93.87
Waving hand 361/375 96.26
Average 713/750 95.07
Table 3 Average processing time for each system component
System component Time (ms)
Face-based adaptive skin color detection 3.45
Motion detection 0.08
Direction counter 0.28
Support vector machine 0.12
Total 3.93
3 http://www.cooliris.com/.
J Real-Time Image Proc
123
execution speed. It is very useful for embedded systems
which have limited computation capability.
5 Conclusions and future works
In this paper, a real-time hand gesture recognition system is
developed which could tolerate complex background by
using MHI and ROI settings. Real-time computation is
achieved by using the designed novel Haar-like features and
the corresponding simple and efficient SVM classification
method. There are two static hand gestures, fist hand and
waving hand, and four dynamic hand gestures representing
hand moving up, down, left, and right are defined. These
hand gestures are natural and simple to use. As to static
hand gesture recognition, the hand regions are firstly
extracted by the face-based adaptive skin color model in the
defined ROI for static hand under variety of environmental
changes. And then respective conditions for static hand
gestures are checked for classification. Five persons were
invited to test the developed system. Experimental results
show that the accuracy is 95.37 % on average which
demonstrate the feasibility of proposed system.
Computer vision concerns with the theories for devel-
oping intelligent systems that could investigate information
from images. The images can be taken in many forms such
Fig. 15 a Screenshot of Cooliris. b Cooliris operated by natural hand gesture recognition. c A commonly designed remote control with a central
‘‘enter’’ button surrounded by four directional buttons
Table 4 Comparisons with several surveyed hand gesture recognition systems
Reference Function
Methodology Vocabulary set Device Operating environment Accuracy
(%)
Speed
(ms)
Licsar and Sziranyi
[22]
User adaptive recognition
with interactive training
Palm with varying
fingers
Projector
and
camera
Only hand allowed to
appear
Over 98 NA
Kim et al. [14] Active shape model for gait
recognition
Human gaits Infra-red
camera
Works across illumination
changes
Over 90 NA
Kao et al. [17] Face and hand gesture
recognition by PCA
Palm with varying
fingers
Color
camera
Elevator Over 94 NA
Xie [1] Fuzzy neural network for
mode classification
Four hand postures Stereo
vision
Hand profile in meeting
room
NA Real
time
Stergiopouloua and
Papamarkos [21]
SGO neural gas network No. of raised fingers RGB
camera
Hands in simple
background
90.45 NA
Van den Bergh and
Van Gool [15]
Classifier using traditional
2D Haarlets
Six hand postures
with varying fingers
ToF (depth)
plus RGB
Allow other persons in
background
99.54 33.4
Qing et al. [33] Traditional 2D Haarlets
classifier with SCFG
Four hand postures
with varying fingers
RGB
camera
Hands in simple
background
95.65 3.04
Ciprian et al. [34] Dynamic hand gesture using
tensor voting filter
Three/four static/
dynamic hand
gestures
RGB
camera
Hands in simple
background
Simulation
only
Real
time
Ours Novel Haar features and
SVM classifier
Six easily operated
gestures
RGB
camera
Allow hands overlapped
with skin-like objects
95.37 3.93
J Real-Time Image Proc
123
as video sequences, views from multiple cameras, or
multi-dimensional slice from specific scanners. Though
computer vision has been developed for a long time, the
commercial applications are still few. At the present
stage, a considerable number of research institutes and
companies are active in this field. One typical example is
Microsoft’s Xbox 360 equipped with Kinect sensor which
allows users to operate the system by bare hands. Though
Kinect provides depth information, user gestures are still
easily occluded by oneself or other person. Therefore,
how to integrate all the captured data from multi-cameras
to recover more motion information accurately is still
necessary for intelligent man–machine interface in the
future.
References
1. Xie, W., Teoh, E.K., Venkateswarlu, R., Chen, X.: Hand as
natural man–machine interface in smart environments. In: Pro-
ceedings of the 24th IASTED International Conference on Signal
Processing, Pattern Recognition, and Applications, pp. 117–122
(2006)
2. Lee, M., Woo, W.: ARKB:3D vision-based augmented reality
keyboard. In: Proceeding of International Conference on Artifi-
cial Real Telexistence, pp. 54–57 (2003)
3. Chen, J.Y., Haas, E., Barnes, M.: Human performance issues and
user interface design for teleoperated robots. IEEE Trans. Syst.
Man Cybernet. Part C 37, 1231–1245 (2007)
4. Hu, C., Meng, M.Q., Liu, P.X., Wang, X.: Visual gesture rec-
ognition for human–machine interface of robot teleoperation. In:
Proceedings of the IEEE International Conference on Intelligent
Robots and Systems, Las Vegas, Nevada, pp. 1560–1565 (2003)
5. Breizner, L., Laptev, I., Lindeberg, T., Lenman, S., Sundhlad, Y.:
A prototype system for computer vision based human computer
interaction, Technical report CVAP251, ISRN KTH NA/P–01/
09—SE, Department of Numerical Analysis and Computer Sci-
ence, KTH, Royal Institute of Technology, Stockholm, Sweden
(2001) ftp://ftp.nada.kth.se/CVAP/reports/cvap251.pdf
6. Chen, C.Y., Fan, Y.P., Chou, H.L.: Hand gesture commands for
slide view control in a PC based presentation. In: Proceedings of
the International Computer Symposium, Workshop on Computer
Graphics and Virtual Reality, Tainan, Taiwan (1998)
7. Argyros, A.A., Lourakis, M.I.A.: Vision-based interpretation of
hand gestures for remote control of a computer mouse. In: Pro-
ceedings of the HCI Workshop, LNCS 3979, Graz, Austria,
pp. 40–51 (2006)
8. Zhou, H., Xie, L., Fang, X.: Visual mouse: SIFT detection and
PCA recognition. In: Proceedings of the International Conference
on Computational Intelligence and Security Workshops, pp. 263–
266 (2007)
9. Hsieh, C.C., Liou, D.H.: A robust hand gesture recognition sys-
tem using Haar-like features. In: Proceedings of the 2nd Inter-
national Conference on Signal Processing and Systems, Dalian,
China, V2-394–V2-398 (2010)
10. Rojas, R.: The backpropagation algorithm. In: Neural Net-
works—A Systematic Introduction, chap. 7. Springer, Berlin
(1996)
11. Du, W., Li, H.: Vision based gesture recognition system with
single camera. In: Proceedings of ICSP, pp. 1351–1357 (2000)
12. Segen, J., Kumar, S.: Human–computer interaction using gesture
recognition and 3D hand tracking. In: Proceedings of ICIP,
Chicago, pp. 188–192 (1998)
13. Liu, Y., Jiam, Y.: A robust hand tracking and gesture recognition
method for wearable visual interfaces and its applications. In
Proceedings of the Third International Conference on Image and
Graphics, pp. 472–475 (2004)
14. Kim, D., Lee, S., Paik, J.: Active shape model-based gait rec-
ognition using infrared images. Intl J Signal Process Image
Process Pattern Recogn 2, 1–13 (2009)
15. Bergh, M.V.D., Gool L.V.: Combining RGB and ToF cameras for
real-time 3D hand gesture interaction. In: Proceedings of IEEE
Workshop on Applications of Computer Vision, pp. 66–72 (2011)
16. Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans
Syst Man Cybern Part C 37, 311–324 (2007)
17. Kao, Y.W., Gu, H.Z., Yuan, S.M.: Integration of face and hand
gesture recognition. In: Proceedings of Third International Con-
ference on Convergence and Hybrid Information Technology,
vol. 1, pp. 330–335 (2008)
18. Ramamoorthy, A., Vaswani, N., Chaudhury, S., Banerjee, S.:
Recognition of dynamic hand gestures. Pattern Recogn. 36,
2069–2081 (2003)
19. Kwok, C., Fox, D., Meila, M.: Adaptive real-time particle filter
for robot localization. In: Proceedings of Robotics and Automa-
tion, vol. 2, pp. 2836–2841 (2003)
20. Yeasin, M., Chaudhuri, S.: Visual understanding of dynamic hand
gestures. Pattern Recogn 33, 1805–1817 (2000)
21. Stergiopouloua, E., Papamarkos, N.: Hand gesture recognition
using a neural network shape fitting technique. Eng. Appl. Artif.
Intel. 22, 1141–1158 (2009)
22. Licsar, A., Sziranyi, T.: User-adaptive hand gesture recognition
system with interactive training. Image Vis. Comput. 23, 1102–
1114 (2005)
23. Wu, Y.M.: The implementation of gesture recognition for media
player system. Master thesis of the Department of Electrical
Engineering, National Taiwan University of Science and Tech-
nology, Taipei, Taiwan (2009)
24. Lai, I.H.: The following robot with searching and obstacle-
avoiding. Master thesis of the Dept. of Electrical Engineering,
National Central University, Chung-Li, Taiwan (2009)
25. Tu, Y.J.: Human computer interaction using face and gesture
recognition. Master thesis of the Department of Electrical Engi-
neering, National Chung Cheng University, Taiwan (2007)
26. Viola, P., Jones, M.: Rapid object detection using a boosted
cascade of simple features. In: Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pattern Rec-
ognition, pp. I-511–I-518 (2001)
27. Lienhart, R., Maydt, J.: An extended set of Haar-like features for
rapid object detection. In: Proceedings International Conference
on Image Processing, vol. 1, I-900–I-903 (2002)
28. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-
based skin color detection techniques. Graphics and Media
Laboratory, Faculty of Computational Mathematics and Cyber-
netics, Moscow State University, Russia (2003)
29. Wimmer, M., Radig, B.: Adaptive skin color classificatory. In:
Proceedings of the first ICGST International Conference on
Graphics, Vision and Image Processing GVIP, vol. I, Cairo,
Egypt, pp. 324–327 (2005)
30. Liou, D.H.: A real-time hand gesture recognition system by
adaptive skin-color detection and motion history image. Master
thesis of the Dept. of Computer Science and Engineering, Tatung
University, Taipei, Taiwan (2009)
31. Alex, J.S., Bernhard, S.L.: A tutorial on support vector regres-
sion. Stat Comput 14, 199–222 (2004)
32. Boser, B.E., Guyon, I.M., Vapnik, V.N.: Support vector machines
are universally consistent. J Complex pp. 768–791 (2002)
J Real-Time Image Proc
123
33. Qing, C., Georganas, N.D., Petriu, E.M.: Hand gesture recogni-
tion using Haar-like features and a stochastic context-free
grammar. IEEE Trans Instrum Meas 57, 1562–1571 (2008)
34. Ciprian, D., Vasile, G., Pekka, N., Veijo, K.: Dynamic hand
gesture recognition for human-computer interactions. In: Pro-
ceedings of the 6th IEEE International Symposium on Applied
Computational Intelligence and Informatics, May 19–21, Roma-
nia, pp. 165–170 (2011)
Author Biographies
Chen-Chiung Hsieh received
his B.S., M.S., and Ph.D.
degrees in the Department of
Computer Science and Infor-
mation Engineering, National
Chiao Tung University, Hsin-
chu, Taiwan, in 1986, 1988, and
1992, respectively. During Dec.
1992 to Jan. 2004, he was with
the Institute for Information
Industry (III) as a vice director.
From Dec. 2004 to Jan. 2006, he
joined Acer Inc. as a senior
director. He is presently an
associate professor in the
Department of Computer Science and Engineering at Tatung
University, Taipei, Taiwan. His research area is mainly focused in
image and multimedia processing.
Dung-Hua Liou received his
B.S. and M.S. degrees in the
Department of Computer Sci-
ence and Engineering, Tatung
University, in 2007 and 2009,
respectively. His research inter-
ests include image processing
and video surveillance.
J Real-Time Image Proc
123