6
Skeletonization in a real-time Gesture Recognition System K. Srijeyanthan A.Thusyanthan C. N. Joseph S. Kokulakumaran C. Gunasekara C. Gamage Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka {srijeyanthan, athusy, nihanthjoseph, kokulakumaran}@gmail.com {chulakag, chandag}@uom.lk AbstractAdvances in technology continue to make both hardware and software affordable and accessible; we have seen a rapid growth in computer vision and image processing applications. One area of interest in vision and image processing is automated identification of objects in real-time or recorded video streams and analysis of these identified objects. An important topic of research in this context is identification of humans and interpreting their actions. For example, a camera mounted on the front of a vehicle can capture images of pedestrians and analyse their actions to interpret if they are about to cross the path of the oncoming vehicle. This could be an important accident prevention technique. This paper presents part of our work in a project that deals with object detection and gesture recognition on video streams in real time that could support such applications. In order to recognize gestures of humans and human movements, we must first identify the moving objects in a video stream. Then the identified objects need to be tracked over the frames of the video and classified as human or non-human. Thereafter, the human objects must be skeletonized in order to encode their movements before interpretation can be done. This paper presents a research and analysis of various skeletonizing methods and illustrates our selection of a particular skeletonization method through implementation of algorithms and analysis of experimental data. Keywords - Skeletonization, Gesture Recognition, Star Skeletons, Video Processing, Computer Vision I. I NTRODUCTION There are many applications in computer vision and image processing where it is useful to identify humans in a video stream and to understand their actions through automated processing. Examples include security of public places where video surveillance cameras can determine if a person has left a package and moved out, monitoring of an elderly-care facility to see if someone has fallen over, monitoring of a day-care centre to determine if a child needs attention, etc. At present, we are researching and developing a modular system called moveIt (Movements Interpreted) [1] that can provide such functionality. It is an automated system to identify, track, recognize, interpret, and analyse whole-body type gestures to determine the behaviour of objects of interest, in this case humans, from video streams. The performance and accuracy of an automated whole-body gesture recognition system depends initially on its ability to detect moving objects of interest accurately in the observed environment. All subsequent actions such as tracking, analysing the motion or identifying humans, requires this accurate detection of moving objects. The track- ing of identified objects can be done using techniques based on contours and blobs. The identified object movements need to be abstracted into a model in order to capture and recognize the gestures being created through the sequence of movements. This model needs to be a generalized model so that it can be used to abstract movements for all identified objects belonging to a particular type. As our project aim is whole-body gesture recognition, we have selected the skeletons of humans to be an appropriate model. The process of obtaining a skeleton model from a video image object of a human is termed as Skeletonization throughout this paper. As explained, Skeletonization is a crucial step in our project. Identification of the skeleton of a human in the tracked blob and fitting it in a varying skeleton according to the movements of the blob is a computationally intensive task. If we are to maintain a higher level of accuracy, computational load increases accordingly as greater amount of processing of video data is required. However, as the moveIt project mainly focuses on real-time gesture recognition, it is necessary to select a fast Skeletonization method without significantly degrading the accuracy. By real-time it is implied when a video is fed in or a camera is calibrated to the moveIt system, then processing ought to be done in background and simultaneously the output should be played on screen (without any delay). For example, if video has frame-rate of 25 then the interval between each frame to be displayed on screen is 40ms. i.e. 40ms available for the processing. Skeletons and Skeletonization are becoming increasingly useful in CAD and computer graphics modelling programs and not limited only to computer vision and image processing ap- plications. Next, section 2, several Skeletonization techniques are discussed and section 3 presents an algorithm that has been optimized for use in real-time applications. Finally, section 4 provides concluding remarks. II. SKELETONIZATION ALGORITHMS In this section, we will discuss the main methodologies used to create skeletons from an original source (video or image) as available in the literature. Also, each of the discussed algorithms will be evaluated for its suitability for real-time use. 978-1-4244-8551-2/10/$26.00 c 2010 IEEE ICIAfS10 213

[IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

  • Upload
    c

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

Skeletonization in a real-time Gesture RecognitionSystem

K. Srijeyanthan A.Thusyanthan C. N. Joseph S. Kokulakumaran C. GunasekaraC. Gamage

Department of Computer Science & Engineering,University of Moratuwa, Sri Lanka

{srijeyanthan, athusy, nihanthjoseph, kokulakumaran}@gmail.com {chulakag, chandag}@uom.lk

Abstract— Advances in technology continue to make bothhardware and software affordable and accessible; we haveseen a rapid growth in computer vision and image processingapplications. One area of interest in vision and image processingis automated identification of objects in real-time or recordedvideo streams and analysis of these identified objects. Animportant topic of research in this context is identification ofhumans and interpreting their actions. For example, a cameramounted on the front of a vehicle can capture images ofpedestrians and analyse their actions to interpret if they areabout to cross the path of the oncoming vehicle. This could bean important accident prevention technique. This paper presentspart of our work in a project that deals with object detectionand gesture recognition on video streams in real time thatcould support such applications. In order to recognize gesturesof humans and human movements, we must first identify themoving objects in a video stream. Then the identified objectsneed to be tracked over the frames of the video and classifiedas human or non-human. Thereafter, the human objects mustbe skeletonized in order to encode their movements beforeinterpretation can be done. This paper presents a researchand analysis of various skeletonizing methods and illustratesour selection of a particular skeletonization method throughimplementation of algorithms and analysis of experimental data.

Keywords - Skeletonization, Gesture Recognition, Star Skeletons,Video Processing, Computer Vision

I. INTRODUCTION

There are many applications in computer vision and imageprocessing where it is useful to identify humans in a videostream and to understand their actions through automatedprocessing. Examples include security of public places wherevideo surveillance cameras can determine if a person hasleft a package and moved out, monitoring of an elderly-carefacility to see if someone has fallen over, monitoring of aday-care centre to determine if a child needs attention, etc. Atpresent, we are researching and developing a modular systemcalled moveIt (Movements Interpreted) [1] that can providesuch functionality. It is an automated system to identify, track,recognize, interpret, and analyse whole-body type gestures todetermine the behaviour of objects of interest, in this casehumans, from video streams. The performance and accuracy ofan automated whole-body gesture recognition system dependsinitially on its ability to detect moving objects of interestaccurately in the observed environment. All subsequent actions

such as tracking, analysing the motion or identifying humans,requires this accurate detection of moving objects. The track-ing of identified objects can be done using techniques basedon contours and blobs.

The identified object movements need to be abstracted intoa model in order to capture and recognize the gestures beingcreated through the sequence of movements. This model needsto be a generalized model so that it can be used to abstractmovements for all identified objects belonging to a particulartype. As our project aim is whole-body gesture recognition,we have selected the skeletons of humans to be an appropriatemodel. The process of obtaining a skeleton model from avideo image object of a human is termed as Skeletonizationthroughout this paper. As explained, Skeletonization is acrucial step in our project. Identification of the skeleton of ahuman in the tracked blob and fitting it in a varying skeletonaccording to the movements of the blob is a computationallyintensive task. If we are to maintain a higher level of accuracy,computational load increases accordingly as greater amount ofprocessing of video data is required. However, as the moveItproject mainly focuses on real-time gesture recognition, itis necessary to select a fast Skeletonization method withoutsignificantly degrading the accuracy. By real-time it is impliedwhen a video is fed in or a camera is calibrated to the moveItsystem, then processing ought to be done in background andsimultaneously the output should be played on screen (withoutany delay). For example, if video has frame-rate of 25 thenthe interval between each frame to be displayed on screen is40ms. i.e. 40ms available for the processing.

Skeletons and Skeletonization are becoming increasinglyuseful in CAD and computer graphics modelling programs andnot limited only to computer vision and image processing ap-plications. Next, section 2, several Skeletonization techniquesare discussed and section 3 presents an algorithm that has beenoptimized for use in real-time applications. Finally, section 4provides concluding remarks.

II. SKELETONIZATION ALGORITHMS

In this section, we will discuss the main methodologiesused to create skeletons from an original source (video orimage) as available in the literature. Also, each of the discussedalgorithms will be evaluated for its suitability for real-time use.

978-1-4244-8551-2/10/$26.00 c©2010 IEEE ICIAfS10213

Page 2: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

Star Skeletonization

Star Skeletonization is one of the basic methods used tocreate skeletons. The Star Skeletonization algorithm is basedon finding extreme points and connecting them with thecentre point of the skeleton. This method is named as starskeleton, because it deals with 5 extreme points connected tothe central point. Usually, star skeleton can be implementedin various ways. One such way is presented in [2]. Whenimplementing Star Skeletonization, initially all the border ofhuman model is extracted. Then, the centre point of theobject(centroid) is calculated by computing the average pointof the borders and then a graph is plotted against boundaryas distance function from centroid. Next, the obtained graphwill be smoothened by using transformations such as DiscreteFourier Transformation, Low Pass Filter and Inverse FourierTransformation. After that, the smoothened graph will be usedto identify the extreme points of the skeleton. Finally, thoseextreme points are connected to the centroid to form the starskeleton.

Curve Skeletons

Curve-skeleton is a widely used method in threedimensional modelling rather than for computer visionprojects. Basically curve-skeletons are thinned onedimensional representations of 3D objects, that is usefulfor numerous applications such as visualization tasksincluding virtual navigation, reduced-model formulation,visualization improvement, animation, [3]. Methodologiesthat are used to compute curve-skeletons of 3D models canbe categorized into three groups according to [4] : voxeltopology, computational geometry and continuous implicit.

Since curve-skeletons incorporate the notion of parts orcomponents, they can accommodate part matching, where theobject to be matched is part of a larger object, or vice versa.This feature can give the users more control over the matchingalgorithm, allowing them to specify what part of the objectthey would like to match or whether the matching algorithmshould weight one part of the object more than another [5].

3D models are common in many disciplines includingcomputer aided design, medical imaging, computer graphics,scientific visualization, computational fluid dynamics and re-mote sensing. While the 3D representation is invaluable, manyapplications require alternate compact representations of thesemodels. One such representation is a line-like or stick-like 1Drepresentation, which is sometimes referred to as a skeletalrepresentation or curve-skeleton [6]. There by, 3D model basedskeletonization is not a different methodology, rather based onthinning and curve-skeleton.

Thinning

Thinning is a morphological operation that is used toeliminate selected foreground pixels from binary images, us-ing somewhat similar operations like erosion and dilation[7].Thinning has some benefits such as maintaining the topologyand the shape of objects when making skeletons, forcing the

skeleton being in the middle of the object, and producingskeletons which have the width of one pixel. So, Thinningis really good choice in noise and shadow free environment.Thinning algorithms do not give proper skeletons if an objectis disconnected, completely deleted or merged with anotherobject. As most of the background subtraction algorithms donot produce fully noise and shadow free objects, thinningcannot be used for skeletonization in those scenarios.

Template Matching

The technique called template matching is the most com-putationally intensive scheme. However, if this technique onlyneeds to be performed in a comparatively small search region,the computational time can be significantly reduced. Also,template matching is biased towards areas where motion isgenerally detected as shown in marked areas of figure 1 andtherefore, it is more likely to prevent the template drifting ontobackground [8].

Fig. 1. Bias areas for template matching

In template matching technique, all search regions are usedto match with the current template so as to search for the bestmatching result. As highlighted in[8], this gives rise to severalimportant issues:

• If a template matching scheme does periodic backgroundupdating, a human object may probably be missed out bymotion detection when that object stays motionless for awhile. Therefore, an extra search region should be usedto solve this problem, which is composed of the pixelsin current frame as a result of the correlation of pixelsbetween previous frame and the actual image.

• As the location and size of faces within each frame isunlikely to change significantly from frame to frame,further improvement could be done by improving thetracking speed by focusing on the face region. This couldbe easily achieved by considering the displacement datafor the centre of each search region.

214

Page 3: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

• It is common for a video to have transiently overlappingobjects such as when there are two walking figures thatcross each other. If we use template matching focusedon face region search, are used, the search regions couldbe accurately determined given that their heads do notmerge visibly. However, when there is partial merging ofheads, motion detection will provide a composite searchregion.

III. A PIXEL DENSITY BASED STAR SKELETONIZATIONALGORITHM

As the aim of moveIt project is to interpret whole-bodygestures in real-time video image streams, the possible choicesfor skeletonization algorithms are greatly narrowed by per-formance requirements. Therefore, star skeletonization, thefastest algorithm among the available schemes, was selectedas the basis for an improved technique. Skeletonization trackshuman movements and upon completion provides as output thebackground subtracted video where skeletons move accordingto movements that are made by humans in the original video.

The proposed skeletonization algorithm operates in threedistinct phases: finding points of head, legs and hands. Thelocating of extreme points of an object from a scene in thevideo image is an error prone task as there may be noise withinthe bounding box which does not belong to a particular object.This is the reason for our approach to identify extreme pointsof a human in separate parts.

Identification of the Head

Our algorithm makes the assumption, that in general, thetop section of the bounding box contains the head. Thisassumption may be wrong, for example, when a person bendsdown, in which case the algorithm consider it as a special caseand uses star skeletonization as explained later in this section.This top section is then divided into equal intervals and pixeldensities are computed as shown in figure 2.

In order to identify the exact point of head, a graph ofpixel density vs. interval points is drawn as shown in figure3. It can be observed that the graph has a bell shape, whichis the desired characteristic shape. For each frame, pixels areanalyzed and the graph is plotted, thus finding the maximumpoints and corresponding x-coordinate of the head. Even ifthere is noise, the shape of the graph will not be affectedsignificantly except to show small variations (upside pulls asfound in 2 - 4 interval points of the graph). This methodshowed more robustness and accuracy even with noise in allthe test cases than other methods which have been discussedin section 2. Therefore, this method is suitable for using inreal-time where a high level of accuracy is required.

Identification of Hands

This phase also uses the same technique as in head iden-tification and is shown in figure 4. However, it is necessaryto use a data analysis method in order to predict the actualhand coordinate. In background subtracted images, there aretwo situations where we cannot or have no requirement to

Fig. 3. Graph for head identification

identify the hand points: (1) when hands have overlapped withbody and (2) when hands have been obstructed by body(forexample, behind the torso).

In such situations, it is extremely difficult to identify handsof a human object as the input is a binary image (that is,the background subtracted video) and the only data availableis existence of a pixel at a given point. This results in thealgorithm rejecting both scenarios. It can be observed from thegraph in figure 5, that the pixel density increases with hand andtherefore it can be predicted with a higher level of confidencethat the hands are going to move in that particular direction.However, sudden movements of hand cannot be predictedusing this algorithm. In our approach, if we do not receivea characteristic graph as shown in figure 5, then we can omitdirection of hands and not analyse the extreme hand points assuch computation will be inefficient and not effective underreal-time processing. In our implementation of the algorithm,reasonable results were obtained for different test cases.

Fig. 5. Graph for hand identification

Identification of Legs

The finding of actual leg points is another difficult task inreal-time video streams as images normally get mixed withunwanted noise and shadows, which cannot be eliminated inpractice. For this reason, we have used the pixel based analysismethod where the analysing area is restricted. The area would

215

Page 4: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

Fig. 2. Identifying the head in an image

Fig. 4. Identifying the hands in an image

be less than H/2 where H is the total height of the humanobject. However, in certain instances, such as when a persontries to bend, his or her hand may get intermingled with legs.This is a challenging scenario for the identification of legs.As a solution, our algorithm checks each back and forth framerelative to the current frame and this has helped to reduce falsepositive identification.

Star Skeletons

Density based skeletonization has limitations as it is lesseffective and less accurate for non-standing type postures likecrawling. So we have implemented star skeletonization in or-der to convert actual human images from video feeds into starshape. Actually, star skeleton consists of the several vectorswhich are the distance from the extremities of human contourto its centroid. Since the star skeleton does not need a lot ofpixel computations, it is computationally simple, real-time androbust technique. The basis of the star skeleton is to connectthe extremities of human contour with its centroid. To find theextremities, each distance from boundary point to the centroidis calculated through boundary tracking in a clockwise orcounter-clockwise order. In distance function, the extremitiesare located at local maxima. Noise reduction should be appliedto the distance function by using a smoothing filter or low passfilter, since the distance function of human contour has noises.

Consequently, the final extremities are detected by findinglocal maxima in smoothed distance function.

Usually, star skeleton can be implemented in various ways.One such way is presented in [2].

The procedure of building star skeletons1) The centroid of the target image boundary (xc, yc) is

determined.

Xc = 1/N

Nb∑(i=1)

xi (1)

Yc = 1/N

Nb∑(i=1)

yi (2)

Where(Xc, Yc) is the average boundary pixel position,Nb is the number of boundary pixels, and (xi, yi) is apixel on the boundary of the target.

2) The distance di from the centroid (Xc, Yc) to eachborder point (xi, yi) are calculated these are expressedas a one dimensional discrete function d(i)= di. Notethat this function is periodic with period Nb.

di =√((xi − xc)2 + (yi − yc)2) (3)

216

Page 5: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

3) The signal d(i) is then smoothed for noise reduction andbecomes d̂(i). This can be done using a linear smoothingfilter or low pass filtering in the frequency domain.

4) Local maxima of d̂(i) are taken as extreme points, andthe ”star” skeleton is constructed by connecting then tothe target centroid (Xc, Yc). Local maxima are detectedby finding zero-crossings of the difference function.

δ(i) = d̂(i)− d̂(i− 1) (4)

Important step of star skeletonization is called target pre-processing which require applying some morphological op-erations to extracted human contour in order to get smoothboundary of human. Figure 6 show the output of each prepro-cessing.

Fig. 6. The boundary is ”unwrapped” as a distance function from the centroid.This function is then smoothed and external points are extracted.

The figure 7 shows the process of converting actual imageto star Skeleton, the extreme points mainly depend on thecut off value that we have chosen at low pass filter. Whencut off frequency increases the number of extreme points willincrease.

Fig. 7. Target preprocessing. A moving target region is morphologicallydilated then eroded. Then its border is extracted.

Our star skeletonization algorithm had been tested on dif-ferent video sequences and obtained results showed significantaccuracy. Sample input and respective star skeletons are shownin figure 8.

Fig. 8. Original captured frames and respective skeletons obtained using starskeletonization

Significant features of Star Skeletonization: Star skele-tonization has an advantage that it is not iterative algorithmand therefore, it is computationally cheap. It also explicitlyprovides a mechanism for controlling scale sensitivity. Finally,it does not need a priori human model. The scale of featureswhich can be detected is directly configurable by changing thecut-off frequency of low pass filter. The method that we haveproposed also had the similar features that star skeleton have,but some draw backs which are iterative process, dependingon human position and some assumptions are needed.

IV. CONCLUSION

Although some parts of our pixel density based skeletoniza-tion algorithm is quite complex due to the need to handleexceptional cases, the overall algorithm is quite simple andtherefore computationally inexpensive to operator. Thus, it isideal for the real-time applications targeted by moveIt project(Further information about the project can be found in [1]).Furthermore, proposed algorithm does not require a humanmodel prior to the processing and requires no training whichis common for template matching schemes.

When there are more than one human object in a scene,blobs are identified separately and the corresponding boundingboxes are given separate identification numbers. In tests, theproposed skeletonization algorithm worked at the video framerate and produced output without any delay similar to theplayback of the original video. A sample of skeletons andidentified bounding boxes are shown in figure 9.

As mentioned in [9], accuracy of background subtractedvideos has a significant impact when it comes to object de-tection and movement tracking. These background subtractedvideos and blobs in them that are identified as objects ofinterest are to be used as input for skeletonization component.Therefore, noise and other limitations that are present inbackground subtraction limits the accuracy of skeleton and

217

Page 6: [IEEE 2010 5th International Conference on Information and Automation for Sustainability (ICIAfS) - Colombo (2010.12.17-2010.12.19)] 2010 Fifth International Conference on Information

Fig. 9. Skeletonized objects and their bounding boxes

its movement encoding as well. Also, as the moveIt projectrequires real-time processing it limits achieving of higherlevels of accuracy and prevents the incorporation of off-linetraining methods available in this skeletonization step.

As discussed in this paper, experiments conducted usingthe proposed method produced highly effective skeletons withan accuracy that was adequate for the sample applications.In summary, the most important feature of this pixel densitybased skeletonization method is the ability to process videosand produce the skeletons in real-time without any additionaldelay.

REFERENCES

[1] C. N. Joseph, S. Kokulakumaran, K. Srijeyanthan,A.Thusyanthan, C. Gunasekara and Dr. C. Gamage, A Frameworkfor Whole-Body Gesture Recognition from Video Feeds, FifthInternational Conference on Industrial and Information Systems(ICIIS), Mangalore, India, 2010.

[2] H. Fujiyoshi, Alan J. Lipton and Takeo Kanade, ”Real-TimeHuman motion Analysis by Image Skeletonization,” IEICE trans.inf. & Syst., Vol.E87-D, 2004

[3] N. Cornea, D. Silver, and P. Min, Curve-skeleton properties,applications and algorithms, IEEE Transactions on Visualizationand Computer Graphics, vol. 13, no. 3, pp. 530548, 2007.

[4] N. Cornea, D. Silver, X. Yuan and R. Balasubramanian, Com-puting hierarchical curve-skeletons of 3d objects, The VisualComputer, 21(11):945955, 2005.

[5] N.D. Cornea, M.F. Demirci, D. Silver, A. Shokoufandeh, S.J.Dickinson, P.B. Kantor. 3D Object Retrieval using Many-to-manyMatching of Curve-skeletons, Proc. Shape Modeling International,2005.

[6] S. Svensson, I.Nystrom and G. Sanniti di Baja, Curve-skeletonization of Surface-like Objects in 3D Images Guided byVoxel Classification, Pattern Recognition Letters, 1419-1426, 2002

[7] Hasthorpe, J, and Mount, N., The generation of river channelskeletons from binary images using raster thinning algorithms,in the website of Geographical Information Science ResearchConference, 2007.

[8] Liang Wang, Weiming Hu, Tieniu Tan, Face Tracking UsingMotion-Guided Dynamic Template Matching, Proceedings of theFifth Asian Conference on Computer Vision (ACCV), Vol. II, pp.448-453, Melbourne, Australia, Jan 22-25, 2002.

[9] C. N. Joseph, S. Kokulakumaran, K. Srijeyanthan,A.Thusyanthan, C. Gunasekara and Dr. C. Gamage, Comparisonof background subtraction algorithms on video streams, Universityof Moratuwa, Sri Lanka, 2010.

218