Multiview Layer Fusion Model for Action Recognition Using...

Research ArticleMultiview Layer Fusion Model for Action RecognitionUsing RGBD Images

Pongsagorn Chalearnnetkul and Nikom Suvonvorn

Department of Computer Engineering Faculty of Engineering Prince of Songkla University Hat YaiSongkhla 90110 Thailand

Correspondence should be addressed to Pongsagorn Chalearnnetkul pongsagornchgmailcom

Received 3 January 2018 Revised 27 April 2018 Accepted 20 May 2018 Published 20 June 2018

Academic Editor Pedro Antonio Gutierrez

Copyright copy 2018 Pongsagorn Chalearnnetkul and Nikom SuvonvornThis is an open access article distributed under the CreativeCommons Attribution License which permits unrestricted use distribution and reproduction in any medium provided theoriginal work is properly cited

Vision-based action recognition encounters different challenges in practice including recognition of the subject from anyviewpoint processing of data in real time and offering privacy in a real-world setting Even recognizing profile-based humanactions a subset of vision-based action recognition is a considerable challenge in computer vision which forms the basis for anunderstanding of complex actions activities and behaviors especially in healthcare applications and video surveillance systemsAccordingly we introduce a novel method to construct a layer feature model for a profile-based solution that allows the fusion offeatures for multiview depth images This model enables recognition from several viewpoints with low complexity at a real-timerunning speed of 63 fps for four profile-based actions standingwalking sitting stooping and lying The experiment using theNorthwestern-UCLA 3D dataset resulted in an average precision of 8640 With the i3DPost dataset the experiment achievedan average precision of 9300 With the PSU multiview profile-based action dataset a new dataset for multiple viewpoints whichprovides profile-based action RGBD images built by our group we achieved an average precision of 9931

1 Introduction

Since 2010 action recognition methods have been increas-ingly developed and have been gradually introduced inhealthcare applications especially for monitoring the elderlyAction analysis plays an important role in the investigation ofnormal or abnormal events in daily-life activities In suchapplications privacy and convenience of usage of chosentechnologies are two key factors that must be thoroughlyconsideredThe pattern of recognized actions is an importantfunction of a system for monitoring complex activities andbehaviors which consist of several brief actions constituting alonger-term activity outcome For example a sleeping processinvolves standingwalking sitting and lying actions and afalling process includes all actions mentioned above exceptsitting

Recently two main approaches have been studied andproposed for determining these actions a wearable sensor-based technique and a vision-based technique

Wearable inertial sensor-based devices have been usedextensively in action recognition due to their small sizelow power consumption low cost and the ease with whichthey can be embedded into other portable devices such asmobile phones and smart watches An inertial sensor usedfor performing navigation commonly comprises motion androtation sensors (eg accelerometers and gyroscopes) Itprovides the path of movement viewpoint velocity andacceleration of the tracked subject Some research studieshave used wearable sensors [1ndash3] mobile phones [4ndash7] andsmart watches [8] for recognizing different actions In someresearch the focus was on detection of abnormal actionssuch as falling [9ndash11] or on reporting status for both normaland abnormal situations [12] To recognize complex actionsmoreover several sensors must be embedded at differentpositions on the body The only limitation of inertial sensorsis the inconvenience presented because sensors must even-tually be attached to the body which is uncomfortable andcumbersome

HindawiComputational Intelligence and NeuroscienceVolume 2018 Article ID 9032945 22 pageshttpsdoiorg10115520189032945

2 Computational Intelligence and Neuroscience

For vision-based techniques many studies emphasizeusing either a single-view or multiview approach for recog-nizing human actions

In a single-view approach four types of feature repre-sentation have been used (1) joint-basedskeleton-based (2)motionflow-based (3) space-time volume-based and (4)grid-based

(1) Joint-basedskeleton-based representation definesthe characteristics of human physical structure anddistinguishes its actions for example multilevel ofjoints and parts from posing features [13] the Fishervector using skeletal quads [14] spatial-temporalfeature of joints-mHOG [15] Lie vector space froma 3D skeleton [16] invariant trajectory trackingusing fifteen joints [17] histogram bag-of-skeleton-codewords [18] masked joint trajectories using 3Dskeletons [19] posture features from 3D skeletonjoints with SVM [20] and star skeletons usingHMMsfor missing observations [21] These representationsresult in clear human modeling although the com-plexity of jointskeleton estimation requires goodaccuracy from tracking and prediction

(2) Motionflow-based representation is a global feature-based method using the motion or flow of an objectsuch as invariant motion history volume [22] localdescriptors from optical-flow trajectories [23] KLTmotion-based snippet trajectories [24] Divergence-Curl-Shear descriptors [25] hybrid features usingcontours and optical flow [26] motion history andoptical-flow images [27] multilevel motion sets [28]projection of accumulated motion energy [29] pyra-mid of spatial-temporal motion descriptors [30] andmotion and optical flow with Markov random fieldsfor occlusion estimation [31] These methods do notrequire accurate background subtractions but makeuse of acquired inconstant features that need strategyand descriptors to manage

(3) Volume-based representations are modeled by stacksof silhouettes shapes or surfaces that use severalframes to build a model such as space-time silhou-ettes from shape history volume [32] geometric prop-erties from continuous volume [33] spatial-temporalshapes from 3D point clouds [34] spatial-temporalfeatures of shapelets from 3D binary cube space-time [35] affine invariants with SVM [36] spatial-temporal micro volume using binary silhouettes [37]integral volume of visual-hull and motion historyvolume [38] and saliency volume from luminancecolor and orientation components [39] These meth-ods acquire a detailed model but must deal with highdimensions of features which require accurate humansegmentation without the background

(4) Grid-based representations divide the observationregion of interest into cells a grid or overlappedblocks to encode local features for example a grid orhistogram of oriented rectangles [40] flow descrip-tors from spatial-temporal small cells [41] histogram

of local binary patterns from a spatial grid [42] andrectangular optical-flow grid [43] codeword featuresfor histograms of oriented gradients and histogramsof optical flow [44] 3D interest points within mul-tisize windows [45] histogram of motion gradients[46] and combination of motion history local binarypattern and histogram of oriented gradients [47]This method is simple for feature modeling in thespatial domain but it must deal with some duplicateand insignificant features

Although the four types of representation described in thesingle-view approach are generally good in monitoring alarge area one single camera will lose its ability to determinecontinuous human daily-life actions due to view varianceocclusion obstruction and lost information among othersThus a multiview approach is introduced to lessen thelimitations of a single-view approach

In the multiview approach methods can be categorizedinto 2D and 3D methods

Examples of the 2D methods are layer-based circularrepresentation of humanmodel structure [48] bag-of-visual-words using spatial-temporal interest points for humanmod-eling and classification [49] view-invariant action masksandmovement representation [50] R-transform features [51]silhouette feature space with PCA [52] low-level character-istics of human features [53] combination of optical-flowhistograms and bag-of-interest-point-words using transitionHMMs [54] contour-based and uniform local binary patternwith SVM [55] multifeatures with key poses learning [56]dimension-reduced silhouette contours [57] action mapusing linear discriminant analysis onmultiview action images[58] posture prototype map using self-organizing map withvoting function and Bayesian framework [59] multiviewaction learning using convolutional neural networks withlong short termmemory [60] and multiview action recogni-tion with an autoencoder neural network for learning view-invariant features [61]

Examples of the 3D method where the human modelis reconstructed or modeled from features between viewsare pyramid bag-of-spatial-temporal-descriptors and part-based features with induced multitask learning [62] spatial-temporal logical graphs with descriptor parts [63] temporalshape similarity in 3D video [64] circular FFT featuresfrom convex shapes [65] bag-of-multiple-temporal-self-similar-features [66] circular shift invariance of DFT frommovement [67] and 3D full bodypose dictionary featureswith convolutional neural networks [68] All of these 3Dapproaches attempt to construct a temporal-spatial datamodel that is able to increase the model precision and con-sequently raise the accuracy of the recognition rate

The multiview approach however has some drawbacksThe methods need more cameras and hence are more costlyIt is a more complex approach in terms of installationcamera calibration between viewpoints and model buildingand hence is more time-consuming In actual applicationhowever installation and setup should be simple flexibleand as easy as possible Systems that are calibration-free orautomatically self-calibrating between viewpoints are sought

Computational Intelligence and Neuroscience 3

Multi-view Fusion

Layer Feature FusionAction

Single-view 1

Layer Human Feature ExtractionPreprocessing

Single-view N

Figure 1 Overview of layer fusion model

(a) (b) (c) (d) (e)

Extracted Human Object

Depth profile of Human

Depth Image Human Blob Rejected Arm Parts ofHuman Blob

Figure 2 Preprocessing and human depth profile extraction (a) 8-bit depth image acquired from depth camera (b) motion detected outputfrommixture-based Gaussian model for background subtraction (c) blob position consists of top-left position (Xb Yb) and width (Wb) andheight (Hb) (d) rejected arm parts of human blob (e) depth profile and position of human blob Xh Yh Wh and Hh

One problem facing a person within camera view beit one single camera or a multitude of cameras is that ofprivacy and lighting conditions Vision-based and profile-based techniques involve the use of either RGB or non-RGBThe former poses a serious problem to privacy Monitoringactions in private areas using RGB cameras make thoseunder surveillance feel uncomfortable because the imagesexpose more clearly their physical outlines As for lightingconditions RGB is also susceptible to intensity images oftendeteriorate in dim environments The depth approach helpssolve both problems a coarse depth profile of the subject isadequate for determining actions and depth information canprevent illumination change issues which are a serious prob-lem in real-life applications of round-the-clock surveillanceThe depth approach that is adopted in our research togetherwith a multiview arrangement is considered worthy of morecostly installation than the single-view approach

A gap that needs attention for most multiview non-RGBresults is that of perspective robustness or viewing-orienta-tion stability and model complexity Under a calibration-freesetup our research aims to contribute to the development ofa fusion technique that is robust and simple in evaluatingthe depth profile of human action recognition We havedeveloped a layer fusion model in order to fuse depth profilefeatures frommultiviews and to test our technique on a tripledataset of validation and efficiency The three datasets testedare the Northwestern-UCLA dataset the i3DPost datasetand the PSU dataset for multiview action from variousviewpoints

The following sections detail our model its results andcomparisons

2 Layer Fusion Model

Our layer fusion model is described in three parts (1)preprocessing for image quality improvement (2) humanmodeling and feature extraction using a single-view layerfeature extraction module and (3) fusion of features from anyview into one single model using layer feature fusion moduleand classifying to actions The system overview is shown inFigure 1

21 Preprocessing The objective of preprocessing is to segre-gate human structure from the background and to eliminatearm parts before extracting the features as depicted inFigure 2 In our experiment the structure of the humanin the foreground is extracted from the background byapplying motion detection using mixture of Gaussian modelsegmentation algorithms [69] The extracted motion image(Im) (Figure 2(b)) from the depth image (Id) (Figure 2(a)) isassumed to be the human object HoweverIm still containsnoise from motion detection which has to be reduced bymorphological noise removal operator The depth of theextracted human object defined by its depth values inside theobject is then improved (Imo) as determined by intersectingId and Im using the AND operation Imo = ImampId

The human blob with bounding rectangle coordinatesXhand Yh widthWh and height Hh as shown in Figure 2(c) islocated using a contour approximation technique Howeverour action technique emphasizes only the profile structureof the figure while hands and arms are excluded as seen inFigure 2(d) and the obtained structure is defined as the figuredepth profile (Imo) (Figure 2(e)) in further recognition steps

View2 Layer Human Feature Extraction (Side-view)

View1 Layer Human Feature Extraction (Front-view)

Multi-view Layer-based

Feature Fusion

LＰ1[2]

LＰ1[1]

LＰ1[0]

LＰ1[-1]

LＰ1[-2]

LＰ2[2]

LＰ2[1]

LＰ2[0]

LＰ2[-1]

LＰ2[-2]

View2 Layer Human Feature Extraction (Slant-view)

Feature Fusion

LＰ1[2]

LＰ1[1]

LＰ1[0]

LＰ1[-1]

LＰ1[-2]

LＰ2[2]

LＰ2[1]

LＰ2[0]

LＰ2[-1]

LＰ2[-2]

Figure 3 Layered human model for multiview fusion (a) sampled layer model for standing (b) sampled layer model for sitting

22 Layer Human Feature Extraction We model the depthprofile of a human in a layered manner that allows extractionof specific features depending on the height level of thehuman structure which possesses different physical char-acteristics The depth profile is divided vertically into odd-numbered layers (eg 5 layers as shown in Figure 3) witha specific size regardless of the distance perspective andviews which would allow features of the same layer from allviews to be fused into reliable features

The human object in the bounding rectangle is dividedinto equal layers to represent features at different levels ofthe structure L[k] mdash k isin -N -N+1 -N+20 1 2 N-2N-1 119873 The total number of layers is 2N+1 where N is themaximum number of upper or lower layers For example inFigure 3 the human structure is divided into five layers (Nequals 2) twoupper two lower and one center thus the layersconsist of 119871[-2] L[-1] L[0] L[+1] and L[+2]The horizontalboundaries of all layers the red vertical lines are defined bythe left and right boundaries of the human objectThe verticalboundaries of each layer shown as yellow horizontal lines aredefined by the top 119910119879[119896] and the bottom 119910119861[119896] values that canbe computed as follows

119910119879 [119896] = 119867ℎ (119896 + 119873)2119873 + 1 + 1 (1)

119910119861 [119896] = 119867ℎ (119896 + 119873 + 1)2119873 + 1 (2)

The region of interest in layer 119896 is defined as 119909 = 0 to Wh and119910 = 119910119879[119896] to 119910119861[119896]According to the model features from the depth profile

human object can be computed along with layers as concate-nated features of every segment using basic and statisticalproperties (eg axis density depth width and area) Depthcan also be distinguished for specific characteristic of actions

In our model we define two main features for each layerincluding the density (120588[119896]) andweighted depth density (Z[119896])

of layers In addition the proportion value (119875v) is defined as aglobal feature to handle the horizontal action

221 Density of Layer The density of layer (120588[119896]) indicatesthe amount of object at a specific layer which varies dis-tinctively according to actions and can be computed as thenumber of white pixels in the layer as shown in the followingequation

1205881015840 [119896] = 119910119861[119896]sum119910=119910119879[119896]

119882ℎsum119909=0

119868119898 (119909 119910) (3)

In multiple views different distances between the object andthe cameras would affect 1205881015840[119896] Objects close to a cameracertainly appear larger than when further away and thus1205881015840[119896] must be normalized in order for these to be fused Weuse themaximumvalue of the perceived object andnormalizeit employing the following equation

120588 [119896] = 1205881015840 [119896]argmax (1205881015840 [119896]) (4)

222 Weighted Depth Density of Layer An inverse depthdensity is additionally introduced to improve the pattern ofdensity feature The procedure is comprised of two partsinverse depth extraction and weighting for density of thelayer

At the outset depth extraction is applied to the layerprofileThe depth profile reveals the surface of the object thatindicates rough structure ranging from 0 to 255 or from nearto far distances from the camera According to perspectiveprojection varying in a polynomial form a depth value at anear distance for example from 4 to 5 has a much smallerreal distance than a depth value at a far distance for examplefrom 250 to 251 The real-range depth of the layer (D1015840[119896])translates the property of 2D depth values to real 3D depthvalues in centimeters The real-range depth better distin-guishes the depth between layers of the objectmdashdifferent

parts of the human bodymdashand increases the ability to classifyactions A polynomial regression [70] has been used toconvert the depth value to real depth as described in thefollowing equation

1198911015840119888 (119909) = 12512119890minus101199096 minus 10370119890minus71199095 + 35014119890minus51199094minus 000611199093 + 057751199092 minus 276342119909+ 567591198902

In addition the (D1015840[119896]) value of each layer is represented bythe converted value of every depth value averaged in thatlayer as defined in the following equation

D1015840 [119896] = 1198911015840119888 (sum119910119861[119896]

119910=119910119879[119896]sum119882ℎ119909=0 119868119898119900 (119909 119910)120588 [119896] ) (6)

Numerically to be able to compare the depth profile of humanstructure from any point of view the D1015840[119896] value needs tobe normalized using its maximum value over all layers byemploying the following equation

D [119896] = D1015840 [119896]argmax (D1015840 [119896]) (7)

In the next step we apply the inverse real-range depth (D119894[119896])hereafter referred to as the inverse depth for weighting thedensity of the layer in order to enhance the feature 120588[119896] thatincreases the probability of classification for certain actionssuch as sitting and stooping We establish the inverse depthas described in (8) to measure the hidden volume of bodystructure which distinguishes particular actions from othersfor example in Table 1 in viewing stooping from the frontthe upper body is hidden but the value of D119894[119896] can reveal thevolume of this zone and in viewing sitting from the front thedepth of the thigh will reveal the hidden volume compared toother parts of the body

D119894 [119896] = ( D [119896] minus argmax (D [119896])argmin (D [119896]) minus argmax (D [119896])) (8)

The inverse depth density of layers (Q[119896]) in (9) is defined asthe product of the inverse depth (D119894[119896]) and the density of thelayer (120588[119896])

Q [119896] = (D119894 [119896]) (120588 [119896]) (9)

We use learning rate 120572 as an adjustable parameter that allowsbalance between Q[119896] and 120588[119896] In this configuration whenthe pattern of normalized inverse depth Di[k] is close to zeroQ[119896] is close to 120588[119896] Equation (10) is the weighted depthdensity of layers (Z[119896]) adjusted by the learning rate on Q[119896]and 120588[119896]

Ζ [119896] = (1 minus 120572) (Q [119896]) + (120572) (120588 [119896]) (10)

As can be deduced from Table 2 the weighted depth densityof layers (Z[119896]) improves the feature pattern for better

differentiation for 13 out of 20 features is the same for 5features (I through V on standing-walking) and is worsefor 2 features (VIII on side-view sitting and X on back-viewsitting)Thus Z[119896] is generally very useful though the similarandworse outcomeswould require a furthermultiview fusionprocess to distinguish the pattern

223 Proportion Value Proportion value (119875v) is a penaltyparameter of the model to indicate roughly the proportion ofobject vertical actions distinguished from horizontal actionsIt is the ratio of the width119882ℎ and the height119867ℎ of the objectin each view (see the following equation)

119875v = 119882ℎ119867ℎ (11)

Table 1 mentioned earlier shows the depth profile of actionand its features in each view In general the feature patternsof each action are mostly similar though they do exhibitsome differences depending on the viewpoints It should benoted here that the camera(s) are positioned at 2 m above thefloor pointing down at an angle of 30∘ from the horizontalline

Four actions in Table 1 standingwalking sitting stoop-ing and lying shall be elaborated here For standingwalkingthe features are invariant to viewpoints for both density (120588[119896])and inverse depth (D119894[119896]) The D[119896] values slope equallyfor every viewpoint due to the position of the camera(s)For sitting the features vary for 120588[119896] and D[119896] according totheir viewpoints However the patterns of sitting for frontand slant views are rather similar to standingwalking D[119896]from some viewpoints indicates the hidden volume of thethigh For stooping the 120588[119896] patterns are quite the samefrom most viewpoints except for the front view and backview due to occlusion of the upper body However D[119896]reveals clearly the volume in stooping For lying the 120588[119896]patterns vary depending on the viewpoints and cannot bedistinguished using layer-based features In this particularcase the proportion value (119875v) is introduced to help identifythe action

23 Layer-Based Feature Fusion In this section we empha-size the fusion of features from various views The pattern offeatures can vary or self-overlapwith respect to the viewpointIn a single view this problem leads to similar and unclearfeatures between actions Accordingly we have introduced amethod for the fusion of features frommultiviews to improveaction recognition From each viewpoint three features ofaction are extracted (1) density of layer (120588[119896]) (2) weighteddepth density of layers (Z[119896]) and (3) proportion value (119875v)We have established two fused features as the combinationof width area and volume of the body structure from everyview with respect to layers These two fused features arethe mass of dimension (120596[119896]) and the weighted mass ofdimension (120596[119896])

(a) The mass of dimension feature (120596[119896]) is computedfrom the product of density of layers (120588[119896]) in every

Table1Sampled

ledepthin

usview

120588[119896]

D[119896]

D119894[119896]

120588[119896]

D[119896]

D119894[119896]

Standing

-walking

Standing

-walking

Standing

-walking

Standing

-walking

Standing

-walking

Sitting

tLyingfro

Sitting

Lyingsla

Table1Con

tinued

120588[119896]

D[119896]

D119894[119896]

120588[119896]

D[119896]

D119894[119896]

Sitting

Lyingsid

Sitting

Lyingback

Sitting

Lyingback

Note(I)120588[119896]isdensity

oflayer(II)D[119896]isreal-range

depthof

layer(III)D119894[119896]isinversed

Table 2 Sampled features in single view and fused features in multiview

Actionview Camera 1 Camera 2 120596[119896]Imo Ζ [119896] | 120572 = 09 Imo Ζ [119896] | 120572 = 09

(I) Standing-walkingfront

(II) Standing-walkingslant

(III) Standing-walkingside

(IV) Standing-walkingback slant

(V) Standing-walkingback

(VI) Sittingfront

(VII) Sittingslant

(VIII) Sittingside

(IX) Sittingback slant

(X) Sittingback

(XI) Stoopingfront

(XII) Stoopingslant

(XIII) Stoopingside

(XIV) Stoopingbackslant

(XV) Stoopingback

Table 2 Continued

(XVI) Lyingfront

(XVII) Lyingslant

(XVIII) Lyingside

(XIX) Lyingbackslant

(XX) Lyingback

Note (I) Z[119896] is weighted depth density of layers which uses 120572 for adjustment between Q[119896] and 120588[119896] (II) 120596[119896] fused features from multiview called theweighted mass of dimension feature

view from ] = 1 (the first view) to ] = 119889 (the last view)as shown in the following equation

120596 [119896] = 119889prodV=1(120588(V=119894) [119896]) (12)

(b) Theweightedmass of dimension feature (120596[119896]) in (13)is defined as the product of theweighted depth densityof layers (Z[119896]) from every view

120596 [119896] = 119889prodV=1(Ζ(V=119894) [119896]) (13)

In addition the maximum proportion value is selected fromall views (see (14)) which is used to make feature vector

119875m = argmax (119875v(v = 1) 119875v(v = 2)119875v(v = 3) 119875v(v = d)) (14)

Two feature vectors a nondepth feature vector and a depthfeature vector are now defined for use in the classificationprocess

The nondepth feature vector is formed by concatenatingthe mass of dimension features (120596[119896]) and the maximumproportion values as follows

120596 [minus119873] 120596 [minus119873 + 1] 120596 [minus119873 + 2] 120596 [0] 120596 [1] 120596 [2] 120596 [119873] 119875119898 (15)

The depth feature vector is formed by the weighted mass ofdimension features (120596[119896]) concatenated with the maximumproportion values (119875m) as follows120596 [minus119873] 120596 [minus119873 + 1] 120596 [minus119873 + 2] 120596 [0] 120596 [1] 120596 [2] 120596 [119873] 119875119898 (16)

Table 2 shows the weighted mass of dimension feature (120596[119896])fused from the weighted depth density of layers (Z[119896]) fromtwo camerasThe fused patterns for standing-walking in eachview are very similar For sitting generally the patterns aremore or less similar except in the back view due to the lack ofleg images The classifier however can differentiate postureusing the thigh part though there are some variations in theupper body For stooping the fused patterns are consistentin all views heap in the upper part but slightly differentin varying degrees of curvature For lying all fused featurepatterns are differentThe depth profiles particularly in frontviews affect the feature adjustment However the classifiercan still distinguish appearances because generally patternsof features in the upper layers are shorter than those in thelower layers

3 Experimental Results

Experiments to test the performance of our method wereperformed on three datasets the PSU (Prince of SongklaUni-versity) dataset the NW-UCLA (Northwestern-University ofCalifornia at Los Angeles) dataset and the i3DPost datasetWe use the PSU dataset to estimate the optimal parame-ters in our model such as the number of layers and theadjustable parameter 120572 The tests are performed on singleand multiviews angles between cameras and classificationmethods Subsequently our method is tested using the NW-UCLA dataset and the i3DPost dataset which is set up fromdifferent viewpoints and angles between cameras to evaluatethe robustness of our model

31 Experiments on the PSU Dataset The PSU dataset [76]contains 328 video clips of human profiles with four basicactions recorded in two views using RGBD cameras (Kinect

315255

Back 180 0 Front

(a) Viewpoints in the work room scenario

315255

Back 180 Front

(b) Viewpoints in the living room scenario

Figure 4 Example of two multiview scenarios of profile-based action for the PSU dataset

~ 4-5m

Figure 5 General camera installation and setup

ver1)The videos were simultaneously captured and synchro-nized between views The profile-based actions consisted ofstandingwalking sitting stooping and lying Two scenariosone in a work room for training and another in a living roomfor testing were performed Figure 4 shows an example ofeach scenario together with the viewpoints covered

Two Kinect cameras were set overhead at 60∘ to thevertical line each at the end of a 2 m pole RGB and depthinformation from multiviews was taken from the stationarycameras with varying viewpoints to observe the areas ofinterest The operational range was about 3-55 m from thecameras to accommodate a full body image as illustratedin Figure 5 The RGB resolution for the video dataset was640times480 while depths at 8 and 24 bits were also of thesame resolution Each sequence was performed by 3-5 actorshaving no less than 40 frames of background at the beginningto allow motion detection using any chosen backgroundsubtraction technique The frame rate was about 8-12 fps

(i) Scenario in the Work Room (Training Set) As illustratedin Figure 6(a) the two camerasrsquo views are perpendicular toeach other There are five angles of object orientation front(0∘) slant (45∘) side (90∘) rear-slant (135∘) and rear (180∘) asshown in Figure 6(b) A total of 8700 frames were obtainedin this scenario for training

(ii) Scenario in the Living Room (Testing Set) This scenarioillustrated in Figure 7 involves one moving Kinect cameraat four angles 30∘ 45∘ 60∘ and 90∘ while another Kinectcamera remains stationary Actions are performed freely invarious directions and positions within the area of interest Atotal of 10720 frames of actions were tested

311 Evaluation of the Number of Layers We determine theappropriate number of layers by testing our model withdifferent numbers of layers (L) using the PSU dataset Thenumbers of layers for testing are 3 5 7 9 11 13 15 17 and19 The alpha value (120572) is instinctively fixed at 07 to findthe optimal value of the number of layers Two classificationmethods are used for training and testing an artificial neuralnetwork (ANN) with a back-propagation algorithm over 20nodes of hidden layers and a support vector machine (SVM)using a radial basis function kernel along with C-SVC Theresults using ANN and SVM are shown in Figures 8 and 9respectively

Figure 8 shows that the 3-layer size achieves the highestaverage precision of 9488 using the ANN and achieves9211 using the SVM in Figure 9 Because the ANN per-forms better in the tests that follow the evaluation of our

Overlapping View (Area of Interest)

Kinect 2

Kinect 1

~ 4-5m

Rear (180∘)

Rear-slant (135∘)

Side (90∘)

Slant (45∘)

Front (0∘)

Figure 6 Scenario in the work room (a) Installation and scenario setup from top view (b) Orientation angle to object

Kinect 2

Kinect 1 (Fixed)

~ 4-5m

Kinect 2

Figure 7 Installation and scenario of the living room setup (fourpositions of Kinect camera 2 at 30∘ 45∘ 60∘ and 90∘ and stationaryKinect camera 1)

model will be based on this layer size together with the useof the ANN classifier

312 Evaluation of Adjustable Parameter 120572 For the weightedmass of dimension feature (120596[119896]) the adjustment parameteralpha (120572)mdashthe value for the weight between the inversedepth density of layers (Q[119896]) and the density of layer(120588[119896])mdashis employed The optimal 120572 value is determined toshow the improved feature that uses inverse depth to revealhidden volume in some parts and the normal volume Theexperiment is carried out by varying alpha from 0 to 1 at 01intervals

Figure 10 shows the precision of action recognition usinga 3-layer size and the ANN classifier versus the alphavalues In general except for the sitting action one may

3 5 7 9 11 13 15 17 19Layer Size

) L = 3

AverageMaximum of AverageStandingwalking

SittingStoopingLying

Figure 8 Precision by layer size for the four postures using anartificial neural network (ANN) on the PSU dataset

note that as the portion of inverse depth density of layers(Q[119896]) is augmented precision increasesThe highest averageprecision is 9532 at 120572 = 09 meaning that 90 of theinverse depth density of layers (Q[119896]) and 10 of the densityof layers (120588[119896]) are an optimal proportion When 120572 isabove 09 all precision values drop The trend for sittingaction is remarkably different from others in that precisionalways hovers near the maximum and gradually but slightlydecreases when 120572 increases

Figure 11 illustrates the multiview confusion matrix ofaction precisions when L = 3 and 120572 = 09 using the PSUdataset We found that standingwalking action had thehighest precision (9931) while lying only reached 9065The classification error of the lying action depends mostlyon its characteristic that the principal axis of the body is

3 5 7 9 11 13 15 17 19Layer Size

Figure 9 Precision by layer size for the four postures using supportvector machine (SVM) on the PSU dataset

0 02 04 06 08 165

Alpha ()

Figure 10 Precision versus 120572 (the adjustment parameter forweighting betweenQ[119896] and 120588[119896]when L= 3 from the PSUdataset)

aligned horizontally which works against the feature modelIn general the missed classifications of standingwalkingstooping and lying actions were mostly confused with thesitting action accounting for 069 482 and 883 ofclassifications respectively Nevertheless the precision ofsitting is relatively high at 9859

313 Comparison of Single ViewMultiview For the sake ofcomparison we also evaluate tests in single-view recognitionfor the PSU dataset for L = 3 and 120572 = 09 in the livingroom scene similar to that used to train the classificationmodel for the work room Figure 12 shows the results from

StandingWalking Sitting Stooping Lying

StandingWalking

Sitting

Stooping

Figure 11 Confusion matrix for multiview recognition in the PSUdataset when L = 3 and 120572 = 09

StandingWalking

Sitting

Stooping

Figure 12 Confusion matrix of single-view Kinect 1 recognition(stationary camera) for the PSU dataset when L = 3 and 120572 = 09

the single-view Kinect 1 (stationary camera) while Figure 13shows those from the single-view Kinect 2 (moving camera)

Results show that the single-view Kinect 1 which is sta-tionary performs slightly better than the single-view Kinect2 which is moving (average precision of 9250 comparedto 9063) The stationary camera gives the best results forsitting action while for the moving camera the result is bestfor the standingwalking action It is worth noting that thestationary camera yields a remarkably better result for lyingaction than the moving one

Figure 14 shows the precision of each of the four posturestogether with that of the average for the multiview and thetwo single views On average the result is best accomplishedwith the use of the multiview and is better for all posturesother than the lying action In this regard single-view 1 yieldsa slightly better result most probably due to its stationaryviewpoint toward the sofa which is perpendicular to its lineof sight

314 Comparison of Angle between Cameras As depictedearlier in Figure 7 the single-view Kinect 1 camera is sta-tionary while the single-view Kinect 2 camera is movableadjusted to capture viewpoints at 30∘ 45∘ 60∘ and 90∘ to thestationary cameraWe test ourmodel on these angles to assess

StandingWalking

Sitting

Stooping

Figure 13 Confusion matrix of single-view Kinect 2 recognition(moving camera) for the PSU dataset when L = 3 and 120572 = 09

Multi-ViewSingle-view 1Single-view 2

Figure 14 Comparison of precision from multiview single-view 1and single-view 2 for the PSU dataset when L = 3 and 120572 = 09

the robustness of the model on the four postures Results areshown in Figure 15

In Figure 15 the lowest average precision result occursat 30∘mdashthe smallest angle configuration between the twocameras This is most probably because the angle is narrowand thus not much additional information is gathered Forall other angles the results are closely clustered In generalstandingwalking and sitting results are quite consistent atall angles while lying and stooping are more affected by thechange

315 EvaluationwithNW-UCLATrainedModel In additionwe tested the PSU dataset in living room scenes using themodel trained on the NW-UCLA dataset [63] Figure 16illustrates that the 9-layer size achieves the highest averageprecision at 9344 Sitting gives the best results and the

90∘45

∘30∘

Figure 15 Precision comparison graph on different angles in eachaction

3 5 7 9 11 13 15 17 19Layer Sizes

SittingStooping

Figure 16 Precision by layer size for the PSU dataset when usingthe NW-UCLA-trained model

highest precision up to 9874 when L = 17 However sittingat low layers also gives good results for example L = 3at 9718 while the highest precision for standingwalkingis 9540 and the lowest precision is 8516 The lowestprecision for stooping is 9208 when L = 5

Figure 17 shows the results when L = 9 which illustratesthat standingwalking gives the highest precision at 9632while bending gives the lowest precision at 8863

In addition we also compare precision of different anglesbetween cameras as shown in Figure 18The result shows thatthe highest precision on average is 9474 at 45∘ and thelowest is 8969 at 90∘

Table 3 Time consumption testing on different numbers of layers and different classifiers

ClassifierTime consumption (ms)frame Frame rate (fps)

L=3 L=11 L=19 L=3 L=11 L=19Min Max Avg Min Max Avg Min Max Avg Avg Avg Avg

ANN 1395 1777 1508 1412 2058 1518 1443 2002 1585 6631 6588 6309SVM 1401 1740 1520 1432 1897 1546 1434 1964 1571 6579 6468 6365Note L stands for the number of layers ANN artificial neural network SVM support vector machine

StandingWalking Sitting Stooping

StandingWalking

Sitting

Stooping

Figure 17 Confusion matrix of 2 views for the PSU dataset (usingthe NW-UCLA-trained model) when L = 9

In general the precision of all actions is highest at 45∘ anddecreases as the angle becomes larger However the resultsobtained using the PSU-trainedmodel show a different trendwhere a larger angle provides better results

316 Evaluation of Time Consumption Time consumptionevaluation excluding interface and video showing time isconducted using theOpenMPwall clockOur system is testedon a normal PC (Intel Core i5 4590 at 330 GHz with 8GB DDR3) We use the OpenCV library for computer visionthe OpenMP library for parallel processing and CLNUI tocapture images from the RGBD cameras The number oflayers and the classifier are tested using 10720 action framesfrom the living room scene

On average the time consumption is found to be approx-imately 15 ms per frame or a frame rate of around 63 fpsAs detailed in Table 3 the number of layers and the typeof classifier affect the performance only slightly In additionwe compare serial processing with parallel processing whichdivides a single process into threads The latter is found to be15507 times faster than the former It was noted that threadinitialization and synchronization consume a portion of thecomputation time

32 Experiment on the NW-UCLA Dataset The NW-UCLAdataset [63] is used to benchmark our methodThis dataset issimilar to our work for the multiview action 3D PSU datasettaken at different viewpoints to capture the RGB and depthimages The NW-UCLA dataset covers nearly ten actionsincluding stand upwalk around sit down and bend to pick upitem but it lacks a lie down action Actions in this dataset are

90∘45

∘30∘

Figure 18 Precision comparison graph of different angles in eachaction when using the NW-UCLA-trained model

marked in colors To test our method the motion detectionstep for segmenting movement is replaced by a specific colorsegmentation to obtain the human structure The rest of theprocedure stays the same

Only actions of interest are selected for the testmdashits tran-sition actions are excluded For example standingwalkingframes are extracted from stand up and walk around sittingframes are selected from sit down and stand up and stoopingis extracted from pick up with one hand and pick up with bothhands

Our method employs the model learned using the PSUdataset in the work room scenario to test the NW-UCLAdataset All parameters in the test are the same except thatalpha is set to zero due to variations in depth values Theexperiment is performed for various numbers of layers fromL= 3 to L = 19 Test results on the NW-UCLA dataset are shownin Figure 19

From Figure 19 the maximum average precision(8640) of the NW-UCLA dataset is obtained at layer L=11 in contrast to the PSU dataset at L = 3 Performance forstooping is generally better than other actions and peaksat 9560 As detailed in Figure 20 standingwalking givesthe lowest precision at 768 The principal cause of low

Table 4 Comparison between NW-UCLA and our recognition systems on the NW-UCLA dataset

NW-UCLA recognition action [63] Precision () Our recognition action Precision ()Walk around 7760 Walkingstanding 7680Sit down 6810 Sitting 8680Bend to pick item (1 hand) 7450 Stooping 9560Average 7340 Average 8640

3 5 7 9 11 13 15 17 19Layer Size

L = 11

SittingStooping

Figure 19 Precision of action by layer size on NW-UCLA dataset

StandingWalking

Sitting

Stooping

Figure 20 Confusion matrix of test results on the NW-UCLA 3Ddataset when L = 11

precision is that the angle of the camera and its capturedrange are very different from the PSU dataset Comparedwith the method proposed by NW-UCLA our methodperforms better by up to 13 percentage points from anaverage of 7340 to 8640 as shown in Table 4 However

3 5 7 9 11 13 15 17 19Layer Size

SittingStooping

Figure 21 Precision of i3DPost dataset by layer size when using thePSU-trained model

to be fair many more activities are considered by the NW-UCLA than ours which focuses only on four basic actionsand hence its disadvantage by comparison

33 Experiment on the i3DPost Dataset The i3DPost dataset[77] is an RGB multiview dataset that contains 13 activitiesThe dataset was captured by 8 cameras from different view-points with 45∘ between cameras performed by eight personsfor two sets at different positions The background imagesallow the same segmentation procedure to build the nondepthfeature vector for recognizing profile-based action

The testing sets extract only target actions from temporalactivities including standingwalking sitting and stoopingfrom sit-standup walk walk-sit and bend

331 Evaluation of i3DPost Dataset with PSU-Trained ModelFirstly i3DPost is tested by the PSU-trained model from 2views at 90∘ between cameras on different layer sizes

Figure 21 shows the testing results of the i3DPost datasetusing the PSU-trainedmodelThe failed prediction for sittingshows a precision of only 2808 at L = 9 By observation themistake is generally caused by the action of sitting that lookslike a squat in the air which is predicted as standing In thePSU dataset sitting is done on a benchchair On the otherhand standingwalking and stooping are performed withgood results of about 9640 and 100 at L = 11 respectively

StandingWalking

Sitting

Stooping

Figure 22 Confusion matrix of 2 views for i3DPost dataset usingPSU-trained model when L = 9

3 5 7 9 11 13 15 17 19Layer Size

L = 17

SittingStooping

Figure 23 Precision of i3DPost dataset with new trained model bylayer size

Figure 22 shows the multiview confusion matrix whenL = 9 Sitting is most often confused with standingwalking(6918 of cases) and standingwalking is confused withstooping (1200 of cases)

332 Training and Evaluation Using i3DPost From the lastsection i3DPost evaluation using the PSU-trained modelresulted inmissed classification for the sitting action Accord-ingly we experimented with our model by training andevaluating using only the i3DPost the first dataset is usedfor testing and the second for training The initial testing isperformed in 2 views

Figures 23 and 24 show the results of each layer size from2viewsThe 17 layers achieve the highest precision at 9300onaverage (9828 8103 and 9968 for standingwalkingsitting and stooping resp) In general standingwalking andstooping achieve good precision of above 90 except at L =3 However the best precision for sitting is only 8103 wheremost wrong classifications are defined as standingwalking

StandingWalking

Sitting

Stooping

Figure 24 Confusion matrix of test result on the 2-view i3DPostdataset when L = 17

135∘

Figure 25 Precision comparison graph for different angles in thei3DPost

(1890 of cases) and the lowest precision is at 4159 whenL = 5 We noticed that the squat still affects performance

In 2-view testing we couple the views for different anglesbetween cameras such as 45∘ 90∘ and 135∘ Figure 25 showsthat at 135∘ the performance on average is highest and thelowest performance is at 45∘ In general a smaller angle giveslower precision however for sitting a narrow angle mayreduce precision dramatically

In addition we perform the multiview experiments forvarious numbers of views from one to six in order to evaluateour multiview model as shown in Figure 26

Figure 26 shows the precision according to the number ofviews The graph reports the maximum precision versus thedifferent number of views from one to six views which are8903 9300 9133 9230 9256 and 9103 at L = 7

Table 5 Comparison between our recognition systems and [59] on the i3DPost dataset

Recognition of similar approach [59] Precision () Our recognition action Precision ()Walk 9500 Walkingstanding 9828Sit 8700 Sitting 8103Bend 1000 Stooping 9968Average 9400 Average 9300

Table 6 Precision results of our study and others emphasizing profile-based action recognition using different methods and datasets

Method Precision rate of action ()Standing Walking Sitting Stooping Lying Average

P Chawalitsittikul et al [70] 9800 9800 9300 9410 9800 9622N Noorit et al [71] 9941 8065 8926 9435 1000 9273M Ahmad et al [72] - 8900 8500 1000 9100 9125C H Chuang et al [73] - 9240 9760 9540 - 9580G I Parisi et al [74] 9667 9000 8333 - 8667 8917N Sawant et al [75] 9185 9614 8503 - - 9101Our method on PSU dataset 9931 9931 9859 9272 9065 9532

1 2 3 4 5 6Number of views

50556065707580859095

AverageStandingwalking

SittingStooping

Figure 26 Precision in i3DPost dataset with new trained model bythe number of views

L = 17 L = 17 L = 7 and L= 13 respectivelyWenoticed that thehighest precision is for 2 views In general the performanceincreases when the number of views increases except forsitting Moreover the number of layers that give maximumprecision is reduced as the number of views increases Inconclusion only two or three views from different angles arenecessary for obtaining the best performance

We compared our method with a similar approach [59]based on a posture prototypemap and results voting functionwith a Bayesian framework for multiview fusion Table 5shows the comparison results The highest precisions of ourmethod and the comparison approach are 9968 and 100for stooping and bend respectively Likewise the lowest pre-cisions are 8103 and 8700 for the same actions Howeverfor walkingstanding our approach obtains better results On

average the comparison approach performs slightly betterthan our method

4 Comparison with Other Studies

Our study has now been presented with other visual profile-based action recognition studies and also compared withother general action recognition studies

41 Precision Results of Different Studies Table 6 shows pre-cision results of various methods emphasizing profile-basedaction recognition Note that themethods employed differentalgorithms and datasets thus results are presented only forthe sake of studying their trends with respect to actions Ourmethod is tested on the PSU dataset Our precision is higheston walking and sitting actions quite good on the standingaction quite acceptable on the lying action but poor on thestooping action It is not possible to compare on precision dueto lack of information on the performance of other methodsHowever we can note that eachmethod performs better thanothers on different actions for example [74] is good forstanding and [73] is a better fit for sitting

42 Comparison with Other Action Recognition StudiesTable 7 compares the advantages and disadvantages of someprevious studies concerning action recognition acquiredaccording to the criteria of their specific applications In theinertial sensor-based approach sensorsdevices are attachedto the body and hencemonitoring is available everywhere forthe inconvenience of carrying them around This approachgives high privacy but is highly complex In an RGB vision-based approach sensors are not attached to the body andhence it is less cumbersome Though not so complex themain drawback of this method is the lack of privacy Inthis regard depth-based views may provide more privacyAlthough rather similar in comparison in the table themultiview approach can cope with some limitations such

Table7Com

ctionrecogn

approaches

basedon

ofuseperspectiver

obustnessandreal-w

applicationcond

Approach

Flexible

scalability

Perspective

stness

Real-w

ApplicationCon

plexity

Cumbersom

sensordevice

attachment

Privacy

Inertia

lsensor-based

[1ndash12]

Highly

complex

Cumbersom

n-based

[131721ndash

4771ndash7375]

Simple-

cumbersom

eLacking

Depth-based

le-view

[14ndash1618ndash

207074]

Depends

cumbersom

erate-

RGBdepth-

multiv

[48ndash68]

Depends

cumbersom

eDepends

e-high

as vision coverage and continuity or obstruction which arecommon in the single-view approach as described in somedetail in Introduction

(i) As for our proposed work it can be seen from thetable that the approach is in general quite good incomparison in addition to being simple it offers ahigh degree of privacy and other characteristics suchas flexibility scalability and robustness are at similarlevels if not better and no calibration is neededwhich is similar to most of the other approachesHowever the two cameras for our multiview ap-proach still have to be installed following certainspecifications such as the fact that the angle betweencameras should be more than 30∘

5 ConclusionSummary and Further Work

In this paper we explore both the inertial and visual ap-proaches to camera surveillance in a confined space suchas a healthcare home The less cumbersome vision-basedapproach is studied in further detail for single-view andmultiview RGB depth-based and for non-RGB depth-basedapproaches We decided on a multiview non-RGB depth-based approach for privacy and have proposed a layer fusionmodel which is representation-based model that allowsfusion of features and information by segmenting parts of theobject into vertical layers

We trained and tested our model on the PSU datasetwith four postures (standingwalking sitting stooping andlying) and have evaluated the outcomes using the NW-UCLAand i3DPost datasets on available postures that could beextracted Results show that our model achieved an averageprecision of 9532 on the PSU datasetmdashon a par with manyother achievements if not better 9300 on the i3DPostdataset and 8640 on the NW-UCLA dataset In additionto flexibility of installation scalability and noncalibration offeatures one advantage over most other approaches is thatour approach is simple while it contributes good recognitionon various viewpoints with a high speed of 63 frames persecond suitable for application in a real-world setting

In further research a spatial-temporal feature is inter-esting and should be investigated for more complex actionrecognition such as waving and kicking Moreover recon-struction of 3D structural and bag-of-visual-word models isalso of interest

Data Availability

The PSU multiview profile-based action dataset is availableat [74] which is authorized only for noncommercial oreducational purposes The additional datasets to support thisstudy are cited at relevant places within the text as references[63] and [75]

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work was partially supported by the 2015 GraduateSchool Thesis Grant Prince of Songkla University (PSU)andThailand Center of Excellence for Life Sciences (TCELS)and the authors wish to express their sincere appreciationThe first author is also grateful for the scholarship fundinggranted for him from Rajamangala University of TechnologySrivijaya to undertake his postgraduate study and he wouldlike to express gratitude to Assistant Professor Dr NikomSuvonvorn for his guidance and advicesThanks are extendedto all members of the Machine Vision Laboratory PSUDepartment of Computer Engineering for the sharing oftheir time in the making of the PSU profile-based actiondataset Last but not least specials thanks are due to MrWiwat Sutiwipakorn a former lecturer at the PSU Fac-ulty of Engineering regarding the use of language in themanuscript

References

[1] Z He and X Bai ldquoA wearable wireless body area networkfor human activity recognitionrdquo in Proceedings of the 6thInternational Conference on Ubiquitous and Future NetworksICUFN 2014 pp 115ndash119 China July 2014

[2] C Zhu and W Sheng ldquoWearable sensor-based hand gestureand daily activity recognition for robot-assisted livingrdquo IEEETransactions on Systems Man and Cybernetics Systems vol 41no 3 pp 569ndash573 2011

[3] J-S Sheu G-S Huang W-C Jheng and C-H Hsiao ldquoDesignand implementation of a three-dimensional pedometer accu-mulating walking or jogging motionsrdquo in Proceedings of the 2ndInternational Symposium on Computer Consumer and ControlIS3C 2014 pp 828ndash831 Taiwan June 2014

[4] R Samiei-Zonouz H Memarzadeh-Tehran and R RahmanildquoSmartphone-centric human posture monitoring systemrdquo inProceedings of the 2014 IEEE Canada International Humanitar-ian Technology Conference IHTC 2014 pp 1ndash4 Canada June2014

[5] X YinW Shen J Samarabandu andXWang ldquoHuman activitydetection based on multiple smart phone sensors and machinelearning algorithmsrdquo in Proceedings of the 19th IEEE Interna-tional Conference on Computer Supported Cooperative Work inDesign CSCWD 2015 pp 582ndash587 Italy May 2015

[6] C A Siebra B A Sa T B Gouveia F Q Silva and A L SantosldquoA neural network based application for remote monitoringof human behaviourrdquo in Proceedings of the 2015 InternationalConference on Computer Vision and Image Analysis Applications(ICCVIA) pp 1ndash6 Sousse Tunisia January 2015

[7] C Pham ldquoMobiRAR real-time human activity recognitionusing mobile devicesrdquo in Proceedings of the 7th IEEE Interna-tional Conference on Knowledge and Systems Engineering KSE2015 pp 144ndash149 Vietnam October 2015

[8] G M Weiss J L Timko C M Gallagher K Yoneda and A JSchreiber ldquoSmartwatch-based activity recognition A machinelearning approachrdquo in Proceedings of the 3rd IEEE EMBSInternational Conference on Biomedical and Health InformaticsBHI 2016 pp 426ndash429 USA February 2016

[9] Y Wang and X-Y Bai ldquoResearch of fall detection and alarmapplications for the elderlyrdquo in Proceedings of the 2013 Interna-tional Conference on Mechatronic Sciences Electric Engineering

and Computer MEC 2013 pp 615ndash619 China December 2013[10] Y Ge and B Xu ldquoDetecting falls using accelerometers by

adaptive thresholds in mobile devicesrdquo Journal of Computers inAcademy Publisher vol 9 no 7 pp 1553ndash1559 2014

[11] W Liu Y Luo J Yan C Tao and L Ma ldquoFalling monitoringsystem based onmulti axial accelerometerrdquo in Proceedings of the2014 11th World Congress on Intelligent Control and Automation(WCICA) pp 7ndash12 Shenyang China June 2014

[12] J Yin Q Yang and J J Pan ldquoSensor-based abnormal human-activity detectionrdquo IEEE Transactions on Knowledge and DataEngineering vol 20 no 8 pp 1082ndash1090 2008

[13] B X Nie C Xiong and S-C Zhu ldquoJoint action recognitionand pose estimation from videordquo in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition CVPR2015 pp 1293ndash1301 Boston Mass USA June 2015

[14] G Evangelidis G Singh and R Horaud ldquoSkeletal quadsHuman action recognition using joint quadruplesrdquo in Proceed-ings of the 22nd International Conference on Pattern RecognitionICPR 2014 pp 4513ndash4518 Sweden August 2014

[15] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andHOG2 for action recognitionrdquo in Proceedings of the 2013IEEE Conference on Computer Vision and Pattern RecognitionWorkshops CVPRW 2013 pp 465ndash470 USA June 2013

[16] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[17] V Parameswaran and R Chellappa ldquoView invariance forhuman action recognitionrdquo International Journal of ComputerVision vol 66 no 1 pp 83ndash101 2006

[18] G Lu Y Zhou X Li andM Kudo ldquoEfficient action recognitionvia local position offset of 3D skeletal body jointsrdquoMultimediaTools and Applications vol 75 no 6 pp 3479ndash3494 2016

[19] A Tejero-de-Pablos Y Nakashima N Yokoya F-J Dıaz-Pernas and M Martınez-Zarzuela ldquoFlexible human actionrecognition in depth video sequences using masked jointtrajectoriesrdquo Eurasip Journal on Image andVideo Processing vol2016 no 1 pp 1ndash12 2016

[20] E Cippitelli S Gasparrini E Gambi and S Spinsante ldquoAHuman activity recognition system using skeleton data fromRGBD sensorsrdquo Computational Intelligence and Neurosciencevol 2016 Article ID 4351435 2016

[21] P Peursum H H Bui S Venkatesh and G West ldquoRobustrecognition and segmentation of human actions using HMMswith missing observationsrdquo EURASIP Journal on Applied SignalProcessing vol 2005 no 13 pp 2110ndash2126 2005

[22] D Weinland R Ronfard and E Boyer ldquoFree viewpoint actionrecognition usingmotionhistory volumesrdquo Journal of ComputerVision and Image Understanding vol 104 no 2-3 pp 249ndash2572006

[23] H Wang A Klaser C Schmid and C L Liu ldquoDense trajecto-ries and motion boundary descriptors for action recognitionrdquoInternational Journal of Computer Vision vol 103 no 1 pp 60ndash79 2013

[24] P Matikainen M Hebert and R Sukthankar ldquoTrajectonsAction recognition through the motion analysis of trackedfeaturesrdquo in Proceedings of the 2009 IEEE 12th InternationalConference on Computer Vision Workshops ICCV Workshops2009 pp 514ndash521 Japan October 2009

[25] M Jain H Jegou and P Bouthemy ldquoBetter exploiting motionfor better action recognitionrdquo in Proceedings of the 26th IEEEConference on Computer Vision and Pattern Recognition CVPR2013 pp 2555ndash2562 Portland OR USA June 2013

[26] S Zhu and L Xia ldquoHuman action recognition based onfusion features extraction of adaptive background subtractionand optical flow modelrdquo Journal of Mathematical Problems inEngineering vol 2015 pp 1ndash11 2015

[27] D-M Tsai W-Y Chiu and M-H Lee ldquoOptical flow-motionhistory image (OF-MHI) for action recognitionrdquo Journal ofSignal Image and Video Processing vol 9 no 8 pp 1897ndash19062015

[28] LWang Y Qiao and X Tang ldquoMoFAP a multi-level represen-tation for action recognitionrdquo International Journal of ComputerVision vol 119 no 3 pp 254ndash271 2016

[29] W Kim J Lee M Kim D Oh and C Kim ldquoHuman actionrecognition using ordinal measure of accumulated motionrdquoEURASIP Journal on Advances in Signal Processing vol 2010Article ID 219190 2010

[30] W Zhang Y Zhang C Gao and J Zhou ldquoAction recognitionby joint spatial-temporal motion featurerdquo Journal of AppliedMathematics vol 2013 pp 1ndash9 2013

[31] H-B Tu L-M Xia and Z-W Wang ldquoThe complex actionrecognition via the correlated topic modelrdquoThe ScientificWorldJournal vol 2014 Article ID 810185 10 pages 2014

[32] L Gorelick M Blank E Shechtman M Irani and R BasrildquoActions as space-time shapesrdquo IEEE Transactions on PatternAnalysis andMachine Intelligence vol 29 no 12 pp 2247ndash22532007

[33] A Yilmaz and M Shah ldquoA differential geometric approach torepresenting the human actionsrdquo Computer Vision and ImageUnderstanding vol 109 no 3 pp 335ndash351 2008

[34] M Grundmann F Meier and I Essa ldquo3D shape context anddistance transform for action recognitionrdquo in Proceedings of the2008 19th International Conference on Pattern Recognition ICPR2008 pp 1ndash4 USA December 2008

[35] D Batra T Chen and R Sukthankar ldquoSpace-time shapelets foraction recognitionrdquo in Proceedings of the 2008 IEEE WorkshoponMotion andVideo ComputingWMVC pp 1ndash6 USA January2008

[36] S Sadek A Al-Hamadi G Krell and B Michaelis ldquoAffine-invariant feature extraction for activity recognitionrdquo Interna-tional Scholarly Research Notices on Machine Vision vol 2013pp 1ndash7 2013

[37] C Achard X Qu A Mokhber and M Milgram ldquoA novelapproach for recognition of human actions with semi-globalfeaturesrdquoMachine Vision and Applications vol 19 no 1 pp 27ndash34 2008

[38] L Dıaz-Mas R Munoz-Salinas F J Madrid-Cuevas andR Medina-Carnicer ldquoThree-dimensional action recognitionusing volume integralsrdquo Pattern Analysis and Applications vol15 no 3 pp 289ndash298 2012

[39] K Rapantzikos Y Avrithis and S Kollias ldquoSpatiotemporalfeatures for action recognition and salient event detectionrdquoCognitive Computation vol 3 no 1 pp 167ndash184 2011

[40] N Ikizler and P Duygulu ldquoHistogram of oriented rectangles Anew pose descriptor for human action recognitionrdquo Image andVision Computing vol 27 no 10 pp 1515ndash1526 2009

[41] A Fathi and G Mori ldquoAction recognition by learning mid-level motion featuresrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo08) pp 1ndash8Anchorage Alaska USA June 2008

[42] V Kellokumpu G Zhao and M Pietikainen ldquoHuman activityrecognition using a dynamic texture basedmethodrdquo in Proceed-ings of the BritishMachine Vision Conference 2008 pp 885ndash894Leeds England UK 2008

[43] D Tran A Sorokin and D A Forsyth ldquoHuman activity recog-nition with metric learningrdquo in Proceedings of the EuropeanConference on Computer Vision (ECCV rsquo08) Lecture Notes inComputer Science pp 548ndash561 Springer 2008

[44] B Wang Y Liu W Wang W Xu and M Zhang ldquoMulti-scalelocality-constrained spatiotemporal coding for local featurebased human action recognitionrdquo The Scientific World Journalvol 2013 Article ID 405645 11 pages 2013

[45] A Gilbert J Illingworth and R Bowden ldquoAction recognitionusing mined hierarchical compound featuresrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 33 no5 pp 883ndash897 2011

[46] I C Duta J R R Uijlings B Ionescu K Aizawa A G Haupt-mann and N Sebe ldquoEfficient human action recognition usinghistograms of motion gradients and VLAD with descriptorshape informationrdquoMultimedia Tools and Applications vol 76no 21 pp 22445ndash22472 2017

[47] M Ahad M Islam and I Jahan ldquoAction recognition basedon binary patterns of action-history and histogram of orientedgradientrdquo Journal on Multimodal User Interfaces vol 10 no 4pp 335ndash344 2016

[48] S Pehlivan and P Duygulu ldquoA new pose-based representationfor recognizing actions from multiple camerasrdquo ComputerVision and ImageUnderstanding vol 115 no 2 pp 140ndash151 2011

[49] J Liu M Shah B Kuipers and S Savarese ldquoCross-view actionrecognition via view knowledge transferrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo11) pp 3209ndash3216 June 2011

[50] N Gkalelis N Nikolaidis and I Pitas ldquoView indepedenthuman movement recognition from multi-view video exploit-ing a circular invariant posture representationrdquo in Proceedingsof the 2009 IEEE International Conference on Multimedia andExpo ICME 2009 pp 394ndash397 USA July 2009

[51] R Souvenir and J Babbs ldquoLearning the viewpoint manifold foraction recognitionrdquo in Proceedings of the 26th IEEE Conferenceon Computer Vision and Pattern Recognition CVPR USA June2008

[52] M Ahmad and S W Lee ldquoHMM-based human action recog-nition using multiview image sequencesrdquo in Proceedings of the18th International Conference on Pattern Recognition (ICPRrsquo06)pp 263ndash266 Hong Kong China 2006

[53] A Yao J Gall and L Van Gool ldquoCoupled action recognitionand pose estimation frommultiple viewsrdquo International Journalof Computer Vision vol 100 no 1 pp 16ndash37 2012

[54] X Ji Z Ju C Wang and C Wang ldquoMulti-view transi-tion HMMs based view-invariant human action recognitionmethodrdquoMultimedia Tools and Applications vol 75 no 19 pp11847ndash11864 2016

[55] A K S Kushwaha S Srivastava and R Srivastava ldquoMulti-viewhuman activity recognition based on silhouette and uniformrotation invariant local binary patternsrdquo Journal of MultimediaSystems vol 23 no 4 pp 451ndash467 2017

[56] S Spurlock andR Souvenir ldquoDynamic view selection formulti-camera action recognitionrdquo Machine Vision and Applicationsvol 27 no 1 pp 53ndash63 2016

[57] A A Chaaraoui and F Florez-Revuelta ldquoA low-dimensionalradial silhouette-based feature for fast human action recog-nition fusing multiple viewsrdquo International Scholarly ResearchNotices vol 2014 pp 1ndash11 2014

[58] A Iosifidis A Tefas and I Pitas ldquoView-independent humanaction recognition based on multi-view action images anddiscriminant learningrdquo in Proceeding of the IEEE Image Videoand Multidimensional Signal Processing Workshop 2013 pp 1ndash42013

[59] A Iosifidis A Tefas and I Pitas ldquoView-invariant action recog-nition based on artificial neural networksrdquo IEEE TransactionsonNeural Networks and Learning Systems vol 23 no 3 pp 412ndash424 2012

[60] R Kavi V Kulathumani F Rohit and V Kecojevic ldquoMultiviewfusion for activity recognition using deep neural networksrdquoJournal of Electronic Imaging vol 25 no 4 Article ID 0430102016

[61] YKong ZDing J Li andY Fu ldquoDeeply learned view-invariantfeatures for cross-view action recognitionrdquo IEEE Transactionson Image Processing vol 26 no 6 pp 3028ndash3037 2017

[62] A-A Liu N Xu Y-T Su H Lin T Hao and Z-X YangldquoSinglemulti-view human action recognition via regularizedmulti-task learningrdquo Journal of Neurocomputing vol 151 no 2pp 544ndash553 2015

[63] JWang X Nie Y Xia YWu and S-C Zhu ldquoCross-view actionmodeling learning and recognitionrdquo in Proceedings of the 27thIEEE Conference on Computer Vision and Pattern RecognitionCVPR 2014 pp 2649ndash2656 USA June 2014

[64] PHuang AHilton and J Starck ldquoShape similarity for 3D videosequences of peoplerdquo International Journal of Computer Visionvol 89 no 2-3 pp 362ndash381 2010

[65] A Veeraraghavan A Srivastava A K Roy-Chowdhury andR Chellappa ldquoRate-invariant recognition of humans and theiractivitiesrdquo IEEE Transactions on Image Processing vol 18 no 6pp 1326ndash1339 2009

[66] I N Junejo E Dexter I Laptev and P Perez ldquoView-independ-ent action recognition from temporal self-similaritiesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol33 no 1 pp 172ndash185 2011

[67] A Iosifidis A Tefas N Nikolaidis and I Pitas ldquoMulti-viewhumanmovement recognition based on fuzzy distances and lin-ear discriminant analysisrdquo Computer Vision and Image Under-standing vol 116 no 3 pp 347ndash360 2012

[68] H Rahmani and A Mian ldquo3D action recognition from novelviewpointsrdquo in Proceedings of the 2016 IEEE Conference onComputer Vision and Pattern Recognition CVPR 2016 pp 1506ndash1515 USA July 2016

[69] P KaewTraKuPong and R Bowden ldquoAn improved adaptivebackground mixture model for real-time tracking with shadowdetectionrdquo in Proceeding of Advanced Video-Based SurveillanceSystems Springer pp 135ndash144 2001

[70] P Chawalitsittikul and N Suvonvorn ldquoProfile-based humanaction recognition using depth informationrdquo in Proceedings ofthe IASTED International Conference on Advances in ComputerScience and Engineering ACSE 2012 pp 376ndash380 ThailandApril 2012

[71] N Noorit N Suvonvorn and M Karnchanadecha ldquoModel-based human action recognitionrdquo in Proceedings of the 2ndInternational Conference onDigital Image Processing SingaporeFebruary 2010

[72] M Ahmad and S-W Lee ldquoHuman action recognitionusing shape and CLG-motion flow from multi-view image

sequencesrdquo Journal of Pattern Recognition vol 41 no 7 pp2237ndash2252 2008

[73] C-H Chuang J-W Hsieh L-W Tsai and K-C Fan ldquoHumanaction recognition using star templates and delaunay triangula-tionrdquo in Proceedings of the 2008 4th International Conference onIntelligent Information Hiding and Multiedia Signal ProcessingIIH-MSP 2008 pp 179ndash182 China August 2008

[74] G I Parisi C Weber and S Wermter ldquoHuman actionrecognition with hierarchical growing neural gas learningrdquo inProceeding of Artificial Neural Networks and Machine Learningon Springer ndash ICANN 2014 vol 8681 of Lecture Notes in Com-puter Science pp 89ndash96 Springer International PublishingSwitzerland 2014

[75] N Sawant and K K Biswas ldquoHuman action recognition basedon spatio-temporal featuresrdquo in Proceedings of the PatternRecognition and Machine Intelligence Springer Lecture Notes inComputer Science pp 357ndash362 Springer Heidelberg Berlin2009

[76] N Suvonvorn Prince of Songkla University (PSU) Multi-viewprofile-based action RGBD dataset 2017 httpfivedotscoepsuacthsimkomp=1483 (accessed 20 December 2017)

[77] N Gkalelis H Kim A Hilton N Nikolaidis and I PitasldquoThe i3DPost multi-view and 3D human actioninteractiondatabaserdquo in Proceeding of the 6th European Conference forVisualMedia Production (CVMP rsquo09) pp 159ndash168 LondonUKNovember 2009

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Volume 2018

ReconfigurableComputing

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Civil EngineeringAdvances in

Electrical and Computer Engineering

Journal of

Computer Networks and Communications

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Biomedical Imaging

Engineering Mathematics

RoboticsJournal of

Computational Intelligence and Neuroscience

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Human-ComputerInteraction

Advances in

Scientic Programming

Submit your manuscripts atwwwhindawicom