[IEEE 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and

MULTIRESOLUTION MATCH KERNELS FOR GESTURE VIDEO CLASSIFICATION

Hemanth Venkateswara, Vineeth N. Balasubramanian, Prasanth Lade, Sethuraman PanchanathanCenter for Cognitive Ubiquitous Computing(CUbiC), Arizona State University{hemanthv, vineeth.nb, prasanthl, panch}@asu.edu

ABSTRACT

The emergence of depth imaging technologies like the Mi-crosoft Kinect has renewed interest in computational methodsfor gesture classification based on videos. For several yearsnow, researchers have used the Bag-of-Features (BoF) as aprimary method for generation of feature vectors from videodata for recognition of gestures. However, the BoF method isa coarse representation of the information in a video, whichoften leads to poor similarity measures between videos. Be-sides, when features extracted from different spatio-temporallocations in the video are pooled to create histogram vectorsin the BoF method, there is an intrinsic loss of their origi-nal locations in space and time. In this paper, we propose anew Multiresolution Match Kernel (MMK) for video classi-fication, which can be considered as a generalization of theBoF method. We apply this procedure to hand gesture classi-fication based on RGB-D videos of the American Sign Lan-guage(ASL) hand gestures and our results show promise andusefulness of this new method.

Index Terms— Bag of Features, Spatio-temporal Pyra-mid, Multiple Kernels, Gesture Recognition

1. INTRODUCTIONWith the emergence of depth sensors like the MicrosoftKinect, there has been a renewed interest in the analysis ofgestures using both color (or grayscale) and depth informa-tion. A standard procedure to estimate such features fromvideos, is to extract points of interest in the video (such asSIFT [1]) and subsequently obtain feature descriptors usingthese interest points [2], which are then used to train a clas-sifier. One of the most common methods used by researcherstoday in this regard is the Bag-of-Features (BoF) model [3].

The BoF model is applied upon videos that are repre-sented by a set of interest point based descriptors. A dictio-nary of codewords that represents the set of descriptors fromall the videos is estimated. The dictionary is usually the cen-ters of descriptor clusters. Each descriptor in a video is thenmapped to the closest codeword in the dictionary, resulting ina histogram over the codewords. Two descriptors are similarif they map to the same codeword. The similarity betweentwo videos is approximated by the similarity between theirdescriptor histograms. The histogram similarity is a coarsemeasure due to the binarization of the belongingness of a de-scriptor to a cluster. Recently, Bo et al. [4] applied kernel

methods to measure descriptor similarity and devised an effi-cient method to use these kernels to evaluate image similarityfor object recognition in images. In this paper, we show howthe concept of efficient kernels can be extended to estimate akernel-based similarity between videos.

Further, the BoF procedure has an additional drawback -it inherently leads to a global comparison of the similarity be-tween two videos. Pooling descriptors from different (x, y, t)locations in a video into a histogram leads to loss of spatio-temporal information. Two unique gestures that have similarmovements but differ in the order they are performed, willgenerate similar descriptors, though at different (x, y, t) loca-tions. Their histograms, from the BoF model, will howeverbe similar. These gestures will get the same class label underany classification based on these histograms. To overcomethis drawback, in this paper, we further extend the efficientkernels for videos to a spatio-temporal pyramid-based mul-tiresolution model.

In the proposed method, which we call MultiresolutionMatch Kernels (MMK), we bring together the concepts ofefficient match kernels [4] and spatio-temporal pyramids [5]for gesture video classification. We introduce a multiresolu-tion spatio-temporal feature representation of a video. Thisrepresentation entails a histogram based similarity measureat multiple resolutions. By replacing the binary histogramwith a kernel function, we obtain a fine-grained rich simi-larity measure. However, the computational cost of estimat-ing kernels at multiple resolutions is high. We then derivea cost-effective formulation for MMK, motivated by [4], atmultiple resolutions. We apply the MMK to classify Ameri-can Sign Language(ASL) hand gestures obtained using a Mi-crosoft Kinect. The MSRGesture3D dataset [6], has depthvideos of the hands performing 12 gestures across 10 users.We demonstrate how MMK improves recognition accuracieswith increased pyramid resolution and show how it performsbetter than the existing techniques.

The remainder of this paper is organized as follows. Sec-tion 2 discusses earlier related work, followed by a descrip-tion of the proposed MMK method in Section 3. Section 4presents our experimental results, followed by our conclu-sions in Section 5.

2. RELATED WORKGesture recognition has been studied extensively for severaldecades now, and has been applied in various fields rang-

ing from social robotics to sign language recognition. Un-til recently, gesture recognition was mostly studied usingcolor(RGB) (or grayscale) videos. The introduction of theKinect has renewed interest in gesture recognition using colorand depth(RGB-D) videos. Weinland at al. [2] provide a sur-vey of gesture recognition techniques with a unique classi-fication of the different procedures used. One of the fore-most techniques for video feature representation, the BoF,was popularized by Laptev et al. [3], where Histogram ofOriented Gradients (HOG) and Histogram of Optical Flow(HOF) based descriptors were estimated from local neighbor-hoods of spatio-temporal interest points in a video. A BoFprocedure was then applied to estimate histogram vectors thatrepresented the videos.

Grauman and Darrell [7] extended the standard BoF byintroducing a spatial pyramid for object recognition. The in-clusion of spatial information into a pyramid based BoF, wasformalized by Lazebnik et al. [5]. Spatio-temporal pyramidsfor videos have been implemented in [8, 9] which use tem-poral pyramid alignment and histogram intersection respec-tively, to estimate video similarities. In this paper, we improveupon the pyramid similarity measure by using Multiresolu-tion Match Kernels (MMK). We use the idea of efficient ker-nels [4] to overcome the coarse binning shortcoming of theBoF by efficiently estimating kernel based similarity betweendescriptors across videos. We now describe our method.

3. MULTIRESOLUTION MATCH KERNELSIn the BoF procedure, every gesture video is represented as ahistogram. The data being binned is the set of feature descrip-tors from the video and the bin centers are a set of dictionaryvectors. Let X= {x1,x2, . . . ,xm} represent the feature de-scriptors in a video, and V={v1,v2, . . . ,vD}, be a set of Ddictionary vectors. A feature descriptor is quantized into a bi-nary vector µ(x)=[µ1(x), µ2(x), . . . , µD(x)]>. µi(x) is 1 ifx ∈ R(vi) and 0 otherwise, where R(vi)={x : ‖x− vi‖ ≤‖x − v‖,∀v ∈ V} (as in [4]). The histogram feature vec-tor for X is then given by µ(X) = 1

|X|∑

x∈X µ(x). A lin-ear kernel measuring video similarity between two videos Xand Y is a dot product over the histogram vectors, given byK(X,Y) = 1

|X||Y|∑

x∈X∑

y∈Y µ(x)>µ(y). This is thestandard BoF similarity measure for two videos. µ(x)>µ(y)can be expressed as a positive definite function δ(x,y), whereδ(x,y) is 1 if {x,y} ⊂ R(vi),∃i ∈ {1, . . . , D}, and 0 oth-erwise.

As mentioned earlier, the BoF representation of X doesnot retain information about the spatio-temporal locations ofthe original feature descriptors. To overcome this drawback,we propose a multiresolution spatio-temporal pyramid rep-resentation for videos. A video is represented as a singlecuboid of dimensions [Height ×Width × Time]. We splitthe video cuboid into smaller cuboids called voxels. At levelL = 1 the video is made up of only 1 voxel. At Level

Fig. 1: Splitting a video into voxels at multiple levels: (left) Level-1, (center)Level-2, (right) Level-3.

L = 2 the video is split into 23 = 8 voxels. At level L thevideo has L3 voxels. A spatio-temporal pyramid at level Lconsists of a sequence of all the voxels generated from lev-els [1, . . . , L]. The number of voxels for a spatio-temporalpyramid L = 3 is therefore 13 + 23 + 33 = 36. Fig-ure. 1, represents video voxels for 3 levels. The dots re-fer to the interest points at (x, y, t) that fall within differentvoxels at different levels. We represent the descriptors fora given level as an ordered partition over the descriptor setX= {x1,x2, . . . ,xm} in the video. An ordered partition forlevel i is represented as Xi = {Xi,1,Xi,2, . . . ,Xi,Ti} whereTi is the number of partitions(voxels) at level i, with Ti = i3.For x ∈ Xi,j , where Xi,j ⊆ X is the set of descriptors atlevel i and voxel j, the binary vector representation is givenas µ(x) = [µ1(x), µ2(x), . . . , µD(x)]>. For the sake of sim-plicity, we deploy the same dictionary V for quantization.The histogram of descriptors for voxel (i, j) is representedas µ(Xi,j) = 1

|Xi,j |∑

x∈Xi,j µ(x). The similarity mea-sure between two corresponding voxels (both labeled (i, j))from two videos X and Y is given as µ(Xi,j)>µ(Yi,j) =

1|Xi,j ||Yi,j |

∑x∈Xi,j

∑y∈Yi,j µ(x)>µ(y). For a given level,

the multiresolution BoF similarity between two videos X andY, is the sum of the similarities of histogram vectors fromcorresponding voxels in videos X and Y. A linear kernelmeasuring the multiresolution BoF similarity between twovideos is now the sum over multiple dot products spanningover corresponding voxels at all the levels. It is given by:

K(X,Y) =

L∑i=1

Ti∑j=1

1

|Xi,j ||Yi,j |∑

x∈Xi,j

∑y∈Yi,j

µ(x)>µ(y)

(1)Unlike in the standard BoF, the kernel in Equation (1) mea-sures the similarity between two videos at multiple resolu-tions, maintaining locality in spatial and temporal dimen-sions. Greater the number of levels, the more finer the com-parison between the videos. When L = 1, Equation (1) re-duces to the standard BoF similarity measure used for videosas mentioned at the beginning.

We now introduce the Multiresolution Match Kernels.The dot product µ(x)>µ(y) in Equation (1) can be replacedby the positive definite function δ(x,y), as discussed pre-viously. Since δ(x,y) is a coarsely quantized estimate ofthe similarity between x and y, we replace it by a finite-dimensional kernel function k(x,y), to estimate the similar-

ity between x and y [4]. This modifies Equation (1) to:

K(X,Y) =

L∑i=1

Ti∑j=1

1

|Xi,j ||Yi,j |∑

x∈Xi,j

∑y∈Yi,j

k(x,y) (2)

Equation (2) can be viewed as a sum of kernels for the samedescriptors (x,y) over multiple levels.

Let us define ki,j(X,Y) :=1

|Xi,j ||Yi,j |∑

x∈X∑

y∈Y k(x,y) where k(x,y) = 0

for x /∈ Xi,j or y /∈ Yi,j . Equation (2) can then be written asK(X,Y) =

∑Li=1

∑Ti

j=1 αiki,j(X,Y), where αi is a weightfor the kernel at Level i. We intend to give more weight tokernel similarities at finer resolutions(higher levels). Thekernel K(X,Y) can be viewed as a linear combination ofkernel functions across levels and voxels, which we calla Multiresolution Match Kernel (MMK). The computa-tional cost of estimating MMKs through Equation (2) isprohibitively high, viz. O(L3m2d) and O(n2L3m2d) toestimate the kernel similarity between two videos and toestimate the kernel gram matrix for n videos, respectively,where m is the average number of descriptors in a video andd is the dimensionality of the descriptors. To overcome thiscomputational overhead we estimate the feature mappinginduced by the finite dimensional kernel K(X,Y) and workin the mapped feature space as in [4]. We now proceed toestimate the feature representations φ(X) induced by thekernel K(X,Y) = φ(X)>φ(Y).

Gonen and Alpaydin stated in [10] that a linear combina-tion of kernels directly corresponds to a concatenation in thefeature space. Therefore, the MMK induces a feature spacethat is a concatenation of the feature spaces induced by the in-dividual kernels ki,j(X,Y). That is, if ki,j(X,Y) induces afeature space φi,j(X), where ki,j(X,Y)= φi,j(X)>φi,j(Y),the feature space induced by the MMK, K(X,Y) isφ(X) = [φ1,1(X)>, . . . , φi,j(X)>, . . . , φL,TL

(X)>]>,with TL = L3. We now need to estimate the featuremap φi,j(X) induced by the kernel ki,j(X,Y). Thisinvolves the estimation of the feature map φ(x) inducedby k(x,y) = φ(x)>φ(y). We estimate φ(x) as in [4].Given φ(x), φi,j(X) is estimated as the mean of all theφ(x) that belong to voxel (i, j). φ(X) is then equal to[φ1,1(X)>, . . . , φi,j(X)>, . . . , φL,TL

(X)>]>, with TL =L3.In estimating φ(x), we determine a set of D basis

vectors(dictionary) that maps the space of all x ∈ X,∀X.We denote this by a set of D vectors, H= {z1, z2, . . . , zD}.We estimate H using K-means clustering with D centers.k(x,y) is given by k(x,y) = kZ(x)>K−1ZZkZ(y), wherekZ(x) is a D × 1 vector with {kZ}i = k′(x, zi). {KZZ}is a D × D matrix with {KZZ}i,j = k′(zi, zj) for somepositive definite kernel k′(, ) (for our experiments weuse a RBF kernel). For G>G = K−1ZZ , the feature mapφ(x) = GkZ(x). The feature map for voxel (i, j) is then,φi,j(X) = 1

|Xi,j |G[∑

x∈Xi,j kZ(x)]. Please refer [4] formore details.

Fig. 2: Different gestures from the MSRGesture3D dataset [6].

The computational complexity of estimating φi,j(X) isnow O(mDd + D2). φ(X) is obtained by concatenating theφi,j(X). φ(X) is the new multiresolution spatio-temporalrepresentation of a video X. If dictionary length is D, andthe pyramid has L levels, φ(X) is of D[L(L+1)

2 ]2 dimensions(e.g., for a dictionary of length D = 10, and spatio-temporalpyramid level L= 3, the total length of the feature map for aMMK is 360 - a level L = 3 pyramid includes levels L = 2and L = 1 also). This is a significant improvement in makingthe MMKs practical and efficient.

4. EXPERIMENTS AND RESULTSThe MSRGesture3D dataset [6] consists of 12 depth gesturevideos depicting letters and words from the American SignLanguage, viz. bathroom, blue, finish, green, hungry, milk,past, pig, store, where, j, z. It is performed by 10 users andhas 336 examples. Figure. 2 depicts a snapshot from an ex-ample of each gesture across different users. In this work,we extract interest points using SIFT [1], from every frame inthe depth video. Every frame is a gray scale image where thepixel value denotes the depth of the pixel in the image. The(x, y, t) coordinates of these points are used to group theminto voxels (x, y are pixel coordinates, t is the frame number).For this study we experiment with only 3 levels for the spatio-temporal pyramid. We apply MMK upon three kinds of fea-ture descriptors (SIFT itself, Local Binary Patterns(LBP) andHOG) extracted from the aforementioned interest points tovalidate the performance of the proposed method.

SIFT descriptors were estimated using a 16 × 16 win-dow of pixels around each interest point. Each interest pointyielded a 128 dimension descriptor (as in [1]). LBP fea-tures [11] were estimated from the same 16 × 16 pixel win-dow around each interest point. For each pixel in the window,a 8-bit LBP binary vector was estimated. The decimal val-ues of the binary vectors were binned into a histogram with59 bins to result in a 59 dimensional descriptor for each in-terest point. In case of HOG, for each of the pixels in the16 × 16 pixel window around the interest point, the gradi-ent orientations were estimated and binned into a histogramwith 36 bins. We apply the soft binning strategy from [12] todistribute the orientation about a set of bins. The feature de-scriptor for each interest point is a vector of 36 components.

For each kind of descriptor, we estimated the dictionary ofD codewords using K-means clustering. We empirically stud-ied MMK with different D values and estimated the best per-formance to be atD = 1000 (see Figure. 3). Feature maps for

100 300 500 700 900 110080

84

88

92

96

100

D−Centers

Acc

ura

cy

SIFT

LBP

HOG

Fig. 3: Accuracies for multiple D Values.

100.00

100.00

100.00

84.62

15.38

7.69

100.00

46.15

92.31

92.31

84.62

15.38

7.69

15.38

100.00

7.69

15.38

23.08

100.00

7.69 84.62

BathroomBlue

FinishGreen

Hungry MilkPast

PigStore

Where

Letter−J

Letter−Z

Bathroom

Blue

Finish

Green

Hungry

Milk

Past

Pig

Store

Where

Letter−J

Letter−Z

Fig. 4: Confusion matrix (SIFT features with MMK-3).

level i were weighted with αi = 1/2−i giving higher weightto matches at finer resolutions. For L = 3 spatio-temporalpyramid (which includes L = 1 and L = 2), the dimensionof feature map φ(X), was 36000. We use the LIBLINEARclassifier (as used by Bo et al. [4]) upon the features φ(X)for final gesture classification. Table (1) shows the results ofour experiments. Here, BoF is the standard Bag-of-Featuresmethod, while MMK-1, MMK-2 and MMK-3 are MMK withL= 1, L= 2 and L= 3, respectively. These results were ob-tained using cross-validation, where 5 subjects were used fortraining and 5 subjects for testing. The obtained results wereaveraged over 5 independent runs to address randomness bias.

It is evident that the proposed MMK method outperformsBoF in these experiments. Also, the results show that the ac-curacies increase at higher resolutions of matching, corrobo-rating the need for multiresolution match kernels. The con-fusion matrix for SIFT and MMK-3, shown in Figure. 4, in-dicates that the gesture for Milk was often misclassified. Webelieve that this is because the gesture involves a clenchedfist whose descriptors are similar to descriptors from othergestures (like letter-J). We will investigate ways to over-come this challenge in future work. Existing state-of-the-arton the MSRGesture3D dataset report accuracies 88.5% [6]and 87.7% [13] using a leave-one-subject-out approach. We,hence, performed a study based on the leave-one-subject-outapproach using the proposed MMK-3 method and obtainedaccuracies of 94.6% (SIFT), 94.1% (LBP) and 91.5% (HOG),supporting our inference that the proposed method holds greatpromise for video-based classification tasks.

BoF MMK-1 MMK-2 MMK-3SIFT 71.28% 72.3% 87.56% 91.66%LBP 66.15% 68.07% 85.64% 90.51%HOG 74.10% 58.33% 74.87% 88.85%

Table 1: Cross-validation accuracies(train 5 subjects, test 5 subjects).

5. CONCLUSIONSWe have proposed a Multiresolution Match Kernel frameworkfor efficient and accurate estimation of video gesture similar-ity. Our method addresses two major drawbacks in the widelyused Bag-of-Features method, viz., coarse similarity match-ing and spatio-temporal invariance. Our experiments on theMSRGesture3D dataset showed that the MMK performs sig-nificantly better than BoF across 3 feature descriptors: SIFT,LBP and HOG. MMK provides a discriminative (multireso-lution spatio-temporal) feature representation for videos withsimilar motion information that differ in the sequence of mo-tion, where a standard BoF procedure would perform poorly.We plan to extend this method to other video-based classifi-cation tasks to study the generalizability of this approach infuture work.

6. ACKNOWLEDGMENTSThis material is based upon work supported by the National Science Foun-dation under Grant No:1116360. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the authors and donot necessarily reflect the views of the National Science Foundation.

7. REFERENCES

[1] D. Lowe, “Distinctive image features from scale-invariant keypoints,”IJCV, vol. 60, no. 2, pp. 91–110, 2004.

[2] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recogni-tion,” Computer Vision and Image Understanding, vol. 115, no. 2, pp.224–241, 2011.

[3] I. Laptev, “On space-time interest points,” IJCV, vol. 64, no. 2, pp.107–123, 2005.

[4] L. Bo and C. Sminchisescu, “Efficient match kernel between sets offeatures for visual recognition,” NIPS-09, vol. 2, no. 3, 2009.

[5] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spa-tial pyramid matching for recognizing natural scene categories,” inIEEE CVPR, 2006, vol. 2, pp. 2169–2178.

[6] A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamichand gesture recognition with a depth sensor,” In: EUSIPCO, 2012.

[7] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminativeclassification with sets of image features,” in IEEE ICCV, 2005, vol. 2,pp. 1458–1465.

[8] D. Xu and S. Chang, “Visual event recognition in news video usingkernel methods with multi-level temporal alignment,” in IEEE CVPR,2007, pp. 1–8.

[9] J. Choi, W. Jeon, and S. Lee, “Spatio-temporal pyramid matching forsports videos,” in Proceedings of the 1st ACM international conferenceon Multimedia information retrieval, 2008, pp. 291–297.

[10] M. Gonen and E. Alpaydın, “Multiple kernel learning algorithms,”JMLR, vol. 12, pp. 2211–2268, 2011.

[11] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Transactions on PAMI, vol. 24, no. 7, pp. 971–987, 2002.

[12] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,”NIPS-10, vol. 7, 2010.

[13] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d actionrecognition with random occupancy patterns,” Computer Vision-ECCV,pp. 872–885, 2012.

Documents

[IEEE 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and