6
978-1-5090-4093-3/16/$31.00 ©2016 IEEE 1767 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) Visual Attributes Based Sparse Multitask Action Recognition Qicong Wang 1 , Jinhao Zhao 1 , Yehu Shen 2 , Maozhen Li 3,4 , Yuxiang Wu 1 , Yunqi Lei 1 1 Department of Computer Science, Xiamen University, Xiamen, China, 361005 2 Suzhou Institute of Nano-tech and Nano-bionics, Chinese Academy of Sciences, China 3 Department of Electronic and Computer Engineering, Brunel University, Uxbridge, UK, UB83PH 4 School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China, 212013 Abstract—For action recognition, traditional multitask learn- ing can share low-level features among actions effectively, but it neglects high-level semantic relationships between latent visual attributes and actions. Some action classes might be related, where latent visual attributes across categories are shared among them. In this paper, we improve multitask learning model using attribute-actions relationship for action datasets with sparse and incomplete labels. Moreover, the amount of semantic information of visual attributes and action class labels are different, so we car- ry out attribute task learning and action task learning separately for improving generalization performance. Specifically, for two latent variables, i.e. visual attributes and model parameters, we formulate the joint optimization objective function regularized by low rank and sparsity. To deal with this non-convex optimization, we transform this non-convex objective function into the convex formulation by an auxiliary variable. Experimental results on two datasets show that the proposed approach can learn latent knowledge effectively to enhance discrimination power and is competitive to other baseline methods. Index Terms—Action Recognition, Multitask Learning, seman- tic relationship, visual attribute. I. I NTRODUCTION Human action recognition from image and video plays a critical role in diverse applications such as surveillance, human-computer interaction, robotics, multimedia content re- trieval [1-3]. In both machine learning and computer vision communities, action recognition is one of the most active research topics. Early action recognition methods mainly fo- cused on the motion object extraction and analysis in tracking [1]. For more discriminative action representation, many effec- tive methods [4-8], such as space-time interest points, were proposed. Action recognition methods based on space-time interest points which are local features use a bag-of-words to describe several human actions such as jumping, running, handwaving, etc. They do not need any preprocessing tech- nologies, such as body-part tracking or background modeling, to capture characteristic shape and motion in the scene. So they are relatively robust to spatio-temporal shifts and scales as well as background clutters in video. However, those methods can not describe intrinsic spatio-temporal relationship effectively among space-time interest points. As a result, spatio-temporal context is treated as another kind of information to capture the relationships for enhancing interest points [9]. The above action recognition methods apply mapping from low-level image features to action class labels directly. How- ever, rich visual features can hardly be described by a single action class label, and as a result the recognition accuracy of these methods are unsatisfactory. The intermediate semantic features are proposed to express action classes [10] [11]. But these features do not have definite semantic information. Additional definitions are needed to represent the related properties of action classes. Recently, attribute learning is developed to overcome the above shortcomings [12] [13]. Visual attributes are defined as the observed property in the images (for example, we believe that the label of arm swing is useful for action recognition). In some applications, they can be seen as a valuable high level semantics information, and introduced into the prediction model [14] [15]. For the present recognition methods based on attribute, the main function of attribute is learning semantic information of its attribute, which can be integrated into the action recognition model. Briefly speaking, we can divide the original mapping from low level image features to object labels into two steps: low level image features mapping to attribute labels and attribute labels mapping to object labels. Thus the prediction results of attribute can be seen as mid-level features which are used to build the relationships between low level image features and high level object classes. By the above analysis, we find that attribute learning and target classification are separate. Although attributes affect the object prediction, when we learn the target classifier, the training data of attribute labels do not introduce any new information directly. In this paper, we explore how to carry out visual attribute learning and action class learning at the same time to improve the generalization ability. Because these visual attributes are shared by action classes, the differences among many action classes can be distinguished by these attributes. Thus attribute learning and action class learning have intrinsic relationship. Multitask learning [16] is a kind of effective way to utilize this intrinsic relationship. We take attribute learning as the extra mission of action class learning, and transform the complex single-task action recognition into multitask action recognition problem. The purpose of multitask learning is to improve the general- ization performance of classifiers by learning multiple related tasks, which can be achieved by learning tasks and using their intrinsic correlation simultaneously [17]. When each task only has limited training data, this method is especially effective.

Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

978-1-5090-4093-3/16/$31.00 ©2016 IEEE 1767

2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

Visual Attributes Based Sparse Multitask ActionRecognition

Qicong Wang1, Jinhao Zhao1, Yehu Shen2, Maozhen Li3,4, Yuxiang Wu1, Yunqi Lei11Department of Computer Science, Xiamen University, Xiamen, China, 361005

2Suzhou Institute of Nano-tech and Nano-bionics, Chinese Academy of Sciences, China3Department of Electronic and Computer Engineering, Brunel University, Uxbridge, UK, UB83PH

4School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China, 212013

Abstract—For action recognition, traditional multitask learn-ing can share low-level features among actions effectively, but itneglects high-level semantic relationships between latent visualattributes and actions. Some action classes might be related,where latent visual attributes across categories are shared amongthem. In this paper, we improve multitask learning model usingattribute-actions relationship for action datasets with sparse andincomplete labels. Moreover, the amount of semantic informationof visual attributes and action class labels are different, so we car-ry out attribute task learning and action task learning separatelyfor improving generalization performance. Specifically, for twolatent variables, i.e. visual attributes and model parameters, weformulate the joint optimization objective function regularized bylow rank and sparsity. To deal with this non-convex optimization,we transform this non-convex objective function into the convexformulation by an auxiliary variable. Experimental results ontwo datasets show that the proposed approach can learn latentknowledge effectively to enhance discrimination power and iscompetitive to other baseline methods.

Index Terms—Action Recognition, Multitask Learning, seman-tic relationship, visual attribute.

I. INTRODUCTION

Human action recognition from image and video playsa critical role in diverse applications such as surveillance,human-computer interaction, robotics, multimedia content re-trieval [1-3]. In both machine learning and computer visioncommunities, action recognition is one of the most activeresearch topics. Early action recognition methods mainly fo-cused on the motion object extraction and analysis in tracking[1]. For more discriminative action representation, many effec-tive methods [4-8], such as space-time interest points, wereproposed. Action recognition methods based on space-timeinterest points which are local features use a bag-of-wordsto describe several human actions such as jumping, running,handwaving, etc. They do not need any preprocessing tech-nologies, such as body-part tracking or background modeling,to capture characteristic shape and motion in the scene. So theyare relatively robust to spatio-temporal shifts and scales as wellas background clutters in video. However, those methods cannot describe intrinsic spatio-temporal relationship effectivelyamong space-time interest points. As a result, spatio-temporalcontext is treated as another kind of information to capture therelationships for enhancing interest points [9].

The above action recognition methods apply mapping fromlow-level image features to action class labels directly. How-

ever, rich visual features can hardly be described by a singleaction class label, and as a result the recognition accuracy ofthese methods are unsatisfactory. The intermediate semanticfeatures are proposed to express action classes [10] [11].But these features do not have definite semantic information.Additional definitions are needed to represent the relatedproperties of action classes.

Recently, attribute learning is developed to overcome theabove shortcomings [12] [13]. Visual attributes are defined asthe observed property in the images (for example, we believethat the label of arm swing is useful for action recognition).In some applications, they can be seen as a valuable highlevel semantics information, and introduced into the predictionmodel [14] [15]. For the present recognition methods based onattribute, the main function of attribute is learning semanticinformation of its attribute, which can be integrated into theaction recognition model. Briefly speaking, we can divide theoriginal mapping from low level image features to object labelsinto two steps: low level image features mapping to attributelabels and attribute labels mapping to object labels. Thus theprediction results of attribute can be seen as mid-level featureswhich are used to build the relationships between low levelimage features and high level object classes. By the aboveanalysis, we find that attribute learning and target classificationare separate. Although attributes affect the object prediction,when we learn the target classifier, the training data of attributelabels do not introduce any new information directly.

In this paper, we explore how to carry out visual attributelearning and action class learning at the same time to improvethe generalization ability. Because these visual attributes areshared by action classes, the differences among many actionclasses can be distinguished by these attributes. Thus attributelearning and action class learning have intrinsic relationship.Multitask learning [16] is a kind of effective way to utilize thisintrinsic relationship. We take attribute learning as the extramission of action class learning, and transform the complexsingle-task action recognition into multitask action recognitionproblem.

The purpose of multitask learning is to improve the general-ization performance of classifiers by learning multiple relatedtasks, which can be achieved by learning tasks and using theirintrinsic correlation simultaneously [17]. When each task onlyhas limited training data, this method is especially effective.

Page 2: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

1768

However, the amount of labeled attributes and action classesis very limited in practical applications, the potential structureof prediction function should be low rank and sparse. Thepopular convex relaxation formula of rank function is based onthe sum of the singular values of a matrix [18-20]. In multitasklearning, the low rank matrix can well mine the structureinformation of subspace to get the relevance among tasks[21][22]. Its problem is that the learned model parametersare usually dense. When the training data of each task islimited and the sample features are in high dimension, we findthat some features may have no discriminative ability. But thedense model parameters could enable those features to havehigh distinguished ability, which is to bias the generalizationability of multitask learning. To select the features with highdiscriminative abilities, we introduce sparse constraint l1 norminto multitask learning to handle the task learning with highdimension features. The sparse coefficients corresponding tothe features with high distinguished ability are larger, andon the contrary the sparse coefficients corresponding to thefeatures with low distinguished ability are smaller, even closeto 0.

Because rank function and sparse l1 norm are non-smooth,we formulate multitask action recognition as a non-smoothconvex optimization problem. Solving this semi-definite pro-gramming is very time-consuming. It is not suitable for bigdata. Inspired by the literature [23] [24], we use an effectiveoptimization technique to solve this problem. We introducefirst an auxiliary variable to make the objective functionseparable, and then transform this optimization problem intoan augmented Lagrange function. The computational cost ofthe product of matrices can be reduced by the SVD techniques.The objective function is minimized to calculate the optimalsolution of parameters repeatedly, until convergence.

This paper is organized as follows. Section 2 introduces theframework of multitask action recognition. Section 3 gives adetailed description of the proposed multitask learning model.Section 4 reports experimental results on two human actiondatasets. Section 5 concludes the paper.

II. MULTITASK ACTION RECOGNITION FRAMEWORK

In visual action recognition, rich visual features could hardlybe described by an action label through one kind of mapping.Therefore, we use the visual attribute to enrich the learningknowledge (for example, we believe that the label of armswing is useful for action recognition). We can treat multitasklearning process as two related parts (i.e. action learningtask and attribute learning task). Our goal is to improve theaccuracy of action classification. Therefore, we take actionlearning as the major task, and attribute learning as theauxiliary task.

In order to learn the major task, we define n as the numberof the target classes. We define xi ∈ Rd as the ith low-levelfeature vector in the training dataset. We define {yij | j =1, 2...n} as binary label, which indicates whether the low-levelfeature vector xi belongs to the jth class. We assume that themain and the auxiliary tasks learn using the same low-level

feature vectors. In order to learn auxiliary tasks with number ofm, {yi(n+k) | k = 1, 2...m} are defined as binary label, whichmeans that the low-level feature vector xi belongs to the jthattribute class. So the number of total tasks is T = n+m. Eachlearning task corresponds to a predictive function fl and a

training data set{(

x1, y1l), ...,

(xp, ypl

)}⊂ Rd ×{0, 1}

(l =

1, ..., T). We focus on the linear prediction fl(x) = wT

l x,where wl is the weight vector of the lth task. We consider theamount of labeled data is very limited, we introduce sparse andlow-rank structure constraints to formulate the optimizationframework of multitask action recognition. This method cannot only attain the relevance among learning tasks but alsoselect the feature subspace with stronger discriminant power.

minW

L(W ) + λ1∥ W ∥∗ + λ2∥ W ∥1 (1)

where L(W ) =∑T

l=1

(∑pi=1 L(w

Tl xi, yil)

)is the least

square loss function; ∥ W ∥∗ is the trace norm of the matrixW , which is calculated by the sum of matrix singular value;∥ W ∥1 is the l1 norm, calculated by

∑i

∑j | wi,j |; λ1

and λ2 are the non-negative balance parameters respectively,which control the low rankness and sparsity of the matrixW . We find that when λ1 = 0, the optimization problemdegenerates to a least absolute shrinkage and selection operator(Lasso) problem. when λ2 = 0, the optimization problemdegenerates to the least square regression multitask learningproblem based on trace norm. Therefore, the proposed methodhas the advantages of the above two methods but Eq. (1) is nota smooth convex optimization problem. To solve this optimiza-tion problem, we introduce an auxiliary variable which makesthe objective function separable and change the optimizationproblem into Lagrange function.

III. THE IMPLEMENTATION OF SOLVING THEOPTIMIZATION PROBLEM

A. Relaxation Method for The Optimization Problem

The alternating direction method (ADM) is suitable for pro-cessing the non-smooth convex optimization problem [25]. Byadding two auxiliary variables Ψ1 and Ψ2, we can transformthe non-smooth convex optimization problem of Eq. (1) into:

minW,Ψ1,Ψ2

L(W ) + λ1∥ Ψ1 ∥∗ + λ2∥ Ψ2 ∥1 (2)

The augmented lagrange function of Eq. (2) can be ex-pressed as:

φp

(W,Ψ1,Ψ2,Γ1,Γ2

)= L(W ) + λ1∥ Ψ1 ∥∗ + λ2∥ Ψ2 ∥1 +

⟨W −Ψ1,Γ1

⟩+⟨W −Ψ2,Γ2

⟩+ ρ

2∥ W −Ψ1 ∥2F + ρ2∥ W −Ψ2 ∥2F

(3)

where Γ1 and Γ2 are lagrangian multipliers. < ., . > representsthe inner product. ρ is the penalty parameter. Usually theaugmented lagrangian multiplier method is to minimize φp

formula with respect to WΨ1 and Ψ2 at the same time.To handle this problem, ADM decomposes the minimization

Page 3: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

1769

formula φp as sub-problem about WΨ1 and Ψ2 respectively.In order to solve the optimization problem (2), ADM performsthe following iteration (Eq. (4) - Eq. (7)):

Wk+1 = argminW

φp

(W,Ψ1

k,Ψ2k,Γ

1k,Γ

2k

)(4)(

Ψ1k+1,Ψ

2k+1

)= argmin

Ψ1,Ψ2

φp

(Wk+1,Ψ

1,Ψ2,Γ1k,Γ

2k

)(5)

Γ1k+1 = Γ1

k + ρ(Wk+1 −Ψ1

k+1

)(6)

Γ2k+1 = Γ2

k + ρ(Wk+1 −Ψ2

k+1

)(7)

where Wk, Ψ1k, Ψ2

k, Γ1k and Γ2

k represent the intermediatesolution of ADM at the kth iteration. ρ is the constant givenin advance.

B. Speedup for The Optimization Problem

To further speed up the convergence, we have to use thelinearization technology to the equation (4) and (5). UpdatingW , the optimal solution Wk+1 of Eq. (4) can be obtained bythe following equation:

Wk+1

= argminW

(L(W ) + ρ

2∥ W −Ψ1k + Γ1

k/ρ ∥2F+

ρ2∥ W −Ψ2

k + Γ2k/ρ ∥2F

) (8)

Notice that we can get the optimal solution Wk+1 throughsolving the linear equations.

Updating Ψ, the optimal solution Ψ1k+1 and Ψ2

k+1 ofequation (5) can be obtained by the following equations:

Ψ1k+1

= argminΨ1

(λ1∥ Ψ1 ∥∗ +

ρ2∥ Wk+1 −Ψ1 + Γ1

k/ρ ∥2F

)(9)

Ψ2k+1

= argminΨ2

(λ2∥ Ψ2 ∥∗ +

ρ2∥ Wk+1 −Ψ2 + Γ2

k/ρ ∥2F

)(10)

We can verify that the optimization problem (9) has ananalytical solution. Suppose the rank

(Wk+1 + Γ1

k/ρk)= r.

The singular value decomposition (SVD) of UrΣrVTr is

Wk+1 + Γ1k/ρk. Ur and Vr include r orthogonal rows, and

Σr = diag{(σ1, σ2, ..., σr)

}. Then we can get the optimal

solution Ψ1k+1 through equation (11).

Ψ1k+1 = Ur

∧ΣrVr,

∧Σr = diag

{(σi −

λ1

ρk

)+

}(11)

where (x)+ is the indicator function.The optimization problem (10) has an analytical solution.

Let δ, w and θ be the elements of Ψ1k+1. Wk+1 and Γ2

k onthe same coordinate. We can get the optimal solution δ usingthe computation process as follows.

δ =

w + 1

ρk(θ − λ2), w + 1

ρkθ > 1

ρkλ2

0, − 1ρkλ2 ≤ w + 1

ρkθ ≤ 1

ρkλ2

w + 1ρk(θ + λ2), w + 1

ρkθ < − 1

ρkλ2

(12)

Fig. 1. Samples of KTH data set, different rows represent different actions.

Fig. 2. Samples of AR data set, different rows represent different actions.Left part represents the outdoor environment and the right part represents theindoor environment.

IV. EXPERIMENTAL RESULTS AND ANALYSIS

A. The Experimental Data and Benchmark Methods

In the experiments, we use the two datasets of actionrecognition: the KTH human action dataset with propertiesand our action recognition dataset (AR). KTH dataset is astandard data set of benchmark human action recognition. Itcontains five action classes (boxing, clapping, waving, runningand walking). Each action class was taken in 4 scenarios by25 people. There are 499 video clips. We transform the videoclips into the corresponding action images and take a part ofthe images as the data set of the experiment. Fig. 1 shows apart of the samples. The AR dataset was collected by ourselvesin outdoor and indoor environments respectively. The datasetcontains 3113 images, and 5 action classes (boxing, clapping,waving, running and walking). Fig. 2 shows some samples.We selects 7 representative action attributes as the propertiesof KTH and AR dataset. They include the breast-level armmovement, the alternating arm forward and so on. The action-attribute relationships are shown in table 1. Each image hasits own class label and attribute label. For the AR dataset, wenormalize all images to 64×128 by applying down-samplingmethod. The resolution of the images in the KTH dataset is allresized to 160×120. Since HOG features can model the outlineof different body actions very well, we use HOG features toextract the low-level feature from the action images. The KTHimage features and the AR (action recognition) image featuresreach 1200 and 1536 dimension respectively.

In order to illustrate our proposed method has a good per-formance. The proposed approach Trace&Sparse is comparedwith some related algorithms, for example, l1 norm multitasklearning Lasso, trace norm multitask learning Trace, multitaskfeature learning Mtl.

Page 4: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

1770

TABLE ITHE DEFINITION OF VISUAL ACTION ATTRIBUTES IN THE KTH AND AR DATASETS.

Boxing Clapping Waving Running WalkingArm movement at chest level 1 1 0 1 0

Alternating arm forward 0 0 0 1 1The arms swing back and forth movement forward 0 0 0 1 1

Folding arm shape 1 1 1 1 0Straight arm shape 0 1 1 0 1

Alternating legs forward movement 0 0 0 1 1Translational movement 0 0 0 1 1

Fig. 3. The action recognition accuracy comparisons on the KTH data setbetween different methods under different percentage training set.

Fig. 4. The action recognition accuracy comparisons on the AR data setbetween different methods under different percentage training set.

B. Performance Comparison

We evaluate the action recognition accuracy of our methodand the benchmark method first. The attribute tasks and thetarget tasks in these methods all use the same low-levelfeatures. We use two-fold cross validation method. That is,the dataset is divided into two parts: the training set and thetesting set. We train in the training set. While the testing setwill be used to test the action recognition accuracy. We runcross validations 10 times. We take the average of the 10 timesexperimental results as the final result. We set the percentageof the training set is 10% to 50%, and the interval incrementis 10%. The experimental results on KTH and AR are shownin Fig. 3 and Fig. 4, respectively.

Form Fig. 3 and Fig. 4, we can find that the action recog-

Fig. 5. The confusion matrix of our proposed method when the training setis 30% in the KTH data set.

nition accuracy of the Trace&Sparse method is the highest.Thus the proposed method is superior to the other threemethods. And it has strong discrimination power to the actionrecognition problems. The Lasso method does not take fulladvantage of the correlation between attribute tasks and targettasks, which leads to the worse performances. While the Tracemethod, the Mtl method and the proposed show the goodperformance through preserving the intra-class similaritiesand considering the inter-class differences at the same time.However, our proposed method can choose the features withstronger discriminant power from high-dimension featuressimultaneously. Thus it can show better effects compared withthe Trace method. The Trace method and the Mtl method areequivalent under special conditions and show similar results.Therefore, our proposed method has stronger competitivenessthan other benchmark methods.

Fig. 5 and Fig. 6 show the confusion matrix of our proposedmethod when the training set is 30% in the KTH and ARdatasets. We can find that running and walking are easilymisclassified. Due to their similar properties and image ap-pearances, running samples have the probability of 4.5% and11.5% to be wrongly recognized as walking. For the ARdataset, a part of the reasons may be caused by the persons donot reach the designated position when we collect the runningand walking datasets.

C. Impact of Attribute-Class Matrix On The Recognition Re-sults

In this section, we study the impact of attribute-class matrixon the action recognition accuracy. In order to find that

Page 5: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

1771

Fig. 6. The confusion matrix of our proposed method when the training setis 30% in the AR data set.

Fig. 7. The performance comparison on the KTH data set between theuser-defined attribute-class matrix and the random attribute-class matrix underdifferent percentage training set.

whether the random attribute-class matrix has the similar effectwith the user-defined attribute-class matrix. We select a setof random values (0 or 1) assigning to the attribute-classmatrix as a random attribute-class matrix. To illustrate theabove problems, we list three methods to be compared with.The first one is based on the Trace&Sparse method with theuser-defined attribute-class matrix. The second one is basedon the Trace&Sparse with A method with a random attribute-class matrix. The final one is the Trace&Sparse&NA methodwithout visual attributes. We set the percentage of the trainingsets between 10% and 50%. And the interval increment is10%. The experimental results are shown in Fig. 7 and Fig. 8.

From the experimental results, we can see that the actionrecognition accuracy of the Trace&Sparse with A methodwith a random attribute-class matrix is lower than that of theTrace&Sparse method with user-defined attribute-class matrix.And the performance of the Trace&Sparse with A method islower than the Trace&Sparse&NA method. We can find thatthe random attribute-class matrix method could not improvethe action recognition accuracy, on the contrary the recognitionaccuracy drops. This means the user-defined attribute-classmatrix is very important for our methods.

V. CONCLUSION

In this work, we developed a new multitask learning modelfor human action recognition. We investigate first the intrinsic

Fig. 8. The performance comparison on the AR data set between the user-defined attribute-class matrix and the random attribute-class matrix underdifferent percentage training set.

relationships among low-level features, visual attributes andaction classes. Then we use the predefined and latent visualattributes to conceive attribute-action relevance. Based on this,we regard the learning process as two correlated parts, i.e.action class learning and visual attribute learning, to furtherimprove the generalization performance. In this way, multitasklearning can not only share the low-level features, but alsomake use of the high-level semantic information. We formulateand solve the joint non-convex optimization objective functionby low rank and sparsity concerning visual attributes andmodel parameters. Experimental results on the KTH andAR datasets have demonstrated that our proposed multitasklearning framework can achieve better classification accuracythan that of baseline methods in human action recognition.

ACKNOWLEDGMENT

This work was partially supported by NSFC under contractNo. 61501451, the scholarship from China Scholarship Coun-cil (CSC) under the Grant CSC No. 201606315022, and theXMU-NU Joint Strategic Partnership Fund.

REFERENCES

[1] J. K. Aggarwal and Q. Cai, “Human motion analysis: A review,”Comput. Vis. Image Und., Vol. 73, pp. 428-440, 1999.

[2] L. Liu, L. Shao and P. Rockett, “Boosted key-frame selection andcorrelated pyramidal motion-feature representation for hhuman actionrecognition,” Pattern Recogn., pp. 1810-1818, 2013.

[3] L. Shao, L. Ji, Y. Liu and J. Zhang, “Human action segmentation andrecognition via motion and shape analysis,” Pattern Recogn. Lett., pp.438-445, 2012.

[4] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,”in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog., 2005, pp. 984-989.

[5] Z. Lin, Z. Jiang, and L. S. Davis. “Recognizing actions by shape-motionprototype trees,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 444-451.

[6] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at adistance,” in Proc. IEEE Int. Conf. Comput. Vis., 2003, pp. 726-733.

[7] M. Raptis and S. Soatto, “Tracklet descriptors for action modeling andvideo analysis,” in Proc. Eur. Conf.Comput. Vis., 2010, pp. 577-590.

[8] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabulariesusing diffusion distance,” in Proc. IEEE Int. Conf. Comput. Vis. PatternRecog., 2009, pp. 461-468.

[9] M. S. Ryoo and J. K. Aggarwal, ”Spatio-temporal relationship match:Video structure comparison for recognition of complex human activi-ties,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1593-1600.

Page 6: Visual attributes based sparse multitask action recognitionstatic.tongtianta.site/paper_pdf/9d81a084-22cc-11e9-b1c5... · 2019. 1. 28. · Visual Attributes Based Sparse Multitask

1772

[10] J. Fowler, “Compressive-projection principal component analysis,” IEEETrans. Image Proc., pp. 223-2242, 2009.

[11] A. Fathi and G. Mori, “Action recognition by learning mid-level motionfeatures,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog., 2008,pp. 1-8.

[12] S. J. Hwang, F. Sha, and K. Grauman, “Sharing Features betweenObjects and Their Attributes,” in Proc. IEEE Int. Conf. Comput. Vis.Pattern Recog., 2011, pp. 1761-1768.

[13] D. Parikh and K. Grauman, “Relative attributes,” in Proc. IEEE Int.Conf. Comput. Vis., 2011, pp. 503-510.

[14] D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M.Turk, “Attribute-based peopole search in surveillance environments,” inProc. IEEE Appl. Comput. Vis., 2009, pp. 1-8.

[15] Y. Wang and G. Mori, “A discriminative latent model of object classesand attributes,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 155-168.

[16] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task featurelearning,” Mach. Learn., pp. 243-272, 2008.

[17] B. Heisele, T. Serre, M. Pontil, T. Vetter, and T. Poggio, “Categorizationby learning and combining object parts” in Proc. Neural Inform. Proc.Syst., 2001, pp. 1239-1245.

[18] Fazel. M, Hindi. H, and Boyd. S. P, “A rank minimization heuristicwith application to minimum order system approximation,” in Proc. Am.

Control Conf., 2001, pp. 4734-4739.[19] Recht. B, Xu. W, and Hassibi. B, “Necessary and sufficient condtions for

success of the nuclear norm heuristic for rank minimization,” in Proc.IEEE Conf. Decis. Control, 2008, pp. 3065-3070.

[20] Weimer. M, Karatzoglou. A, and Smola. A, “Improving maximummargin matrix factorization,” Mach. Learn., pp. 263-276, 2008.

[21] Tomioka. R, and Aihara. K, “Classifying matrices with a spectralregularization,” in Proc. Intern. Conf. Mach. Learn., 2007, vol. 29, no.6, pp. 895-902.

[22] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selectionand joint subspace selection for multiple classification problems,” Stat.Comput., vol. 33, no. 3, pp. 231-252, 2010.

[23] Ren X and Lin Z, “Linearized alternating direction method with adaptivepenalty and warm starts for fast solving transform invariant low-ranktextures,” Int. J. comput. Vis., pp. 1-14, 2013.

[24] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola, “Maximum-marginmatrix factorization,” in Proc. Neural Inform. Proc. Syst., 2004, pp.1329-1336.

[25] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Found. Trends Mach. Learn., pp. 1-122, 2011.