Human Action Recognition Using Time Delay Input Radial ... · [21] D. Wu and L. Shao, "Silhouette analysis-based action recognition via exploiting human poses," IEEE Transactions

D KALHOR et al: HUMAN ACTION RECOGNITION USING TIME DELAY INPUT RADIAL BASIS …

DOI 10.5013/IJSSST.a.15.03.07 42 ISSN: 1473-804x online, 1473-8031 print

Human Action Recognition Using Time Delay Input Radial Basis Function Networks

Davood Kalhor, Ishak Aris, Trifa Moaini and Izhal Abdul Halin Dept. of Electrical and Electronic Engineering

Faculty of Engineering, Universiti Putra Malaysia Serdang, Selangor, Malaysia

[email protected], [email protected], [email protected], [email protected]

Abstract — This paper presents a fast, vision-based method for the problem of human action representation and recognition. The first problem is addressed by constructing an action descriptor from spatiotemporal data of action silhouettes based on appearance and motion features. For action classification, a new Radial Basis Function Network (RBF), called Time Delay Input Radial Basis Function Network (TDIRBF) is proposed by introducing time delay units to the RBF in a novel approach. A TDIRBF offers a few desirable features such as an easier learning process and more flexibility. The representational power and speed of the proposed method were explored using a publicly available dataset. Based on experimental results, implemented in MATLAB and on standard PCs, the average time for constructing a feature vector for a high-resolution video was just about 20 ms/frame (or 50 fps) and the classifier speed was above 15 fps. Furthermore, the proposed approach demonstrated good performance in terms of both execution time and overall performance (a new performance measure that combines accuracy and speed into one metric).

Keywords - action recognition; action representation; motion descriptor; neural network; radial basis function network





















and 2) a significant improvement, approximately 1800 843762427/456432 times increase, in overall performance is observed. Similarly, the method presented in this paper shows much better performance than Space-Time Shapes [12] in terms of speed and overall performance. For instance, the proposed action descriptor in this paper is about 3000 (0.6/0.02)×(1024×678)/(110×70) times faster and overall performance is improved more than 600 times (600 < 843,762,427/1255,485). In addition, the performance of our method based on accuracy is almost comparable to these state-of-the-art methods, particularly when the network is trained based on objective 2.

-VI. CONCLUSION AND FUTURE WORK

This paper presented a fast method to understand human

actions from video sequences. To achieve this goal, a new action descriptor, based on appearance and motion, and an action classifier (TDIRBF) were proposed. The TDIRBF has a few advantages over TDRBFs, including simplicity (in structure and training) and flexibility (ability to train for different objectives). Furthermore, the application of the proposed method is not limited to a specific type of actions like cyclic and acyclic; instead, it can be used for the both groups. Based on empirical evidence, as summarized in Table 1 and discussed in Section 5, the suggested descriptor is appropriate for real-time applications with a frame rate slower than 50 fps. Comparison with the two state-of-the-art methods in the literature demonstrated that the proposed method improves very significantly upon those works in terms of speed and overall performance while preserving an average accuracy above 90%. The best obtained accuracy for the UIUC dataset was 94.5%. For this setting, the overall system was more than one hundred times faster than both Metric Learning and Space-Time Shapes.

As discussed in Section 5, high degree of self-occlusion, which is a major drawback of single-view silhouettes, causes a significant increase in the similarity between some actions. This was the most serious obstacle to the perfect performance of the suggested approach by this paper in terms of accuracy (i.e., 100%). One possible avenue for future work is to employ an algorithm to choose a proper view (from multiple cameras) of actions before being applied to the proposed method in this paper.

REFERENCES [1] S.-U. Jung and M. S. Nixon, "Heel strike detection based on human

walking movement for surveillance analysis," Pattern Recognition Letters, vol. 34, pp. 895-902, 2013.

[2] W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 34, pp. 334-352, 2004.

[3] M. Pantic, A. Pentland, A. Nijholt, and S. T. Huang, "Human Computing and Machine Understanding of Human Behavior: A Survey," in Artifical Intelligence for Human Computing. vol. 4451, T. Huang, A. Nijholt, M. Pantic, and A. Pentland, Eds., ed: Springer Berlin Heidelberg, 2007, pp. 47-71.

[4] P.-C. Chung and C.-D. Liu, "A daily behavior enabled hidden Markov model for human behavior understanding," Pattern Recognition, vol. 41, pp. 1572-1580, 2008.

[5] T. D'Orazio and M. Leo, "A review of vision-based systems for soccer video analysis," Pattern Recognition, vol. 43, pp. 2911-2926, 2010.

[6] F. Rukun, X. Songhua, and G. Weidong, "Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis," IEEE Transactions on Visualization and Computer Graphics, vol. 18, pp. 501-515, 2012.

[7] A. F. Bobick, "Movement, activity and action: the role of knowledge in the perception of motion," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 352, p. 1257, 1997.

[8] T. B. Moeslund, A. Hilton, and V. Krüger, "A survey of advances in vision-based human motion capture and analysis," Computer Vision and Image Understanding, vol. 104, pp. 90-126, 2006.

[9] A. Iosifidis, A. Tefas, and I. Pitas, "View-invariant action recognition based on artificial neural networks," IEEE Transactions on Neural Networks and Learning Systems, vol. 23, pp. 412-424, 2012.

[10] M.-C. Roh, H.-K. Shin, and S.-W. Lee, "View-independent human action recognition with Volume Motion Template on single stereo camera," Pattern Recognition Letters, vol. 31, pp. 639-647, 2010.

[11] Y. Shen and H. Foroosh, "View-Invariant Action Recognition from Point Triplets," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 1898-1905, 2009.

[12] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, pp. 2247-2253, 2007.

[13] C. Rao, M. Shah, and T. Syeda-Mahmood, "Invariance in motion analysis of videos," in Proceedings of the ACM International Multimedia Conference and Exhibition, Berkeley, CA., 2003, pp. 518-527.

[14] H. J. Seo and P. Milanfar, "Action Recognition from One Example," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 867-882, 2011.

[15] D. Tran, A. Sorokin, "Human Activity Recognition with Metric Learning," in Computer Vision-ECCV 2008, vol. 5302, D. Forsyth, P. Zisserman, Eds., ed: Springer Berlin Heidelberg, 2008, pp. 548-561.

[16] J. Yamato, J. Ohya, and K. Ishii, "Recognizing Human Action in Time-Sequential Images using Hidden Markov Model," in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992, pp. 379-385.

[17] X. Zhen, L. Shao, D. Tao, and X. Li, "Embedding Motion and Structure Features for Action Recognition," IEEE Transactions on Circuits System Video Technology, vol. 23, pp. 1182-1190, 2013.

[18] H. Jiang, M. S. Drew, and Z.-N. Li, "Action Detection in Cluttered Video With Successive Convex Matching," IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, pp. 50-64, 2010.

[19] S. Ali and M. Shah, "Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 288-303, 2010.

[20] A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 257-267, 2001.

[21] D. Wu and L. Shao, "Silhouette analysis-based action recognition via exploiting human poses," IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, pp. 236-243, 2013.

[22] M. R. Berthold, "A time delay radial basis function network for phoneme recognition," in IEEE International Conference on Neural Networks, 1994, pp. 4470-4472, 4472a vol.7.

[23] A. J. Howell and H. Buxton, "Learning identity with radial basis function networks," Neurocomputing, vol. 20, pp. 15-34, 1998.

[24] J. K. Aggarwal and Q. Cai, "Human Motion Analysis: A Review," Computer Vision and Image Understanding, vol. 73, pp. 428-440, 1999.

[25] R. Poppe, "A survey on vision-based human action recognition," Image and Vision Computing, vol. 28, pp. 976-990, 2010.



[26] D. Weinland, R. Ronfard, and E. Boyer, "A survey of vision-based methods for action representation, segmentation and recognition," Computer Vision and Image Understanding, vol. 115, pp. 224-241, 2011.

[27] J. K. Aggarwal and M. S. Ryoo, "Human activity analysis: A review," ACM Comput. Surv., vol. 43, pp. 1-43, 2011.

[28] Y. Cui, D. L. Swets, and J. J. Weng, "Learning-Based hand sign recognition using SHOSLIF-M," in Proceedings of the 5th International Conference on Computer Vision (ICCV'95), 1995, pp. 631-636.

[29] T. Darrell and A. Pentland, "Space-time gestures," in Computer Vision and Pattern Recognition, 1993. Proceedings CVPR'93., 1993 IEEE Computer Society Conference on, 1993, pp. 335-340.

[30] T. E. Starner and A. Pentland, "Visual Recognition of American Sign Language Using Hidden Markov Models," in Proc. Int'l Workshop on Automatic Face and Gesture Recognition, 1995.

[31] A. Rosenfeld and J. L. Pfaltz, "Sequential Operations in Digital Picture Processing," Journal of the Association for Computing Machinery, vol. 13, pp. 471-494, 1966.

[32] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, "Performance of optical flow techniques," International Journal of Computer Vision, vol. 12, pp. 43-77, 1994.

[33] H. Liu, T.-H. Hong, M. Herman, T. Camus, and R. Chellappa, "Accuracy vs Efficiency Trade-offs in Optical Flow Algorithms," Computer Vision and Image Understanding, vol. 72, pp. 271-286, 1998.

[34] A. F. Bobick and J. W. Davis, "An Appearance-Based Representation of Action," in Proceedings of the 13th International Conference on Pattern Recognition (ICPR'96), 1996, pp. 307-312 vol.1.

[35] R. Venkatesh Babu and K. R. Ramakrishnan, "Recognition of human actions using motion history information extracted from the compressed video," Image and Vision Computing, vol. 22, pp. 597-607, 2004.

[36] D. Weinland, R. Ronfard, and E. Boyer, "Free viewpoint action recognition using motion history volumes," Computer Vision and Image Understanding, vol. 104, pp. 249-257, 2006.

[37] Y. Yacoob and M. J. Black, "Parameterized Modeling and Recognition of Activities," Computer Vision and Image Understanding, vol. 73, pp. 232-247, 1999.

[38] J. Hoey and J. J. Little, "Representation and Recognition of Complex Human Motion," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'00), 2000, pp. 752-759 vol.1.

[39] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," in Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV'03), 2003, pp. 726-733 vol.2.

[40] Y. Wang and G. Mori, "Human Action Recognition by Semilatent Topic Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 1762-1774, 2009.

[41] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa, "Matching Shape Sequences in Video with Applications in Human Movement Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1896-1909, 2005.

[42] S. Nayak, S. Sarkar, and B. Loeding, "Distribution-based dimensionality reduction applied to articulated motion recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 795-810, 2009.

[43] N. M. Oliver, B. Rosario, and A. P. Pentland, "A Bayesian computer vision system for modeling human interactions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 831-843, 2000.

[44] J. Gu, X. Ding, S. Wang, and Y. Wu, "Action and Gait Recognition From Recovered 3-D Human Joints," IEEE Transactions on Systems,

Man, and Cybernetics, Part B: Cybernetics, vol. 40, pp. 1021-1033, 2010.

[45] H. Park, J.-I. Park, U.-M. Kim, and W. Woo, "Emotion Recognition from Dance Image Sequences Using Contour Approximation," in Structural, Syntactic, and Statistical Pattern Recognition. vol. 3138, A. Fred, T. Caelli, R. W. Duin, A. Campilho, and D. de Ridder, Eds., ed: Springer Berlin Heidelberg, 2004, pp. 547-555.

[46] J. Owens and A. Hunter, "Application of the self-organising map to trajectory classification," in Proc. 3rd IEEE International Workshop on Visual Surveillance, 2000, pp. 77-83.

[47] M.-H. Yang, N. Ahuja, and M. Tabb, "Extraction of 2D motion trajectories and its application to hand gesture recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1061-1074, 2002.

[48] E. Petlenkov, S. Nõmm, J. Vain, and F. Miyawaki, "Application of self organizing kohonen map to detection of surgeon motions during endoscopic surgery," in Proc. International Joint Conference on Neural Networks, Hong Kong, 2008, pp. 2806-2811.

[49] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, "Phoneme recognition using time-delay neural networks," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, pp. 328-339, 1989.

[50] M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, and D. M. Hummels, "On the training of radial basis function classifiers," Neural Networks, vol. 5, pp. 595-603, 1992.

[51] W. Huang and Q. M. J. Wu, "Human Action Recognition Based on Self Organizing Map," in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP'10), 2010, pp. 2130-2133.

[52] Z. Lin, Z. Jiang, and L. S. Davis, "Recognizing actions by shape-motion prototype trees," in Proceedings of the 12th IEEE International Conference on Computer Vision (ICCV'09), 2009, pp. 444-451.

[53] A. Ruina and R. Pratap, Introduction to Statics and Dynamics: Oxford University Press, 2010.

[54] C.-F. Juang and L.-T. Chen, "Moving object recognition by a shape-based neural fuzzy network," Neurocomputing, vol. 71, pp. 2937-2949, 2008.

[55] T. Guha and R. K. Ward, "Learning sparse representations for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 1576-1588, 2012.

[56] R. Minhas, A. A. Mohammed, and Q. J. Wu, "Incremental learning in human action recognition based on snippets," IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, pp. 1529-1541, 2012.

Documents

Human Action Recognition Using Time Delay Input Radial ... · [21] D. Wu and L. Shao, "Silhouette analysis-based action recognition via exploiting human poses," IEEE Transactions