Modern architectures convolutional neural networks in human activity recognition
AlexNet and VGG16 and modern architectures such as ResNet, Inception V3, Inception-ResNet, MobileNet V2, NASNet and PNASNet. The main characteristic of a convolutional neural network (CNN) is its ability to extract features automatically from input images, which facilitates the processes of activity recognition and classification. Convolutional networks indeed derive more relevant and complex features with every additional layer. In addition, CNNs have achieved perfect classification on highly similar activities that were previously extremely difficult to classify. In this paper, we evaluate modern convolutional neural networks in terms of their human activity recognition accuracy, and we compare the results with the state of the art methods. In our research, we used two public data sets, HMDB (Shooting gun, kicking, falling to the floor, punching) and the Weizman dataset (walking, running, jumping, bending, one hand waving, two-hand waving, jumping in place, jumping jack, skipping). Our experimental results indicated that the CNN with NASNet architecture achieves the best performance of the six CNN architectures on both human activity data sets (HMDB and Weizman).
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the inception architecture for computer vision,”. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA 2016, pp. 2818-2826,
K. He, X. Zhang, S. Ren, J. Sun, “Identity Mappings in Deep Residual Networks,” in European Conference in Computer Version (ECCV), Amsterdam, 2016, pp. 630-645
C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” in the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 2017, pp. 4278-4284.
B. Zoph, V. Vasudevan, J. Shlens, V. Le. Quoc “Learning Transferable Architectures for Scalable Image Recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City Utah US state, 2018, pp 8697-8710
C. Liu, L. Jia Li, Barret Zoph, Maxim Neumann, et al., “Progressive Neural Architecture Search”. Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 19-34.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City UT, USA 2018, pp4510-4520
T. Plotz, Y. Hammerla, P. Olivier, “Feature learning for activity recognition in ubiquitous computing,” in International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, 2011, pp.1729-1734.
M. Zeng, L. T, Nguyen, “Convolutional neural networks for human activity recognition using mobile sensors,” in Mobile Computing, Applications and Services (MobiCASE) 6th International Conference, Austin, USA, 2014, pp.197– 205.
J. Yosinski, J. Clune, Y. Bengio, H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems 27 (NIPS ’14), NIPS Foundation, Montreal, Canada, 2014, pp. 3320-3328.
J. Yang, M. Nguyen, “Deep Convolutional Neural Networks On Multichannel Time Series For Human Activity Recognition,” in 24th International Joint Conference on Artificial Intelligence (IJCAI), Argentina ,2015, pp. 3995-4001.
M. Zeng, O. Mengshoel, “Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors,” in 6th International Conference on Mobile Computing, Applications and Services (MobiCASE-16), Austin Texas US, 2014, pp197-205.
R. Girshick, J. Donahue, T. Darrell, and J. Malik,Rich “Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus ,Ohio 2014, pp1-21.
J. Long, E. Shelhamer, T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA2015, pp 3431–3440.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei- Fei, “Largescale video classification with convolutional neural networks,” in Computer Vision and Pattern Recognition (CVPR) IEEE, Columbus, Ohio, 2014, pp 1725– 1732.
A. Toshev, C. Szegedy, “Deeppose, Human pose estimation via deep neural networks” in Computer Vision and Pattern Recognition (CVPR) IEEE, Columbus, Ohio 2014, pp 1653–1660.
N. Wang, D.Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in Neural Information Processing Systems Conference, US, 2013, pp 809–817.
C. Dong, C. Loy, K. He, X. Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision–ECCV Conference, Zurich, Switzerland, 2014, pp 184–199.
K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, ICLR Conference, San Diego, May 2014, pp. 1-14.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, et al., “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, June 2015, pp 1-9.
B. Zoph, V. Vasudevan, J.Shlen,Le, Q.V, “Learning Transferable architectures for scalable image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1-14.
HMDB Data set@http://serre-lab.clps.brown.edu/resource/ hmdb-a-largehuman- motion-database.
L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, “Actions as space-time shapes, IEEE Trans.Pattern Anal. Machine Intell, vol. 29, issue 12, pp. 2247- 2253. Dec. 2007.
M. Elshourbagy, E. Hemayed, M. Fayek, “Enhanced bag of words using k-means for human activity recognition,”Egyptian Informatics Journal, vol.17, issue2 ,PP 227-237, July (2016).
M. Bregonzio, T. Xiang, S. Gong, “Fusing appearance and distribution information of interest points for action recognition”. Pattern Recognition Journal (Elsevier), vol 45, issue 3, March 2012, pp 1220–1234.
M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, “Actions as space-time shapes,” in Tenth IEEE International Conference on Computer Vision (ICCV’05), Beijing, China, 2005, pp. 1395–1402.
N. Ikizler, P. Duygulu, “Human action recognition using distribution of oriented rectangular patches,” in Human Motion–Understanding Modelling Capture and Animation Conference, Rio de Janeiro, Brazil, 2007, pp. 271- 284.
P. Scovanner, A. Ali, M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM international conference on Multimedia, Augsburg, Germany, 2007, pp.3 57–360.
J.Carlos, H.Wang, L.Fei-Fei, “Unsupervised learning of human action categories using spatial- temporal words,” Int J comput vision, vol. 79, pp.299-318, March 2008
A. Klaser, M. Marszalek, C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in Proceedings of the British Machine Vision Conference Leeds, United Kingdom, 2008, pp. 271–275.
A.Fathi, G.Mori, “Action recognition by learning mid-level motion features,” in IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, 2008, pp. 1–8.
A. Khelalef, Fakhreddine, N. Benoudjit, “An Efficient Human Activity Recognition Technique,” HAL open science Journal, vol 29, issue 4, pp 702-715 , June 2020.
SH. Shabbeer Basha, “An information-rich sampling Technique over spatiotemporal CNN for classification of human actions in video,” Computer vision and pattern recognition Journal, arxiv:2002.02100v2, 7Feb (2020).
X. Wang, L.Wang, Y. Qiao, “A comparative study of encoding, pooling and normalization methods for action recognition,” in Computer Vision, Asian Conference on Computer Vision, Sydney, Australia., 2012, pp. 572–585.
H.Wang, C.Schmid, “Action recognition with improved trajectories,” in IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 3551–3558.
X. Peng, L.Wang, X.Wang, Y.Qiao, “ Bag of Visual Words and Fusion Methods for Action Recognition, Comprehensive Study and Good Practice: Elsevier Vol. 150 pp. 109–125, September 2016.
X. Peng, C.Zou, Y.Qiao, Q.Peng, “Action recognition with stacked fisher vectors,” in Computer Vision, Asian Conference on Computer Vision, ECCV, Zurich, 2014, pp. 581-595.
K. Simonyan, A.Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems Conference, Montreal, Quebec, Canada. 2014, pp. 568–576.
Z. Lan, M.Lin, X.Li, A.G.Hauptmann, B.Raj, “Beyond gaussian pyramid: multi-skip feature stacking for action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA., 2015, pp. 204–212.
L.Wang, Y.Qiao, X.Tang, “Action recognition with trajectory-pooled deepconvolutional descriptors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 4305–4314.
X. Zhen, L.Shao, “Action recognition via spatio-temporal local features: a comprehensive study,” Image and Vision Computing, vol. 50, pp. 1–13, June 2016.
Y. Shi, Y.Tian, Y.Wang, T.Huang, “Sequential deep trajectory descriptor for action recognition with three-stream CNN,” IEEE Transactions on Multimedia, vol 19 issue7, pp. 1510–1520, July 2017.
S. Nazir, M.Haroon, S.Velastin, “Human Action Recognition using Multi- Kernel Learning for Temporal Residual Network,” in 14th International Joint Conference Vision, Imaging and Computer Graphics Theory and Applications, Prague Czech Republic, 2019, pp.420-426ز
- There are currently no refbacks.
Copyright (c) 2022 H. Mahmoud
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Advances in Computing and Engineering
Academy Publishing Center (APC)
Arab Academy for Science, Technology and Maritime Transport (AASTMT)