Abstract
Human behaviour and action recognition are vital components of effective surveillance video analysis, playing a key role in maintaining public safety. Current approaches, such as 3D Convolutional Neural Networks (3D CNN) and two-stream neural networks (2SNN), often struggle with computational inefficiencies due to their high parameter demands. In response to these challenges, we introduce HARNet, a lightweight residual 3D CNN architecture based on directed acyclic graphs, specifically designed to enhance the efficiency of human action detection. HARNet employs a novel pipeline to generate spatial motion data from raw video inputs, enabling robust latent representation learning of human motion. Unlike traditional methods, our approach processes both spatial and motion information within a single stream, effectively utilizing both types of cues. To further improve the discriminative capability of the extracted features, we integrate a Support Vector Machine (SVM) classifier on the latent representations obtained from HARNet’s fully connected layer. Comprehensive evaluations on the UCF101, HMDB51, and KTH datasets show significant performance gains of 2.75%, 10.94%, and 0.18%, respectively. These results highlight the strength of HARNet’s streamlined design and the effectiveness of combining SVM classifiers with deep feature learning for accurate and efficient human action recognition in surveillance videos. This work advances the field of reliable video analysis for real-world applications.
Keywords: 3D Convolutional Neural Networks (3D CNN), Directed Acyclic Graphs, Human Action Recognition Network (HAR Net), Spatial Motion, Support Vector Machine (SVM).