Abstract:
A skeleton-based action recognition model is proposed for the spatiotemporal sampling graph convolutional network (ST-GCN) based on the self-calibration mechanism to address the problem of existing action recognition algorithms disregarding the dependence of spatiotemporal information context and lacking multilevel receptive fields for feature extraction. First, this paper introduces the working principles of ST-GCN and 3D-GCN, Transformer, and self-attention mechanism and analyzes whether 3D-GCN and Transformer cannot effectively model the global and local spatiotemporal contexts, respectively. Second, a spatiotemporal sampling graph convolutional network is proposed to effectively perform the spatiotemporal context modeling. This network divides the global action into multiple subactions by employing a series of continuous temporal multiframes as spatiotemporal sampling, establishes the local crosstemporal dependency by computing the correlation between a single node and all nodes in the sampling frequency frame with the nonlocal network, and establishes the global crosstemporal dependency by combining the nonlocal network and temporal convolution to compute the correlation between a single sampling subaction and global subactions. Subsequently, to effectively improve the multilevel receptive field for capturing more discriminating temporal features, a temporal self-calibrating convolutional network is proposed for convoluting in two different scales of space-time. Further, two abovementioned features can be combined: one is the space–time of the original scale, while the other is the potential space-time with a smaller scale using downsampling operation; here, the latter adaptively establishes the dependence between the remote space-time and channel and models the interchannel dependence by differentiating the characteristics of each channel. Meanwhile, the spatiotemporal sampling graph convolutional and temporal self-calibration networks are combined to construct the spatiotemporal-sampling graph convolutional network based on self-calibration mechanism, and end-to-end training is performed on this model using the multistream network. Finally, to confirm the effectiveness and superior performance of the model feature extraction, some experimental work is performed on the skeleton-based action recognition based on the NTU-RGB+D and NTU-RGB+D120 skeleton-based action datasets, and the findings reveal that the recognition accuracy under X-View and X-Sub of the NTU-RGB+D dataset reaches up to 95.2% and 88.8%, respectively, confirming the generalization ability of the model on the NTU-RGB+D120 dataset. This work displays that the model has excellent recognition accuracy and generalization ability and corroborates the effective spatiotemporal feature extraction ability and excellent performance of the action recognition model.