基于自校准机制的时空采样图卷积行为识别模型

Action recognition model based on the spatiotemporal sampling graph convolutional network and self-calibration mechanism

  • 摘要: 针对现有行为识别算法忽视时空信息上下文的依赖关系和缺乏多层次感受野的特征提取问题,本文提出一种基于自校准机制的时空采样图卷积网络行为识别模型. 首先,介绍ST-GCN和3D-GCN、Transformer和自注意力机制的工作原理,并分析了3D-GCN和Transformer不能有效进行时空上下文建模;其次,为有效进行时空上下文建模而提出了一种时空采样图卷积网络,其以时序连续多帧作为时空采样将全局动作分为多个子动作,通过非局部网络计算单一节点与采样频率帧内所有节点的相关性来建立局部跨时空依赖关系,并通过结合非局部网络和时域卷积计算单个采样子动作与全局子动作的相关性以此来建立全局跨时空依赖关系;然后,为了有效地增强多层次的感受野来捕获更具判别力的时域特征,提出了一种时域自校准卷积网络在两个不同的尺度时空中分别进行卷积并特征融合:一种是原始比例尺度的时空,另一种是使用下采样具有较小比例尺度的潜在时空;再者,结合时空采样图卷积网络和时域自校准网络构建基于自校准机制的时空采样图卷积网络行为识别模型,在多流网络下进行端到端的训练. 最后,基于NTU-RGB+D和NTU-RGB+D120骨架动作数据集开展了骨架行为识别的相关实验研究,研究结果表明该行为识别模型具有高效的时空特征提取能力以及优秀的性能.

     

    Abstract: A skeleton-based action recognition model is proposed for the spatiotemporal sampling graph convolutional network (ST-GCN) based on the self-calibration mechanism to address the problem of existing action recognition algorithms disregarding the dependence of spatiotemporal information context and lacking multilevel receptive fields for feature extraction. First, this paper introduces the working principles of ST-GCN and 3D-GCN, Transformer, and self-attention mechanism and analyzes whether 3D-GCN and Transformer cannot effectively model the global and local spatiotemporal contexts, respectively. Second, a spatiotemporal sampling graph convolutional network is proposed to effectively perform the spatiotemporal context modeling. This network divides the global action into multiple subactions by employing a series of continuous temporal multiframes as spatiotemporal sampling, establishes the local crosstemporal dependency by computing the correlation between a single node and all nodes in the sampling frequency frame with the nonlocal network, and establishes the global crosstemporal dependency by combining the nonlocal network and temporal convolution to compute the correlation between a single sampling subaction and global subactions. Subsequently, to effectively improve the multilevel receptive field for capturing more discriminating temporal features, a temporal self-calibrating convolutional network is proposed for convoluting in two different scales of space-time. Further, two abovementioned features can be combined: one is the space–time of the original scale, while the other is the potential space-time with a smaller scale using downsampling operation; here, the latter adaptively establishes the dependence between the remote space-time and channel and models the interchannel dependence by differentiating the characteristics of each channel. Meanwhile, the spatiotemporal sampling graph convolutional and temporal self-calibration networks are combined to construct the spatiotemporal-sampling graph convolutional network based on self-calibration mechanism, and end-to-end training is performed on this model using the multistream network. Finally, to confirm the effectiveness and superior performance of the model feature extraction, some experimental work is performed on the skeleton-based action recognition based on the NTU-RGB+D and NTU-RGB+D120 skeleton-based action datasets, and the findings reveal that the recognition accuracy under X-View and X-Sub of the NTU-RGB+D dataset reaches up to 95.2% and 88.8%, respectively, confirming the generalization ability of the model on the NTU-RGB+D120 dataset. This work displays that the model has excellent recognition accuracy and generalization ability and corroborates the effective spatiotemporal feature extraction ability and excellent performance of the action recognition model.

     

/

返回文章
返回