Abstract:
Aiming at the problems of insufficient feature expression ability and weak generalization ability of the traditional Res2Net model in the field of voice print recognition, this paper proposes a feature extraction module known as the SE-DR-Res2Block, which combinedly uses dense connection and residual connection. The combination of low-semantic features with spatial information characteristics allows focusing more on detailed information and high-semantic information that concentrates on global information as well as abstract features. This can compensate for the loss of some detailed information caused by abstraction. First, the feature of each layer in the dense connection structure is derived from the feature output of all previous layers to realize feature reuse. Second, the structure and working principle of the ECAPA-TDNN network using traditional Res2Block is introduced. To achieve more efficient feature extraction, the dense connection is used to further realize full feature mining. Based on SE-block, a more efficient feature extraction module, SE-DR-Res2Net, is proposed by combining the residual join and dense links. As compared to the traditional SE-Block structures, the convolutional layers are used here instead of fully connected layers. Because they not only reduce the number of parameters needed for training but also allow weight sharing, thereby reducing overfitting. Therefore, effective extraction of feature information from different layers is essential for obtaining multiscale expression as well as maximizing the reuse of features. During the collection of more scale-specific feature information, a large number of dense structures can lead to a dramatic increase in parameters and computational complexity. By using partial residual structures instead of dense structures, we can effectively prevent the dramatic increase in parameter quantity while maintaining the performance to a certain extent. Finally, to verify the effectiveness of the module, SE-Res2block, Full-SE-Res2block, SE-DR-Res2block, and Full-SE-DR-Res2block are adopted based on the different network models. Voxceleb1 and SITW (speakers in the wild) datasets were used for Voxceleb1 and SITW, respectively. The performance comparison of Res2Net-50 models with different modules on the Voxceleb1 dataset shows that SE-DR-Res2Net-50 achieves the best equal error rate of 3.51%, which also validates the adaptability of this module on different networks. The usage of different modules on different networks, as well as experiments and analyses conducted on different datasets, were compared. The experimental results showed that the optimal equal error rates of the ECAPA-TDNN network model using SE-DR-Res2block had reached 2.24% and 3.65%, respectively. This verifies the feature expression ability of this module, and the corresponding results based on different test data sets also confirm its excellent generalization ability.