The current mainstream networks, such as squeeze and excitation residual neural network (SE-ResNet) and emphasized channel attention, propagation and aggregation based time delay neural network (ECAPATDNN), enhance the capability of speaker embedding extractors to extract more discriminative speaker embeddings by incorporating squeeze and excitation (SE) attention within the convolutional blocks. However, the SE attention focuses solely on encoding inter-channel information, overlooking the importance of spatial positional information and time-frequency information, which are crucial for the model’s performance. In this paper, we first experimentally compare the effectiveness of several mainstream attention mechanisms in the computer vision domain for the ECAPA-TDNN model. Next, we focus on the substantial improvements that coordinate attention (CA) brings to the ECAPA-TDNN model. The introduction of CA can help the model embed time-frequency information into the channel representation. Even without using AS-Norm, our proposed model achieves relative reductions of about 5.3% equal error rate (EER) and 5.5% minimum detection cost function (minDCF) on both the Voxceleb-O and Voxceleb-H test sets compared to the ECAPA-TDNN baseline model. In addition, the EER is relatively reduced by 9.46% on the CN-Celeb1 test set. This result strongly demonstrates that the CA module can effectively improve the generalization ability of the ECAPA-TDNN model.
[1] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: Robust DNN embeddings for speaker recognition [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5329-5333.
[2] SNYDER D, GARCIA-ROMERO D, SELL G, et al. Speaker recognition for multi-speaker conversations using X-vectors [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 5796-5800.
[3] ZEINALI H, WANG S, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019 [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.12592
[4] PEDDINTI V, POVEY D, KHUDANPUR S. A time delay neural network architecture for efficient modeling of long temporal contexts [C]//Interspeech 2015. Dresden: ISCA, 2015: 3214-3218.
[5] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[6] KENNY P. Bayesian speaker verification with heavy-tailed priors [C]// Odyssey Speaker and Language Recognition Workshop. Brno: ISCA, 2010.
[7] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798.
[8] VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification [C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 4052-4056.
[9] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]//Interspeech 2020. Shanghai: ISCA, 2020: 3830-3834.
[10] ZHANG Y J, WANG Y W, CHEN C P, et al. Improving time delay neural network based speaker recognition with convolutional block and feature aggregation methods [C]//Interspeech 2021. Brno: ISCA, 2021: 76-80.
[11] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.
[12] WANG Q L, WU B G, ZHU P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
[13] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [M]//Computer vision – ECCV 2018. Cham: Springer, 2018: 3-19.
[14] ZHANG Q L, YANG Y B. SA-net: Shuffle attention for deep convolutional neural networks [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 2235-2239.
[15] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13708-13717.
[16] WEI Y H, DU J Z, LIU H, et al. CTFALite: Lightweight channel-specific temporal and frequency attention mechanism for enhancing the speaker embedding extractor [C]//Interspeech 2022. Incheon: ISCA, 2022: 341-345.
[17] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: Deep speaker recognition [C]//Interspeech 2018. Hyderabad: ISCA, 2018: 1086-1090.
[18] SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
[19] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 5220-5224.
[20] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613-2617.
[21] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library [C]// 33rd Conference on Neural Information Processing Systems. Vancouver: NIPS, 2019: 1-12.
[22] GAO S H, CHENG M M, ZHAO K, et al. Res2Net: A new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
[23] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
[24] LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with warm restarts [DB/OL]. (2016-08-13). http://arxiv.org/abs/1608.03983
[25] DENG J K, GUO J, YANG J, et al. ArcFace: Additive angular margin loss for deep face recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 5962-5979.
[26] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2616-2620.
[27] NAGRANI A, CHUNG J S, XIE W D, et al. Voxceleb: Large-scale speaker verification in the wild [J]. Computer Speech & Language, 2020, 60: 101027.
[28] BROWN A, HUH J, CHUNG J S, et al. VoxSRC 2021: The third VoxCeleb speaker recognition challenge [DB/OL]. (2022-01-12). http://arxiv.org/abs/2201.04583
[29] FAN Y, KANG J W, LI L T, et al. CN-celeb: A challenging Chinese speaker recognition dataset [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7604-7608.
[30] CUMANI S, BATZU P D, COLIBRO D, et al. Comparison of speaker recognition approaches for real applications [C]//Interspeech 2011. Florence: ISCA, 2011: 2365-2368.