Improving ECAPA-TDNN Performance with Coordinate Attention

doi:10.1007/s12204-024-2726-z

Abstract

Abstract: The current mainstream networks, such as squeeze and excitation residual neural network (SE-ResNet) and emphasized channel attention, propagation and aggregation based time delay neural network (ECAPATDNN), enhance the capability of speaker embedding extractors to extract more discriminative speaker embeddings by incorporating squeeze and excitation (SE) attention within the convolutional blocks. However, the SE attention focuses solely on encoding inter-channel information, overlooking the importance of spatial positional information and time-frequency information, which are crucial for the model’s performance. In this paper, we first experimentally compare the effectiveness of several mainstream attention mechanisms in the computer vision domain for the ECAPA-TDNN model. Next, we focus on the substantial improvements that coordinate attention (CA) brings to the ECAPA-TDNN model. The introduction of CA can help the model embed time-frequency information into the channel representation. Even without using AS-Norm, our proposed model achieves relative reductions of about 5.3% equal error rate (EER) and 5.5% minimum detection cost function (minDCF) on both the Voxceleb-O and Voxceleb-H test sets compared to the ECAPA-TDNN baseline model. In addition, the EER is relatively reduced by 9.46% on the CN-Celeb1 test set. This result strongly demonstrates that the CA module can effectively improve the generalization ability of the ECAPA-TDNN model.

Key words: speaker verification, convolution attention mechanism, coordinate attention (CA), domain generalization, ECAPA-TDNN model

摘要： 尽管深度学习在说话人验证方面取得了巨大进步，仍然存在许多挑战。一个关键的挑战是提取具有高度辨别力的说话人表示。为了解决这个问题，研究人员设计了更复杂的网络结构,例如SE-ResNet 和 ECAPA-TDNN。这些模型通过在卷积块中加入压缩激励注意力（SE），以增强说话人嵌入提取器的能力，进而提取更具辨别力的说话人表示。然而，SE注意力仅对通道信息建模，忽视了空间位置信息和时频信息的重要性，这些信息可能有助于提高模型性能和泛化能力，尤其是时频信息在语音任务中的关键作用。本文中，首先通过实验比较了计算机视觉领域几种主流注意力机制对于ECAPA-TDNN模型的有效性。着重分析了通道信息、时频信息以及空间位置信息对ECAPA-TDNN模型性能的影响程度。随后，重点关注坐标注意力（CA）为 ECAPA-TDNN 模型带来的实质性改进。 CA机制的引入有助于将时频信息嵌入到通道表示中，从而帮助模型提取更丰富的时频表示。与基线相比，提出的模型在 Voxceleb-O、Voxceleb-E和 Voxceleb-H 数据集上均实现了性能提升。并且CN-Celeb1测试集上的结果也有力证明了CA模块能有效提高ECAPA-TDNN模型的泛化能力。

关键词: 说话人表示，卷积注意机制，坐标注意力，域泛化，ECAPA-TDNN模型

CLC Number:

TN912.34

Liu Shuanghong, Song Zhida, He Liang. Improving ECAPA-TDNN Performance with Coordinate Attention[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 241-247.

References

[1] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: Robust DNN embeddings for speaker recognition [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5329-5333.
[2] SNYDER D, GARCIA-ROMERO D, SELL G, et al. Speaker recognition for multi-speaker conversations using X-vectors [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 5796-5800.
[3] ZEINALI H, WANG S, SILNOVA A, et al. BUT system description to VoxCeleb speaker recognition challenge 2019 [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.12592
[4] PEDDINTI V, POVEY D, KHUDANPUR S. A time delay neural network architecture for efficient modeling of long temporal contexts [C]//Interspeech 2015. Dresden: ISCA, 2015: 3214-3218.
[5] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[6] KENNY P. Bayesian speaker verification with heavy-tailed priors [C]// Odyssey Speaker and Language Recognition Workshop. Brno: ISCA, 2010.
[7] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798.
[8] VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification [C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 4052-4056.
[9] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]//Interspeech 2020. Shanghai: ISCA, 2020: 3830-3834.
[10] ZHANG Y J, WANG Y W, CHEN C P, et al. Improving time delay neural network based speaker recognition with convolutional block and feature aggregation methods [C]//Interspeech 2021. Brno: ISCA, 2021: 76-80.
[11] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023.
[12] WANG Q L, WU B G, ZHU P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
[13] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [M]//Computer vision – ECCV 2018. Cham: Springer, 2018: 3-19.
[14] ZHANG Q L, YANG Y B. SA-net: Shuffle attention for deep convolutional neural networks [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 2235-2239.
[15] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13708-13717.
[16] WEI Y H, DU J Z, LIU H, et al. CTFALite: Lightweight channel-specific temporal and frequency attention mechanism for enhancing the speaker embedding extractor [C]//Interspeech 2022. Incheon: ISCA, 2022: 341-345.
[17] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: Deep speaker recognition [C]//Interspeech 2018. Hyderabad: ISCA, 2018: 1086-1090.
[18] SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
[19] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 5220-5224.
[20] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613-2617.
[21] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library [C]// 33rd Conference on Neural Information Processing Systems. Vancouver: NIPS, 2019: 1-12.
[22] GAO S H, CHENG M M, ZHAO K, et al. Res2Net: A new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
[23] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
[24] LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with warm restarts [DB/OL]. (2016-08-13). http://arxiv.org/abs/1608.03983
[25] DENG J K, GUO J, YANG J, et al. ArcFace: Additive angular margin loss for deep face recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 5962-5979.
[26] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2616-2620.
[27] NAGRANI A, CHUNG J S, XIE W D, et al. Voxceleb: Large-scale speaker verification in the wild [J]. Computer Speech & Language, 2020, 60: 101027.
[28] BROWN A, HUH J, CHUNG J S, et al. VoxSRC 2021: The third VoxCeleb speaker recognition challenge [DB/OL]. (2022-01-12). http://arxiv.org/abs/2201.04583
[29] FAN Y, KANG J W, LI L T, et al. CN-celeb: A challenging Chinese speaker recognition dataset [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 7604-7608.
[30] CUMANI S, BATZU P D, COLIBRO D, et al. Comparison of speaker recognition approaches for real applications [C]//Interspeech 2011. Florence: ISCA, 2011: 2365-2368.