J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 241-247.doi: 10.1007/s12204-024-2726-z

Special Issue: 人机语音通讯

• Automation & Computer Technologies • Previous Articles     Next Articles

Improving ECAPA-TDNN Performance with Coordinate Attention

基于坐标注意力的ECAPA-TDNN模型性能研究

刘双红,宋志达,何亮   

  1. School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China; Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China; Department of Electronic Engineering, and Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
  2. 新疆大学 计算机科学与技术学院,乌鲁木齐 830017;新疆信号与信息处理重点实验室,乌鲁木齐 830017;清华大学 电子工程系;北京信息科学与技术国家研究中心,北京 100084
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-04-22

Abstract: The current mainstream networks, such as squeeze and excitation residual neural network (SE-ResNet) and emphasized channel attention, propagation and aggregation based time delay neural network (ECAPATDNN), enhance the capability of speaker embedding extractors to extract more discriminative speaker embeddings by incorporating squeeze and excitation (SE) attention within the convolutional blocks. However, the SE attention focuses solely on encoding inter-channel information, overlooking the importance of spatial positional information and time-frequency information, which are crucial for the model’s performance. In this paper, we first experimentally compare the effectiveness of several mainstream attention mechanisms in the computer vision domain for the ECAPA-TDNN model. Next, we focus on the substantial improvements that coordinate attention (CA) brings to the ECAPA-TDNN model. The introduction of CA can help the model embed time-frequency information into the channel representation. Even without using AS-Norm, our proposed model achieves relative reductions of about 5.3% equal error rate (EER) and 5.5% minimum detection cost function (minDCF) on both the Voxceleb-O and Voxceleb-H test sets compared to the ECAPA-TDNN baseline model. In addition, the EER is relatively reduced by 9.46% on the CN-Celeb1 test set. This result strongly demonstrates that the CA module can effectively improve the generalization ability of the ECAPA-TDNN model.

Key words: speaker verification, convolution attention mechanism, coordinate attention (CA), domain generalization, ECAPA-TDNN model

摘要: 尽管深度学习在说话人验证方面取得了巨大进步,仍然存在许多挑战。一个关键的挑战是提取具有高度辨别力的说话人表示。为了解决这个问题,研究人员设计了更复杂的网络结构,例如SE-ResNet 和 ECAPA-TDNN。这些模型通过在卷积块中加入压缩激励注意力(SE),以增强说话人嵌入提取器的能力,进而提取更具辨别力的说话人表示。然而,SE注意力仅对通道信息建模,忽视了空间位置信息和时频信息的重要性,这些信息可能有助于提高模型性能和泛化能力,尤其是时频信息在语音任务中的关键作用。本文中,首先通过实验比较了计算机视觉领域几种主流注意力机制对于ECAPA-TDNN模型的有效性。着重分析了通道信息、时频信息以及空间位置信息对ECAPA-TDNN模型性能的影响程度。随后,重点关注坐标注意力(CA)为 ECAPA-TDNN 模型带来的实质性改进。 CA机制的引入有助于将时频信息嵌入到通道表示中,从而帮助模型提取更丰富的时频表示。与基线相比,提出的模型在 Voxceleb-O、Voxceleb-E和 Voxceleb-H 数据集上均实现了性能提升。并且CN-Celeb1测试集上的结果也有力证明了CA模块能有效提高ECAPA-TDNN模型的泛化能力。

关键词: 说话人表示,卷积注意机制,坐标注意力,域泛化,ECAPA-TDNN模型

CLC Number: