J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 289-297.doi: 10.1007/s12204-024-2738-8

Special Issue: 人机语音通讯

• Automation & Computer Technologies • Previous Articles     Next Articles

Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition

Wav2vec-AD:用于语音识别的声学单元发现模块集成式自监督对比预训练方法

努尔麦麦提·尤鲁瓦斯1,2,孙李旭1,2,李欣1,刘起超1,2,王智翔1,2   

  1. 1. College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China; 2. Xinjiang Multilingual Information Technology Laboratory, Xinjiang University, Urumqi 830017, China
  2. 1. 新疆大学 计算机科学与技术学院,乌鲁木齐 830017;2. 新疆大学 新疆多语种信息技术重点实验室,乌鲁木齐 830017
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-05-06

Abstract: An effective speech recognition model necessitates an ample supply of labeled data for supervised training. However, this proposition poses a monumental challenge for low-resource languages in terms of constructing a speech recognition system with high precision. In this paper, we propose a novel pre-training strategy for contrastive learning by fusing the acoustic unit discovery module with Wav2vec 2.0, herein referred to asWav2vec-AD. This strategy, for the first time in speech contrastive learning, enables controlled negative sample selection via the acoustic unit discovery module, thereby augmenting the model’s representational learning capability. Furthermore, we conduct a thorough analysis regarding the selection of negative samples in different situations to enhance the speech representation learned by the model, optimizing its efficacy in downstream tasks. In the low-resource case, compared to the baseline Wav2vec 2.0, Wav2vec-AD achieves absolute word error rate (WER) improvements of 1.55% and 1.46% respectively on the development-clean and test-clean subsets of LibriSpeech. Moreover, absolute WER improvements of 0.63% and 4.21% were realized in Arabic and Turkish language datasets, respectively.

Key words: self-supervised learning, automatic speech recognition, contrastive learning, low-resource

摘要: 一个有效的语音识别模型需要大量的标记数据进行监督训练;然而,对于低资源语言来说,如何构建一个高精度的语音识别系统是一个巨大的挑战。本文通过将声学单元发现模块与 Wav2vec 2.0(以下简称 Wav2vec-AD)相融合,提出了一种用于对比学习的新预训练策略。在语音对比学习中,这一策略首次通过声学单元发现模块实现了可控的负样本选择,从而增强了模型的表征学习能力。此外,还对不同情况下的负样本选择进行了深入分析,以增强模型学习语音表征能力,优化其在下游任务中的功效。在低资源情况下,与基线 Wav2vec 2.0 相比,Wav2vec-AD 在 LibriSpeech 的dev-clean子集和test-clean子集上的绝对 WER 分别提高了1.55% 和 1.46%。此外,阿拉伯语和土耳其语数据集的绝对WER 分别提高了0.63% 和 4.21%。

关键词: 自监督学习,自动语音识别,对比学习,低资源

CLC Number: