Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition

doi:10.1007/s12204-024-2738-8

Abstract

Abstract: An effective speech recognition model necessitates an ample supply of labeled data for supervised training. However, this proposition poses a monumental challenge for low-resource languages in terms of constructing a speech recognition system with high precision. In this paper, we propose a novel pre-training strategy for contrastive learning by fusing the acoustic unit discovery module with Wav2vec 2.0, herein referred to asWav2vec-AD. This strategy, for the first time in speech contrastive learning, enables controlled negative sample selection via the acoustic unit discovery module, thereby augmenting the model’s representational learning capability. Furthermore, we conduct a thorough analysis regarding the selection of negative samples in different situations to enhance the speech representation learned by the model, optimizing its efficacy in downstream tasks. In the low-resource case, compared to the baseline Wav2vec 2.0, Wav2vec-AD achieves absolute word error rate (WER) improvements of 1.55% and 1.46% respectively on the development-clean and test-clean subsets of LibriSpeech. Moreover, absolute WER improvements of 0.63% and 4.21% were realized in Arabic and Turkish language datasets, respectively.

Key words: self-supervised learning, automatic speech recognition, contrastive learning, low-resource

摘要： 一个有效的语音识别模型需要大量的标记数据进行监督训练；然而，对于低资源语言来说，如何构建一个高精度的语音识别系统是一个巨大的挑战。本文通过将声学单元发现模块与 Wav2vec 2.0（以下简称 Wav2vec-AD）相融合，提出了一种用于对比学习的新预训练策略。在语音对比学习中，这一策略首次通过声学单元发现模块实现了可控的负样本选择，从而增强了模型的表征学习能力。此外，还对不同情况下的负样本选择进行了深入分析，以增强模型学习语音表征能力，优化其在下游任务中的功效。在低资源情况下，与基线 Wav2vec 2.0 相比，Wav2vec-AD 在 LibriSpeech 的dev-clean子集和test-clean子集上的绝对 WER 分别提高了1.55% 和 1.46%。此外，阿拉伯语和土耳其语数据集的绝对WER 分别提高了0.63% 和 4.21%。

关键词: 自监督学习，自动语音识别，对比学习，低资源

CLC Number:

TP18

Nurmemet Yolwas, Sun Lixu, Li Xin, Liu Qichao, Wang Zhixiang. Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 289-297.

References

[1] CHIU C C, SAINATH T N, WU Y H, et al. State-of-the-art speech recognition with sequence-to-sequence models [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4774-4778.

[2] KARITA S, CHEN N X, HAYASHI T, et al. A comparative study on transformer vs RNN in speech applications [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 449-456.

[3] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition [DB/OL]. (2019-04-30). http://arxiv.org/abs/1904.13377

[4] GUO P C, BOYER F, CHANG X K, et al. Recent developments on espnet toolkit boosted by conformer [C]//ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 5874-5878.

[5] KIM C, GOWDA D, LEE D, et al. A review of on-device fully neural end-to-end automatic speech recognition algorithms [C]//2020 54th Asilomar Conference on Signals, Systems, and Computers. Pacific Grove: IEEE, 2020: 277-283.

[6] AUSTIN P K, SALLABANK J. The Cambridge handbook of endangered languages [M]. Cambridge: Cambridge University Press, 2011.

[7] ZAHRER A, ŽGANK A, SCHUPPLER B. Towards building an automatic transcription system for language documentation: Experiences from Muyu [C]// Twelfth Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 2020: 2893-2900.

[8] SHI J T, AMITH J D, GARCÍA R C, et al. Leveraging end-to-end ASR for endangered language documentation: An empirical study on yoloxóchitl Mixtec [DB/OL]. (2021-01-26). http://arxiv.org/abs/2101.10877

[9] LI J Y. Recent advances in end-to-end automatic speech recognition [J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1): e8.

[10] LÜSCHER C, BECK E, IRIE K, et al. RWTH ASR systems for LibriSpeech: Hybrid vs attention [C]// Interspeech 2019. Graz: ISCA, 2019: 231-235.

[11] CHO J, BASKAR M K, LI R Z, et al. Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling [C]//2018 IEEE Spoken Language Technology Workshop. Athens: IEEE, 2018: 521-527.

[12] WATANABE S, HORI T, HERSHEY J R. Language independent end-to-end architecture for joint language identification and speech recognition [C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop. Okinawa: IEEE, 2017: 265-271.

[13] TOSHNIWAL S, SAINATH T N, WEISS R J, et al. Multilingual speech recognition with a single end-to-end model [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: ACM, 2018: 4904-4908.

[14] KANNAN A, DATTA A, SAINATH T N, et al. Large-scale multilingual speech recognition with a streaming end-to-end model [C]// Interspeech 2019. Graz: ISCA, 2019: 2130-2134.

[15] HOU W X, DONG Y, ZHUANG B R, et al. Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning [C]// Interspeech 2020. Shanghai: ISCA, 2020: 1037-1041.

[16] PRATAP V, SRIRAM A, TOMASELLO P, et al. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters [DB/OL]. (2020-07-06). http://arxiv.org/abs/2007.03001

[17] ADAMS O, WIESNER M, WATANABE S, et al. Massively multilingual adversarial speech recognition [DB/OL]. (2019-04-03). http://arxiv.org/abs/1904.02210

[18] LI B, PANG R M, SAINATH T N, et al. Scaling end-to-end models for large-scale multilingual ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1011-1018.

[19] WU A, WANG C H, PINO J, et al. Self-supervised representations improve end-to-end speech translation [DB/OL]. (2020-06-22). http://arxiv.org/abs/2006.12124

[20] KRISHNA D N, WANG P Y, BOZZA B. Using large self-supervised models for low-resource speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 2436-2440.

[21] CHANG X K, MAEKAKU T, GUO P C, et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 228-235.

[22] LIU A T, LI S W, LEE H Y. TERA: Self-supervised learning of transformer encoder representation for speech [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.

[23] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979

[24] BABU A, WANG C, TJANDRA A, et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale [DB/OL]. (2021-11-17). https://arxiv.org/abs/2111.09296

[25] LI X J, DALMIA S, LI J C, et al. Universal phone recognition with a multilingual allophone system [C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 8249-8253.

[26] YAN B, DALMIA S, MORTENSEN D R, et al. Differentiable allophone graphs for language-universal speech recognition [DB/OL]. (2021-07-24). http://arxiv.org/abs/2107.11628

[27] XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23). http://arxiv.org/abs/2109.11680

[28] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition [DB/OL]. (2019-04-11). http://arxiv.org/abs/1904.05862

[29] VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding [DB/OL]. (2018-07-10). http://arxiv.org/abs/1807.03748

[30] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [DB/OL]. (2020-06-20). http://arxiv.org/abs/2006.11477

[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.

[32] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.

[33] YANG S, CHI P H, CHUANG Y S, et al. Superb: Speech processing universal performance benchmark [DB/OL]. (2021-05-03). https://arxiv.org/abs/2105.01051

[34] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07). http://arxiv.org/abs/2202.03555

[35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21271-21284.

[36] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9630-9640.

[37] BAI J W, LI B, ZHANG Y, et al. Joint unsupervised and supervised training for multilingual ASR [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6402-6406.

[38] ZHU H, WANG L, WANG J D, et al. Wav2vec-S: Semi-supervised pre-training for low-resource ASR [DB/OL]. (2021-10-09). http://arxiv.org/abs/2110.04484

[39] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979

[40] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 9912-9924.

[41] WANG T Z, ISOLA P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere [C]//Proceedings of the 37th International Conference on Machine Learning. Online: PMLR, 2020: 9929-9939.

[42] HOFFMANN D T, BEHRMANN N, GALL J, et al. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 897-905.

[43] HUYNH T, KORNBLITH S, WALTER M R, et al. Boosting contrastive self-supervised learning with false negative cancellation [C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2022: 986-996.

[44] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9726-9735.

[45] KALANTIDIS Y, SARIYILDIZ M B, PION N, et al. Hard negative mixing for contrastive learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21798-21809.

[46] CAO R, WANG Y H, LIANG Y X, et al. Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding [DB/OL]. (2022-02-26). http://arxiv.org/abs/2202.13093

[47] ZHANG Y Z, ZHANG R C, MENSAH S, et al. Unsupervised sentence representation via contrastive learning with mixing negatives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 11730-11738.

[48] HANNUN A. The history of speech recognition to the year 2030 [DB/OL]. (2021-07-30). http://arxiv.org/abs/2108.00084