Automation & Computer Technologies

Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition

Expand
  • 1. College of Computer Science and Technology, Xinjiang University, Urumqi 830017, China; 2. Xinjiang Multilingual Information Technology Laboratory, Xinjiang University, Urumqi 830017, China

Received date: 2023-12-19

  Accepted date: 2024-01-05

  Online published: 2024-05-06

Abstract

An effective speech recognition model necessitates an ample supply of labeled data for supervised training. However, this proposition poses a monumental challenge for low-resource languages in terms of constructing a speech recognition system with high precision. In this paper, we propose a novel pre-training strategy for contrastive learning by fusing the acoustic unit discovery module with Wav2vec 2.0, herein referred to asWav2vec-AD. This strategy, for the first time in speech contrastive learning, enables controlled negative sample selection via the acoustic unit discovery module, thereby augmenting the model’s representational learning capability. Furthermore, we conduct a thorough analysis regarding the selection of negative samples in different situations to enhance the speech representation learned by the model, optimizing its efficacy in downstream tasks. In the low-resource case, compared to the baseline Wav2vec 2.0, Wav2vec-AD achieves absolute word error rate (WER) improvements of 1.55% and 1.46% respectively on the development-clean and test-clean subsets of LibriSpeech. Moreover, absolute WER improvements of 0.63% and 4.21% were realized in Arabic and Turkish language datasets, respectively.

Cite this article

Nurmemet Yolwas, Sun Lixu, Li Xin, Liu Qichao, Wang Zhixiang . Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(2) : 289 -297 . DOI: 10.1007/s12204-024-2738-8

References

[1] CHIU C C, SAINATH T N, WU Y H, et al. State-of-the-art speech recognition with sequence-to-sequence models [C]//2018 IEEE International Conference on AcousticsSpeech and Signal Processing. Calgary: IEEE, 2018: 4774-4778.

[2] KARITA S, CHEN N X, HAYASHI T, et al. A comparative study on transformer vs RNN in speech applications [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 449-456.

[3] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition [DB/OL]. (2019-04-30). http://arxiv.org/abs/1904.13377

[4] GUO P C, BOYER F, CHANG X K, et al. Recent developments on espnet toolkit boosted by conformer [C]//ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 5874-5878.

[5] KIM C, GOWDA D, LEE D, et al. A review of on-device fully neural end-to-end automatic speech recognition algorithms [C]//2020 54th Asilomar Conference on Signals, Systems, and Computers. Pacific Grove: IEEE, 2020: 277-283.

[6] AUSTIN P K, SALLABANK J. The Cambridge handbook of endangered languages [M]. Cambridge: Cambridge University Press, 2011.

[7] ZAHRER A, ŽGANK A, SCHUPPLER B. Towards building an automatic transcription system for language documentation: Experiences from Muyu [C]// Twelfth Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 2020: 2893-2900.

[8] SHI J T, AMITH J D, GARCÍA R C, et al. Leveraging end-to-end ASR for endangered language documentation: An empirical study on yoloxóchitl Mixtec [DB/OL]. (2021-01-26). http://arxiv.org/abs/2101.10877

[9] LI J Y. Recent advances in end-to-end automatic speech recognition [J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1): e8.

[10] LÜSCHER C, BECK E, IRIE K, et al. RWTH ASR systems for LibriSpeech: Hybrid vs attention [C]// Interspeech 2019. Graz: ISCA, 2019: 231-235.

[11] CHO J, BASKAR M K, LI R Z, et al. Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling [C]//2018 IEEE Spoken Language Technology Workshop. Athens: IEEE, 2018: 521-527.

[12] WATANABE S, HORI T, HERSHEY J R. Language independent end-to-end architecture for joint language identification and speech recognition [C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop. Okinawa: IEEE, 2017: 265-271.

[13] TOSHNIWAL S, SAINATH T N, WEISS R J, et al. Multilingual speech recognition with a single end-to-end model [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: ACM, 2018: 4904-4908.

[14] KANNAN A, DATTA A, SAINATH T N, et al. Large-scale multilingual speech recognition with a streaming end-to-end model [C]// Interspeech 2019. Graz: ISCA, 2019: 2130-2134.

[15] HOU W X, DONG Y, ZHUANG B R, et al. Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning [C]// Interspeech 2020. Shanghai: ISCA, 2020: 1037-1041.

[16] PRATAP V, SRIRAM A, TOMASELLO P, et al. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters [DB/OL]. (2020-07-06). http://arxiv.org/abs/2007.03001

[17] ADAMS O, WIESNER M, WATANABE S, et al. Massively multilingual adversarial speech recognition [DB/OL]. (2019-04-03). http://arxiv.org/abs/1904.02210

[18] LI B, PANG R M, SAINATH T N, et al. Scaling end-to-end models for large-scale multilingual ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1011-1018.

[19] WU A, WANG C H, PINO J, et al. Self-supervised representations improve end-to-end speech translation [DB/OL]. (2020-06-22). http://arxiv.org/abs/2006.12124

[20] KRISHNA D N, WANG P Y, BOZZA B. Using large self-supervised models for low-resource speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 2436-2440.

[21] CHANG X K, MAEKAKU T, GUO P C, et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 228-235.

[22] LIU A T, LI S W, LEE H Y. TERA: Self-supervised learning of transformer encoder representation for speech [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.

[23] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979

[24] BABU A, WANG C, TJANDRA A, et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale [DB/OL]. (2021-11-17). https://arxiv.org/abs/2111.09296

[25] LI X J, DALMIA S, LI J C, et al. Universal phone recognition with a multilingual allophone system [C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 8249-8253.

[26] YAN B, DALMIA S, MORTENSEN D R, et al. Differentiable allophone graphs for language-universal speech recognition [DB/OL]. (2021-07-24). http://arxiv.org/abs/2107.11628

[27] XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23). http://arxiv.org/abs/2109.11680

[28] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition [DB/OL]. (2019-04-11). http://arxiv.org/abs/1904.05862

[29] VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding [DB/OL]. (2018-07-10). http://arxiv.org/abs/1807.03748

[30] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [DB/OL]. (2020-06-20). http://arxiv.org/abs/2006.11477

[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.

[32] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.

[33] YANG S, CHI P H, CHUANG Y S, et al. Superb: Speech processing universal performance benchmark [DB/OL]. (2021-05-03). https://arxiv.org/abs/2105.01051

[34] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07). http://arxiv.org/abs/2202.03555

[35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21271-21284.

[36] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9630-9640.

[37] BAI J W, LI B, ZHANG Y, et al. Joint unsupervised and supervised training for multilingual ASR [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6402-6406.

[38] ZHU H, WANG L, WANG J D, et al. Wav2vec-S: Semi-supervised pre-training for low-resource ASR [DB/OL]. (2021-10-09). http://arxiv.org/abs/2110.04484

[39] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979

[40] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 9912-9924.

[41] WANG T Z, ISOLA P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere [C]//Proceedings of the 37th International Conference on Machine Learning. Online: PMLR, 2020: 9929-9939.

[42] HOFFMANN D T, BEHRMANN N, GALL J, et al. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 897-905.

[43] HUYNH T, KORNBLITH S, WALTER M R, et al. Boosting contrastive self-supervised learning with false negative cancellation [C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2022: 986-996.

[44] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9726-9735.

[45] KALANTIDIS Y, SARIYILDIZ M B, PION N, et al. Hard negative mixing for contrastive learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21798-21809.

[46] CAO R, WANG Y H, LIANG Y X, et al. Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding [DB/OL]. (2022-02-26). http://arxiv.org/abs/2202.13093

[47] ZHANG Y Z, ZHANG R C, MENSAH S, et al. Unsupervised sentence representation via contrastive learning with mixing negatives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 11730-11738.

[48] HANNUN A. The history of speech recognition to the year 2030 [DB/OL]. (2021-07-30). http://arxiv.org/abs/2108.00084

Outlines

/