Journal of Shanghai Jiaotong University(Science) >
Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition
Received date: 2023-12-19
Accepted date: 2024-01-05
Online published: 2024-05-06
Nurmemet Yolwas, Sun Lixu, Li Xin, Liu Qichao, Wang Zhixiang . Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(2) : 289 -297 . DOI: 10.1007/s12204-024-2738-8
[2] KARITA S, CHEN N X, HAYASHI T, et al. A comparative study on transformer vs RNN in speech applications [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 449-456.
[3] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition [DB/OL]. (2019-04-30). http://arxiv.org/abs/1904.13377
[4] GUO P C, BOYER F, CHANG X K, et al. Recent developments on espnet toolkit boosted by conformer [C]//ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 5874-5878.
[5] KIM C, GOWDA D, LEE D, et al. A review of on-device fully neural end-to-end automatic speech recognition algorithms [C]//2020 54th Asilomar Conference on Signals, Systems, and Computers. Pacific Grove: IEEE, 2020: 277-283.
[6] AUSTIN P K, SALLABANK J. The Cambridge handbook of endangered languages [M]. Cambridge: Cambridge University Press, 2011.
[7] ZAHRER A, ŽGANK A, SCHUPPLER B. Towards building an automatic transcription system for language documentation: Experiences from Muyu [C]// Twelfth Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 2020: 2893-2900.
[8] SHI J T, AMITH J D, GARCÍA R C, et al. Leveraging end-to-end ASR for endangered language documentation: An empirical study on yoloxóchitl Mixtec [DB/OL]. (2021-01-26). http://arxiv.org/abs/2101.10877
[9] LI J Y. Recent advances in end-to-end automatic speech recognition [J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1): e8.
[10] LÜSCHER C, BECK E, IRIE K, et al. RWTH ASR systems for LibriSpeech: Hybrid vs attention [C]// Interspeech 2019. Graz: ISCA, 2019: 231-235.
[11] CHO J, BASKAR M K, LI R Z, et al. Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling [C]//2018 IEEE Spoken Language Technology Workshop. Athens: IEEE, 2018: 521-527.
[12] WATANABE S, HORI T, HERSHEY J R. Language independent end-to-end architecture for joint language identification and speech recognition [C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop. Okinawa: IEEE, 2017: 265-271.
[13] TOSHNIWAL S, SAINATH T N, WEISS R J, et al. Multilingual speech recognition with a single end-to-end model [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: ACM, 2018: 4904-4908.
[14] KANNAN A, DATTA A, SAINATH T N, et al. Large-scale multilingual speech recognition with a streaming end-to-end model [C]// Interspeech 2019. Graz: ISCA, 2019: 2130-2134.
[15] HOU W X, DONG Y, ZHUANG B R, et al. Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning [C]// Interspeech 2020. Shanghai: ISCA, 2020: 1037-1041.
[16] PRATAP V, SRIRAM A, TOMASELLO P, et al. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters [DB/OL]. (2020-07-06). http://arxiv.org/abs/2007.03001
[17] ADAMS O, WIESNER M, WATANABE S, et al. Massively multilingual adversarial speech recognition [DB/OL]. (2019-04-03). http://arxiv.org/abs/1904.02210
[18] LI B, PANG R M, SAINATH T N, et al. Scaling end-to-end models for large-scale multilingual ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1011-1018.
[19] WU A, WANG C H, PINO J, et al. Self-supervised representations improve end-to-end speech translation [DB/OL]. (2020-06-22). http://arxiv.org/abs/2006.12124
[20] KRISHNA D N, WANG P Y, BOZZA B. Using large self-supervised models for low-resource speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 2436-2440.
[21] CHANG X K, MAEKAKU T, GUO P C, et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 228-235.
[22] LIU A T, LI S W, LEE H Y. TERA: Self-supervised learning of transformer encoder representation for speech [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.
[23] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979
[24] BABU A, WANG C, TJANDRA A, et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale [DB/OL]. (2021-11-17). https://arxiv.org/abs/2111.09296
[25] LI X J, DALMIA S, LI J C, et al. Universal phone recognition with a multilingual allophone system [C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 8249-8253.
[26] YAN B, DALMIA S, MORTENSEN D R, et al. Differentiable allophone graphs for language-universal speech recognition [DB/OL]. (2021-07-24). http://arxiv.org/abs/2107.11628
[27] XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23). http://arxiv.org/abs/2109.11680
[28] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition [DB/OL]. (2019-04-11). http://arxiv.org/abs/1904.05862
[29] VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding [DB/OL]. (2018-07-10). http://arxiv.org/abs/1807.03748
[30] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [DB/OL]. (2020-06-20). http://arxiv.org/abs/2006.11477
[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[32] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.
[33] YANG S, CHI P H, CHUANG Y S, et al. Superb: Speech processing universal performance benchmark [DB/OL]. (2021-05-03). https://arxiv.org/abs/2105.01051
[34] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07). http://arxiv.org/abs/2202.03555
[35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21271-21284.
[36] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9630-9640.
[37] BAI J W, LI B, ZHANG Y, et al. Joint unsupervised and supervised training for multilingual ASR [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6402-6406.
[38] ZHU H, WANG L, WANG J D, et al. Wav2vec-S: Semi-supervised pre-training for low-resource ASR [DB/OL]. (2021-10-09). http://arxiv.org/abs/2110.04484
[39] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979
[40] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 9912-9924.
[41] WANG T Z, ISOLA P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere [C]//Proceedings of the 37th International Conference on Machine Learning. Online: PMLR, 2020: 9929-9939.
[42] HOFFMANN D T, BEHRMANN N, GALL J, et al. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 897-905.
[43] HUYNH T, KORNBLITH S, WALTER M R, et al. Boosting contrastive self-supervised learning with false negative cancellation [C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2022: 986-996.
[44] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9726-9735.
[45] KALANTIDIS Y, SARIYILDIZ M B, PION N, et al. Hard negative mixing for contrastive learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21798-21809.
[46] CAO R, WANG Y H, LIANG Y X, et al. Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding [DB/OL]. (2022-02-26). http://arxiv.org/abs/2202.13093
[47] ZHANG Y Z, ZHANG R C, MENSAH S, et al. Unsupervised sentence representation via contrastive learning with mixing negatives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 11730-11738.
[48] HANNUN A. The history of speech recognition to the year 2030 [DB/OL]. (2021-07-30). http://arxiv.org/abs/2108.00084
/
| 〈 |
|
〉 |