J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 289-297.doi: 10.1007/s12204-024-2738-8
Special Issue: 人机语音通讯
• Automation & Computer Technologies • Previous Articles Next Articles
努尔麦麦提·尤鲁瓦斯1,2,孙李旭1,2,李欣1,刘起超1,2,王智翔1,2
Received:2023-12-19
Accepted:2024-01-05
Online:2026-04-01
Published:2024-05-06
CLC Number:
Nurmemet Yolwas, Sun Lixu, Li Xin, Liu Qichao, Wang Zhixiang. Wav2vec-AD: Acoustic Unit Discovery Module-Integrated, Self-Supervised Contrastive Pre-training Approach for Speech Recognition[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 289-297.
| [1] CHIU C C, SAINATH T N, WU Y H, et al. State-of-the-art speech recognition with sequence-to-sequence models [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4774-4778.
[2] KARITA S, CHEN N X, HAYASHI T, et al. A comparative study on transformer vs RNN in speech applications [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 449-456. [3] PHAM N Q, NGUYEN T S, NIEHUES J, et al. Very deep self-attention networks for end-to-end speech recognition [DB/OL]. (2019-04-30). http://arxiv.org/abs/1904.13377 [4] GUO P C, BOYER F, CHANG X K, et al. Recent developments on espnet toolkit boosted by conformer [C]//ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 5874-5878. [5] KIM C, GOWDA D, LEE D, et al. A review of on-device fully neural end-to-end automatic speech recognition algorithms [C]//2020 54th Asilomar Conference on Signals, Systems, and Computers. Pacific Grove: IEEE, 2020: 277-283. [6] AUSTIN P K, SALLABANK J. The Cambridge handbook of endangered languages [M]. Cambridge: Cambridge University Press, 2011. [7] ZAHRER A, ŽGANK A, SCHUPPLER B. Towards building an automatic transcription system for language documentation: Experiences from Muyu [C]// Twelfth Language Resources and Evaluation Conference. Marseille: European Language Resources Association, 2020: 2893-2900. [8] SHI J T, AMITH J D, GARCÍA R C, et al. Leveraging end-to-end ASR for endangered language documentation: An empirical study on yoloxóchitl Mixtec [DB/OL]. (2021-01-26). http://arxiv.org/abs/2101.10877 [9] LI J Y. Recent advances in end-to-end automatic speech recognition [J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1): e8. [10] LÜSCHER C, BECK E, IRIE K, et al. RWTH ASR systems for LibriSpeech: Hybrid vs attention [C]// Interspeech 2019. Graz: ISCA, 2019: 231-235. [11] CHO J, BASKAR M K, LI R Z, et al. Multilingual sequence-to-sequence speech recognition: Architecture, transfer learning, and language modeling [C]//2018 IEEE Spoken Language Technology Workshop. Athens: IEEE, 2018: 521-527. [12] WATANABE S, HORI T, HERSHEY J R. Language independent end-to-end architecture for joint language identification and speech recognition [C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop. Okinawa: IEEE, 2017: 265-271. [13] TOSHNIWAL S, SAINATH T N, WEISS R J, et al. Multilingual speech recognition with a single end-to-end model [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: ACM, 2018: 4904-4908. [14] KANNAN A, DATTA A, SAINATH T N, et al. Large-scale multilingual speech recognition with a streaming end-to-end model [C]// Interspeech 2019. Graz: ISCA, 2019: 2130-2134. [15] HOU W X, DONG Y, ZHUANG B R, et al. Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning [C]// Interspeech 2020. Shanghai: ISCA, 2020: 1037-1041. [16] PRATAP V, SRIRAM A, TOMASELLO P, et al. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters [DB/OL]. (2020-07-06). http://arxiv.org/abs/2007.03001 [17] ADAMS O, WIESNER M, WATANABE S, et al. Massively multilingual adversarial speech recognition [DB/OL]. (2019-04-03). http://arxiv.org/abs/1904.02210 [18] LI B, PANG R M, SAINATH T N, et al. Scaling end-to-end models for large-scale multilingual ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1011-1018. [19] WU A, WANG C H, PINO J, et al. Self-supervised representations improve end-to-end speech translation [DB/OL]. (2020-06-22). http://arxiv.org/abs/2006.12124 [20] KRISHNA D N, WANG P Y, BOZZA B. Using large self-supervised models for low-resource speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 2436-2440. [21] CHANG X K, MAEKAKU T, GUO P C, et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 228-235. [22] LIU A T, LI S W, LEE H Y. TERA: Self-supervised learning of transformer encoder representation for speech [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366. [23] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979 [24] BABU A, WANG C, TJANDRA A, et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale [DB/OL]. (2021-11-17). https://arxiv.org/abs/2111.09296 [25] LI X J, DALMIA S, LI J C, et al. Universal phone recognition with a multilingual allophone system [C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 8249-8253. [26] YAN B, DALMIA S, MORTENSEN D R, et al. Differentiable allophone graphs for language-universal speech recognition [DB/OL]. (2021-07-24). http://arxiv.org/abs/2107.11628 [27] XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23). http://arxiv.org/abs/2109.11680 [28] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition [DB/OL]. (2019-04-11). http://arxiv.org/abs/1904.05862 [29] VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding [DB/OL]. (2018-07-10). http://arxiv.org/abs/1807.03748 [30] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [DB/OL]. (2020-06-20). http://arxiv.org/abs/2006.11477 [31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460. [32] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518. [33] YANG S, CHI P H, CHUANG Y S, et al. Superb: Speech processing universal performance benchmark [DB/OL]. (2021-05-03). https://arxiv.org/abs/2105.01051 [34] BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07). http://arxiv.org/abs/2202.03555 [35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21271-21284. [36] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers [C]//2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9630-9640. [37] BAI J W, LI B, ZHANG Y, et al. Joint unsupervised and supervised training for multilingual ASR [C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6402-6406. [38] ZHU H, WANG L, WANG J D, et al. Wav2vec-S: Semi-supervised pre-training for low-resource ASR [DB/OL]. (2021-10-09). http://arxiv.org/abs/2110.04484 [39] CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24). http://arxiv.org/abs/2006.13979 [40] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 9912-9924. [41] WANG T Z, ISOLA P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere [C]//Proceedings of the 37th International Conference on Machine Learning. Online: PMLR, 2020: 9929-9939. [42] HOFFMANN D T, BEHRMANN N, GALL J, et al. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 897-905. [43] HUYNH T, KORNBLITH S, WALTER M R, et al. Boosting contrastive self-supervised learning with false negative cancellation [C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2022: 986-996. [44] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9726-9735. [45] KALANTIDIS Y, SARIYILDIZ M B, PION N, et al. Hard negative mixing for contrastive learning [C]// 34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 21798-21809. [46] CAO R, WANG Y H, LIANG Y X, et al. Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding [DB/OL]. (2022-02-26). http://arxiv.org/abs/2202.13093 [47] ZHANG Y Z, ZHANG R C, MENSAH S, et al. Unsupervised sentence representation via contrastive learning with mixing negatives [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 11730-11738. [48] HANNUN A. The history of speech recognition to the year 2030 [DB/OL]. (2021-07-30). http://arxiv.org/abs/2108.00084 |
| [1] | Xiao Sujie, Hao Ruipeng, Cheng Gaofeng, Xu Xiaoyan, Li Ta. EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 282-288. |
| [2] | Tao Hongjie, Li Zhaofei, Qi Fei, Chen Jingjue, Zhou Hao. High Resolution Remote Sensing Image Segmentation Method with Improved DeepLabv3+ [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 348-358. |
| [3] | Huang Wenhan, Deng Xiaotie. Boundedly Rational Agents in Sequential Posted Pricing [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 390-404. |
| [4] | Chen Chengxin, Zhang Pengyuan. DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 248-257. |
| [5] | Wu Zhiyang, Zhang Zhicheng, Dang Yonghao, Yin Jianqin, Tang Jin. ListPose: Lightweight and Implicit Spatial-Temporal Modeling with TokenPose for Video-Based Pose Estimation [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(1): 143-153. |
| [6] | WU Yalei, LI Jinghua, KONG Dehui, LI Qianxing, YIN Baocai. 3D Hand Pose Estimation Using Semantic Dynamic Hypergraph Convolutional Networks [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865. |
| [7] | DONG Zhaoxian, YU Shuo, SHEN Yanming. Multi-Scale Dynamic Hypergraph Convolutional Network for Traffic Flow Forecasting [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 880-888. |
| [8] | BAO Qirui, MEI Haiyang, WEI Huilin, L Zheng, WANG Yuxin, YANG Xin. Generating Adversarial Patterns in Facial Recognition with Visual Camouflage [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 911-922. |
| [9] | YU Nannan, WANG Chaoyi, QIAO Yu, WANG Yuxin, ZHENG Chenglin, ZHANG Qiang, YANG Xin. Hypergraph-Based Asynchronous Event Processing for Moving Object Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 952-961. |
| [10] | Ma Jin, Ren Ze, Zhang Tongtong, Ding Ying, Lu Yilei, Peng Yinghong. Transformer-Based Contrastive Learning Method for Automated Sleep Stages Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 720-732. |
| [11] | Mi Linhui, Yuan Junyi, Zhou Yankang, Hou Xumin. Text Structured Algorithm of Lung Cancer Cases Based on Deep Learning [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 778-789. |
| [12] | Miao Jun, Chang Yiru, Chen Chen, Zhang Maoyuan, Liu Yan, Qi Honggang, Guo Zhijun, Xu Qian. Ground-Glass Lung Nodules Recognition Based on CatBoost Feature Selection and Stacking Ensemble Learning [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 790-799. |
| [13] | Ma Yiyuan, Chen Huaiyuan, Chen Weidong. Real-Time Prediction of Elbow Motion Through sEMG-Based Hybrid BP-LSTM Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 455-462. |
| [14] | Pan Xinrong, Liu Xuewen, Zhu Bo, Wang Yingyi. Physics-Guided Neural Network with Gini Impurity-Based Structural Optimizer for Prediction of Membrane-Type Acoustic Material Transmission Loss [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 613-624. |
| [15] | Xiao Wenbo, Xiong Jiakai, Yu Lesheng, He Yinshui, Ma Guohong. Weld Defect Monitoring Based on Two-Stage Convolutional Neural Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 291-299. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||