EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition

doi:10.1007/s12204-024-2725-0

Abstract

Abstract: The attention-based encoder-decoder end-to-end model has achieved promising performance in automatic speech recognition (ASR). However, in practical applications, substitution errors commonly occur in ASR systems, particularly for characters with the same or similar pronunciation. According to statistics, homophones cause at least 50% character errors. Therefore, our study focuses on addressing the issue of substitution errors with the same or similar pronunciation. In this study, we propose a BERT language model with error correction (EC-BERT) for the ASR system. We design a two-stage training schedule involving pre-training with a large amount of pseudo-paired data followed by fine-tuning with a small real-paired data to mitigate the inconsistency of the original pre-trained BERT model with our task. Unlike other error correction models, we do not need an error detection network or mask mechanism but directly use the BERT model to learn and correct the error locations. The experimental results show that our proposed method is effective and achieves a relative reduction of 19.2% in character error rate compared with the connectionist temporal classification (CTC) greedy search result and 12.8% compared with the CTC-WFST result on the AISHELL-1 test set. We also prove that our proposed EC-BERT model can achieve comparable results to other error correction models with a shorter runtime and can easily be integrated into the practical ASR system.

Key words: automatic speech recognition (ASR), end-to-end, BERT, error correction

摘要： 基于注意力的端到端编解码模型在自动语音识别（ASR）中取得了良好的性能。然而，实际应用中，ASR系统经常出现替换错误，特别是对于发音相同或相似的字符。据统计，同音异义词至少造成50%的字符错误。因此，我们的研究重点是解决发音相同或相似的替换错误问题。研究中，为ASR系统提出了一个带有纠错的BERT语言模型（EC-BERT）。设计了一个两阶段的训练计划，包括使用大量伪成对数据进行预训练，然后使用少量真实成对数据进行微调，以减轻原始预训练的BERT模型与我们任务的不一致。与其他纠错模型不同，不需要纠错网络或掩码机制，而是直接使用BERT模型来学习和纠正错误字符。实验结果表明，该方法是有效的，在AISHELL-1测试集上，与CTC贪婪搜索结果相比，字符错误率相对降低了19.2%，与CTC-WFST结果相比，字符错误率相对降低了12.8%。还证明了提出的EC-BERT模型可以在更短的运行时间内获得与其他误差纠错模型相当的结果，并且可以很容易地集成到实际ASR系统中。

关键词: 自动语音识别，端到端，BERT，纠错

CLC Number:

TP183

Xiao Sujie, Hao Ruipeng, Cheng Gaofeng, Xu Xiaoyan, Li Ta. EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition[J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 282-288.

References

1. GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]// 23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369-376.
2. GRAVES A. Sequence transduction with recurrent neural networks [DB/OL]. (2012-11-14). https://arxiv.org/abs/1211.3711
3. CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 4960-4964.
4. WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition [J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
5. LIU S L, YANG T, YUE T C, et al. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction [C]// 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL, 2021: 2991-3000.
6. LIU S L, SONG S K, YUE T C, et al. CRASpell: A contextual typo robust approach to improve Chinese spelling correction [C]//Findings of the Association for Computational Linguistics: ACL 2022. Dublin: ACL, 2022: 3008-3018.
7. ZHANG S H, HUANG H R, LIU J C, et al. Spelling error correction with soft-masked BERT [C]// 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 882-890.
8. ZHANG R Q, PANG C, ZHANG C Q, et al. Correcting Chinese spelling errors with phonetic pre-training [C]//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: ACL, 2021: 2250-2261.
9. JI T, YAN H, QIU X P. SpellBERT: A lightweight pretrained model for Chinese spelling check [C]// 2021 Conference on Empirical Methods in Natural Language Processing. Online: ACL, 2021: 3544-3551.
10. CHENG X Y, XU W D, CHEN K L, et al. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check [C]// 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 871-881.
11. LIAO J W, ESKIMEZ S, LU L Y, et al. Improving readability for automatic speech recognition transcription [J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(5): 142.
12. MANI A, PALASKAR S, MERIPO N V, et al. ASR error correction and domain adaptation using machine translation [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 6344-6348.
13. LENG Y C, TAN X, ZHU L C, et al. FastCorrect: Fast error correction with edit alignment for automatic speech recognition [DB/OL]. (2021-05-09). http://arxiv.org/abs/2105.03842
14. KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 4835-4839.
15. DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [DB/OL]. (2018-10-11). http://arxiv.org/abs/1810.04805
16. BU H, DU J Y, NA X Y, et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline [C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Seoul: IEEE, 2017: 1-5.
17. YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054-4058.
18. PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. ISCA: ISCA, 2019: 2613-2617.
19. GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition [C]//Interspeech 2020. Graz: ISCA, 2020: 5036-5040.
20. VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st International Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 6000-6010.

[1]	Chen Chengxin, Zhang Pengyuan. DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition [J]. J Shanghai Jiaotong Univ Sci, 2026, 31(2): 248-257.
[2]	WU Yalei, LI Jinghua, KONG Dehui, LI Qianxing, YIN Baocai. 3D Hand Pose Estimation Using Semantic Dynamic Hypergraph Convolutional Networks [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 855-865.
[3]	DONG Zhaoxian, YU Shuo, SHEN Yanming. Multi-Scale Dynamic Hypergraph Convolutional Network for Traffic Flow Forecasting [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(5): 880-888.
[4]	Ma Jin, Ren Ze, Zhang Tongtong, Ding Ying, Lu Yilei, Peng Yinghong. Transformer-Based Contrastive Learning Method for Automated Sleep Stages Classification [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 720-732.
[5]	Xiao Wenbo, Xiong Jiakai, Yu Lesheng, He Yinshui, Ma Guohong. Weld Defect Monitoring Based on Two-Stage Convolutional Neural Network [J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 291-299.
[6]	KE Jing¹(柯晶), ZHU Junchao² (朱俊超), YANG Xin¹(杨鑫), ZHANG Haolin³ (张浩林), SUN Yuxiang¹(孙宇翔), WANG Jiayi¹(王嘉怡), LU Yizhou⁴(鲁亦舟), SHEN Yiqing⁵(沈逸卿), LIU Sheng⁶(刘晟), JIANG Fusong⁷(蒋伏松), HUANG Qin⁸(黄琴). TshFNA-Examiner: A Nuclei Segmentation and Cancer Assessment Framework for Thyroid Cytology Image [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 945-957.
[7]	LI Mingai^{1, 2∗} (李明爱), WEI Lina¹ (魏丽娜). Motor Imagery Classification Based on Plain Convolutional Neural Network and Linear Interpolation [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 958-966.
[8]	GENG Zongsheng¹ (耿宗盛), ZHAO Dongdong^1,2 (赵东东), ZHOU Xingwen¹ (周兴文), YAN Lei¹ (闫磊), YAN Shi^1,2∗ (阎石). Leader-Following Consensus of Multi-Agent Systems via Fully Distributed Event-Based Control [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 640-645.
[9]	LIU Zengmin (刘增敏), WANG Shentao(王申涛), YAO Lixiu(姚莉秀), CAI Yunze(蔡云泽). Online Multi-Object Tracking Under Moving Unmanned Aerial Vehicle Platform Based on Object Detection and Feature Extraction Network [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 388-399.
[10]	ZHANG Yanjun(张彦军), WANG Biyun(王碧云)，CAI Yunze (蔡云泽). Multi-Channel Based on Attention Network for Infrared Small Target Detection [J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 414-427.
[11]	WANG Yujuan1 (王玉娟)，LI Wengang2 (李文刚)，LIU .Jianyong3 (刘建勇),CHEN Guangxue4 (陈广学),WANG Jun1*(汪军). Color Prediction Model of Gray Hybrid Multifilament Fabric [J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 802-808.
[12]	LIU Zhuoran (刘卓然), ZHAO Xu∗ (赵旭). Multilevel Disparity Reconstruction Network for Real-Time Stereo Matching [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 715-722.
[13]	SU Chong∗ (宿翀), LÜ Jing (吕晶), ZHANG Danyang (张丹阳), LI Hongguang∗ (李宏光). Affective Preferences Mining Approach with Applications in Process Control [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 737-746.