Automation & Computer Technologies

EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition

Expand
  • 1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China

Received date: 2023-12-19

  Accepted date: 2024-01-05

  Online published: 2024-04-22

Abstract

The attention-based encoder-decoder end-to-end model has achieved promising performance in automatic speech recognition (ASR). However, in practical applications, substitution errors commonly occur in ASR systems, particularly for characters with the same or similar pronunciation. According to statistics, homophones cause at least 50% character errors. Therefore, our study focuses on addressing the issue of substitution errors with the same or similar pronunciation. In this study, we propose a BERT language model with error correction (EC-BERT) for the ASR system. We design a two-stage training schedule involving pre-training with a large amount of pseudo-paired data followed by fine-tuning with a small real-paired data to mitigate the inconsistency of the original pre-trained BERT model with our task. Unlike other error correction models, we do not need an error detection network or mask mechanism but directly use the BERT model to learn and correct the error locations. The experimental results show that our proposed method is effective and achieves a relative reduction of 19.2% in character error rate compared with the connectionist temporal classification (CTC) greedy search result and 12.8% compared with the CTC-WFST result on the AISHELL-1 test set. We also prove that our proposed EC-BERT model can achieve comparable results to other error correction models with a shorter runtime and can easily be integrated into the practical ASR system.

Cite this article

Xiao Sujie, Hao Ruipeng, Cheng Gaofeng, Xu Xiaoyan, Li Ta . EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition[J]. Journal of Shanghai Jiaotong University(Science), 2026 , 31(2) : 282 -288 . DOI: 10.1007/s12204-024-2725-0

References

1. GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]// 23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369-376.
2. GRAVES A. Sequence transduction with recurrent neural networks [DB/OL]. (2012-11-14). https://arxiv.org/abs/1211.3711
3. CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 4960-4964.
4. WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition [J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
5. LIU S L, YANG T, YUE T C, et al. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction [C]// 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: ACL, 2021: 2991-3000.
6. LIU S L, SONG S K, YUE T C, et al. CRASpell: A contextual typo robust approach to improve Chinese spelling correction [C]//Findings of the Association for Computational Linguistics: ACL 2022. Dublin: ACL, 2022: 3008-3018.
7. ZHANG S H, HUANG H R, LIU J C, et al. Spelling error correction with soft-masked BERT [C]// 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 882-890.
8. ZHANG R Q, PANG C, ZHANG C Q, et al. Correcting Chinese spelling errors with phonetic pre-training [C]//Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: ACL, 2021: 2250-2261.
9. JI T, YAN H, QIU X P. SpellBERT: A lightweight pretrained model for Chinese spelling check [C]// 2021 Conference on Empirical Methods in Natural Language Processing. Online: ACL, 2021: 3544-3551.
10. CHENG X Y, XU W D, CHEN K L, et al. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check [C]// 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 871-881.
11. LIAO J W, ESKIMEZ S, LU L Y, et al. Improving readability for automatic speech recognition transcription [J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(5): 142.
12. MANI A, PALASKAR S, MERIPO N V, et al. ASR error correction and domain adaptation using machine translation [C]// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 6344-6348.
13. LENG Y C, TAN X, ZHU L C, et al. FastCorrect: Fast error correction with edit alignment for automatic speech recognition [DB/OL]. (2021-05-09). http://arxiv.org/abs/2105.03842
14. KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 4835-4839.
15. DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [DB/OL]. (2018-10-11). http://arxiv.org/abs/1810.04805
16. BU H, DU J Y, NA X Y, et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline [C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Seoul: IEEE, 2017: 1-5.
17. YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054-4058.
18. PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. ISCA: ISCA, 2019: 2613-2617.
19. GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition [C]//Interspeech 2020. Graz: ISCA, 2020: 5036-5040.
20. VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st International Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 6000-6010.
Outlines

/