J Shanghai Jiaotong Univ Sci ›› 2026, Vol. 31 ›› Issue (2): 282-288.doi: 10.1007/s12204-024-2725-0

Special Issue: 人机语音通讯

• Automation & Computer Technologies • Previous Articles     Next Articles

EC-BERT: A BERT Language Model with Error Correction for Mandarin Chinese Speech Recognition

EC-BERT: 面向中文普通话语音识别BERT纠错语言模型

肖素杰1, 2, 郝锐朋1, 程高峰1, 徐晓艳1, 黎塔1, 2   

  1. 1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China
  2. 1. 中国科学院 声学研究所 语音声学与内容理解重点实验室, 北京100190;2. 中国科学院大学, 北京100049
  • Received:2023-12-19 Accepted:2024-01-05 Online:2026-04-01 Published:2024-04-22

Abstract: The attention-based encoder-decoder end-to-end model has achieved promising performance in automatic speech recognition (ASR). However, in practical applications, substitution errors commonly occur in ASR systems, particularly for characters with the same or similar pronunciation. According to statistics, homophones cause at least 50% character errors. Therefore, our study focuses on addressing the issue of substitution errors with the same or similar pronunciation. In this study, we propose a BERT language model with error correction (EC-BERT) for the ASR system. We design a two-stage training schedule involving pre-training with a large amount of pseudo-paired data followed by fine-tuning with a small real-paired data to mitigate the inconsistency of the original pre-trained BERT model with our task. Unlike other error correction models, we do not need an error detection network or mask mechanism but directly use the BERT model to learn and correct the error locations. The experimental results show that our proposed method is effective and achieves a relative reduction of 19.2% in character error rate compared with the connectionist temporal classification (CTC) greedy search result and 12.8% compared with the CTC-WFST result on the AISHELL-1 test set. We also prove that our proposed EC-BERT model can achieve comparable results to other error correction models with a shorter runtime and can easily be integrated into the practical ASR system.

Key words: automatic speech recognition (ASR), end-to-end, BERT, error correction

摘要: 基于注意力的端到端编解码模型在自动语音识别(ASR)中取得了良好的性能。然而,实际应用中,ASR系统经常出现替换错误,特别是对于发音相同或相似的字符。据统计,同音异义词至少造成50%的字符错误。因此,我们的研究重点是解决发音相同或相似的替换错误问题。研究中,为ASR系统提出了一个带有纠错的BERT语言模型(EC-BERT)。设计了一个两阶段的训练计划,包括使用大量伪成对数据进行预训练,然后使用少量真实成对数据进行微调,以减轻原始预训练的BERT模型与我们任务的不一致。与其他纠错模型不同,不需要纠错网络或掩码机制,而是直接使用BERT模型来学习和纠正错误字符。实验结果表明,该方法是有效的,在AISHELL-1测试集上,与CTC贪婪搜索结果相比,字符错误率相对降低了19.2%,与CTC-WFST结果相比,字符错误率相对降低了12.8%。还证明了提出的EC-BERT模型可以在更短的运行时间内获得与其他误差纠错模型相当的结果,并且可以很容易地集成到实际ASR系统中。

关键词: 自动语音识别,端到端,BERT,纠错

CLC Number: