上海交通大学学报 ›› 2022, Vol. 56 ›› Issue (11): 1554-1560.doi: 10.16183/j.cnki.jsjtu.2021.079

• 电子信息与电气工程 • 上一篇    

基于预训练语言模型的语法错误纠正方法

韩明月, 王英林()   

  1. 上海财经大学 信息管理与工程学院,上海 200433
  • 收稿日期:2021-03-16 出版日期:2022-11-28 发布日期:2022-12-02
  • 通讯作者: 王英林 E-mail:wang.yinglin@shufe.edu.cn
  • 作者简介:韩明月(1995-),女,河南省周口市人,博士生,从事自然语言生成、文本因果推理研究.

Grammatical Error Correction by Transferring Learning Based on Pre-Trained Language Model

HAN Mingyue, WANG Yinglin()   

  1. School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China
  • Received:2021-03-16 Online:2022-11-28 Published:2022-12-02
  • Contact: WANG Yinglin E-mail:wang.yinglin@shufe.edu.cn

摘要:

自然语言处理中的语法错误纠正 (GEC) 任务存在着低资源性的问题,学习GEC模型需要耗费大量的标注成本以及训练成本.对此,采用从掩码式序列到序列的预训练语言生成模型 (MASS) 中的迁移学习方式,充分利用预训练模型已提取的语言特征,在GEC的标注数据上微调模型,结合特定的前处理、后处理方法改善GEC模型的表现,从而提出一种新的GEC系统(MASS-GEC).在两个公开的GEC任务中评估该系统,在有限的资源下,与当前GEC系统相比,达到了更好的效果.具体地,在CoNLL14 数据集上,该系统在强调查准率的指标F0.5上表现分数为57.9;在JFLEG数据集上,该系统在基于系统输出纠正结果与参考纠正结果n元语法重合度的评估指标GLEU上表现分数为59.1.该方法为GEC任务低资源问题的解决提供了新视角,即从自监督预训练语言模型中,利用适用于GEC任务的文本特征,辅助解决GEC问题.

关键词: 语法错误纠正, 自然语言生成, 序列到序列

Abstract:

Grammatical error correction (GEC) is a low-resource task, which requires annotations with high costs and is time consuming in training. In this paper, the MASS-GEC is proposed to solve this problem by transferring learning from a pre-trained language generation model, and masked sequence is proposed to sequence pre-training for language generation (MASS). In addition, specific preprocessing and postprocessing strategies are applied to improve the performance of the GEC system. Finally, this system is evaluated on two public datasets and a competitive performance is achieved compared with the state-of-the-art work with limited resources. Specifically, this system achieves 57.9 in terms of F0.5 score which emphasizes more on precision on the CoNLL2014 task. On the JFLEG task, the MASS-GEC achieves 59.1 in terms of GLEU score which measures the n-gram coincidence between the output of the model and the correct answer manually annotated. This paper provides a new perspective that the low-resource problem in GEC can be solved well by transferring the general language knowledge from the self-supervised pre-trained language model.

Key words: grammatical error correction (GEC), natural language generation, sequence to sequence

中图分类号: