基于预训练语言模型的语法错误纠正方法

doi:10.16183/j.cnki.jsjtu.2021.079

摘要/Abstract

摘要：

自然语言处理中的语法错误纠正 (GEC) 任务存在着低资源性的问题,学习GEC模型需要耗费大量的标注成本以及训练成本.对此,采用从掩码式序列到序列的预训练语言生成模型 (MASS) 中的迁移学习方式,充分利用预训练模型已提取的语言特征,在GEC的标注数据上微调模型,结合特定的前处理、后处理方法改善GEC模型的表现,从而提出一种新的GEC系统(MASS-GEC).在两个公开的GEC任务中评估该系统,在有限的资源下,与当前GEC系统相比,达到了更好的效果.具体地,在CoNLL14 数据集上,该系统在强调查准率的指标F_0.5上表现分数为57.9;在JFLEG数据集上,该系统在基于系统输出纠正结果与参考纠正结果n元语法重合度的评估指标GLEU上表现分数为59.1.该方法为GEC任务低资源问题的解决提供了新视角,即从自监督预训练语言模型中,利用适用于GEC任务的文本特征,辅助解决GEC问题.

关键词: 语法错误纠正, 自然语言生成, 序列到序列

Abstract:

Grammatical error correction (GEC) is a low-resource task, which requires annotations with high costs and is time consuming in training. In this paper, the MASS-GEC is proposed to solve this problem by transferring learning from a pre-trained language generation model, and masked sequence is proposed to sequence pre-training for language generation (MASS). In addition, specific preprocessing and postprocessing strategies are applied to improve the performance of the GEC system. Finally, this system is evaluated on two public datasets and a competitive performance is achieved compared with the state-of-the-art work with limited resources. Specifically, this system achieves 57.9 in terms of F_0.5 score which emphasizes more on precision on the CoNLL2014 task. On the JFLEG task, the MASS-GEC achieves 59.1 in terms of GLEU score which measures the n-gram coincidence between the output of the model and the correct answer manually annotated. This paper provides a new perspective that the low-resource problem in GEC can be solved well by transferring the general language knowledge from the self-supervised pre-trained language model.

Key words: grammatical error correction (GEC), natural language generation, sequence to sequence

中图分类号:

TP391.1

韩明月, 王英林. 基于预训练语言模型的语法错误纠正方法[J]. 上海交通大学学报, 2022, 56(11): 1554-1560.

HAN Mingyue, WANG Yinglin. Grammatical Error Correction by Transferring Learning Based on Pre-Trained Language Model[J]. Journal of Shanghai Jiao Tong University, 2022, 56(11): 1554-1560.

图/表 4

图1

表1

表2

图2

参考文献 30

[1]	NG H T, WU S M, WU Y, et al. The CoNLL-2013 shared task on grammatical error correction[C]∥Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria: Association for Computational Linguistics, 2013: 1-12.
[2]	NG H T, WU S M, BRISCOE T, et al. The CoNLL-2014 shared task on grammatical error correction[C]∥Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014: 1-14.
[3]	WU J C, YEN T H, CHANG J, et al. NTHU at the CoNLL-2014 shared task[C]∥Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014: 91-95.
[4]	AWASTHI A, SARAWAGI S, GOYAL R, et al. Parallel iterative edit models for local sequence transduction[C]∥Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, 2019: 4251-4261.
[5]	OMELIANCHUK K, ATRASEVYCH V, CHERNODUB A, et al. GECToR-grammatical error correction: Tag, not rewrite[C]∥Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Seattle, WA, USA: Association for Computational Linguistics, 2020: 163-170.
[6]	JUNCZYS-DOWMUNT M, GRUNDKIEWICZ R, GUHA S, et al. Approaching neural grammatical error correction as a low-resource machine translation task[C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New Orleans, Louisiana, USA: Association for Computational Linguistics, 2018: 595-606.
[7]	ZHAO W, WANG L, SHEN K W, et al. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data[C]∥Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2019: 156-165.
[8]	FLACHS S, LACROIX O, SØGAARD A. Noisy channel for low resource grammatical error correction[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 191-196.
[9]	FELICE M, YUAN Z, ANDERSEN Ø E, et al. Grammatical error correction using hybrid systems and type filtering[C]∥Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014: 15-24.
[10]	LIU Z R, LIU Y. Exploiting unlabeled data for neural grammatical error detection[J]. Journal of Computer Science and Technology, 2017, 32(4): 758-767. doi: 10.1007/s11390-017-1757-4 URL
[11]	GRUNDKIEWICZ R, JUNCZYS-DOWMUNT M, HEAFIELD K. Neural grammatical error correction systems with unsupervised pre-training on synthetic data[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 252-263.
[12]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]∥Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2019: 4171-4186.
[13]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[EB/OL]. (2019-07-26) [2021-03-16]. https:∥arxiv.org/abs/1907.11692.
[14]	SONG K T, TAN X, QIN T, et al. MASS: Masked sequence to sequence pre-training for language generation[C]∥Proceedings of the 36th International Conference on Machine Learning. Los Angeles, USA: PMLR, 2019: 5926-5936.
[15]	DAHLMEIER D, NG H T, WU S M. Building a large annotated corpus of learner English: The NUS corpus of learner English[C]∥Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, Georgia, USA: Association for Computational Linguistics, 2013: 22-31.
[16]	TAJIRI T, KOMACHI M, MATSUMOTO Y. Tense and aspect error correction for ESL learners using global context[C]∥Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea: Association for Computational Linguistics, 2012: 198-202.
[17]	YANNAKOUDAKIS H, BRISCOE T, MEDLOCK B. A new dataset and method for automatically grading ESOL texts[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011: 180-189.
[18]	BRYANT C, FELICE M, ANDERSEN Ø E, et al. The BEA-2019 shared task on grammatical error correction[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 52-75.
[19]	GE T, WEI F R, ZHOU M. Fluency boost learning and inference for neural grammatical error correction[C]∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: Association for Computational Linguistics, 2018: 1055-1065.
[20]	XIE Z, GENTHIAL G, XIE S, et al. Noising and denoising natural language: Diverse backtranslation for grammar correction[C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New Orleans, Louisiana, USA: Association for Computational Linguistics, 2018: 619-628.
[21]	CHOLLAMPATT S, WANG W Q, NG H T. Cross-sentence grammatical error correction[C]∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 435-445.
[22]	KANEKO M, HOTATE K, KATSUMATA S, et al. TMU transformer system using BERT for re-ranking at BEA 2019 grammatical error correction on restricted track[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 207-212.
[23]	ALIKANIOTIS D, RAHEJA V. The unreasonable effectiveness of transformer language models in grammatical error correction[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 127-133.
[24]	KANEKO M, MITA M, KIYONO S, et al. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction[C]∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, Washington, USA: Association for Computational Linguistics, 2020: 4248-4254.
[25]	CHOE Y J, HAM J, PARK K, et al. A neural grammatical error correction system built on better pre-training and sequential transfer learning[C]∥Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019: 213-227.
[26]	NAPOLES C, SAKAGUCHI K, TETREAULT J. JFLEG: A fluency corpus and benchmark for grammatical error correction[C]∥Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: Association for Computational Linguistics, 2017: 229-234.
[27]	KIYONO S, SUZUKI J, MITA M, et al. An empirical study of incorporating pseudo data into grammatical error correction[C]∥Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, 2019: 1236-1242.
[28]	CHOLLAMPATT S, NG H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction[C]∥Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press, 2018: 5755-5762.
[29]	GRUNDKIEWICZ R, JUNCZYS-DOWMUNT M. Near human-level performance in grammatical error correction with hybrid machine translation[C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New Orleans, Louisiana, USA: Association for Computational Linguistics, 2018: 284-290.
[30]	LICHTARGE J, ALBERTI C, KUMAR S, et al. Corpora generation for grammatical error correction[C]∥Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2019: 3291-3301.

编辑推荐 0

Metrics

阅读次数

全文

708

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	12	0	0	696

来源	本网站	其他网站

次数	458	250
比例	65%	35%

摘要

1060

最新录用	在线预览	正式出版

0	0	1060

来源	本网站	其他网站

次数	587	473
比例	55%	45%

来源	时间	方法	合成语料数据量/条	CoNLL14的F_0.5指标		JFLEG的GLEU指标
来源	时间	方法	合成语料数据量/条	单模型	集成模型	单模型	集成模型
文献[9]	2014	SMT+LM	-	37.3	-	-	-
文献[28]	2018	CNN Seq2Seq	-	46.4	54.8	51.3	57.5
文献[6]	2018	Transformer Seq2Seq	-	-	55.8	-	59.9
文献[29]	2018	SMT + NMT+ LM	-	-	56.3	-	61.5
文献[7]	2019	Copy-Transformer	3.02×10⁶	59.8	61.2	-	61.0
文献[4]	2019	BERT序列标注	3.02×10⁶	59.7	61.2	60.3	61.0
文献[30]	2019	Transformer +迭代纠正	1.7×10⁹	56.8	60.4	61.6	63.3
本文	2020	MASS-GEC1	-	57.5	-	58.0	-
本文	2020	MASS-GEC2	-	56.1	-	59.1	-

GEC系统	CoNLL14			JFLEG
GEC系统	P	R	F_0.5	GLEU
MASS-GEC2	60.1	39.9	54.6	59.1
MASS-GEC1	67.7	36.6	57.9	58.0
-主谓一致纠正	67.5	36.0	57.5	58.0
-错误类型控制	62.1	38.0	55.1	57.7
-微调	51.6	5.2	18.5	50.4