Text Structured Algorithm of Lung Cancer Cases Based on Deep Learning

doi:10.1007/s12204-025-2825-5

Abstract

Abstract: Surgical site infections (SSIs) are the most common healthcare-related infections in patients with lung cancer. Constructing a lung cancer SSI risk prediction model requires the extraction of relevant risk factors from lung cancer case texts, which involves two types of text structuring tasks: attribute discrimination and attribute extraction. This article proposes a joint model, Multi-BGLC, around these two types of tasks, using bidirectional encoder representations from transformers (BERT) as the encoder and fine-tuning the decoder composed of graph convolutional neural network (GCNN) + long short-term memory (LSTM) + conditional random field (CRF) based on cancer case data. The GCNN is used for attribute discrimination, whereas the LSTM and CRF are used for attribute extraction. The experiment verified the effectiveness and accuracy of the model compared with other baseline models.

Key words: text structuring, text classification, sequence labeling, data augmentation, lung cancer, electronic medical record

摘要： 手术部位感染是肺癌患者中最常见的医疗相关感染。构建肺癌手术部位感染的风险预测模型需要从肺癌病例文本中提取相关风险因素，这涉及两种类型的文本结构化任务：属性判别和属性提取。围绕这两种任务提出了一种联合模型，即Multi BGLC；该模型使用BERT作为编码器，并基于癌症病例数据，对由GCNN+LSTM+CRF组成的解码器进行微调。其中，GCNN用于属性判别，而LSTM和CRF用于属性提取。实验证明，与其他基线模型相比，该模型的有效性和准确性更高。

关键词: 文本结构化，文本分类，序列标注，数据增强，肺癌，电子病历

CLC Number:

TP181
R734.2

Mi Linhui, Yuan Junyi, Zhou Yankang, Hou Xumin. Text Structured Algorithm of Lung Cancer Cases Based on Deep Learning[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(4): 778-789.

References

[1] MARON M E. Automatic indexing: An experimental inquiry [J]. Journal of the ACM, 1961, 8(3): 404-417.
[2] COVER T, HART P. Nearest neighbor pattern classification [J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
[3] JOACHIMS T. Text categorization with support vector machines: Learning with many relevant features [M]//Machine learning: ECML-98. Berlin, Heidelberg: Springer, 1998: 137-142.
[4] SCHNEIDER K M. A new feature selection score for multinomial naive Bayes text classification based on KL-divergence [C]// ACL Interactive Poster and Demonstration Sessions. Barcelona: ACL, 2004: 186-189.
[5] DAI W, XUE G R, YANG Q, et al. Transferring naive Bayes classifiers for text classification [C]// 22nd National Conference on Artificial Intelligence. Vancouver: AAAI, 2007: 540-545.
[6] CORTES C, VAPNIK V. Support-vector networks [J]. Machine Learning, 1995, 20(3): 273-297.
[7] JOACHIMS T. Transductive inference for text classification using support vector machines [C]// 16th International Conference on Machine Learning. Bled: IMLS, 1999: 200-209.
[8] LAI S W, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2015, 29(1): 2267-2273.
[9] SUTSKEVER I, MARTENS J, HINTON G E. Generating text with recurrent neural networks [C]// 28th International Conference on Machine Learning. Bellevue: IMLS, 2011: 1017-1024.
[10] MANDIC D P, CHAMBERS J. Recurrent neural networks for prediction: learning algorithms, architectures and stability [M]. Chichester: John Wiley & Sons, Inc., 2001.
[11] JIANG M Y, LIANG Y C, FENG X Y, et al. Text classification based on deep belief network and softmax regression [J]. Neural Computing and Applications, 2018, 29(1): 61-70.
[12] LEWIS M, LIU Y H, GOYAL N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension [C]// 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 7871-7880.
[13] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. [2024-12-01]. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf
[14] ZHANG Q, CHEN X. Applying BERT on the classification of Chinese legal documents [M]//Advances in Internet, data & web technologies. Cham: Springer, 2023: 215-222.
[15] WANG J, ZHANG J, HU B F. Optimal class-dependent discretization-based fine-grain hypernetworks for classification of microarray data [J]. Journal of Shanghai Jiao Tong University, 2013, 47(12): 1856-1862 (in Chinese).
[16] KOWSARI K, HEIDARYSAFA M, BROWN D E, et al. RMDL: Random multimodel deep learning for classification [C]// 2nd International Conference on Information System and Data Mining. Lakeland: ACM, 2018: 19-28.
[17] WU Y, JIANG M, XU J, et al. Clinical named entity recognition using deep learning models [C]//AMIA Annual Symposium Proceedings. Washington: AMIA, 2017: 1812-1819.
[18] MAGGE A, SCOTCH M, GONZALEZ-HERNANDEZ G. Clinical NER and relation extraction using Bi-Char-LSTMs and random forest classifiers [C]// 1st International Workshop on Medication and Adverse Drug Event Detection. Worcester: PMLR, 2018: 25-30.
[19] BAXTER J. A model of inductive bias learning [J]. Journal of Artificial Intelligence Research, 2000, 12: 149-198.
[20] THRUN S. Is learning the n-th thing any easier than learning the first [C]// 9th International Conference on Neural Information Processing Systems. Denver: NIPS, 1995: 640-646.
[21] CARUANA R. Multitask learning [M]//Learning to learn. Boston,: Springer, 1998: 95-133.
[22] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[23] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [DB/OL]. (2013-01-16). https://arxiv.org/abs/1301.3781
[24] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 1-11.
[25] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019: 4171-4186.
[26] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive language models beyond a fixed-length context [DB/OL]. (2019-01-09). https://arXiv.org/abs/1901.02860
[27] SUN Y, WANG S, FENG S, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation [DB/OL]. (2021-07-05). https://arxiv.org/abs/2107.02137
[28] DAUPHIN Y N, FAN A, AULI M, et al. Language modeling with gated convolutional networks [C]// 34th International Conference on Machine Learning. Sydney: PMLR, 2017: 933-941.
[29] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]//18th International Conference on Machine Learning. Williamstown: IMLS, 2001: 282-289.
[30] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling [DB/OL]. (2014-12-11). https://arxiv.org/abs/1412.3555