J Shanghai Jiaotong Univ Sci ›› 2021, Vol. 26 ›› Issue (4): 494-502.doi: 10.1007/s12204-021-2285-5

• • 上一篇    下一篇

Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record

MA Qunsheng (马群圣), CEN Xingxing (岑星星), YUAN Junyi (袁骏毅), HOU Xumin * (侯旭敏)   

  1. (Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai 200030, China)
  • 出版日期:2021-08-28 发布日期:2021-06-06
  • 通讯作者: HOU Xumin * (侯旭敏) E-mail:hxmchest@163.com

Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record

MA Qunsheng (马群圣), CEN Xingxing (岑星星), YUAN Junyi (袁骏毅), HOU Xumin * (侯旭敏)   

  1. (Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai 200030, China)
  • Online:2021-08-28 Published:2021-06-06
  • Contact: HOU Xumin * (侯旭敏) E-mail:hxmchest@163.com

摘要:  Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.


关键词:  deep active learning , named entity recognition (NER) , information extraction , word embedding , Chinese electronic medical record (EMR)

Abstract:  Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.


Key words:  deep active learning , named entity recognition (NER) , information extraction , word embedding , Chinese electronic medical record (EMR)

中图分类号: