J Shanghai Jiaotong Univ Sci ›› 2021, Vol. 26 ›› Issue (4): 494-502.doi: 10.1007/s12204-021-2285-5

• Computer & Communication Engineering • Previous Articles     Next Articles

Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record

Word Embedding Bootstrapped Deep Active Learning Method to Information Extraction on Chinese Electronic Medical Record

MA Qunsheng (马群圣), CEN Xingxing (岑星星), YUAN Junyi (袁骏毅), HOU Xumin * (侯旭敏)   

  1. (Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai 200030, China)
  2. (Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai 200030, China)
  • Online:2021-08-28 Published:2021-06-06
  • Contact: HOU Xumin * (侯旭敏) E-mail:hxmchest@163.com

Abstract:  Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.


Key words:  deep active learning | named entity recognition (NER) | information extraction | word embedding | Chinese electronic medical record (EMR)

摘要:  Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.


关键词:  deep active learning | named entity recognition (NER) | information extraction | word embedding | Chinese electronic medical record (EMR)

CLC Number: