基于领域词典与CRF双层标注的中文电子病历实体识别

Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF

  • 摘要: 医疗实体识别是电子病历文本信息抽取的基本任务。针对中文电子病历文本复合实体较多、实体长度较长、句子成分缺失严重、实体边界不清的语言特点以及标注语料难以获取的现状,提出了一种基于领域词典和条件随机场(CRF)的双层标注模型。该模型通过对外部资源的统计分析构建医疗领域词典,再结合条件随机场,进行了两次不同粒度的标注,将领域词典识别的准确性和机器学习的自动性融为一体,从中文电子病历文本中识别出疾病、症状、药品、操作四类医疗实体。该模型在测试数据中的宏精确率为96.7%、宏召回率为97.7%、宏F1值为97.2%。同时对比分析了采用注意力机制的深度神经网络的识别效果,因受到领域数据集大小的限制,在该测试数据集中后者表现不佳。实验结果表明了该双层标注模型对中文医疗实体识别的高效性。

     

    Abstract: As a document recorded by professional medical personnel, electronic medical records contain a large and important clinical resource. How to use a large amount of potential information in electronic medical records has become one of the major research directions. Chinese electronic medical records are knowledge-intensive, in which the data has considerable research value. However, they have more complex entities because of the language features of Chinese, and the composite entity is long. These sentences components in the text are missing. Moreover, the boundaries of clinical entities are often unclear. Labeling corpus is a job that requires a great deal of manpower because of the technical language used in a given text. Therefore, the recognition of Chinese clinical named entities is a hard problem. Considering these characteristics of Chinese electronic medical records, this paper proposed a double-layer annotation model that combined with a domain dictionary and conditional random field (CRF). A medical domain dictionary was constructed by statistical analysis method, and combined with CRF to mark two different granularity labeling operations. The manually constructed medical domain dictionary has extremely high accuracy for the recognition of registered words, and machine learning could automatically recognize unregistered words. This work integrated the two aspects based on these advantages. With the proposed method, diseases, symptoms, drugs, and operations could be recognized from Chinese electronic medical records. Using the test dataset, the Macro-P with 96.7%, the Macro-R with 97.7% and the Macro-F1 with 97.2% were obtained. The recognition performance of the proposed method was greatly improved compared with that of a single-layer model. The recognition effect of deep neural network with attention was also analyzed, which did not perform well due to the size of the domain dataset. The experimental results show the efficiency of the double-layer annotation model for the named entity recognition of Chinese electronic medical records.

     

/

返回文章
返回