摘要
地理知识图谱构建的重要任务之一是地理命名实体的识别。中文文本中词汇结构灵活,词汇边界不明显,地理领域的中文标注数据集稀缺,因此中文文本的地理命名实体识别一直是研究难点之一。针对蕴含地理信息的海量网络文本中的地理命名实体识别任务,建立了基于维基百科数据的地表水系数据集以及领域词典,并提出了一种基于扩展词嵌入的词汇增强方法,对于BERT预训练模型进行词汇增强,并结合了BiGRU与CRF网络进行上下文特征识别与学习,构建了EXPBERT-BiGRU-CRF的命名实体识别模型,实验表明,该模型在地表水系数据集上达到了95.94%的F1值,比无词汇增强Bert模型提高了4.94%,相较于其他模型的精度也有大幅度提升,能更加准确地识别地理命名实体。
One of the important tasks in constructing a geographical knowledge graph is the recognition of geographical named entities.Chinese text has flexible vocabulary structures and unclear word boundaries,making the recognition of geographical named entities in Chinese text a chal-lenging research area,especially due to the scarcity of annotated datasets in the geographical domain.To address the task of geographical named entities recognition in massive network texts containing geographical information,we established a dataset of surface water system based on Wikipedia data and a domain dictionary,and proposed a vocabulary enhancement method based on expanded word embedding to enhance the vo-cabulary of BERT pre-training model.We constructed EXPBERT-BiGRU-CRF named entity recognition model by combining BiGRU and CRF networks for context feature recognition and learning.Experimental results show that this model achieves F1_score of 95.94%on the surface wa-ter system dataset,which is a 4.94%improvement compared to the BERT model without vocabulary enhancement,along with significant accura-cy improvements compared to other models,and can accurately identify geographical named entities.
作者
郑旭野
陈涛
周婧娟
ZHENG Xuye;CHEN Tao;ZHOU Jingjuan(School of Geodesy and Geomatics,Wuhan University,Wuhan 430079,China;Province Surveying Mapping Production Archives of Hubei,Wuhan 430014,China)
出处
《地理空间信息》
2025年第2期1-6,共6页
Geospatial Information
基金
湖北省自然科学基金资助项目(2022CFB194)。
关键词
地理知识图谱
BERT
命名实体识别
词汇增强
geographical knowledge graph
BERT
named entity recognition
vocabulary enhancement