摘要
目前传统的文本相似度方法大多数存在未考虑语义及结构信息,容易忽略文本特征细节信息等问题。针对上述问题,提出多模型加权融合的文本相似度计算算法。利用词频、词性、词句位置3个特征共同计算句子相似度;为发现文本的结构信息方面,提出分层池化IIG-SIF用于计算文本的相似程度;结合前两个环节的相似度模型构建一种线性加权模型,汇集两个算法使结果更为精确。实验结果表明,该算法能够提高准确率和召回率,在不同语种和粒度的数据集上均得到更优的实验结果。
Most of the current traditional text similarity methods fail to consider the semantic and structural information,and it is easy to ignore the details of the text features and other issues.Aiming at the above problems,a text similarity calculation algorithm based on multi-model weighted fusion was proposed.The three characteristics of word frequency,part of speech,and word and sentence position were used to jointly calculate sentence similarity.To find the structural information of the text,a hierarchical pooling IIG-SIF was proposed to calculate the similarity of the text.The similarity models of first two were combined to construct a linear weighting model,by which two algorithms were brought together to make the result more accurate.Experimental results show that the proposed algorithm can improve the accuracy and recall rate,and obtain better experimental results on data sets of different languages and granularities.
作者
田红鹏
马博
冯健
TIAN Hong-peng;MA Bo;FENG Jian(College of Computer Science and Technology,Xi’an University of Science and Technology,Xi’an 710600,China)
出处
《计算机工程与设计》
北大核心
2021年第11期3239-3245,共7页
Computer Engineering and Design
基金
陕西省自然科学基础研究计划基金项目(2020JM-533)。
关键词
文本相似度
特征融合
词移距离
分层池化
句向量
text similarity
feature fusion
word movement distance
layered pooling
sentence vector