摘要
面向中文文本的信息检索,提出一种从句子、段落到文本整体分阶段进行的文本相似度计算方法。该方法结合文档的主题与应用范围,用语义加强的权重计算方法对特征词赋予相应的权重,并根据每个计算阶段的特点,分别融入对文本语义的计算因素,力求使中文文本的相似度计算结果更为准确。最后建立文本相似度计算系统,通过与传统算法的实验结果进行对比,证明改进后的算法可以取得更好的效果。
For Chinese text information retrieval, a staged and integrated similarity algorithm of text is proposed, which processes sentences, paragraphs and the whole document stage by stage. The algorithm combines the topic and application ranges of document, and the corresponding weight is given to the feature words via the weighted calculation method with the semantic enhancement. Moreover, these weights are integrated into the calculated factors of the text semantic with the characteristics of each calculation phase, respectively to reach the aim of finding a more accurate similarity calculation results for Chinese text similarity calculation. Finally, a text similarity computing system is built and the improved algo- rithm of the system achieves better experimental results comparing with the traditional algorithms.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第10期20-26,共7页
New Technology of Library and Information Service
基金
陕西省教育厅科学研究计划项目"基于实时嵌入式安全的双向序列加密方法研究"(项目编号:2013JK1146)的研究成果之一
关键词
文本相似度
信息检索
语义相似度
权重
Texts similarity Information retrieval Semantic similarity Term weight