摘要
提出了改进的文本相似度计算方法,在计算文本的相似度时,赋予不同文本块中的句子不同的权值,同时直接去掉短句子和合并高相似度的句子以精简句子包中句子数量以提高运算速度。改进后的文本相似度计算方法为:先根据句子相似度的计算方法计算句子的相似度,再计算文本块的相似度,最后按照文本块的权值计算整个文本的相似度。经试验证明,改进后的算法在文本召回率、准确率和F1值上都有明显的提高。
An improved text similarity calculation method is proposed. By means of giving different weights to sentences of different text blocks, removing short sentences directly and combining with high similar sentences, the total number of sentences in BoS (Bag of Sentences) can be decreased during similarity calculation and the processing speed can be increased. First of all, the improved text similarity calculation method calculates the similarity of the sentence according to the sentence similarity calculation method. Then the text similarity is calculated and finally the whole text similarity is calculated according to the weights of the text block. It is proved by experiments that the improved calculation method has significant improvement in recall rate and precision of text and F1 value.
出处
《南京邮电大学学报(自然科学版)》
北大核心
2013年第1期79-83,共5页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
河南省科技攻关项目(102102210489)资助项目