期刊文献+

基于改进BoS的Web文本分类研究 被引量:1

Research on Web Text Classification Based on Improved BoS
在线阅读 下载PDF
导出
摘要 提出了改进的文本相似度计算方法,在计算文本的相似度时,赋予不同文本块中的句子不同的权值,同时直接去掉短句子和合并高相似度的句子以精简句子包中句子数量以提高运算速度。改进后的文本相似度计算方法为:先根据句子相似度的计算方法计算句子的相似度,再计算文本块的相似度,最后按照文本块的权值计算整个文本的相似度。经试验证明,改进后的算法在文本召回率、准确率和F1值上都有明显的提高。 An improved text similarity calculation method is proposed. By means of giving different weights to sentences of different text blocks, removing short sentences directly and combining with high similar sentences, the total number of sentences in BoS (Bag of Sentences) can be decreased during similarity calculation and the processing speed can be increased. First of all, the improved text similarity calculation method calculates the similarity of the sentence according to the sentence similarity calculation method. Then the text similarity is calculated and finally the whole text similarity is calculated according to the weights of the text block. It is proved by experiments that the improved calculation method has significant improvement in recall rate and precision of text and F1 value.
出处 《南京邮电大学学报(自然科学版)》 北大核心 2013年第1期79-83,共5页 Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金 河南省科技攻关项目(102102210489)资助项目
关键词 WEB文本分类 句子包 向量空间模型 文本挖掘 web text classification bag of sentences vector space model text mining
  • 相关文献

参考文献11

  • 1中国互联网络信息中心(CNNIC).第27次中国互联网络发展状况统计报告[EB/OL].[2011-12-10].http://www.cnnic.net.cn/dtygg/dtgg/201101/t20110118_20250.html.
  • 2HUM S, JIA Z J. Web Text Categorization on GBODSS [ C ] // Pro- ceedings of 4th International Conference on Computer Science & Ed- ucation. 2009:599 -603.
  • 3SALTON G,LESK M E. Computer Evaluation of Indexing and Text Processing[ J ]. Journal of the ACM, 1968,15 ( 1 ) :8 -36.
  • 4YANG Y. An Evaluation of Statistical Approaches to Text Categori- zation[ J]. Journal of Information Retrieval, 1999,1 ( 1/2 ) :67 - 8g.
  • 5WIENER E, PEDERSEN J O, WEIGEND A S. A Neural Network Approach to Topic Spotting [ C ]//Proceedings of the 4th Annum Symposium on Document Analysis and Information Retrieval. Nevad- a,Las Vegas,1995:317 -332.
  • 6CHEN J N, HUANG H K, TIAN S F, et al. Feature Selection for Text Classification with Naive Bayes [ J ]. Expert Systems with Appli- cations,2009,36 (3) :5432 - 5435.
  • 7张运良,张全.基于句类向量空间模型的自动文本分类研究[J].计算机工程,2007,33(22):45-47. 被引量:6
  • 8黄曾阳.HNC(概念层次网络)理论[M].北京:清华大学出版社,1998..
  • 9何维,王宇.基于句子的文本表示及中文文本分类研究[J].情报学报,2009,28(6):839-843. 被引量:3
  • 10吕学强,任飞亮,黄志丹,姚天顺.句子相似模型和最相似句子查找算法[J].东北大学学报(自然科学版),2003,24(6):531-534. 被引量:68

二级参考文献31

共引文献120

同被引文献11

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部