期刊文献+

近似镜像网页检测算法的研究与评价 被引量:21

Research and Evaluation of Near replicas of Web Pages Detection Algorithms
在线阅读 下载PDF
导出
摘要 当前在WWW上有众多的近似镜像web页面 ,如何快速准确地发现这些内容上相似的网页已经成为提高搜索引擎服务质量的关键技术之一 .为基于关键词匹配的搜索引擎系统提出了 5种近似镜像网页检测算法 ,并利用“天网”系统对这 5种算法进行了实际评测 .另外还将它们与现有的方法进行了对比分析 .本文所论述的近似镜像检测算法已成功地被用于消除“天网”系统的重复网页 。 Many documents are being replicated across the World-wide Web.How to efficiently and accurately find the near-replicas of web pages has become an important topic in the search engine research area,which can be used to improve the quality of searching service.In this paper,we propose 5 near-replicas detection algorithms for search engines that rely on keyword matching,and evaluate them using the WebGather search engine system.In addition,we also compare our method with one of the most popular copy detection mechanisms.Our method has been successfully adopted to remove the near-replicas of web pages in WebGather,and it can also be widely used to build digital library.
出处 《电子学报》 EI CAS CSCD 北大核心 2000年第z1期130-132,129,共3页 Acta Electronica Sinica
基金 国家 973重大基础研究发展规划项目基金! (No.G1 9990 32 70 6)
关键词 万维网 搜索引擎 近似镜像 向量空间模型 MD5 World Wide Web search engine near-replicas vector space model MD5
  • 相关文献

参考文献3

  • 1[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 2[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 3[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.

同被引文献124

引证文献21

二级引证文献94

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部