摘要
当前在WWW上有众多的近似镜像web页面 ,如何快速准确地发现这些内容上相似的网页已经成为提高搜索引擎服务质量的关键技术之一 .为基于关键词匹配的搜索引擎系统提出了 5种近似镜像网页检测算法 ,并利用“天网”系统对这 5种算法进行了实际评测 .另外还将它们与现有的方法进行了对比分析 .本文所论述的近似镜像检测算法已成功地被用于消除“天网”系统的重复网页 。
Many documents are being replicated across the World-wide Web.How to efficiently and accurately find the near-replicas of web pages has become an important topic in the search engine research area,which can be used to improve the quality of searching service.In this paper,we propose 5 near-replicas detection algorithms for search engines that rely on keyword matching,and evaluate them using the WebGather search engine system.In addition,we also compare our method with one of the most popular copy detection mechanisms.Our method has been successfully adopted to remove the near-replicas of web pages in WebGather,and it can also be widely used to build digital library.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2000年第z1期130-132,129,共3页
Acta Electronica Sinica
基金
国家 973重大基础研究发展规划项目基金! (No.G1 9990 32 70 6)