摘要
随机抽样技术已经广泛应用于数据挖掘的各类算法中,它在处理分布均匀的数据集时非常有效,但在处理分布比较倾斜的数据集时容易丢失小的聚类。为此提出基于网格的密度偏差抽样算法,仅需要扫描一遍数据集就可以得到近似的密度偏差抽样。经实验测试分析表明,该算法不仅提高了聚类的正确性,而且抗噪声能力强、效率高,是解决海量数据挖掘的一种有效途径。
Uniform random sampling is widely applied to many kinds of algorithms in data mining. It processes uniform distribution data set extremely effectively, but easily loses slight cluster and consequently decreases clustering accuracy, when the processing data set is skew distribution. A grid-based density biased sampling algorithm (G_DBS) was proposed. It got approximate density biased samples through scanning data only one time. Our experimental evaluation shows that G_DBS algorithm not only improves the accuracy of clustering, but also is insensitive to noise and has high efficiency. It is one of the effective solutions to mass data mining.
出处
《计算机应用》
CSCD
北大核心
2007年第7期1695-1698,共4页
journal of Computer Applications
基金
重庆市自然科学基金资助项目(2005BB2063)
重庆市自然科学基金重点项目(2005BA2003)
重庆市教委科学技术研究项目(050509)
关键词
数据挖掘
偏差抽样
聚类
数据约简
海量数据
data mining
biased sampling
clustering
data reduction
mass data