摘要
针对传统的Agnes算法在处理大批量数据时出现的内存和CPU处理速度问题,提出基于Map Reduce框架的并行Agnes算法,给出了算法的主要设计方案。Map阶段主要进行簇的初始化步骤,Reduce阶段则计算簇间距离,合并距离最近的簇。为了考虑属性间的联系,在计算簇间距离时,使用马氏距离替代欧氏距离。最后使用大小不同的数据集验证改进算法的加速比和可伸缩性。实验结果表明基于Map Reduce框架的并行Agnes算法适合于大批量数据的分析和挖掘。
In order to solve the problem of memory capacity and CPU processing speed when the traditional Agnes algorithm is used to deal with massive data. A parallel Agnes algorithm based on mapreduee was proposed. And concrete method was also described. The process of the Map's aim is to get initialized clusters. The process of the Reduce is to calculate distance between clusters,merge the most closed clusters. And concerning the connection of Attributes,the thesis replaced Euclidean Distance with Mahalanobis Distance. At last, using different size of dataset to test speedup ratio and sealabilily of improved algorithm .The experimental result show that improved algorithm is suitable for massive data analysis and data mining.
作者
张国光
巩秀钢
于旭东
冯韶文
ZHANG Guo- guang;GONG Xiu- gang;YU Xu- along;FENG Shao- wen(School of Computer Science and Technology,Shandong University of Technology, Zibo Shandong 255049,Chin)
出处
《科技视界》
2018年第10期113-115,共3页
Science & Technology Vision