摘要
具有噪声的基于密度的数据聚类(DBSCAN)算法是数据挖掘领域中的经典方法之一,其不仅能发现数据中潜藏的复杂关系,还能过滤其中的数据噪声,从而获得高质量的数据聚类.然而,现有的基于密度的数据聚类算法仅支持单模态(类型)数据的聚类,难以应对多模态(类型)数据并存的应用场景.随着信息技术的快速发展,数据呈现多模态化的发展态势,现实生活中的数据不再是单一的数据类型,而是多种数据模态(类型)的组合,如文本、图像、地理坐标、数据特征等.因此,现有的数据聚类方法难以对复杂的多模态数据进行有效的数据建模,更无法进行高效的多模态数据聚类.基于此,提出一种基于密度的多度量空间聚类算法.首先,为了刻画多模态数据间的复杂关系,利用多度量空间表征数据之间的相似性关系,并且利用聚合多度量图索引(AMG)实现多模态数据建模.接着,利用差分化的相似性关系优化聚合多度量图的图结构,并且结合最优策略优先的搜索策略进行剪枝,以实现高效的多模态数据聚类.最后,在真实与合成数据集上针对多种参数设置进行实验.实验结果验证了所提方法运行效率提升了至少1个数量级,并具有较高的聚类精度与良好的可扩展性.
The density-based spatial clustering of applications with noise(DBSCAN)algorithm is one of the clustering analysis methods in the field of data mining.It has a strong capability of discovering complex relationships between objects and is insensitive to noise data.However,existing DBSCAN methods only support the clustering of unimodal objects,struggling with applications involving multi-model data.With the rapid development of information technology,data has become increasingly diverse in real-life applications and contains a huge variety of models,such as text,images,geographical coordinates,and data features.Thus,existing clustering methods fail to effectively model complex multi-model data and cannot support efficient multi-model data clustering.To address these issues,in this study,a density-based clustering algorithm in multi-metric spaces is proposed.Firstly,to characterize the complex relationships within multimodel data,this study uses a multi-metric space to quantify the similarity between objects and employs aggregated multi-metric graph(AMG)to model multi-model data.Next,this study employs differential distances to balance the graph structure and leverages a best-first search strategy combined with pruning techniques to achieve efficient multi-model data clustering.The experimental evaluation on real and synthetic datasets,using various experimental settings,demonstrates that the proposed method achieves at least one order of magnitude improvement in efficiency with high clustering accuracy,and exhibits good scalability.
作者
朱轶凡
罗程阳
马瑞遥
陈璐
毛玉仁
高云君
ZHU Yi-Fan;LUO Cheng-Yang;MA Rui-Yao;CHEN Lu;MAO Yu-Ren;GAO Yun-Jun(College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China;School of Software Technology,Zhejiang University,Ningbo 315048,China)
出处
《软件学报》
北大核心
2025年第2期851-873,共23页
Journal of Software
基金
国家重点研发计划(2021YFC3300303)
国家自然科学基金(62025206,61972338,62102351)
杭州市人工智能重大科技创新项目(2022AIZD0116)。