摘要
针对互联网舆情挖掘领域的特点,提出了一种基于向量空间模型VSM的文本聚类算法STCC(Similarity Threshold Control Clustering BasedVSM)。该算法按照层次聚类从下至上凝聚的策略,获取初始簇信息,然后根据K-means算法的思想以设置的聚类相似度阈值作为度量来合并簇。该算法结合层次聚类和K-means算法的优点,克服其缺点。与层次聚类相比,每一次聚类时不需要比较所有簇之间的相似度,降低了时间复杂度,提高了聚类的效率;与K—means算法相比,不需要确定K值,灵活性更高。通过实验表明,该算法聚类效果好,实用性高,适合大规模的文本聚类。
By analyzing the existed clustering algorithms, a new text clustering algorithm, which uses similarity threshold control clustering based VSM (STCC) , is proposed in this paper. The algorithm is based on the hierarchical clustering bottom to top strategy to get the information of primary clusters and can merge clusters in a threshold of clustering similarity according to K-means. The algorithm overcomes the shortcomings of calculating the similarity in all clusters with every clustering and pre-determining the value K. The experimental results show that the algorithm can reduce the time complexity, improve the clustering efficiency, is more flexible and more applicable.
出处
《情报学报》
CSSCI
北大核心
2014年第5期530-537,共8页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金项目(61373161)
北京市属高等学校人才强教深化计划“中青年骨干人才”项目(PHR201008083)资助
关键词
互联网舆情
数据挖掘
关键词提
取文本聚类
internet public opinion, data mining, keywords extraction, text clustering