摘要
为提高对舆情信息的分析能力,设计并实现基于Spark框架的均值漂移算法。使用Ansj分词、Word2vec算法对舆情信息进行特征提取,然后基于Spark并行计算框架和均值漂移算法原理进行聚类分析。实验结果显示,均值漂移算法在Iris和Wine两组数据集下的准确率均超过90%,聚类结果明显优于K-means算法,具有较好的适应性。性能实验结果表明,增加运行程序的并行化程度可以提高均值漂移算法的运行效率。基于Spark框架的均值漂移算法能有效提高舆情信息的分析能力,助力建立健康的网络环境。
To improve the analysis ability of public opinion information,we design a mean shift algorithm based on the Spark framework.For public opinion,using the Ansj word segmentation and Word2vec algorithm feature extraction,finally clustering based on the Spark framework parallel computing model and the principle of mean shift algorithm.The numerical results show that,in both Iris and Wine data sets,the accuracy of the mean shift algorithm is over 90%,the clustering result is significantly better than the K-means algorithm,then the mean shift algorithm has better adaptability.In the performance experiment,it can effectively improve the operation efficiency of the algorithm and has better data scalability by increasing the degree of parallelization of the algorithm operation program.Therefore,the algorithm can effectively improve the analysis ability of public opinion,and help establish a healthy network environment.
作者
张京坤
王怡怡
ZHANG Jing-kun;WANG Yi-yi(Taiji Computer Corporation,China Electronics Technology Group Corporation,Beijing 100020,China;School of Mathematics and Information Science,Shaanxi Normal University,Xi’an 710100,China)
出处
《软件导刊》
2022年第6期141-146,共6页
Software Guide
关键词
舆情
SPARK
均值漂移
聚类
并行化
public opinion
Spark
mean shift
clustering
parallelization