摘要
大规模文档集中潜藏的语义信息一般可以用潜在狄利克雷(LDA)主题模型识别,因为微博短文本语义稀疏,所以在微博短文本聚类中的应用并不理想。利用传统的潜在狄利克雷分布的主题模型来给微博建模,得到的微博用户分布并不直观,通过改进的LDA模型将用户表示为主题概率向量,不仅能够充分地挖掘文本隐藏的语义信息,同时能够直观地呈现用户的主题分布。提出基于密度区域划分的K-means算法对微博用户进行聚类。使用真实的微博数据集进行验证,与传统的K-means聚类方法对比,采用该方法对微博用户的聚类能够有较明显的提高。
Latent Dirichlet Allocation (LDA) model can be used for identifying semantic information from large-scale document set. Due to the semantic sparse of micro-blog short text, the application of micro-blog short text clustering is not ideal. Therefore, this paper uses the topic model of the traditional LDA to construct micor-blog model, which obtains indirect distribution of micro-blog users. The improved LDA model presents users as subject probability vector; which can not only mine the hidden semantic information of text, but also can directly present topic distribution of users. The paper proposes K-means algorithm based on density region devision to cluster users of micro-biog. The paper uses real data sets of micor-blog for the verification. To compare with the traditional K-means clustering method, the proposed method can obviously improve the clustering of micor-blog users.
出处
《情报理论与实践》
CSSCI
北大核心
2016年第3期135-139,共5页
Information Studies:Theory & Application
基金
国家自然科学基金项目"网页内容真实性评价研究"(项目编号:61171159)
北京市发改委项目"异构大数据分析挖掘整合技术北京市工程实验室创新能力建设项目"的成果
关键词
微博
主题模型
文本聚类
K均值算法
micro-blog
topic model
text clustering
K-means algorithm