期刊文献+

基于改进的LDA主题模型的微博用户聚类研究 被引量:13

Research on Micro-blog Users Clustering Based on Improved LDA Topic Model
原文传递
导出
摘要 大规模文档集中潜藏的语义信息一般可以用潜在狄利克雷(LDA)主题模型识别,因为微博短文本语义稀疏,所以在微博短文本聚类中的应用并不理想。利用传统的潜在狄利克雷分布的主题模型来给微博建模,得到的微博用户分布并不直观,通过改进的LDA模型将用户表示为主题概率向量,不仅能够充分地挖掘文本隐藏的语义信息,同时能够直观地呈现用户的主题分布。提出基于密度区域划分的K-means算法对微博用户进行聚类。使用真实的微博数据集进行验证,与传统的K-means聚类方法对比,采用该方法对微博用户的聚类能够有较明显的提高。 Latent Dirichlet Allocation (LDA) model can be used for identifying semantic information from large-scale document set. Due to the semantic sparse of micro-blog short text, the application of micro-blog short text clustering is not ideal. Therefore, this paper uses the topic model of the traditional LDA to construct micor-blog model, which obtains indirect distribution of micro-blog users. The improved LDA model presents users as subject probability vector; which can not only mine the hidden semantic information of text, but also can directly present topic distribution of users. The paper proposes K-means algorithm based on density region devision to cluster users of micro-biog. The paper uses real data sets of micor-blog for the verification. To compare with the traditional K-means clustering method, the proposed method can obviously improve the clustering of micor-blog users.
出处 《情报理论与实践》 CSSCI 北大核心 2016年第3期135-139,共5页 Information Studies:Theory & Application
基金 国家自然科学基金项目"网页内容真实性评价研究"(项目编号:61171159) 北京市发改委项目"异构大数据分析挖掘整合技术北京市工程实验室创新能力建设项目"的成果
关键词 微博 主题模型 文本聚类 K均值算法 micro-blog topic model text clustering K-means algorithm
  • 相关文献

参考文献16

  • 1张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590. 被引量:79
  • 2马慧芳,贾美惠子,袁媛,张志昌.融合词项关联关系的半监督微博聚类算法[J].计算机工程,2015,41(5):202-206. 被引量:3
  • 3张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘[J].计算机研究与发展,2011,48(10):1795-1802. 被引量:167
  • 4LENG B, ZENG J, YAO M, et al. 3D object retrieval with multitopic model combining relevance feedback and LDA model[J]. Image Processing, IEEE Transactions on, 2015, 24 ( 1 ) : 94-105.
  • 5MA D, RAO Lan, WANG Ting. An empirical study of SLDA for information retrieval [ J ]. Information Retrieval Technolo- gy, 2011 (1): $4-92.
  • 6白友东,庄伯金.基于LDA和K均值的微博用户聚类研究[EB/OL].[2014-01-06].http://www.paper.edu.cn/releasepaper/eontent/201401-216.
  • 7NALLAPATI R M, AHMED A, XING E P, et al. Joint latent topic models for text and citations [ C ] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. ACM, 2008 : 542-550.
  • 8HSU B J P, GLASS J. Style & topic language model adaptation using HMM-LDA [ C] //Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006: 373-381.
  • 9MOGHADDAM S, ESTER M. ILDA: interdependent LDA model for learning latent aspects and their ratings from online product reviews [ C ] //Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval. ACM, 2011: 665-674.
  • 10史剑虹,陈兴蜀,王文贤.基于隐主题分析的中文微博话题发现[J].计算机应用研究,2014,31(3):700-704. 被引量:19

二级参考文献124

  • 1秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报,2005,19(6):13-20. 被引量:51
  • 2赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62. 被引量:24
  • 3谭松波,王月粉.中文文本分类语料库-TanCorpv1.0[EB/OL].(2007-08-29)[2008-01-20].http://www.searehforum:org.cn/tansongbo/corpus.htm.
  • 4Kang J H, Lerman K, Plangprasopchok A. Analyzing Microblogs with affinity propagation [C] //Proc of the 1st KDD Workshop on Social Media Analytic. New York: ACM, 2010:67-70.
  • 5Ramage D, Dumais S, Liebling D. Characterizing microblogs with topic models [C] //Proc of Int AAAI Conf on Weblogs and Social Media. Menlo Park, CA: AAAI, 2010:130-137.
  • 6Xu R, Wunsch D. Survey of clustering algorithms [J]. IEEE Trans on Neural Networks, 2005, 16(3): 645-678.
  • 7Deerwester S, Dumais S, Landauer T, et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
  • 8Landauer T K, Foltz P W, Laham D. Introduction to Latent Semantic Analysis [J]. Discourse Processes, 1998, 25 (2) 259-284.
  • 9Griffiths T, Steyvers M. Probabilistic topic models [G] // Latent Semantic Analysis: A Road to Meaning. Hillsdale, NJ: Laurence Erlbaum, 2006.
  • 10Hofmann T. Probabilistic latent semantic indexing [C] // Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 1999:50-57.

共引文献387

同被引文献161

引证文献13

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部