期刊文献+

基于LDA模型的文本分类研究 被引量:57

Research on text categorization based on LDA
在线阅读 下载PDF
导出
摘要 针对传统的降维算法在处理高维和大规模的文本分类时存在的局限性,提出了一种基于LDA模型的文本分类算法,在判别模型SVM框架中,应用LDA概率增长模型,对文档集进行主题建模,在文档集的隐含主题-文本矩阵上训练SVM,构造文本分类器。参数推理采用Gibbs抽样,将每个文本表示为固定隐含主题集上的概率分布。应用贝叶斯统计理论中的标准方法,确定最优主题数T。在语料库上进行的分类实验表明,与文本表示采用VSM结合SVM,LSI结合SVM相比,具有较好的分类效果。 When the text corpuses are high-dimensional and large-scale,the traditional dimension reduction algorithms will expose their limitations.A Chinese text categorization algorithm based on LDA is presented.In the discriminative frame of Support Vector Machine(SVM),Latent Dirichlet Allocation(LDA) is used to give a generative probabilistic model for the text corpus,which reduces each document to fixed valued features——The probabilistic distribution on a set of latent topics.Gibbs sampling is used for parameter estimation.In the process of modeling the corpus,a latent topics-document matrix associated with the corpus has been constructed for training SVM.Standard method of Bayes is used for reference to get the best number of topics.Compared to Vector Space Model(VSM) for text expression combined SVM and the classifier based on Latent Semantic Indexing(LSI) combined SVM,the experimental result shows that the proposed method for text categorization is practicable and effective.
出处 《计算机工程与应用》 CSCD 北大核心 2011年第13期150-153,共4页 Computer Engineering and Applications
关键词 文本分类 潜在狄利克雷分配(LDA)模型 GIBBS抽样 贝叶斯统计理论 text categorization Latent Dirichlet Allocation (LDA) Gibbs sampling Bayes statistics theory
  • 相关文献

参考文献6

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:391
  • 2伍建军,康耀红.文本分类中特征降维方式的研究[J].海南大学学报(自然科学版),2007,25(1):62-66. 被引量:4
  • 3Deerwester S,Dumais S T A.lndexing by latent semantic analysis[J] Journal of the Society for Information Science,1990,41(6).
  • 4Blei D,Ng A,Jordan M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5).
  • 5Griffiths T L,Steyvers M.Finding scientific topics[J].PNAS,2004,101(1).
  • 6Chang Chih-Chung,Lin Chih-Jen.LIBSVM:A library for support vector machine[EB/OL].(2001).http://www.csie.ntu.edu.tw/~cjlin/libsvm.

二级参考文献10

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:96
  • 3张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172. 被引量:100
  • 4LIU Tao,LIU Sheng-ping,CHEN Zheng.An evaluation on feature selection for text clustering[C]∥ Proceedings of the 20th International Conference on Machine Learning (ICML203).Washington DC.:2003:488-495.
  • 5YANG Yiming.A comparative study on feature selection in text categorization[C]∥Proceeding of the Fourteenth International Conference on Machine Learning (ICMLp97).San Francisco:Morgan Kaufmann Publishers,1997:412-420.
  • 6GALAVOTTI Luigi,SEBASTIANI Fabrizio.Feature selection and negative evidence in automated text categorization[C]∥ Proceedings of the ACM KDD-00 Workshop on Text Mining.New York,US:ACM Press,2000:40-42.
  • 7DEERWESTER S,DUMAIS S,FURNAS D.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
  • 8DOUGLAS BAKER L,MCCALLUM Andrew Kachites.Distributional clustering of words for text classification[C]∥ Proceedings of SIGIR-98,21st ACM International Conference on Research and Development in Information Retrieval.New York,US:ACM Press,1998:96-103.
  • 9YANG Yi-ming.Expert network:Effective and efficient learning from human decisions in text categorization and retrieval[C]∥ Proceedings of the 7 th Annual International ACN-SIGIR Conference on Research and Development in Information Retrieval.Dublin:Springer Verlag,1994:13-22.
  • 10陈毅松,汪国平,董士海.基于支持向量机的渐进直推式分类学习算法[J].软件学报,2003,14(3):451-460. 被引量:88

共引文献393

同被引文献474

引证文献57

二级引证文献618

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部