摘要
文本聚类在数据挖掘和机器学习中发挥着重要作用,该技术经过多年的发展,已产生了一系列的理论成果。传统向量空间模型的文本建模方法存在维度高、数据稀疏和缺乏语义信息等问题,然而仅仅引入词典的文本建模部分解决了语义问题却又受限于人工词典词量少、人工耗力大等多种问题。文中借鉴主题模型的思想,提出一种以word2vec算法得到词向量为基础,词聚类的类别为主题,结合文本中主题的频率、分布范围、位置因子等特征以获得文本在类别空间上的特征向量,完成文本建模的方法 word2fea。将其与两种文本建模方法 VSM和word2vec_base进行比较,实验结果表明该方法能够明显提高文本分类准确率。
Text classification plays an important role in data mining and machine learning,which has produced a series of theory after years of development. The traditional text modeling method of vector space model has the problems of high dimension,sparse data,and the lack of semantic. However,the text modeling introduced the artificial dictionary is constrained by quantity of words,artificial power consumption and other problems. By referencing the idea of topic model,a text modeling method word2 fea was presented which based on the model of word2 vec for the topic clusters with the word vectors,meanwhile combined with the frequency,distribution and location of the topic on documents to obtain the feature of the text. Compared with two text modeling methods,VSMand word2vec_base,the experimental results show that this method can significantly improve the accuracy of text classification.
出处
《计算机技术与发展》
2016年第2期165-167,173,共4页
Computer Technology and Development
基金
中央高校基本科研业务费专项资金(2014B33014)