摘要
本文研究和改进了经典的向量空间模型 (VSM )的词语权重计算方法 ,并在此基础上提出了一种基于向量空间模型的多层次文本分类方法。也就是把各类按照一定的层次关系组织成树状结构 ,并将一个类中的所有训练文档合并为一个类文档 ,在提取各类模型时只在同层同一结点下的类文档之间进行比较 ;而对文档进行自动分类时 ,首先从根结点开始找到对应的大类 ,然后递归往下直到找到对应的叶子子类。实验和实际系统表明 。
This paper does research and improves on the classical approach of calculating the term weight in Vector Space Model.Furthermore,an approach of multi hierarchy text classification based on Vector Space Model is proposed.In this approach,all classes are organized as a tree according to some given hierarchical relations,and all the training documents in a class are combined into a class document.In order to construct the class models,it is just only to compare among the class documents attached to the same node of the same layer.When it is going to classify the documents,one matching process is hierarchically performed from the root node to the leaf nodes until a corresponding subclass is found.The experiment and real systems indicate that the approach is of high classification Precision and Recall.
出处
《中文信息学报》
CSCD
北大核心
2002年第3期8-14,26,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金 (6 0 1730 17)
北京自然科学基金 (40 110 0 3)支持