摘要
文本自动分类是指将文本按一定的策略归于一个或多个类别中的应用技术。本文首先介绍三种基于统计的自动分类技术 (k近邻分类器、支持向量机分类器和朴素贝叶斯分类器 ) ,剖析了基于统计的自动分类的优势及不足。基于统计的自动分类的不足主要表现为 :当类别之间分类特征的交叉变大时 ,分类精度呈下降趋势 ,在多层分类的情况下 ,此局限尤为突出。针对此局限性 ,为了提高自动分类的精度 ,我们引入了基于规则的自动分类来对其进行改进和扩充 ,并整合两种自动分类技术的优点 ,设计出了混合分类器系统 。
The technique of text automatic category is to classify texts into one or more classes according to some strategy.This paper firstly reports three kinds of technique of text automatic category based on statistic ( k nearest neighbor ,support vector machine and nave bayes),and analyses their advantages and disadvantages.The weakness of statistic based automatic category is the category precision decrease while the character intersect within classes increase, especially in the case of multi layers classifying. In order to improve statistic based automatic category performance, rule based automatic category is used. we combine statistic based category with rule based classifying method , design and realize a system of mixing category lastly, which has and has had very good performance in category.
出处
《中文信息学报》
CSCD
北大核心
2004年第4期9-14,共6页
Journal of Chinese Information Processing
关键词
计算机应用
中文信息处理
文本挖掘
文本分类
规则分类
computer application
Chinese information processing
text mining
text category
rule based classifying