摘要
传统的特征提取方法大多注重类别对特征词的作用,不能很好地表达样本对类别的影响。为此,对样本的类别贡献问题进行研究。针对Sprinkling特征提取方法中未考虑样本对类别的贡献度问题,提出一种基于K-Sprinkling的特征提取方法。综合考虑样本紧密度和样本隶属度信息,利用Sprinkling方法的特点,将样本权值映射到语义空间中,实现对文本的分类。实验结果表明,K-Sprinkling方法比传统的Sprinkling方法在平衡样本分类上F1值提高了1.89%,在不平衡样本分类上F1值提高了3.30%,取得了较好的分类效果。
The traditional feature extraction methods are mainly focus to the role of the category on the characteristic word for text classification, which do not express the impact of the sample on the classification. In this paper, aiming at the problem that the contribution of the sample to the classis is not detected out from the Sprinkling, and the K-Sprinkling is proposed based on these detected sample tightness and sample membership. Then, by considering the Sprinkling advantages, the sample weights are mapped into the vector feature space to achieve the text classification through the potential semantic indexing method. The experimental results show that the K-Sprinkling method proposed in this paper can obtain better classification performance. It outperforms the traditional method by 1.89% on the balance sample, as well as 3.30% on the imbalance sample in terms of F1-score.
出处
《计算机工程》
CAS
CSCD
北大核心
2017年第12期141-146,共6页
Computer Engineering
基金
黑龙江省自然科学基金(F201201)
林业公益性行业科研专项(201504307)
关键词
特征提取
样本隶属度
样本紧密度
潜在语义索引
贡献度
feature extraction
sample membership
sample tightness
Latent Semantic Indexing ( LSI )
contributiondegree