摘要
癌症通常由基因发生突变引起,因此从大量基因中有效地识别出少量致癌基因具有重要意义.针对基因表达谱数据高维小样本的特点,将支持向量机递归特征消除(SVM-RFE)和特征聚类算法相结合,提出一种新的基因选择方法:K类别SVM-RFE(K-SVM-RFE).该算法通过特征排序算法去除大量无关基因,利用K均值聚类算法将相似基因聚为一类,并通过两次SVM-RFE算法精选致癌基因.随后将K-SVM-RFE算法应用于多个基因表达谱数据集,并对其中的关键参数设置进行了讨论.实验结果表明K-SVM-RFE算法所选基因较已有方法在分类准确率上有显著提高,特别是在选择少量致癌基因上效果提升更为明显.
Cancer is usually caused by mutations in genes.It is significant to effectively identify a small number of pathogenic genes from numerous genes.Based on characteristics of gene expression profile data,a novel algorithm(K-SVM-RFE)of gene selection is proposed by combining SVM-RFE with feature clustering algorithm.First,irrelevant genes were removed by feature ranking algorithm.Then,these genes were clustered by K-means and the SVM-RFE algorithm was applied twice to select key genes.We conducted experiments on some real-world data sets and discussed the parameter settings in our method.Results show that,compared with the existing methods,genes selected by the K-SVM-RFE algorithm have significantly improved the classification accuracy,especially in selecting a few key genes.
作者
叶小泉
吴云峰
YE Xiaoquan;WU Yunfeng(Fujian Key Laboratory of Sensing and Computing for Smart City,School of Information Science and Engineering,Xiamen University,Xiamen361005,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第5期702-707,共6页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(61771331)
关键词
基因表达谱
特征选择
K均值聚类
支持向量机
gene expression profile
feature selection
K-means
support vector machine