摘要
目的探讨随机森林算法在肺癌高维病例-对照资料分析中的应用效果。方法选取500例医院来源肺癌患者作为病例组,以517名社区来源对照人群作为对照组,每名研究对象均常规采集静脉抗凝血5ml,位点基因型通过GoldenGate定制芯片平台分型,经筛选获得399个SNP位点,先利用随机森林算法进行降维,再用传统的logistic回归对降维后的变量进行分析,并采用受试者工作特征曲线(ROC)曲线下面积(AUC)分析多个SNP位点与肺癌的遗传易感性。结果经随机森林算法筛得50个平均重要性得分最高且错误率最低的变量,其中环境变量(吸烟、年龄分组、性别)的重要性得分均位于前20,分别为4.05、3.12、1.16;在调整3个环境变量后,经阳性结果错误率(FDR)法进干亍多重性校正,结果仍有统计学意义的SNP位点有6个(FDR—P〈0.05),而如果直接采用传统logistic回归分析,则无法发现有统计学意义的SNP位点。对于2个ROC曲线(分别为只包含环境变量模型ROC曲线、包含环境变量和SNP位点模型的ROC曲线)AUC(分别为0.6491±0.0172、0.6811±0.0166)的似然比检验结果表明,6个SNP位点与肺癌的关联性有统计学意义(χ^2=43.82,P=3.6×10^-11)。结论利用随机森林算法先剔除高维数据的噪声位点,再利用logistic回归分析,可提高检验效能,优于直接利用logistic回归分析。
Objective To investigate the performance of random forest method as a SNP screening procedure in high dimensional ease-control data of lung cancer. Methods This study included 500 lung cancer patients and 517 controls. A total of 5 ml venous blood sample was collected from each participant. The genotypes were classified by GoldenGate platform, and 399 SNPs were selected. The random forest method was first applied to reduce the dimension, and then the traditional logistic regression method was used to analyze the variables and the genetic susceptibility between lung cancer and multiple SNPs was analyzed by AUC (areas under receiver operation characteristics(ROC) curves). Results Fifty important variables, whose average importance scores were highest and whose error rates were lowest, were selected by random forest method. The importance scores of enviromnenta] variables ( smoking, age and gender) were all listed at top 20, which were respectively 4. 05, 3.12 and I. 16. After adjusting 3 environmental variables and false discovery rate (FDR), 6 SNPs were still significantly associated with lung cancer (FDR-P 〈0. 05). However, if traditional logistic regression analysis were directly applied, no significant SNPs were found. The likelihood testing result of AUC of the 2 ROC ( one curve only included environmental variables and the other curve included environmental variables and SNPs) were 0.6491 ± 0.0172 and 0.6811 ±0.0166 respectively; showed statistical significance of the association between the 6 SNPs and lung cancer (χ2 = 43.82,P = 3.6 × 10-11 ). Conclusion Random forest analysis could first remove the turbulent SNPs and then make the analysis by logistic regression method. Tiffs could improve the testing efficacy, which issignificantly better than single logistic regression analysis.
出处
《中华预防医学杂志》
CAS
CSCD
北大核心
2012年第9期845-849,共5页
Chinese Journal of Preventive Medicine
基金
基金项目:国家自然科学基金(30901232,81072389)
江苏省高校自然科学基金重大项目(10KJA330034)
江苏高校优势学科建设工程项目
关键词
肺肿瘤
多态性
单核苷酸
人工智能
随机森林
Lung neoplasms
Polymorphism, single nucleotide
Artificial intelligence
Random forest