期刊文献+

高维肺癌病例-对照研究资料的随机森林降维分析 被引量:7

Random forest analysis of high dimensional case and control study of lung cancer
原文传递
导出
摘要 目的探讨随机森林算法在肺癌高维病例-对照资料分析中的应用效果。方法选取500例医院来源肺癌患者作为病例组,以517名社区来源对照人群作为对照组,每名研究对象均常规采集静脉抗凝血5ml,位点基因型通过GoldenGate定制芯片平台分型,经筛选获得399个SNP位点,先利用随机森林算法进行降维,再用传统的logistic回归对降维后的变量进行分析,并采用受试者工作特征曲线(ROC)曲线下面积(AUC)分析多个SNP位点与肺癌的遗传易感性。结果经随机森林算法筛得50个平均重要性得分最高且错误率最低的变量,其中环境变量(吸烟、年龄分组、性别)的重要性得分均位于前20,分别为4.05、3.12、1.16;在调整3个环境变量后,经阳性结果错误率(FDR)法进干亍多重性校正,结果仍有统计学意义的SNP位点有6个(FDR—P〈0.05),而如果直接采用传统logistic回归分析,则无法发现有统计学意义的SNP位点。对于2个ROC曲线(分别为只包含环境变量模型ROC曲线、包含环境变量和SNP位点模型的ROC曲线)AUC(分别为0.6491±0.0172、0.6811±0.0166)的似然比检验结果表明,6个SNP位点与肺癌的关联性有统计学意义(χ^2=43.82,P=3.6×10^-11)。结论利用随机森林算法先剔除高维数据的噪声位点,再利用logistic回归分析,可提高检验效能,优于直接利用logistic回归分析。 Objective To investigate the performance of random forest method as a SNP screening procedure in high dimensional ease-control data of lung cancer. Methods This study included 500 lung cancer patients and 517 controls. A total of 5 ml venous blood sample was collected from each participant. The genotypes were classified by GoldenGate platform, and 399 SNPs were selected. The random forest method was first applied to reduce the dimension, and then the traditional logistic regression method was used to analyze the variables and the genetic susceptibility between lung cancer and multiple SNPs was analyzed by AUC (areas under receiver operation characteristics(ROC) curves). Results Fifty important variables, whose average importance scores were highest and whose error rates were lowest, were selected by random forest method. The importance scores of enviromnenta] variables ( smoking, age and gender) were all listed at top 20, which were respectively 4. 05, 3.12 and I. 16. After adjusting 3 environmental variables and false discovery rate (FDR), 6 SNPs were still significantly associated with lung cancer (FDR-P 〈0. 05). However, if traditional logistic regression analysis were directly applied, no significant SNPs were found. The likelihood testing result of AUC of the 2 ROC ( one curve only included environmental variables and the other curve included environmental variables and SNPs) were 0.6491 ± 0.0172 and 0.6811 ±0.0166 respectively; showed statistical significance of the association between the 6 SNPs and lung cancer (χ2 = 43.82,P = 3.6 × 10-11 ). Conclusion Random forest analysis could first remove the turbulent SNPs and then make the analysis by logistic regression method. Tiffs could improve the testing efficacy, which issignificantly better than single logistic regression analysis.
出处 《中华预防医学杂志》 CAS CSCD 北大核心 2012年第9期845-849,共5页 Chinese Journal of Preventive Medicine
基金 基金项目:国家自然科学基金(30901232,81072389) 江苏省高校自然科学基金重大项目(10KJA330034) 江苏高校优势学科建设工程项目
关键词 肺肿瘤 多态性 单核苷酸 人工智能 随机森林 Lung neoplasms Polymorphism, single nucleotide Artificial intelligence Random forest
  • 相关文献

参考文献14

  • 1International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2), Bellenguez C, et al. Genome-wide association study identifies a variant in HDAC9 associated with large vessel isehemic stroke. Nat Genet, 2012, 44 ( 3 ) : 328-333.
  • 2刘丽,缪小平,林东昕.后全基因组关联研究时代的机遇与挑战[J].中华预防医学杂志,2012,46(3):198-201. 被引量:6
  • 3Hu Z, Shao M, Yuan J, et al. Polymorphisms in DNA damage binding protein 2 (DDB2) and susceptibility of primmy lung cancer in the Chinese: a case-control study. Carcinogenesis, 2006, 27(7) : 1475-1480.
  • 4Hu Z, Wang H, Shao M, et ah Genetic variants in MGMT and risk of lung cancer in Southeastern Chinese: a haplotype-based analysis. Hum Murat, 2007, 28 ( 5 ) : 431 -440.
  • 5陈峰,柏建岭,赵杨,荀鹏程.全基因组关联研究中的统计分析方法[J].中华流行病学杂志,2011,32(4):400-404. 被引量:11
  • 6Breiman L. Random forests. Machine Learning, 2001, 45 ( 1 ) : 5-32.
  • 7Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent Statistics. J E duc Behav Statist, 2006, 25 ( 1 ) : 60-83.
  • 8Kooperberg C, Ruczinski I, LeBlanc ML, et al. Sequence analysis using logic regression. Genet Epidemiol, 2001, 21 Suppl 1: $626~631.
  • 9Friedman JH, Roosen CB. An introduetion to multivariate adaptive regression splines. Star Methods Med Res, 1995,4(3 ) : 197-217.
  • 10Hsieh CH, Lu RH, Lee NH, et al. Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vec'tor machines, amt artificial neural networks. Surgery, 2011, 149(1): 87-93.

二级参考文献65

  • 1Hardy J, Singleton A. Genomewide association studies and human disease. N Engl J Med,2009,360(17) : 1759-1768.
  • 2Zhang X J, Huang W, Yang S, et al. Psoriasis genome-wide association study identifies susceptibility variants within LCE gene cluster at lq21. Nat Genet,2009,41 (2) :205-210.
  • 3Han JW, Zheng HF, Cui Y, et al. Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythematosus. Nat Genet, 2009,41 ( 11 ) : 1234-1237.
  • 4Zhang FR, Huang W, Chen SM, et al. Genomewide association study of leprosy. N Engl J Med, 2009,361 (27) : 2609-2618.
  • 5Lei SF, Yang TL, Tan LJ, et al. Genome-wide association scan for stature in Chinese: evidence for ethnic specific loci. Hum Genet, 2009,125( 1 ) ~ 1-9.
  • 6Guo Y, Tan LJ, Lei SF, et al. Genome-wide association study identifies ALDH7A1 as a novel susceptibility gene for osteoporosis. PLoS Genet, 2010,6( 1 ) : e1000806.
  • 7Bei JX, Li Y, Jia WH, et al. A genome-wide association study of nasopharyngeal carcinoma identifies three new susceptibility loci. Nat Genet, 2010,42 (7) : 599-603.
  • 8Wu C, Xu B, Yuan P, et al. Genome-wide examination of genetic variants associated with response to platinum-based chemotherapy in patients with small-cell lung cancer. Pharmacogenet Genomics, 2010,20(6) : 389-395.
  • 9Quan C, Ren YQ, Xiang LH, et al. Genome-wide association study for vitiligo identifies susceptibility loci at 6q27 and the MHC. Nat Genet,2010,42(7) :614-618.
  • 10Zhang H,Zhai Y,Hu Z,et al. Genome-wide association study identifies lp36.22 as a new susceptibility locus for hepatocellular carcinoma in chronic hepatitis B virus carriers. Nat Genet, 2010, 42(9) : 755-758.

共引文献15

同被引文献69

引证文献7

二级引证文献81

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部