期刊文献+

垃圾邮件分类的偏依赖特性研究 被引量:1

Research on the Characteristic of Partial Dependency for Spam Classification
在线阅读 下载PDF
导出
摘要 由于相对于漏报,误报会对邮件过滤性能造成更负面的影响,因此有必要研究如何让邮件过滤器对误报代价表现出更高的敏感性.本文通过引入具有偏依赖特征的权值系数函数,提出了一种能够实现非对称训练学习的改进拟合Logistic Regression邮件分类算法模型.根据在实际邮件样本集上所作测试试验,在分类精度性能没有降低的条件下,验证了新分类模型在误报率和漏报率两项指标之间存在较明显的偏依赖特性,同时对扰动特征数据表现出较强鲁棒特性. Since false positive, compared with false negative, would cause much higher negative influence on email filter' s performance,it is necessary to investigate how to make the email filter become more sensitive to handle the cost of false positive. This paper brings forward an advanced fitting Logistic Regression model for spam discrimination by inn:educing a coefficientweighted function which can help to implement unbalanced classifier training. Without performance degradation on classification precision, the results of the performance evaluation on actual email testing sets verify that the new categorization model is of the partial dependent characteristic evidently between the criteria of false positive ratio and false negative ratio. Meanwhile, the testing results suggest that the model is robust to perturbing data as well.
出处 《电子学报》 EI CAS CSCD 北大核心 2007年第10期1870-1874,共5页 Acta Electronica Sinica
基金 国家863高技术研究发展计划(No.863-104-03-01)
关键词 垃圾邮件 偏依赖 误报率 漏报率 spam characteristic of partial dependency false positive ratio false negative ratio
  • 相关文献

参考文献14

  • 1Mut Puigserver, M Ferrer Gomila, J L Huguet i Rotger, L. Electronic mail protocol resistant to a minority of malicious [A].Proceedings of IEEE Infocom[C]. Tel Aviv, Israel: IEEE computer society press,2000. 1401 - 1405.
  • 2K R Gee. Using Latent Semantic Indexing to Filter Spam[A]. Proceedings of the 2003 ACM Symposittrn on Applied Computing[C]. Melbourne, Florida: ACM press, 2003.460 - 464.
  • 3Chi-Yuan Yeh, Chih-Hung Wu, Shing-Hwang Doong. Effective spam classification based on meta-heuristics[A]. IEEE International Conference on Systems, Man and Cybernetics [C]. Guangzhou: IEEE computer society press,2005. 332 - 338.
  • 4Sebastiani F. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2032,34(1): 1 - 47.
  • 5Androutsopoulos I, Koutsias J, Chandrinos V, Paliouras G, Spyropoulos C. An evaluation of naive bayesian anti-spam faltering [A]. Workshop on Machine Learning in the New Information Age [C]. Barcelona, Spain: ACM press, 2000.578 - 584.
  • 6Andrej Bratko, Bogdan Filipic. Spam filtering using statistical data compression models [J]. Journal of machine learning research. 2006 14(2): 1-38.
  • 7李文斌,刘椿年,陈嶷瑛.基于混合高斯模型的电子邮件多过滤器融合方法[J].电子学报,2006,34(2):247-251. 被引量:12
  • 8Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning Data Mining [M]. inference and prediction. New York: Springer-Verlag press, 2001.474 - 479.
  • 9V Zorkadis, D A Karras, M. Panayotou. Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering[J]. Neural Network. 2005,18(5-6):799 - 807.
  • 10George Forman. An extensive empirical study of feature selection metrics for text classification [J]. Journal of Machine Learning Research. 2003,3 (1): 1533 - 7928.

二级参考文献7

  • 1Karl-Michael Schneider.A comparison of event models for na(i)ve bayes anti-spam e-mail filtering[A].Proc.10th Conference of the European Chapter of the Association for Computational Linguistics[C].Budapest,Hungary,2003.307-314.
  • 2Christian Siefkes,et al.Combining winnow and orthogonal sparse bigrams for incremental spam filtering[A].Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004)[C].2004.410-421.
  • 3Bauer E,Kohavi R.An empirical comparison of voting classification algorithms:bagging,boosting,and variants[J].Machine Learning,1999,36(1-2):105-139.
  • 4D Reynolds,R Rose.Robust text-independent speaker identification using Gaussian mixture speaker models[J].IEEE Trans Speech and Audio Proc,1995,3(1):72-83.
  • 5A K Jain,RPW Duin,J Mao.Statistical pattern recognition:a review[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(1):4-37.
  • 6孙怀江,胡钟山,杨静宇.基于证据理论的多分类器融合方法研究[J].计算机学报,2001,24(3):231-235. 被引量:25
  • 7熊应,朱斌,朱海云.电子邮件智能分类系统的设计[J].电子学报,2001,29(12):1653-1655. 被引量:6

共引文献11

同被引文献7

  • 1Niu Yuan.A quantitative study of forum spamming using contextbased analysis[C]//Proeeedings of the 14th Annual Network and Distributed System Security Symposium,San Diego,CA,2007:79-92.
  • 2Mishne G,Carmel D.Blocking blog spam with language model disagreement[C]//Proceedings of the 1st AIRWeb.New York:ACM, 2005 : 1-6.
  • 3Kolari P.Detecting spam blogs:A machine learning approach[C]// Proceedings of the 21st National Conference on Artificial Intelligence.Baltimore : University of Maryland, 2006 : 1351-1356.
  • 4Lin Yu-ru.Splog detection using self-similarity analysis on blog temporal dynamics[C]//Proceedings of AIRWeb 2007.New York: ACM, 2007 : 1-8.
  • 5Brooks C H,Montanez N.Improved annotation of the blogosphere via autotagging and hierarchical clustering[C]//Proceedings of the 15th International Conference on World Wide Web.New York: ACM, 2006 : 625-632.
  • 6Lin C J,Weng R C,Keerthi S S.Trust region newton methods for large-scale logistic regression[C]//Proceedings of the 24th International Conference on Machine Learning.New York:ACM,2007: 561-568.
  • 7代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量:230

引证文献1

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部