期刊文献+

A Classification Method for Web Information Extraction 被引量:2

A Classification Method for Web Information Extraction
在线阅读 下载PDF
导出
摘要 Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing. Web information extraction is viewed as a classification process and a competing classification method is presented to extract Web information directly through classification. Web fragments are represented with three general features and the similarities between fragments are then defined on the bases of these features. Through competitions of fragments for different slots in information templates, the method classifies fragments into slot classes and filters out noise information. Far less annotated samples are needed as compared with rule-based methods and therefore it has a strong portability. Experiments show that the method has good performance and is superior to DOM-based method in information extraction. Key words information extraction - competing classification - feature extraction - wrapper induction CLC number TP 311 Foundation item: Supported by the National Natural Science Foundation of China (60303024)Biography: LI Xiang-yang (1974-), male, Ph. D. Candidate, research direction: information extraction, natural language processing.
出处 《Wuhan University Journal of Natural Sciences》 CAS 2004年第5期823-827,共5页 武汉大学学报(自然科学英文版)
基金 theNationalNaturalScienceFoundationofChina(60303024)
关键词 information extraction competing classification feature extraction wrapper induction information extraction competing classification feature extraction wrapper induction
  • 相关文献

参考文献1

二级参考文献17

  • 1Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 2Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 3Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/
  • 4Cattell R G G. The Object Database Standard ODMG-93. San Mateo,California: Morgan Kaufmann Publishers,1994
  • 5Mitchell T. Machine Learning. New York: McGraw Hill, 1997
  • 6Wall L et al. Programming Perl(3rd Edition). O'Reilly & Associates,2000
  • 7Birbeck M et al. Professional XML. Wrox Press Inc, 2000
  • 8Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proc International Conference on Data Engineering (ICDE), San diego, California, 2000. 611-621
  • 9Chamberlin D, Robie J, Florescu D. Quilt: An XML query language for heterogeneous data sources. In: Proc International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, 2000. 53-62
  • 10Sahuguet A, Azavant F. Building light-weight wrappers for legacy web datasources using w4f. In: Proc International Conference on Very Large Databases, Edinburgh, Scotland, 1999. 738-741

共引文献101

同被引文献12

  • 1[1]Florescu D,Levy A Y,Mendelzon A.Database techniques for the World-Wide Web:A Survery.In:ACM The SIGMOD Record,1998.59-74.
  • 2[2]Ziv Bar-Yossef,Sridhar Rajagopalan.Template Detection via Data Mining and its APPlication.WWW2002,2002.
  • 3[4]Laender AHF,Ribeiro-Neto BA,Da Silva AS.et al.A Brief Survey of Web Data Extraction Tools[J].SIGMOD Record,2002,31 (2):84-93.
  • 4[6]Crescenzi V,Mecca G.On Automatic Information Extra-ction from Large Web Sites[R].Technical Report DIA-76-2003.
  • 5KUSHMERICK N.Wrapper induction for information extraction[D].Washington:University of Washington,1997.
  • 6HSU C H,DUNG M T.Generating finite-state transducers for semi-structured data extraction from the web[J].Information Systems,1998,23(8):521-538.
  • 7MUSLEA I,MINTON S,KNOBLOCK C.AAAI-98 on AI and information integration[C].Madison,Wisconsin:AAAI/MIT Press,1998.74-81.
  • 8CALIFF M,MOONCY R.Proceedings of the 16th National Conference on Artificial Intelligence(AAAI99)[C].Orlando,Florida:AAAI Press,1999.328-334.
  • 9SODERLAND S.Learning information extraction rules for semi-structured and free text[J].Machine Learning,1999,34(1-3):233-272.
  • 10SHAVLIK J.Proceedings of the 15th International Conference on Machine Learning(ICML-98)[C].Madison,Wisconsin:Morgan Kaufmann,1998.161-169.

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部