期刊文献+

中文分词十年回顾 被引量:251

Chinese Word Segmentation: A Decade Review
在线阅读 下载PDF
导出
摘要 过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步。其主要表现为:(1)通过“分词规范+词表+分词语料库”的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法,并使自动分词系统的精度达到了新高。 During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improve- ments could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by "segmentation guidelines + lexicon + segmented corpus"; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.
作者 黄昌宁 赵海
出处 《中文信息学报》 CSCD 北大核心 2007年第3期8-19,共12页 Journal of Chinese Information Processing
关键词 计算机应用 中文信息处理 中文分词 词语定义 未登录词识别 字标注分词方法 computer application Chinese information processing Chinese word segmentation (CWS) definition of words out-of-vocabulary (OOV) word recognition Character-based tagging approach of CWS
  • 相关文献

参考文献6

二级参考文献68

共引文献211

同被引文献1966

引证文献251

二级引证文献1404

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部