摘要
过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步。其主要表现为:(1)通过“分词规范+词表+分词语料库”的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法,并使自动分词系统的精度达到了新高。
During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improve- ments could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by "segmentation guidelines + lexicon + segmented corpus"; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.
出处
《中文信息学报》
CSCD
北大核心
2007年第3期8-19,共12页
Journal of Chinese Information Processing
关键词
计算机应用
中文信息处理
中文分词
词语定义
未登录词识别
字标注分词方法
computer application
Chinese information processing
Chinese word segmentation (CWS)
definition of words
out-of-vocabulary (OOV) word recognition
Character-based tagging approach of CWS