摘要
A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect.
A new joint decoding strategy that com- bines the character-based and word-based condi- tional random field model is proposed. In this seg- mentation framework, fragments are used to gener- ate candidate Out-of-Vocabularies (OOVs). After the initial segmentation, the segmentation fragments are divided into two classes as " combination" (combining several fragments as an unknown word) and " segregation" (segregating to some words). So, more OOVs can be recalled. Moreover, for the characteristics of the cross-domain segmentation, context information is reasonably used to guide Chi- nese Word Segmentation (CWS). This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010. The rates of OOV recall obtain bet- ter performance and the overall segmentation per- formances achieve a good effect.
基金
supported by the National Natural Science Foundation of China under Grants No.61173100,No.61173101
the Fundamental Research Funds for the Central Universities under Grant No.DUT10RW202