A Divide-Conquer Strategy for Both English and Chinese Text Chunking

Author

Liang, Ying-Hong ; Wang, Ni-Hong ; Qiu, Zhao-wen ; Chen, Yin- ; Zhao, Tie-jun

fYear

2007

fDate

22-24 Aug. 2007

Firstpage

81

Lastpage

86

Abstract

The traditional English text chunking approach identifies phrases by using only one model and phrases with the same types of features. It has been shown that the limitations of using only one model are that: the use of the same types of features is not suitable for all phrases, and data sparseness may also result. In this paper, a divide-conquer strategy is proposed and applied in the identification of English phrases. And then, this strategy is rapid transplanted to Chinese text chunking. This strategy divides the task of chunking into several sub-tasks according to sensitive features of each phrase and identifies different phrases in parallel. Then, a two-stage decreasing conflict strategy is used to synthesize each sub-task´s answer, where the main features are: one, each phrase uses its own sensitive features; two, avoidance of data sparseness. Through testing on public corpus (English) and Chinese Penn Treebank (Chinese), F score of English chunking achieves to 95.14% and that of Chinese chunking is 95.23%. These results are state of the art with the best results that have been reported..

Keywords

Data mining; Electronic mail; Forestry; Information technology; Laboratories; Learning systems; Natural language processing; Natural languages; Speech processing; Testing; text chunkindivide-conquer strategydata sparseness;

fLanguage

English

Publisher

ieee

Conference_Titel

Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on

Conference_Location

Luoyang, Henan, China

Print_ISBN

978-0-7695-2930-1

Type

conf

DOI

10.1109/ALPIT.2007.36

Filename

4460619