DocumentCode :
2141455
Title :
Word Segmentation for the Sequences Emitted from a Word-Valued Source
Author :
Ishida, Takashi ; Matsushima, Toshiyasu ; Hirasawa, Shigeichi
Author_Institution :
Waseda Univ., Tokyo
fYear :
2007
fDate :
16-19 Oct. 2007
Firstpage :
662
Lastpage :
661
Abstract :
Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.
Keywords :
natural language processing; word processing; Chinese language processing; Japanese language processing; natural languages structure; probabilistic language model; source coding problem; word segmentation; word-valued source; Binary trees; Character recognition; Computational modeling; Information analysis; Information technology; Natural language processing; Natural languages; Numerical models; Random variables; Source coding;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology, 2007. CIT 2007. 7th IEEE International Conference on
Conference_Location :
Aizu-Wakamatsu, Fukushima
Print_ISBN :
978-0-7695-2983-7
Type :
conf
DOI :
10.1109/CIT.2007.170
Filename :
4385160
Link To Document :
بازگشت