مرکز منطقه ای اطلاع رساني علوم و فناوري - Word Segmentation for the Sequences Emitted from a Word-Valued Source

DocumentCode :

2141455

Title :

Word Segmentation for the Sequences Emitted from a Word-Valued Source

Author :

Ishida, Takashi ; Matsushima, Toshiyasu ; Hirasawa, Shigeichi

Author_Institution :

Waseda Univ., Tokyo

fYear :

2007

fDate :

16-19 Oct. 2007

Firstpage :

662

Lastpage :

661

Abstract :

Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

Keywords :

natural language processing; word processing; Chinese language processing; Japanese language processing; natural languages structure; probabilistic language model; source coding problem; word segmentation; word-valued source; Binary trees; Character recognition; Computational modeling; Information analysis; Information technology; Natural language processing; Natural languages; Numerical models; Random variables; Source coding;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer and Information Technology, 2007. CIT 2007. 7th IEEE International Conference on

Conference_Location :

Aizu-Wakamatsu, Fukushima

Print_ISBN :

978-0-7695-2983-7

Type :

conf

DOI :

10.1109/CIT.2007.170

Filename :

4385160

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2141455