DocumentCode :
2449469
Title :
Joint n-gram Chinese language modeling with an application to Chinese word segmentation
Author :
He, Xin ; Ou, Zhijian ; Sun, Jiasong
Author_Institution :
Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
fYear :
2012
fDate :
16-18 July 2012
Firstpage :
319
Lastpage :
323
Abstract :
The state-of-the-art language models (LMs) are n-gram models, which, for Chinese, are word-based n-grams. To construct Chinese word-based n-gram LMs, we need to have a lexicon and a Chinese word segmentation (CWS) step. However, there is no standard definition of a word in Chinese, and it is always possible to construct new words by combining multiple characters, which causes out-of-vocabulary (OOV) problems. These make lexicon definition and CWS being difficult and ill-defined, which deteriorates the quality of the Chinese LMs. Recently, conditional random fields (CRFs) have been shown to have the ability to perform robust and accurate CWS, especially in recalling OOV words. However they are in essence not Chinese language models, but conditional models of the position-of-character (POC) tag-sequence given the character-sequence. In this paper, we propose a new Chinese language model - joint n-gram, which incorporates the POC tags so that we escape from using a lexicon. It is a truly generative model of Chinese sentences. The effectiveness of the new LM is shown in terms of perplexities and CWS performances.
Keywords :
natural language processing; word processing; CRF; CWS; Chinese sentences; Chinese word segmentation; Chinese word-based n-gram LMs; OOV problems; OOV words; POC tag-sequence; conditional random fields; joint n-gram Chinese language modeling; lexicon definition; out-of-vocabulary problems; position-of-character tag-sequence; state-of-the-art language models; Computational modeling; Hidden Markov models; Joints; Robustness; Speech recognition; Standards; Tagging;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Audio, Language and Image Processing (ICALIP), 2012 International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0173-2
Type :
conf
DOI :
10.1109/ICALIP.2012.6376633
Filename :
6376633
Link To Document :
بازگشت