Title :
Joint n-gram Chinese language modeling with an application to Chinese word segmentation
Author :
He, Xin ; Ou, Zhijian ; Sun, Jiasong
Author_Institution :
Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
Abstract :
The state-of-the-art language models (LMs) are n-gram models, which, for Chinese, are word-based n-grams. To construct Chinese word-based n-gram LMs, we need to have a lexicon and a Chinese word segmentation (CWS) step. However, there is no standard definition of a word in Chinese, and it is always possible to construct new words by combining multiple characters, which causes out-of-vocabulary (OOV) problems. These make lexicon definition and CWS being difficult and ill-defined, which deteriorates the quality of the Chinese LMs. Recently, conditional random fields (CRFs) have been shown to have the ability to perform robust and accurate CWS, especially in recalling OOV words. However they are in essence not Chinese language models, but conditional models of the position-of-character (POC) tag-sequence given the character-sequence. In this paper, we propose a new Chinese language model - joint n-gram, which incorporates the POC tags so that we escape from using a lexicon. It is a truly generative model of Chinese sentences. The effectiveness of the new LM is shown in terms of perplexities and CWS performances.
Keywords :
natural language processing; word processing; CRF; CWS; Chinese sentences; Chinese word segmentation; Chinese word-based n-gram LMs; OOV problems; OOV words; POC tag-sequence; conditional random fields; joint n-gram Chinese language modeling; lexicon definition; out-of-vocabulary problems; position-of-character tag-sequence; state-of-the-art language models; Computational modeling; Hidden Markov models; Joints; Robustness; Speech recognition; Standards; Tagging;
Conference_Titel :
Audio, Language and Image Processing (ICALIP), 2012 International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0173-2
DOI :
10.1109/ICALIP.2012.6376633