DocumentCode
302107
Title
Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling
Author
Law, Hubert Hin-Cheung ; Chan, Chorlcin
Author_Institution
Dept. of Comput. Sci., Hong Kong Univ., Hong Kong
Volume
1
fYear
1996
fDate
7-10 May 1996
Firstpage
196
Abstract
A novel ergodic multigram hidden Markov model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words are generated independently based on the word classes, without word boundary marked on the sentence. This model can be applied to languages without word boundary markers such as Chinese. With a lexicon containing syntactic classes for each word, its applications include language modeling for recognizers, and integrated word segmentation and class tagging. Pre-segmented and tagged corpus are not needed for training, and both segmentation and tagging are trained in one single model. In this paper, relevant algorithms for this model are presented, and experimental results on a Chinese news corpus are reported
Keywords
hidden Markov models; natural languages; speech recognition; stochastic processes; Chinese language modeling; Chinese news corpus; class tagging; doubly stochastic process; ergodic multigram HMM; hidden Markov model; lexicon; multi-character words; sentence production; single character words; syntactic classes; word segmentation; Computer science; Hidden Markov models; Lattices; Maximum likelihood decoding; Natural languages; Production; Stochastic processes; Tagging; Terminology; Viterbi algorithm;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on
Conference_Location
Atlanta, GA
ISSN
1520-6149
Print_ISBN
0-7803-3192-3
Type
conf
DOI
10.1109/ICASSP.1996.540324
Filename
540324
Link To Document