DocumentCode
390491
Title
Maximum likelihood algorithm on Chinese word segmentation
Author
Lo, Wing-Sze ; Wong, Hi-Fung ; Siu, Man-Hung
Author_Institution
Dept. of Electr. & Electron. Eng., Hong Kong Univ. of Sci. & Technol., Kowloon, China
Volume
1
fYear
2002
fDate
26-30 Aug. 2002
Firstpage
468
Abstract
A Chinese sentence is typically written as a sequence of characters. However, a word is a logical semantic and syntactic unit. Thus, a segmentation algorithm is necessary. to map the sequence of characters into a sequence of words. Forward maximum matching, which tries to find the longest words to match the characters in the sentence, is one of the most popular methods because of its simplicity and efficiency. However, because it makes decisions by finding the longest next word without regard to the whole sentence, it is not optimal. In this paper, we proposed two new segmentation algorithms: the dynamic matching algorithm and maximum likelihood segmentation algorithm. In the dynamic matching algorithm, dynamic programming is used to look for the best segmentation (longest average word length) for the whole sentence. In the maximum likelihood algorithm, we aim at obtaining the likely word segmentation given a particular language model. Because of ML, this algorithm also guarantees to give the best perplexity across different segmentations. While both algorithms yield limited gains in terms of perplexity reduction, both give significant reduction in recognition error on the 863 corpus.
Keywords
dynamic programming; maximum likelihood estimation; speech recognition; Chinese word segmentation; dynamic matching algorithm; dynamic programming; language model; likely word segmentation; longest average word length; maximum likelihood algorithm; perplexity reduction; recognition error reduction; speech recognition; word sequence; Dynamic programming; Heuristic algorithms; Humans; Natural languages; Speech recognition; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
Signal Processing, 2002 6th International Conference on
Print_ISBN
0-7803-7488-6
Type
conf
DOI
10.1109/ICOSP.2002.1181093
Filename
1181093
Link To Document