Title :
Maximum likelihood algorithm on Chinese word segmentation
Author :
Lo, Wing-Sze ; Wong, Hi-Fung ; Siu, Man-Hung
Author_Institution :
Dept. of Electr. & Electron. Eng., Hong Kong Univ. of Sci. & Technol., Kowloon, China
Abstract :
A Chinese sentence is typically written as a sequence of characters. However, a word is a logical semantic and syntactic unit. Thus, a segmentation algorithm is necessary. to map the sequence of characters into a sequence of words. Forward maximum matching, which tries to find the longest words to match the characters in the sentence, is one of the most popular methods because of its simplicity and efficiency. However, because it makes decisions by finding the longest next word without regard to the whole sentence, it is not optimal. In this paper, we proposed two new segmentation algorithms: the dynamic matching algorithm and maximum likelihood segmentation algorithm. In the dynamic matching algorithm, dynamic programming is used to look for the best segmentation (longest average word length) for the whole sentence. In the maximum likelihood algorithm, we aim at obtaining the likely word segmentation given a particular language model. Because of ML, this algorithm also guarantees to give the best perplexity across different segmentations. While both algorithms yield limited gains in terms of perplexity reduction, both give significant reduction in recognition error on the 863 corpus.
Keywords :
dynamic programming; maximum likelihood estimation; speech recognition; Chinese word segmentation; dynamic matching algorithm; dynamic programming; language model; likely word segmentation; longest average word length; maximum likelihood algorithm; perplexity reduction; recognition error reduction; speech recognition; word sequence; Dynamic programming; Heuristic algorithms; Humans; Natural languages; Speech recognition; Vocabulary;
Conference_Titel :
Signal Processing, 2002 6th International Conference on
Print_ISBN :
0-7803-7488-6
DOI :
10.1109/ICOSP.2002.1181093