مرکز منطقه ای اطلاع رساني علوم و فناوري - Maximum likelihood algorithm on Chinese word segmentation

DocumentCode :

390491

Title :

Maximum likelihood algorithm on Chinese word segmentation

Author :

Lo, Wing-Sze ; Wong, Hi-Fung ; Siu, Man-Hung

Author_Institution :

Dept. of Electr. & Electron. Eng., Hong Kong Univ. of Sci. & Technol., Kowloon, China

Volume :

fYear :

2002

fDate :

26-30 Aug. 2002

Firstpage :

468

Abstract :

A Chinese sentence is typically written as a sequence of characters. However, a word is a logical semantic and syntactic unit. Thus, a segmentation algorithm is necessary. to map the sequence of characters into a sequence of words. Forward maximum matching, which tries to find the longest words to match the characters in the sentence, is one of the most popular methods because of its simplicity and efficiency. However, because it makes decisions by finding the longest next word without regard to the whole sentence, it is not optimal. In this paper, we proposed two new segmentation algorithms: the dynamic matching algorithm and maximum likelihood segmentation algorithm. In the dynamic matching algorithm, dynamic programming is used to look for the best segmentation (longest average word length) for the whole sentence. In the maximum likelihood algorithm, we aim at obtaining the likely word segmentation given a particular language model. Because of ML, this algorithm also guarantees to give the best perplexity across different segmentations. While both algorithms yield limited gains in terms of perplexity reduction, both give significant reduction in recognition error on the 863 corpus.

Keywords :

dynamic programming; maximum likelihood estimation; speech recognition; Chinese word segmentation; dynamic matching algorithm; dynamic programming; language model; likely word segmentation; longest average word length; maximum likelihood algorithm; perplexity reduction; recognition error reduction; speech recognition; word sequence; Dynamic programming; Heuristic algorithms; Humans; Natural languages; Speech recognition; Vocabulary;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Signal Processing, 2002 6th International Conference on

Print_ISBN :

0-7803-7488-6

Type :

conf

DOI :

10.1109/ICOSP.2002.1181093

Filename :

1181093

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=390491