• DocumentCode
    390491
  • Title

    Maximum likelihood algorithm on Chinese word segmentation

  • Author

    Lo, Wing-Sze ; Wong, Hi-Fung ; Siu, Man-Hung

  • Author_Institution
    Dept. of Electr. & Electron. Eng., Hong Kong Univ. of Sci. & Technol., Kowloon, China
  • Volume
    1
  • fYear
    2002
  • fDate
    26-30 Aug. 2002
  • Firstpage
    468
  • Abstract
    A Chinese sentence is typically written as a sequence of characters. However, a word is a logical semantic and syntactic unit. Thus, a segmentation algorithm is necessary. to map the sequence of characters into a sequence of words. Forward maximum matching, which tries to find the longest words to match the characters in the sentence, is one of the most popular methods because of its simplicity and efficiency. However, because it makes decisions by finding the longest next word without regard to the whole sentence, it is not optimal. In this paper, we proposed two new segmentation algorithms: the dynamic matching algorithm and maximum likelihood segmentation algorithm. In the dynamic matching algorithm, dynamic programming is used to look for the best segmentation (longest average word length) for the whole sentence. In the maximum likelihood algorithm, we aim at obtaining the likely word segmentation given a particular language model. Because of ML, this algorithm also guarantees to give the best perplexity across different segmentations. While both algorithms yield limited gains in terms of perplexity reduction, both give significant reduction in recognition error on the 863 corpus.
  • Keywords
    dynamic programming; maximum likelihood estimation; speech recognition; Chinese word segmentation; dynamic matching algorithm; dynamic programming; language model; likely word segmentation; longest average word length; maximum likelihood algorithm; perplexity reduction; recognition error reduction; speech recognition; word sequence; Dynamic programming; Heuristic algorithms; Humans; Natural languages; Speech recognition; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing, 2002 6th International Conference on
  • Print_ISBN
    0-7803-7488-6
  • Type

    conf

  • DOI
    10.1109/ICOSP.2002.1181093
  • Filename
    1181093