• DocumentCode
    1798839
  • Title

    Adaptive compression-based models of Chinese text

  • Author

    Teahan, William J. ; Peiliang Wu ; Wei Liu

  • Author_Institution
    Sch. of Comput. Sci., Bangor Univ., Bangor, UK
  • fYear
    2014
  • fDate
    7-9 July 2014
  • Firstpage
    874
  • Lastpage
    881
  • Abstract
    Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.
  • Keywords
    data compression; natural language processing; text analysis; Chinese text; Chinese word segmentation; English text; adaptive compression-based model; character-based variants; part-of-speech based variants; partial predictive match text compression scheme; word-based variants; Adaptation models; Context; Context modeling; Encoding; Hidden Markov models; Natural language processing; Predictive models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Audio, Language and Image Processing (ICALIP), 2014 International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4799-3902-2
  • Type

    conf

  • DOI
    10.1109/ICALIP.2014.7009920
  • Filename
    7009920