• DocumentCode
    3384031
  • Title

    Estimating and comparing entropies across written natural languages using PPM compression

  • Author

    Behr, F. ; Fossum, V. ; Mitzenmacher, M. ; Xiao, D.

  • fYear
    2003
  • fDate
    25-27 March 2003
  • Firstpage
    416
  • Abstract
    Summary form only given. The measurement of the entropy of written English is extended to include the following written natural languages: Arabic, Chinese, French, Japanese, Korean, Russian, and Spanish. It was observed that translations of the same document have approximately the same size when compressed even though they have widely varying uncompressed sizes. In the experiment, an efficient compression algorithm was used. It utilized PPMD+, PPMZ, and BZIP2 to compress the given texts and compare the resulting sizes. Similar experiments with machine translations were also performed. Based on the findings, it suggests that compression can be used as a tool to find poor translations. The results of these experiments, while preliminary, support the hypothesis that translation preserves information content. This analysis opens new horizons for future research concerning the relationship between compression and translation.
  • Keywords
    data compression; entropy; language translation; linguistics; Arabic; BZIP2; Chinese; French; Japanese; Korean; PPM compression; PPMD+; PPMZ; Russian; Spanish; compression algorithm; data compression; entropy comparison; entropy estimation; information preservation; language translation; machine translation; text compression; translation errors; uncompressed sizes; written English entropy; written natural languages; Entropy; Natural languages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2003. Proceedings. DCC 2003
  • Conference_Location
    Snowbird, UT, USA
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-1896-6
  • Type

    conf

  • DOI
    10.1109/DCC.2003.1194035
  • Filename
    1194035