Compression of multilingual aligned texts

Author

Conley, Ehud S. ; Klein, Shmuel T.

Author_Institution

Dept. of Comput. Sci., Bar-Ilan Univ., Ramat-Gan

fYear

2006

fDate

28-30 March 2006

Lastpage

442

Abstract

Summary form only given. Multilingual text compression depends primarily on the ability to match the corresponding parts of related texts by identifying semantic correspondences across the various sub-texts, a task generally referred to as text alignment. Savings in storage space can be obtained by replacing words and phrases with pointers to their translations, determined by any alignment algorithm. The suggested method was tested on an English-French corpus of the European Union. The French part was compressed using pointers towards the English part. The obtained compression rate (22.0%) is similar to the performances of Bzip and HuffWord and better than that of Gzip. However, Bzip and Gzip´s performances degrade when small sub-sections are processed separately, which makes them inappropriate for systems which often decode only small pieces

Keywords

data compression; encoding; linguistics; English-French corpus; European Union; alignment algorithm; multilingual aligned text compression; storage space; Computer science; Concurrent computing; Data compression; Decoding; Degradation; Dictionaries; Encoding; Natural languages; Terminology; Testing;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 2006. DCC 2006. Proceedings

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

0-7695-2545-8

Type

conf

DOI

10.1109/DCC.2006.15

Filename

1607285