Title :
Compression of multilingual aligned texts
Author :
Conley, Ehud S. ; Klein, Shmuel T.
Author_Institution :
Dept. of Comput. Sci., Bar-Ilan Univ., Ramat-Gan
Abstract :
Summary form only given. Multilingual text compression depends primarily on the ability to match the corresponding parts of related texts by identifying semantic correspondences across the various sub-texts, a task generally referred to as text alignment. Savings in storage space can be obtained by replacing words and phrases with pointers to their translations, determined by any alignment algorithm. The suggested method was tested on an English-French corpus of the European Union. The French part was compressed using pointers towards the English part. The obtained compression rate (22.0%) is similar to the performances of Bzip and HuffWord and better than that of Gzip. However, Bzip and Gzip´s performances degrade when small sub-sections are processed separately, which makes them inappropriate for systems which often decode only small pieces
Keywords :
data compression; encoding; linguistics; English-French corpus; European Union; alignment algorithm; multilingual aligned text compression; storage space; Computer science; Concurrent computing; Data compression; Decoding; Degradation; Dictionaries; Encoding; Natural languages; Terminology; Testing;
Conference_Titel :
Data Compression Conference, 2006. DCC 2006. Proceedings
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-2545-8
DOI :
10.1109/DCC.2006.15