DocumentCode :
1596424
Title :
On the Use of Word Alignments to Enhance Bitext Compression
Author :
Martinez-Prieto, M.A. ; Adiego, Joaquin ; Sanchez-Martinez, Felipe ; Fuente, P. ; Carrasco, Rafael C.
Author_Institution :
Depto. de Inf., Univ. de Valladolid, Valladolid
fYear :
2009
Firstpage :
459
Lastpage :
459
Abstract :
The amount of information that is stored in digital form in more than one language is growing very fast as a consequence of the globalization. Furthermore, there are countries and supra-national entities whose legislation enforces the translation (and storage) of all the official texts into all their official languages. Two texts that are mutual translations are usually referred to as a bilingual parallel corpus or, in short, as a bitext. Compressing independently the two texts of a bitext is far form efficient, since the information conveyed by both texts, the meaning, is similar. We take advantage of this fact to devise a bitext compression algorithm that compresses and stores the two texts that form a bitext simultaneously. In our approach, a single model is used to represent both bitext components. For this purpose, we define a biword as a pair made of two words, each one from a different text, that are mutual translations in the bitext. This new concept allows one to represent with a single symbol two words with high mutual information.
Keywords :
natural language processing; text analysis; bilingual parallel corpus; bitext component; bitext compression algorithm; biword; digital form; globalization; legislation enforces; mutual information; mutual translation; official language; supra-national entities; word alignments; Compression algorithms; Compressors; Data compression; Data preprocessing; Dictionaries; Globalization; Legislation; Mutual information; Pipelines; Bitext compression; Biword; PPM; Word alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2009. DCC '09.
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
978-1-4244-3753-5
Type :
conf
DOI :
10.1109/DCC.2009.22
Filename :
4976513
Link To Document :
بازگشت