Title :
Modelling Parallel Texts for Boosting Compression
Author :
Adiego, Joaquín ; Martinez-Prieto, M.A. ; Hoyos-Torio, Javier E ; Sanchez-Martinez, Felipe
Author_Institution :
Dept. de Inf., Univ. de Valladolid, Valladolid, Spain
Abstract :
Bilingual parallel corpora, also known as bitexts, convey the same information in two different languages. This implies that to model a bitext we can take advantage of the translation relationship that exists between the two texts; the text alignment task makes it possible to establish such a translation relationship. A biword is defined as a pair of words, each from a different text, that are mutual translations in the bitext; the use of biwords allows both texts in the bitext to be represented on a single model. Several biword-based schemes have been proposed leading to good compression ratios. Bearing in mind Melamed\´s affirmation which states that "the translation of a text into another language can be viewed as a detailed annotation of what that text means", we propose a new model for bitexts in agreement with this affirmation, dubbed MAR. The idea is to represent the words in the right text with respect to the preceding word in the left text; thus, a first-order model based on alignment relationships is proposed.
Keywords :
data compression; text analysis; bilingual parallel corpora; bitext; biword-based scheme; boosting compression; parallel text; text alignment task; Boosting; Data compression; Dictionaries; Information retrieval; Bitext Compression; Compression Boosting; PPM;
Conference_Titel :
Data Compression Conference (DCC), 2010
Conference_Location :
Snowbird, UT
Print_ISBN :
978-1-4244-6425-8
Electronic_ISBN :
1068-0314
DOI :
10.1109/DCC.2010.86