Title of article :
Natural Language Compression on Edge-Guided text preprocessing
Author/Authors :
Miguel A. Mart?nez-Prieto، نويسنده , , Joaqu?n Adiego، نويسنده , , Pablo de la Fuente، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2011
Pages :
25
From page :
5387
To page :
5411
Abstract :
This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM).
Keywords :
Natural Language Compression , Text preprocessing , k-Order modeling , Edge-Guided
Journal title :
Information Sciences
Serial Year :
2011
Journal title :
Information Sciences
Record number :
1214778
Link To Document :
بازگشت