مرکز منطقه ای اطلاع رساني علوم و فناوري - Compact in-memory models for compression of large text databases

DocumentCode :

3205199

Title :

Compact in-memory models for compression of large text databases

Author :

Zobel, Justin ; Williams, Hugh E.

Author_Institution :

Dept. of Comput. Sci., R. Melbourne Inst. of Technol., Vic., Australia

fYear :

1999

fDate :

1999

Firstpage :

224

Lastpage :

231

Abstract :

For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements

Keywords :

data compression; database management systems; information retrieval; word processing; approximate models; character based model; compact in-memory models; full word based model; large text database compression; main memory costs; overall memory requirements; rare words; semi-static word based models; similarly spelt common words; word pairs; Compression algorithms; Computer science; Costs; Databases; Encoding; Information retrieval; Query processing; Radio spectrum management;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware

Conference_Location :

Cancun

Print_ISBN :

0-7695-0268-7

Type :

conf

DOI :

10.1109/SPIRE.1999.796599

Filename :

796599

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3205199