مرکز منطقه ای اطلاع رساني علوم و فناوري - Sequence of Hashes Compression in Data De-duplication

Abstract :

Data de-duplication is a simple compression method, popular in storage archival and backup that consists in partitioning large data objects (files) into smaller parts (named chunks), and replacing the chunks for the purpose of communication or storage by their ID, generally a cryptographic hash like SHA-1 of the chunk data [A. Muthitacharoen et al., 2001], [D.R. Bobbarjung et al., 2006]. The compression ratio achieved by de-duplication can be improved by (1) increasing the likelihood of matching the new chunks against the dictionary (archived) chunks and/or (2) compressing the list of hashes (indexes, of 20 bytes each). Using smaller chunk sizes increases the chance of matching but many more hashes will be generated. The chunks repository is a hash table where each entry stores the SHA-1 value of the chunk and the chunk data. In addition, with each newly created entry we store a chronological pointer linking it with the next new entry. When the hashes produced by the chunker follow the chronological pointers we encode them as a sequence of hashes by specifying the first hash in the sequence and the length of the sequence or when the same hash is generated repeatedly we encode it as a run of hashes by specifying its value and the number of repeated occurrences. The usefulness of the chronological pointers is derived from the insight that when archiving successive versions of a file or set of files, large contiguous areas remain unchanged between these versions and the chronological pointers are predictors of this contiguity. If the contiguity is broken there is a small loss in the hash sequence compression.

Keywords :

cryptography; data compression; chronological pointer linking; chunk data; chunk repository; cryptographic hash; data de-duplication; hash sequence compression ratio; hash table; storage archival; Cryptography; Data compression; Dictionaries; Joining processes; Operating systems; Data De-duplication; cryptographic hashes compression;