Author_Institution :
California Univ., Berkeley, CA, USA
Abstract :
Summary form only given. In the problem we consider, one party (the “receiver”) seeks to obtain a copy of a large file or unstructured database from another party (the “source”) over a communication link which is slow or expensive relative to the computing power available at each end. The communication link is bi-directional, and error-free in both directions, although it may have delay. The receiver has an outdated version of the source file. Although there may have been many changes between these two large files, they are still substantially the same, in the sense that many large portions of the file were unaffected by the changes. For ease of exposition, we treat the file as a single long sequence of n symbols. The goal is an algorithm which ends when the receiver has constructed an identical copy of the source file, and which reaches this state after transmitting an amount of data proportional only to the sum of the sizes of the changes, and some terms proportional to log n. In other words, when n is very large, the amount of data communicated should be negligible compared to n, assuming reasonably large agreement between the two versions of the file. We achieve this goal. Cyclic redundancy checks (CRCs) play an important role in this algorithm. The least significant byte of each CRC is called a token. The novel notion in this paper is a very simple new data structure, which we call a data-punctuated token tree. For each of the two files, this tree is constructed independently, starting from the leaves and ending with the root. When completed, each node in this tree contains a token on an appropriate subsequence of the data and a punctuation bit. The punctuation bits, which are derived from the data, define the structure of the tree
Keywords :
data communication; information theory; redundancy; sequences; tree data structures; algorithm; communication link; cyclic redundancy checks; data structure; data-punctuated token trees; receiver; source file; symbol sequence; Bidirectional control; Cyclic redundancy check; Data structures; Databases; Delay; Random variables; Statistics; Tail; Transmitters; Tree data structures;