Author : 
Tuncel, Ertem ; Koulgi, Prashant ; Rose, Kenneth
         
        
            Author_Institution : 
Dept. of Electr. Eng., Univ. of California, Riverside, CA, USA
         
        
            Abstract : 
This paper investigates the relationship between rate-distortion theory and efficient content-based data retrieval from high-dimensional databases. We consider database design as the encoding of a data object sequence, and retrieval from the database as the decoding of the sequence using side information (i.e., the query) available only at the decoder. We show that, in this setting, the optimal asymptotic tradeoff between the search time Rs (bits per data object read from the storage device) and the expected search accuracy Ds (relevance of the retrieved data set) is given by the Wyner-Ziv solution with a side-information-dependent distortion measure. Moreover, the data indexing and retrieval problem is, in general, inseparable from the data compression problem. Data items selected by the search procedure, which can be stored in the disk with a limited total rate of Rr ≥ Rs, need to be presented at a prescribed expected reconstruction quality Dr. This is, hence, a problem of scalable source coding or successive refinement, albeit with differing layer distortion measures to quantify search and reconstruction quality, respectively. We derive a single-letter characterization of all achievable quadruples {Rs,Rr,Ds,Dr}, and prove conditions for "successive refinability" without rate loss. Finally, we show that the special case Ds=Dr=0 is nontrivial and of practical interest in this context, as it can impose "acceptable" search and reconstruction qualities for each individual data item and for the entire query space with high probability, in contradistinction with standard average distortion requirements. The region of achievable {Rs,Rr} is obtained by adapting Rimoldi\´s characterization to a new regular scalable coding problem.
         
        
            Keywords : 
content-based retrieval; data compression; database indexing; distortion measurement; rate distortion theory; source coding; Rimoldis characterization; Wyner-Ziv solution; approximate similarity searching; content-based data retrieval; data indexing; data object sequence encoding; decoder; high-dimensional databases; query space probability; rate-distortion theory; scalable source coding; side-information-dependent distortion measure; single-letter characterization; successive refinement; Content based retrieval; Data compression; Databases; Decoding; Distortion measurement; Encoding; Indexing; Information retrieval; Rate-distortion; Time measurement; Approximate similarity searching; Wyner–Ziv problem; content-based retrieval; databases; scalable coding; successive refinability without rate loss; zero–one distortion measures;