Title :
Efficient similarity queries via lossy compression
Author :
Ochoa, Idoia ; Ingber, Amir ; Weissman, Tsachy
Author_Institution :
Dept. of Electr. Eng., Stanford Univ., Stanford, CA, USA
Abstract :
The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: “which sequences in the database are similar to a given sequence y?”. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.
Keywords :
DNA; bioinformatics; data compression; database management systems; query processing; DNA sequences; Hamming distortion; compressed database; compression rate; efficient similarity queries; lossy compression; reliable query answers; sequence compression; Compression algorithms; Computer architecture; DNA; Databases; Joints; Reliability; Simulation;
Conference_Titel :
Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on
Conference_Location :
Monticello, IL
Print_ISBN :
978-1-4799-3409-6
DOI :
10.1109/Allerton.2013.6736618