مرکز منطقه ای اطلاع رساني علوم و فناوري - Horizontal Reduction: Instance-Level Dimensionality Reduction for Similarity Search in Large Document Databases

DocumentCode :

2457977

Title :

Horizontal Reduction: Instance-Level Dimensionality Reduction for Similarity Search in Large Document Databases

Author :

Kim, Min Soo ; Whang, Kyu-Young ; Moon, Yang-Sae

Author_Institution :

Dept. of Comput. Sci., Korea Adv. Inst. of Sci. & Technol. (KAIST), Daejeon, South Korea

fYear :

2012

fDate :

1-5 April 2012

Firstpage :

1061

Lastpage :

1072

Abstract :

Dimensionality reduction is essential in text mining since the dimensionality of text documents could easily reach several tens of thousands. Most recent efforts on dimensionality reduction, however, are not adequate to large document databases due to lack of scalability. We hence propose a new type of simple but effective dimensionality reduction, called horizontal (dimensionality) reduction, for large document databases. Horizontal reduction converts each text document to a few bitmap vectors and provides tight lower bounds of inter-document distances using those bitmap vectors. Bitmap representation is very simple and extremely fast, and its instance-based nature makes it suitable for large and dynamic document databases. Using the proposed horizontal reduction, we develop an efficient k-nearest neighbor (k-NN) search algorithm for text mining such as classification and clustering, and we formally prove its correctness. The proposed algorithm decreases I/O and CPU overheads simultaneously since horizontal reduction (1) reduces the number of accesses to documents significantly by exploiting the bitmap-based lower bounds in filtering dissimilar documents at an early stage, and accordingly, (2) decreases the number of CPU-intensive computations for obtaining a real distance between high-dimensional document vectors. Extensive experimental results show that horizontal reduction improves the performance of the reduction (preprocessing) process by one to two orders of magnitude compared with existing reduction techniques, and our k-NN search algorithm significantly outperforms the existing ones by one to three orders of magnitude.

Keywords :

data mining; database management systems; pattern clustering; text analysis; Bitmap representation; bitmap vectors; horizontal reduction; instance-level dimensionality reduction; inter-document distances; k-NN search algorithm; k-nearest neighbor search algorithm; large document databases; similarity search; text documents; text mining; Complexity theory; Discrete Fourier transforms; Euclidean distance; Gold; Large scale integration; Text mining; Vectors; fuzzy system; query interface; user-centric application;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering (ICDE), 2012 IEEE 28th International Conference on

Conference_Location :

Washington, DC

ISSN :

1063-6382

Print_ISBN :

978-1-4673-0042-1

Type :

conf

DOI :

10.1109/ICDE.2012.115

Filename :

6228156

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2457977