• DocumentCode
    974725
  • Title

    Dimensionality reduction and similarity computation by inner-product approximations

  • Author

    Egecioglu, Omer ; Ferhatosmanoglu, Hakan ; Ogras, Umit

  • Author_Institution
    Dept. of Comput. Sci., California Univ., Santa Barbara, CA, USA
  • Volume
    16
  • Issue
    6
  • fYear
    2004
  • fDate
    6/1/2004 12:00:00 AM
  • Firstpage
    714
  • Lastpage
    726
  • Abstract
    As databases increasingly integrate different types of information such as multimedia, spatial, time-series, and scientific data, it becomes necessary to support efficient retrieval of multidimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. Reducing the dimension of the feature vectors to enhance the performance of the underlying technique is a popular solution to the infamous curse of dimensionality. We expect the techniques to have good quality of distance measures when the similarity distance between two feature vectors is approximated by some notion of distance between two lower-dimensional transformed vectors. Thus, it is desirable to develop techniques resulting in accurate approximations to the original similarity distance. We investigate dimensionality reduction techniques that directly target minimizing the errors made in the approximations. In particular, we develop dynamic techniques for efficient and accurate approximation of similarity evaluations between high-dimensional vectors based on inner-product approximations. Inner-product, by itself, is used as a distance measure in a wide area of applications such as document databases. A first order approximation to the inner-product is obtained from the Cauchy-Schwarz inequality. We extend this idea to higher order power symmetric functions of the multidimensional points. We show how to compute fixed coefficients that work as universal weights based on the moments of the probability density function of the data set. We also develop a dynamic model to compute the universal coefficients for data sets whose distribution is not known. Our experiments on synthetic and real data sets show that the similarity between two objects in high-dimensional space can be accurately approximated by a significantly lower-dimensional representation.
  • Keywords
    distributed databases; statistical distributions; Cauchy-Schwarz inequality; dimensionality reduction; distributed databases; high-dimensional data; inner-product approximation; multidimensional data; similarity search; statistical distributions; Area measurement; Discrete Fourier transforms; Discrete cosine transforms; Discrete wavelet transforms; Fourier transforms; Information retrieval; Karhunen-Loeve transforms; Multidimensional systems; Multimedia databases; Spatial databases; 65; Inner-product approximation; dimensionality reduction; high-dimensional data.; p{hbox{-}}rm NORMS; similarity search;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2004.9
  • Filename
    1294892