• DocumentCode
    2334028
  • Title

    On effective conceptual indexing and similarity search in text data

  • Author

    Aggarwal, Cham C. ; Yu, Philip S.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    3
  • Lastpage
    10
  • Abstract
    Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects, stemming from the presence of a large number of words unrelated to the overall topic in the document. Inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user queries. One way of improving the quality of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. Unfortunately, LSI maps the data into a domain, in which it is not possible to provide effective indexing techniques. The authors investigate new ways of providing conceptual search among documents by creating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries can be performed on large collections of documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search on the inverted representation both in terms of quality and search efficiency
  • Keywords
    indexing; query processing; search problems; text analysis; LSI; Latent Semantic Indexing; concept space; conceptual document search; conceptual indexing; conceptual word-chains; document retrieval; document-to-document similarity search; indexing techniques; inverted representation; keywords; noise effects; qualitative perspective; search engines; short user queries; similarity queries; similarity search; standard textual similarity search; text data; textual descriptions; user-defined sets; Indexing; Information retrieval; Intrusion detection; Large scale integration; Libraries; Recommender systems; Search engines; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    0-7695-1119-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2001.989494
  • Filename
    989494