• DocumentCode
    2392451
  • Title

    A novel method for detecting similar documents

  • Author

    Cooper, James W. ; Coden, Anni R. ; Brown, Eric W.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • fYear
    2002
  • fDate
    7-10 Jan. 2002
  • Firstpage
    1153
  • Lastpage
    1159
  • Abstract
    We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar.
  • Keywords
    information resources; query processing; relevance feedback; WWW; database query; document similarity; important terms; information retrieval system; ranked list; rapid phrase recognizer system; Character recognition; Databases; File servers; Information retrieval; Particle measurements; Plagiarism; Statistics; Text analysis; Text mining; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    System Sciences, 2002. HICSS. Proceedings of the 35th Annual Hawaii International Conference on
  • Print_ISBN
    0-7695-1435-9
  • Type

    conf

  • DOI
    10.1109/HICSS.2002.994037
  • Filename
    994037