• DocumentCode
    3537648
  • Title

    A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository

  • Author

    Park, Sun-Young ; Kim, SeonYeong ; Kim, Sung-Hwan ; Cho, Hwan-Gue

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Pusan Nat. Univ., Busan, South Korea
  • fYear
    2011
  • fDate
    Aug. 31 2011-Sept. 2 2011
  • Firstpage
    526
  • Lastpage
    532
  • Abstract
    Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.
  • Keywords
    Internet; data structures; dictionaries; query formulation; text analysis; GDIC; Internet; data structure; document repository; global dictionary; similar text detection; text plagiarism; text search; Data structures; Dictionaries; Inspection; Internet; Plagiarism; Sensitivity; dictionary; information retrieval; plagiarism; text similarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology (CIT), 2011 IEEE 11th International Conference on
  • Conference_Location
    Pafos
  • Print_ISBN
    978-1-4577-0383-6
  • Electronic_ISBN
    978-0-7695-4388-8
  • Type

    conf

  • DOI
    10.1109/CIT.2011.76
  • Filename
    6036820