DocumentCode
3537648
Title
A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository
Author
Park, Sun-Young ; Kim, SeonYeong ; Kim, Sung-Hwan ; Cho, Hwan-Gue
Author_Institution
Dept. of Comput. Sci. & Eng., Pusan Nat. Univ., Busan, South Korea
fYear
2011
fDate
Aug. 31 2011-Sept. 2 2011
Firstpage
526
Lastpage
532
Abstract
Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.
Keywords
Internet; data structures; dictionaries; query formulation; text analysis; GDIC; Internet; data structure; document repository; global dictionary; similar text detection; text plagiarism; text search; Data structures; Dictionaries; Inspection; Internet; Plagiarism; Sensitivity; dictionary; information retrieval; plagiarism; text similarity;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Information Technology (CIT), 2011 IEEE 11th International Conference on
Conference_Location
Pafos
Print_ISBN
978-1-4577-0383-6
Electronic_ISBN
978-0-7695-4388-8
Type
conf
DOI
10.1109/CIT.2011.76
Filename
6036820
Link To Document