DocumentCode :
3537648
Title :
A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository
Author :
Park, Sun-Young ; Kim, SeonYeong ; Kim, Sung-Hwan ; Cho, Hwan-Gue
Author_Institution :
Dept. of Comput. Sci. & Eng., Pusan Nat. Univ., Busan, South Korea
fYear :
2011
fDate :
Aug. 31 2011-Sept. 2 2011
Firstpage :
526
Lastpage :
532
Abstract :
Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.
Keywords :
Internet; data structures; dictionaries; query formulation; text analysis; GDIC; Internet; data structure; document repository; global dictionary; similar text detection; text plagiarism; text search; Data structures; Dictionaries; Inspection; Internet; Plagiarism; Sensitivity; dictionary; information retrieval; plagiarism; text similarity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (CIT), 2011 IEEE 11th International Conference on
Conference_Location :
Pafos
Print_ISBN :
978-1-4577-0383-6
Electronic_ISBN :
978-0-7695-4388-8
Type :
conf
DOI :
10.1109/CIT.2011.76
Filename :
6036820
Link To Document :
بازگشت