Title :
Plagiarism Detection in arXiv
Author :
Sorokina, Daria ; Gehrke, Johannes ; Warner, Simeon ; Ginsparg, Paul
Author_Institution :
Dept. of Comput. Sci., Cornell Univ., Ithaca, NY
Abstract :
We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.
Keywords :
research and development; text analysis; arXiv; plagiarism detection; problematic author behaviors; research document collections; Application software; Computer science; Displays; History; Information science; Large-scale systems; Physics computing; Plagiarism; Sequences; Testing;
Conference_Titel :
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2701-7
DOI :
10.1109/ICDM.2006.126