DocumentCode :
3090215
Title :
Plagiarism detection in text using Vector Space Model
Author :
Ekbal, Asif ; Saha, Simanto ; Choudhary, Garvit
Author_Institution :
Dept. of Comput. Sci. & Eng., Indian Inst. of Technol. Patna, Patna, India
fYear :
2012
fDate :
4-7 Dec. 2012
Firstpage :
366
Lastpage :
371
Abstract :
Plagiarism denotes the act of copying someone else´s idea (or, works) and claiming it as his/her own. Plagiarism detection is the procedure to detect the texts of a given document which are plagiarized, i.e. copied from from some other documents. Potential challenges are due to the facts that plagiarists often obfuscate the copied texts; might shuffle, remove, insert, or replace words or short phrases; might also restructure the sentences replacing words with synonyms; and changing the order of appearances of words in a sentence. In this paper we propose a technique based on textual similarity for external plagiarism detection. For a given suspicious document we have to identify the set of source documents from which the suspicious document is copied. The method we propose comprises of four phases. In the first phase, we process all the documents to generate tokens, lemmas, finding Part-of-Speech (PoS) classes, character-offsets, sentence numbers and named-entity (NE) classes. In the second phase we select a subset of documents that may possibly be the sources of plagiarism. We use an approach based on the traditional Vector Space Model (VSM) for this candidate selection. In the third phase we use a graph-based approach to find out the similar passages in suspicious document and selected source documents. Finally we filter out the false detections1.
Keywords :
graph theory; law; text analysis; NE class; POS class; VSM; candidate selection; character-offsets; graph-based approach; lemma generation; named-entity class; part-of-speech class; plagiarism detection; sentence numbers; source documents; suspicious document; textual similarity; token generation; vector space model; Computational modeling; Hybrid intelligent systems; Information retrieval; Measurement; Plagiarism; Training; Vectors; N-gram language model; Plagiarism detection; Vector Space Model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Hybrid Intelligent Systems (HIS), 2012 12th International Conference on
Conference_Location :
Pune
Print_ISBN :
978-1-4673-5114-0
Type :
conf
DOI :
10.1109/HIS.2012.6421362
Filename :
6421362
Link To Document :
بازگشت