DocumentCode :
3326463
Title :
Copy detection in urdu language documents using n-grams model
Author :
Khan, Muhammad Asad ; Aleem, A. ; Wahab, Abdul ; Khan, M.N.
Author_Institution :
Dept. of Comput. Sci., Univ. of Peshawar, Peshawar, Pakistan
fYear :
2011
fDate :
11-13 July 2011
Firstpage :
263
Lastpage :
266
Abstract :
In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and the resemblance measures calculated from the bi-gram comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student´s assignments in Urdu language.
Keywords :
copy protection; natural language processing; text analysis; C#; Urdu language documents; Urdu text passages; copy detection; n-grams model; plagiarism detection; student assignments; tri-grams; word retrieval; Indexes; Tiles; Bi-gram; Copy detection; N-gram Model; Natural Language Processing; Urdu Language;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Networks and Information Technology (ICCNIT), 2011 International Conference on
Conference_Location :
Abbottabad
ISSN :
2223-6317
Print_ISBN :
978-1-61284-940-9
Type :
conf
DOI :
10.1109/ICCNIT.2011.6020940
Filename :
6020940
Link To Document :
بازگشت