Title :
Copy detection in urdu language documents using n-grams model
Author :
Khan, Muhammad Asad ; Aleem, A. ; Wahab, Abdul ; Khan, M.N.
Author_Institution :
Dept. of Comput. Sci., Univ. of Peshawar, Peshawar, Pakistan
Abstract :
In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and the resemblance measures calculated from the bi-gram comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student´s assignments in Urdu language.
Keywords :
copy protection; natural language processing; text analysis; C#; Urdu language documents; Urdu text passages; copy detection; n-grams model; plagiarism detection; student assignments; tri-grams; word retrieval; Indexes; Tiles; Bi-gram; Copy detection; N-gram Model; Natural Language Processing; Urdu Language;
Conference_Titel :
Computer Networks and Information Technology (ICCNIT), 2011 International Conference on
Conference_Location :
Abbottabad
Print_ISBN :
978-1-61284-940-9
DOI :
10.1109/ICCNIT.2011.6020940