DocumentCode
3326463
Title
Copy detection in urdu language documents using n-grams model
Author
Khan, Muhammad Asad ; Aleem, A. ; Wahab, Abdul ; Khan, M.N.
Author_Institution
Dept. of Comput. Sci., Univ. of Peshawar, Peshawar, Pakistan
fYear
2011
fDate
11-13 July 2011
Firstpage
263
Lastpage
266
Abstract
In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and the resemblance measures calculated from the bi-gram comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student´s assignments in Urdu language.
Keywords
copy protection; natural language processing; text analysis; C#; Urdu language documents; Urdu text passages; copy detection; n-grams model; plagiarism detection; student assignments; tri-grams; word retrieval; Indexes; Tiles; Bi-gram; Copy detection; N-gram Model; Natural Language Processing; Urdu Language;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Networks and Information Technology (ICCNIT), 2011 International Conference on
Conference_Location
Abbottabad
ISSN
2223-6317
Print_ISBN
978-1-61284-940-9
Type
conf
DOI
10.1109/ICCNIT.2011.6020940
Filename
6020940
Link To Document