Copy detection in urdu language documents using n-grams model

Author

Khan, Muhammad Asad ; Aleem, A. ; Wahab, Abdul ; Khan, M.N.

Author_Institution

Dept. of Comput. Sci., Univ. of Peshawar, Peshawar, Pakistan

fYear

2011

fDate

11-13 July 2011

Firstpage

263

Lastpage

266

Abstract

In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and the resemblance measures calculated from the bi-gram comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student´s assignments in Urdu language.

Keywords

copy protection; natural language processing; text analysis; C#; Urdu language documents; Urdu text passages; copy detection; n-grams model; plagiarism detection; student assignments; tri-grams; word retrieval; Indexes; Tiles; Bi-gram; Copy detection; N-gram Model; Natural Language Processing; Urdu Language;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Networks and Information Technology (ICCNIT), 2011 International Conference on

Conference_Location

Abbottabad

ISSN

2223-6317

Print_ISBN

978-1-61284-940-9

Type

conf

DOI

10.1109/ICCNIT.2011.6020940

Filename

6020940