• DocumentCode
    3326463
  • Title

    Copy detection in urdu language documents using n-grams model

  • Author

    Khan, Muhammad Asad ; Aleem, A. ; Wahab, Abdul ; Khan, M.N.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Peshawar, Peshawar, Pakistan
  • fYear
    2011
  • fDate
    11-13 July 2011
  • Firstpage
    263
  • Lastpage
    266
  • Abstract
    In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and the resemblance measures calculated from the bi-gram comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student´s assignments in Urdu language.
  • Keywords
    copy protection; natural language processing; text analysis; C#; Urdu language documents; Urdu text passages; copy detection; n-grams model; plagiarism detection; student assignments; tri-grams; word retrieval; Indexes; Tiles; Bi-gram; Copy detection; N-gram Model; Natural Language Processing; Urdu Language;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Networks and Information Technology (ICCNIT), 2011 International Conference on
  • Conference_Location
    Abbottabad
  • ISSN
    2223-6317
  • Print_ISBN
    978-1-61284-940-9
  • Type

    conf

  • DOI
    10.1109/ICCNIT.2011.6020940
  • Filename
    6020940