Title of article :
A Corpus for Evaluation of Cross Language Text Re-use Detection Systems
Author/Authors :
Mohtaj ، Salar Faculty IV - Technische University Berlin , Asghari ، Habibollah ICT Research Institute - Academic Center for Education, Culture and Research (ACECR)
From page :
169
To page :
179
Abstract :
In recent years, the availability of documents through the Internet along with automatic translation systems have increased plagiarism, especially across languages. Cross-lingual plagiarism occurs when the source or original text is in one language and the plagiarized or re-used text is in another language. Various methods for automatic text re-use detection across languages have been developed whose objective is to assist human experts in analyzing documents for plagiarism cases. For evaluating the performance of these systems and algorithms, standard evaluation resources are needed. To construct cross lingual plagiarism detection corpora, the majority of earlier studies have paid attention to English and other European language pairs, and have less focused on low resource languages. In this paper, we investigate a method for constructing an English-Persian cross-language plagiarism detection corpus based on parallel bilingual sentences that artificially generate passages with various degrees of paraphrasing. The plagiarized passages are inserted into topically related English and Persian Wikipedia articles in order to have more realistic text documents. The proposed approach can be applied to other less-resourced languages. In order to evaluate the compiled corpus, both intrinsic and extrinsic evaluation methods were employed. So, the compiled corpus can be suitably included into an evaluation framework for assessing cross-language plagiarism detection systems. Our proposed corpus is free and publicly available for research purposes.
Keywords :
Cross Language Plagiarism Detection , Corpus , Text Re , Use Detection , Obfuscation
Journal title :
Journal of Information Systems and Telecommunication
Journal title :
Journal of Information Systems and Telecommunication
Record number :
2727327
Link To Document :
بازگشت