DocumentCode :
2868678
Title :
Web Based Cross Language Semantic Plagiarism Detection
Author :
Kent, Chow Kok ; Salim, Naomie
Author_Institution :
Fac. of CS & Inf. Sys., Univ. Teknol. Malaysia, Skudai, Malaysia
fYear :
2011
fDate :
12-14 Dec. 2011
Firstpage :
1096
Lastpage :
1102
Abstract :
As the Internet help us cross language and cultural border and with different types of translation tools, cross language plagiarism is bound to rise. Besides that, semantic plagiarism, where the student reconstructs the sentence or changes some terms into its corresponding synonyms, also raises concerns in the academic field. Both of this plagiarism is hardly detected due to the difference in their fingerprints. Plagiarism detection tools available are not capable to detect such plagiarism cases. In this research, we propose a new approach in detecting both cross language and semantic plagiarism. We consider Bahasa Melayu as the input language of the submitted document and English as a target language of similar, possibly plagiarised documents. In this system we shorten the query document by utilising fuzzy swarm-based summarisation approach. Our point of view is that using the summary will give us the most important keywords in the document. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and Word Net to determine the semantic similarity level between the suspected documents with candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as Nouns, Verbs and Adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from Word Net taxonomy. Our testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism techniques. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and F-Measure compared to the conventional Longest- Common Subsequence (LCS) approach.
Keywords :
application program interfaces; cultural aspects; document handling; fuzzy set theory; grammars; language translation; natural language processing; query processing; semantic Web; API; Bahasa Melayu; English; F-Measure; Google AJAX Search API; Google translate application programming interface; Internet; Malay documents; Stanford Parser; Web based cross language semantic plagiarism detection; Word Net; Word Net taxonomy; World Wide Web; cultural border; fuzzy swarm-based summarisation approach; information content value; input summary documents; longest common subsequence; plagiarised documents; query document; semantic based similarity measurement; semantic similarity level; tokenized documents; translation tools; Fingerprint recognition; Google; Plagiarism; Semantics; Taxonomy; Testing; Weapons; cross language; fuzzy swarm based summarization; plagiarism detection; semantic;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable, Autonomic and Secure Computing (DASC), 2011 IEEE Ninth International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4673-0006-3
Type :
conf
DOI :
10.1109/DASC.2011.180
Filename :
6119063
Link To Document :
بازگشت