مرکز منطقه ای اطلاع رساني علوم و فناوري - A Fast Searching for Similar Text Using Genomic Read Mapping Method

DocumentCode :

3459627

Title :

A Fast Searching for Similar Text Using Genomic Read Mapping Method

Author :

Chang Seok Ock ; Sung-Hwan Kim ; Haesung Tak ; Hwan Gue Cho

Author_Institution :

Dept. of Comput. Eng., Pusan Nat. Univ., Busan, South Korea

fYear :

2013

fDate :

3-5 Dec. 2013

Firstpage :

219

Lastpage :

226

Abstract :

The most important consideration when detecting plagiarism is precision. Thus, the precise determination of the similarity of two documents is critical for the authors of documents. However, the problem complexity is increased by considering precision alone. Typically, the semantic detection of plagiarism has very high complexity, so a syntactic method for detecting plagiarism is used widely. The two main syntactic methods are sequence alignment and fingerprinting. Sequence alignment has powerful characteristics such as very high precision, because it is based on character-by-character comparisons. However, naive sequence alignment has a high space complexity (O(n²)). Fingerprinting is another syntactic method that uses the similarity of vectors extracted from documents. This method has a lower space complexity (O(n)) compared with sequence alignment. However, it also has lower precision because this method does not consider the structural similarity of documents. The method we propose for detecting plagiarized texts can detect plagiarism precisely, even with a low spatiotemporal complexity, by applying the short-read mapping method used for next-generation sequencing (NGS). In addition, we propose a distance measure for documents, which is based on the detection method used to construct phylogenetic tree by calculating the similarities of documents. The proposed method has a maximum precision of 0.95 and a maximum recall of 0.94. The construction of phylogenetic trees for linearly plagiarized documents using the distance measure had an average precision of 0.99. In the future, we will study the phylogeny of naturally plagiarized documents.

Keywords :

computational complexity; text analysis; NGS; character-by-character comparisons; distance measure; document similarity; fast searching; fingerprinting; genomic read mapping method; linearly plagiarized documents; naturally plagiarized documents phylogeny; next-generation sequencing; phylogenetic tree; plagiarism semantic detection; precision; problem complexity; sequence alignment; short-read mapping method; similar text; space complexity; spatiotemporal complexity; syntactic method; vector similarity; Complexity theory; Indexes; Phylogeny; Plagiarism; Skin; Syntactics; Vectors; Burrows-Wheeler Transform; FM-index; Similar Document Detection;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on

Conference_Location :

Sydney, NSW

Type :

conf

DOI :

10.1109/CSE.2013.43

Filename :

6755221

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3459627