DocumentCode :
3459627
Title :
A Fast Searching for Similar Text Using Genomic Read Mapping Method
Author :
Chang Seok Ock ; Sung-Hwan Kim ; Haesung Tak ; Hwan Gue Cho
Author_Institution :
Dept. of Comput. Eng., Pusan Nat. Univ., Busan, South Korea
fYear :
2013
fDate :
3-5 Dec. 2013
Firstpage :
219
Lastpage :
226
Abstract :
The most important consideration when detecting plagiarism is precision. Thus, the precise determination of the similarity of two documents is critical for the authors of documents. However, the problem complexity is increased by considering precision alone. Typically, the semantic detection of plagiarism has very high complexity, so a syntactic method for detecting plagiarism is used widely. The two main syntactic methods are sequence alignment and fingerprinting. Sequence alignment has powerful characteristics such as very high precision, because it is based on character-by-character comparisons. However, naive sequence alignment has a high space complexity (O(n2)). Fingerprinting is another syntactic method that uses the similarity of vectors extracted from documents. This method has a lower space complexity (O(n)) compared with sequence alignment. However, it also has lower precision because this method does not consider the structural similarity of documents. The method we propose for detecting plagiarized texts can detect plagiarism precisely, even with a low spatiotemporal complexity, by applying the short-read mapping method used for next-generation sequencing (NGS). In addition, we propose a distance measure for documents, which is based on the detection method used to construct phylogenetic tree by calculating the similarities of documents. The proposed method has a maximum precision of 0.95 and a maximum recall of 0.94. The construction of phylogenetic trees for linearly plagiarized documents using the distance measure had an average precision of 0.99. In the future, we will study the phylogeny of naturally plagiarized documents.
Keywords :
computational complexity; text analysis; NGS; character-by-character comparisons; distance measure; document similarity; fast searching; fingerprinting; genomic read mapping method; linearly plagiarized documents; naturally plagiarized documents phylogeny; next-generation sequencing; phylogenetic tree; plagiarism semantic detection; precision; problem complexity; sequence alignment; short-read mapping method; similar text; space complexity; spatiotemporal complexity; syntactic method; vector similarity; Complexity theory; Indexes; Phylogeny; Plagiarism; Skin; Syntactics; Vectors; Burrows-Wheeler Transform; FM-index; Similar Document Detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location :
Sydney, NSW
Type :
conf
DOI :
10.1109/CSE.2013.43
Filename :
6755221
Link To Document :
بازگشت