DocumentCode :
583084
Title :
Unified Approach for Computing Document Similarity with Fingerprinting and Alignments
Author :
Seo, Jongkyu ; Ock, Chang-Seok ; Cho, Hwan-Gue
Author_Institution :
Dept. of Comput. Eng., Pusan Nat. Univ., Busan, South Korea
fYear :
2012
fDate :
27-29 Oct. 2012
Firstpage :
448
Lastpage :
455
Abstract :
A fingerprinting algorithm and a sequence alignment are widely used for measuring the similarity of documents. The former algorithm is a very fast procedure that extracts and compares document´s features. However, a fingerprinting algorithm cannot determine the partial similarity of a document. The latter algorithm is a procedure that arranges sequences of string to find similar regions. Sequence alignment is very effective in comparison to short strings but takes relatively long computing times. In this paper, we propose the MLA (Multi-Level Alignment) system, which combines a fingerprinting algorithm and a sequence alignment. The MLA system is designed for obtaining the advantages of both methods. This system uses a segmented block with uniform length as a basic operating unit. A similarity table of two input documents can be generated by comparing each document´s blocks using a fingerprinting algorithm. Then, sequence alignment needs to be applied in the similarity table in order to identify similar regions. The proportion of the fingerprint algorithm and the sequence alignment in the MLA system is determined by the basic operation block´s size k. If k is 1 then the MLA system operates the same as sequence alignment. However, if k is larger than the documents size, then it operates the same as a fingerprinting algorithm. Using this system, we prove that computing document´s similarity with the hybrid-approach is faster than sequence alignment and also more accurate than the fingerprinting algorithm.
Keywords :
document image processing; fingerprint identification; MLA system; computing document similarity; document features; fingerprinting algorithm; multilevel alignment system; partial similarity; segmented block; sequence alignment; similarity table; Computers; Equations; Feature extraction; Fingerprint recognition; Mathematical model; Plagiarism; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (CIT), 2012 IEEE 12th International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4673-4873-7
Type :
conf
DOI :
10.1109/CIT.2012.234
Filename :
6391941
Link To Document :
بازگشت