Title :
On the Optimization of a Duplicate Document Detection Algorithm Based on SIMD and Document Statistics
Author :
Yuan, X.P. ; Long, J. ; Zhang, H. ; Zhang, Z.P. ; Gui, W.H.
Author_Institution :
Sch. of Inf. Sci. & Eng., Central South Univ., Changsha, China
Abstract :
Although considerable effort has been devoted to duplicate document detection (DDD) and its applications, there is very limited study on the optimization of its time-consuming functions. An experimental analysis which is conducted on a million Grant Proposal documents from the nsfc.gov.cn shows that even by using the clustering and the sampling methods, the speed of DDD is still quite slow. By analyzing the performance of our system with Intel VTune Performance Analyzer, we find out that the shingle comparison is the most time-consuming part in our system, occupying 58% CPU usage. Based on the analysis of the whole algorithm and the data statistics, we propose and implement an optimized shingle comparison algorithm using Intel SIMD technology. Experiments demonstrate that the proposed optimization technique brings 11.6%-38.5% performance gain with various instruction sets and parameters settings. Further performance gain could be achieved base on the accuracy and speed tradeoff.
Keywords :
document handling; parallel processing; statistics; Intel SIMD technology; Intel VTune performance analyzer; document statistics; duplicate document detection algorithm; grant proposal documents; Accuracy; Algorithm design and analysis; Clustering algorithms; Optimization; Performance gain; Plagiarism; Registers;
Conference_Titel :
Computational Intelligence and Software Engineering (CiSE), 2010 International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-5391-7
Electronic_ISBN :
978-1-4244-5392-4
DOI :
10.1109/CISE.2010.5676949