Title :
Models and algorithms for duplicate document detection
Author :
Lopresti, Daniel P.
Author_Institution :
Lucent Technol. Inc., AT&T Bell Labs., Murray Hill, NJ, USA
Abstract :
This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution derived from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data reflecting real-world degradation effects
Keywords :
string matching; visual databases; approximate string matching; document image databases; duplicate document detection; real-world degradation effects; Data mining; Electrical capacitance tomography; Feature extraction; Image databases; Information management; Microwave integrated circuits; Optical character recognition software; Packaging; Spatial databases; Turning;
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
DOI :
10.1109/ICDAR.1999.791783