Detecting near-duplicate document images using interest point matching

Author

Vitaladevuni, Shiv ; Choi, F. ; Prasad, Ranga ; Natarajan, Prem

Author_Institution

Raytheon BBN Technol., Cambridge, MA, USA

fYear

2012

fDate

11-15 Nov. 2012

Firstpage

347

Lastpage

350

Abstract

We present an approach to detecting near-duplicate document images using SIFT interest point matching. Given a set of document images, a database is constructed from the SIFT features extracted from each image, stored as a kd-tree. The near-duplicates of a query image are estimated by directly matching its SIFT descriptors with the feature database. We demonstrate the approach on a challenging set of unconstrained Arabic hand and machine written images obtained from the field, consisting of 16,000+ documents. Our experiments indicate that the approach detects near-duplicates with low false alarm rate and outperforms bag-of-words based approach.

Keywords

document image processing; feature extraction; image matching; natural language processing; tree data structures; SIFT descriptors; SIFT feature extraction; SIFT interest point matching; bag-of-words-based approach; false alarm rate; feature database; kd-tree storage; machine written images; near-duplicate document image detection; query image estimation; unconstrained Arabic hand; Feature extraction; Image databases; Image segmentation; Imaging; Optical character recognition software; Shape;

fLanguage

English

Publisher

ieee

Conference_Titel

Pattern Recognition (ICPR), 2012 21st International Conference on

Conference_Location

Tsukuba

ISSN

1051-4651

Print_ISBN

978-1-4673-2216-4

Type

conf

Filename

6460143