Title :
Unsupervised Classification of Structurally Similar Document Images
Author :
Kumar, Jayant ; Doermann, David
Author_Institution :
Inst. of Adv. Comput. Studies, Univ. of Maryland, College Park, MD, USA
Abstract :
In this paper, we present a learning based approach for computing structural similarities among document images for unsupervised exploration in large document collections. The approach is based on multiple levels of content and structure. At a local level, a bag-of-visual words based on SURF features provides an effective way of computing content similarity. The document is then recursively partitioned and a histogram of codewords is computed for each partition. Structural similarity is computed using a random forest classifier trained with these histogram features. We experiment with three diverse datasets of document images varying in size, degree of structural similarity, and types of document images. Our results demonstrate that the proposed approach provides an effective general framework for grouping structurally similar document images.
Keywords :
decision trees; document image processing; image classification; unsupervised learning; SURF features; bag-of-visual words; codeword histogram; content similarity computation; histogram features; large document collections; learning-based approach; random forest classifier; recursive partitioning; structural similarity computation; structurally similar document images; unsupervised classification; unsupervised exploration; Accuracy; Feature extraction; NIST; Optical character recognition software; Radio frequency; Training; Vegetation; Clustering; Random forest; Structural similarity; Unsupervised classification;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.248