Title :
Document Content Extraction Using Automatically Discovered Features
Author :
Wang, Sui-Yu ; Baird, Henry S. ; An, Chang
Author_Institution :
Comput. Sci. & Eng. Dept., Lehigh Univ., Bethlehem, PA, USA
Abstract :
We report an automatic feature discovery method that achieves results comparable to a manually chosen, larger feature set on a document image content extraction problem: the location and segmentation of regions containing handwriting and machine-printed text in documents images. This approach is a greedy forward selection algorithm that iteratively constructs one linear feature at a time. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. We conducted experiments on 87 diverse test images. Four manually chosen linear features with an error rate of 16.2% were given to the algorithm; the algorithm then found an additional ten features; the composite 14 features achieve an error rate of 13.8%. This outperforms a feature set of size 14 chosen by principal component analysis (PCA) with an error rate of 15.4%. It also nearly matches the error rate of 13.6% achieved by twice as many manually chosen features. Thus our algorithm appears to compete with both the widely used PCA method and tedious and expensive trial-and-error manual exploration.
Keywords :
document image processing; feature extraction; greedy algorithms; handwritten character recognition; image classification; image segmentation; iterative methods; principal component analysis; text analysis; PCA; automatic feature discovery; document image content extraction; feature mapping; greedy forward selection algorithm; handwriting-machine-printed text; image classification; iterative algorithm; principal component analysis; region location; region segmentation; Clustering algorithms; Error analysis; Filters; Handwriting recognition; Image analysis; Iterative algorithms; Null space; Principal component analysis; Testing; Text analysis;
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2009.198