Title :
Robust text extraction in mixed-type binary documents
Author :
Nikolaidis, Athanasios ; Strouthopoulos, Charalambos
Author_Institution :
Dept. of Inf. & Commun., Technol. Educ. Inst. of Serres, Terma Magnisias
Abstract :
Text extraction from documents is an essential preprocessing stage of applications such as OCR (optical character recognition), document image compression, storage and retrieval. Although many different techniques have been proposed to date, they usually assume that text orientation and size is fixed throughout the document image. Our work faces the problem of varying orientation and size, which is often the case in practice, either because of the nature of the original document or due to imposed distortions. Our algorithm first identifies marks using a suitable contour following technique. A PCA (principal component analyzer) is afterwards employed in order to determine the principal axes of each mark, and a nearest-neighbor technique is used to find the shortest distances between marks. A feature vector is formed based on mark dimensions and distances between them, which is then fed into a SOFM (self-organizing feature map) in order to divide the marks into homogeneous clusters. A set of fuzzy rules is formed using all cluster weights and variances. Finally, a fuzzy classification scheme identifies each mark as a character or a non-character. The technique was tested on a variety of mixed-type documents and it proved to be quite fast and accurate.
Keywords :
fuzzy set theory; principal component analysis; self-organising feature maps; text analysis; document image compression; fuzzy classification; fuzzy rules; homogeneous clusters; mixed-type binary documents; optical character recognition; principal component analysis; robust text extraction; self organizing feature map; text orientation; Character recognition; Fuzzy sets; Image coding; Image retrieval; Image storage; Optical character recognition software; Optical distortion; Principal component analysis; Robustness; Testing;
Conference_Titel :
Multimedia Signal Processing, 2008 IEEE 10th Workshop on
Conference_Location :
Cairns, Qld
Print_ISBN :
978-1-4244-2294-4
Electronic_ISBN :
978-1-4244-2295-1
DOI :
10.1109/MMSP.2008.4665110