• DocumentCode
    3267976
  • Title

    Robust text extraction in mixed-type binary documents

  • Author

    Nikolaidis, Athanasios ; Strouthopoulos, Charalambos

  • Author_Institution
    Dept. of Inf. & Commun., Technol. Educ. Inst. of Serres, Terma Magnisias
  • fYear
    2008
  • fDate
    8-10 Oct. 2008
  • Firstpage
    393
  • Lastpage
    398
  • Abstract
    Text extraction from documents is an essential preprocessing stage of applications such as OCR (optical character recognition), document image compression, storage and retrieval. Although many different techniques have been proposed to date, they usually assume that text orientation and size is fixed throughout the document image. Our work faces the problem of varying orientation and size, which is often the case in practice, either because of the nature of the original document or due to imposed distortions. Our algorithm first identifies marks using a suitable contour following technique. A PCA (principal component analyzer) is afterwards employed in order to determine the principal axes of each mark, and a nearest-neighbor technique is used to find the shortest distances between marks. A feature vector is formed based on mark dimensions and distances between them, which is then fed into a SOFM (self-organizing feature map) in order to divide the marks into homogeneous clusters. A set of fuzzy rules is formed using all cluster weights and variances. Finally, a fuzzy classification scheme identifies each mark as a character or a non-character. The technique was tested on a variety of mixed-type documents and it proved to be quite fast and accurate.
  • Keywords
    fuzzy set theory; principal component analysis; self-organising feature maps; text analysis; document image compression; fuzzy classification; fuzzy rules; homogeneous clusters; mixed-type binary documents; optical character recognition; principal component analysis; robust text extraction; self organizing feature map; text orientation; Character recognition; Fuzzy sets; Image coding; Image retrieval; Image storage; Optical character recognition software; Optical distortion; Principal component analysis; Robustness; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia Signal Processing, 2008 IEEE 10th Workshop on
  • Conference_Location
    Cairns, Qld
  • Print_ISBN
    978-1-4244-2294-4
  • Electronic_ISBN
    978-1-4244-2295-1
  • Type

    conf

  • DOI
    10.1109/MMSP.2008.4665110
  • Filename
    4665110