• DocumentCode
    2021459
  • Title

    An Efficient Word Segmentation Technique for Historical and Degraded Machine-Printed Documents

  • Author

    Makridis, M. ; Nikolaou, N. ; Gatos, B.

  • Author_Institution
    Democritus Univ. of Thrace, Xanthi
  • Volume
    1
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    178
  • Lastpage
    182
  • Abstract
    Word segmentation is a crucial step for segmentation-free document analysis systems and is used for creating an index based on word matching. In this paper, we propose a novel methodology for word segmentation in historical and degraded machine-printed documents. The proposed technique faces problems such as having text of different size, having text and non-text areas lying very near and having non-straight and warped text lines. It is based on: (i) a dynamic run length smoothing algorithm that helps grouping together homogeneous text regions, (ii) noise and punctuation marks removal as well as on obstacle detection in order to facilitate the segmentation process and (iv) a draft text line estimation procedure that guides the final word segmentation result. After testing on numerous historical and degraded machine-printed documents, it has turned out that our methodology performs better compared to current state-of-the-art word segmentation techniques for historical and degraded machine-printed documents.
  • Keywords
    document image processing; image matching; image segmentation; indexing; text analysis; indexing; machine-printed documents; obstacle detection; segmentation-free document analysis systems; text line estimation; text region; word matching; word segmentation technique; Algorithm design and analysis; Computational intelligence; Degradation; Image segmentation; Informatics; Laboratories; Pixel; Radiofrequency interference; Smoothing methods; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4378699
  • Filename
    4378699