• DocumentCode
    430987
  • Title

    Document processing methods for Telugu and other South East Asian scripts

  • Author

    Negi, Atul ; Sowri, V.S.R. ; Rao, K. Mohan

  • Author_Institution
    Dept. of CIS, Hyderabad Univ., India
  • Volume
    B
  • fYear
    2004
  • fDate
    21-24 Nov. 2004
  • Firstpage
    132
  • Abstract
    It is observed that in several South East Asian scripts, a single character consists of two or more connected components. In these scripts the complex arrangement of connected components leads to problems such as touching characters and difficulty in identifying words and text line boundaries. In the present work we propose a method to extract text lines by clustering of connected components, based upon their spatial properties. Those components, with abnormal properties and which are not identified by an OCR, are sent for character segmentation. For character segmentation we describe "Drop Fall" and "Whitestream" methods. The methods presented here are applicable to any language (script) that requires connected component based processing.
  • Keywords
    document image processing; image segmentation; natural languages; South East Asian script; Telugu; Whitestream method; character segmentation; connected component based processing; document processing method; drop fall; spatial property; text line boundary; Optical character recognition software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    TENCON 2004. 2004 IEEE Region 10 Conference
  • Print_ISBN
    0-7803-8560-8
  • Type

    conf

  • DOI
    10.1109/TENCON.2004.1414549
  • Filename
    1414549