• DocumentCode
    2172756
  • Title

    Automatic separation of words in multi-lingual multi-script Indian documents

  • Author

    Pal, U. ; Chaudhuri, B.B.

  • Author_Institution
    Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Calcutta, India
  • Volume
    2
  • fYear
    1997
  • fDate
    18-20 Aug 1997
  • Firstpage
    576
  • Abstract
    In a multi-lingual country like India, a document may contain more than one script forms. For such a document it is necessary to separate different script forms before feeding them to OCRs of individual script. In this paper an automatic word segmentation approach is described which can separate Roman, Bangla and Devnagari scripts present in a single document. The approach has a tree structure where at first Roman script words are separated using the `headline´ feature. The headline is common in Bangla and Devnagari but absent in Roman. Next, Bangla and Devnagari words are separated using some finer characteristics of the character set although recognition of individual character is avoided. At present, the system has an overall accuracy of 96.09%
  • Keywords
    image recognition; optical character recognition; OCRs; automatic separation of words; automatic word segmentation; multilingual multiscript Indian documents; tree structure; Character generation; Character recognition; Cleaning; Computer vision; Continents; Natural languages; Optical character recognition software; Pattern recognition; Shape; Tree data structures;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
  • Conference_Location
    Ulm
  • Print_ISBN
    0-8186-7898-4
  • Type

    conf

  • DOI
    10.1109/ICDAR.1997.620567
  • Filename
    620567