• DocumentCode
    2434411
  • Title

    Text line script identification for a tri-lingual document

  • Author

    Aithal, Prakash K. ; Rajesh, G. ; Acharya, Dinesh U. ; Subbareddy, N. V Krishnamoorthi M

  • Author_Institution
    Manipal Inst. of Technol., Manipal, India
  • fYear
    2010
  • fDate
    29-31 July 2010
  • Firstpage
    1
  • Lastpage
    3
  • Abstract
    India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.
  • Keywords
    document image processing; feature extraction; knowledge based systems; natural language processing; optical character recognition; text analysis; English language; Hindi language; Karnataka language; document images; feature extraction; knowledge base system; multilingual document; multilingual multiscript country; optical character recognition; text line script identification; tri-lingual document; Feature extraction; Histograms; Image segmentation; Knowledge based systems; Optical character recognition software; Shape; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computing Communication and Networking Technologies (ICCCNT), 2010 International Conference on
  • Conference_Location
    Karur
  • Print_ISBN
    978-1-4244-6591-0
  • Type

    conf

  • DOI
    10.1109/ICCCNT.2010.5592562
  • Filename
    5592562