Title :
Towards a Robust OCR System for Indic Scripts
Author :
Krishnan, Prasad ; Sankaran, Naveen ; Singh, A.K. ; Jawahar, C.V.
Author_Institution :
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
Abstract :
The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.
Keywords :
document image processing; natural languages; optical character recognition; BLSTM based transcription scheme; Indian languages; Indic scripts; UNICODE conversion rules; UNICODE re-ordering; Web based OCR system; bidirectional long-short term memory network; continuous learning; incremental refinement; incremental training; latent symbol; nonstandard fonts; nonstandard styles; optical character recognition; Character recognition; Degradation; Feature extraction; Optical character recognition software; Robustness; Text recognition; Training; Indic Scripts; Neural Networks; Optical Character Recognition;
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
DOI :
10.1109/DAS.2014.74