DocumentCode :
153348
Title :
Towards a Robust OCR System for Indic Scripts
Author :
Krishnan, Prasad ; Sankaran, Naveen ; Singh, A.K. ; Jawahar, C.V.
Author_Institution :
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
fYear :
2014
fDate :
7-10 April 2014
Firstpage :
141
Lastpage :
145
Abstract :
The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.
Keywords :
document image processing; natural languages; optical character recognition; BLSTM based transcription scheme; Indian languages; Indic scripts; UNICODE conversion rules; UNICODE re-ordering; Web based OCR system; bidirectional long-short term memory network; continuous learning; incremental refinement; incremental training; latent symbol; nonstandard fonts; nonstandard styles; optical character recognition; Character recognition; Degradation; Feature extraction; Optical character recognition software; Robustness; Text recognition; Training; Indic Scripts; Neural Networks; Optical Character Recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
Type :
conf
DOI :
10.1109/DAS.2014.74
Filename :
6830986
Link To Document :
بازگشت