Title :
Indic script identification from handwritten document images — An unconstrained block-level approach
Author :
Md Obaidullah, Sk ; Das, Nibaran ; Halder, Chayan ; Roy, Kaushik
Author_Institution :
Dept. of Comput. Sc. & Eng., Aliah Univ., Kolkata, India
Abstract :
In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.
Keywords :
document image processing; handwritten character recognition; natural language processing; optical character recognition; transforms; BDCT; BFFT; BRT; Bangla; Devanagari; GAS; India; Indic script identification; Malayalam; Oriya; Roman; Urdu; automatic HSI technique; bi-scripts combinations; feature vector; greedy attribute selection method; handwritten document images; handwritten script identification; learning process; multiscript country; optical character recognition; script specific OCR; six-scripts combinations; statistical techniques; tetra-scripts combinations; textural techniques; transform; tri-scripts combinations; unconstrained block-level approach; unconstrained document image blocks; Discrete cosine transforms; Entropy; Image segmentation; Optical character recognition software; Radon; Standards; Block-level Transform; Classification; Handwritten Script Identification; Statistical Feature;
Conference_Titel :
Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on
Conference_Location :
Kolkata
DOI :
10.1109/ReTIS.2015.7232880