Author_Institution :
Dept. of Comput. Sci., Institue of Inf. Technol., Abbottabad, Pakistan
Abstract :
Urdu as a language, is gaining popularity because lot many people around the world e.g, India, Pakistan, Bangladesh, etc., speak and understand it. Like other languages e.g, Latin, Chinese, Japanese, Persian, Arabic, etc., Urdu is also under consideration of research community for developing Optical Character Recognition (OCR) Systems. Like Arabic, Urdu script comes with a number of fonts e.g, Nasakh, Nastalique, Noori, etc. The presented work uses analytical approach to recognize machine written Urdu Nastalique script. The methodology includes 3 major modules, (1) Preprocessing that uses binarization and filtering on the input image, (2) Main Process that includes sub phases Line Segmentation, Baseline Detection, Thinning, Segmentation, Smoothing, Dot Recognition from preprocessed image, and (3) Recognition that normalizes the processed image into a standard size of 50×32 and makes a row vector of 1600 using row-major order. Finally it uses Feed Forward Neural Network to recognize the processed input image as one of the 271 ligature classes. The neural network has 1600 neurons in input layer, 60 hidden neurons, and 271 output neurons. The methodology is evaluated on 10 images, 69 lines, and 1292 ligatures. The overall recognition rate is 87%.
Keywords :
"Optical character recognition software","Character recognition","Image segmentation","Feature extraction","Shape","Optical imaging","Image recognition"