Title :
Suppression of non-text components in handwritten document images
Author :
Sarkar, Ram ; Moulik, Sanjay ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita ; Kundu, Mahantapas
Author_Institution :
Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
Abstract :
Document layout analysis is a pre-processing step to convert handwritten/printed documents into electronic form through Optical Character Recognition (OCR) system. Handwritten documents are usually unstructured i.e. they do not have a specific layout and most documents may contain some non-text regions e.g. graphs, tables, diagrams etc. Therefore, such documents cannot be directly given as input to the OCR system without suppressing the non-text regions in the documents. The traditional Run Length Smoothing Algorithm (RLSA) does not produce good results for handwritten document pages, since the text components in it have lesser pixel density than those in printed text. In present work, a modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images. The components in the document pages are then classified into text/non-text groups using a Support Vector Machine (SVM) classifier. The method shows a success rate of 83.3% on a dataset of 3000 components.
Keywords :
document image processing; electronic publishing; image classification; optical character recognition; support vector machines; text analysis; OCR; RLSA; SRLSA; SVM classifier; document layout analysis; document page classification; electronic form; handwritten document images; nontext component suppression; nontext region suppression; optical character recognition; pixel density; printed documents; run length smoothing algorithm; spiral run length smearing algorithm; support vector machine; text components; Feature extraction; Graphics; Image segmentation; Information processing; Optical character recognition software; Support vector machines; Training; Handwritten OCR; Handwritten document image; Non-text suppression; SVM classifier; Spiral RLSA;
Conference_Titel :
Image Information Processing (ICIIP), 2011 International Conference on
Conference_Location :
Himachal Pradesh
Print_ISBN :
978-1-61284-859-4
DOI :
10.1109/ICIIP.2011.6108921