Title :
An approach for printed document labeling
Author :
Adak, Chandranath
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Kalyani, Kalyani, India
Abstract :
A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.
Keywords :
Hough transforms; document image processing; fuzzy set theory; pattern clustering; text analysis; Hough transformation; document image; fuzzy c-means clustering; grid structure examination; horizontal histogram; nontext portion; object region; preprocessing stage; printed characters; printed document image; printed document labeling; textual region; white-line dividers; Histograms; Image analysis; Image segmentation; Labeling; Optical character recognition software; Text analysis; Transforms; Document Image Analysis; Document Labeling; Fuzzy C-Means Clustering; Hough Transform; Optical Character Recognition;
Conference_Titel :
Automation, Control, Energy and Systems (ACES), 2014 First International Conference on
Conference_Location :
Hooghy
Print_ISBN :
978-1-4799-3893-3
DOI :
10.1109/ACES.2014.6808032