DocumentCode
3487967
Title
A Document Image Segmentation System Using Analysis of Connected Components
Author
Zirari, F. ; Ennaji, Abdellatif ; Nicolas, S. ; Mammass, D.
Author_Institution
LITIS Lab., Univ. of Rouen, Rouen, France
fYear
2013
fDate
25-28 Aug. 2013
Firstpage
753
Lastpage
757
Abstract
Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and structural analysis. This is a fast and efficient method to separate adequately the graphical and the textual parts of a document. We have evaluated our method on two well-known subsets: the UW-III dataset and the ICDAR 2009 page segmentation competition dataset. Comparisons are led with two methods of state-of-the-art, these results showing that our method proved better performances in this task.
Keywords
document image processing; graph theory; image segmentation; optical character recognition; ICDAR 2009 page segmentation competition dataset; OCR classification engine; UW-III dataset; connected components; document image segmentation system; graph-based modeling; non textual components; optical character recognition operation; structural analysis; textual components; Accuracy; Educational institutions; Histograms; Image edge detection; Image segmentation; Text categorization; connected components; document image; graph; structural analysis; ttext/non-text separating;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location
Washington, DC
ISSN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2013.154
Filename
6628719
Link To Document