DocumentCode :
1634001
Title :
Text Lines and Snippets Extraction for 19th Century Handwriting Documents Layout Analysis
Author :
Malleron, Vincent ; Eglin, Véronique ; Emptoz, Hubert ; Dord-Crousle, S. ; Regnier, Paul
Author_Institution :
INSA-Lyon, LIRIS, Univ. de Lyon, Lyon, France
fYear :
2009
Firstpage :
1001
Lastpage :
1005
Abstract :
In this paper we propose a new approach to improve electronic editions of human science corpus, providing an efficient estimation of manuscripts pages structure. In any handwriting documents analysis process, the text line segmentation is an important stage. The presence of variable inter-line spaces, of inconstant base-line skews, overlapping and occlusions in unconstrained ancient 19th handwritten documents complexifies the text lines segmentation task. In this paper, we only use as prior knowledge of script the fact that text lines skews can be random and irregular.In that context, we model text line detection as an image segmentation problem by enhancing text line structure using Hough transform and a clustering of connected components so as to make text line boundaries appear. The proposed approach of snippets decomposition for page layout analysislies on a first step of content pages classification in five visual and genetic taxonomies, and a second step of text line extraction and snippets decomposition. Experiments show that the proposed method achieves high accuracy for detecting text lines in regular and semi-regular handwritten pages in the corpus of digitized Flaubert manuscripts (rdquoDossiers documentaires de Bouvard et Pecuchetrdquo, 1872-1880).
Keywords :
Hough transforms; computer graphics; document image processing; handwriting recognition; history; image classification; image enhancement; image segmentation; pattern clustering; text analysis; Hough transform; content page classification; digitized Flaubert manuscript; electronic edition; genetic taxonomy; handwriting document layout analysis; human science corpus; image enhancement; inconstant baseline skew; manuscript page structure; occlusion; pattern clustering; snippet decomposition approach; snippet extraction; text line detection; text line segmentation task; variable interline space; Text analysis; Hough transform; connected components classification; snippets decomposition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
ISSN :
1520-5363
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2009.199
Filename :
5277538
Link To Document :
بازگشت