DocumentCode
3131561
Title
Layout analysis of book pages
Author
Green, Ron ; Oliver, Chad
Author_Institution
Dept. of Comput. Sci. & Software Eng., Univ. of Canterbury, Christchurch, New Zealand
fYear
2013
fDate
27-29 Nov. 2013
Firstpage
118
Lastpage
123
Abstract
A method is proposed for analysing the geometric and logical structure of pages in a typical single-column book. A Gaussian blur combined with thresholding is used to form connected components which nominally represent words. A bottom-up nearest-neighbour approach is used to find textual lines, and a manually-defined line length parameter is used to remove marginal noise and find the page frame. A state machine is used to group lines and label them according to function. The proposed method is able to correctly segment and label 99.82% of all targeted features in a set of 196 sample pages. Of the sixteen errors encountered in the sample pages, eleven are instances where adjacent lines have been merged together, four are instances where paragraphs have been split in half, and the remaining error was caused by a header element being detected as part of the body text.
Keywords
Gaussian processes; digital preservation; finite state machines; geometry; image denoising; optical character recognition; text analysis; Gaussian blur; body text; book page layout analysis; bottom-up nearest-neighbour approach; header element; logical page structure; manually-defined line length parameter; marginal noise removal; page geometric structure; single-column book; state machine; textual line finding; Algorithm design and analysis; Clustering algorithms; Image segmentation; Kernel; Layout; Noise; Sections; Geometric Layout; Logical Layout; OCR Preprocessing; Skew Detection; Structure Detection;
fLanguage
English
Publisher
ieee
Conference_Titel
Image and Vision Computing New Zealand (IVCNZ), 2013 28th International Conference of
Conference_Location
Wellington
ISSN
2151-2191
Print_ISBN
978-1-4799-0882-0
Type
conf
DOI
10.1109/IVCNZ.2013.6727002
Filename
6727002
Link To Document