DocumentCode :
1638621
Title :
Hybrid Page Layout Analysis via Tab-Stop Detection
Author :
Smith, Ray
Author_Institution :
Google Inc., Mountain View, CA, USA
fYear :
2009
Firstpage :
241
Lastpage :
245
Abstract :
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.
Keywords :
document image processing; image classification; optical character recognition; C++ source code implementation; OCR; bottom-up classification method; column layout; hybrid page layout analysis; initial data-type hypothesis; tab-stop detection; top-down manner; Algorithm design and analysis; Background noise; Image analysis; Image edge detection; Image segmentation; Optical character recognition software; Pixel; Publishing; Search engines; Text analysis; Page Layout Analysis; Tab detection; Tesseract;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
Conference_Location :
Barcelona
ISSN :
1520-5363
Print_ISBN :
978-1-4244-4500-4
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2009.257
Filename :
5277715
Link To Document :
بازگشت