مرکز منطقه ای اطلاع رساني علوم و فناوري - Hybrid Page Layout Analysis via Tab-Stop Detection

DocumentCode :

1638621

Title :

Hybrid Page Layout Analysis via Tab-Stop Detection

Author :

Smith, Ray

Author_Institution :

Google Inc., Mountain View, CA, USA

fYear :

2009

Firstpage :

241

Lastpage :

245

Abstract :

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.

Keywords :

document image processing; image classification; optical character recognition; C++ source code implementation; OCR; bottom-up classification method; column layout; hybrid page layout analysis; initial data-type hypothesis; tab-stop detection; top-down manner; Algorithm design and analysis; Background noise; Image analysis; Image edge detection; Image segmentation; Optical character recognition software; Pixel; Publishing; Search engines; Text analysis; Page Layout Analysis; Tab detection; Tesseract;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on

Conference_Location :

Barcelona

ISSN :

1520-5363

Print_ISBN :

978-1-4244-4500-4

Electronic_ISBN :

1520-5363

Type :

conf

DOI :

10.1109/ICDAR.2009.257

Filename :

5277715

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1638621