DocumentCode
2148596
Title
Extending Page Segmentation Algorithms for Mixed-Layout Document Processing
Author
Winder, Amy ; Andersen, Tim ; Smith, Elisa H Barney
Author_Institution
Comput. Sci. Dept., Boise State Univ., Boise, ID, USA
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
1245
Lastpage
1249
Abstract
The goal of this work is to add the capability to segment documents containing text, graphics, and pictures in the open source OCR engine OCRopus. To achieve this goal, OCRopus´ RAST algorithm was improved to recognize non-text regions so that mixed content documents could be analyzed in addition to text-only documents. Also, a method for classifying text and non-text regions was developed and implemented for the Voronoi algorithm enabling users to perform OCR on documents processed by this method. Finally, both algorithms were modified to perform at a range of resolutions. Our testing showed an improvement of 15-40% for the RAST algorithm, giving it an average segmentation accuracy of about 80%. The Voronoi algorithm averaged around 70% accuracy on our test data. Depending on the particular layout and idiosyncracies of the documents to be digitized, however, either algorithm could be sufficiently accurate to be utilized.
Keywords
computational geometry; document image processing; image classification; image segmentation; optical character recognition; public domain software; text analysis; OCR; OCRopus RAST algorithm; Voronoi algorithm; mixed content document; mixed layout document processing; nontext region recognition; open source OCR engine; page segmentation algorithm; text classification; text only document; Algorithm design and analysis; Classification algorithms; Image resolution; Image segmentation; Layout; Merging; RAST; Voronoi; open source OCR; page segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.251
Filename
6065509
Link To Document