DocumentCode :
2148596
Title :
Extending Page Segmentation Algorithms for Mixed-Layout Document Processing
Author :
Winder, Amy ; Andersen, Tim ; Smith, Elisa H Barney
Author_Institution :
Comput. Sci. Dept., Boise State Univ., Boise, ID, USA
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
1245
Lastpage :
1249
Abstract :
The goal of this work is to add the capability to segment documents containing text, graphics, and pictures in the open source OCR engine OCRopus. To achieve this goal, OCRopus´ RAST algorithm was improved to recognize non-text regions so that mixed content documents could be analyzed in addition to text-only documents. Also, a method for classifying text and non-text regions was developed and implemented for the Voronoi algorithm enabling users to perform OCR on documents processed by this method. Finally, both algorithms were modified to perform at a range of resolutions. Our testing showed an improvement of 15-40% for the RAST algorithm, giving it an average segmentation accuracy of about 80%. The Voronoi algorithm averaged around 70% accuracy on our test data. Depending on the particular layout and idiosyncracies of the documents to be digitized, however, either algorithm could be sufficiently accurate to be utilized.
Keywords :
computational geometry; document image processing; image classification; image segmentation; optical character recognition; public domain software; text analysis; OCR; OCRopus RAST algorithm; Voronoi algorithm; mixed content document; mixed layout document processing; nontext region recognition; open source OCR engine; page segmentation algorithm; text classification; text only document; Algorithm design and analysis; Classification algorithms; Image resolution; Image segmentation; Layout; Merging; RAST; Voronoi; open source OCR; page segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.251
Filename :
6065509
Link To Document :
بازگشت