• DocumentCode
    2148596
  • Title

    Extending Page Segmentation Algorithms for Mixed-Layout Document Processing

  • Author

    Winder, Amy ; Andersen, Tim ; Smith, Elisa H Barney

  • Author_Institution
    Comput. Sci. Dept., Boise State Univ., Boise, ID, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1245
  • Lastpage
    1249
  • Abstract
    The goal of this work is to add the capability to segment documents containing text, graphics, and pictures in the open source OCR engine OCRopus. To achieve this goal, OCRopus´ RAST algorithm was improved to recognize non-text regions so that mixed content documents could be analyzed in addition to text-only documents. Also, a method for classifying text and non-text regions was developed and implemented for the Voronoi algorithm enabling users to perform OCR on documents processed by this method. Finally, both algorithms were modified to perform at a range of resolutions. Our testing showed an improvement of 15-40% for the RAST algorithm, giving it an average segmentation accuracy of about 80%. The Voronoi algorithm averaged around 70% accuracy on our test data. Depending on the particular layout and idiosyncracies of the documents to be digitized, however, either algorithm could be sufficiently accurate to be utilized.
  • Keywords
    computational geometry; document image processing; image classification; image segmentation; optical character recognition; public domain software; text analysis; OCR; OCRopus RAST algorithm; Voronoi algorithm; mixed content document; mixed layout document processing; nontext region recognition; open source OCR engine; page segmentation algorithm; text classification; text only document; Algorithm design and analysis; Classification algorithms; Image resolution; Image segmentation; Layout; Merging; RAST; Voronoi; open source OCR; page segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.251
  • Filename
    6065509