• DocumentCode
    2146052
  • Title

    Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines

  • Author

    Lund, William B. ; Walker, Daniel D. ; Ringger, Eric K.

  • Author_Institution
    Comput. Sci. Dept., Brigham Young Univ., Provo, UT, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    764
  • Lastpage
    768
  • Abstract
    This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. The synthetic data set used to train or calibrate the selection models is chosen without regard to the test data set, hence, we refer to it as "out of domain." It is synthetic in the sense that document images have been generated from the original digital text and degraded using realistic error models. Along with the true transcripts and OCR hypotheses, the calibration data contains sufficient information to produce good models of how to select the best OCR hypothesis and thus correct mistaken OCR hypotheses. Maximum entropy methods leverage that information using carefully chosen feature functions to choose the best possible correction. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore, 52.2% of the documents achieve a new low WER.
  • Keywords
    document image processing; image classification; optical character recognition; OCR engines; WER; digital text; discriminative error correction; document images; entropy classification methods; optical character recognition; progressive alignment; realistic error models; synthetic data set; word error rate; Calibration; Engines; Entropy; Error analysis; Lattices; Optical character recognition software; Training; Error correction; Machine learning; Multiple sequence alignment; Optical character recognition; Optical character recognition software; Progressive text alignment; Synthetic training data set;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.303
  • Filename
    6065414