• DocumentCode
    2146007
  • Title

    A Fast Alignment Scheme for Automatic OCR Evaluation of Books

  • Author

    Yalniz, Ismet Zeki ; Manmatha, R.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    754
  • Lastpage
    758
  • Abstract
    This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. The proposed approach effectively segments the alignment problem into small sub problems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor. This approach is used to evaluate the OCR accuracy of real scanned books in English, French, German and Spanish.
  • Keywords
    electronic publishing; natural language processing; optical character recognition; recursive estimation; text analysis; vocabulary; English; French; German; OCR accuracy; OCR output; OCR systems; Project Gutenberg website; RETAS; Spanish; automatic OCR book evaluation; edit distance based alignment algorithm; final alignment; ground truth e-texts; optical character recognition systems; real scanned books; recursive text alignment scheme; short chunks; text segment; vocabulary; Accuracy; Complexity theory; Error analysis; Hidden Markov models; Noise; Optical character recognition software; Vocabulary; OCR evaluation; digital libraries; sequence alignment;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.157
  • Filename
    6065412