• DocumentCode
    2011079
  • Title

    Improving Book OCR by Adaptive Language and Image Models

  • Author

    Lee, Dar-Shyang ; Smith, Ray

  • Author_Institution
    Google Inc., Mountain View, CA, USA
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    115
  • Lastpage
    119
  • Abstract
    In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.
  • Keywords
    document image processing; optical character recognition; adaptive language model; book OCR improvement; book content; correction hypothesis; document-specific image model; open source Tesseract engine; parallel correction paths; typefaces; Conferences; Text analysis; adaptive OCR; document-specific OCR; error correction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.45
  • Filename
    6195346