• DocumentCode
    2530150
  • Title

    Document style census for OCR

  • Author

    Nagy, George ; Sarkar, Prateek

  • Author_Institution
    DocLab., Rensselaer Polytech. Inst., Troy, NY, USA
  • fYear
    2004
  • fDate
    2004
  • Firstpage
    134
  • Lastpage
    147
  • Abstract
    Four methods of converting paper documents to computer-readable form are compared with regard to hypothetical labor cost: keyboarding, omnifont OCR, style-specific OCR, and style-constrained or style-adaptive OCR. The best choice is determined primarily by (1) the reject rates of the various OCR systems at a given error rate, (2) the fraction of the material that must be labeled for training the system, and (3) the cost of partitioning the material according to style. For large corpora, sampling strategies are proposed both for estimating conversion costs and for taking advantage of style homogeneity.
  • Keywords
    document image processing; optical character recognition; computer-readable form; document style census; keyboarding; omnifont OCR; paper documents; style homogeneity; style-adaptive OCR; style-specific OCR; Costs; Demography; Digital cameras; Error analysis; Facsimile; Image converters; Image sampling; Optical character recognition software; Sampling methods; Software libraries;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on
  • Print_ISBN
    0-7695-2088-X
  • Type

    conf

  • DOI
    10.1109/DIAL.2004.1263245
  • Filename
    1263245