• DocumentCode
    3437066
  • Title

    Document image ground truth generation from electronic text

  • Author

    Zi, Gang ; Doermann, David

  • Author_Institution
    Lab. for Language & Media Process., Maryland Univ., College Park, MD, USA
  • Volume
    2
  • fYear
    2004
  • fDate
    23-26 Aug. 2004
  • Firstpage
    663
  • Abstract
    The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with Windows enhanced metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded, and used for training and evaluating OCR systems. We briefly survey related work and describe our system.
  • Keywords
    natural languages; text analysis; MS Windows operating system; Windows enhanced metafile directives; document analysis systems; document image ground truth generation; electronic text; multilingual sources processing; Costs; Degradation; Educational institutions; Image generation; Laboratories; Noise generators; Operating systems; Optical character recognition software; Page description languages; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-2128-2
  • Type

    conf

  • DOI
    10.1109/ICPR.2004.1334346
  • Filename
    1334346