• DocumentCode
    2869052
  • Title

    OCR with no shape training

  • Author

    Ho, Tin Kam ; Nagy, George

  • Author_Institution
    Lecent Technol. Bell Labs., Murray Hill, NJ, USA
  • Volume
    4
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    27
  • Abstract
    We present a document-specific OCR system and apply it to a corpus of fixed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a “clump” metric, typically yields several hundred clusters with highly skewed populations. Letter identities are assigned to each cluster by maximizing matches with a lexicon of English words. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in: (1) using real data; (2) a more appropriate clustering algorithm; and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters
  • Keywords
    document image processing; image classification; optical character recognition; business letters; document-specific OCR system; highly skewed populations; letter identities; many-to-many mapping; segmented character bitmaps; unsupervised classification; Business; Clustering algorithms; Decoding; Optical character recognition software; Prototypes; Robustness; Scattering; Shape; Tin; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2000. Proceedings. 15th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-0750-6
  • Type

    conf

  • DOI
    10.1109/ICPR.2000.902858
  • Filename
    902858