• DocumentCode
    2011125
  • Title

    Robust Recognition of Degraded Documents Using Character N-Grams

  • Author

    Dutta, Shrey ; Sankaran, Naveen ; Sankar, K. Pramod ; Jawahar, C.V.

  • Author_Institution
    Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    130
  • Lastpage
    134
  • Abstract
    In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.
  • Keywords
    document image processing; natural language processing; optical character recognition; Indian language document images; Malayalam document images; OCR; character n-gram images; degraded documents; digital libraries; optical character recognition; robust recognition; scanned documents; Character recognition; Degradation; Error analysis; Feature extraction; Image recognition; Optical character recognition software; Training; Character N-Grams; Degraded Documents; OCR;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.76
  • Filename
    6195349