Robust Recognition of Degraded Documents Using Character N-Grams

Author

Dutta, Shrey ; Sankaran, Naveen ; Sankar, K. Pramod ; Jawahar, C.V.

Author_Institution

Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India

fYear

2012

fDate

27-29 March 2012

Firstpage

130

Lastpage

134

Abstract

In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

Keywords

document image processing; natural language processing; optical character recognition; Indian language document images; Malayalam document images; OCR; character n-gram images; degraded documents; digital libraries; optical character recognition; robust recognition; scanned documents; Character recognition; Degradation; Error analysis; Feature extraction; Image recognition; Optical character recognition software; Training; Character N-Grams; Degraded Documents; OCR;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on

Conference_Location

Gold Cost, QLD

Print_ISBN

978-1-4673-0868-7

Type

conf

DOI

10.1109/DAS.2012.76

Filename

6195349