DocumentCode
2011125
Title
Robust Recognition of Degraded Documents Using Character N-Grams
Author
Dutta, Shrey ; Sankaran, Naveen ; Sankar, K. Pramod ; Jawahar, C.V.
Author_Institution
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
fYear
2012
fDate
27-29 March 2012
Firstpage
130
Lastpage
134
Abstract
In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.
Keywords
document image processing; natural language processing; optical character recognition; Indian language document images; Malayalam document images; OCR; character n-gram images; degraded documents; digital libraries; optical character recognition; robust recognition; scanned documents; Character recognition; Degradation; Error analysis; Feature extraction; Image recognition; Optical character recognition software; Training; Character N-Grams; Degraded Documents; OCR;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
Conference_Location
Gold Cost, QLD
Print_ISBN
978-1-4673-0868-7
Type
conf
DOI
10.1109/DAS.2012.76
Filename
6195349
Link To Document