DocumentCode :
2011125
Title :
Robust Recognition of Degraded Documents Using Character N-Grams
Author :
Dutta, Shrey ; Sankaran, Naveen ; Sankar, K. Pramod ; Jawahar, C.V.
Author_Institution :
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
fYear :
2012
fDate :
27-29 March 2012
Firstpage :
130
Lastpage :
134
Abstract :
In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.
Keywords :
document image processing; natural language processing; optical character recognition; Indian language document images; Malayalam document images; OCR; character n-gram images; degraded documents; digital libraries; optical character recognition; robust recognition; scanned documents; Character recognition; Degradation; Error analysis; Feature extraction; Image recognition; Optical character recognition software; Training; Character N-Grams; Degraded Documents; OCR;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
Conference_Location :
Gold Cost, QLD
Print_ISBN :
978-1-4673-0868-7
Type :
conf
DOI :
10.1109/DAS.2012.76
Filename :
6195349
Link To Document :
بازگشت