Title :
A Complete Optical Character Recognition Methodology for Historical Documents
Author :
Vamvakas, G. ; Gatos, B. ; Stamatopoulos, N. ; Perantonis, S.J.
Author_Institution :
Inst. of Inf. & Telecommun., Nat. Center for Sci. Res. Demokritos, Athens
Abstract :
In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.
Keywords :
document image processing; humanities; image enhancement; image segmentation; learning (artificial intelligence); optical character recognition; pattern clustering; text analysis; visual databases; ASCII label; historical document; image binarization; image enhancement; image recognition; optical character recognition methodology; semiautomatic procedure; text line detection; top-down segmentation approach; Character recognition; Cultural differences; Handwriting recognition; Image converters; Image databases; Image recognition; Image segmentation; Optical character recognition software; Pattern recognition; Text analysis; Historical Documents; Optical Character Recognition;
Conference_Titel :
Document Analysis Systems, 2008. DAS '08. The Eighth IAPR International Workshop on
Conference_Location :
Nara
Print_ISBN :
978-0-7695-3337-7
DOI :
10.1109/DAS.2008.73