Automatic Training Set Generation for Better Historic Document Transcription and Compression

Author

de Franca Pereira e Silva, Gabriel ; Dueire Lins, Rafael ; Gomes, Chandima

Author_Institution

Univ. Fed. de Pernambuco, Recife, Brazil

fYear

2014

fDate

7-10 April 2014

Firstpage

277

Lastpage

281

Abstract

The more complete the training set of an optical character recognition platform, the greater the chances of obtaining a better precision in transcription. The development of a database for such purpose is a task of paramount effort as it is performed manually and must be as extensive as possible in order to potentially cover all words in a language. Dealing with historic documents either handwritten, typed, or printed is even a harder effort as documents are often degraded by time and storage conditions. The recent work of Silva-Lins showed how to automatically generate training sets of isolated characters for cursive writing of one specific person. This is particularly important in the transcription of historic files of important people. The present work improves that strategy by analyzing letter ligature patterns. The improvement in OCR transcription accuracy both of printed, typed and handwritten documents is borne out by experimental evidence.

Keywords

document image processing; learning (artificial intelligence); optical character recognition; OCR transcription accuracy; automatic training set generation; database; handwritten historic documents; historic document compression; historic document transcription; optical character recognition platform; printed historic documents; typed historic documents; Accuracy; Dictionaries; Noise; Optical character recognition software; Pattern recognition; Training; OCR; documents; font sets; training sets;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on

Conference_Location

Tours

Print_ISBN

978-1-4799-3243-6

Type

conf

DOI

10.1109/DAS.2014.30

Filename

6831013