DocumentCode
153403
Title
Automatic Training Set Generation for Better Historic Document Transcription and Compression
Author
de Franca Pereira e Silva, Gabriel ; Dueire Lins, Rafael ; Gomes, Chandima
Author_Institution
Univ. Fed. de Pernambuco, Recife, Brazil
fYear
2014
fDate
7-10 April 2014
Firstpage
277
Lastpage
281
Abstract
The more complete the training set of an optical character recognition platform, the greater the chances of obtaining a better precision in transcription. The development of a database for such purpose is a task of paramount effort as it is performed manually and must be as extensive as possible in order to potentially cover all words in a language. Dealing with historic documents either handwritten, typed, or printed is even a harder effort as documents are often degraded by time and storage conditions. The recent work of Silva-Lins showed how to automatically generate training sets of isolated characters for cursive writing of one specific person. This is particularly important in the transcription of historic files of important people. The present work improves that strategy by analyzing letter ligature patterns. The improvement in OCR transcription accuracy both of printed, typed and handwritten documents is borne out by experimental evidence.
Keywords
document image processing; learning (artificial intelligence); optical character recognition; OCR transcription accuracy; automatic training set generation; database; handwritten historic documents; historic document compression; historic document transcription; optical character recognition platform; printed historic documents; typed historic documents; Accuracy; Dictionaries; Noise; Optical character recognition software; Pattern recognition; Training; OCR; documents; font sets; training sets;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location
Tours
Print_ISBN
978-1-4799-3243-6
Type
conf
DOI
10.1109/DAS.2014.30
Filename
6831013
Link To Document