DocumentCode :
153403
Title :
Automatic Training Set Generation for Better Historic Document Transcription and Compression
Author :
de Franca Pereira e Silva, Gabriel ; Dueire Lins, Rafael ; Gomes, Chandima
Author_Institution :
Univ. Fed. de Pernambuco, Recife, Brazil
fYear :
2014
fDate :
7-10 April 2014
Firstpage :
277
Lastpage :
281
Abstract :
The more complete the training set of an optical character recognition platform, the greater the chances of obtaining a better precision in transcription. The development of a database for such purpose is a task of paramount effort as it is performed manually and must be as extensive as possible in order to potentially cover all words in a language. Dealing with historic documents either handwritten, typed, or printed is even a harder effort as documents are often degraded by time and storage conditions. The recent work of Silva-Lins showed how to automatically generate training sets of isolated characters for cursive writing of one specific person. This is particularly important in the transcription of historic files of important people. The present work improves that strategy by analyzing letter ligature patterns. The improvement in OCR transcription accuracy both of printed, typed and handwritten documents is borne out by experimental evidence.
Keywords :
document image processing; learning (artificial intelligence); optical character recognition; OCR transcription accuracy; automatic training set generation; database; handwritten historic documents; historic document compression; historic document transcription; optical character recognition platform; printed historic documents; typed historic documents; Accuracy; Dictionaries; Noise; Optical character recognition software; Pattern recognition; Training; OCR; documents; font sets; training sets;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
Type :
conf
DOI :
10.1109/DAS.2014.30
Filename :
6831013
Link To Document :
بازگشت