DocumentCode :
567407
Title :
A domain knowledge-based approach for automatic correction of printed invoices
Author :
Sorio, Enrico ; Bartoli, Alberto ; Davanzo, Giorgio ; Medvet, Eric
Author_Institution :
DI3 - Ind. & Inf. Eng. Dept., Univ. of Trieste, Trieste, Italy
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
151
Lastpage :
155
Abstract :
Although OCR technology is now commonplace, character recognition errors are still a problem, in particular, in automated systems for information extraction from printed documents. This paper proposes a method for the automatic detection and correction of OCR errors in an information extraction system. Our algorithm uses domain-knowledge about possible misrecognition of characters to propose corrections; then it exploits knowledge about the type of the extracted information to perform syntactic and semantic checks in order to validate the proposed corrections. We assess our proposal on a real-world, highly challenging dataset composed of nearly 800 values extracted from approximately 100 commercial invoices and we obtained very good results.
Keywords :
document handling; error correction; error detection; information retrieval; invoicing; knowledge based systems; optical character recognition; OCR technology; automatic OCR error correction; automatic OCR error detection; automatic information extraction system; automatic printed invoice correction; character misrecognition; character recognition errors; commercial invoices; domain knowledge-based approach; printed documents; semantic checks; syntactic checks; Information retrieval; Joints; Optical character recognition software; Semantics; Syntactics; Text analysis; document understanding; error correction; error detection; optical character recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Society (i-Society), 2012 International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4673-0838-0
Type :
conf
Filename :
6285067
Link To Document :
بازگشت