Title :
Field Extraction from Administrative Documents by Incremental Structural Templates
Author :
Rusinol, Marcal ; Benkhelfallah, Tayeb ; D´Andecy, Vincent Poulain
Author_Institution :
ITESOFT, Aimargues, France
Abstract :
In this paper we present an incremental framework aimed at extracting field information from administrative document images in the context of a Digital Mail-room scenario. Given a single training sample in which the user has marked which fields have to be extracted from a particular document class, a document model representing structural relationships among words is built. This model is incrementally refined as the system processes more and more documents from the same class. A reformulation of the tf-idf statistic scheme allows to adjust the importance weights of the structural relationships among words. We report in the experimental section our results obtained with a large dataset of real invoices.
Keywords :
document image processing; feature extraction; information retrieval; statistics; administrative document images; digital mail-room scenario; document model; field information extraction; importance weights; incremental structural templates; tf-idf statistic scheme; words structural relationships; Accuracy; Context; Information retrieval; Layout; Optical character recognition software; Text analysis; Field extraction; administrative document images;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.223