Title :
A segmentation method for bibliographic references by contextual tagging of fields
Author :
Besagni, Dominique ; Belaïd, Abdel ; Benet, Nelly
Author_Institution :
URI, INIST-CNRS, Vandoeuvre-les-Nancy, France
Abstract :
In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to sub-fields and fields. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Non labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75.9% references are completely segmented from 2500 references.
Keywords :
bibliographic systems; grammars; image segmentation; text analysis; ASCII file; OCR; PoS; bibliographic references; canonical forms; character recognition; conference name; correction rules; fields contextual tagging; grammar categories; homogeneous classes; part-of-speech tagging; prototype design; record fields; scientific publications; segmentation method; structural elements; text coding; Character recognition; Indexing; Information analysis; Information retrieval; Instruments; Intersymbol interference; Optical character recognition software; Prototypes; Tagging; Turning;
Conference_Titel :
Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
Print_ISBN :
0-7695-1960-1
DOI :
10.1109/ICDAR.2003.1227694