• DocumentCode
    1994063
  • Title

    A segmentation method for bibliographic references by contextual tagging of fields

  • Author

    Besagni, Dominique ; Belaïd, Abdel ; Benet, Nelly

  • Author_Institution
    URI, INIST-CNRS, Vandoeuvre-les-Nancy, France
  • fYear
    2003
  • fDate
    3-6 Aug. 2003
  • Firstpage
    384
  • Abstract
    In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to sub-fields and fields. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to record fields: "authors", "title", "conference name", "date", etc. Non labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6% words are correctly attributed, and about 75.9% references are completely segmented from 2500 references.
  • Keywords
    bibliographic systems; grammars; image segmentation; text analysis; ASCII file; OCR; PoS; bibliographic references; canonical forms; character recognition; conference name; correction rules; fields contextual tagging; grammar categories; homogeneous classes; part-of-speech tagging; prototype design; record fields; scientific publications; segmentation method; structural elements; text coding; Character recognition; Indexing; Information analysis; Information retrieval; Instruments; Intersymbol interference; Optical character recognition software; Prototypes; Tagging; Turning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
  • Print_ISBN
    0-7695-1960-1
  • Type

    conf

  • DOI
    10.1109/ICDAR.2003.1227694
  • Filename
    1227694