• DocumentCode
    1742150
  • Title

    Part-of-speech tagging for table of contents recognition

  • Author

    Belaïd, A. ; Pienon, L. ; Valverde, N.

  • Author_Institution
    LORIA, CNRS, Vandoeuvre-les Nancy, France
  • Volume
    4
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    451
  • Abstract
    A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically, scientific papers in a digital library system named Calliope. This method operates on an a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in the title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction
  • Keywords
    digital libraries; document image processing; feature extraction; optical character recognition; ASCII file; OCR; character recognition; contextual correction rules; digital library system; document analysis; part-of-speech tagging; scientific papers; table of contents recognition; text labeling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2000. Proceedings. 15th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-0750-6
  • Type

    conf

  • DOI
    10.1109/ICPR.2000.902955
  • Filename
    902955