Title :
Part-of-speech tagging for table of contents recognition
Author :
Belaïd, A. ; Pienon, L. ; Valverde, N.
Author_Institution :
LORIA, CNRS, Vandoeuvre-les Nancy, France
Abstract :
A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically, scientific papers in a digital library system named Calliope. This method operates on an a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in the title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction
Keywords :
digital libraries; document image processing; feature extraction; optical character recognition; ASCII file; OCR; character recognition; contextual correction rules; digital library system; document analysis; part-of-speech tagging; scientific papers; table of contents recognition; text labeling;
Conference_Titel :
Pattern Recognition, 2000. Proceedings. 15th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
0-7695-0750-6
DOI :
10.1109/ICPR.2000.902955