Title :
Bibliographic element extraction from scanned documents using conditional random fields
Author :
Ohta, Manabu ; Yakushi, Takayuki ; Takasu, Atsuhiro
Author_Institution :
Okayama Univ., Okayama
Abstract :
Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each authorpsilas name from the authorspsila text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authorspsila text blocks in Japanese.
Keywords :
bibliographic systems; database management systems; digital libraries; document image processing; feature extraction; information retrieval; optical character recognition; random processes; Japanese academic articles; OCR markup; bibliographic databases; bibliographic element extraction; conditional random fields; digital libraries; document image processing techniques; optical character recognition; printed documents; text blocks; Abstracts; Character recognition; Data mining; Informatics; Information analysis; Labeling; Optical character recognition software; Software libraries; Text analysis; XML;
Conference_Titel :
Digital Information Management, 2008. ICDIM 2008. Third International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4244-2916-5
Electronic_ISBN :
978-1-4244-2917-2
DOI :
10.1109/ICDIM.2008.4746745