• DocumentCode
    2337427
  • Title

    Bibliographic element extraction from scanned documents using conditional random fields

  • Author

    Ohta, Manabu ; Yakushi, Takayuki ; Takasu, Atsuhiro

  • Author_Institution
    Okayama Univ., Okayama
  • fYear
    2008
  • fDate
    13-16 Nov. 2008
  • Firstpage
    99
  • Lastpage
    104
  • Abstract
    Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each authorpsilas name from the authorspsila text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authorspsila text blocks in Japanese.
  • Keywords
    bibliographic systems; database management systems; digital libraries; document image processing; feature extraction; information retrieval; optical character recognition; random processes; Japanese academic articles; OCR markup; bibliographic databases; bibliographic element extraction; conditional random fields; digital libraries; document image processing techniques; optical character recognition; printed documents; text blocks; Abstracts; Character recognition; Data mining; Informatics; Information analysis; Labeling; Optical character recognition software; Software libraries; Text analysis; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management, 2008. ICDIM 2008. Third International Conference on
  • Conference_Location
    London
  • Print_ISBN
    978-1-4244-2916-5
  • Electronic_ISBN
    978-1-4244-2917-2
  • Type

    conf

  • DOI
    10.1109/ICDIM.2008.4746745
  • Filename
    4746745