• DocumentCode
    470034
  • Title

    Authors’ names extraction from scanned documents

  • Author

    Ohta, Manabu ; Yamasaki, Shun ; Yakushi, Takayuki ; Takasu, Atsuhiro

  • Author_Institution
    Okayama Univ., Okayama
  • Volume
    1
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    67
  • Lastpage
    72
  • Abstract
    Authors´ names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as optical character recognition (OCR). In this paper, we describe an automatic authors´ names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors´ blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed hidden Markov model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99%, of authors´ blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.
  • Keywords
    bibliographic systems; digital libraries; document image processing; hidden Markov models; optical character recognition; Japanese academic articles; academic articles searching; authors names extraction; bibliographic element; digital libraries; document image processing; hidden Markov model; optical character recognition mark-up; scanned documents; unsegmented character string labeling; Character recognition; Data mining; Hidden Markov models; Humans; Image analysis; Informatics; Labeling; Optical character recognition software; Software libraries; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management, 2007. ICDIM '07. 2nd International Conference on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-1-4244-1475-8
  • Electronic_ISBN
    978-1-4244-1476-5
  • Type

    conf

  • DOI
    10.1109/ICDIM.2007.4444202
  • Filename
    4444202