• DocumentCode
    3467088
  • Title

    Intelligent Parsing of Scanned Volumes for Web Based Archives

  • Author

    Lu, Xiaonan ; Wang, James Z. ; Giles, C. Lee

  • Author_Institution
    Pennsylvania State Univ., State College
  • fYear
    2007
  • fDate
    17-19 Sept. 2007
  • Firstpage
    559
  • Lastpage
    568
  • Abstract
    The proliferation of digital libraries and the large amount of existing documents raise important issues in efficient handling of documents. Printed texts in documents need to be converted into digital format and semantic information need to be parsed and managed for effective retrieval. In this work, we attempt to solve the problems faced by current web based archives, where large scale repositories of electronic resources have been built from scanned volumes. Specifically, we focus on the scientific domain and target scanned volumes of scientific publications. Our goal is to automate the semantic processing of scanned volumes, an important and challenging step towards efficient retrieval of content within scanned volumes. We tackle the problem by designing a machine learning-based method to extract multi-level metadata about content of scanned volumes. We combine image and text information within scanned volumes for intelligent parsing. We developed a system and test it with real world data from the Internet Archive, and the experimental evaluation has demonstrated good results.
  • Keywords
    digital libraries; document handling; learning (artificial intelligence); meta data; semantic Web; Web based archives; digital format; digital libraries proliferation; documents handling; electronic resources; intelligent parsing; machine learning-based method; multi-level metadata; scanned volumes; semantic information; semantic processing; target scanned volumes; text information; Character recognition; Competitive intelligence; Data mining; Educational institutions; Internet; Learning systems; Optical character recognition software; Software libraries; Text recognition; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantic Computing, 2007. ICSC 2007. International Conference on
  • Conference_Location
    Irvine, CA
  • Print_ISBN
    978-0-7695-2997-4
  • Type

    conf

  • DOI
    10.1109/ICSC.2007.47
  • Filename
    4338394