• DocumentCode
    2327576
  • Title

    A new method of information extraction from PDF files

  • Author

    Yuan, Fang ; Bo Lu

  • Author_Institution
    Coll. of Math. & Comput. Sci., Hebei Univ., Baoding, China
  • Volume
    3
  • fYear
    2005
  • fDate
    18-21 Aug. 2005
  • Firstpage
    1738
  • Abstract
    With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. The most important step of solving this problem is to extract information from the PDF files. This paper presents a new method for extracting information from PDF files. It first parses PDF files to get text and format information and injects tags into text information to transform it into semi-structured text, and finally, one pattern match algorithm based on tree model is applied to obtain the solution. A further experiment proved this method was effective.
  • Keywords
    document image processing; feature extraction; pattern matching; text analysis; tree data structures; PDF file; information extraction; pattern matching; tree model; Computer science; Data mining; Educational institutions; Electronic mail; Engineering management; Information science; Information technology; Internet; Mathematics; Pattern matching; Information extraction; PDF; Pattern match algorithm based on tree model; Semi-structured data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
  • Conference_Location
    Guangzhou, China
  • Print_ISBN
    0-7803-9091-1
  • Type

    conf

  • DOI
    10.1109/ICMLC.2005.1527225
  • Filename
    1527225