A new method of information extraction from PDF files

Author

Yuan, Fang ; Bo Lu

Author_Institution

Coll. of Math. & Comput. Sci., Hebei Univ., Baoding, China

Volume

3

fYear

2005

fDate

18-21 Aug. 2005

Firstpage

1738

Abstract

With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. The most important step of solving this problem is to extract information from the PDF files. This paper presents a new method for extracting information from PDF files. It first parses PDF files to get text and format information and injects tags into text information to transform it into semi-structured text, and finally, one pattern match algorithm based on tree model is applied to obtain the solution. A further experiment proved this method was effective.

Keywords

document image processing; feature extraction; pattern matching; text analysis; tree data structures; PDF file; information extraction; pattern matching; tree model; Computer science; Data mining; Educational institutions; Electronic mail; Engineering management; Information science; Information technology; Internet; Mathematics; Pattern matching; Information extraction; PDF; Pattern match algorithm based on tree model; Semi-structured data;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on

Conference_Location

Guangzhou, China

Print_ISBN

0-7803-9091-1

Type

conf

DOI

10.1109/ICMLC.2005.1527225

Filename

1527225