DocumentCode
2327576
Title
A new method of information extraction from PDF files
Author
Yuan, Fang ; Bo Lu
Author_Institution
Coll. of Math. & Comput. Sci., Hebei Univ., Baoding, China
Volume
3
fYear
2005
fDate
18-21 Aug. 2005
Firstpage
1738
Abstract
With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. The most important step of solving this problem is to extract information from the PDF files. This paper presents a new method for extracting information from PDF files. It first parses PDF files to get text and format information and injects tags into text information to transform it into semi-structured text, and finally, one pattern match algorithm based on tree model is applied to obtain the solution. A further experiment proved this method was effective.
Keywords
document image processing; feature extraction; pattern matching; text analysis; tree data structures; PDF file; information extraction; pattern matching; tree model; Computer science; Data mining; Educational institutions; Electronic mail; Engineering management; Information science; Information technology; Internet; Mathematics; Pattern matching; Information extraction; PDF; Pattern match algorithm based on tree model; Semi-structured data;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
Conference_Location
Guangzhou, China
Print_ISBN
0-7803-9091-1
Type
conf
DOI
10.1109/ICMLC.2005.1527225
Filename
1527225
Link To Document