Title :
A document classification and extraction system with learning ability
Author :
Li, Xuhong ; Ng, Peter A.
Author_Institution :
Dept. of Comput. & Inf. Sci., New Jersey Inst. of Technol., Newark, NJ, USA
Abstract :
Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on “logical closeness” is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document´s logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm
Keywords :
directed graphs; document image processing; feature extraction; image classification; image segmentation; optical character recognition; perceptrons; string matching; unsupervised learning; OCR; automatic document analysis; directed weight graph; document classification system; document extraction system; document image processing; document layout structure; document logical structure; document type hierarchy; domain-independent automatic document image understanding system; enhanced perceptron learning algorithm; frame template; hierarchical relationships; image segmentation method; learning from experience; logical closeness; string representation matching algorithm; Application software; Computer science; Data mining; Document image processing; Educational institutions; Image segmentation; Information science; Optical character recognition software; Read only memory; Text processing;
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
DOI :
10.1109/ICDAR.1999.791758