DocumentCode :
3140557
Title :
A document classification and extraction system with learning ability
Author :
Li, Xuhong ; Ng, Peter A.
Author_Institution :
Dept. of Comput. & Inf. Sci., New Jersey Inst. of Technol., Newark, NJ, USA
fYear :
1999
fDate :
20-22 Sep 1999
Firstpage :
197
Lastpage :
200
Abstract :
Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on “logical closeness” is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document´s logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm
Keywords :
directed graphs; document image processing; feature extraction; image classification; image segmentation; optical character recognition; perceptrons; string matching; unsupervised learning; OCR; automatic document analysis; directed weight graph; document classification system; document extraction system; document image processing; document layout structure; document logical structure; document type hierarchy; domain-independent automatic document image understanding system; enhanced perceptron learning algorithm; frame template; hierarchical relationships; image segmentation method; learning from experience; logical closeness; string representation matching algorithm; Application software; Computer science; Data mining; Document image processing; Educational institutions; Image segmentation; Information science; Optical character recognition software; Read only memory; Text processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
Type :
conf
DOI :
10.1109/ICDAR.1999.791758
Filename :
791758
Link To Document :
بازگشت