DocumentCode :
3497832
Title :
Improving OCR text categorization accuracy with electronic abstracts
Author :
Li, Linlin ; Tan, Chew Lim
Author_Institution :
Dept. of Comput. Sci., Nat. Univ. of Singapore
fYear :
2006
fDate :
27-28 April 2006
Lastpage :
87
Abstract :
Categorization of imaged documents is a useful technique for building document image based digital libraries. This paper investigates techniques to improve categorization accuracy on OCR text, particularly that of biomedical imaged documents. Experiments with different feature selection methods were run to explore their effect on the categorization performance. The result shows that document frequency is a good feature selection method in terms of eliminating OCR errors. Furthermore, our categorization scheme IMP that combines OCR text and electronic abstracts shows consistent improvement on the accuracy as compared to categorizing on either abstracts or OCR text alone
Keywords :
digital libraries; document image processing; medical image processing; optical character recognition; text analysis; OCR text categorization accuracy improvement; biomedical imaged documents; document image based digital libraries; electronic abstracts; feature selection; imaged document categorization; Abstracts; Biomedical optical imaging; Image analysis; Image databases; Optical character recognition software; Optical noise; Software libraries; Terminology; Text analysis; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Image Analysis for Libraries, 2006. DIAL '06. Second International Conference on
Conference_Location :
Lyon
Print_ISBN :
0-7695-2531-8
Type :
conf
DOI :
10.1109/DIAL.2006.22
Filename :
1612949
Link To Document :
بازگشت