Title :
Using character shape coding for information retrieval
Author :
Smeaton, A.F. ; Spitz, A.L.
Author_Institution :
Sch. of Comput. Applications, Dublin City Univ., Ireland
Abstract :
In conventional information retrieval the task of finding users´ search terms in a document is simple. When the document is not available in machine readable format, optical character recognition (OCR) can usually be performed. We have developed a technique for performing information retrieval on document images in such a manner that the accuracy has great utility. The method makes generalisations about the images of characters, then performs classification of these and agglomerates the resulting character shape codes into word tokens based on character shape coding. These are sufficiently specific in their representation of the underlying words to allow reasonable performance of retrieval. Using a collection of over 250 Mbytes of document texts and queries with known relevance assessments, we present a series of experiments to determine how various parameters in the retrieval strategy affect retrieval performance and we obtain a surprisingly good result
Keywords :
document image processing; image classification; image coding; information retrieval; optical character recognition; software performance evaluation; OCR; character shape coding; classification; document images; document texts; information retrieval; machine readable format; optical character recognition; performance; queries; relevance assessments; search terms; word tokens; Character recognition; Computer applications; Computer interfaces; Humans; Image retrieval; Information retrieval; Knowledge representation; Natural languages; Optical character recognition software; Shape;
Conference_Titel :
Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
Conference_Location :
Ulm
Print_ISBN :
0-8186-7898-4
DOI :
10.1109/ICDAR.1997.620655