Title :
Text extraction from color documents-clustering approaches in three and four dimensions
fDate :
6/23/1905 12:00:00 AM
Abstract :
Colored paper documents often contain important text information. For automating the retrieval process, identification of text elements is essential. In order to reduce the number of colors in a scanned document, color clustering is usually done first. In this article two histogram-based color clustering algorithms are investigated. The first is based on the RGB color space exclusively, while the second takes spatial information into account, in addition to the colors. Experimental results have shown that the use of spatial information in the clustering algorithm has a positive impact. Thus the automatic retrieval of text information can be improved. The proposed methods for clustering are not restricted to document images. They can also be used for processing Web or video images, for example
Keywords :
document image processing; image colour analysis; information retrieval; optical character recognition; OCR; RGB color space; Web images; color documents; document image processing; document scanning; experimental results; histogram-based color clustering; information retrieval; spatial information; text extraction; text retrieval; video images; Books; Clustering algorithms; Color; Computer science; Data mining; Histograms; Information retrieval; Machine assisted indexing; Marine vehicles; Mathematics;
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
DOI :
10.1109/ICDAR.2001.953923