Title :
Fast identification of stop words for font learning and keyword spotting
Author_Institution :
Lucent Technol., AT&T Bell Labs., Murray Hill, NJ, USA
Abstract :
A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted. This paper describes a fast procedure for locating the best candidates for those stop words. The procedure uses width statistics of individual words and their immediate neighbors. In an experiment using 400 page images, the method removed 63% of the words from consideration. The stop/nonstop word discrimination also assists keyword spotting for information retrieval
Keywords :
character sets; document image processing; optical character recognition; English; adaptive strategy; experiment; font learning; information retrieval; keyword spotting; lexicon; linguistics; stop word identification; text recognition; word discrimination; word width statistics; word-shape based recognition; Character recognition; Data mining; Image recognition; Information retrieval; Prototypes; Shape; Statistics; Testing; Text recognition; Tin;
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
DOI :
10.1109/ICDAR.1999.791792