مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast identification of stop words for font learning and keyword spotting

DocumentCode :

3141207

Title :

Fast identification of stop words for font learning and keyword spotting

Author :

Ho, Tin Kam

Author_Institution :

Lucent Technol., AT&T Bell Labs., Murray Hill, NJ, USA

fYear :

1999

fDate :

20-22 Sep 1999

Firstpage :

333

Lastpage :

336

Abstract :

A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted. This paper describes a fast procedure for locating the best candidates for those stop words. The procedure uses width statistics of individual words and their immediate neighbors. In an experiment using 400 page images, the method removed 63% of the words from consideration. The stop/nonstop word discrimination also assists keyword spotting for information retrieval

Keywords :

character sets; document image processing; optical character recognition; English; adaptive strategy; experiment; font learning; information retrieval; keyword spotting; lexicon; linguistics; stop word identification; text recognition; word discrimination; word width statistics; word-shape based recognition; Character recognition; Data mining; Image recognition; Information retrieval; Prototypes; Shape; Statistics; Testing; Text recognition; Tin;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on

Conference_Location :

Bangalore

Print_ISBN :

0-7695-0318-7

Type :

conf

DOI :

10.1109/ICDAR.1999.791792

Filename :

791792

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3141207