DocumentCode
3190140
Title
A statistical refinement method for word shape token querying of document images
Author
O´Connor, Jerh ; Smeaton, Alan F.
Author_Institution
Sch. of Comput. Applications, Dublin City Univ., Ireland
fYear
1999
fDate
1999
Firstpage
572
Lastpage
576
Abstract
Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as “g” and “y” into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a word or letter into its WST representation than it is to map into its full ASCII representation. In previous work we showed that user-mediated selection of WSTs for querying document images improved system performance. In the work reported here we use a statistically derived dataset to help determine whether or not a particular WST from a scanned document image actually matches a query term WST. We do this by comparing the preceding and following WSTs of the each WST in a document against previously collected frequency data for a large set of WST occurrences
Keywords
document image processing; optical character recognition; visual databases; ASCII representation; character shape code mapping function; contour; document images; similarly shaped letters; statistical refinement method; statistically derived dataset; system performance; user-mediated selection; word shape token querying; Aggregates; Frequency; Optical character recognition software; Search engines; Shape;
fLanguage
English
Publisher
ieee
Conference_Titel
Database and Expert Systems Applications, 1999. Proceedings. Tenth International Workshop on
Conference_Location
Florence
Print_ISBN
0-7695-0281-4
Type
conf
DOI
10.1109/DEXA.1999.795248
Filename
795248
Link To Document