• DocumentCode
    3190140
  • Title

    A statistical refinement method for word shape token querying of document images

  • Author

    O´Connor, Jerh ; Smeaton, Alan F.

  • Author_Institution
    Sch. of Comput. Applications, Dublin City Univ., Ireland
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    572
  • Lastpage
    576
  • Abstract
    Word Shape Tokens (WSTs) are tokens used to represent words based on the overall shape or contour of a word as it appears in printed text. A character shape code (CSC) mapping function is used to aggregate similarly shaped letters such as “g” and “y” into one single code to represent those letters. The rationale behind this is that it is far easier and more accurate to map a scanned image of a word or letter into its WST representation than it is to map into its full ASCII representation. In previous work we showed that user-mediated selection of WSTs for querying document images improved system performance. In the work reported here we use a statistically derived dataset to help determine whether or not a particular WST from a scanned document image actually matches a query term WST. We do this by comparing the preceding and following WSTs of the each WST in a document against previously collected frequency data for a large set of WST occurrences
  • Keywords
    document image processing; optical character recognition; visual databases; ASCII representation; character shape code mapping function; contour; document images; similarly shaped letters; statistical refinement method; statistically derived dataset; system performance; user-mediated selection; word shape token querying; Aggregates; Frequency; Optical character recognition software; Search engines; Shape;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications, 1999. Proceedings. Tenth International Workshop on
  • Conference_Location
    Florence
  • Print_ISBN
    0-7695-0281-4
  • Type

    conf

  • DOI
    10.1109/DEXA.1999.795248
  • Filename
    795248