• DocumentCode
    3023204
  • Title

    An approach for stemming in symbolically compressed Indian language imaged documents

  • Author

    Garain, Utpal ; Datta, Alok Kumar

  • Author_Institution
    Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Kolkata, India
  • fYear
    2005
  • fDate
    29 Aug.-1 Sept. 2005
  • Firstpage
    1080
  • Abstract
    Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots, and thereby improving the overall retrieval efficiency. This paper presents an algorithm for stemming in the context of document image retrieval system. The algorithm assumes that the documents are symbolically compressed and stemming has been attempted in the compressed domain itself. Experiments have been conducted on Indian language imaged documents for which efficient OCR still remains a challenging task. Results obtained from a set 150 document images (in Bangla script, the second most popular script in the Indian sub-continent) consisting of about 12K word show a promising performance of the proposed approach.
  • Keywords
    document handling; image retrieval; natural languages; optical character recognition; Bangla script; Indian language; compressed documents; document image retrieval system; information retrieval system; optical character recognition; stemming algorithm; Character recognition; Computer vision; Image coding; Image retrieval; Image storage; Information retrieval; Internet; Optical character recognition software; Pattern recognition; Search engines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
  • ISSN
    1520-5263
  • Print_ISBN
    0-7695-2420-6
  • Type

    conf

  • DOI
    10.1109/ICDAR.2005.45
  • Filename
    1575710