• DocumentCode
    3497832
  • Title

    Improving OCR text categorization accuracy with electronic abstracts

  • Author

    Li, Linlin ; Tan, Chew Lim

  • Author_Institution
    Dept. of Comput. Sci., Nat. Univ. of Singapore
  • fYear
    2006
  • fDate
    27-28 April 2006
  • Lastpage
    87
  • Abstract
    Categorization of imaged documents is a useful technique for building document image based digital libraries. This paper investigates techniques to improve categorization accuracy on OCR text, particularly that of biomedical imaged documents. Experiments with different feature selection methods were run to explore their effect on the categorization performance. The result shows that document frequency is a good feature selection method in terms of eliminating OCR errors. Furthermore, our categorization scheme IMP that combines OCR text and electronic abstracts shows consistent improvement on the accuracy as compared to categorizing on either abstracts or OCR text alone
  • Keywords
    digital libraries; document image processing; medical image processing; optical character recognition; text analysis; OCR text categorization accuracy improvement; biomedical imaged documents; document image based digital libraries; electronic abstracts; feature selection; imaged document categorization; Abstracts; Biomedical optical imaging; Image analysis; Image databases; Optical character recognition software; Optical noise; Software libraries; Terminology; Text analysis; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Image Analysis for Libraries, 2006. DIAL '06. Second International Conference on
  • Conference_Location
    Lyon
  • Print_ISBN
    0-7695-2531-8
  • Type

    conf

  • DOI
    10.1109/DIAL.2006.22
  • Filename
    1612949