• DocumentCode
    2148446
  • Title

    Searching OCR´ed Text: An LDA Based Approach

  • Author

    Hassan, Ehtesham ; Garg, Vikram ; Haque, S. K Mirajul ; Chaudhury, Santanu ; Gopal, M.

  • Author_Institution
    Dept. of Electr. Eng., Indian Inst. of Technol. Delhi, New Delhi, India
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1210
  • Lastpage
    1214
  • Abstract
    Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR´s confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.
  • Keywords
    document image processing; information retrieval; learning (artificial intelligence); optical character recognition; Devanagari script; LDA based approach; Lucene; OCR text searching; digitized document collection; document indexing framework; latent dirichlet allocation; retrieval performance; semantic word grouping; topic learning process; Character recognition; Indexing; Optical character recognition software; Resource management; Semantics; Vectors; Vocabulary; Document Retrieval; Latent Dirichlet Allocation; Optical Character Recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.244
  • Filename
    6065502