DocumentCode
2148446
Title
Searching OCR´ed Text: An LDA Based Approach
Author
Hassan, Ehtesham ; Garg, Vikram ; Haque, S. K Mirajul ; Chaudhury, Santanu ; Gopal, M.
Author_Institution
Dept. of Electr. Eng., Indian Inst. of Technol. Delhi, New Delhi, India
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
1210
Lastpage
1214
Abstract
Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR´s confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.
Keywords
document image processing; information retrieval; learning (artificial intelligence); optical character recognition; Devanagari script; LDA based approach; Lucene; OCR text searching; digitized document collection; document indexing framework; latent dirichlet allocation; retrieval performance; semantic word grouping; topic learning process; Character recognition; Indexing; Optical character recognition software; Resource management; Semantics; Vectors; Vocabulary; Document Retrieval; Latent Dirichlet Allocation; Optical Character Recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.244
Filename
6065502
Link To Document