DocumentCode
1993652
Title
Impact of imperfect OCR on part-of-speech tagging
Author
Lin, Xiaofan
Author_Institution
Hewlett-Packard Labs., Palo Alto, CA, USA
fYear
2003
fDate
3-6 Aug. 2003
Firstpage
284
Abstract
Part-of-speech (POS) tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years. However, one question remains unanswered: How will a POS tagger behave when the input text is not error-free? This issue can be of great importance when the text comes from imperfect sources like optical character recognition (OCR). This paper analyzes the performance of both individual POS taggers and combination systems on imperfect text. Experimental results show that a POS tagger´s accuracy decreases linearly with the character error rate and the slope indicates a tagger´s sensitivity to input text errors.
Keywords
natural languages; optical character recognition; text analysis; NLP system; OCR; POS tagging; character error rate; imperfect text; natural language processing; optical character recognition; part-of-speech tagging; text error; Application software; Character recognition; Computer errors; Data mining; Error analysis; Hidden Markov models; Natural language processing; Optical character recognition software; Optical sensors; Tagging;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
Print_ISBN
0-7695-1960-1
Type
conf
DOI
10.1109/ICDAR.2003.1227674
Filename
1227674
Link To Document