Title :
Extraction of Spelling Variations from Language Structure for Noisy Text Correction
Author :
Gerdjikov, Stefan ; Mihov, Stoyan ; Nenchev, Vladislav
Author_Institution :
Fac. of Math. & Inf., Sofia Univ., Sofia, Bulgaria
Abstract :
We describe a novel approach for the extraction of spelling variations from a list of instances. It relates distinctive infixes to distinctive infixes of referenced words. The distinctive infixes are extracted automatically from a (multi)set of instances and a referenced dictionary without any additional expert knowledge. Based on the spelling variations retrieved during a learning(training) phase we develop a correction algorithm which suggests and ranks candidates for a particular noisy word. The main advantage of our approach is that it provides good corrections for the unobserved noisy words while it is almost perfect on words observed during the learning. Our experimental results of the normalisation of a typical reference corpus of Early Modern English letters, [1], significantly improve over previous results of VARD2, [2]. We also achieve better results than those reported in [3] and [4] on the OCR-correction of the TREC-5 Confusion Track corpus,[5].
Keywords :
document image processing; image denoising; natural language processing; spelling aids; text analysis; Early Modern English letters; OCR-correction; TREC-5 confusion track corpus; VARD2; automatic distinctive infix extraction; correction algorithm; expert knowledge; language structure; learning phase; noisy text correction; reference corpus; referenced dictionary; spelling variation extraction; Approximation methods; Dictionaries; Educational institutions; Hidden Markov models; Noise measurement; Training; Upper bound; finite state automata; noisy texts correction; spelling variations;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.72