Title :
Towards Indian language spell-checker design
Author :
Chaudhuri, Bidyut Baran
Author_Institution :
Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Calcutta, India
Abstract :
This paper deals with the development of a spell-checker in Indian languages using as an example Bangla, the second most popular language on the Indian Subcontinent. A brief review of problems and the current scenario of Indian language spell-checkers is described. The approach for the Bangla spell-checker is then elaborated. In this approach the technique works in two stages. The first stage takes care of phonetic similarity error. For that the phonetically similar characters are mapped into single units of character code. A new dictionary Dc is constructed with this reduced set of alphabets. A phonetically similar but wrongly spelt word can be easily corrected using this dictionary. The second stage takes care of errors other than phonetic similarity. A wrongly spelt word S of n characters is searched in the dictionary Dc. If S is a nonword, its first k1≤n characters will match with a valid word in Dc. (if k1=n then the word in Dc must be longer than n). A reversed word dictionary Dr is also generated where the characters of the word are maintained in a reversed order. If the last k2 characters of S match with a word in Dr then, for a single error, it is located within the intersection region of first k1+1 and last k2+1 characters of S. We observed that this region is very small compared to word length for most cases and the number of suggested correct words can be drastically reduced using this information. We have used our approach in correcting Bangla text, where the problem of inflection is tackled by a simplified version of a morphological analyser. Another problem encountered in Indian languages is the existence of a large number of compound words formed by euphony and assimilation. The problem of compound words is also carefully tackled.
Keywords :
computational linguistics; dictionaries; mathematical morphology; natural languages; spelling aids; Bangla; Indian language spell-checker design; alphabet; assimilation; character code; compound words; dictionary; euphony; inflection; intersection region; morphological analyser; phonetic similarity error; phonetically similar characters; reversed word dictionary; Computer errors; Computer interfaces; Computer vision; Dictionaries; Error correction; Information retrieval; Optical character recognition software; Optical computing; Pattern recognition; Speech recognition;
Conference_Titel :
Language Engineering Conference, 2002. Proceedings
Print_ISBN :
0-7695-1885-0
DOI :
10.1109/LEC.2002.1182301