DocumentCode :
2844354
Title :
OOV words in an English-Arabic CLIR system
Author :
Bellaachia, Abdelghani ; Amor-Tijani, G.
Author_Institution :
Dept. of Comput. Sci., George Washington Univ., Washington, DC
fYear :
2008
fDate :
6-9 July 2008
Firstpage :
874
Lastpage :
882
Abstract :
Proper nouns are usually primary keys in a query. Their correct translation might be necessary to maintain a good retrieval performance in a cross language information retrieval (CLIR) system. However, dictionaries only include the most commonly used proper nouns, like major countries and capitals. As they are spelling variants of each other in most languages, using an approximate string matching technique against the target database index is the common approach taken to find the target language correspondents of the original query key. N-gram technique proved to be the most effective among other approximate string matching techniques. As we are dealing with an English-Arabic CLIR system which involves two languages of different alphabets, we decided to combine transliteration with the n-gram technique to generate the different spelling variants of out of vocabulary (OOV) words. We call this technique: Transliteration Ngram (TNG). One issue that arises with the Arabic language is that words that are spelled similarly can have different meanings depending on the context of the sentence. This is particularly true for proper names, which usually have a meaning if used as a verb or adjective. To further enhance our transliteration approach, we chose to use part of speech (POS) disambiguation to reduce the number of unrelated words from the set transliterations obtained using TNG.
Keywords :
database indexing; information retrieval systems; language translation; natural language processing; query processing; string matching; vocabulary; English-Arabic CLIR system; N-gram technique; OOV words; POS; TNG; approximate string matching technique; cross language information retrieval system; original query key; out of vocabulary; part of speech disambiguation; target database index; transliteration Ngram; transliteration approach; Computer science; Databases; Degradation; Dictionaries; Indexes; Information retrieval; Speech enhancement; Vocabulary;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computers and Communications, 2008. ISCC 2008. IEEE Symposium on
Conference_Location :
Marrakech
ISSN :
1530-1346
Print_ISBN :
978-1-4244-2702-4
Electronic_ISBN :
1530-1346
Type :
conf
DOI :
10.1109/ISCC.2008.4625724
Filename :
4625724
Link To Document :
بازگشت