Title :
A Dictionary Based Urdu Word Segmentation Using Maximum Matching Algorithm for Space Omission Problem
Author :
Rashid, Rasber ; Latif, Saeed
Author_Institution :
Coll. of Telecommun. Eng., Nat. Univ. of Sci. & Technol. (NUST), Islamabad, Pakistan
Abstract :
The foremost step in any Natural Language Processing system is Word Segmentation. Word segmentation means dividing a sentence into the words it consists. For this research purpose Urdu is selected because very less work has been done. In Urdu space cannot be used in marking word boundary because it is not consistently used. Urdu word segmentation is different from other Asian languages in that it consist both Space Omission and Space Insertion problem. This paper discusses these problems and suggests a technique that solves both of these problems. It uses simple and already used basic techniques in a different way to develop an efficient Segmentation Algorithm. Morphological analysis of Urdu Text is also taken into account. Dictionary is used for verification and identification of Urdu Words. This work has been tested on words collected from Geo, Jang, BBC news sites and other online documents available on internet. The proposed algorithm has been tested on 11,995 words and 97.2% of these words are segmented correctly.
Keywords :
Internet; dictionaries; electronic publishing; natural language processing; pattern matching; text analysis; word processing; Internet; Urdu word identification; Urdu word verification; dictionary based Urdu word segmentation; maximum matching algorithm; morphological Urdu text analysis; natural language processing system; news sites; online documents; space insertion problem; space omission problem; word boundary marking; Space Insertion problem; Space Omission problem; Urdu Word Segmentation;
Conference_Titel :
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4673-6113-2
Electronic_ISBN :
978-0-7695-4886-9
DOI :
10.1109/IALP.2012.11