Title :
The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
Author :
Rakholia, Rajnish M. ; Saini, Jatinderkumar R.
Author_Institution :
Sch. of Comput. Sci., R.K. Univ., Rajkot, India
Abstract :
Gujarati is a language of Indo-Aryan origin which in turn is a branch of Indo-European languages. For the written script of Gujarati language almost no automation tools are available for language processing due to complexity of Gujarati grammar as well as complex structure of Gujarati written script framework. This paper presents the design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF). Technically, this is a Natural Language Processing (NLP) area and we have designed and implemented a training-dataset independent tokenization algorithm for diacritic extraction using 8-bit UTF through an open source and free programming language called Java. The algorithm has been designed to be independent of font-size, font-type and font-style as well as the type of literary work like Prose, Poetry, Ghazal, etc. The obtained results with an execution of more than 60,000 tokens extracted from 138 Gujarati documents, each for Portable Document Format (PDF) and non-PDF format yield an accuracy of 99.58%. The accuracy of text files have been found to be 0.77% more than that of PDF files. The results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language. On the side lines of the paper, we also present the future research direction targeted towards improving the efficiency and accuracy of Stemming, Part-of-Speech Tagging (POS-Tagging) and Text Mining in Gujarati language.
Keywords :
data mining; grammars; natural language processing; text analysis; Gujarati documents; Gujarati grammar; Gujarati language; Gujarati written script framework; Indo-Aryan origin; Indo-European languages; Java; NLP; POS-tagging; UTF; diacritic extraction technique; free programming language; natural language processing; nonPDF format; open source language; part-of-speech tagging; portable document format; stemming; text files; text mining; training-dataset independent tokenization algorithm; unicode transformation format; Standards; Diacritic; Gujarati; Natural Language Processing (NLP); POS-Tagging; Stemming; Unicode Transformation Format (UTF);
Conference_Titel :
Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on
Conference_Location :
Coimbatore
Print_ISBN :
978-1-4799-6084-2
DOI :
10.1109/ICECCT.2015.7226037