DocumentCode :
3581196
Title :
Tokenizer for the Malay language using pattern matching
Author :
Abu Bakar, Juhaida ; Omar, Khairuddin ; Nasrudin, Mohammad Faidzul ; Murah, Mohd Zamri
Author_Institution :
Sch. of Comput., Univ. Utara Malaysia, Sintok, Malaysia
fYear :
2014
Firstpage :
140
Lastpage :
144
Abstract :
Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.
Keywords :
natural language processing; string matching; text analysis; Jawi corpus; Jawi script tokenizer; Malay language; NLP; NLTK tokenizer; Natural Language ToolKit; natural language processing; pattern matching; text processing; Electronic mail; Pattern matching; Jawi corpus; Pattern matching; Regular expression; Tokenization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Systems Design and Applications (ISDA), 2014 14th International Conference on
Print_ISBN :
978-1-4799-7937-0
Type :
conf
DOI :
10.1109/ISDA.2014.7066258
Filename :
7066258
Link To Document :
بازگشت