مرکز منطقه ای اطلاع رساني علوم و فناوري - Tokenizer for the Malay language using pattern matching

DocumentCode :

3581196

Title :

Tokenizer for the Malay language using pattern matching

Author :

Abu Bakar, Juhaida ; Omar, Khairuddin ; Nasrudin, Mohammad Faidzul ; Murah, Mohd Zamri

Author_Institution :

Sch. of Comput., Univ. Utara Malaysia, Sintok, Malaysia

fYear :

2014

Firstpage :

140

Lastpage :

144

Abstract :

Tokenization is a fundamental task focused on text processing. Among other tasks, the segmentation process is used to identify information units, such as sentences and words. In this paper, we discuss the Natural Language ToolKit (NLTK) tokenizer as a step to manipulate patterns within text. The purpose of this work is to build up Natural Language Processing (NLP) base for Jawi corpus. A series of experiments was performed, to validate the corpus and fulfill the requirement of the Jawi script tokenizer, with the promising results. Based on these promising results, the token will be used for tagging process.

Keywords :

natural language processing; string matching; text analysis; Jawi corpus; Jawi script tokenizer; Malay language; NLP; NLTK tokenizer; Natural Language ToolKit; natural language processing; pattern matching; text processing; Electronic mail; Pattern matching; Jawi corpus; Pattern matching; Regular expression; Tokenization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligent Systems Design and Applications (ISDA), 2014 14th International Conference on

Print_ISBN :

978-1-4799-7937-0

Type :

conf

DOI :

10.1109/ISDA.2014.7066258

Filename :

7066258

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3581196