DocumentCode :
599277
Title :
Characteristics of a Malay journalistic corpus
Author :
Zamin, Norshuhani ; Oxley, Alan ; Bakar, Z.A. ; Farhan, S.A.
Author_Institution :
Fac. of Sci. & Inf. Technol., Univ. Teknol. PETRONAS, Tronoh, Malaysia
fYear :
2012
fDate :
23-26 Sept. 2012
Firstpage :
214
Lastpage :
218
Abstract :
This paper presents in detail a linguistics study of a journalistic corpus of Malay describing Indonesian terrorism. The initial raw text was manually annotated for its parts-of-speech. It is the first corpus of its nature ever established in Malaysia. The objective of this research is to conduct an empirical analysis of the actual patterns of use in journalistic texts. This paper presents the characteristics of Malay terrorism corpus which include the properties, word classes, named entities and word occurrences. The results of this work are given purely in terms of the characteristics of a Malay terrorism corpus. The results are highly useful for solving larger tasks in the Natural Language Processing area, such as Information Retrieval and Information Extraction, in the area of terrorism.
Keywords :
electronic publishing; natural language processing; social sciences computing; terrorism; Indonesian terrorism; Malay journalistic corpus; Malay terrorism corpus; Malaysia; information extraction; information retrieval; linguistics study; natural language processing area; parts-oj-speech; Computers; Electronic publishing; Information services; Internet; Speech; Speech processing; Terrorism; corpus; linguistic; linguistic patterns; terrorism articles; word distribution;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Control, Systems & Industrial Informatics (ICCSII), 2012 IEEE Conference on
Conference_Location :
Bandung
Print_ISBN :
978-1-4673-1022-2
Type :
conf
DOI :
10.1109/CCSII.2012.6470503
Filename :
6470503
Link To Document :
بازگشت