Title :
A statistical dictionary-based word alignment algorithm: An unsupervised approach
Author :
Zamin, Norshuhani ; Oxley, Alan ; Abu Bakar, Zainab ; Farhan, Syed Ahmad
Author_Institution :
Fac. of Sci. & Inf. Technol., Univ. Teknol. PETRONAS, Tronoh, Malaysia
Abstract :
Malay is categorized as a resource-poor language. Thus, there is limited research on corpus linguistics for Malay. This paper discusses an automated process of applying part-of-speech (POS) tags to Malay words. Conventional tagging works well on static grammatical classes with little ambiguities, as performed in most research on resource-rich languages. However, the grammatical classes of Malay are dynamic, where adjectives can be verbs or adverbs and vice versa. This makes automatic POS tagging of Malay a chaotic and challenging process. There is no labelled data publicly available for Malay while hand-crafted corpora are labour-intensive and time-consuming. Hence, this paper introduces an unsupervised technique to tag Malay terrorism texts as a case study. This is a solution to partially overcome the shortage of annotated resources for Malay and the labour-intensity of a hand-tagged corpus. This approach does not require any labelled training data but involves translation of texts into a resource-rich language, i.e. English, and a dictionary look-up. After comparing the results with human annotators, it is found that the unsupervised technique reaches 76% precision and a 67% recall rate.
Keywords :
dictionaries; grammars; natural language processing; statistical analysis; Malay terrorism texts; Malay words; automated process; automatic POS tagging; corpus linguistics; hand-crafted corpora; hand-tagged corpus; part-of-speech tags; resource-poor language; static grammatical classes; statistical dictionary-based word alignment algorithm; unsupervised approach; unsupervised technique; Grammar; bigram; bitext mapping; dice coefficient; malay language; part-of-speech tagging;
Conference_Titel :
Computer & Information Science (ICCIS), 2012 International Conference on
Conference_Location :
Kuala Lumpeu
Print_ISBN :
978-1-4673-1937-9
DOI :
10.1109/ICCISci.2012.6297278