Automatic Arabic term extraction from special domain corpora

Author

Al-Thubaity, Abdul Mohsen ; Khan, Mahrukh ; Alotaibi, Saad ; Alonazi, Badriyya

Author_Institution

King Abdulaziz City for Sci. & Technol., Riyadh, Saudi Arabia

fYear

2014

fDate

20-22 Oct. 2014

Firstpage

1

Lastpage

5

Abstract

The availability of machine-readable Arabic special domain text in digital libraries, websites of Arabic university publications, and refereed journals fosters numerous interesting studies and applications. Among these applications is automatic term extraction from special domain corpora. These extracted terms can serve as a foundation for other applications and research, such as special domain dictionary building, terminology resource creation, and special domain ontology construction. Our literature survey shows a lack of such studies for Arabic special domain text; moreover, the few studies that have been identified use complex and computationally expensive methods. In this study, we use two basic methods to automatically extract terms from Arabic special domain corpora. Our methods are based on two simple heuristics. The most frequent words and n-grams in special domain corpora are typically terms, which themselves are typically bounded by functional words. We applied our methods on a corpus of applied Arabic linguistics. We obtained results comparable to those of other Arabic term extraction studies in that they exhibited 87% accuracy when only terms strictly pertaining to the field of applied Arabic linguistics were considered, and 93.7% when related terms were included.

Keywords

dictionaries; linguistics; natural language processing; ontologies (artificial intelligence); Arabic special domain corpora; Arabic university publications; Web sites; applied Arabic linguistics; automatic Arabic term extraction; automatic term extraction; digital libraries; frequent words; functional words; heuristics; machine-readable Arabic special domain text; n-grams; special domain dictionary building; special domain ontology construction; terminology resource creation; Accuracy; Buildings; Data mining; Dictionaries; Pragmatics; Semantics; Arabic term extraction; special domain corpora; term frequency-inverse document frequency; terminology resources;

fLanguage

English

Publisher

ieee

Conference_Titel

Asian Language Processing (IALP), 2014 International Conference on

Conference_Location

Kuching

Type

conf

DOI

10.1109/IALP.2014.6973468

Filename

6973468