Title :
Challenges in Developing Persian Corpora from Online Resources
Author :
Ghayoomi, Masood ; Momtazi, Saeedeh
Author_Institution :
Dept. of Comput. Linguistics, Saarland Univ., Saarbrucken, Germany
Abstract :
Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists´ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.
Keywords :
natural language processing; text analysis; Persian corpora; Semitic language family; electronic text processing; online resources; Books; Computational linguistics; Mood; Natural languages; Operating systems; Spatial databases; Speech; Telephony; Web pages; Writing;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.31