Title :
Identification of phoneme and its distribution of malay language derived from Friday sermon transcripts
Author :
Asyafie, Muhammad Aasim ; Harun, Mokhtar ; Shapiai, Mohd Ibrahim ; Khalid, Puspa Inayat
Author_Institution :
Fac. of Electr. Eng., Univ. Teknol. Malaysia, Johor Bahru, Malaysia
Abstract :
Lack of text data is one of the main issues encountered by Malay speech researchers. Currently, there are few established Malay text corpora to aid in their research. Text corpora are essential due to its ability to provide empirical data for researchers in the field of linguistics and are useful to construct word lists for speech intelligibility test, speech analysis across genders and automatic speech recognition. The text corpora also need to mimic the natural phoneme of the language it represents. To accomplish this, we need to know the phonetic distribution of the language. The purpose of this research is to devise a phoneme distribution for the Malay language based on the transcripts obtained from fifty two Friday sermons. The Friday sermon transcripts were obtained through the official government website and then standardized by removing images and foreign letters; expanding acronyms and short forms; converting numbers and symbols to appropriate Malay words. The transcripts were then phonetically transcribed by first identifying the language rules and wrote a program based on those rules. The program was written using Personal Home Page (PHP) and the data were then stored into MySQL (Sequential Query Language). The data were then retrieved and compared to the Malay words used in news broadcast. In conclusion, the Malay used in Friday sermon and news broadcast differs in the usage of the phonemes /a/, /e/, /o/, /d/, /p/, /t∫/, /n/, /l/, /h/ and /r/.
Keywords :
SQL; Web sites; computational linguistics; speech processing; speech recognition; text analysis; Friday sermon transcripts; Malay text corpora; Malay words; MySQL; PHP; acronyms; automatic speech recognition; data retrieval; data storage; empirical data; foreign letter removal; genders; image letter removal; language rules; linguistics field; natural phoneme; news broadcast; official government Web site; personal home page; phoneme distribution; phoneme identification; phoneme usage; phonetic distribution; phonetic transcription; sequential query language; short forms; speech analysis; speech intelligibility test; text data; word lists; Companies; Context; Correlation; Databases; Speech; Spreadsheet programs; Terminology; Bahasa Melayu; Speech; speech clariy; speech intelligibility;
Conference_Titel :
Research and Development (SCOReD), 2014 IEEE Student Conference on
Print_ISBN :
978-1-4799-6427-7
DOI :
10.1109/SCORED.2014.7072964