Development of a standard text and speech corpus for the Punjabi language

Author

Dhanjal, Surinder ; Bhatia, Satvinder Singh

Author_Institution

Dept. of Comput. Sci., Thompson Rivers Univ., Kamloops, BC, Canada

fYear

2013

fDate

25-27 Nov. 2013

Firstpage

1

Lastpage

6

Abstract

In this paper, a new text and speech corpus in the Punjabi language has been developed. The Punjabi language is a modern Indo-Aryan language. The Punjabi language has been ranked amongst the top spoken languages of the world. Over the years, this ranking has varied between 10 and 18. Any research work on the Punjabi language, therefore, assumes an international significance. The Punjabi language is the native language of the Punjab state in two countries: East Punjab in India, and West Punjab in Pakistan. There are many dialects of the Punjabi language and two different scripts in both countries. It will be an enormous task to design a new text or speech corpus that can completely describe all dialects in both scripts. This work, therefore, concentrates only on one dialect of the Punjabi language: the Malwai dialect. This paper describes at least 20 unique features of the newly designed corpus.

Keywords

natural languages; speech processing; text analysis; East Punjab; India; Indo-Aryan language; Malwai dialect; Punjabi language; West Punjab; speech corpus; standard text corpus development; Agriculture; Animals; Cities and towns; Databases; Speech; Speech processing; Vegetation; Corpora development; Gurmukhi Script; IPA; Malwa; Malwai Dialect; Punjabi language; Speech corpus; Speech processing; Text corpus;

fLanguage

English

Publisher

ieee

Conference_Titel

Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference

Conference_Location

Gurgaon

Type

conf

DOI

10.1109/ICSDA.2013.6709891

Filename

6709891