Classification Based on Speech Rhythm via a Temporal Alignment of Spoken Sentences

Author

Heo, Inseok ; Sethares, William A.

Author_Institution

Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, United States

Volume

23

Issue

12

fYear

2015

Firstpage

2209

Lastpage

2216

Abstract

How much information is contained in the rhythm of speech? Is it possible to tell, just from the rhythm of the speech, whether the speaker is male or female? Is it possible to tell if they are a native or nonnative speaker? This paper provides a new way to address such questions. Traditional investigations into speech rhythm approach the problem by manually annotating the speech, and investigating a preselected collection of features such as the durations of vowels or inter-phoneme timings. This paper presents a method that can automatically align the audio of multiple people when speaking the same sentence. The output of the alignment procedure is a mapping (from the micro-timing of one speaker to that of another) that can be used as a surrogate for speech rhythm. The method is applied to a large online corpus of speakers and shows that it is possible to classify the speakers based on these mappings alone. Several technical aspects are discussed. First, the spectrograms switch between different-length analysis windows (based on whether the speech is voiced or unvoiced) to ameliorate the time-frequency trade-off. These variable window spectrograms are fed into a dynamic time warping algorithm to produce a timing map which represents the speech rhythm. The accuracy of the alignment is evaluated by a technique of transitive validation, and the timing maps are used to form a feature vector for the classification. The method is applied to the online Speech Accent Archive corpus. In the gender discrimination experiments, the proposed method was only about 5% worse than a state-of-the-art classifier based on spectral feature vectors. In the native speaker discrimination task, the speech rhythm was about 15% better than when using spectral information.

Keywords

Hidden Markov models; IEEE transactions; Rhythm; Spectrogram; Speech; Speech processing; Timing; Automated alignment; speech accent; speech prosody; speech rhythm; transitive validation; variable length windows;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher

ieee

ISSN

2329-9290

Type

jour

DOI

10.1109/TASLP.2015.2475155

Filename

7230254