DocumentCode
741234
Title
Classification Based on Speech Rhythm via a Temporal Alignment of Spoken Sentences
Author
Heo, Inseok ; Sethares, William A.
Author_Institution
Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, United States
Volume
23
Issue
12
fYear
2015
Firstpage
2209
Lastpage
2216
Abstract
How much information is contained in the rhythm of speech? Is it possible to tell, just from the rhythm of the speech, whether the speaker is male or female? Is it possible to tell if they are a native or nonnative speaker? This paper provides a new way to address such questions. Traditional investigations into speech rhythm approach the problem by manually annotating the speech, and investigating a preselected collection of features such as the durations of vowels or inter-phoneme timings. This paper presents a method that can automatically align the audio of multiple people when speaking the same sentence. The output of the alignment procedure is a mapping (from the micro-timing of one speaker to that of another) that can be used as a surrogate for speech rhythm. The method is applied to a large online corpus of speakers and shows that it is possible to classify the speakers based on these mappings alone. Several technical aspects are discussed. First, the spectrograms switch between different-length analysis windows (based on whether the speech is voiced or unvoiced) to ameliorate the time-frequency trade-off. These variable window spectrograms are fed into a dynamic time warping algorithm to produce a timing map which represents the speech rhythm. The accuracy of the alignment is evaluated by a technique of transitive validation, and the timing maps are used to form a feature vector for the classification. The method is applied to the online Speech Accent Archive corpus. In the gender discrimination experiments, the proposed method was only about 5% worse than a state-of-the-art classifier based on spectral feature vectors. In the native speaker discrimination task, the speech rhythm was about 15% better than when using spectral information.
Keywords
Hidden Markov models; IEEE transactions; Rhythm; Spectrogram; Speech; Speech processing; Timing; Automated alignment; speech accent; speech prosody; speech rhythm; transitive validation; variable length windows;
fLanguage
English
Journal_Title
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher
ieee
ISSN
2329-9290
Type
jour
DOI
10.1109/TASLP.2015.2475155
Filename
7230254
Link To Document