DocumentCode :
10091
Title :
Generative Modeling of Voice Fundamental Frequency Contours
Author :
Kameoka, Hirokazu ; Yoshizato, Kota ; Ishihara, Tatxsuma ; Kadowaki, Kento ; Ohishi, Yasunori ; Kashino, Kunio
Author_Institution :
Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan
Volume :
23
Issue :
6
fYear :
2015
fDate :
Jun-15
Firstpage :
1042
Lastpage :
1053
Abstract :
This paper introduces a generative model of voice fundamental frequency (F0) contours that allows us to extract prosodic features from raw speech data. The present F0 contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.
Keywords :
feature extraction; mathematical analysis; probability; speech processing; speech synthesis; statistical analysis; stochastic processes; Fujisaki model; dialogue system; discrete-time stochastic process; emotion recognition; generative modeling; mathematical model; parameter estimation framework; parameter extractor; probabilistic model; prosodic feature extraction; raw speech data; speaker identification; speech conversion; statistical-model-based text-to-speech synthesizer; vocal fold vibration; voice fundamental frequency contour; Computational modeling; Data models; Hidden Markov models; IEEE transactions; Mathematical model; Speech; Speech processing; Expectation-maximization algorithm; Fujisaki model; prosody; voice fundamental frequency contour;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
2329-9290
Type :
jour
DOI :
10.1109/TASLP.2015.2418576
Filename :
7076606
Link To Document :
بازگشت