Title :
Generative Modeling of Voice Fundamental Frequency Contours
Author :
Kameoka, Hirokazu ; Yoshizato, Kota ; Ishihara, Tatxsuma ; Kadowaki, Kento ; Ohishi, Yasunori ; Kashino, Kunio
Author_Institution :
Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan
Abstract :
This paper introduces a generative model of voice fundamental frequency (F0) contours that allows us to extract prosodic features from raw speech data. The present F0 contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.
Keywords :
feature extraction; mathematical analysis; probability; speech processing; speech synthesis; statistical analysis; stochastic processes; Fujisaki model; dialogue system; discrete-time stochastic process; emotion recognition; generative modeling; mathematical model; parameter estimation framework; parameter extractor; probabilistic model; prosodic feature extraction; raw speech data; speaker identification; speech conversion; statistical-model-based text-to-speech synthesizer; vocal fold vibration; voice fundamental frequency contour; Computational modeling; Data models; Hidden Markov models; IEEE transactions; Mathematical model; Speech; Speech processing; Expectation-maximization algorithm; Fujisaki model; prosody; voice fundamental frequency contour;
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
DOI :
10.1109/TASLP.2015.2418576