Generative Modeling of Voice Fundamental Frequency Contours

Author

Kameoka, Hirokazu ; Yoshizato, Kota ; Ishihara, Tatxsuma ; Kadowaki, Kento ; Ohishi, Yasunori ; Kashino, Kunio

Author_Institution

Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan

Volume

23

Issue

6

fYear

2015

fDate

Jun-15

Firstpage

1042

Lastpage

1053

Abstract

This paper introduces a generative model of voice fundamental frequency (F₀) contours that allows us to extract prosodic features from raw speech data. The present F₀ contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.

Keywords

feature extraction; mathematical analysis; probability; speech processing; speech synthesis; statistical analysis; stochastic processes; Fujisaki model; dialogue system; discrete-time stochastic process; emotion recognition; generative modeling; mathematical model; parameter estimation framework; parameter extractor; probabilistic model; prosodic feature extraction; raw speech data; speaker identification; speech conversion; statistical-model-based text-to-speech synthesizer; vocal fold vibration; voice fundamental frequency contour; Computational modeling; Data models; Hidden Markov models; IEEE transactions; Mathematical model; Speech; Speech processing; Expectation-maximization algorithm; Fujisaki model; prosody; voice fundamental frequency contour;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher

ieee

ISSN

2329-9290

Type

jour

DOI

10.1109/TASLP.2015.2418576

Filename

7076606