Abstract :
Summary form only given. This paper presents an overview of the needs for future speech synthesis, based on an analysis of trends up to the present day. Whereas "contexts" can be considered the keyword of unit-selection synthesis, this paper argues that "relationships" also need to be considered if we are to progress to the next level of speech synthesis quality. Lexical, syntactic and discoursal contexts have been shown to affect the acoustic characteristics of the speech waveform, and consequent consideration of the prosodic environment as a selection criterion has resulted in significant improvements to the quality of corpus-based synthesised speech. From a preliminary analysis of the acoustic characteristics in a large conversational-speech corpus of spontaneous Japanese, it is clear that speaker-listener relationships, and speaker-commitment relationships also have similar effects on the speech. This paper summarises the types of meaningful variation that arise when the talker is addressing a different interlocutor, and when the talker expresses different degrees of commitment towards the content of an utterance. The necessity for synthesised speech to mimic these characteristics of a human speaker is discussed. Samples of naturally-occurring human speech are presented, in order to illustrate the multi-dimensionality of information carried by the human voice in conversational interactions, and suggestions are offered for the categorisation and parameterisation of these variables.
Keywords :
speech synthesis; Japanese; acoustic characteristics; discoursal contexts; human voice; interlocutor; lexical contexts; multi-dimensionality; naturally-occurring human speech; prosody; relationships; speaker-commitment relationships; speaker-listener relationships; speech synthesis; speech waveform; syntactic contexts; synthesised speech; talker; unit-selection synthesis; voice-quality; Electrostatic precipitators; Humans; Information science; Laboratories; Noise reduction; Speech analysis; Speech coding; Speech processing; Speech synthesis; Wiener filter;