• DocumentCode
    880291
  • Title

    Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

  • Author

    Tang, Hao ; Fu, Yun ; Tu, Jilin ; Hasegawa-Johnson, Mark ; Huang, Thomas S.

  • Author_Institution
    Beckman Inst. for Adv. Sci. & Technol., Univ. of Illinois at Urbana-Champaign, Urbana, IL
  • Volume
    10
  • Issue
    6
  • fYear
    2008
  • Firstpage
    969
  • Lastpage
    981
  • Abstract
    Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural-sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.
  • Keywords
    avatars; computer animation; human computer interaction; speech synthesis; virtual reality; emotive text-to-speech synthesis; generic 3-D avatar face model; human-human communication; human-machine interaction; humanoid audio-visual avatar; natural-sounding emotive speech; virtual computer agents; 3–D face modeling and animation; Audio–visual avatar; TTS; emotive speech synthesis; human–computer interaction; multimodal system;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2008.2001355
  • Filename
    4637888