• DocumentCode
    357018
  • Title

    Audio-visual unit selection for the synthesis of photo-realistic talking-heads

  • Author

    Cosatto, Eric ; Potamianos, Gerasimos ; Graf, Hans Peter

  • Author_Institution
    AT&T Labs.-Res., Red Bank, NJ, USA
  • Volume
    2
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    619
  • Abstract
    This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and arcs of the graph that are computed from similarities in both the acoustic and visual domain. While acoustic similarities are computed by simple phonetic matching, visual similarities are estimated using a hierarchical metric that uses high-level features (position and sizes of facial parts) and low-level features (projection of the image pixels on principal components of the database). This method preserves coarticulation and temporal coherence, producing smooth, lip-synched animations. Once the database has been prepared, this system can produce animations from ASCII text fully automatically
  • Keywords
    computer animation; multimedia computing; realistic images; speech synthesis; video signal processing; Viterbi search; acoustic similarities; audio-visual unit selection; candidate image units; coarticulation; computer vision; hierarchical metric; high-level features; lip-synchronization; low-level features; mouth area; phonetic matching; photo-realistic talking-heads; recorded video samples; sample based image synthesis; speech-synchronized talking-head animations; temporal coherence; text-to-speech synthesizer; variable-length video units; Animation; Cameras; Costs; Image databases; Mouth; Spatial databases; Speech synthesis; Synthesizers; Visual databases; Viterbi algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on
  • Conference_Location
    New York, NY
  • Print_ISBN
    0-7803-6536-4
  • Type

    conf

  • DOI
    10.1109/ICME.2000.871439
  • Filename
    871439