loading...
Face Analysis for the Synthesis of Photo-Realistic Talking Heads
Grenoble, France9 March 26-March 30
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/AFGR.2000.840633Fourth IEEE International Conference ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Hans Peter Graf, AT&T Labs-Research
Eric Cosatto, AT&T Labs-Research
Tony Ezzat, Massachusetts Institute of Technology
This paper describes techniques for extracting bitmaps of facial parts from videos of a talking person. The goal is to synthesize photo-realistic talking heads of high quality that show picture-perfect appearance and realistic head movements with good lip-sound synchronization. For the synthesis of a talking head, bitmaps of facial parts are combined to form whole heads and then sequences of such images are integrated with audio from a text-to-speech synthesizer. For a seamless integration of facial parts, their shape and visual appearance must be known with high accuracy. When a person is recorded for such a task, the head is moving and the facial expressions change, influencing the appearance of the face. The recognition system, therefore, has to find not only the location of facial features, but must also be able to determine the head's orientation and estimate the facial expressions.Our face recognition proceeds in multiple steps, each with an increased precision. Using motion, color and shape information, the head's position and the location of the main facial features are determined first. Then smaller areas are searched with matched filters, in order to identify specific facial features with high precision. From this information a head's 3D orientation is calculated. Facial parts are cut from the image and, using the head's orientation, are warped into bitmaps with 'normalized' orientation and scale.In order to synthesize naturally looking heads, not only the static appearances of a face, but also the whole dynamics of the facial deformations have to be captured and rendered with high precision. By translating all facial parts into a normalized view, we can describe their dynamics with a few parameters. For example, we record the normalized parameters of the lip shape for diphones and the most common triphones. Such sample-based co-articulation produces more naturally looking synthesized speech than model-based co-articulation.
Index Terms:
Sample-based talking heads, photo-realistic talking heads, sample-based co-articulation, face recognition, facial feature analysis, visual text-to-speech
Citation:
Hans Peter Graf, Eric Cosatto, Tony Ezzat, "Face Analysis for the Synthesis of Photo-Realistic Talking Heads," fg, pp.189, Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG'00), 2000
Usage of this product signifies your acceptance of the Terms of Use.