Face-to-face Communicative Avatar Driven by Voice
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Face-to-face Communicative Avatar Driven by Voice Shigeo MORISHIMA and Tatsuo YOTSUKURA Faculty of Engineering, Seikei University {shigeo, yotsu}@ee.seikei.ac.jp transmitted voice can control the lip shape and facial ABSTRACT expression of avatar by media conversion algorithm[5][6]. Recently computer can make cyberspace to walk through by This media conversion technique can be applied to an interactive virtual reality technique. An avatar in entertainment area. An example is interactive movie in which cyberspace can bring us a virtual face-to-face communication audience can take part in movie scenes as a hero or heroin. environment. In this paper, we realize an avatar which has a The face part of movie star in famous movie film is replaced real face in cyberspace to construct a multi-user with user's own face and this face is controlled interactively communication system by voice transmission through by user's voice and a few control function. In this system, network. Voice from microphone is transmitted and analyzed, face model fitting into the movie scene (match move process) then mouth shape and facial expression of avatar are is necessary and it takes long time with manual delicate work synchronously estimated and synthesized on real time. And to make the scene impressive. also we introduce an entertainment application of a real- time voice driven synthetic face. This project is named 2. Face Modeling "Fifteen Seconds of Fame" which is an example of interactive movie. To generate a realistic avatar's face, a generic face model is manually adjusted to both users frontal face image and side view face image to produce a personal 3D face model and all of the control rules for facial expressions are defined as a 3D 1. Introduction movement of grid points in a generic face model. Fig.1 shows a personal model after fitting process by using our original Recently, research into creating friendly human interfaces has GUI based face fitting tool. Front view image and side view f l o u r i s h e d r e m a r k a b l y. S u c h i n t e r f a c e s s m o o t h image are given into to the system and then corresponding communication between a computer and a human. One style control points are manually moved to a reasonable position is to have a virtual human[1][2] appearing on the computer by mouse operation. terminal who should be able to understand and express not only linguistic information but also non-verbal information. This is similar to human-to-human communication with a face- Synthesized face is coming out by mapping of blended texture to-face style and is sometimes called a Life-like generated by users frontal image and side view image onto Communication Agent[3][4]. In the human-human the modified personal face model. The body of avatar is also communication system, facial expression is the essential means of transmitting non-verbal information and promoting friendliness between the participants. Our final goal is to generate a virtual space close to a real communication environment between network users. In this paper, multi-users virtual face-to-face communication environment in cyberspace is presented. There is an avatar projecting the feature of each user in cyberspace which has a real texture-mapped face to generate facial expression and action which is controlled by user. User can also get a view in cyberspace through the avatars eyes, so he can communicate Figure 1: Personal Face Model Fitted onto Images with other people by gaze crossing. And also users
Figure 2: Personal 3D Avatar Figure 3: Neural network for parameter conversion under construction, so in current prototype system, each avatar has only a body beyond neck which is shown in Fig.2. Users emotion condition can be transmitted to other clients as a feature and motion of avatar as well as facial expression. This realizes also non-verbal communication between users. 3. Voice Driven Talking Head Multiple users' communication system in cyberspace is Figure 4: Mouth shape for vowel a composed of several sub-processes expressed as follows. 3.1 Voice Capturing 3.3 Emotion Condition At client system, on-line captured voice of each user is A/D Emotion condition is also classified by LPC cepstrum , power, converted by 16KHz and 16bits, and is transmitted to server pitch frequency and utterance speed using multiple linear system frame-by-frame through network. In a intra-net discriminant analysis method into one of the emotion environment, no speech compression is needed but for internet categories, i.e., Anger, Happiness, Sadness and Neutral. application, high compression technique is inevitable to keep Utterance speed is recognized automatically. Based on voice quality high and cancel transmission delay. Current cepsrum distance between current and preceding frames and prototype system is implemented on intra-network WLR distance from 5 vowel data base, phoneme segment environment. boundaries are automatically detected. Fig. 5 shows voice wave and segmentation result. This process is going on frame 3.2 Lip Synchronization by frame along with utterance, so current speed is calculated by number of detected boudary frames divided by total spoken At server system, voice from each client is phonetically length. analyzed and converted to mouth shape and expression parameters. LPC Cepstrum parameters are converted into Fig.6 shows the location of each basic emotion condition in mouth shape parameters by neural network frame by frame the space defined by the coordinate for utterance speed and trained by vowel features. Fig.3 shows neural network for pitch. Each condition is distinguished clearly by these parameter conversion. This is direct and continuous mapping parameters. from LPC parameters to mouth shape parameters. Mouth shape is expressed only by 13 parameters describing a feature of mouth shape. For training data, mouth shape parameters and LPC parameters for 5 vowels are prepared. Fig.4 shows an example mouth shape for vowel a. User can easily modify mouth shape for each vowel by mouth shape editing tool. Figure 5: Result of segmentation
Each basic emotion has a specific facial expression parameters Speed described by FACS (Facial Action Coding System)[7]. Examples of basic expressions are shown in Fig.7(a)-(c). These faces can also be customized by user using face editing tool. 3.4 Location Control Each user can walk through and fly through cyberspace by mouse control and current locations of all users are always observed by server system. Avatar image is generated in the Normalized Power client space by the location information from the server system. Figure 6: Location of Emotion Figure 7(a): Basic emotion Anger Figure 8: A communication with eye contact Figure 7(b): Basic emotion Happiness Figure 9: View from audience Figure 7(c): Basic emotion Sadness
from any angle shown in Fig.9 to search for other clients in cyber space or to become an audience at the same time. These views can be selected by menu in window. 3.8 Voice Output Playback volume of an avatars voice depends on the distance to that avatar. To add multiple speakers system make 3D audio output possible. To realize lip synchronization, 64ms delay is given to voice playback. Initial view of window Final view of window 4. User Adaptation When new user comes in, his face model and voice model have to be registered before operation. In case of voice, new learning for neural network has to be performed ideally. However, it takes a very long time to get convergence of backpropagation. To simplify the face model construction and voice learning, the GUI tool for speaker adaptation is prepared. Initial view of window Final view of window 4.1 Face Model Fitting Figure 10: Fitting tool window To register the face of new user, a generic 3D face model is modified to fit on the input face image. Only 2D frontal image is needed. To improve the depth feature, side view image can be utilized. Figure 10 shows the initial and final 3.5 Emotion Key-in view of fitting tool window for frontal face image. Some of Emotion condition can always be decided by voice, but the control points on face model are shifted manually. It takes sometimes user give his avatar a specific emotion condition a few minutes to complete users face model because of the optionally by pushing function key. This process-works with easy mouse operation by GUI tool. Expression control rules first priority. For example, push anger and then red and bigger are defined on the generic model, so every users face can face is coming out . When happiness, bouncing face is coming be equally modified to generate basic expression using FACS out, and so on. based expression control mechanism. 3.6 Information Management at Server 4.2 Voice Adaptation Location information of each avatar, mouth shape parameters 75 persons voice data including 5 vowels are pre-captured and emotion parameters are transmitted every 1/30 seconds and database for weights of neural network and voice to client system. Distance between every 2 users are parameters are constructed. So speaker adaptation is calculated by the avatar location information, and voice from performed by choosing the optimum weight from database. every user except himself is mixed and amplified with gain Training of neural network for every 75 persons data is according to the distance. So the voice from the nearest avatar already finished before. When new nonregistered speaker is very loud and one from far away is silent. comes in, he has to speak 5 vowels into microphone before operation. LPC Cepstrum is calculated for every 5 vowels 3.7 Agent and Space Generation at Client and this is given into the neural network. And then mouth shape is calculated by selected weight and error between Based on facial expression parameters and mouth shape true mouth shape and generated mouth shape is evaluated. parameters, avatar face is synthesized frame by frame. And This process is applied to all of the database one by one and avatar body is located on cyberspace according to the location the optimum weight is selected when the minimum error is information. There are two modes for displaying, view from detected. avatars own eyes for eye contact shown in Fig.8 and view
an user's face inserted into actor's face. 6. Conclusion Natural communication environment between multiple users in cyberspace by transmission of natural voice and real-time synthesis of avatars facial expression is presented. Synthesis speed of cyberspace and avatars is about 10.5 frame per second by SGI Onyx2 (R10k, 180MHz). Current system is working on 3 users and intra-network environment. To increase the number of users, its necessary to reduce the traffic in network by compressing voice signal and reduce Figure 11: Fitted face model into movie scene the cost of server processing. Our final goal is to realize the system on Internet environment. And application of this system to entertainment is introduced. In coming future, facial image analysis[8][9] and emotion model [10] are introduced to improve communication environment. 7. References [1] Norman I. Badler, Cary B. Phillips, and Bonnie L. Webber, "Simulating Humans", Computer Graphics Figure 12: User's face inserted into actor's face Animation and Control, Oxford University Press (1993). [2] Nadia M. Thalmann and Prem Kalra, "The Simulation of a Virtual TV Presenter", Computer Graphics and 5. Interactive Movie Applications, pp.9-21, World Scientific (1995). When people watch movie film, he sometimes overlap his [3] Shigeo Morishima, etc. "Life-Like, Bel able own figure with actor's image. An interactive movie system Communication Agents", Course Notes #25, ACM we constructed is an image creating system in which user Siggraph (1996). can control facial expression and lip motion of his face image [4] Justine Cassell, et. al., "Animated Conversation: Rule- inserted into movie scene. User gives voice by microphone based Generation of Facial Expression, Gesture and and pushing keys which determine expression and special Spoken Intonation for Multiple Conversational Agents", effect. His own video program can be generated on realtime. Proceedings of SIGGRAPH'94, pp.413-420 (1994). This project is named as "The Fifteen Seconds of Fame". [5] Shigeo Morishima and Hiroshi Harashima, "A Media At first, once a frontal face image of visitor is captured by Conversion from Speech to Facial Image for Intelligent camera. 3D generic wireframe model is fitted onto user's Man-Machine Interface", IEEE JSAC, Vol.9, No.4, pp. face image to generate personal 3D surface model. Facial 594-600, (1991). expression is synthesized by controlling the grid point of [6] Shigeo Morishima: Virtual Face-to-Face Communication face model and texture mapping. For speaker adaptation, Driven by Voice Through Network, Workshop on visitor has to speak 5 vowels to choose an optimum weight Perceptual User Interfaces, pp85-86, 1997 from data base. [7] Paul Ekman and Wallace V. Friesen, "Facial Action Coding System", Consulting Psychologists Press Inc. At interactive process, a famous movie scene is going on (1978). and face part of actor or actress is replaced with visitor's [8] Irfan Essa, T. Darrell and A. Pentland, "Tracking Facial face. And also facial expression and lip shape are controlled Motion", Proceedings of Workshop on Motion and Non- synchronously by captured voice. And also active camera is rigid and Ariticulated Objects, pp.36-42 (1994). tracking visitor's face and facial expression is controlled by [9] Kenji Mase, "Recognition of Facial Expression from CV based face image analysis. When there are more than Optical Flow", IEICE Transactions, Vol E 74, No. 10, one actor in a scene, all of these faces can be replaced with October (1991). several visitors' faces and all of the faces can be controlled [10] Shigeo Morishima, "Modeling of Facial Expression by several visitors' voice at the same time. Fig.11 shows the and Emotion for Human Communication System", result of fitting of face model into movie scene. Fig.12 shows Displays 17, pp.15-25, Elsevier (1996).
You can also read