Face-to-face Communicative Avatar Driven by Voice

Page created by Judy Davidson

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Face-to-face Communicative Avatar Driven by Voice

Shigeo MORISHIMA and Tatsuo YOTSUKURA
Faculty of Engineering, Seikei University
{shigeo, yotsu}@ee.seikei.ac.jp

transmitted voice can control the lip shape and facial
ABSTRACT expression of avatar by media conversion algorithm[5][6].

Recently computer can make cyberspace to walk through by This media conversion technique can be applied to
an interactive virtual reality technique. An avatar in entertainment area. An example is interactive movie in which
cyberspace can bring us a virtual face-to-face communication audience can take part in movie scenes as a hero or heroin.
environment. In this paper, we realize an avatar which has a The face part of movie star in famous movie film is replaced
real face in cyberspace to construct a multi-user with user's own face and this face is controlled interactively
communication system by voice transmission through by user's voice and a few control function. In this system,
network. Voice from microphone is transmitted and analyzed, face model fitting into the movie scene (match move process)
then mouth shape and facial expression of avatar are is necessary and it takes long time with manual delicate work
synchronously estimated and synthesized on real time. And to make the scene impressive.
also we introduce an entertainment application of a real-
time voice driven synthetic face. This project is named 2. Face Modeling
"Fifteen Seconds of Fame" which is an example of interactive
movie. To generate a realistic avatar's face, a generic face model is
manually adjusted to both users frontal face image and side
view face image to produce a personal 3D face model and all
of the control rules for facial expressions are defined as a 3D
1. Introduction movement of grid points in a generic face model. Fig.1 shows
a personal model after fitting process by using our original
Recently, research into creating friendly human interfaces has GUI based face fitting tool. Front view image and side view
f l o u r i s h e d r e m a r k a b l y. S u c h i n t e r f a c e s s m o o t h image are given into to the system and then corresponding
communication between a computer and a human. One style control points are manually moved to a reasonable position
is to have a virtual human[1][2] appearing on the computer by mouse operation.
terminal who should be able to understand and express not
only linguistic information but also non-verbal information.
This is similar to human-to-human communication with a face- Synthesized face is coming out by mapping of blended texture
to-face style and is sometimes called a Life-like generated by users frontal image and side view image onto
Communication Agent[3][4]. In the human-human the modified personal face model. The body of avatar is also
communication system, facial expression is the essential
means of transmitting non-verbal information and promoting
friendliness between the participants.

Our final goal is to generate a virtual space close to a real
communication environment between network users. In this
paper, multi-users virtual face-to-face communication
environment in cyberspace is presented. There is an avatar
projecting the feature of each user in cyberspace which has a
real texture-mapped face to generate facial expression and
action which is controlled by user. User can also get a view in
cyberspace through the avatars eyes, so he can communicate Figure 1: Personal Face Model Fitted onto Images
with other people by gaze crossing. And also users

Figure 2: Personal 3D Avatar Figure 3: Neural network for parameter conversion

under construction, so in current prototype system, each avatar
has only a body beyond neck which is shown in Fig.2.
Users emotion condition can be transmitted to other clients
as a feature and motion of avatar as well as facial expression.
This realizes also non-verbal communication between users.

3. Voice Driven Talking Head
Multiple users' communication system in cyberspace is Figure 4: Mouth shape for vowel a
composed of several sub-processes expressed as follows.

3.1 Voice Capturing 3.3 Emotion Condition
At client system, on-line captured voice of each user is A/D Emotion condition is also classified by LPC cepstrum , power,
converted by 16KHz and 16bits, and is transmitted to server pitch frequency and utterance speed using multiple linear
system frame-by-frame through network. In a intra-net discriminant analysis method into one of the emotion
environment, no speech compression is needed but for internet categories, i.e., Anger, Happiness, Sadness and Neutral.
application, high compression technique is inevitable to keep Utterance speed is recognized automatically. Based on
voice quality high and cancel transmission delay. Current cepsrum distance between current and preceding frames and
prototype system is implemented on intra-network WLR distance from 5 vowel data base, phoneme segment
environment. boundaries are automatically detected. Fig. 5 shows voice
wave and segmentation result. This process is going on frame
3.2 Lip Synchronization by frame along with utterance, so current speed is calculated
by number of detected boudary frames divided by total spoken
At server system, voice from each client is phonetically length.
analyzed and converted to mouth shape and expression
parameters. LPC Cepstrum parameters are converted into Fig.6 shows the location of each basic emotion condition in
mouth shape parameters by neural network frame by frame the space defined by the coordinate for utterance speed and
trained by vowel features. Fig.3 shows neural network for pitch. Each condition is distinguished clearly by these
parameter conversion. This is direct and continuous mapping parameters.
from LPC parameters to mouth shape parameters. Mouth
shape is expressed only by 13 parameters describing a feature
of mouth shape. For training data, mouth shape parameters
and LPC parameters for 5 vowels are prepared. Fig.4 shows
an example mouth shape for vowel a. User can easily
modify mouth shape for each vowel by mouth shape editing
tool.

Figure 5: Result of segmentation

Each basic emotion has a specific facial expression parameters
Speed
                                                 described by FACS (Facial Action Coding System)[7].
                                                 Examples of basic expressions are shown in Fig.7(a)-(c).
                                                 These faces can also be customized by user using face editing
                                                 tool.

                                                 3.4 Location Control
                                                 Each user can walk through and fly through cyberspace by
                                                 mouse control and current locations of all users are always
                                                 observed by server system. Avatar image is generated in the
                            Normalized Power     client space by the location information from the server
                                                 system.

            Figure 6: Location of Emotion

         Figure 7(a): Basic emotion Anger               Figure 8: A communication with eye contact

        Figure 7(b): Basic emotion Happiness

                                                               Figure 9: View from audience
        Figure 7(c): Basic emotion Sadness

from any angle shown in Fig.9 to search for other clients in
cyber space or to become an audience at the same time.
These views can be selected by menu in window.

3.8 Voice Output
Playback volume of an avatars voice depends on the distance
to that avatar. To add multiple speakers system make 3D
audio output possible. To realize lip synchronization, 64ms
delay is given to voice playback.
Initial view of window Final view of window
4. User Adaptation
When new user comes in, his face model and voice model
have to be registered before operation. In case of voice, new
learning for neural network has to be performed ideally.
However, it takes a very long time to get convergence of
backpropagation. To simplify the face model construction
and voice learning, the GUI tool for speaker adaptation is
prepared.

Initial view of window Final view of window 4.1 Face Model Fitting

Figure 10: Fitting tool window To register the face of new user, a generic 3D face model
is modified to fit on the input face image. Only 2D frontal
image is needed. To improve the depth feature, side view
image can be utilized. Figure 10 shows the initial and final
3.5 Emotion Key-in view of fitting tool window for frontal face image. Some of
Emotion condition can always be decided by voice, but the control points on face model are shifted manually. It takes
sometimes user give his avatar a specific emotion condition a few minutes to complete users face model because of the
optionally by pushing function key. This process-works with easy mouse operation by GUI tool. Expression control rules
first priority. For example, push anger and then red and bigger are defined on the generic model, so every users face can
face is coming out . When happiness, bouncing face is coming be equally modified to generate basic expression using FACS
out, and so on. based expression control mechanism.

3.6 Information Management at Server 4.2 Voice Adaptation

Location information of each avatar, mouth shape parameters 75 persons voice data including 5 vowels are pre-captured
and emotion parameters are transmitted every 1/30 seconds and database for weights of neural network and voice
to client system. Distance between every 2 users are parameters are constructed. So speaker adaptation is
calculated by the avatar location information, and voice from performed by choosing the optimum weight from database.
every user except himself is mixed and amplified with gain Training of neural network for every 75 persons data is
according to the distance. So the voice from the nearest avatar already finished before. When new nonregistered speaker
is very loud and one from far away is silent. comes in, he has to speak 5 vowels into microphone before
operation. LPC Cepstrum is calculated for every 5 vowels
3.7 Agent and Space Generation at Client and this is given into the neural network. And then mouth
shape is calculated by selected weight and error between
Based on facial expression parameters and mouth shape true mouth shape and generated mouth shape is evaluated.
parameters, avatar face is synthesized frame by frame. And This process is applied to all of the database one by one and
avatar body is located on cyberspace according to the location the optimum weight is selected when the minimum error is
information. There are two modes for displaying, view from detected.
avatars own eyes for eye contact shown in Fig.8 and view

an user's face inserted into actor's face.

6. Conclusion
Natural communication environment between multiple users
in cyberspace by transmission of natural voice and real-time
synthesis of avatars facial expression is presented. Synthesis
speed of cyberspace and avatars is about 10.5 frame per
second by SGI Onyx2 (R10k, 180MHz). Current system is
working on 3 users and intra-network environment. To
increase the number of users, its necessary to reduce the
traffic in network by compressing voice signal and reduce
Figure 11: Fitted face model into movie scene the cost of server processing. Our final goal is to realize the
system on Internet environment. And application of this
system to entertainment is introduced. In coming future,
facial image analysis[8][9] and emotion model [10] are
introduced to improve communication environment.

7. References
[1] Norman I. Badler, Cary B. Phillips, and Bonnie L.
Webber, "Simulating Humans", Computer Graphics
Figure 12: User's face inserted into actor's face Animation and Control, Oxford University Press (1993).
[2] Nadia M. Thalmann and Prem Kalra, "The Simulation
of a Virtual TV Presenter", Computer Graphics and
5. Interactive Movie Applications, pp.9-21, World Scientific (1995).
When people watch movie film, he sometimes overlap his [3] Shigeo Morishima, etc. "Life-Like, Bel able
own figure with actor's image. An interactive movie system Communication Agents", Course Notes #25, ACM
we constructed is an image creating system in which user Siggraph (1996).
can control facial expression and lip motion of his face image [4] Justine Cassell, et. al., "Animated Conversation: Rule-
inserted into movie scene. User gives voice by microphone based Generation of Facial Expression, Gesture and
and pushing keys which determine expression and special Spoken Intonation for Multiple Conversational Agents",
effect. His own video program can be generated on realtime. Proceedings of SIGGRAPH'94, pp.413-420 (1994).
This project is named as "The Fifteen Seconds of Fame". [5] Shigeo Morishima and Hiroshi Harashima, "A Media
At first, once a frontal face image of visitor is captured by Conversion from Speech to Facial Image for Intelligent
camera. 3D generic wireframe model is fitted onto user's Man-Machine Interface", IEEE JSAC, Vol.9, No.4, pp.
face image to generate personal 3D surface model. Facial 594-600, (1991).
expression is synthesized by controlling the grid point of [6] Shigeo Morishima: Virtual Face-to-Face Communication
face model and texture mapping. For speaker adaptation, Driven by Voice Through Network, Workshop on
visitor has to speak 5 vowels to choose an optimum weight Perceptual User Interfaces, pp85-86, 1997
from data base. [7] Paul Ekman and Wallace V. Friesen, "Facial Action
Coding System", Consulting Psychologists Press Inc.
At interactive process, a famous movie scene is going on (1978).
and face part of actor or actress is replaced with visitor's [8] Irfan Essa, T. Darrell and A. Pentland, "Tracking Facial
face. And also facial expression and lip shape are controlled Motion", Proceedings of Workshop on Motion and Non-
synchronously by captured voice. And also active camera is rigid and Ariticulated Objects, pp.36-42 (1994).
tracking visitor's face and facial expression is controlled by [9] Kenji Mase, "Recognition of Facial Expression from
CV based face image analysis. When there are more than Optical Flow", IEICE Transactions, Vol E 74, No. 10,
one actor in a scene, all of these faces can be replaced with October (1991).
several visitors' faces and all of the faces can be controlled [10] Shigeo Morishima, "Modeling of Facial Expression
by several visitors' voice at the same time. Fig.11 shows the and Emotion for Human Communication System",
result of fitting of face model into movie scene. Fig.12 shows Displays 17, pp.15-25, Elsevier (1996).

You can also read