Face-to-face Communicative Avatar Driven by Voice

Page created by Judy Davidson
 
CONTINUE READING
Face-to-face Communicative Avatar Driven by Voice

                                          Shigeo MORISHIMA and Tatsuo YOTSUKURA
                                                       Faculty of Engineering, Seikei University
                                                           {shigeo, yotsu}@ee.seikei.ac.jp

                                                                                   transmitted voice can control the lip shape and facial
                             ABSTRACT                                              expression of avatar by media conversion algorithm[5][6].

Recently computer can make cyberspace to walk through by                           This media conversion technique can be applied to
an interactive virtual reality technique. An avatar in                             entertainment area. An example is interactive movie in which
cyberspace can bring us a virtual face-to-face communication                       audience can take part in movie scenes as a hero or heroin.
environment. In this paper, we realize an avatar which has a                       The face part of movie star in famous movie film is replaced
real face in cyberspace to construct a multi-user                                  with user's own face and this face is controlled interactively
communication system by voice transmission through                                 by user's voice and a few control function. In this system,
network. Voice from microphone is transmitted and analyzed,                        face model fitting into the movie scene (match move process)
then mouth shape and facial expression of avatar are                               is necessary and it takes long time with manual delicate work
synchronously estimated and synthesized on real time. And                          to make the scene impressive.
also we introduce an entertainment application of a real-
time voice driven synthetic face. This project is named                            2. Face Modeling
"Fifteen Seconds of Fame" which is an example of interactive
movie.                                                                             To generate a realistic avatar's face, a generic face model is
                                                                                   manually adjusted to both user’s frontal face image and side
                                                                                   view face image to produce a personal 3D face model and all
                                                                                   of the control rules for facial expressions are defined as a 3D
1. Introduction                                                                    movement of grid points in a generic face model. Fig.1 shows
                                                                                   a personal model after fitting process by using our original
Recently, research into creating friendly human interfaces has                     GUI based face fitting tool. Front view image and side view
f l o u r i s h e d r e m a r k a b l y. S u c h i n t e r f a c e s s m o o t h   image are given into to the system and then corresponding
communication between a computer and a human. One style                            control points are manually moved to a reasonable position
is to have a virtual human[1][2] appearing on the computer                         by mouse operation.
terminal who should be able to understand and express not
only linguistic information but also non-verbal information.
This is similar to human-to-human communication with a face-                       Synthesized face is coming out by mapping of blended texture
to-face style and is sometimes called a Life-like                                  generated by user’s frontal image and side view image onto
Communication Agent[3][4]. In the human-human                                      the modified personal face model. The body of avatar is also
communication system, facial expression is the essential
means of transmitting non-verbal information and promoting
friendliness between the participants.

Our final goal is to generate a virtual space close to a real
communication environment between network users. In this
paper, multi-users virtual face-to-face communication
environment in cyberspace is presented. There is an avatar
projecting the feature of each user in cyberspace which has a
real texture-mapped face to generate facial expression and
action which is controlled by user. User can also get a view in
cyberspace through the avatar’s eyes, so he can communicate                            Figure 1: Personal Face Model Fitted onto Images
with other people by gaze crossing. And also user’s
Figure 2: Personal 3D Avatar                            Figure 3: Neural network for parameter conversion

under construction, so in current prototype system, each avatar
has only a body beyond neck which is shown in Fig.2.
User’s emotion condition can be transmitted to other clients
as a feature and motion of avatar as well as facial expression.
This realizes also non-verbal communication between users.

3. Voice Driven Talking Head
Multiple users' communication system in cyberspace is                         Figure 4: Mouth shape for vowel “a”
composed of several sub-processes expressed as follows.

3.1 Voice Capturing                                               3.3 Emotion Condition
At client system, on-line captured voice of each user is A/D      Emotion condition is also classified by LPC cepstrum , power,
converted by 16KHz and 16bits, and is transmitted to server       pitch frequency and utterance speed using multiple linear
system frame-by-frame through network. In a intra-net             discriminant analysis method into one of the emotion
environment, no speech compression is needed but for internet     categories, i.e., Anger, Happiness, Sadness and Neutral.
application, high compression technique is inevitable to keep     Utterance speed is recognized automatically. Based on
voice quality high and cancel transmission delay. Current         cepsrum distance between current and preceding frames and
prototype system is implemented on intra-network                  WLR distance from 5 vowel data base, phoneme segment
environment.                                                      boundaries are automatically detected. Fig. 5 shows voice
                                                                  wave and segmentation result. This process is going on frame
3.2 Lip Synchronization                                           by frame along with utterance, so current speed is calculated
                                                                  by number of detected boudary frames divided by total spoken
At server system, voice from each client is phonetically          length.
analyzed and converted to mouth shape and expression
parameters. LPC Cepstrum parameters are converted into            Fig.6 shows the location of each basic emotion condition in
mouth shape parameters by neural network frame by frame           the space defined by the coordinate for utterance speed and
trained by vowel features. Fig.3 shows neural network for         pitch. Each condition is distinguished clearly by these
parameter conversion. This is direct and continuous mapping       parameters.
from LPC parameters to mouth shape parameters. Mouth
shape is expressed only by 13 parameters describing a feature
of mouth shape. For training data, mouth shape parameters
and LPC parameters for 5 vowels are prepared. Fig.4 shows
an example mouth shape for vowel “a”. User can easily
modify mouth shape for each vowel by mouth shape editing
tool.

                                                                              Figure 5: Result of segmentation
Each basic emotion has a specific facial expression parameters
Speed
                                                 described by FACS (Facial Action Coding System)[7].
                                                 Examples of basic expressions are shown in Fig.7(a)-(c).
                                                 These faces can also be customized by user using face editing
                                                 tool.

                                                 3.4 Location Control
                                                 Each user can walk through and fly through cyberspace by
                                                 mouse control and current locations of all users are always
                                                 observed by server system. Avatar image is generated in the
                            Normalized Power     client space by the location information from the server
                                                 system.

            Figure 6: Location of Emotion

         Figure 7(a): Basic emotion “Anger”               Figure 8: A communication with eye contact

        Figure 7(b): Basic emotion “Happiness”

                                                               Figure 9: View from audience
        Figure 7(c): Basic emotion “Sadness”
from any angle shown in Fig.9 to search for other clients in
                                                                  cyber space or to become an audience at the same time.
                                                                  These views can be selected by menu in window.

                                                                  3.8 Voice Output
                                                                  Playback volume of an avatar’s voice depends on the distance
                                                                  to that avatar. To add multiple speakers system make 3D
                                                                  audio output possible. To realize lip synchronization, 64ms
                                                                  delay is given to voice playback.
      Initial view of window     Final view of window
                                                                  4. User Adaptation
                                                                    When new user comes in, his face model and voice model
                                                                  have to be registered before operation. In case of voice, new
                                                                  learning for neural network has to be performed ideally.
                                                                  However, it takes a very long time to get convergence of
                                                                  backpropagation. To simplify the face model construction
                                                                  and voice learning, the GUI tool for speaker adaptation is
                                                                  prepared.

       Initial view of window Final view of window                4.1 Face Model Fitting

              Figure 10: Fitting tool window                        To register the face of new user, a generic 3D face model
                                                                  is modified to fit on the input face image. Only 2D frontal
                                                                  image is needed. To improve the depth feature, side view
                                                                  image can be utilized. Figure 10 shows the initial and final
3.5 Emotion Key-in                                                view of fitting tool window for frontal face image. Some of
Emotion condition can always be decided by voice, but             the control points on face model are shifted manually. It takes
sometimes user give his avatar a specific emotion condition       a few minutes to complete user’s face model because of the
optionally by pushing function key. This process-works with       easy mouse operation by GUI tool. Expression control rules
first priority. For example, push anger and then red and bigger   are defined on the generic model, so every user’s face can
face is coming out . When happiness, bouncing face is coming      be equally modified to generate basic expression using FACS
out, and so on.                                                   based expression control mechanism.

3.6 Information Management at Server                              4.2 Voice Adaptation

Location information of each avatar, mouth shape parameters         75 persons’ voice data including 5 vowels are pre-captured
and emotion parameters are transmitted every 1/30 seconds         and database for weights of neural network and voice
to client system. Distance between every 2 users are              parameters are constructed. So speaker adaptation is
calculated by the avatar location information, and voice from     performed by choosing the optimum weight from database.
every user except himself is mixed and amplified with gain        Training of neural network for every 75 persons’ data is
according to the distance. So the voice from the nearest avatar   already finished before. When new nonregistered speaker
is very loud and one from far away is silent.                     comes in, he has to speak 5 vowels into microphone before
                                                                  operation. LPC Cepstrum is calculated for every 5 vowels
3.7 Agent and Space Generation at Client                          and this is given into the neural network. And then mouth
                                                                  shape is calculated by selected weight and error between
  Based on facial expression parameters and mouth shape           true mouth shape and generated mouth shape is evaluated.
parameters, avatar face is synthesized frame by frame. And        This process is applied to all of the database one by one and
avatar body is located on cyberspace according to the location    the optimum weight is selected when the minimum error is
information. There are two modes for displaying, view from        detected.
avatar’s own eyes for eye contact shown in Fig.8 and view
an user's face inserted into actor's face.

                                                                 6. Conclusion
                                                                 Natural communication environment between multiple users
                                                                 in cyberspace by transmission of natural voice and real-time
                                                                 synthesis of avatar’s facial expression is presented. Synthesis
                                                                 speed of cyberspace and avatars is about 10.5 frame per
                                                                 second by SGI Onyx2 (R10k, 180MHz). Current system is
                                                                 working on 3 users and intra-network environment. To
                                                                 increase the number of users, it’s necessary to reduce the
                                                                 traffic in network by compressing voice signal and reduce
      Figure 11: Fitted face model into movie scene              the cost of server processing. Our final goal is to realize the
                                                                 system on Internet environment. And application of this
                                                                 system to entertainment is introduced. In coming future,
                                                                 facial image analysis[8][9] and emotion model [10] are
                                                                 introduced to improve communication environment.

                                                                 7. References
                                                                 [1] Norman I. Badler, Cary B. Phillips, and Bonnie L.
                                                                      Webber, "Simulating Humans", Computer Graphics
      Figure 12: User's face inserted into actor's face               Animation and Control, Oxford University Press (1993).
                                                                 [2] Nadia M. Thalmann and Prem Kalra, "The Simulation
                                                                      of a Virtual TV Presenter", Computer Graphics and
5. Interactive Movie                                                  Applications, pp.9-21, World Scientific (1995).
When people watch movie film, he sometimes overlap his           [3] Shigeo Morishima, etc. "Life-Like, Bel able
own figure with actor's image. An interactive movie system            Communication Agents", Course Notes #25, ACM
we constructed is an image creating system in which user              Siggraph (1996).
can control facial expression and lip motion of his face image   [4] Justine Cassell, et. al., "Animated Conversation: Rule-
inserted into movie scene. User gives voice by microphone             based Generation of Facial Expression, Gesture and
and pushing keys which determine expression and special               Spoken Intonation for Multiple Conversational Agents",
effect. His own video program can be generated on realtime.           Proceedings of SIGGRAPH'94, pp.413-420 (1994).
This project is named as "The Fifteen Seconds of Fame".          [5] Shigeo Morishima and Hiroshi Harashima, "A Media
At first, once a frontal face image of visitor is captured by         Conversion from Speech to Facial Image for Intelligent
camera. 3D generic wireframe model is fitted onto user's              Man-Machine Interface", IEEE JSAC, Vol.9, No.4, pp.
face image to generate personal 3D surface model. Facial              594-600, (1991).
expression is synthesized by controlling the grid point of       [6] Shigeo Morishima: “Virtual Face-to-Face Communication
face model and texture mapping. For speaker adaptation,               Driven by Voice Through Network”, Workshop on
visitor has to speak 5 vowels to choose an optimum weight             Perceptual User Interfaces, pp85-86, 1997
from data base.                                                  [7] Paul Ekman and Wallace V. Friesen, "Facial Action
                                                                      Coding System", Consulting Psychologists Press Inc.
At interactive process, a famous movie scene is going on              (1978).
and face part of actor or actress is replaced with visitor's     [8] Irfan Essa, T. Darrell and A. Pentland, "Tracking Facial
face. And also facial expression and lip shape are controlled         Motion", Proceedings of Workshop on Motion and Non-
synchronously by captured voice. And also active camera is            rigid and Ariticulated Objects, pp.36-42 (1994).
tracking visitor's face and facial expression is controlled by   [9] Kenji Mase, "Recognition of Facial Expression from
CV based face image analysis. When there are more than                Optical Flow", IEICE Transactions, Vol E 74, No. 10,
one actor in a scene, all of these faces can be replaced with         October (1991).
several visitors' faces and all of the faces can be controlled   [10] Shigeo Morishima, "Modeling of Facial Expression
by several visitors' voice at the same time. Fig.11 shows the         and Emotion for Human Communication System",
result of fitting of face model into movie scene. Fig.12 shows        Displays 17, pp.15-25, Elsevier (1996).
You can also read