Automated Facial Action Unit Recognition in Horses - ZHENGHONG LI - KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Automated Facial Action Unit Recognition in Horses ZHENGHONG LI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Automated Facial Action Unit Recognition in Horses ZHENGHONG LI Master in Machine Learning Date: July 5, 2020 Supervisor: Sofia Broomé Examiner: Hedvig Kjellström School of Electrical Engineering and Computer Science Swedish title: Automatisk igenkänning av ansiktsaktionsenheter hos hästar
iii Abstract In recent years, with the development of deep learning and the applications of deep learning models, computer vision tasks such as human facial action unit recognition have made significant progress. Inspired by these works, we have investigated the possibility of training a model to recognize horse facial action units automatically. With the help of the Equine Facial Action Coding System (EquiFACS) created by veterinarians recently, our aim has been to de- tect EquiFACS units from images and videos. In this project, we proposed a cascade framework for horse facial action unit recognition from images. We firstly trained several object detectors to detect the predefined regions of inter- est. Then we applied binary classifiers for each action unit in related regions. We experimented with different types of neural network classifiers and found AlexNet to work the best in our framework. Additionally, we also transferred a model for human facial action unit recognition to horses and explored strate- gies to learn the correlations among different action units.
iv Sammanfattning Under de senaste åren, med utvecklingen av djupinlärning och dess tillämp- ningar, har datorseendeuppgifter så som igenkänning av mänskliga ansiktsak- tionsenheter gjort stora framsteg. Inspirerad av dessa arbeten har vi undersökt möjligheten att hitta en modell för att automatiskt känna igen hästars ansiktsut- tryck. Med hjälp av Equine Facial Action Coding System som nyligen skapats av veterinärer kan vi upptäcka ansiktsaktionsenheter hos hästar som definieras i detta system från bilder och videor. I detta projekt föreslog vi ett kaskadram- verk för igenkänning av hästens ansiktsaktionsenheter från bilder. Först träna- de vi flera objektdetektorer för att upptäcka de fördefinierade regionerna av intresse. Sedan använde vi binära klassificeringar för varje aktionsenhet i rela- terade regioner. Vi testade olika modeller av klassificerare och fann att AlexNet fungerade bäst i våra experiment. Dessutom överförde vi också en modell för mänsklig ansiktsaktionsenhetsigenkänning till hästar och utforskade strategier för att lära sig korrelationerna mellan olika aktionsenheter.
Contents 1 Introduction 1 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions and Limitations . . . . . . . . . . . . . . . . . 2 1.3 Societal Impact and Sustainability . . . . . . . . . . . . . . . 2 1.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . 3 2 Background 4 2.1 Facial Action Coding System . . . . . . . . . . . . . . . . . . 4 2.1.1 Human Facial Action Coding System . . . . . . . . . 4 2.1.2 Equine Facial Action Coding System . . . . . . . . . 5 2.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Generic Image Classification . . . . . . . . . . . . . . 5 2.2.2 Fine-grained Image Classification . . . . . . . . . . . 6 2.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Facial Feature Point Detection and Head Pose Estimation . . . 7 2.4.1 Facial Feature Point Detection . . . . . . . . . . . . . 7 2.4.2 Head Pose Estimation . . . . . . . . . . . . . . . . . 9 2.5 Facial Action Unit Recognition . . . . . . . . . . . . . . . . . 10 2.5.1 Still Image-Based Models . . . . . . . . . . . . . . . 10 2.5.2 Sequence-Based Models . . . . . . . . . . . . . . . . 11 2.6 Animal Pain Recognition . . . . . . . . . . . . . . . . . . . . 12 2.6.1 Still Image-Based Model . . . . . . . . . . . . . . . . 13 2.6.2 Sequence-Based Model . . . . . . . . . . . . . . . . . 13 3 Methods 14 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Cascade Framework . . . . . . . . . . . . . . . . . . 14 3.2.2 End-to-end model: DRML . . . . . . . . . . . . . . . 17 v
vi CONTENTS 3.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . 19 4 Results 20 4.1 Finding the Best Model Based on the Four Relatively Easy Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Binary Classification for AU101 (Inner Brow Raiser) . 20 4.1.2 Binary Classification for AD1, AU25, and AD19 . . . 23 4.2 Binary Classification for Five Other AUs . . . . . . . . . . . . 25 4.3 Learning Correlations among AUs . . . . . . . . . . . . . . . 26 5 Discussion 30 6 Conclusions 33 Bibliography 34
Chapter 1 Introduction The horse is a highly social species like humans. In human medicine, we usually analyze the facial expressions of people to assess their emotions. Sim- ilarly, it is a good way to analyze horse behaviors through horse facial ex- pressions. Earlier studies show that the facial expressions can be described as combinations of a number of facial action units according to the observ- able changes in the skin and related facial muscle movements. In recent years, a system for the coding of the horse facial action unit called Equine Facial Action Unit Coding System (EquiFACS) [1] has been created. Based on the EquiFACS, we can recognize horse facial actions by detecting their facial ac- tion units. In the past few years, great progress has been made in the field of computer vision. As the application of deep learning models such as convolutional neu- ral networks (CNN) in computer vision, in some tasks such as image classifi- cation, the accuracies of computer models are even competitive with human capabilities. Related works for human facial action unit detection have also made progress in these years. Therefore, we consider finding a model to auto- matically recognize horse facial actions. This project mainly focuses on how to detect the facial action units of horses from still images. We tried to transfer models for human action unit detection to horses, and we also applied classi- cal CNN model to our task. Finally, we found a cascade framework that works relatively better and could yield reasonable results. 1.1 Research Questions The goal of this project is mainly to address the following research questions: • Is it possible to find a computer vision method to detect horse facial 1
2 CHAPTER 1. INTRODUCTION action units automatically and accurately from images? • Are the outputs of our models adequate for horse facial action unit recog- nition? For the first question, based on our experiment results shown in Chapter4 Re- sults, we can say that we have found a model that can realize such goals. For the second question, this will be further discussed in Chapter5 Discussion. 1.2 Contributions and Limitations There are three main contributions of our project. First, we proposed a cas- cade framework for the recognition of horse facial action units. Second, we ex- plored different classifiers and found the best model for our framework. Third, we transferred models for human facial action unit recognition to horses and compared it to our framework. The main limitation of this project was posed by the size of the dataset. As there is no published dataset for horse facial action unit recognition, we use an unpublished dataset provided by veterinarians from the Swedish University of Agricultural Sciences. This dataset is a small and unbalanced dataset. There- fore, the amount of data may not support the training of a complex model, and it also limits the capability of generalization of our models. Another limitation is caused by the noise of the dataset. Our dataset is originally a video dataset and we sample one frame per video for our experiments. Because of this, there exist motion blurs in some frames which will cause our model overfit. In ad- dition, there is no previous work for horse facial action unit recognition. We do not have any direct reference for our task and we do not have a baseline for the evaluation. 1.3 Societal Impact and Sustainability To our best knowledge, we are the first group to research how to automatically recognize horse facial action units. Our proposed framework is intended to help people analyze horse behaviors such as pain recognition. In this way, it will be more convenient for veterinarians to accurately detect whether a horse has a disease and help them keep horses in health. Also, we believe that this work is worth learning for similar tasks for other animals and can potentially help people and animals live in harmony.
CHAPTER 1. INTRODUCTION 3 As for the sustainability, according to the Sustainable Development Goals adopted by the member countries of the United Nations, our project will mainly contribute to the Goal 15: life on land. One of the main aspects of this goal is to halt biodiversity loss on land. Although horse is not an endangered species, this project could help in the case that the method is transferable to other en- dangered species. If we can know more about animals’ behaviors, we can save those species by helping with their well-being and saving their lives from diseases. 1.4 Ethical Considerations One consideration is about whether the data used in this project is collected ethically. In this project, the original dataset for experiments consists of horse films. These films are provided by veterinarians, and they have ethical permis- sions to collect them. Moreover, although our project has many positive con- tributions to our society, there are still some potential unethical impacts such as the invasion of privacy. As our work could also be transferred for human facial action unit recognition and human emotion estimation, this technique may potentially violate people’s privacy.
Chapter 2 Background For many species, facial expressions can be described as combinations of a number of facial action units defined in related facial action coding systems. Such coding systems are the basis of facial action unit recognition tasks for certain species. Before studies for animals, a large number of related studies for human facial action unit recognition were carried out. Also, generic works in computer vision fields and specific studies about faces are closely related to horse facial action unit recognition. In this chapter, we will briefly introduce related works for our tasks. 2.1 Facial Action Coding System 2.1.1 Human Facial Action Coding System Facial expression has been a focus of emotion research for over a hundred years [2]. In 1978, Ekman and Friesen proposed the Facial Action Coding System (FACS) [3]. Through electrically stimulating individual muscles and learning to control them voluntarily, each action unit is associated with one or more facial muscles [4]. In 2002, Ekman et al proposed a new version of human FACS [5] which has been widely used for human emotion recognition. FACS 2002 specifies 9 action units in the upper face and 18 in the lower face. In addition, there are 14 head positions and movements, 9 eye positions and movements, 5 miscellaneous action units, 9 action descriptors, 9 gross behav- iors, and 5 visibility codes. FACS is a observer-based measurement of facial expression, with which we can recognize human facial expression and emotion more accurately. 4
CHAPTER 2. BACKGROUND 5 2.1.2 Equine Facial Action Coding System Inspired by the progress of human FACS, a variety of animal facial action coding system has been created. Since the horse is a highly social species, to enhance the understanding of communication and cognition in horses and provide insights into the effects of domestication, Wathan et al [1] created Equine Facial Action Coding System (EquiFACS). EquineFACS consists of action units (AUs) and action descriptors (ADs), where AUs represent the contraction of a particular facial muscle (or set of muscles) and the resulting facial movements, and ADs represent more general facial movements without specific related facial muscles. For example, AU47 half blink is particularly caused by the muscles around the eyes. But AD1 eye white increase can be caused in various ways such as the movement of the eyeballs, which is not related to facial muscles, and the wide opening of the eyes, which is related to the muscles around the eyes. In addition to the ADs similar to human fa- cial ADs, the movement of the ears of horses is also very important for their facial expression. Such action descriptors are specifically named as ear action descriptors (EADs). This project is to recognize the equine facial actions by recognizing the AUs and ADs defined in EquiFACS. 2.2 Image Classification Recognizing horse facial action units from still images can be considered as a fine-grained image classification task. Image classification task usually can be categorized into two classes, generic or fine-grained classification. The main difference is that fine-grained classification deals with categories that are very similar such as bird species. Usually, models for fine-grained classification are based on models for generic image classification. 2.2.1 Generic Image Classification Before deep learning was widely applied in computer vision, many non-deep learning methods were proposed. These methods usually extracted some man- made features such as histogram of oriented gradients (HOG) and scale-invariant feature transform (SIFT) [6], which features were then used to train non-deep classifiers such as random sample consensus (RANSAC) [7] and support vec- tor machine (SVM) [8]. After Krizhevsky et al. proposed the AlexNet [9], deep convolutional neural networks (CNN) have replaced the traditional meth- ods in image classification fields with their outstanding performance. There-
6 CHAPTER 2. BACKGROUND fore, this project will be mainly based on deep learning methods. CNN LeNet-5 [10] was the first convolutional neural networks (CNN). CNNs typically consist of stacked convolutional layers followed by fully-connected layers. CNNs are trained by error back-propagation [11]. Besides the AlexNet mentioned above, other classical CNNs such as VGG [12], GoogleNet [13], and ResNet[14] have been applied as feature extrac- tors in various fields. Capsule To enhance the expressive power of neural networks, Sabour et al [15] proposed capsules. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. Unlike the scalar output of a standard CNN classifier, the output of a capsule is a vector. The length of the activity vector is to represent the probability that the entity exists and the orientation is intended to capture any other relevant attributes of the object, such as appearance or orientation. Capsules have been applied in facial action unit detection fields [16]. 2.2.2 Fine-grained Image Classification Fine-grained image classification methods usually can be categorized into two classes: strongly or weakly supervised methods. Weakly supervised methods only employ class labels for training, while strongly supervised methods ad- ditionally employ labeled object bounding boxes for regions of interest [17] or part annotations [18] for training. For weakly supervised methods, to make the model focus on the regions of interest, attention mechanisms have been employed [19, 20]. Besides these methods, a simple alternative way for fine- grained image classification is to directly apply a generic image classification model on detected regions of interest. 2.3 Object Detection As is mentioned above, fine-grained image classification models should focus on regions of interest. Therefore, object detection can be employed in our tasks to help localize the regions of interest such as faces, eyes, and nostrils. Then the following classifiers can be only applied in these regions to improve the accuracy of the facial action unit recognition. Usually object detection models can be classified into two categories: anchor-based models and anchor-free models.
CHAPTER 2. BACKGROUND 7 Anchor-based Anchor-based models do the detection task based on the an- chor box proposals. Usually anchor based methods can be divided into two categories: one-stage method and two-stage method. One-stage methods such as YOLOv2 [21]/v3 [22] and SSD [23] generate the an- chor proposals and the detection in one stage and can be trained in an end-to-end manner. Two-stage methods such as Faster-RCNN [24] first generate anchor box proposals via a pre-trained region proposal network (RPN) and then use ROI-Pooling and detection networks on the region proposals for the final detection. Anchor-free Anchor-free methods directly predict the bounding boxes with- out anchor box proposals and are faster than anchor-based models in most cases. Traditional anchor-free models such as YOLOv1 [25] are not able to reach the accuracies of anchor-based models. Recently, many advanced anchor-free models have been proposed, such as CornerNet [26] and SAPD [27] whose accuracies are highly competitive with anchor- based methods at obviously faster speed. Specifically, SPAD uses soft- weighted anchor points and soft-selected pyramid levels for the feature pyramids to balance the speed and accuracy for anchor-free detectors and reached the state-of-the-art performance for object detection. 2.4 Facial Feature Point Detection and Head Pose Estimation As is mentioned above, part annotations are usually employed in fine-grained image classification. Facial feature points is the part annotations that are usu- ally applied in facial action unit detection frameworks. After the facial feature points have been detected, regions that are relevant to facial action units such as eyes, nostrils, and mouth can be cut out for classification. Head pose es- timation can be used for image rectification and it is usually combined with facial feature point detection. 2.4.1 Facial Feature Point Detection Facial feature point detection is widely used in many applications such as face recognition, facial expression analysis, and 3D face modeling. Existing meth- ods are categorized into two primary categories according to whether there is the need of a parametric shape model: parametric shape model-based methods and nonparametric shape model-based methods [28].
8 CHAPTER 2. BACKGROUND Parametric models rely on some particular distributions and models such as multivariate Gaussian distributions or Gaussian mixtures. The number of parameters in a parametric model is fixed. One of the most common point dis- tribution models is proposed by Cootes and Taylor [29]. Parametric models can be further divided into two categories: local part model-based methods and holistic model-based methods. Local part model-based methods, e.g. Ac- tive Shape Models [30], usually detect each facial feature point around some regions locally and then these detected points are constrained by a global shape model. Holistic model-based methods, e.g. Active Appearance Models [31], usually estimate the location of facial feature points from a holistic texture representation combined with a global shape model. Nonparametric models are not based on specific shape distributions. Such methods are usually categorized into four categories according to the con- nection between shape and appearance: exemplar-based methods, graphical model-based methods, cascaded regression based methods, and deep learning- based methods. Exemplar-based methods [32] generally find the constraint for the configuration of facial feature points by exemplar shapes in the training set. Graphical model-based methods [33] generally constrain the configuration of facial feature points via graphical models such as tree structures or Markov random fields. Cascaded regression-based methods [34] directly estimate a regression function from image appearance in a coarse-to-fine manner with- out explicitly learning any shape model or appearance model. Deep learning- based methods either learn the nonlinear shape and appearance variation or learn the nonlinear mapping from the face appearance to the face shape. Some specific deep learning-based methods have been effectively applied in animal facial feature point detection: Interspecies [35] is a specific transfer learning method that transfer knowl- edge gained from human faces to animal faces. Instead of directly fine- tuning a network trained to detect keypoints on human faces to animal faces, Rashid et al. proposed a more effective method that warps the ani- mal images to human-like images first and then fine-tunes the pretrained human facial feature point detector for animals. This approach has three main steps: (1) finding nearest neighbor human faces that have similar pose to each animal face; (2) using the nearest neighbors to train an animal-to human warping network; and (3) using the warped (human- like) animal images to fine-tune a pretrained human key point detector for animal facial keypoint detection. DeepLabCut [36] is a toolbox for extracting the geometrical configuration
CHAPTER 2. BACKGROUND 9 of multiple animal body parts, which can be used for animal facial fea- ture point detection. The idea is to find an effective model to detect the keypoints of animal body parts with a small amount of data for training. This model is mainly based on a subset of DeeperCut [37]. Both of these two algorithms are mainly based on variations of ImageNet-pretrained [38] ResNet with readout layers that predict the location of a body part. The network of DeepLabCut is actually a pretrained ResNet backbone followed by a series of deconvolutional layers. This model can be easily trained for animal facial feature point detection with a small number of training images (≈ 200). 2.4.2 Head Pose Estimation Head pose estimation is to infer the orientation of a person’s (or an animal’s) head relative to the view of a camera. Since the heads in images are usually not upright, head pose estimation can help us rectify the image for further feature extraction. Head pose estimation is closely related to facial feature point detection. Some head pose estimation methods can be transferred to facial feature point detection. Meanwhile, some other head pose estimation methods are based on the detected facial feature points. Head pose estima- tion methods generally consist of eight categories [39]: appearance template methods, detector array methods, nonlinear regression methods, manifold em- bedding methods, flexible models, geometric methods, tracking methods, and hybrid methods. Among these methods, nonlinear regression methods, flexi- ble models, and geometric methods are closely related to facial feature point detection: Nonlinear regression methods Nonlinear regression methods estimate pose by learning a nonlinear functional mapping from the image space to one or more pose directions. Nonlinear regression methods are closely re- lated to cascaded regression-based methods for facial feature point de- tection. For example, [34] for facial feature point detection is based on Cascade Pose Regression (CPR) [40] method for pose estimation. Specifically, CPR starts with a loosely specified initial guess and pro- gressively find the target regression function in a coarse-to-fine man- ner. Each refinement is carried out by a different regressor, and each regressor performs simple image measurements that are dependent on the output of the previous regressors. This method mainly changes the parameter error in CPR to the alignment error and employs new shape indexed features for facial feature point detection.
10 CHAPTER 2. BACKGROUND Flexible models Flexible models are fit to the facial structure of the individ- ual in the image plane. Head pose is estimated from feature-level com- parisons or from the instantiation of the model parameters. A common flexible model for head pose estimation is the AAM [31] which is also used for facial feature point detection. Once the model has converged to the feature locations, an estimate of head pose can be obtained by mapping the appearance parameters to a pose estimate [41]. Geometric methods Geometric methods use head shape and the precise con- figuration of local features to estimate pose. A straightforward way is to use five facial feature points (the outside corners of each eye, the out- side corners of the mouth, and the tip of the nose), the facial symmetry axis is found by connecting a line between the midpoint of the eyes and midpoint of the mouth [42]. Moreover, some simple shapes can also be used for head pose estimation. For example, for near-frontal faces, the yaw of a face can be reliably estimated by creating a triangle between the eyes and the mouth and finding its deviation from a pure isosceles triangle [43]. 2.5 Facial Action Unit Recognition We will now go through methods in the literature for detecting facial action units. Since most of the animal FACS were not created until recent years, there are still only very few studies for animal facial action unit recognition. Instead, many studies in human facial action unit recognition have been done which are worthy of reference for animal facial action unit recognition. Recently, deep learning has been widely applied in facial action unit recognition. These models can be classified into two categories with respect to the inputs: still image-based models or sequence-based models. 2.5.1 Still Image-Based Models Still image-based models usually attempt to focus on important regions of the image such as eyes and mouth to detect relative AUs. Regional learning meth- ods and attention mechanisms are usually employed in these models. DRML [44] is the first algorithm that jointly uses CNN with regional learning methods and multi-label sigmoid cross-entropy loss for facial action unit detection. As AUs are active on sparse facial regions, the authors use
CHAPTER 2. BACKGROUND 11 a regional learning method for facial action unit detection by inserting a region layer into a classical CNN for image classification. The input feature map from the lower convolutional layer to the region layer is uni- formly divided into 8 × 8 patches, and each patch will be applied with a convolution layer to get a re-weighted patches. The output is a con- catenation of all re-weighted patches. Finally, the multi-label sigmoid cross-entropy loss is employed for training. ARL [45] is an end-to-end deep learning based attention and relation learn- ing framework for facial action unit recognition. The framework con- tains three parts. The first part is for hierarchical and multi-scale region learning. The three intermediate layers for 8 × 8, 4 × 4 and 2 × 2 patches are cascaded, and the output of each intermediate layer will be concate- nated and then summed element-wise with the output of the input layer. The second part is for channel-wise attention learning first, followed by spatial attention learning. The third part is for pixel-wise relation learn- ing via a fully-connected CRF model. Capsule Network [16] for facial action unit recognition is a method to utilize the higher expressive power of capsules than standard neurons to detect AUs. Similar to the CapsNet [15], the proposed network consists of three parts. The first part is convolutional layers which are employed as a mid level feature extractor. The second part is two capsule layers. The primary capsules further develop image features and transform scalar inputs from the convolution layers to vector representations. The class capsules collate the vector outputs of the primary capsules to form the final class predictions. The third part is a reconstruction model for visu- alizing the properties learned by the class capsules and regularizing the overall network. 2.5.2 Sequence-Based Models Compared to still image-based models, sequence-based models can utilize both spatial and temporal information for facial action unit recognition. The- oretically, some AUs are very hard to be discriminated from still frames. For example, AU43 eye closed and AU45 blink in FACS 2002 (similarly AU143 eye enclosure and AU145 blink in EquiFACS) are quite similar actions and the main differences are the action speed and the duration of the eye closed. Therefore, the temporal dependence is significant for identifying some AUs.
12 CHAPTER 2. BACKGROUND Long Short-Term Memory (LSTM) [46] methods are commonly employed for temporal feature modeling. CNN+LSTM [47] is a hybrid network to model the spatial representation, temporal dependence, and AU correlation for facial action unit detec- tion. Specifically, spatial representations of each frame are extracted by a CNN. Then, LSTMs are stacked on top of the CNNs to model tempo- ral dependence. Finally, two fully connected layers with shared param- eters are placed on top of both CNNs and LSTMs as a fusion network to aggregate spatial and temporal correlations and produce per-frame prediction. ROI [48] is a deep learning framework for AU detection with region of inter- est (ROI) adaptation, integrated multi-label learning, and optimal LSTM- based temporal fusing. The authors use 20 ROI Nets for 20 selected face regions respectively. The ROIs are localized via an ensemble of regres- sion trees method [49] for facial landmark detection. Then the output of a VGG feature extractor will be cropped for each ROI Net for more specific feature extraction. The output of each ROI Net will be concate- nated as an input of the LSTM. Finally, the LSTM will learn the temporal dependencies and make the prediction via multi-label learning. TCAE [50] is a self-supervised model that learns representations for AU de- tection from videos without manual annotations. Since the transforma- tion between two face images is caused by both facial actions and head motions, the authors propose a Twin Cycle Autoencoder (TCAE) to dis- entangle the facial action related movements and the head motion related ones. Specifically, TCAE is trained to respectively change the facial ac- tions and head poses of the source face to those of the target face. After the training process, the obtained encoder for AU related movements can be employed for AU detection by stacking a linear classifier on the top. 2.6 Animal Pain Recognition Animal pain recognition is a further task that is closely related to animal fa- cial action unit detection. Since pain is usually a sign of diseases, animal pain recognition can help us for in improving animal welfare. As for horses, Gleerup et al. [51] found that horses displayed specific facial expressions when in pain. As very few studies have been done in this field, only two specific
CHAPTER 2. BACKGROUND 13 methods will be discussed in the following subsection. Similar with the facial action unit detection models, one is a still image-based model while the other is a sequence-based model. 2.6.1 Still Image-Based Model Lu et al. [52] proposed one of the earliest work for animal pain recognition. They detect the facial action units of sheep first and then use the features of AUs for pain estimation. This method consists of five main steps: face detec- tion, facial landmark detection, feature-wise normalization, feature descriptor and pain level estimation. Firstly, the frontal face is detected by Viola-Jones object detection framework [53]. Then, CPR is employed to detect sheep facial landmarks. Next, feature-wise normalization is carried by rectifying the im- age based on the facial landmarks and cropping the regions of interest. Then, HOG is used for the feature descriptors. Finally, an SVM is applied to estimate the pain scores of sheep. 2.6.2 Sequence-Based Model Broomé et al. [54] proposed the first work for equine pain recognition from videos via a deep recurrent two-stream architecture. The framework of the proposed model is mainly based on a Convolutional LSTM (C-LSTM) model [55], where the fully-connected matrix multiplications involving the weight matrices in the LSTM equations are replaced with convolutions. The authors further expand it to a two-stream [56] (spatial stream on RGB and temporal stream on optical flow) C-LSTM network, referred to as C-LSTM-2. The out- puts of the two streams are fused by element-wise multiplication or addition for the final classification.
Chapter 3 Methods 3.1 Dataset The dataset we are working on is an unpublished dataset for automated equine facial action unit detection. There are 21066 labeled video clips for 31 ac- tion units (AUs) or action descriptions (ADs), and we randomly sampled one frame from each clip for our experiments. However, the distribution of labeled examples is quite uneven. For instance, there are at most 5280 labeled clips for EAD104 ear rotator, but only one labeled clips for AU160 lower lip relax. Only 11 categories (Table 3.1) have more than 200 labeled clips, which are considered as the least required number of examples for training, validation, and test. Note that because of the lack of annotations for the horse individ- uals in the videos, we cannot guarantee that the horses in the test set do not appear in the training set. To make the results reasonable for most classes, we randomly separate 70 percent of the labeled clips for training, 15 percent for validation, and 15 percent for the test. In this way, except for AU5, each class will have at least 50 positive examples for validation or test. We also moved the clips that overlap with each other to the same set to avoid the frames that are closely correlated appearing in different sets. 3.2 Algorithm 3.2.1 Cascade Framework Considering the quite uneven distribution of the labeled examples in the dataset, directly applying a multi-label learning method may not work well, as it will be easy to get stuck in a local minimum where the model always predicts the 14
CHAPTER 3. METHODS 15 Table 3.1: Selected Action Units Code Action Units Labels AD1 Eye white increase 395 AD19 Tougue show 451 AD38 Nostril dilator 729 AU101 Inner brow raiser 1933 AU145 Blink 3888 AU25 Lips part 484 AU47 Half blink 1826 AU5 Upper lid raiser 209 AUH13 Nostril lift 354 EAD101 Ears forward 4813 EAD104 Ear rotator 5280 Figure 3.1: A raw example of AD1 eye white increase in our dataset dominant AUs to be true and predicts the others to be false. Therefore, we choose to employ multiple binary classifiers for these action units. Also, such binary classification for each action unit is a fine-grained image classification task so that directly applying networks for common image classification tasks will fail to reach acceptable results. Noticing that the horse face is usually too small in a raw frame (Figure 3.1), and inspired by the framework for sheep pain estimation [17], we proposed our cascade framework (Figure 3.2) for horse fa- cial action unit recognition. For each input image, we first detect the horse face and cut it out. Then we extract eye regions and the lower face regions from the detected face regions. (Because the eye regions and the lower face regions are too small in raw frames, the detectors are not able to detect these regions directly.) Finally, classical CNNs for image classification will be employed as binary classifiers for related AUs in these regions. Note that each part is trained separately.
16 CHAPTER 3. METHODS Figure 3.2: Our Cascade Framework for Horse Facial Action Unit Recognition (Each part is trained separately.) Object Detector: YOLOv3-tiny YOLOv3 [22] is a widely applied object de- tector. In our project, we employed its light implementation YOLOv3-tiny (Figure 3.3), which is easier to be transferred to a small dataset to detect regions of interest, such as faces and eyes. YOLOv3 employs Darknet- 53 as the feature extractor and refers to the feature pyramid for detection, i.e., small fea- ture maps for large objects and large fea- ture maps for small objects. YOLOv3 gen- erates three sizes of feature maps, and each of them contains three anchors, including the sizes and positions of predicted bound- ing boxes, objectness scores for each bound- ing box, and predicted classes. YOLOv3 also employs nine bounding box priors and com- bines them with the anchors for the final pre- diction. Compared to YOLOv3, YOLOv3- tiny uses a smaller feature extractor and only generates two sizes of feature maps, which makes it easier to train. Figure 3.3: Architecture of Yolov3-tiny (figure from [57])
CHAPTER 3. METHODS 17 Binary Classifier: AlexNet After the eye detector or lower face detector, the detected region will be resized as 64 × 64. Then we can directly apply generic image classification models for action unit binary classification. We tried AlexNet, VGG, and ResNet, and we found that the AlexNet reached the best performance among these three. The specification of the modified archi- tecture of AlexNet can be found in Table 3.2. Note that for the experiments in face regions, the architectures are the same as the models for ImageNet, and the size of the input is always 224 × 224. Table 3.2: Modified Archtecture of AlexNet for Binary Classification Stage Filter Output Shape input image - 64 × 64 × 3 conv1 5 × 5 × 64, stride=2, padding=2 32 × 32 × 64 maxpool 2×2 16 × 16 × 64 conv2 3 × 3 × 192, stride=1, padding=1 16 × 16 × 192 maxpool 2×2 8 × 8 × 192 conv3 3 × 3 × 384, stride=1, padding=1 8 × 8 × 384 conv4 3 × 3 × 256, stride=1, padding=1 8 × 8 × 256 conv5 3 × 3 × 256, stride=1, padding=1 8 × 8 × 256 maxpool 2×2 4 × 4 × 256 avgpool 1/2 input_W × 1/2 input_H 2 × 2 × 256 fc_1 4096 4096 fc_2 2048 2048 output 1 1 3.2.2 End-to-end model: DRML Besides our framework, we also conducted experiments on the Deep Region and Multi-label Learning model (DRML) for facial action unit recognition. DRML model is a classical deep model for human facial action unit detection, and we tried to transfer it to our task. (Note that because the pre-trained DRML was not publicly available, in the following experiments, we trained the DRML model from a random initialization.) As the facial action units (AUs) are ac- tive in sparse facial regions, the authors use the regional learning method for facial action unit detection. Moreover, to learn potential correlations between AUs, multi-label learning can be applied in facial action unit recognition. The
18 CHAPTER 3. METHODS authors realize these functions by insert a region layer into a common convo- lutional neural network and use a multi-label sigmoid cross-entropy loss for training. The network architecture is shown in Figure 3.4. Figure 3.4: The architecture of DRML model (figure from [44]) The region layer is shown in Figure 3.5. A input feature map from conv1 is uniformly divided into 8 × 8 patches, and each patch will be forwarded to a convolution layer to get a re-weighted patches. The output is a concatenation of all re-weighted patches. Figure 3.5: Region layer for deep region learning (figure from [44]) Let the number of AUs be C, the number of samples be N , the ground truth Y ∈ {−1, 0, 1}N ×C , Yi,j indicate the (i, j)-th element of Y , and the predictions Ŷ ∈ RN ×C . The loss function is : N C 1 XX L(Y , Ŷ ) = − {[Ync > 0] log Ŷnc + [Ync < 0] log(1 − Ŷnc )} (3.1) N n=1 c=1 where [x] is an indicator function returning 1 if the statement x is true and 0
CHAPTER 3. METHODS 19 otherwise. In our experiments, we did not include zero examples, i.e., Y ∈ {−1, 1}N ×C . 3.3 Evaluation Methods It is common to employ confusion matrix (Table 3.3) to analyze the perfor- mance of a binary classifier or a detector. Table 3.3: Confusion Matrix Predict Classes Classes = Yes Classes = No Actual Classes Classes = Yes True Positive (TP) False Negative (FN) Classes = No False Positive (FP) True Negative (TN) Based on the confusion matrix, we usually calculate four values for the result evaluation, accuracy, precision, recall, and F1 score. These values can be calculated by: TP + TN Accuracy = (3.2) TP + FP + TN + FN TP P recision = (3.3) TP + FP TP Recall = (3.4) TP + FN 2 ∗ P recision ∗ Recall F 1 score = (3.5) P recision + Recall Accuracy is the most intuitive performance measure, and it is a ratio of correctly predicted observation to the total observations. Precision is the ratio of correctly predicted positive observations to the total predicted positive ob- servations. Recall is the ratio of correctly predicted positive observations of all positive observations in the actual class. F1 score is the weighted average of precision and recall. Usually, F1 score is used to evaluate detectors. In our tasks, we use F1 score to evaluate multi-label classifiers and use both F1 score and accuracy to evaluate the binary classifiers.
Chapter 4 Results According to veterinarians, four out of the eleven selected AUs are theoret- ically easier to be recognized from still images: AU101 (inner brow raiser), AD1 (eye white increase), AU25 (lips part), and AD19 (tongue show). The reason is that in these regions, there is no other AU that is mutually exclusive with them and their changes are relatively easy to be recognized. Therefore, we started with these four AUs to find a model that works for our tasks and then applied the found model to the others. Finally, we tried to let our models learn the correlations among the four relatively easy AUs. 4.1 Finding the Best Model Based on the Four Relatively Easy Classes We started with the simplest task that we used a number of binary classifiers for AU detection. As we have the most labeled examples for AU101 (inner brow raiser) among these four classes, we firstly did experiments on AU101 to find the most reasonable model. Then, we apply it to the other classes to test whether the found model can work for other AUs. 4.1.1 Binary Classification for AU101 (Inner Brow Raiser) According to the EquiFACS [1], the main appearance changes of AU101 is that the skin above the inner corner of the eye is pulled dorsally and obliquely towards the medial frontal region (Figure 4.1). Therefore, we believe that the best feature to learn is the angular shape of the inner brow. To help train the binary classification models, we balanced the positive ex- amples and negative examples in training. For the consistency of our results, 20
CHAPTER 4. RESULTS 21 we also balanced the positive and negative examples in the validation set and test set. In this way, the accuracy and F1 score results for a random classifier are both at 50 percent. The same was done for the experiments of binary classi- fication for other AUs. Although for the detection task, we are more interested in F1 score than accuracy, it is not as suitable for our binary classification ex- periments as accuracy. Because some models failed in the following binary classification experiments, sometimes a state that almost always predicts true will get the highest F1 score. (In this case, the precision is 50 percent while the recall is 100 percent, and thus, the F1 score is 66.7 percent.) Based on our observation, we found that the result of accuracy is more stable in our exper- iments. Therefore, we chose accuracy as the criterion for the model selection in our binary classification experiments. Figure 4.1: An example of the appearance changes of AU101 (figure from [1]) Face Region Crops We first employed the DRML model for binary classification only on cropped face regions and compared the results to the classification on the original im- ages. We then compared the results of the DRML model to AlexNet, VGG, and ResNet on the face region. The classification accuracies are shown in Ta- ble 4.1. We also employed Grad-CAM [58] to analyze which are the regions of interest that the network learns. Grad-CAM is a tool that uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. The results are shown in Figure 4.2, the red regions yield the higher classification scores. Although directly applying the DRML model to original frames obtained
22 CHAPTER 4. RESULTS Table 4.1: Accuracies and F1 Scores of Binary Classifications of AU101 Validation Test Model Accuracy F1 Score Accuracy F1 Score DRML 68.7 68.5 69.4 68.3 DRML (face) 63.7 62.0 60.6 57.7 AlexNet (face) 64.2 65.3 64.3 64.6 VGG19 (face) 60.9 60.3 60.4 59.8 ResNet34 (face) 54.4 56.3 54.4 55.6 the best accuracy, the network only seldom focuses on face regions. The rea- son is that the examples of AU101 are not evenly distributed among different scenes and different individuals in our dataset. As we have mentioned above that we randomly separate the training, validation, and test set because of the lack of the annotations of individuals, the distribution of the examples are similar in these three sets. Therefore, the correlations among the AU101, in- dividuals, and backgrounds are relatively strong and easy to learn. Thus, the DRML model tends to learn these correlations, which is not a desirable behav- ior. To let the model focus on the faces, we cropped the face region out with the help of the face detector. However, even though we trained the DRML only on face regions, it seldom focuses on eye regions, which are considered to be the expected regions of interest. Instead, it learns the correlations among other face regions and sometimes also focuses on the remaining background patches. Similarly, AlexNet and VGG19 also only rarely focused on the eye region. ResNet34 failed to learn valid features in this task. Eye Region Crops Since directly applying models to face regions, they cannot focus on the re- gion that we expect it to learn, we then trained an eye detector to force the models to learn the features of eye regions. We trained the eye detector on the detected face region because eye regions in original frames are too small to fit the bounding box prior. In this case, since we do not need to use the deep region layer in the DRML model, we replaced the deep region layer with a simple 3 × 3 convolution layer. We applied this modified model to the eye regions, and we also applied AlexNet, VGG, and ResNet in this experiment. The results are shown in Table 4.2 and Figure 4.3. As is mentioned above, the expected region of interest should be the inner
CHAPTER 4. RESULTS 23 Figure 4.2: Heat map of the models for AU101 on original frames and face regions. Images in the same columns are from the same frames. brow. However, none of these models can always focus on this region. AlexNet reaches the best accuracy among these models because it can always focus on the region above the eye, which is considered to be strongly correlated with AU101. The result of AlexNet on the eye region is competitive with the best result we get above, but it is much more reasonable. Compared to AlexNet, the other three only focus on regions with weak correlations to AU101. VGG19 seems to find good regions, but actually, it mainly focuses on eyeballs, which is not related to AU101. A possible reason why ResNet and VGG have worse performance than AlexNet is that the dimension of the feature maps is too high. In the original DRML model, the highest dimension is only 32. Perhaps AU features are actually in a lower dimension. Also, the amount of our labeled data may not be enough for the training of deep networks, such as ResNet and VGG. 4.1.2 Binary Classification for AD1, AU25, and AD19 Based on the results for AU101, we decided to choose AlexNet as the binary classifier in our cascade framework. We then applied our framework for the
24 CHAPTER 4. RESULTS Table 4.2: Accuracies and F1 Scores of Binary Classifications of AU101 in Eye Regions Validation Test Model Accuracy F1 Score Accuracy F1 Score AlexNet(eye) 64.6 65.0 66.7 67.6 DRML_replace(eye) 58.4 54.6 56.6 54.9 ResNet34(eye) 56.3 54.4 55.6 54.4 VGG19(eye) 54.9 53.0 54.2 55.4 Figure 4.3: Heat map of the models for AU101 on eye region. The order of rows are the same as Table 4.2. Images in the same columns are from the same frames. other three easy classes. The results are shown in Table 4.3 and Figure 4.4. AD1 Eye White Increase The main feature of AD1 should be the eye white on the eyeball. In our experiment, our framework usually can find in- creased eye white. But sometimes it turns to focus on the corners of the eyes. Since sometimes AD1 is caused by the wider opened eyes, the angle of the corners of eyes also correlated with AD1. Therefore, the results for AD1 are acceptable. AU25 Lips Part AU25 and the following AD19 are in the mouth region. Since the mouth is very close to the nostril on horse faces, we detect them to- gether as the lower face region. We can see that in most cases, the model could focus on the lips, but sometimes it pays attention to the nostrils,
CHAPTER 4. RESULTS 25 Table 4.3: Accuracies and F1 Scores of AlexNet for Binary Classifications of AD1 (Eye White Increase), AU25 (Lips Part), and AD19 (Tongue Show) Validation Test Action Units Accuracy F1 Score Accuracy F1 Score AD1 (eye) 70.2 70.4 72.0 73.4 AU25 (lower face) 67.0 63.1 70.9 69.1 AD19 (lower face) 67.9 68.1 65.9 65.9 Figure 4.4: Heat maps of AlexNet for AD1 (Eye White Increase), AU25 (Lips Part), and AD19 (Tongue Show) which are not strongly correlated with lips. AD19 Tongue Show There are roughly two cases of AD19. One is that the mouth opens obviously, and the tongue sticks out of the mouth. In this case, our framework can focus on the tongue. The other case is that the tongue is only slightly exposed. In this case, our model fails to focus on the tongue but tends to pay attention to the corners of the mouth. Similar to AD1, the shapes of the corners of mouth are also correlated with AD19, so these results are acceptable. 4.2 Binary Classification for Five Other AUs The above experiments have indicated that our framework is effective for the four relatively easy classes. In this section, we extended the model to five other classes: AD38 (nostril dilator), AUH13 (nostril lift), AU145 (blink), AU47 (half blink), and AU5 (upper lid raiser). The results are shown in Table 4.4 and Figure 4.5.
26 CHAPTER 4. RESULTS Table 4.4: Accuracies and F1 Scores of AlexNet for Binary Classifications of AD38 (Nostril Dilator), AUH13 (Nostril Lift), AU145 (Blink), AU47 (Half Blink), and AU5 (Upper Lid Raiser) Validation Test Action Units Accuracy F1 Score Accuracy F1 Score AD38 (lower face) 70.0 67.4 72.0 68.4 AUH13 (lower face) 68.1 71.9 66.7 70.9 AU145 (eye) 67.9 70.5 66.6 69.7 AU47 (eye) 56.6 55.4 54.6 56.5 AU5 (eye) 72.7 73.9 72.8 71.2 Lower Face Region AD38 (nostril dilator) and AUH13 (nostril lift) are both founded in the lower face region. We employed the same detector as AU25 and AD19 here. We can see that our model could pay attention to the nostrils in most cases and the accuracies and F1 scores are accept- able. Eye Region For AU145 (blink), theoretically, we cannot detect it from still images because the main differences between AU143 (eye enclosure) and AU145 are the duration of the eye closed. However, we get a good result for AU145 here because in the original dataset, we only have 61 labeled clips for AU143, which is not enough for training, and we did not include them in our experiment dataset. The bias of our dataset causes this "good" result. For AU47 (half blink), our framework currently does not for this class. The reason for this result may be that the differences between AU47 and no action are too small, and AU47 is sometimes con- fused with AU145. For AU5 (upper lid raiser), our model can correctly focus on the eyelid regions. 4.3 Learning Correlations among AUs In the above experiments, we train the models separately for binary classifi- cations to learn the features for each AU. Actually, there exist some poten- tial correlations among different AUs that could be leveraged. For example, some AUs usually appear together, such as AU25 (lips part) and AD19 (tongue show), especially when the tongue is extended out of the mouth. Also, in re- lated works for human facial action unit recognition, it has been shown that learning the correlations among AUs could contribute to the performance of
CHAPTER 4. RESULTS 27 Figure 4.5: Heat maps of AlexNet for AD38 (Nostril Dilator), AUH13 (Nostril Lift), AU145 (Blink), AU47 (Half Blink), and AU5 (Upper Lid Raiser) models. For example, in the DRML paper, the author mentioned that they employed multi-label learning to learn the correlations among AUs. There- fore, we tried to synthesize these binary classifiers together to learn such cor- relations. We concatenated the outputs of the avgpool layers of each binary classifier and trained a multi-layer perceptron (MLP) via multi-label learning (Figure 4.6). We compared the test result of the synthesized classifier with the result of testing each binary classifier separately. We also carried out ex- periments with the DRML model and trained it in two ways. One is directly training it on all classes via multi-label learning and testing it on the whole test set, and the other is training and testing it separately on each class. We carried out these experiments for the four relatively easy classes. Since the distribution of examples for the four classes is quite uneven and examples of AU101 are roughly five times more than the other three, it will cause the train- ing to get stuck in a local minimum where the model always predicts AU101 with probability one and keep the probability of others as zero. To balance the training set, we randomly copied the examples of the other three classes to make the number of examples in each class the same as AU101. Consider- ing that there can be some states where all of the four AUs do not appear, we also added fully negative examples to make positive examples for each class as
28 CHAPTER 4. RESULTS Figure 4.6: Synthesized framework for learning correlations among AUs (Only MLP was trained due to the limited computational resources.) one-fifth of the training set. Moreover, we balanced the validation set and test set by randomly deleting positive examples and adding fully negative exam- ples to make positive examples for each class as one-fifth of each set. F1 score is employed here for the evaluation of the detection task. In this experiment, the F1 score of the random case is 28.6 percent. (In the random case, for each AU, the precision is 20 percent, and the recall is 50 percent.) Note that we used the same test set to evaluate each model, and failed cases of the region detectors are also included in the results. If a region detector fails, all related AUs will be predicted as false. For the synthesized model, we only forwarded the images where both regions can be detected. If only one region is detected, this region will only be forwarded to related binary classifiers and tested separately. The results of the final classifications are shown in Table 4.5. Table 4.5: F1 Scores of Synthesized Classifiers and Separate Classifiers for the Four Relatively Easy Classes Architecture Mean AD1 AD19 AU101 AU25 DRML (multi-label) 39.6 38.8 34.4 34.0 51.4 DRML (separate) 36.2 45.8 27.7 33.2 38.3 AlexNet (regions, synthesized) 44.6 45.7 40.0 49.4 43.1 AlexNet (regions, separate) 45.0 45.6 41.1 48.2 45.2 We can see that the DRML model trained via multi-label learning has a better performance than when trained separately, which shows that multi-label learning can help the DRML model learn the correlations among different
CHAPTER 4. RESULTS 29 AUs. However, for our framework, the synthesized classifier does not out- perform the separate classifiers. A possible reason is that only training the final MLP is not effective enough to learn the correlations, as in the DRML model, both the convolutional layers and fully connected layers are trained via multi-label learning. Finally, our framework outperforms the DRML model for the detection of these four relatively easy classes. Note that although in the previous experiments for AU101, the DRML model trained on raw frames had a better performance than our framework, our framework outperforms the DRML model for AU101 in this experiment. As we have discussed above, the DRML model tends to learn background information. Compared to the experiments in Section 4.1 where the positive and negative examples are bal- anced, the proportion of negative samples for each AU in this experiment is higher. In this way, the correlations between the backgrounds and each AU became weaker so that our framework works better than the DRML model in this experiment.
Chapter 5 Discussion In this chapter, we will briefly summarize and discuss some key points of our project and future works that can possibly improve our project. Making the models focus on correct regions is very important for horse facial action unit detection. Because our dataset is a small dataset and the AUs are not evenly distributed in the dataset, there are some artifacts such as the correlation between the background and the AUs. Therefore, at the begin- ning of the experiments for AU101, the DRML models for raw frames reached the highest accuracy. However, when the proportion of the negative examples for each AU became higher in the experiments in Section 4.3 compared to the experiments in Section 4.1, such an unreasonable correlation became weaker. Finally, our framework forcing the classifiers to focus on correct regions out- performs the DRML models, which shows the importance of the attention of models to correct regions. The accuracy of our framework still needs to be improved for practical applications. To evaluate whether our model is adequate for horse facial ac- tion unit detection, we should compare the experiment results to a baseline. Since there is no previous work for this task, it will be good to use a human baseline, which indicates how difficult it is for humans to recognize horse fa- cial action units. As we do not have such a baseline yet, we choose an alter- native way to evaluate our model by comparing the results of our experiments to human facial action unit recognition. For example, the mean F1 score of the DRML model on the BP4D dataset [59] (a human facial action unit detec- tion dataset) for 12 AUs is 48.3 percents, which is much higher than ours for horses. Therefore, the accuracy of our model is still needed to be improved for 30
You can also read