Improving Multimodal fusion via Mutual Dependency Maximisation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Multimodal fusion via Mutual Dependency Maximisation Pierre Colombo1,2 , Emile Chapuis1 , Matthieu Labeau1 , Chloe Clavel1 1 LTCI, Telecom Paris, Institut Polytechnique de Paris, 2 IBM GBS France, 1 firstname.lastname@telecom-paris.fr, 2 pierre.colombo@ibm.com, Abstract the difficulty of learning multimodal representa- tions and raise specific challenges. Baltrušaitis Multimodal sentiment analysis is a trending area of research, and the multimodal fusion et al. (2018) identifies fusion as one of the five core is one of its most active topic. Acknowledg- challenges in multimodal representation learning, arXiv:2109.00922v2 [cs.LG] 9 Sep 2021 ing humans communicate through a variety of the four other being: representation, modality align- channels (i.e visual, acoustic, linguistic), mul- ment, translation and co-learning. Fusion aims at timodal systems aim at integrating different integrating the different unimodal representations unimodal representations into a synthetic one. into one common synthetic representation. Effec- So far, a consequent effort has been made on tive fusion is still an open problem: the best multi- developing complex architectures allowing the fusion of these modalities. However, such sys- modal models in sentiment analysis (Rahman et al., tems are mainly trained by minimising sim- 2020) improve over their unimodal counterparts, ple losses such as L1 or cross-entropy. In relying on text modality only, by less than 1.5% on this work, we investigate unexplored penalties accuracy. Additionally, the fusion should not only and propose a set of new objectives that mea- improve accuracy but also make representations sure the dependency between modalities. We more robust to missing modalities. demonstrate that our new penalties lead to a Multimodal fusion can be divided into early and consistent improvement (up to 4.3 on accu- late fusion techniques: early fusion takes place racy) across a large variety of state-of-the-art models on two well-known sentiment analysis at the feature level (Ye et al., 2017), while late datasets: CMU-MOSI and CMU-MOSEI. Our fusion takes place at the decision or scoring level method not only achieves a new SOTA on both (Khan et al., 2012). Current research in multimodal datasets but also produces representations that sentiment analysis mainly focuses on developing are more robust to modality drops. Finally, a new fusion mechanisms relying on deep architec- by-product of our methods includes a statisti- tures (e.g TFN (Zadeh et al., 2017), LFN (Liu et al., cal network which can be used to interpret the 2018), MARN (Zadeh et al., 2018b), MISA (Haz- high dimensional representations learnt by the model. arika et al., 2020), MCTN (Pham et al., 2019), HFNN (Mai et al., 2019), ICCN (Sun et al., 2020)). The- 1 Introduction ses models are evaluated on several multimodal sentiment analysis benchmark such as IEMOCAP Humans employ three different modalities to com- (Busso et al., 2008), MOSI (Wöllmer et al., 2013), municate in a coordinated manner: the language MOSEI (Zadeh et al., 2018c) and POM (Garcia modality with the use of words and sentences, the et al., 2019b; Park et al., 2014). Current state-of- vision modality with gestures, poses and facial ex- the-art on these datasets uses architectures based pressions and the acoustic modality through change on pre-trained transformers (Tsai et al., 2019; Siri- in vocal tones. Multimodal representation learning wardhana et al., 2020) such as MultiModal Bert has shown great progress in a large variety of tasks (MAGBERT) or MultiModal XLNET (MAGXLNET) including emotion recognition, sentiment analy- (Rahman et al., 2020). sis (Soleymani et al., 2017), speaker trait analysis (Park et al., 2014) and fine-grained opinion min- The aforementioned architectures are trained by ing (Garcia et al., 2019a). Learning from different minimising either a L1 loss or a Cross-Entropy modalities is an efficient way to improve perfor- loss between the predictions and the ground-truth mance on the target tasks (Xu et al., 2013). Never- labels. To the best of our knowledge, few efforts theless, heterogeneities across modalities increase have been dedicated to exploring alternative losses.
In this work, we propose a set of new objectives (3) provides an explanation of the decision taken to perform and improve over existing fusion mech- by the neural architecture (sec. 5.4). anisms. These improvements are inspired by the InfoMax principle (Linsker, 1988), i.e. choosing 2 Problem formulation & related work the representation maximising the mutual informa- tion (MI) between two possibly overlapping views In this section, we formulate the problem of learn- of the input. The MI quantifies the dependence ing multi-modal representation (sec. 2.1) and we re- of two random variables; contrarily to correlation, view both existing measures of mutual dependency MI also captures non-linear dependencies between (see sec. 2.2) and estimation methods (sec. 2.3). the considered variables. Different from previous In the rest of the paper, we will focus on learn- work, which mainly focuses on comparing two ing from three modalities (i.e language, audio and modalities, our learning problem involves multiple video), however our approach can be generalised modalities (e.g text, audio, video). Our proposed to any arbitrary number of modalities. method, which induces no architectural changes, 2.1 Learning multimodal representations relies on jointly optimising the target loss with an additional penalty term measuring the mutual de- Plethora of neural architectures have been pro- pendency between different modalities. posed to learn multimodal representations for sen- timent classification. Models often rely on a fusion 1.1 Our Contributions mechanism (e.g multi-layer perceptron (Khan et al., 2012), tensor factorisation (Liu et al., 2018; Zadeh We study new objectives to build more performant et al., 2019) or complex attention mechanisms and robust multimodal representations through an (Zadeh et al., 2018a)) that is fed with modality- enhanced fusion mechanism and evaluate them on specific representations. The fusion problem boils multimodal sentiment analysis. Our method also down to learning a model Mf : Xa × Xv × Xl → allows us to explain the learnt high dimensional Rd . Mf is fed with uni-modal representations of multimodal embeddings. The paper contributions the inputs Xa,v,l = (Xa , Xv , Xl ) obtained through can be summarised as follows: three embedding networks fa , fv and fl . Mf has A set of novel objectives using multivariate de- to retain both modality-specific interactions (i.e pendency measures. We introduce three new interactions that involve only one modality) and trainable surrogates to maximise the mutual de- cross-view interactions (i.e more complex, they pendencies between the three modalities (i.e audio, span across both views). Overall, the learning of language and video). We provide a general algo- Mf involves both the minimisation of the down- rithm inspired by MINE (Belghazi et al., 2018), stream task loss and the maximisation of the mutual which was developed in a bi-variate setting for esti- dependency between the different modalities. mating the MI. Our new method enriches MINE by extending the procedure to a multivariate setting 2.2 Mutual dependency maximisation that allows us to maximise different Mutual Depen- Mutual information as mutual dependency dency Measures: the Total Correlation (Watanabe, measure: the core ideas we rely on to better learn 1960), the f-Total Correlation and the Multivari- cross-view interactions are not new. They consist of ate Wasserstein Dependency Measure (Ozair et al., mutual information maximisation (Linsker, 1988), 2019). and deep representation learning. Thus, one of the Applications and numerical results. We apply most natural choices is to use the MI that measures our new set of objectives to five different archi- the dependence between two random variables, in- tectures relying on LSTM cells (Huang et al., cluding high-order statistical dependencies (Kinney 2015) (e.g EF-LSTM, LFN, MFN) or transformer and Atwal, 2014). Given two random variables X layers (e.g MAGBERT, MAG-XLNET). Our pro- and Y , the MI is defined by posed method (1) brings a substantial improvement on two different multimodal sentiment analysis pXY (x, y) datasets (i.e MOSI and MOSEI,sec. 5.1), (2) makes I(X; Y ) , EXY log , (1) pX (x)pY (y) the encoder more robust to missing modalities (i.e when predicting without language, audio or video where pXY is the joint probability density function the observed performance drop is smaller, sec. 5.3), (pdf) of the random variables (X, Y ), and pX , pY
represent the marginal pdfs. MI can also be defined 3.0 with a the KL divergence: 2.5 KL f 2.0 I(X; Y ) , KL [pXY (x, y)||pX (x)pY (y)] . (2) 1.5 Extension of mutual dependency to different 1.0 metrics: the KL divergence seems to be limited 0.5 when used for estimating MI (McAllester and 0.0 Stratos, 2020). A natural step is to replace the 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 KL divergence in Eq. 2 with different divergences such as the f-divergences or distances such as the Figure 1: Estimation of different dependency measures Wasserstein distance. Hence, we introduce new mu- for multivariate Gaussian random variables for differ- tual dependency measures (MDM): the f-Mutual ent degree of correlation. Information (Belghazi et al., 2018), denoted If and the Wasserstein Measures (Ozair et al., 2019), de- noted IW . As previously, pXY denotes the joint follow: corr(Xi , Xk ) = δi,k ρ , where ρ ∈ (−1, 1) pdf, and pX , pY denote the marginal pdfs. The new and δi,k is Kronecker’s delta. We observe that the measures are defined as follows: dependency measure based on Wasserstein distance is different from the one based on the divergences If , Df (pXY (x, y); pX (x)pY (y)), (3) and thus will lead to different gradients. Although theoretical studies have been done on the use of where Df denotes any f -divergences and different metrics for dependency estimations, it re- mains an open question to know which one is the IW , W(pXY (x, y); pX (x)pY (y)), (4) best suited. In this work, we will provide an exper- where W denotes the Wasserstein distance (Peyré imental response in a specific case. et al., 2019). 3 Model and training objective 2.3 Estimating mutual dependency measures In this section, we introduce our new set of losses to The computation of MI and other mutual depen- improve fusion. In sec. 3.1, we first extend widely dency measures can be difficult without knowing used bi-variate dependency measures to multivari- the marginal and joint probability distributions, ate dependencies (James and Crutchfield, 2017) thus it is popular to maximise lower bounds to measures (MDM). We then introduce variational obtain better representations of different modal- bounds on the MDM, and in sec. 3.2, we describe ities including image (Tian et al., 2019; Hjelm our method to minimise the proposed variational et al., 2018), audio (Dilpazir et al., 2016) and text bounds. (Kong et al., 2019) data. Several estimators have Notations We consider Xa , Xv , Xl as the multi- been proposed: MINE (Belghazi et al., 2018) uses modal data from the audio,video and language the Donsker-Varadhan representation (Donsker and modality respectively with joint probability dis- Varadhan, 1985) to derive a parametric lower bound tribution pXa Xv Xl . We denote as pXj the marginal holds, Nguyen et al. (2017, 2010) uses variational distribution of Xj with j ∈ {a, v, l} corresponding characterisation of f-divergence and a multi-sample to the jth modality. version of the density ratio (also known as noise General loss As previously mentioned, we rely on contrastive estimation (Oord et al., 2018; Ozair the InfoMax principle (Linsker, 1988) and aim at et al., 2019)). These methods have mostly been jointly maximising the MDM between the different developed and studied in a bi-variate setting. modalities and minimising the task loss; hence, we Illustration of neural dependency measures on are in a multi-task setting (Argyriou et al., 2007; a bivariate case. In Fig. 1 we can see the aforemen- Ruder, 2017) and the objective of interest can be tioned dependency measures (i.e see Eq. 2, Eq. 4, defined as: Eq. 3) when estimated with MINE (Belghazi et al., 2018) for multivariate Gaussian random variables, Xa and Xb . The component wise correlation for L , Ldown. − λ · LM DM . (5) | {z } | {z } the considered multivariate Gaussian is defined as main task mutual dependency term
Ldown. represents a downstream specific (target The neural multivariate f-mutual information mea- task) loss i.e a binary cross-entropy or a L1 loss, λ sure If is defined as follows: is a meta-parameter and LM DM is the multivariate dependencies measures (see sec. 3.2). Minimisa- If , sup EpXa Xv Xl [Tθ ] − E Q pX [eTθ −1 ]. (7) j θ j∈{a,v,l} tion of our newly defined objectives requires to derive lower bounds on the LM DM terms, and then The neural multivariate Wasserstein dependency to obtain trainable surrogates. measure IW is defined as follows: 3.1 From bivariate to multivariate IW , sup EpXa Xv Xl [Tθ ] − log E Q pXj [Tθ ] . (8) dependencies θ:Tθ ∈L j∈{a,v,l} In our setting, we aim at maximising cross-view Where L is the set of all 1-Lipschitz functions from interactions involving three modalities, thus we Rd → R need to generalise bivariate dependency measures to multivariate dependency measures. Sketch of proofs: Eq. 6 is a direct application of Definition 3.1 (Multivariate Dependencies Mea- the Donsker-Varadhan representation of the KL sures). Let Xa , Xv , Xl be a set of random vari- divergence (we assume that the integrability con- ables with joint pdf pXa Xv Xl and respective straints are satisfied). Eq. 7 comes from the work marginal pdf pXj with j ∈ {a, v, l}. Then we of Nguyen et al. (2017). Eq. 8 comes from the defined the multivariate mutual information Ikl Kantorovich-Rubenstein: we refer the reader to which is also refered as total correlation (Watan- (Villani, 2008; Peyré et al., 2019) for a rigorous abe, 1960) or multi-information (Studenỳ and Vej- and exhaustive treatment. narová, 1998): Practical estimate of the variational bounds. Y The empirical estimator that we derive from Th. 1 Ikl , KL(pXa Xv Xl (xa , xv , xl )|| pXj (xj )). can be used in practical way: the expectations in j∈{a,v,l} Eq. 6, Eq. 7 and Eq. 8 are estimated using empirical Simarly for any f-divergence we define the multi- samples from the joint distribution Q pXa Xv Xl . The variate f-mutual information If as: empirical samples from pXj are obtained j∈{a,v,l} Y by shuffling the samples from the joint distribution If , Df (pXa Xv Xl (xa , xv , xl ); pXj (xj )). in a batch. We integrate this into minimising a j∈{a,v,l} multi-task objective (5) by using minus the estima- Finally, we also extend Eq. 3 to obtain the multi- tor. We refer to the losses obtained with the penalty variate Wasserstein dependency measure IW : based on the estimators described in Eq. 6, Eq. 7 Y and Eq. 8 as Lkl , Lf and LW respectively. Details IW , W(pXa Xv Xl (xa , xv , xl ); pXj (xj )). on the practical minimisation of our variational j∈{a,v,l} bounds are provided in Algorithm 1. where W denotes the Wasserstein distance. Remark. In this work we choose to generalise 3.2 From theoretical bounds to trainable MINE to compute multivariate dependencies. Com- surrogates paring our proposed algorithm to other alterna- tives mentioned in sec. 2 is left for future work. To train our neural architecture we need to esti- This choice is driven by two main reasons: (1) mate the previously defined multivariate depen- our framework allows the use of various types dency measures. We rely on neural estimators that of contrast measures (e.g Wasserstein distance,f - are given in Th. 1. divergences); (2) the critic network Tθ can be used Theorem 1. Multivariate Neural Dependency for interpretability purposes as shown in sec. 5.4. Measures Let the family of functions T (θ) : Xa × Xv × Xl → R parametrized by a deep neural net- 4 Experimental setting work with learnable parameters θ ∈ Θ. The multi- variate mutual information measure Ikl is defined In this section, we present our experimental settings as: including the neural architectures we compare, the datasets, the metrics and our methodology, which Ikl , sup EpXa Xv Xl [Tθ ] − log E Q pX [eTθ ] . (6) θ j∈{a,v,l} j includes the hyper-parameter selection.
Algorithm 1 Two-stage procedure to minimise 2014), BERT or XLNET contextualised embed- multivariate dependency measures. dings. For Glove, the embeddings are of dimension INPUT: Dn = {(xja , xjv , xjl ), ∀j ∈ [1, n]} multi- 300, where for BERT and XLNET this dimension modal training dataset, m batch size, σa , σv , σl : is 768. [1, m] → [1, m] three permutations, θc weights Vision: Vision features are extracted using Facet of the deep classifier, θ weights of the statistical which results into facial action units corresponding network Tθ . to facial muscle movement. For CMU-MOSEI, the Initialization: parameters θ and θc video vectors are composed of 47 units, and for Build Negative Dataset: CMU-MOSI they are composed of 35. Audio : Audio features are extracted using CO- σ (j) D¯n = {(xσa a (j) , xσv v (j) , xl l ), ∀j ∈ [1, n]} VAREP (Degottex et al., 2014). This results into a vector of dimension 74 which includes 12 Mel- Optimization: frequency cepstral coefficients (MFCCs), as well while (θ, θc ) not converged do as pitch tracking and voiced/unvoiced segmenting for i ∈ [1, U nroll] do features, peak slope parameters, maxima disper- Sample from Dn , B ∼ pXQ a Xv Xl sion quotients and glottal source parameters. Sample from D¯n , B̄ ∼ pXj Video and audio are aligned on text-based follow- j∈{a,v,l} ing the convention introduced in (Chen et al., 2017) Update θ based on the empirical version and the forced alignment described in (Yuan and of Eq. 6 or Eq. 7 or Eq. 8. Liberman, 2008). end for Sample a batch B from D 4.2 Evaluation metrics Update θc with B using Eq. 5. end while Multimodal Opinion Sentiment Intensity prediction OUTPUT: Classifiers weights θc is treated as a regression problem. Thus, we report both the Mean Absolute Error (MAE) and the cor- relation of model predictions with true labels. In 4.1 Datasets the literature, the regression task is also turned into a binary classification task for polarity prediction. We empirically evaluate our methods on two We follow standard practices (Rahman et al., 2020) english datasets: CMU-MOSI and CMU-MOSEI. and report the Accuracy2 (Acc7 denotes accuracy Both datasets have been frequently used to assess on 7 classes and Acc2 the binary accuracy) of our model performance in human multimodal senti- best performing models. ment and emotion recognition. CMU-MOSI: Multimodal Opinion Sentiment In- 4.3 Neural architectures tensity (Wöllmer et al., 2013) is a sentiment an- notated dataset gathering 2, 199 short monologue In our experiments, we choose to modify the loss video clips. function of the different models that have been CMU-MOSEI: CMU-Multimodal Opinion Senti- introduced for multi-modal sentiment analysis on ment and Emotion Intensity (Zadeh et al., 2018c) both CMU-MOSI and CMU-MOSEI: Memory Fu- is an emotion and sentiment annotated corpus con- sion Network (MFN (Zadeh et al., 2018a)), Low- sisting of 23, 454 movie review videos taken from rank Multimodal Fusion (LFN (Liu et al., 2018)) YouTube. Both CMU-MOSI and CMU-MOSEI and two state-of-the-art transformers based models are labelled by humans with a sentiment score in (Rahman et al., 2020) for fusion rely on BERT (De- [−3, 3]. For each dataset, three modalities are avail- vlin et al., 2018) (MAG-BERT) and XLNET (Yang able; we follow prior work (Zadeh et al., 2018b, et al., 2019) (MAG-XLNT). To assess the validity 2017; Rahman et al., 2020) and the features that of the proposed losses, we also apply our method have been obtained as follows1 : to a simple early fusion LSTM (EF-LSTM) as a Language: Video transcripts are converted to word baseline model. embeddings using either Glove (Pennington et al., Model overview: Aforementioned models can be 1 2 Data from CMU-MOSI and CMU-MOSEI can be obtained The regression outputs are turned into categorical values from https://github.com/WasifurRahman/ to obtain either 2 or 7 categories (see (Rahman et al., 2020; BERT_multimodal_transformer Zadeh et al., 2018a; Liu et al., 2018))
seen as a multi-modal encoder fθe providing a rep- the different dependency measures. The compared resentation Zavl containing information and depen- measures are combined with different models using dencies between modalities Xl , Xa , Xv namely: various fusion mechanisms. Improving the robustness to modality drop: a fθe (Xa , Xv , Xl ) = Zavl . desirable quality of multimodal representations is the robustness to a missing modality. We study As a final step, a linear transformation Aθp is ap- how the maximisation of mutual dependency mea- plied to Zavl to perform the regression. sures during training affects the robustness of the EF-LSTM: is the most basic architecture used in representation when a modality becomes missing. the current multimodal analysis where each se- Towards explainable representations: the sta- quence view is encoded separately with LSTM tistical network Tθ allows us to compute a de- channels. Then, a fusion function is applied to pendency measure between the three considered all representations. modalities. We carry out a qualitative analysis in TFN: computes a representation of each view, and order to investigate if a high dependency can be then applies a fusion operator. Acoustic and visual explained by complementariness across modalities. views are first mean-pooled then encoded through a 2-layers perceptron. Linguistic features are com- 5.1 Efficiency of the MDM penalty puted with a LSTM channel. Here, the fusion func- For a simple EF-LSTM, we study the improvement tion is a cross-modal product capturing unimodal, induced by addition of our MDM penalty. The re- bimodal and trimodal interactions across modali- sults are presented in Tab. 1, where a EF-LSTM ties. trained with no mutual dependency term is denoted MFN enriches the previous EF-LSTM architecture with L∅ . On both studied datasets, we observe that with an attention module that computes a cross- the addition of a MDM penalty leads to stronger view representation at each time step. They are performances on all metrics. For both datasets, we then gathered and a final representation is com- observe that the best performing models are ob- puted by a gated multi-view memory (Zadeh et al., tained by training with an additional mutual depen- 2018a). dency measure term. Keeping in mind the example MAG-BERT and MAG-XLNT are based on pre- shown in Fig. 1, we can draw a first comparison trained transformer architectures (Devlin et al., between the different dependency measures. Al- 2018; Yang et al., 2019) allowing inputs on each though in a simple case Lf and Lkl estimate a simi- of the transformer units to be multimodal, thanks lar quantity (see Fig. 1), in more complex practical to a special gate inspired by Wang et al. (2018). applications they do not achieve the same perfor- The Zavl is the [CLS] representation provided by mance. Even though, the Donsker-Varadhan bound the last transformer head. For each architecture, used for Lkl is stronger3 than the one used to es- we use the optimal architecture hyperparameters timate Lf ; for a simple model the stronger bound provided by the associated papers (see sec. 8). does not lead to better results. It is possible that 5 Numerical results most of the differences in performance observed come from the optimisation process during train- We present and discuss here the results obtained us- ing4 . ing the experimental setting described in sec. 4. To Takeaways: On the simple case of EF-LSTM better understand the impact of our new methods, adding MDM penalty improves the performance we propose to investigate the following points: on the downstream tasks. Efficiency of the LM DM : to gain understanding of the usefulness of our new objectives, we study 5.2 Improving models and comparing the impact of adding the mutual dependency term multivariate dependency measures on the basic multimodal neural model EF-LSTM. In this experiment, we apply the different penalties Improving model performance and comparing to more advanced architectures, using various fu- multivariate dependency measures: the choice sion mechanisms. of the most suitable dependency measure for a 3 given task is still an open problem (see sec. 3). For a fixed Tθ the right term in Eq. 6 is greater than Eq. 7 4 Similar conclusion have been drawn in the field of metric Thus, we compare the performance – on both mul- learning problem when comparing different estimates of the timodal sentiment and emotion prediction tasks– of mutual information (Boudiaf et al., 2020).
Acch7 Acch2 M AE l Corrh Acch7 CMU-MOSI Acch2 M AE l Corrh Acch7 CMU-MOSEI Acch2 M AE l Corrh CMU-MOSI MFN L∅ 31.3 76.6 1.01 0.62 44.4 74.7 0.72 0.53 L∅ 31.1 76.1 1.00 0.65 Lkl 32.5 76.7 0.96 0.65 44.2 74.7 0.72 0.57 Lkl 31.7 76.4 1.00 0.66 Lf 35.7 77.4 0.96 0.65 46.1 75.4 0.69 0.56 LW 35.9 77.6 0.96 0.65 46.2 75.1 0.69 0.56 Lf 33.7 76.2 1.02 0.66 LFN LW 33.5 76.4 0.98 0.66 L∅ 31.9 76.9 1.00 0.63 45.2 74.2 0.70 0.54 Lkl 32.6 77.7 0.97 0.63 46.1 75.3 0.68 0.57 CMU-MOSEI Lf 35.6 77.1 0.97 0.63 45.8 75.4 0.69 0.57 LW 35.6 77.7 0.96 0.67 46.2 75.4 0.67 0.57 L∅ 44.2 75.0 0.72 0.52 MAGBERT Lkl 44.5 75.6 0.70 0.53 L∅ 40.2 84.7 0.79 0.80 46.8 84.9 0.59 0.77 Lkl 42.0 85.6 0.76 0.82 47.1 85.4 0.59 0.79 Lf 45.5 75.2 0.70 0.52 Lf 41.7 85.6 0.78 0.82 46.9 85.6 0.59 0.79 LW 45.3 75.9 0.68 0.54 LW 41.8 85.3 0.76 0.82 47.8 85.5 0.59 0.79 MAGXLNET L∅ 43.0 86.2 0.76 0.82 46.7 84.4 0.59 0.79 Table 1: Results on sentiment analysis on both Lkl 44.5 86.1 0.74 0.82 47.5 85.4 0.59 0.81 CMU-MOSI and CMU-MOSEI for a EF-LSTM. Acc7 Lf 43.9 86.6 0.74 0.82 47.4 85.0 0.59 0.81 LW 44.4 86.9 0.74 0.82 47.9 85.8 0.59 0.82 denotes accuracy on 7 classes and Acc2 the binary ac- curacy. M AE denotes the Mean Absolute Error and Table 2: Results on sentiment and emotion prediction Corr is the Pearson correlation. h means higher is on both CMU-MOSI and CMU-MOSEI dataset for the better and l means lower is better. The choice of the different neural architectures presented in sec. 4 relying evaluation metrics follows standard practices (Rahman on various fusion mechanisms. et al., 2020). Underline results demonstrate significant improvement (p-value belows 0.05) against the base- line when performing the Wilcoxon Mann Whitney test NET) or by the Multimodal Adaptation Gate used (Wilcoxon, 1992) on 10 runs using different seeds. to perform the fusion. Comparing dependency measures. Tab. 2 shows General analysis. Tab. 2 shows the performance that there is no dependency measure that achieves of various neural architectures trained with and the best results in all cases. This result tends to con- without MDM penalty. Results are coherent with firm that the optimisation process during training the previous experiment: we observe that jointly plays an important role (see hypothesis in sec. 5.1). maximising a mutual dependency measure leads However, we can observe that optimising the multi- to better results on the downstream task: for ex- variate Wasserstein dependency measure is usually ample, a MFN on CMU-MOSI trained with LW a good choice, since it achieves state of the art re- outperforms by 4.6 points on Acch7 the model sults in many configurations. It is worth noting trained without the mutual dependency term. On that several pieces of research point the limitations CMU-MOSEI we also obtain subsequent improve- of mutual information estimators (McAllester and ments while training with MMD. On CMU-MOSI Stratos, 2020; Song and Ermon, 2019). the TFN also strongly benefits from the mutual de- Takeaways: The addition of MMD not only ben- pendency term with an absolute improvement of efits simple models (e.g EF-LSTM) but also im- 3.7% (on Acch7 ) with LW compared to L∅ . Tab. 2 proves performance when combined with both com- shows that our methods not only perform well plex fusion mechanisms and pretrained models. For on recurrent architectures but also on pretrained practical applications, the Wasserstein distance is a Transformer-based models, that achieve higher re- good choice of contrast function. sults due to a superior capacity to model contextual dependencies (see (Rahman et al., 2020)). 5.3 Improved robustness to modality drop Improving state-of-the-art models. MAGBERT Although fusion with visual and acoustic modali- and MAGXLNET are state-of-the art models on both ties provided a performance improvement (Wang CMU-MOSI and CMU-MOSEI. From Tab. 2, we et al., 2018), the performance of Multimodal sys- observe that our methods can improve the perfor- tems on sentiment prediction tasks is mainly car- mance of both models. It is worth noting that, in ried by the linguistic modality (Zadeh et al., 2018a, both cases, LW combined with pre-trained trans- 2017). Thus it is interesting to study how a mul- formers achieves good results. This performance timodal system behaves when the text modality is gain suggests that our method is able to capture missing because it gives insights on the robustness dependencies that are not learnt during either pre- of the representation. training of the language model (i.e BERT or XL- Experiment description. In this experiment, we
Spoken Transcripts Acoustic and visual behaviour Tθ um the story was all right low energy monotonous voice + headshake L i mean its a Nicholas Sparks book it must be good disappointed tone + neutral facial expression L the action is fucking awesome head nod + excited voice H it was cute you know the actors did a great job bringing the smurfs to multiple smiles H life such as joe george lopez neil patrick harris katy perry and a fourth Table 3: Examples from the CMU-MOSI dataset using MAGBERT. The last column is computed using the statistical network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight positive, neutral and negative expression/behaviours respectively 0.6 MAGBERT to dropping the language modality, and thus, could be preferred in practical applications. 0.5 Takeaway: Maximising the MMD allows an infor- mation transfer between modalities. Acc2corrupt/Acc2 0.4 f 5.4 Towards explainable representations 0.3 kl In this section, we propose a qualitative experi- 0.2 ment allowing us to interpret the predictions made by the deep neural classifier. During training, Tθ 0.1 estimates the mutual dependency measure, using 0.0 the surrogates introduced in Th. 1. However, the A V A+V inference process only involves the classifier, and Present Modality Tθ is unused. Eq. 6, Eq. 7, Eq. 8 show that Tθ is Figure 2: Study of the robustness of the representations trained to discriminate between valid representa- against drop of the linguistic modality. Studied model tions (coming from the joint distribution) and cor- is MAGBERT on CMU-MOSI. The ratio between the ac- rupted representations (coming from the product of curacy achieved with a corrupted linguistic modality the marginals). Thus, Tθ can be used, at inference Acccorrupt 2 and the accuracy Acc2 without any corrup- tion is reported on y-axis. The preserved modalities time, to measure the mutual dependency of the rep- during inference are reported on x-axis. A, V respec- resentations used by the neural model. In Tab. 3 tively stands for acoustic and visual modality. we report examples of low and high discrepancy measures for MAGBERT on CMU-MOSI. We can focus on the MAGBERT and MAGXLNET since they observe that high values correspond to video clips are the best performing models.5 As before, the where audio, text and video are complementary considered models are trained using the losses de- (e.g use of head node (McClave, 2000)) and low scribed in sec. 3 and all modalities are kept during values correspond to the case where there exists training time. During inference, we either keep contradictions across several modalities. Results only one modality (Audio or Video) or both. Text on MAGXLET can be found in sec. 8.3. modality is always dropped. Takeaways: Tθ used to estimate the MDM pro- Results. Results of the experiments conducted on vides a mean to interpret representations learnt by CMU-MOSI are shown in Fig. 2, giving values for the encoder. the ratio Acccorrupt 2 /Acc2 where Acccorrupt 2 is the binary accuracy in the corrupted configuration and 6 Conclusions Acc2 the accuracy obtained when all modalities are In this paper, we introduced three new losses based considered. We observe that models trained with on MDM. Through extensive set of experiments on an MDM penalty (either Lkl , Lf or LW ) resist bet- CMU-MOSI and CMU-MOSEI, we have shown that ter to missing modalities than those trained with SOTA architectures can benefit from these innova- L∅ . For example, when trained with Lkl or Lf , the tions with little modifications. A by-product of our drop in performance is limited to ≈ 25% in any method involves a statistical network that is a useful setting. Interestingly, for MAGBERT LW and LKL tool to explain the learnt high dimensional multi- achieve comparable results; LKL is more resistant modal representations. This work paves the way 5 Because of space constraints results corresponding to for using and developing new alternative methods MAGXLNET are reported in sec. 8. to improve the learning (e.g new estimator of mu-
tual information (Colombo et al., 2021a), Wasser- Emile Chapuis, Pierre Colombo, Matteo Manica, stein Barycenters (Colombo et al., 2021b), Data Matthieu Labeau, and Chloé Clavel. 2020a. Hierar- chical pre-training for sequence labelling in spoken Depths (Staerman et al., 2021), Extreme Value The- dialog. CoRR, abs/2009.11152. ory (Jalalzai et al., 2020)). A future line of research involves using this methods for emotion (Colombo Emile Chapuis, Pierre Colombo, Matteo Manica, et al., 2019; Witon et al., 2018) and dialog act (Cha- Matthieu Labeau, and Chloé Clavel. 2020b. Hierar- chical pre-training for sequence labelling in spoken puis et al., 2021, 2020a,b) classification with pre- dialog. In Findings of the Association for Compu- trained model tailored for spoken language (Dinkar tational Linguistics: EMNLP 2020, Online Event, et al., 2020). 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2636–2648. Association for 7 Acknowledgments Computational Linguistics. The research carried out in this paper has received Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal- trušaitis, Amir Zadeh, and Louis-Philippe Morency. funding from IBM, the French National Research 2017. Multimodal sentiment analysis with word- Agency’s grant ANR-17-MAOI and the DSAIDIS level fusion and reinforcement learning. In Proceed- chair at Telecom-Paris. This work was also granted ings of the 19th ACM International Conference on access to the HPC resources of IDRIS under the Multimodal Interaction, pages 163–171. allocation 2021-AP010611665 as well as under the Pierre Colombo, Chloé Clavel, and Pablo Piantanida. project 2021-101838 made by GENCI. 2021a. A novel estimator of mutual information for learning to disentangle textual representations. CoRR, abs/2105.02685. References Pierre Colombo, Guillaume Staerman, Chloé Clavel, Abien Fred Agarap. 2018. Deep learning us- and Pablo Piantanida. 2021b. Automatic text eval- ing rectified linear units (relu). arXiv preprint uation through the lens of wasserstein barycenters. arXiv:1803.08375. CoRR, abs/2108.12463. Andreas Argyriou, Theodoros Evgeniou, and Massim- Pierre Colombo, Wojciech Witon, Ashutosh Modi, iliano Pontil. 2007. Multi-task feature learning. In James Kennedy, and Mubbasir Kapadia. 2019. Advances in neural information processing systems, Affect-driven dialog generation. In Proceedings of pages 41–48. the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Tadas Baltrušaitis, Chaitanya Ahuja, and Louis- Human Language Technologies, NAACL-HLT 2019, Philippe Morency. 2018. Multimodal machine learn- Minneapolis, MN, USA, June 2-7, 2019, Volume 1 ing: A survey and taxonomy. IEEE transac- (Long and Short Papers), pages 3734–3743. Associ- tions on pattern analysis and machine intelligence, ation for Computational Linguistics. 41(2):423–443. Gilles Degottex, John Kane, Thomas Drugman, Tuomo Mohamed Ishmael Belghazi, Aristide Baratin, Sai Raitio, and Stefan Scherer. 2014. Covarep—a col- Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron laborative voice analysis repository for speech tech- Courville, and R Devon Hjelm. 2018. Mine: mu- nologies. In 2014 ieee international conference tual information neural estimation. arXiv preprint on acoustics, speech and signal processing (icassp), arXiv:1801.04062. pages 960–964. IEEE. Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Granger, Marco Pedersoli, Pablo Piantanida, and Is- Kristina Toutanova. 2018. Bert: Pre-training of deep mail Ben Ayed. 2020. A unifying mutual informa- bidirectional transformers for language understand- tion view of metric learning: cross-entropy vs. pair- ing. arXiv preprint arXiv:1810.04805. wise losses. In European Conference on Computer Vision, pages 548–564. Springer. Hammad Dilpazir, Zia Muhammad, Qurratulain Min- has, Faheem Ahmed, Hafiz Malik, and Hasan Mah- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe mood. 2016. Multivariate mutual information for au- Kazemzadeh, Emily Mower, Samuel Kim, Jean- dio video fusion. Signal, Image and Video Process- nette N Chang, Sungbok Lee, and Shrikanth S ing, 10(7):1265–1272. Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language re- Tanvi Dinkar, Pierre Colombo, Matthieu Labeau, and sources and evaluation, 42(4):335. Chloé Clavel. 2020. The importance of fillers for text representations of speech transcripts. In Pro- Emile Chapuis, Pierre Colombo, Matthieu Labeau, and ceedings of the 2020 Conference on Empirical Meth- Chloé Clave. 2021. Code-switched inspired losses ods in Natural Language Processing, EMNLP 2020, for generic spoken dialog representations. CoRR, Online, November 16-20, 2020, pages 7985–7993. abs/2108.12465. Association for Computational Linguistics.
MD Donsker and SRS Varadhan. 1985. Large devia- Ralph Linsker. 1988. Self-organization in a perceptual tions for stationary gaussian processes. Communi- network. Computer, 21(3):105–117. cations in Mathematical Physics, 97(1-2):187–210. Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi- Alexandre Garcia, Pierre Colombo, Slim Essid, Flo- narasimhan, Paul Pu Liang, Amir Zadeh, and Louis- rence d’Alché Buc, and Chloé Clavel. 2019a. From Philippe Morency. 2018. Efficient low-rank multi- the token to the review: A hierarchical multi- modal fusion with modality-specific factors. arXiv modal approach to opinion mining. arXiv preprint preprint arXiv:1806.00064. arXiv:1908.11216. Ilya Loshchilov and Frank Hutter. 2017. Decou- Alexandre Garcia, Slim Essid, Florence d’Alché Buc, pled weight decay regularization. arXiv preprint and Chloé Clavel. 2019b. A multimodal movie re- arXiv:1711.05101. view corpus for fine-grained opinion mining. arXiv Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Di- preprint arXiv:1902.10102. vide, conquer and combine: Hierarchical feature fu- Devamanyu Hazarika, Roger Zimmermann, and Sou- sion network with local and global perspectives for janya Poria. 2020. Misa: Modality-invariant and- multimodal affective computing. In Proceedings of specific representations for multimodal sentiment the 57th Annual Meeting of the Association for Com- analysis. arXiv preprint arXiv:2005.03545. putational Linguistics, pages 481–492. David McAllester and Karl Stratos. 2020. Formal R Devon Hjelm, Alex Fedorov, Samuel Lavoie- limitations on the measurement of mutual informa- Marchildon, Karan Grewal, Phil Bachman, Adam tion. In International Conference on Artificial Intel- Trischler, and Yoshua Bengio. 2018. Learn- ligence and Statistics, pages 875–884. ing deep representations by mutual information estimation and maximization. arXiv preprint Evelyn Z McClave. 2000. Linguistic functions of head arXiv:1808.06670. movements in the context of speech. Journal of pragmatics, 32(7):855–878. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- tional lstm-crf models for sequence tagging. arXiv Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017. preprint arXiv:1508.01991. Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Éric pages 2670–2680. Gaussier, Giovanna Varni, Emmanuel Vignon, and Anne Sabourin. 2020. Heavy-tailed representations, XuanLong Nguyen, Martin J Wainwright, and text polarity classification & data augmentation. In Michael I Jordan. 2010. Estimating divergence func- Advances in Neural Information Processing Systems tionals and the likelihood ratio by convex risk min- 33: Annual Conference on Neural Information Pro- imization. IEEE Transactions on Information The- cessing Systems 2020, NeurIPS 2020, December 6- ory, 56(11):5847–5861. 12, 2020, virtual. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Ryan G James and James P Crutchfield. 2017. Mul- 2018. Representation learning with contrastive pre- tivariate dependence beyond shannon information. dictive coding. arXiv preprint arXiv:1807.03748. Entropy, 19(10):531. Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre Sermanet. Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost 2019. Wasserstein dependency measure for repre- Van De Weijer, Andrew D Bagdanov, Maria Vanrell, sentation learning. In Advances in Neural Informa- and Antonio M Lopez. 2012. Color attributes for tion Processing Systems, pages 15604–15614. object detection. In 2012 IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3306– Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, 3313. IEEE. Kenji Sagae, and Louis-Philippe Morency. 2014. Computational analysis of persuasiveness in social Diederik P Kingma and Jimmy Ba. 2014. Adam: A multimedia: A novel dataset and multimodal predic- method for stochastic optimization. arXiv preprint tion approach. In Proceedings of the 16th Interna- arXiv:1412.6980. tional Conference on Multimodal Interaction, pages 50–57. Justin B Kinney and Gurinder S Atwal. 2014. Eq- uitability, mutual information, and the maximal in- Jeffrey Pennington, Richard Socher, and Christopher D formation coefficient. Proceedings of the National Manning. 2014. Glove: Global vectors for word Academy of Sciences, 111(9):3354–3359. representation. In EMNLP, volume 14, pages 1532– 1543. Lingpeng Kong, Cyprien de Masson d’Autume, Wang Ling, Lei Yu, Zihang Dai, and Dani Yogatama. 2019. Gabriel Peyré, Marco Cuturi, et al. 2019. Computa- A mutual information maximization perspective of tional optimal transport: With applications to data language representation learning. arXiv preprint science. Foundations and Trends® in Machine arXiv:1910.08350. Learning, 11(5-6):355–607.
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis- Proceedings of the conference. Association for Com- Philippe Morency, and Barnabás Póczos. 2019. putational Linguistics. Meeting, volume 2019, page Found in translation: Learning robust joint represen- 6558. NIH Public Access. tations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial In- Cédric Villani. 2008. Optimal transport: old and new, telligence, volume 33, pages 6892–6899. volume 338. Springer Science & Business Media. Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, AmirAli Bagher Zadeh, Chengfeng Mao, Louis- Amir Zadeh, and Louis-Philippe Morency. 2018. Philippe Morency, and Ehsan Hoque. 2020. Inte- Words can shift: Dynamically adjusting word rep- grating multimodal information in large pretrained resentations using nonverbal behaviors. transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- Satosi Watanabe. 1960. Information theoretical analy- guistics, pages 2359–2369. sis of multivariate correlation. IBM Journal of re- Sebastian Ruder. 2017. An overview of multi-task search and development, 4(1):66–82. learning in deep neural networks. arXiv preprint arXiv:1706.05098. Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics, Shamane Siriwardhana, Andrew Reis, Rivindu pages 196–202. Springer. Weerasekera, and Suranga Nanayakkara. 2020. Jointly fine-tuning" bert-like" self supervised Wojciech Witon, Pierre Colombo, Ashutosh Modi, and models to improve multimodal speech emotion Mubbasir Kapadia. 2018. Disney at IEST 2018: Pre- recognition. arXiv preprint arXiv:2008.06682. dicting emotions using an ensemble. In Proceedings of the 9th Workshop on Computational Approaches Mohammad Soleymani, David Garcia, Brendan Jou, to Subjectivity, Sentiment and Social Media Analysis, Björn Schuller, Shih-Fu Chang, and Maja Pantic. WASSA@EMNLP 2018, Brussels, Belgium, October 2017. A survey of multimodal sentiment analysis. 31, 2018, pages 248–253. Association for Computa- Image and Vision Computing, 65:3–14. tional Linguistics. Jiaming Song and Stefano Ermon. 2019. Understand- Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn ing the limitations of variational mutual information Schuller, Congkai Sun, Kenji Sagae, and Louis- estimators. arXiv preprint arXiv:1910.06222. Philippe Morency. 2013. Youtube movie reviews: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Sentiment analysis in an audio-visual context. IEEE Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Intelligent Systems, 28(3):46–53. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. research, 15(1):1929–1958. Empirical evaluation of rectified activations in con- volutional network. Guillaume Staerman, Pavlo Mozharovskyi, and Stéphan Clémençon. 2021. Affine-invariant Chang Xu, Dacheng Tao, and Chao Xu. 2013. A integrated rank-weighted depth: Definition, prop- survey on multi-view learning. arXiv preprint erties and finite sample analysis. arXiv preprint arXiv:1304.5634. arXiv:2106.11068. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Milan Studenỳ and Jirina Vejnarová. 1998. The multi- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. information function as a tool for measuring stochas- Xlnet: Generalized autoregressive pretraining for tic dependence. In Learning in graphical models, language understanding. In Advances in neural in- pages 261–297. Springer. formation processing systems, pages 5753–5763. Zhongkai Sun, Prathusha Sarma, William Sethares, and Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017. Yingyu Liang. 2020. Learning relationships be- A temporal order modeling approach to human ac- tween text, audio, and video via deep canonical cor- tion recognition from multimodal sensor data. ACM relation for multimodal language analysis. In Pro- Transactions on Multimedia Computing, Communi- ceedings of the AAAI Conference on Artificial Intel- cations, and Applications (TOMM), 13(2):1–22. ligence, volume 34, pages 8992–8999. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Jiahong Yuan and Mark Liberman. 2008. Speaker iden- 2019. Contrastive multiview coding. arXiv preprint tification on the scotus corpus. Journal of the Acous- arXiv:1906.05849. tical Society of America, 123(5):3878. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam- J Zico Kolter, Louis-Philippe Morency, and Rus- bria, and Louis-Philippe Morency. 2017. Tensor lan Salakhutdinov. 2019. Multimodal transformer fusion network for multimodal sentiment analysis. for unaligned multimodal language sequences. In arXiv preprint arXiv:1707.07250.
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. arXiv preprint arXiv:1802.00927. Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra- teek Vij, Erik Cambria, and Louis-Philippe Morency. 2018b. Multi-attention recurrent network for hu- man communication comprehension. In Proceed- ings of the... AAAI Conference on Artificial Intelli- gence. AAAI Conference on Artificial Intelligence, volume 2018, page 5642. NIH Public Access. Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang, Paul Pu Liang, Soujanya Poria, and Louis-Philippe Morency. 2019. Factorized multimodal transformer for multimodal sequential learning. arXiv preprint arXiv:1911.09826. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Po- ria, Erik Cambria, and Louis-Philippe Morency. 2018c. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fu- sion graph. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246.
8 Supplementary Statistic Network Layer Number of outputs Activation function 8.1 Training details [Zlav , Z lav ] din , din - Dense layer din /2 LeakyReLU In this section, we both present a comprehensive il- Dropout 0.4 - lustration of the Algorithm 1 and state the details of Dense layer din LeakyReLU Dropout 0.4 - experimental hyperparameters selection as well as Dense layer din LeakyReLU and the architectures used for the statistic network Dropout 0.4 - Tθ . Dense layer din /4 LeakyReLU Dropout 0.4 - 8.1.1 Illustration of Algorithm 1 Dense layer din /4 LeakyReLU Dropout 0.4 - Fig. 3 describes the Algorithm 1. As can be seen Dense layer 1 Sigmoïd in the figure, to compute the mutual dependency measure the statistic network Tθ takes the two em- Table 4: Statistics network description. din denotes the dimension of Zavl . beddings of the different batch B and B̄. 8.2 Additional experiments for robustness to modality drop Fig. 4 shows the results of the robustness text on MAGXLNET. Similarly to Fig. 2 we observe more robust representation to modality drop when jointly maximising the LW and Lkl with the target loss. Fig. 4 shows no improvement when training with Lf . This can also be linked to Tab. 2 which simi- larly shows no improvement in this very specific Figure 3: Illustration of the method describes in Al- configuration. gorithm 1 for the different estimators derived from Th. 1. B and B̄ stands for the batch of data sample MAGXLNET from the joint probability distribution and the product 0.6 of the marginal distribution respectively. Zavl denotes 0.5 the fusion representation of linguistic, acoustic and vi- Acc2corrupt/Acc2 sual (resp. l, a and v) modalities provided by a multi- 0.4 kl modal architecture fθe for the batch B . Z lav denotes 0.3 the same quantity as described before for the batch B̄. f Aθp denotes the linear projection before classification 0.2 or regression. 0.1 0.0 A V A+V 8.1.2 Hyperparameters selection Present Modality We use dropout (Srivastava et al., 2014) and opti- Figure 4: Study of the robustness of the representations mise the global loss Eq. 5 by gradient descent using against a drop of the linguistic modality. Studied model AdamW (Loshchilov and Hutter, 2017; Kingma is MAGXLNET on CMU-MOSI. The ratio between the and Ba, 2014) optimiser. The best learning rate is accuracy achieved with a corrupted linguistic modality Acccorrupt and the accuracy Acc2 without any corrup- found in the grid {0.002, 0.001, 0.0005, 0.0001}. 2 tion is reported on y-axis. The preserved modalities The best model is selected using the lowest MAE during inference are reported on x-axis. A, V respec- on the validation set. We U nroll to 10. tively stands for the acoustic and visual modality. 8.1.3 Architectures of Tθ Across the different experiment we use a statis- 8.3 Additional qualitative examples tic network with an architecture as describes in Tab. 5 illustrates the use of Tθ to explain the repre- Tab. 4. We follow (Belghazi et al., 2018) and use sentations learnt by the model. Similarly to Tab. 4 LeakyRELU (Agarap, 2018; Xu et al., 2015) as we observe that high values correspond to com- activation function. plementarity across modalities and low values are
related to contradictoriness across modalities.
Spoken Transcripts Acoustic and visual behaviour Tθ but the m the script is corny high energy voice + headshake + (many) smiles L as for gi joe was it was just like laughing high enery voice + laughts + smiles L its the the plot the the acting is terrible but i think this one did beat scream 2 now headshake + long sigh L the xxx sequence is really well done static head + low energy monotonous voice L you know of course i was waithing for the princess and the frog smiles + high energy voice + + high pitch H dennis quaid i think had a lot of fun smiles + high energy voice H it was very very very boring low energy voice + frown eyebrows H i do not wanna see any more of this angry voice + angry facial expression H Table 5: Examples from the CMU-MOSI dataset using MAGXLNET trained with LW . The last column is computed using the statistic network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight positive, neutral and negative expression/behaviours respectively.
You can also read