Improving Multimodal fusion via Mutual Dependency Maximisation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Multimodal fusion via Mutual Dependency Maximisation
Pierre Colombo1,2 , Emile Chapuis1 , Matthieu Labeau1 , Chloe Clavel1
1
LTCI, Telecom Paris, Institut Polytechnique de Paris,
2
IBM GBS France,
1
firstname.lastname@telecom-paris.fr,
2
pierre.colombo@ibm.com,
Abstract the difficulty of learning multimodal representa-
tions and raise specific challenges. Baltrušaitis
Multimodal sentiment analysis is a trending
area of research, and the multimodal fusion et al. (2018) identifies fusion as one of the five core
is one of its most active topic. Acknowledg- challenges in multimodal representation learning,
arXiv:2109.00922v2 [cs.LG] 9 Sep 2021
ing humans communicate through a variety of the four other being: representation, modality align-
channels (i.e visual, acoustic, linguistic), mul- ment, translation and co-learning. Fusion aims at
timodal systems aim at integrating different integrating the different unimodal representations
unimodal representations into a synthetic one. into one common synthetic representation. Effec-
So far, a consequent effort has been made on
tive fusion is still an open problem: the best multi-
developing complex architectures allowing the
fusion of these modalities. However, such sys- modal models in sentiment analysis (Rahman et al.,
tems are mainly trained by minimising sim- 2020) improve over their unimodal counterparts,
ple losses such as L1 or cross-entropy. In relying on text modality only, by less than 1.5% on
this work, we investigate unexplored penalties accuracy. Additionally, the fusion should not only
and propose a set of new objectives that mea- improve accuracy but also make representations
sure the dependency between modalities. We more robust to missing modalities.
demonstrate that our new penalties lead to a Multimodal fusion can be divided into early and
consistent improvement (up to 4.3 on accu-
late fusion techniques: early fusion takes place
racy) across a large variety of state-of-the-art
models on two well-known sentiment analysis at the feature level (Ye et al., 2017), while late
datasets: CMU-MOSI and CMU-MOSEI. Our fusion takes place at the decision or scoring level
method not only achieves a new SOTA on both (Khan et al., 2012). Current research in multimodal
datasets but also produces representations that sentiment analysis mainly focuses on developing
are more robust to modality drops. Finally, a new fusion mechanisms relying on deep architec-
by-product of our methods includes a statisti- tures (e.g TFN (Zadeh et al., 2017), LFN (Liu et al.,
cal network which can be used to interpret the
2018), MARN (Zadeh et al., 2018b), MISA (Haz-
high dimensional representations learnt by the
model. arika et al., 2020), MCTN (Pham et al., 2019), HFNN
(Mai et al., 2019), ICCN (Sun et al., 2020)). The-
1 Introduction ses models are evaluated on several multimodal
sentiment analysis benchmark such as IEMOCAP
Humans employ three different modalities to com-
(Busso et al., 2008), MOSI (Wöllmer et al., 2013),
municate in a coordinated manner: the language
MOSEI (Zadeh et al., 2018c) and POM (Garcia
modality with the use of words and sentences, the
et al., 2019b; Park et al., 2014). Current state-of-
vision modality with gestures, poses and facial ex-
the-art on these datasets uses architectures based
pressions and the acoustic modality through change
on pre-trained transformers (Tsai et al., 2019; Siri-
in vocal tones. Multimodal representation learning
wardhana et al., 2020) such as MultiModal Bert
has shown great progress in a large variety of tasks
(MAGBERT) or MultiModal XLNET (MAGXLNET)
including emotion recognition, sentiment analy-
(Rahman et al., 2020).
sis (Soleymani et al., 2017), speaker trait analysis
(Park et al., 2014) and fine-grained opinion min- The aforementioned architectures are trained by
ing (Garcia et al., 2019a). Learning from different minimising either a L1 loss or a Cross-Entropy
modalities is an efficient way to improve perfor- loss between the predictions and the ground-truth
mance on the target tasks (Xu et al., 2013). Never- labels. To the best of our knowledge, few efforts
theless, heterogeneities across modalities increase have been dedicated to exploring alternative losses.In this work, we propose a set of new objectives (3) provides an explanation of the decision taken
to perform and improve over existing fusion mech- by the neural architecture (sec. 5.4).
anisms. These improvements are inspired by the
InfoMax principle (Linsker, 1988), i.e. choosing 2 Problem formulation & related work
the representation maximising the mutual informa-
tion (MI) between two possibly overlapping views In this section, we formulate the problem of learn-
of the input. The MI quantifies the dependence ing multi-modal representation (sec. 2.1) and we re-
of two random variables; contrarily to correlation, view both existing measures of mutual dependency
MI also captures non-linear dependencies between (see sec. 2.2) and estimation methods (sec. 2.3).
the considered variables. Different from previous In the rest of the paper, we will focus on learn-
work, which mainly focuses on comparing two ing from three modalities (i.e language, audio and
modalities, our learning problem involves multiple video), however our approach can be generalised
modalities (e.g text, audio, video). Our proposed to any arbitrary number of modalities.
method, which induces no architectural changes,
2.1 Learning multimodal representations
relies on jointly optimising the target loss with an
additional penalty term measuring the mutual de- Plethora of neural architectures have been pro-
pendency between different modalities. posed to learn multimodal representations for sen-
timent classification. Models often rely on a fusion
1.1 Our Contributions mechanism (e.g multi-layer perceptron (Khan et al.,
2012), tensor factorisation (Liu et al., 2018; Zadeh
We study new objectives to build more performant et al., 2019) or complex attention mechanisms
and robust multimodal representations through an (Zadeh et al., 2018a)) that is fed with modality-
enhanced fusion mechanism and evaluate them on specific representations. The fusion problem boils
multimodal sentiment analysis. Our method also down to learning a model Mf : Xa × Xv × Xl →
allows us to explain the learnt high dimensional Rd . Mf is fed with uni-modal representations of
multimodal embeddings. The paper contributions the inputs Xa,v,l = (Xa , Xv , Xl ) obtained through
can be summarised as follows: three embedding networks fa , fv and fl . Mf has
A set of novel objectives using multivariate de- to retain both modality-specific interactions (i.e
pendency measures. We introduce three new interactions that involve only one modality) and
trainable surrogates to maximise the mutual de- cross-view interactions (i.e more complex, they
pendencies between the three modalities (i.e audio, span across both views). Overall, the learning of
language and video). We provide a general algo- Mf involves both the minimisation of the down-
rithm inspired by MINE (Belghazi et al., 2018), stream task loss and the maximisation of the mutual
which was developed in a bi-variate setting for esti- dependency between the different modalities.
mating the MI. Our new method enriches MINE by
extending the procedure to a multivariate setting 2.2 Mutual dependency maximisation
that allows us to maximise different Mutual Depen- Mutual information as mutual dependency
dency Measures: the Total Correlation (Watanabe, measure: the core ideas we rely on to better learn
1960), the f-Total Correlation and the Multivari- cross-view interactions are not new. They consist of
ate Wasserstein Dependency Measure (Ozair et al., mutual information maximisation (Linsker, 1988),
2019). and deep representation learning. Thus, one of the
Applications and numerical results. We apply most natural choices is to use the MI that measures
our new set of objectives to five different archi- the dependence between two random variables, in-
tectures relying on LSTM cells (Huang et al., cluding high-order statistical dependencies (Kinney
2015) (e.g EF-LSTM, LFN, MFN) or transformer and Atwal, 2014). Given two random variables X
layers (e.g MAGBERT, MAG-XLNET). Our pro- and Y , the MI is defined by
posed method (1) brings a substantial improvement
on two different multimodal sentiment analysis
pXY (x, y)
datasets (i.e MOSI and MOSEI,sec. 5.1), (2) makes I(X; Y ) , EXY log , (1)
pX (x)pY (y)
the encoder more robust to missing modalities (i.e
when predicting without language, audio or video where pXY is the joint probability density function
the observed performance drop is smaller, sec. 5.3), (pdf) of the random variables (X, Y ), and pX , pYrepresent the marginal pdfs. MI can also be defined 3.0
with a the KL divergence: 2.5 KL
f
2.0
I(X; Y ) , KL [pXY (x, y)||pX (x)pY (y)] . (2)
1.5
Extension of mutual dependency to different 1.0
metrics: the KL divergence seems to be limited 0.5
when used for estimating MI (McAllester and
0.0
Stratos, 2020). A natural step is to replace the 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
KL divergence in Eq. 2 with different divergences
such as the f-divergences or distances such as the Figure 1: Estimation of different dependency measures
Wasserstein distance. Hence, we introduce new mu- for multivariate Gaussian random variables for differ-
tual dependency measures (MDM): the f-Mutual ent degree of correlation.
Information (Belghazi et al., 2018), denoted If and
the Wasserstein Measures (Ozair et al., 2019), de-
noted IW . As previously, pXY denotes the joint follow: corr(Xi , Xk ) = δi,k ρ , where ρ ∈ (−1, 1)
pdf, and pX , pY denote the marginal pdfs. The new and δi,k is Kronecker’s delta. We observe that the
measures are defined as follows: dependency measure based on Wasserstein distance
is different from the one based on the divergences
If , Df (pXY (x, y); pX (x)pY (y)), (3) and thus will lead to different gradients. Although
theoretical studies have been done on the use of
where Df denotes any f -divergences and different metrics for dependency estimations, it re-
mains an open question to know which one is the
IW , W(pXY (x, y); pX (x)pY (y)), (4)
best suited. In this work, we will provide an exper-
where W denotes the Wasserstein distance (Peyré imental response in a specific case.
et al., 2019).
3 Model and training objective
2.3 Estimating mutual dependency measures
In this section, we introduce our new set of losses to
The computation of MI and other mutual depen-
improve fusion. In sec. 3.1, we first extend widely
dency measures can be difficult without knowing
used bi-variate dependency measures to multivari-
the marginal and joint probability distributions,
ate dependencies (James and Crutchfield, 2017)
thus it is popular to maximise lower bounds to
measures (MDM). We then introduce variational
obtain better representations of different modal-
bounds on the MDM, and in sec. 3.2, we describe
ities including image (Tian et al., 2019; Hjelm
our method to minimise the proposed variational
et al., 2018), audio (Dilpazir et al., 2016) and text
bounds.
(Kong et al., 2019) data. Several estimators have
Notations We consider Xa , Xv , Xl as the multi-
been proposed: MINE (Belghazi et al., 2018) uses
modal data from the audio,video and language
the Donsker-Varadhan representation (Donsker and
modality respectively with joint probability dis-
Varadhan, 1985) to derive a parametric lower bound
tribution pXa Xv Xl . We denote as pXj the marginal
holds, Nguyen et al. (2017, 2010) uses variational
distribution of Xj with j ∈ {a, v, l} corresponding
characterisation of f-divergence and a multi-sample
to the jth modality.
version of the density ratio (also known as noise
General loss As previously mentioned, we rely on
contrastive estimation (Oord et al., 2018; Ozair
the InfoMax principle (Linsker, 1988) and aim at
et al., 2019)). These methods have mostly been
jointly maximising the MDM between the different
developed and studied in a bi-variate setting.
modalities and minimising the task loss; hence, we
Illustration of neural dependency measures on
are in a multi-task setting (Argyriou et al., 2007;
a bivariate case. In Fig. 1 we can see the aforemen-
Ruder, 2017) and the objective of interest can be
tioned dependency measures (i.e see Eq. 2, Eq. 4,
defined as:
Eq. 3) when estimated with MINE (Belghazi et al.,
2018) for multivariate Gaussian random variables,
Xa and Xb . The component wise correlation for L , Ldown. − λ · LM DM . (5)
| {z } | {z }
the considered multivariate Gaussian is defined as main task mutual dependency termLdown. represents a downstream specific (target The neural multivariate f-mutual information mea-
task) loss i.e a binary cross-entropy or a L1 loss, λ sure If is defined as follows:
is a meta-parameter and LM DM is the multivariate
dependencies measures (see sec. 3.2). Minimisa- If , sup EpXa Xv Xl [Tθ ] − E Q pX [eTθ −1 ]. (7)
j
θ j∈{a,v,l}
tion of our newly defined objectives requires to
derive lower bounds on the LM DM terms, and then
The neural multivariate Wasserstein dependency
to obtain trainable surrogates. measure IW is defined as follows:
3.1 From bivariate to multivariate
IW , sup EpXa Xv Xl [Tθ ] − log E Q
pXj [Tθ ] . (8)
dependencies θ:Tθ ∈L j∈{a,v,l}
In our setting, we aim at maximising cross-view
Where L is the set of all 1-Lipschitz functions from
interactions involving three modalities, thus we
Rd → R
need to generalise bivariate dependency measures
to multivariate dependency measures. Sketch of proofs: Eq. 6 is a direct application of
Definition 3.1 (Multivariate Dependencies Mea- the Donsker-Varadhan representation of the KL
sures). Let Xa , Xv , Xl be a set of random vari- divergence (we assume that the integrability con-
ables with joint pdf pXa Xv Xl and respective straints are satisfied). Eq. 7 comes from the work
marginal pdf pXj with j ∈ {a, v, l}. Then we of Nguyen et al. (2017). Eq. 8 comes from the
defined the multivariate mutual information Ikl Kantorovich-Rubenstein: we refer the reader to
which is also refered as total correlation (Watan- (Villani, 2008; Peyré et al., 2019) for a rigorous
abe, 1960) or multi-information (Studenỳ and Vej- and exhaustive treatment.
narová, 1998): Practical estimate of the variational bounds.
Y The empirical estimator that we derive from Th. 1
Ikl , KL(pXa Xv Xl (xa , xv , xl )|| pXj (xj )). can be used in practical way: the expectations in
j∈{a,v,l} Eq. 6, Eq. 7 and Eq. 8 are estimated using empirical
Simarly for any f-divergence we define the multi- samples from the joint distribution
Q pXa Xv Xl . The
variate f-mutual information If as: empirical samples from pXj are obtained
j∈{a,v,l}
Y by shuffling the samples from the joint distribution
If , Df (pXa Xv Xl (xa , xv , xl ); pXj (xj )).
in a batch. We integrate this into minimising a
j∈{a,v,l}
multi-task objective (5) by using minus the estima-
Finally, we also extend Eq. 3 to obtain the multi- tor. We refer to the losses obtained with the penalty
variate Wasserstein dependency measure IW : based on the estimators described in Eq. 6, Eq. 7
Y and Eq. 8 as Lkl , Lf and LW respectively. Details
IW , W(pXa Xv Xl (xa , xv , xl ); pXj (xj )). on the practical minimisation of our variational
j∈{a,v,l}
bounds are provided in Algorithm 1.
where W denotes the Wasserstein distance. Remark. In this work we choose to generalise
3.2 From theoretical bounds to trainable MINE to compute multivariate dependencies. Com-
surrogates paring our proposed algorithm to other alterna-
tives mentioned in sec. 2 is left for future work.
To train our neural architecture we need to esti- This choice is driven by two main reasons: (1)
mate the previously defined multivariate depen- our framework allows the use of various types
dency measures. We rely on neural estimators that of contrast measures (e.g Wasserstein distance,f -
are given in Th. 1. divergences); (2) the critic network Tθ can be used
Theorem 1. Multivariate Neural Dependency for interpretability purposes as shown in sec. 5.4.
Measures Let the family of functions T (θ) : Xa ×
Xv × Xl → R parametrized by a deep neural net-
4 Experimental setting
work with learnable parameters θ ∈ Θ. The multi-
variate mutual information measure Ikl is defined In this section, we present our experimental settings
as: including the neural architectures we compare, the
datasets, the metrics and our methodology, which
Ikl , sup EpXa Xv Xl [Tθ ] − log E Q pX [eTθ ] . (6)
θ j∈{a,v,l}
j includes the hyper-parameter selection.Algorithm 1 Two-stage procedure to minimise 2014), BERT or XLNET contextualised embed-
multivariate dependency measures. dings. For Glove, the embeddings are of dimension
INPUT: Dn = {(xja , xjv , xjl ), ∀j ∈ [1, n]} multi- 300, where for BERT and XLNET this dimension
modal training dataset, m batch size, σa , σv , σl : is 768.
[1, m] → [1, m] three permutations, θc weights Vision: Vision features are extracted using Facet
of the deep classifier, θ weights of the statistical which results into facial action units corresponding
network Tθ . to facial muscle movement. For CMU-MOSEI, the
Initialization: parameters θ and θc video vectors are composed of 47 units, and for
Build Negative Dataset: CMU-MOSI they are composed of 35.
Audio : Audio features are extracted using CO-
σ (j)
D¯n = {(xσa a (j) , xσv v (j) , xl l ), ∀j ∈ [1, n]} VAREP (Degottex et al., 2014). This results into
a vector of dimension 74 which includes 12 Mel-
Optimization: frequency cepstral coefficients (MFCCs), as well
while (θ, θc ) not converged do as pitch tracking and voiced/unvoiced segmenting
for i ∈ [1, U nroll] do features, peak slope parameters, maxima disper-
Sample from Dn , B ∼ pXQ a Xv Xl sion quotients and glottal source parameters.
Sample from D¯n , B̄ ∼ pXj Video and audio are aligned on text-based follow-
j∈{a,v,l} ing the convention introduced in (Chen et al., 2017)
Update θ based on the empirical version
and the forced alignment described in (Yuan and
of Eq. 6 or Eq. 7 or Eq. 8.
Liberman, 2008).
end for
Sample a batch B from D 4.2 Evaluation metrics
Update θc with B using Eq. 5.
end while Multimodal Opinion Sentiment Intensity prediction
OUTPUT: Classifiers weights θc is treated as a regression problem. Thus, we report
both the Mean Absolute Error (MAE) and the cor-
relation of model predictions with true labels. In
4.1 Datasets the literature, the regression task is also turned into
a binary classification task for polarity prediction.
We empirically evaluate our methods on two
We follow standard practices (Rahman et al., 2020)
english datasets: CMU-MOSI and CMU-MOSEI.
and report the Accuracy2 (Acc7 denotes accuracy
Both datasets have been frequently used to assess
on 7 classes and Acc2 the binary accuracy) of our
model performance in human multimodal senti-
best performing models.
ment and emotion recognition.
CMU-MOSI: Multimodal Opinion Sentiment In-
4.3 Neural architectures
tensity (Wöllmer et al., 2013) is a sentiment an-
notated dataset gathering 2, 199 short monologue In our experiments, we choose to modify the loss
video clips. function of the different models that have been
CMU-MOSEI: CMU-Multimodal Opinion Senti- introduced for multi-modal sentiment analysis on
ment and Emotion Intensity (Zadeh et al., 2018c) both CMU-MOSI and CMU-MOSEI: Memory Fu-
is an emotion and sentiment annotated corpus con- sion Network (MFN (Zadeh et al., 2018a)), Low-
sisting of 23, 454 movie review videos taken from rank Multimodal Fusion (LFN (Liu et al., 2018))
YouTube. Both CMU-MOSI and CMU-MOSEI and two state-of-the-art transformers based models
are labelled by humans with a sentiment score in (Rahman et al., 2020) for fusion rely on BERT (De-
[−3, 3]. For each dataset, three modalities are avail- vlin et al., 2018) (MAG-BERT) and XLNET (Yang
able; we follow prior work (Zadeh et al., 2018b, et al., 2019) (MAG-XLNT). To assess the validity
2017; Rahman et al., 2020) and the features that of the proposed losses, we also apply our method
have been obtained as follows1 : to a simple early fusion LSTM (EF-LSTM) as a
Language: Video transcripts are converted to word baseline model.
embeddings using either Glove (Pennington et al., Model overview: Aforementioned models can be
1 2
Data from CMU-MOSI and CMU-MOSEI can be obtained The regression outputs are turned into categorical values
from https://github.com/WasifurRahman/ to obtain either 2 or 7 categories (see (Rahman et al., 2020;
BERT_multimodal_transformer Zadeh et al., 2018a; Liu et al., 2018))seen as a multi-modal encoder fθe providing a rep- the different dependency measures. The compared
resentation Zavl containing information and depen- measures are combined with different models using
dencies between modalities Xl , Xa , Xv namely: various fusion mechanisms.
Improving the robustness to modality drop: a
fθe (Xa , Xv , Xl ) = Zavl . desirable quality of multimodal representations is
the robustness to a missing modality. We study
As a final step, a linear transformation Aθp is ap-
how the maximisation of mutual dependency mea-
plied to Zavl to perform the regression.
sures during training affects the robustness of the
EF-LSTM: is the most basic architecture used in
representation when a modality becomes missing.
the current multimodal analysis where each se-
Towards explainable representations: the sta-
quence view is encoded separately with LSTM
tistical network Tθ allows us to compute a de-
channels. Then, a fusion function is applied to
pendency measure between the three considered
all representations.
modalities. We carry out a qualitative analysis in
TFN: computes a representation of each view, and
order to investigate if a high dependency can be
then applies a fusion operator. Acoustic and visual
explained by complementariness across modalities.
views are first mean-pooled then encoded through
a 2-layers perceptron. Linguistic features are com- 5.1 Efficiency of the MDM penalty
puted with a LSTM channel. Here, the fusion func-
For a simple EF-LSTM, we study the improvement
tion is a cross-modal product capturing unimodal,
induced by addition of our MDM penalty. The re-
bimodal and trimodal interactions across modali-
sults are presented in Tab. 1, where a EF-LSTM
ties.
trained with no mutual dependency term is denoted
MFN enriches the previous EF-LSTM architecture
with L∅ . On both studied datasets, we observe that
with an attention module that computes a cross-
the addition of a MDM penalty leads to stronger
view representation at each time step. They are
performances on all metrics. For both datasets, we
then gathered and a final representation is com-
observe that the best performing models are ob-
puted by a gated multi-view memory (Zadeh et al.,
tained by training with an additional mutual depen-
2018a).
dency measure term. Keeping in mind the example
MAG-BERT and MAG-XLNT are based on pre-
shown in Fig. 1, we can draw a first comparison
trained transformer architectures (Devlin et al.,
between the different dependency measures. Al-
2018; Yang et al., 2019) allowing inputs on each
though in a simple case Lf and Lkl estimate a simi-
of the transformer units to be multimodal, thanks
lar quantity (see Fig. 1), in more complex practical
to a special gate inspired by Wang et al. (2018).
applications they do not achieve the same perfor-
The Zavl is the [CLS] representation provided by
mance. Even though, the Donsker-Varadhan bound
the last transformer head. For each architecture,
used for Lkl is stronger3 than the one used to es-
we use the optimal architecture hyperparameters
timate Lf ; for a simple model the stronger bound
provided by the associated papers (see sec. 8).
does not lead to better results. It is possible that
5 Numerical results most of the differences in performance observed
come from the optimisation process during train-
We present and discuss here the results obtained us- ing4 .
ing the experimental setting described in sec. 4. To Takeaways: On the simple case of EF-LSTM
better understand the impact of our new methods, adding MDM penalty improves the performance
we propose to investigate the following points: on the downstream tasks.
Efficiency of the LM DM : to gain understanding
of the usefulness of our new objectives, we study 5.2 Improving models and comparing
the impact of adding the mutual dependency term multivariate dependency measures
on the basic multimodal neural model EF-LSTM. In this experiment, we apply the different penalties
Improving model performance and comparing to more advanced architectures, using various fu-
multivariate dependency measures: the choice sion mechanisms.
of the most suitable dependency measure for a
3
given task is still an open problem (see sec. 3). For a fixed Tθ the right term in Eq. 6 is greater than Eq. 7
4
Similar conclusion have been drawn in the field of metric
Thus, we compare the performance – on both mul- learning problem when comparing different estimates of the
timodal sentiment and emotion prediction tasks– of mutual information (Boudiaf et al., 2020).Acch7 Acch2 M AE l Corrh Acch7
CMU-MOSI
Acch2 M AE l Corrh Acch7
CMU-MOSEI
Acch2 M AE l Corrh
CMU-MOSI MFN
L∅ 31.3 76.6 1.01 0.62 44.4 74.7 0.72 0.53
L∅ 31.1 76.1 1.00 0.65 Lkl 32.5 76.7 0.96 0.65 44.2 74.7 0.72 0.57
Lkl 31.7 76.4 1.00 0.66 Lf 35.7 77.4 0.96 0.65 46.1 75.4 0.69 0.56
LW 35.9 77.6 0.96 0.65 46.2 75.1 0.69 0.56
Lf 33.7 76.2 1.02 0.66 LFN
LW 33.5 76.4 0.98 0.66 L∅ 31.9 76.9 1.00 0.63 45.2 74.2 0.70 0.54
Lkl 32.6 77.7 0.97 0.63 46.1 75.3 0.68 0.57
CMU-MOSEI Lf 35.6 77.1 0.97 0.63 45.8 75.4 0.69 0.57
LW 35.6 77.7 0.96 0.67 46.2 75.4 0.67 0.57
L∅ 44.2 75.0 0.72 0.52 MAGBERT
Lkl 44.5 75.6 0.70 0.53 L∅ 40.2 84.7 0.79 0.80 46.8 84.9 0.59 0.77
Lkl 42.0 85.6 0.76 0.82 47.1 85.4 0.59 0.79
Lf 45.5 75.2 0.70 0.52 Lf 41.7 85.6 0.78 0.82 46.9 85.6 0.59 0.79
LW 45.3 75.9 0.68 0.54 LW 41.8 85.3 0.76 0.82 47.8 85.5 0.59 0.79
MAGXLNET
L∅ 43.0 86.2 0.76 0.82 46.7 84.4 0.59 0.79
Table 1: Results on sentiment analysis on both Lkl 44.5 86.1 0.74 0.82 47.5 85.4 0.59 0.81
CMU-MOSI and CMU-MOSEI for a EF-LSTM. Acc7 Lf 43.9 86.6 0.74 0.82 47.4 85.0 0.59 0.81
LW 44.4 86.9 0.74 0.82 47.9 85.8 0.59 0.82
denotes accuracy on 7 classes and Acc2 the binary ac-
curacy. M AE denotes the Mean Absolute Error and Table 2: Results on sentiment and emotion prediction
Corr is the Pearson correlation. h means higher is on both CMU-MOSI and CMU-MOSEI dataset for the
better and l means lower is better. The choice of the different neural architectures presented in sec. 4 relying
evaluation metrics follows standard practices (Rahman on various fusion mechanisms.
et al., 2020). Underline results demonstrate significant
improvement (p-value belows 0.05) against the base-
line when performing the Wilcoxon Mann Whitney test NET) or by the Multimodal Adaptation Gate used
(Wilcoxon, 1992) on 10 runs using different seeds.
to perform the fusion.
Comparing dependency measures. Tab. 2 shows
General analysis. Tab. 2 shows the performance that there is no dependency measure that achieves
of various neural architectures trained with and the best results in all cases. This result tends to con-
without MDM penalty. Results are coherent with firm that the optimisation process during training
the previous experiment: we observe that jointly plays an important role (see hypothesis in sec. 5.1).
maximising a mutual dependency measure leads However, we can observe that optimising the multi-
to better results on the downstream task: for ex- variate Wasserstein dependency measure is usually
ample, a MFN on CMU-MOSI trained with LW a good choice, since it achieves state of the art re-
outperforms by 4.6 points on Acch7 the model sults in many configurations. It is worth noting
trained without the mutual dependency term. On that several pieces of research point the limitations
CMU-MOSEI we also obtain subsequent improve- of mutual information estimators (McAllester and
ments while training with MMD. On CMU-MOSI Stratos, 2020; Song and Ermon, 2019).
the TFN also strongly benefits from the mutual de- Takeaways: The addition of MMD not only ben-
pendency term with an absolute improvement of efits simple models (e.g EF-LSTM) but also im-
3.7% (on Acch7 ) with LW compared to L∅ . Tab. 2 proves performance when combined with both com-
shows that our methods not only perform well plex fusion mechanisms and pretrained models. For
on recurrent architectures but also on pretrained practical applications, the Wasserstein distance is a
Transformer-based models, that achieve higher re- good choice of contrast function.
sults due to a superior capacity to model contextual
dependencies (see (Rahman et al., 2020)). 5.3 Improved robustness to modality drop
Improving state-of-the-art models. MAGBERT Although fusion with visual and acoustic modali-
and MAGXLNET are state-of-the art models on both ties provided a performance improvement (Wang
CMU-MOSI and CMU-MOSEI. From Tab. 2, we et al., 2018), the performance of Multimodal sys-
observe that our methods can improve the perfor- tems on sentiment prediction tasks is mainly car-
mance of both models. It is worth noting that, in ried by the linguistic modality (Zadeh et al., 2018a,
both cases, LW combined with pre-trained trans- 2017). Thus it is interesting to study how a mul-
formers achieves good results. This performance timodal system behaves when the text modality is
gain suggests that our method is able to capture missing because it gives insights on the robustness
dependencies that are not learnt during either pre- of the representation.
training of the language model (i.e BERT or XL- Experiment description. In this experiment, weSpoken Transcripts Acoustic and visual behaviour Tθ
um the story was all right low energy monotonous voice + headshake L
i mean its a Nicholas Sparks book it must be good disappointed tone + neutral facial expression L
the action is fucking awesome head nod + excited voice H
it was cute you know the actors did a great job bringing the smurfs to
multiple smiles H
life such as joe george lopez neil patrick harris katy perry and a fourth
Table 3: Examples from the CMU-MOSI dataset using MAGBERT. The last column is computed using the statistical
network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight positive, neutral and
negative expression/behaviours respectively
0.6
MAGBERT to dropping the language modality, and thus, could
be preferred in practical applications.
0.5 Takeaway: Maximising the MMD allows an infor-
mation transfer between modalities.
Acc2corrupt/Acc2
0.4
f 5.4 Towards explainable representations
0.3 kl
In this section, we propose a qualitative experi-
0.2 ment allowing us to interpret the predictions made
by the deep neural classifier. During training, Tθ
0.1
estimates the mutual dependency measure, using
0.0 the surrogates introduced in Th. 1. However, the
A V A+V inference process only involves the classifier, and
Present Modality
Tθ is unused. Eq. 6, Eq. 7, Eq. 8 show that Tθ is
Figure 2: Study of the robustness of the representations trained to discriminate between valid representa-
against drop of the linguistic modality. Studied model tions (coming from the joint distribution) and cor-
is MAGBERT on CMU-MOSI. The ratio between the ac- rupted representations (coming from the product of
curacy achieved with a corrupted linguistic modality
the marginals). Thus, Tθ can be used, at inference
Acccorrupt
2 and the accuracy Acc2 without any corrup-
tion is reported on y-axis. The preserved modalities
time, to measure the mutual dependency of the rep-
during inference are reported on x-axis. A, V respec- resentations used by the neural model. In Tab. 3
tively stands for acoustic and visual modality. we report examples of low and high discrepancy
measures for MAGBERT on CMU-MOSI. We can
focus on the MAGBERT and MAGXLNET since they observe that high values correspond to video clips
are the best performing models.5 As before, the where audio, text and video are complementary
considered models are trained using the losses de- (e.g use of head node (McClave, 2000)) and low
scribed in sec. 3 and all modalities are kept during values correspond to the case where there exists
training time. During inference, we either keep contradictions across several modalities. Results
only one modality (Audio or Video) or both. Text on MAGXLET can be found in sec. 8.3.
modality is always dropped. Takeaways: Tθ used to estimate the MDM pro-
Results. Results of the experiments conducted on vides a mean to interpret representations learnt by
CMU-MOSI are shown in Fig. 2, giving values for the encoder.
the ratio Acccorrupt
2 /Acc2 where Acccorrupt
2 is the
binary accuracy in the corrupted configuration and 6 Conclusions
Acc2 the accuracy obtained when all modalities are In this paper, we introduced three new losses based
considered. We observe that models trained with on MDM. Through extensive set of experiments on
an MDM penalty (either Lkl , Lf or LW ) resist bet- CMU-MOSI and CMU-MOSEI, we have shown that
ter to missing modalities than those trained with SOTA architectures can benefit from these innova-
L∅ . For example, when trained with Lkl or Lf , the tions with little modifications. A by-product of our
drop in performance is limited to ≈ 25% in any method involves a statistical network that is a useful
setting. Interestingly, for MAGBERT LW and LKL tool to explain the learnt high dimensional multi-
achieve comparable results; LKL is more resistant modal representations. This work paves the way
5
Because of space constraints results corresponding to for using and developing new alternative methods
MAGXLNET are reported in sec. 8. to improve the learning (e.g new estimator of mu-tual information (Colombo et al., 2021a), Wasser- Emile Chapuis, Pierre Colombo, Matteo Manica,
stein Barycenters (Colombo et al., 2021b), Data Matthieu Labeau, and Chloé Clavel. 2020a. Hierar-
chical pre-training for sequence labelling in spoken
Depths (Staerman et al., 2021), Extreme Value The-
dialog. CoRR, abs/2009.11152.
ory (Jalalzai et al., 2020)). A future line of research
involves using this methods for emotion (Colombo Emile Chapuis, Pierre Colombo, Matteo Manica,
et al., 2019; Witon et al., 2018) and dialog act (Cha- Matthieu Labeau, and Chloé Clavel. 2020b. Hierar-
chical pre-training for sequence labelling in spoken
puis et al., 2021, 2020a,b) classification with pre- dialog. In Findings of the Association for Compu-
trained model tailored for spoken language (Dinkar tational Linguistics: EMNLP 2020, Online Event,
et al., 2020). 16-20 November 2020, volume EMNLP 2020 of
Findings of ACL, pages 2636–2648. Association for
7 Acknowledgments Computational Linguistics.
The research carried out in this paper has received Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-
trušaitis, Amir Zadeh, and Louis-Philippe Morency.
funding from IBM, the French National Research 2017. Multimodal sentiment analysis with word-
Agency’s grant ANR-17-MAOI and the DSAIDIS level fusion and reinforcement learning. In Proceed-
chair at Telecom-Paris. This work was also granted ings of the 19th ACM International Conference on
access to the HPC resources of IDRIS under the Multimodal Interaction, pages 163–171.
allocation 2021-AP010611665 as well as under the Pierre Colombo, Chloé Clavel, and Pablo Piantanida.
project 2021-101838 made by GENCI. 2021a. A novel estimator of mutual information
for learning to disentangle textual representations.
CoRR, abs/2105.02685.
References Pierre Colombo, Guillaume Staerman, Chloé Clavel,
Abien Fred Agarap. 2018. Deep learning us- and Pablo Piantanida. 2021b. Automatic text eval-
ing rectified linear units (relu). arXiv preprint uation through the lens of wasserstein barycenters.
arXiv:1803.08375. CoRR, abs/2108.12463.
Andreas Argyriou, Theodoros Evgeniou, and Massim- Pierre Colombo, Wojciech Witon, Ashutosh Modi,
iliano Pontil. 2007. Multi-task feature learning. In James Kennedy, and Mubbasir Kapadia. 2019.
Advances in neural information processing systems, Affect-driven dialog generation. In Proceedings of
pages 41–48. the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis- Human Language Technologies, NAACL-HLT 2019,
Philippe Morency. 2018. Multimodal machine learn- Minneapolis, MN, USA, June 2-7, 2019, Volume 1
ing: A survey and taxonomy. IEEE transac- (Long and Short Papers), pages 3734–3743. Associ-
tions on pattern analysis and machine intelligence, ation for Computational Linguistics.
41(2):423–443.
Gilles Degottex, John Kane, Thomas Drugman, Tuomo
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Raitio, and Stefan Scherer. 2014. Covarep—a col-
Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron laborative voice analysis repository for speech tech-
Courville, and R Devon Hjelm. 2018. Mine: mu- nologies. In 2014 ieee international conference
tual information neural estimation. arXiv preprint on acoustics, speech and signal processing (icassp),
arXiv:1801.04062. pages 960–964. IEEE.
Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Granger, Marco Pedersoli, Pablo Piantanida, and Is- Kristina Toutanova. 2018. Bert: Pre-training of deep
mail Ben Ayed. 2020. A unifying mutual informa- bidirectional transformers for language understand-
tion view of metric learning: cross-entropy vs. pair- ing. arXiv preprint arXiv:1810.04805.
wise losses. In European Conference on Computer
Vision, pages 548–564. Springer. Hammad Dilpazir, Zia Muhammad, Qurratulain Min-
has, Faheem Ahmed, Hafiz Malik, and Hasan Mah-
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe mood. 2016. Multivariate mutual information for au-
Kazemzadeh, Emily Mower, Samuel Kim, Jean- dio video fusion. Signal, Image and Video Process-
nette N Chang, Sungbok Lee, and Shrikanth S ing, 10(7):1265–1272.
Narayanan. 2008. Iemocap: Interactive emotional
dyadic motion capture database. Language re- Tanvi Dinkar, Pierre Colombo, Matthieu Labeau, and
sources and evaluation, 42(4):335. Chloé Clavel. 2020. The importance of fillers for
text representations of speech transcripts. In Pro-
Emile Chapuis, Pierre Colombo, Matthieu Labeau, and ceedings of the 2020 Conference on Empirical Meth-
Chloé Clave. 2021. Code-switched inspired losses ods in Natural Language Processing, EMNLP 2020,
for generic spoken dialog representations. CoRR, Online, November 16-20, 2020, pages 7985–7993.
abs/2108.12465. Association for Computational Linguistics.MD Donsker and SRS Varadhan. 1985. Large devia- Ralph Linsker. 1988. Self-organization in a perceptual
tions for stationary gaussian processes. Communi- network. Computer, 21(3):105–117.
cations in Mathematical Physics, 97(1-2):187–210.
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-
Alexandre Garcia, Pierre Colombo, Slim Essid, Flo- narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-
rence d’Alché Buc, and Chloé Clavel. 2019a. From Philippe Morency. 2018. Efficient low-rank multi-
the token to the review: A hierarchical multi- modal fusion with modality-specific factors. arXiv
modal approach to opinion mining. arXiv preprint preprint arXiv:1806.00064.
arXiv:1908.11216.
Ilya Loshchilov and Frank Hutter. 2017. Decou-
Alexandre Garcia, Slim Essid, Florence d’Alché Buc, pled weight decay regularization. arXiv preprint
and Chloé Clavel. 2019b. A multimodal movie re- arXiv:1711.05101.
view corpus for fine-grained opinion mining. arXiv Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Di-
preprint arXiv:1902.10102. vide, conquer and combine: Hierarchical feature fu-
Devamanyu Hazarika, Roger Zimmermann, and Sou- sion network with local and global perspectives for
janya Poria. 2020. Misa: Modality-invariant and- multimodal affective computing. In Proceedings of
specific representations for multimodal sentiment the 57th Annual Meeting of the Association for Com-
analysis. arXiv preprint arXiv:2005.03545. putational Linguistics, pages 481–492.
David McAllester and Karl Stratos. 2020. Formal
R Devon Hjelm, Alex Fedorov, Samuel Lavoie- limitations on the measurement of mutual informa-
Marchildon, Karan Grewal, Phil Bachman, Adam tion. In International Conference on Artificial Intel-
Trischler, and Yoshua Bengio. 2018. Learn- ligence and Statistics, pages 875–884.
ing deep representations by mutual information
estimation and maximization. arXiv preprint Evelyn Z McClave. 2000. Linguistic functions of head
arXiv:1808.06670. movements in the context of speech. Journal of
pragmatics, 32(7):855–878.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
tional lstm-crf models for sequence tagging. arXiv Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017.
preprint arXiv:1508.01991. Dual discriminator generative adversarial nets. In
Advances in Neural Information Processing Systems,
Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Éric pages 2670–2680.
Gaussier, Giovanna Varni, Emmanuel Vignon, and
Anne Sabourin. 2020. Heavy-tailed representations, XuanLong Nguyen, Martin J Wainwright, and
text polarity classification & data augmentation. In Michael I Jordan. 2010. Estimating divergence func-
Advances in Neural Information Processing Systems tionals and the likelihood ratio by convex risk min-
33: Annual Conference on Neural Information Pro- imization. IEEE Transactions on Information The-
cessing Systems 2020, NeurIPS 2020, December 6- ory, 56(11):5847–5861.
12, 2020, virtual.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals.
Ryan G James and James P Crutchfield. 2017. Mul- 2018. Representation learning with contrastive pre-
tivariate dependence beyond shannon information. dictive coding. arXiv preprint arXiv:1807.03748.
Entropy, 19(10):531. Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron
Van den Oord, Sergey Levine, and Pierre Sermanet.
Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost 2019. Wasserstein dependency measure for repre-
Van De Weijer, Andrew D Bagdanov, Maria Vanrell, sentation learning. In Advances in Neural Informa-
and Antonio M Lopez. 2012. Color attributes for tion Processing Systems, pages 15604–15614.
object detection. In 2012 IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 3306– Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
3313. IEEE. Kenji Sagae, and Louis-Philippe Morency. 2014.
Computational analysis of persuasiveness in social
Diederik P Kingma and Jimmy Ba. 2014. Adam: A multimedia: A novel dataset and multimodal predic-
method for stochastic optimization. arXiv preprint tion approach. In Proceedings of the 16th Interna-
arXiv:1412.6980. tional Conference on Multimodal Interaction, pages
50–57.
Justin B Kinney and Gurinder S Atwal. 2014. Eq-
uitability, mutual information, and the maximal in- Jeffrey Pennington, Richard Socher, and Christopher D
formation coefficient. Proceedings of the National Manning. 2014. Glove: Global vectors for word
Academy of Sciences, 111(9):3354–3359. representation. In EMNLP, volume 14, pages 1532–
1543.
Lingpeng Kong, Cyprien de Masson d’Autume, Wang
Ling, Lei Yu, Zihang Dai, and Dani Yogatama. 2019. Gabriel Peyré, Marco Cuturi, et al. 2019. Computa-
A mutual information maximization perspective of tional optimal transport: With applications to data
language representation learning. arXiv preprint science. Foundations and Trends® in Machine
arXiv:1910.08350. Learning, 11(5-6):355–607.Hai Pham, Paul Pu Liang, Thomas Manzini, Louis- Proceedings of the conference. Association for Com-
Philippe Morency, and Barnabás Póczos. 2019. putational Linguistics. Meeting, volume 2019, page
Found in translation: Learning robust joint represen- 6558. NIH Public Access.
tations by cyclic translations between modalities. In
Proceedings of the AAAI Conference on Artificial In- Cédric Villani. 2008. Optimal transport: old and new,
telligence, volume 33, pages 6892–6899. volume 338. Springer Science & Business Media.
Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang,
AmirAli Bagher Zadeh, Chengfeng Mao, Louis- Amir Zadeh, and Louis-Philippe Morency. 2018.
Philippe Morency, and Ehsan Hoque. 2020. Inte- Words can shift: Dynamically adjusting word rep-
grating multimodal information in large pretrained resentations using nonverbal behaviors.
transformers. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin- Satosi Watanabe. 1960. Information theoretical analy-
guistics, pages 2359–2369. sis of multivariate correlation. IBM Journal of re-
Sebastian Ruder. 2017. An overview of multi-task search and development, 4(1):66–82.
learning in deep neural networks. arXiv preprint
arXiv:1706.05098. Frank Wilcoxon. 1992. Individual comparisons by
ranking methods. In Breakthroughs in statistics,
Shamane Siriwardhana, Andrew Reis, Rivindu pages 196–202. Springer.
Weerasekera, and Suranga Nanayakkara. 2020.
Jointly fine-tuning" bert-like" self supervised Wojciech Witon, Pierre Colombo, Ashutosh Modi, and
models to improve multimodal speech emotion Mubbasir Kapadia. 2018. Disney at IEST 2018: Pre-
recognition. arXiv preprint arXiv:2008.06682. dicting emotions using an ensemble. In Proceedings
of the 9th Workshop on Computational Approaches
Mohammad Soleymani, David Garcia, Brendan Jou, to Subjectivity, Sentiment and Social Media Analysis,
Björn Schuller, Shih-Fu Chang, and Maja Pantic. WASSA@EMNLP 2018, Brussels, Belgium, October
2017. A survey of multimodal sentiment analysis. 31, 2018, pages 248–253. Association for Computa-
Image and Vision Computing, 65:3–14. tional Linguistics.
Jiaming Song and Stefano Ermon. 2019. Understand- Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
ing the limitations of variational mutual information Schuller, Congkai Sun, Kenji Sagae, and Louis-
estimators. arXiv preprint arXiv:1910.06222. Philippe Morency. 2013. Youtube movie reviews:
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Sentiment analysis in an audio-visual context. IEEE
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Intelligent Systems, 28(3):46–53.
Dropout: a simple way to prevent neural networks
from overfitting. The journal of machine learning Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015.
research, 15(1):1929–1958. Empirical evaluation of rectified activations in con-
volutional network.
Guillaume Staerman, Pavlo Mozharovskyi, and
Stéphan Clémençon. 2021. Affine-invariant Chang Xu, Dacheng Tao, and Chao Xu. 2013. A
integrated rank-weighted depth: Definition, prop- survey on multi-view learning. arXiv preprint
erties and finite sample analysis. arXiv preprint arXiv:1304.5634.
arXiv:2106.11068.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Milan Studenỳ and Jirina Vejnarová. 1998. The multi- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
information function as a tool for measuring stochas- Xlnet: Generalized autoregressive pretraining for
tic dependence. In Learning in graphical models, language understanding. In Advances in neural in-
pages 261–297. Springer. formation processing systems, pages 5753–5763.
Zhongkai Sun, Prathusha Sarma, William Sethares, and
Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017.
Yingyu Liang. 2020. Learning relationships be-
A temporal order modeling approach to human ac-
tween text, audio, and video via deep canonical cor-
tion recognition from multimodal sensor data. ACM
relation for multimodal language analysis. In Pro-
Transactions on Multimedia Computing, Communi-
ceedings of the AAAI Conference on Artificial Intel-
cations, and Applications (TOMM), 13(2):1–22.
ligence, volume 34, pages 8992–8999.
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Jiahong Yuan and Mark Liberman. 2008. Speaker iden-
2019. Contrastive multiview coding. arXiv preprint tification on the scotus corpus. Journal of the Acous-
arXiv:1906.05849. tical Society of America, 123(5):3878.
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-
J Zico Kolter, Louis-Philippe Morency, and Rus- bria, and Louis-Philippe Morency. 2017. Tensor
lan Salakhutdinov. 2019. Multimodal transformer fusion network for multimodal sentiment analysis.
for unaligned multimodal language sequences. In arXiv preprint arXiv:1707.07250.Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. arXiv preprint arXiv:1802.00927. Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra- teek Vij, Erik Cambria, and Louis-Philippe Morency. 2018b. Multi-attention recurrent network for hu- man communication comprehension. In Proceed- ings of the... AAAI Conference on Artificial Intelli- gence. AAAI Conference on Artificial Intelligence, volume 2018, page 5642. NIH Public Access. Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang, Paul Pu Liang, Soujanya Poria, and Louis-Philippe Morency. 2019. Factorized multimodal transformer for multimodal sequential learning. arXiv preprint arXiv:1911.09826. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Po- ria, Erik Cambria, and Louis-Philippe Morency. 2018c. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fu- sion graph. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246.
8 Supplementary Statistic Network
Layer Number of outputs Activation function
8.1 Training details [Zlav , Z lav ] din , din -
Dense layer din /2 LeakyReLU
In this section, we both present a comprehensive il- Dropout 0.4 -
lustration of the Algorithm 1 and state the details of Dense layer din LeakyReLU
Dropout 0.4 -
experimental hyperparameters selection as well as Dense layer din LeakyReLU
and the architectures used for the statistic network Dropout 0.4 -
Tθ . Dense layer din /4 LeakyReLU
Dropout 0.4 -
8.1.1 Illustration of Algorithm 1 Dense layer din /4 LeakyReLU
Dropout 0.4 -
Fig. 3 describes the Algorithm 1. As can be seen Dense layer 1 Sigmoïd
in the figure, to compute the mutual dependency
measure the statistic network Tθ takes the two em- Table 4: Statistics network description. din denotes the
dimension of Zavl .
beddings of the different batch B and B̄.
8.2 Additional experiments for robustness to
modality drop
Fig. 4 shows the results of the robustness text on
MAGXLNET. Similarly to Fig. 2 we observe more
robust representation to modality drop when jointly
maximising the LW and Lkl with the target loss.
Fig. 4 shows no improvement when training with
Lf . This can also be linked to Tab. 2 which simi-
larly shows no improvement in this very specific
Figure 3: Illustration of the method describes in Al- configuration.
gorithm 1 for the different estimators derived from
Th. 1. B and B̄ stands for the batch of data sample MAGXLNET
from the joint probability distribution and the product 0.6
of the marginal distribution respectively. Zavl denotes 0.5
the fusion representation of linguistic, acoustic and vi-
Acc2corrupt/Acc2
sual (resp. l, a and v) modalities provided by a multi- 0.4
kl
modal architecture fθe for the batch B . Z lav denotes 0.3
the same quantity as described before for the batch B̄. f
Aθp denotes the linear projection before classification 0.2
or regression.
0.1
0.0 A V A+V
8.1.2 Hyperparameters selection Present Modality
We use dropout (Srivastava et al., 2014) and opti- Figure 4: Study of the robustness of the representations
mise the global loss Eq. 5 by gradient descent using against a drop of the linguistic modality. Studied model
AdamW (Loshchilov and Hutter, 2017; Kingma is MAGXLNET on CMU-MOSI. The ratio between the
and Ba, 2014) optimiser. The best learning rate is accuracy achieved with a corrupted linguistic modality
Acccorrupt and the accuracy Acc2 without any corrup-
found in the grid {0.002, 0.001, 0.0005, 0.0001}. 2
tion is reported on y-axis. The preserved modalities
The best model is selected using the lowest MAE during inference are reported on x-axis. A, V respec-
on the validation set. We U nroll to 10. tively stands for the acoustic and visual modality.
8.1.3 Architectures of Tθ
Across the different experiment we use a statis- 8.3 Additional qualitative examples
tic network with an architecture as describes in Tab. 5 illustrates the use of Tθ to explain the repre-
Tab. 4. We follow (Belghazi et al., 2018) and use sentations learnt by the model. Similarly to Tab. 4
LeakyRELU (Agarap, 2018; Xu et al., 2015) as we observe that high values correspond to com-
activation function. plementarity across modalities and low values arerelated to contradictoriness across modalities.
Spoken Transcripts Acoustic and visual behaviour Tθ
but the m the script is corny high energy voice + headshake + (many) smiles L
as for gi joe was it was just like laughing
high enery voice + laughts + smiles L
its the the plot the the acting is terrible
but i think this one did beat scream 2 now headshake + long sigh L
the xxx sequence is really well done static head + low energy monotonous voice L
you know of course i was waithing for the princess and the frog smiles + high energy voice + + high pitch H
dennis quaid i think had a lot of fun smiles + high energy voice H
it was very very very boring low energy voice + frown eyebrows H
i do not wanna see any more of this angry voice + angry facial expression H
Table 5: Examples from the CMU-MOSI dataset using MAGXLNET trained with LW . The last column is computed
using the statistic network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight
positive, neutral and negative expression/behaviours respectively.You can also read