Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks

Page created by Victor Lloyd

Government & Politics

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Lightweight and Efficient Neural Natural Language Processing
                                                                    with Quaternion Networks
                                                         1
                                                             Yi Tay, 2 Aston Zhang, 3 Luu Anh Tuan, 4 Jinfeng Rao∗, 5 Shuai Zhang
                                                                          6
                                                                            Shuohang Wang, 7 Jie Fu, 8 Siu Cheung Hui
                                                               1,8
                                                                   Nanyang Technological University, 2 Amazon AI, 3 MIT CSAIL
                                                                   4
                                                                     Facebook AI, 5 UNSW, 6 Singapore Management University,
                                                                                7
                                                                                  Mila and Polytechnique Montréal
                                                                                   ytay017@e.ntu.edu.sg

                                                                  Abstract                      adaptations of these models, without significantly
                                                                                                degrading performance, would certainly have a
                                             Many state-of-the-art neural models for NLP        positive impact on many real world applications.
                                             are heavily parameterized and thus memory
arXiv:1906.04393v1 [cs.CL] 11 Jun 2019

                                             inefficient. This paper proposes a series of
                                                                                                   To this end, this paper explores a new way to
                                             lightweight and memory efficient neural ar-        improve/maintain the performance of these neural
                                             chitectures for a potpourri of natural language    architectures while substantially reducing the pa-
                                             processing (NLP) tasks. To this end, our mod-      rameter cost (compression of up to 75%). In or-
                                             els exploit computation using Quaternion al-       der to achieve this, we move beyond real space,
                                             gebra and hypercomplex spaces, enabling not        exploring computation in Quaternion space (i.e.,
                                             only expressive inter-component interactions       hypercomplex numbers) as an inductive bias. Hy-
                                             but also significantly (75%) reduced parame-
                                                                                                percomplex numbers comprise of a real and three
                                             ter size due to lesser degrees of freedom in
                                             the Hamilton product. We propose Quaternion        imaginary components (e.g., i, j, k) in which inter-
                                             variants of models, giving rise to new architec-   dependencies between these components are en-
                                             tures such as the Quaternion attention Model       coded naturally during training via the Hamilton
                                             and Quaternion Transformer. Extensive exper-       product ⊗. Hamilton products have fewer degrees
                                             iments on a battery of NLP tasks demonstrates      of freedom, enabling up to four times compres-
                                             the utility of proposed Quaternion-inspired        sion of model size. Technical details are deferred
                                             models, enabling up to 75% reduction in pa-
                                                                                                to subsequent sections.
                                             rameter size without significant loss in perfor-
                                             mance.                                                While Quaternion connectionist architectures
                                                                                                have been considered in various deep learn-
                                         1   Introduction                                       ing application areas such as speech recogni-
                                                                                                tion (Parcollet et al., 2018b), kinematics/human
                                         Neural network architectures such as Transform-        motion (Pavllo et al., 2018) and computer vi-
                                         ers (Vaswani et al., 2017; Dehghani et al., 2018)      sion (Gaudet and Maida, 2017), our work is the
                                         and attention networks (Parikh et al., 2016; Seo       first hypercomplex inductive bias designed for a
                                         et al., 2016; Bahdanau et al., 2014) are dominant      wide spread of NLP tasks. Other fields have mo-
                                         solutions in natural language processing (NLP) re-     tivated the usage of Quaternions primarily due
                                         search today. Many of these architectures are pri-     to their natural 3 or 4 dimensional input features
                                         marily concerned with learning useful feature rep-     (e.g., RGB scenes or 3D human poses) (Parcol-
                                         resentations from data in which providing a strong     let et al., 2018b; Pavllo et al., 2018). In a similar
                                         architectural inductive bias is known to be ex-        vein, we can similarly motivate this by considering
                                         tremely helpful for obtaining stellar results.         the multi-sense nature of natural language (Li and
                                            Unfortunately, many of these models are known       Jurafsky, 2015; Neelakantan et al., 2015; Huang
                                         to be heavily parameterized, with state-of-the-art     et al., 2012). In this case, having multiple em-
                                         models easily containing millions or billions of       beddings or components per token is well-aligned
                                         parameters (Vaswani et al., 2017; Radford et al.,      with this motivation.
                                         2018; Devlin et al., 2018; Radford et al., 2019).         Latent interactions between components may
                                         This renders practical deployment challenging. As      also enjoy additional benefits, especially pertain-
                                         such, the enabling of efficient and lightweight        ing to applications which require learning pair-
                                             ∗
                                                 Work done while at University of Maryland.     wise affinity scores (Parikh et al., 2016; Seo

et al., 2016). Intuitively, instead of regular (real)          nion components are self-contained and play
dot products, Hamilton products ⊗ extensively                  well with real-valued counterparts.
learn representations by matching across multiple
(inter-latent) components in hypercomplex space.        2     Background on Quaternion Algebra
Alternatively, the effectiveness of multi-view and      This section introduces the necessary background
multi-headed (Vaswani et al., 2017) approaches          for this paper. We introduce Quaternion algebra
may also explain the suitability of Quaternion          along with Hamilton products, which form the
spaces in NLP models. The added advantage               crux of our proposed approaches.
to multi-headed approaches is that Quaternion
spaces explicitly encodes latent interactions be-       Quaternion A Quaternion Q ∈ H is a hy-
tween these components or heads via the Hamilton        percomplex number with three imaginary compo-
product which intuitively increases the expressive-     nents as follows:
ness of the model. Conversely, multi-headed em-                       Q = r + xi + yj + zk,               (1)
beddings are generally independently produced.
   To this end, we propose two Quaternion-              where ijk = i2 = j2 = k2 = −1 and noncom-
inspired neural architectures, namely, the Quater-      mutative multiplication rules apply: ij = k, jk =
nion attention model and the Quaternion Trans-          i, ki = j, ji = −k, kj = −i, ik = −j. In (1), r is
former. In this paper, we devise and formulate          the real value and similarly, x, y, z are real num-
a new attention (and self-attention) mechanism in       bers that represent the imaginary components of
Quaternion space using Hamilton products. Trans-        the Quaternion vector Q. Operations on Quater-
formation layers are aptly replaced with Quater-        nions are defined in the following.
nion feed-forward networks, yielding substantial        Addition The addition of two Quaternions is de-
improvements in parameter size (of up to 75%            fined as:
compression) while achieving comparable (and
occasionally better) performance.                                Q + P = Qr + Pr + (Qx + Px )i
                                                                     +(Qy + Py )j + (Qz + Pz )k,
Contributions All in all, we make the following
major contributions:                                    where Q and P with subscripts denote the real
                                                        value and imaginary components of Quaternion Q
  • We propose Quaternion neural models for             and P . Subtraction follows this same principle
    NLP. More concretely, we propose a novel            analogously but flipping + with −.
    Quaternion attention model and Quaternion
    Transformer for a wide range of NLP tasks.          Scalar Multiplication Scalar        α      multiplies
    To the best of our knowledge, this is the first     across all components, i.e.,
    formulation of hypercomplex Attention and                      αQ = αr + αxi + αyj + αzk.
    Quaternion models for NLP.
                                                        Conjugate The conjugate of Q is defined as:
  • We evaluate our Quaternion NLP models on
    a wide range of diverse NLP tasks such as                         Q∗ = r − xi − yj − zk.
    pairwise text classification (natural language      Norm The unit Quaternion Q/ is defined as:
    inference, question answering, paraphrase
                                                                                 Q
    identification, dialogue prediction), neural                    Q/ = p                     .
    machine translation (NMT), sentiment anal-                            r 2 + x2 + y 2 + z 2
    ysis, mathematical language understanding           Hamilton Product The Hamilton product,
    (MLU), and subject-verb agreement (SVA).            which represents the multiplication of two
                                                        Quaternions Q and P , is defined as:
  • Our experimental results show that Quater-
    nion models achieve comparable or better                Q ⊗ P = (Qr Pr − Qx Px − Qy Py − Qz Pz )
    performance to their real-valued counterparts                 + (Qx Pr + Qr Px − Qz Py + Qy Pz ) i
    with up to a 75% reduction in parameter
                                                                  + (Qy Pr + Qz Px + Qr Py − Qx Pz ) j
    costs. The key advantage is that these mod-
    els are expressive (due to Hamiltons) and also                + (Qz Pr − Qy Px + Qx Py + Qr Pz ) k,
    parameter efficient. Moreover, our Quater-                                                       (2)

which intuitively encourages inter-latent interac-     FRPSRQHQWVRIWKHRXWSXW4XDWHUQLRQ4Ň               r’              x’             y’             z’

tion between all the four components of Q and
P . In this work, we use Hamilton products exten-       FRPSRQHQWVRIWKHLQSXW4XDWHUQLRQ4                r               x              y              z

sively for vector and matrix transformations that
live at the heart of attention models for NLP.                          SDLUZLVHFRQQHFWLRQVZLWKZHLJKWSDUDPHWHUYDULDEOHV

                                                             r’                                                         x’

3     Quaternion Models of Language                     Wr        -Wx     -Wy           -Wz              Wx        Wr        -Wz                 Wy

                                                             r             x       y          z          r              x               y             z
In this section, we propose Quaternion neural
models for language processing tasks. We be-                                       y’                                                                 z’

gin by introducing the building blocks, such as                   Wy       Wz    Wr     -Wx                   Wz                        Wx             Wr
Quaternion feed-forward, Quaternion attention,               r             x       y          z          r              x
                                                                                                                                  -Wy
                                                                                                                                        y             z

and Quaternion Transformers.                           Figure 1:         4 weight parameter variables
                                                       (Wr , Wx , Wy , Wz ) are used in 16 pairwise con-
3.1    Quaternion Feed-Forward                         nections between components of the input and output
A Quaternion feed-forward layer is similar to a        Quaternions.
feed-forward layer in real space, while the former
operates in hypercomplex space where Hamilton          counterpart, resulting in a 75% reduction in pa-
product is used. Denote by W ∈ H the weight pa-        rameterization. Such a parameterization reduction
rameter of a Quaternion feed-forward layer and let     can also be explained by weight sharing (Parcollet
Q ∈ H be the layer input. The linear output of the     et al., 2018b,a).
layer is the Hamilton product of two Quaternions:
W ⊗ Q.                                                 Nonlinearity Nonlinearity can be added to a
                                                       Quaternion feed-forward layer and component-
Saving Parameters? How and Why In lieu of              wise activation is adopted (Parcollet et al., 2018a):
the fact that it might not be completely obvious at
first glance why Quaternion models result in mod-            α(Q) = α(r) + α(x)i + α(y)j + +α(z)k,
els with smaller parameterization, we dedicate the
                                                       where Q is defined in (1) and α(.) is a nonlinear
following to address this.
                                                       function such as tanh or ReLU.
   For the sake of parameterization comparison,
let us express the Hamilton product W ⊗ Q in           3.2         Quaternion Attention
a Quaternion feed-forward layer in the form of
                                                       Next, we propose a Quaternion attention model to
matrix multiplication, which is used in real-space
                                                       compute attention and alignment between two se-
feed-forward. Recall the definition of Hamilton
                                                       quences. Let A ∈ H`a ×d and B ∈ H`b ×d be input
product in (2). Putting aside the Quaterion unit
                                                       word sequences, where `a , `b are numbers of to-
basis [1, i, j, k]> , W ⊗ Q can be expressed as:
                                                       kens in each sequence and d is the dimension of
        
          Wr −Wx −Wy −Wz
                          
                            r                          each input vector. We first compute:
        Wx Wr −Wz Wy  x
                         ,                  (3)                                      E = A ⊗ B>,
         Wy Wz   Wr −Wx  y 
          Wz −Wy Wx   Wr    z                          where E ∈ H`a ×`b . We apply Softmax(.) to E
                                                       component-wise:
where W = Wr + Wx i + Wy j + Wz k and Q is
defined in (1).                                         G = ComponentSoftmax(E)
   We highlight that, there are only 4 distinct pa-     B 0 = GR BR + GX BX i + GY BY j + GZ BZ k,
rameter variable elements (4 degrees of freedom),
namely Wr , Wx , Wy , Wz , in the weight matrix        where G and B with subscripts represent the real
(left) of (3), as illustrated by Figure 1; while in    and imaginary components of G and B. Similarly,
real-space feed-forward, all the elements of the       we perform the same on A which is described as
weight matrix are different parameter variables        follows:
(4 × 4 = 16 degrees of freedom). In other
                                                         F = ComponentSoftmax(E > )
words, the degrees of freedom in Quaternion feed-
forward is only a quarter of those in its real-space     A0 = FR AR + FX AX i + FY AY j + FZ AZ k,

where A0 is the aligned representation of B and             Note that in (4), Q ⊗ K returns four ` × `
B 0 is the aligned representation of A. Next, given      matrices (attention weights) for each component
A0 ∈ R`b ×d , B 0 ∈ R`A ×d we then compute and           (r, i, j, k). Softmax is applied component-wise,
compare the learned alignments:                          along with multiplication with V which is multi-
          X                                              plied in similar fashion to the Quaternion attention
   C1 =        QFFN([A0i ; Bi , A0i ⊗ Bi ; A0i − Bi ])   model. Note that the Hamilton product in the self-
                                                         attention itself does not change the parameter size
          X
  C2 =        QFFN([Bi0 ; Ai , Bi0 ⊗ Ai ; Bi0 − Ai ]),
                                                         of the network.
where QFFN(.) is a Quaternion feed-forward layer
                                                         Quaternion Transformer Block Aside from
with nonlinearity and [; ] is the component-wise
                                                         the linear transformations for forming query, key,
             Poperator. i refers to word positional
contatentation
                                                         and values. Tranformers also contain position
indices and     over words in the sequence. Both
                                                         feed-forward networks with ReLU activations.
outputs C1 , C2 are then passed
                                                         Similarly, we replace the feed-forward connec-
      Y = QFFN([C1 ; C2 ; C1 ⊗ C2 ; C1 − C2 ]),          tions (FFNs) with Quaternion FFNs. We denote
                                                         this as Quaternion Transformer (full) while denot-
where Y ∈ H is a Quaternion valued output. In or-        ing the model that only uses Quaternion FFNs in
der to train our model end-to-end with real-valued       the self-attention as (partial). Finally, the remain-
losses, we concatenate each component and pass           der of the Transformer networks remain identical
into a final linear layer for classification.            to the original design (Vaswani et al., 2017) in the
                                                         sense that component-wise functions are applied
3.3    Quaternion Transformer
                                                         unless specified above.
This section describes our Quaternion adaptation
of Transformer networks. Transformer (Vaswani            3.4   Embedding Layers
et al., 2017) can be considered state-of-the-art         In the case where the word embedding layer is
across many NLP tasks. Transformer networks              trained from scratch (i.e., using Byte-pair encod-
are characterized by stacked layers of linear trans-     ing in machine translation), we treat each embed-
forms along with its signature self-attention mech-      ding to be the concatenation of its four compo-
anism. For the sake of brevity, we outline the spe-      nents. In the case where pre-trained embeddings
cific changes we make to the Transformer model.          such as GloVe (Pennington et al., 2014) are used,
Quaternion Self-Attention The standard self-             a nonlinear transform is used to project the embed-
attention mechanism considers the following:             dings into Quaternion space.

                          QK >                           3.5   Connection to Real Components
              A = softmax( √ )V,
                            dk                           A vast majority of neural components in the deep
                                                         learning arsenal operate in real space. As such,
where Q, K, V are traditionally learned via linear
                                                         it would be beneficial for our Quaternion-inspired
transforms from the input X. The key idea here is
                                                         components to interface seamlessly with these
that we replace this linear transform with a Quater-
                                                         components. If input to a Quaternion module
nion transform.
                                                         (such as Quaternion FFN or attention modules),
  Q = Wq ⊗ X; K = Wk ⊗ X; V = Wv ⊗ X,                    we simply treat the real-valued input as a concate-
                                                         nation of components r, x, y, z. Similarly, the out-
where ⊗ is the Hamilton product and X is the in-         put of the Quaternion module, if passed to a real-
put Quaternion representation of the layer. In this      valued layer, is treated as a [r; x; y; z], where [; ] is
case, since computation is performed in Quater-          the concatenation operator.
nion space, the parameters of W is effectively re-
duced by 75%. Similarly, the computation of self-        Output layer and Loss Functions To train our
attention also relies on Hamilton products. The          model, we simply concatenate all r, i, j, k compo-
revised Quaternion self-attention is defined as fol-     nents into a single vector at the final output layer.
lows:                                                    For example, for classification, the final Softmax
                                                         output is defined as following:
                             Q⊗K
        A = ComponentSoftmax( √    )V.            (4)
                                dk                               Y = Softmax(W ([r; x; y; z]) + b),

where Y ∈ R|C| where |C| is the number of • Natural language inference (NLI) - This
classes and x, y, z are the imaginary components. task is concerned with determining if two
Similarly for sequence loss (for sequence trans- sentences entail or contradict each other.
duction problems), the same can be also done. We use SNLI (Bowman et al., 2015), Sc-
iTail (Khot et al., 2018), MNLI (Williams
Parameter Initialization It is intuitive that spe- et al., 2017) as benchmark data sets.
cialized initialization schemes ought to be devised
for Quaternion representations and their mod- • Question answering (QA) - This task in-
ules (Parcollet et al., 2018b,a). volves learning to rank question-answer
pairs. We use WikiQA (Yang et al., 2015)
/
w = |w|(cos(θ) + qimag sin(θ), which comprises of QA pairs from Bing
Search.
/
where qimag is the normalized imaginary con-
structed from uniform randomly sampling from • Paraphrase detection - This task involves
[0, 1]. θ is randomly and uniformly sampled from detecting if two sentences are paraphrases of
[−π, π]. However, our early experiments show each other. We use Tweets (Lan et al., 2017)
that, at least within the context of NLP appli- data set and the Quora paraphrase data set
cations, this initialization performed comparable (Wang et al., 2017).
or worse than the standard Glorot initialization.
• Dialogue response selection - This is a re-
Hence, we opt to initialize all components inde-
sponse selection (RS) task that tries to se-
pendently with Glorot initialization.
lect the best response given a message. We
use the Ubuntu dialogue corpus, UDC (Lowe
4 Experiments
et al., 2015).
This section describes our experimental setup
Implementation Details We implement Q-Att
across multiple diverse NLP tasks. All experi-
in TensorFlow (Abadi et al., 2016), along with the
ments were run on NVIDIA Titan X hardware.
Decomposable Attention baseline (Parikh et al.,
Our Models On pairwise text classification, we 2016). Both models optimize the cross entropy
benchmark Quaternion attention model (Q-Att), loss (e.g., binary cross entropy for ranking tasks
testing the ability of Quaternion models on pair- such as WikiQA and Ubuntu). Models are op-
wise representation learning. On all the other timized with Adam with the learning rate tuned
tasks, such as machine translation and subject- amongst {0.001, 0.0003} and the batch size tuned
verb agreement, we evaluate Quaternion Trans- amongst {32, 64}. Embeddings are initialized
formers. We evaluate two variations of Transform- with GloVe (Pennington et al., 2014). For Q-
ers, full and partial. The full setting converts all Att, we use an additional transform layer to
linear transformations into Quaternion space and project the pre-trained embeddings into Quater-
is approximately 25% of the actual Transformer nion space. The measures used are generally
size. The second setting (partial) only reduces the accuracy measure (for NLI and Paraphrase
the linear transforms at the self-attention mech- tasks) and ranking measures (MAP/MRR/Top-1)
anism. Tensor2Tensor1 is used for Transformer for ranking tasks (WikiQA and Ubuntu).
benchmarks, which uses its default Hyperparam-
Baselines and Comparison We use the Decom-
eters and encoding for all experiments.
posable Attention model as a baseline, adding
[ai ; bi ; ai bi ; ai − bi ] before the compare2 lay-
4.1 Pairwise Text Classification
ers since we found this simple modification to in-
We evaluate our proposed Quaternion attention crease performance. This also enables fair com-
(Q-Att) model on pairwise text classification tasks. parison with our variation of Quaternion attention
This task involves predicting a label or ranking which uses Hamilton product over Element-wise
score for sentence pairs. We use a total of seven multiplication. We denote this as DeAtt. We eval-
data sets from problem domains such as: uate at a fixed representation size of d = 200
1 2
https://github.com/tensorflow/ This follows the matching function of (Chen et al.,
tensor2tensor. 2016).

Task NLI QA Paraphrase RS
Measure Accuracy MAP/MRR Accuracy Top-1
Model SNLI SciTail MNLI WikiQA Tweet Quora UDC # Params
DeAtt (d = 50) 83.4 73.8 69.9/70.9 66.0/67.1 77.8 82.2 48.7 200K
DeAtt (d = 200) 86.2 79.0 73.6/73.9 67.2/68.3 80.0 85.4 51.8 700K
Q-Att (d = 50) 85.4 79.6 72.3/72.9 66.2/68.1 80.1 84.1 51.5 200K (-71%)

Table 1: Experimental results on pairwise text classification and ranking tasks. Q-Att achieves comparable or
competitive results compared with DeAtt with approximately one third of the parameter cost.

Model IMDb SST # Params
Transformer 82.6 78.9 400K
Quaternion Transformer (full) 83.9 (+1.3%) 80.5 (+1.6%) 100K (-75.0%)
Quaternion Transformer (partial) 83.6 (+1.0%) 81.4 (+2.5%) 300K (-25.0%)

Table 2: Experimental results on sentiment analysis on IMDb and Stanford Sentiment Treebank (SST) data sets.
Evaluation measure is accuracy.

(equivalent to d = 50 in Quaternion space). We IMDb (Maas et al., 2011) and Stanford Sentiment
also include comparisons at equal parameteriza- Treebank (SST) (Socher et al., 2013).
tion (d = 50 and approximately 200K parame-
ters) to observe the effect of Quaternion represen- Results Table 2 reports results the sentiment
tations. We selection of DeAtt is owing to simplic- classification task on IMDb and SST. We observe
ity and ease of comparison. We defer the prospect that both the full and partial variation of Quater-
of Quaternion variations of more advanced mod- nion Transformers outperform the base Trans-
els (Chen et al., 2016; Tay et al., 2017b) to future former. We observe that Quaternion Transformer
work. (partial) obtains a +1.0% lead over the vanilla
Transformer on IMDb and +2.5% on SST. This
Results Table 1 reports results on seven differ- is while having a 24.5% saving in parameter
ent and diverse data sets. We observe that a tiny cost. Finally the full Quaternion version leads
Q-Att model (d = 50) achieves comparable (or by +1.3%/1.6% gains on IMDb and SST respec-
occasionally marginally better or worse) perfor- tively while maintaining a 75% reduction in pa-
mance compared to DeAtt (d = 200), gaining a rameter cost. This supports our core hypothesis of
68% parameter savings. The results actually im- improving accuracy while saving parameter costs.
prove on certain data sets (2/7) and are compara-
ble (often less than a percentage point difference)
4.3 Neural Machine Translation
compared with the d = 200 DeAtt model. More-
over, we scaled the parameter size of the DeAtt We evaluate our proposed Quaternion Transformer
model to be similar to the Q-Att model and found against vanilla Transformer on three data sets
that the performance degrades quite significantly on this neural machine translation (NMT) task.
(about 2% − 3% lower on all data sets). This More concretely, we evaluate on IWSLT 2015 En-
demonstrates the quality and benefit of learning glish Vietnamese (En-Vi), WMT 2016 English-
with Quaternion space. Romanian (En-Ro) and WMT 2018 English-
Estonian (En-Et). We also include results on the
4.2 Sentiment Analysis standard WMT EN-DE English-German results.
We evaluate on the task of document-level sen-
timent analysis which is a binary classification Implementation Details We implement models
problem. in Tensor2Tensor and trained for 50k steps for
both models. We use the default base single GPU
Implementation Details We compare our pro- hyperparameter setting for both models and aver-
posed Quaternion Transformer against the vanilla age checkpointing. Note that our goal is not to ob-
Transformer. In this experiment, we use the tiny tain state-of-the-art models but to fairly and sys-
Transformer setting in Tensor2Tensor with a vo- tematically evaluate both vanilla and Quaternion
cab size of 8K. We use two data sets, namely Transformers.

BLEU
Model IWSLT’15 En-Vi WMT’16 En-Ro WMT’18 En-Et # Params
Transformer Base 28.4 22.8 14.1 44M
Quaternion Transformer (full) 28.0 18.5 13.1 11M (-75%)
Quaternion Transformer (partial) 30.9 22.7 14.2 29M (-32%)

Table 3: Experimental results on neural machine translation (NMT). Results of Transformer Base on EN-VI
(IWSLT 2015), EN-RO (WMT 2016) and EN-ET (WMT 2018). Parameter size excludes word embeddings. Our
proposed Quaternion Transformer achieves comparable or higher performance with only 67.9% parameter costs
of the base Transformer model.

Results Table 3 reports the results on neural exist, mainly switching and introduction of new
machine translation. On the IWSLT’15 En-Vi mathematical operators.
data set, the partial adaptation of the Quater-
Implementation Details We train Quaternion
nion Transformer outperforms (+2.5%) the base
Transformer for 100K steps using the de-
Transformer with a 32% reduction in parameter
fault Tensor2Tensor setting following the original
cost. On the other hand, the full adaptation comes
work (Wangperawong, 2018). We use the tiny
close (−0.4%) with a 75% reduction in paramter
hyperparameter setting. Similar to NMT, we re-
cost. On the WMT’16 En-Ro data set, Quaternion
port both full and partial adaptations of Quater-
Transformers do not outperform the base Trans-
nion Transformers. Baselines are reported from
former. We observe a −0.1% degrade in per-
the original work as well, which includes com-
formance on the partial adaptation and −4.3%
parisons from Universal Transformers (Dehghani
degrade on the full adaptation of the Quaternion
et al., 2018) and Adaptive Computation Time
Transformer. However, we note that the drop in
(ACT) Universal Transformers. The evaluation
performance with respect to parameter savings is
measure is accuracy per sequence, which counts
still quite decent, e.g., saving 32% parameters for
a generated sequence as correct if and only if the
a drop of only 0.1 BLEU points. The full adapta-
entire sequence is an exact match.
tion loses out comparatively. On the WMT’18 En-
Et dataset, the partial adaptation achieves the best Results Table 4 reports our experimental re-
result with 32% less parameters. The full adapta- sults on the MLU data set. We observe a mod-
tion, comparatively, only loses by 1.0 BLEU score est +7.8% accuracy gain when using the Quater-
from the original Transformer yet saving 75% pa- nion Transformer (partial) while saving 24.5% pa-
rameters. rameter costs. Quaternion Transformer outper-
forms Universal Transformer and marginally is
WMT English-German Notably, Quater- outperformed by Adaptive Computation Universal
nion Transformer achieves a BLEU score of Transformer (ACT U-Transformer) by 0.5%. On
26.42/25.14 for partial/full settings respectively the other hand, a full Quaternion Transformer still
on the standard WMT 2014 En-De benchmark. outperforms the base Transformer (+2.8%) with
This is using a single GPU trained for 1M steps 75% parameter saving.
with a batch size of 8192. We note that results do
not differ much from other single GPU runs (i.e., 4.5 Subject Verb Agreement
26.07 BLEU) on this dataset (Nguyen and Joty, Additionally, we compare our Quaternion Trans-
2019). former on the subject-verb agreement task (Linzen
et al., 2016). The task is a binary classification
4.4 Mathematical Language Understanding problem, determining if a sentence, e.g., ‘The keys
We include evaluations on a newly released to the cabinet .’ follows by a plural/singular.
mathematical language understanding (MLU) data
Implementation We use the Tensor2Tensor
set (Wangperawong, 2018). This data set is a
framework, training Transformer and Quaternion
character-level transduction task that aims to test
Transformer with the tiny hyperparameter setting
a model’s the compositional reasoning capabili-
with 10k steps.
ties. For example, given an input x = 85, y =
−523, x ∗ y the model strives to decode an output Results Table 5 reports the results on the SVA
of −44455. Several variations of these problems task. Results show that Quaternion Transform-

Model Acc / Seq # Params
Universal Transformer 78.8 -
ACT U-Transformer 84.9 -
Transformer 76.1 400K
Quaternion Transformer (full) 78.9 (+2.8%) 100K (-75%)
Quaternion Transformer (partial) 84.4 (+8.3%) 300K ( -25%)

Table 4: Experimental results on mathematical language understanding (MLU). Both Quaternion models outper-
form the base Transformer model with up to 75% parameter savings.

ers perform equally (or better) than vanilla Trans- Quaternion representations for collaborative filter-
formers. On this task, the partial adaptation per- ing. A common theme is that Quaternion repre-
forms better, improving Transformers by +0.7% sentations are helpful and provide utility over real-
accuracy while saving 25% parameters. valued representations.
The interest in non-real spaces can be attributed
Model Acc Params
to several factors. Firstly, complex weight ma-
Transformer 94.8 400K
trices used to parameterize RNNs help to com-
Quaternion (full) 94.7 100K
bat vanishing gradients (Arjovsky et al., 2016).
Quaternion (partial) 95.5 300K
On the other hand, complex spaces are also in-
Table 5: Experimental results on subject-verb agree- tuitively linked to associative composition, along
ment (SVA) number prediction task. with holographic reduced representations (Plate,
1991; Nickel et al., 2016; Tay et al., 2017a).
5 Related Work Asymmetry has also demonstrated utility in do-
mains such as relational learning (Trouillon et al.,
The goal of learning effective representations lives 2016; Nickel et al., 2016) and question answer-
at the heart of deep learning research. While most ing (Tay et al., 2018). Complex networks (Trabelsi
neural architectures for NLP have mainly explored et al., 2017), in general, have also demonstrated
the usage of real-valued representations (Vaswani promise over real networks.
et al., 2017; Bahdanau et al., 2014; Parikh et al.,
In a similar vein, the hypercomplex Hamilton
2016), there have also been emerging interest in
product provides a greater extent of expressive-
complex (Danihelka et al., 2016; Arjovsky et al.,
ness, similar to the complex Hermitian product, al-
2016; Gaudet and Maida, 2017) and hypercom-
beit with a 4-fold increase in interactions between
plex representations (Parcollet et al., 2018b,a;
real and imaginary components. In the case of
Gaudet and Maida, 2017).
Quaternion representations, due to parameter sav-
Notably, progress on Quaternion and hyper-
ing in the Hamilton product, models also enjoy a
complex representations for deep learning is still
75% reduction in parameter size.
in its infancy and consequently, most works on
this topic are very recent. Gaudet and Maida pro- Our work draws important links to multi-
posed deep Quaternion networks for image clas- head (Vaswani et al., 2017) or multi-sense (Li
sification, introducing basic tools such as Quater- and Jurafsky, 2015; Neelakantan et al., 2015) rep-
nion batch normalization or Quaternion initializa- resentations that are highly popular in NLP re-
tion (Gaudet and Maida, 2017). In a similar vein, search. Intuitively, the four-component structure
Quaternion RNNs and CNNs were proposed for of Quaternion representations can also be inter-
speech recognition (Parcollet et al., 2018a,b). In preted as some kind of multi-headed architec-
parallel Zhu et al. proposed Quaternion CNNs ture. The key difference is that the basic operators
and applied them to image classification and de- (e.g., Hamilton product) provides an inductive bias
noising tasks (Zhu et al., 2018). Comminiello that encourages interactions between these com-
et al. proposed Quaternion CNNs for sound ponents. Notably, the idea of splitting vectors has
detection (Comminiello et al., 2018). (Zhang also been explored (Daniluk et al., 2017), which
et al., 2019a) proposed Quaternion embeddings of is in similar spirit to breaking a vector into four
knowledge graphs. (Zhang et al., 2019b) proposed components.

6 Conclusion Ivo Danihelka, Greg Wayne, Benigno Uria, Nal
Kalchbrenner, and Alex Graves. 2016. Asso-
This paper advocates for lightweight and efficient ciative long short-term memory. arXiv preprint
neural NLP via Quaternion representations. More arXiv:1602.03032 .
concretely, we proposed two models - Quaternion
Michał Daniluk, Tim Rocktäschel, Johannes Welbl,
attention model and Quaternion Transformer. We and Sebastian Riedel. 2017. Frustratingly short at-
evaluate these models on eight different NLP tasks tention spans in neural language modeling. arXiv
and a total of thirteen data sets. Across all data preprint arXiv:1702.04521 .
sets the Quaternion model achieves comparable
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals,
performance while reducing parameter size. All
Jakob Uszkoreit, and Łukasz Kaiser. 2018. Univer-
in all, we demonstrated the utility and benefits of sal transformers. arXiv preprint arXiv:1807.03819
incorporating Quaternion algebra in state-of-the- .
art neural models. We believe that this direction
paves the way for more efficient and effective rep- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
resentation learning in NLP. Our Tensor2Tensor bidirectional transformers for language understand-
implementation of Quaternion Transformers ing. arXiv preprint arXiv:1810.04805 .
will be released at https://github.com/
vanzytay/QuaternionTransformers. Chase Gaudet and Anthony Maida. 2017. Deep quater-
nion networks. arXiv preprint arXiv:1712.04604 .
7 Acknowledgements
Eric H Huang, Richard Socher, Christopher D Man-
The authors thank the anonymous reviewers of ning, and Andrew Y Ng. 2012. Improving word
representations via global context and multiple word
ACL 2019 for their time, feedback and comments. prototypes. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguis-
tics: Long Papers-Volume 1. Association for Com-
References putational Linguistics, pages 873–882.
Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Scitail: A textual entailment dataset from science
Sanjay Ghemawat, Geoffrey Irving, Michael Isard, question answering. In Thirty-Second AAAI Con-
et al. 2016. Tensorflow: A system for large-scale ference on Artificial Intelligence.
machine learning. In 12th {USENIX} Symposium
on Operating Systems Design and Implementation Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.
({OSDI} 16). pages 265–283. A continuously growing dataset of sentential para-
phrases. arXiv preprint arXiv:1708.00391 .
Martin Arjovsky, Amar Shah, and Yoshua Bengio.
2016. Unitary evolution recurrent neural networks.
Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-
In International Conference on Machine Learning.
beddings improve natural language understanding?
pages 1120–1128.
arXiv preprint arXiv:1506.01070 .
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
learning to align and translate. arXiv preprint 2016. Assessing the ability of lstms to learn syntax-
arXiv:1409.0473 . sensitive dependencies. Transactions of the Associ-
ation for Computational Linguistics 4:521–535.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
and Christopher D Manning. 2015. A large anno- Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
tated corpus for learning natural language inference. Pineau. 2015. The ubuntu dialogue corpus: A large
arXiv preprint arXiv:1508.05326 . dataset for research in unstructured multi-turn dia-
logue systems. arXiv preprint arXiv:1506.08909 .
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei,
Hui Jiang, and Diana Inkpen. 2016. Enhanced Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
lstm for natural language inference. arXiv preprint Dan Huang, Andrew Y. Ng, and Christopher Potts.
arXiv:1609.06038 . 2011. Learning word vectors for sentiment analysis.
In Proceedings of the 49th Annual Meeting of the
Danilo Comminiello, Marco Lella, Simone Scarda- Association for Computational Linguistics: Human
pane, and Aurelio Uncini. 2018. Quaternion con- Language Technologies. Association for Computa-
volutional neural networks for detection and lo- tional Linguistics, Portland, Oregon, USA, pages
calization of 3d sound events. arXiv preprint 142–150. http://www.aclweb.org/anthology/P11-
arXiv:1812.06811 . 1015.

Arvind Neelakantan, Jeevan Shankar, Alexandre Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.
Passos, and Andrew McCallum. 2015. Effi- Hermitian co-attention networks for text matching
cient non-parametric estimation of multiple embed- in asymmetrical domains.
dings per word in vector space. arXiv preprint
arXiv:1504.06654 . Yi Tay, Minh C Phan, Luu Anh Tuan, and Siu Cheung
Hui. 2017a. Learning to rank question answer pairs
Phi Xuan Nguyen and Shafiq Joty. with holographic dual lstm architecture. In Proceed-
2019. Phrase-based attentions. ings of the 40th International ACM SIGIR Confer-
https://openreview.net/forum?id=r1xN5oA5tm. ence on Research and Development in Information
Maximilian Nickel, Lorenzo Rosasco, and Tomaso Retrieval. ACM, pages 695–704.
Poggio. 2016. Holographic embeddings of knowl-
edge graphs. In Thirtieth Aaai conference on artifi- Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2017b.
cial intelligence. Compare, compress and propagate: Enhancing
neural architectures with alignment factorization
Titouan Parcollet, Mirco Ravanelli, Mohamed for natural language inference. arXiv preprint
Morchid, Georges Linarès, Chiheb Trabelsi, Renato arXiv:1801.00102 .
De Mori, and Yoshua Bengio. 2018a. Quater-
nion recurrent neural networks. arXiv preprint Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy
arXiv:1806.04418 . Serdyuk, Sandeep Subramanian, João Felipe Santos,
Soroush Mehri, Negar Rostamzadeh, Yoshua Ben-
Titouan Parcollet, Ying Zhang, Mohamed Morchid, gio, and Christopher J Pal. 2017. Deep complex net-
Chiheb Trabelsi, Georges Linarès, Renato De Mori, works. arXiv preprint arXiv:1705.09792 .
and Yoshua Bengio. 2018b. Quaternion con-
volutional neural networks for end-to-end au- Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric
tomatic speech recognition. arXiv preprint Gaussier, and Guillaume Bouchard. 2016. Com-
arXiv:1806.07789 . plex embeddings for simple link prediction. In In-
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and ternational Conference on Machine Learning. pages
Jakob Uszkoreit. 2016. A decomposable attention 2071–2080.
model for natural language inference. arXiv preprint
arXiv:1606.01933 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Dario Pavllo, David Grangier, and Michael Auli. 2018. Kaiser, and Illia Polosukhin. 2017. Attention is all
Quaternet: A quaternion-based recurrent model for you need. In Advances in Neural Information Pro-
human motion. arXiv preprint arXiv:1805.06485 . cessing Systems. pages 5998–6008.
Jeffrey Pennington, Richard Socher, and Christopher Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.
Manning. 2014. Glove: Global vectors for word Bilateral multi-perspective matching for natural lan-
representation. In Proceedings of the 2014 confer- guage sentences. arXiv preprint arXiv:1702.03814
ence on empirical methods in natural language pro- .
cessing (EMNLP). pages 1532–1543.
Tony Plate. 1991. Holographic reduced representa- Artit Wangperawong. 2018. Attending to math-
tions: Convolution algebra for compositional dis- ematical language with transformers. CoRR
tributed representations. abs/1812.02825. http://arxiv.org/abs/1812.02825.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Adina Williams, Nikita Nangia, and Samuel R Bow-
Ilya Sutskever. 2018. Improving language under- man. 2017. A broad-coverage challenge corpus for
standing by generative pre-training . sentence understanding through inference. arXiv
preprint arXiv:1704.05426 .
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.
models are unsupervised multitask learners . Wikiqa: A challenge dataset for open-domain ques-
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and tion answering. In Proceedings of the 2015 Con-
Hannaneh Hajishirzi. 2016. Bidirectional attention ference on Empirical Methods in Natural Language
flow for machine comprehension. arXiv preprint Processing. pages 2013–2018.
arXiv:1611.01603 .
Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019a.
Richard Socher, Alex Perelygin, Jean Wu, Jason Quaternion knowledge graph embeddings. arXiv
Chuang, Christopher D Manning, Andrew Ng, and preprint arXiv:1904.10281 .
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree- Shuai Zhang, Lina Yao, Lucas Vinh Tran, Aston
bank. In Proceedings of the 2013 conference on Zhang, and Yi Tay. 2019b. Quaternion collabora-
empirical methods in natural language processing. tive filtering for recommendation. arXiv preprint
pages 1631–1642. arXiv:1906.02594 .

Xuanyu Zhu, Yi Xu, Hongteng Xu, and Changjian
  Chen. 2018. Quaternion convolutional neural net-
  works. In Proceedings of the European Conference
  on Computer Vision (ECCV). pages 631–647.

You can also read