Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks 1 Yi Tay, 2 Aston Zhang, 3 Luu Anh Tuan, 4 Jinfeng Rao∗, 5 Shuai Zhang 6 Shuohang Wang, 7 Jie Fu, 8 Siu Cheung Hui 1,8 Nanyang Technological University, 2 Amazon AI, 3 MIT CSAIL 4 Facebook AI, 5 UNSW, 6 Singapore Management University, 7 Mila and Polytechnique Montréal ytay017@e.ntu.edu.sg Abstract adaptations of these models, without significantly degrading performance, would certainly have a Many state-of-the-art neural models for NLP positive impact on many real world applications. are heavily parameterized and thus memory arXiv:1906.04393v1 [cs.CL] 11 Jun 2019 inefficient. This paper proposes a series of To this end, this paper explores a new way to lightweight and memory efficient neural ar- improve/maintain the performance of these neural chitectures for a potpourri of natural language architectures while substantially reducing the pa- processing (NLP) tasks. To this end, our mod- rameter cost (compression of up to 75%). In or- els exploit computation using Quaternion al- der to achieve this, we move beyond real space, gebra and hypercomplex spaces, enabling not exploring computation in Quaternion space (i.e., only expressive inter-component interactions hypercomplex numbers) as an inductive bias. Hy- but also significantly (75%) reduced parame- percomplex numbers comprise of a real and three ter size due to lesser degrees of freedom in the Hamilton product. We propose Quaternion imaginary components (e.g., i, j, k) in which inter- variants of models, giving rise to new architec- dependencies between these components are en- tures such as the Quaternion attention Model coded naturally during training via the Hamilton and Quaternion Transformer. Extensive exper- product ⊗. Hamilton products have fewer degrees iments on a battery of NLP tasks demonstrates of freedom, enabling up to four times compres- the utility of proposed Quaternion-inspired sion of model size. Technical details are deferred models, enabling up to 75% reduction in pa- to subsequent sections. rameter size without significant loss in perfor- mance. While Quaternion connectionist architectures have been considered in various deep learn- 1 Introduction ing application areas such as speech recogni- tion (Parcollet et al., 2018b), kinematics/human Neural network architectures such as Transform- motion (Pavllo et al., 2018) and computer vi- ers (Vaswani et al., 2017; Dehghani et al., 2018) sion (Gaudet and Maida, 2017), our work is the and attention networks (Parikh et al., 2016; Seo first hypercomplex inductive bias designed for a et al., 2016; Bahdanau et al., 2014) are dominant wide spread of NLP tasks. Other fields have mo- solutions in natural language processing (NLP) re- tivated the usage of Quaternions primarily due search today. Many of these architectures are pri- to their natural 3 or 4 dimensional input features marily concerned with learning useful feature rep- (e.g., RGB scenes or 3D human poses) (Parcol- resentations from data in which providing a strong let et al., 2018b; Pavllo et al., 2018). In a similar architectural inductive bias is known to be ex- vein, we can similarly motivate this by considering tremely helpful for obtaining stellar results. the multi-sense nature of natural language (Li and Unfortunately, many of these models are known Jurafsky, 2015; Neelakantan et al., 2015; Huang to be heavily parameterized, with state-of-the-art et al., 2012). In this case, having multiple em- models easily containing millions or billions of beddings or components per token is well-aligned parameters (Vaswani et al., 2017; Radford et al., with this motivation. 2018; Devlin et al., 2018; Radford et al., 2019). Latent interactions between components may This renders practical deployment challenging. As also enjoy additional benefits, especially pertain- such, the enabling of efficient and lightweight ing to applications which require learning pair- ∗ Work done while at University of Maryland. wise affinity scores (Parikh et al., 2016; Seo
et al., 2016). Intuitively, instead of regular (real) nion components are self-contained and play dot products, Hamilton products ⊗ extensively well with real-valued counterparts. learn representations by matching across multiple (inter-latent) components in hypercomplex space. 2 Background on Quaternion Algebra Alternatively, the effectiveness of multi-view and This section introduces the necessary background multi-headed (Vaswani et al., 2017) approaches for this paper. We introduce Quaternion algebra may also explain the suitability of Quaternion along with Hamilton products, which form the spaces in NLP models. The added advantage crux of our proposed approaches. to multi-headed approaches is that Quaternion spaces explicitly encodes latent interactions be- Quaternion A Quaternion Q ∈ H is a hy- tween these components or heads via the Hamilton percomplex number with three imaginary compo- product which intuitively increases the expressive- nents as follows: ness of the model. Conversely, multi-headed em- Q = r + xi + yj + zk, (1) beddings are generally independently produced. To this end, we propose two Quaternion- where ijk = i2 = j2 = k2 = −1 and noncom- inspired neural architectures, namely, the Quater- mutative multiplication rules apply: ij = k, jk = nion attention model and the Quaternion Trans- i, ki = j, ji = −k, kj = −i, ik = −j. In (1), r is former. In this paper, we devise and formulate the real value and similarly, x, y, z are real num- a new attention (and self-attention) mechanism in bers that represent the imaginary components of Quaternion space using Hamilton products. Trans- the Quaternion vector Q. Operations on Quater- formation layers are aptly replaced with Quater- nions are defined in the following. nion feed-forward networks, yielding substantial Addition The addition of two Quaternions is de- improvements in parameter size (of up to 75% fined as: compression) while achieving comparable (and occasionally better) performance. Q + P = Qr + Pr + (Qx + Px )i +(Qy + Py )j + (Qz + Pz )k, Contributions All in all, we make the following major contributions: where Q and P with subscripts denote the real value and imaginary components of Quaternion Q • We propose Quaternion neural models for and P . Subtraction follows this same principle NLP. More concretely, we propose a novel analogously but flipping + with −. Quaternion attention model and Quaternion Transformer for a wide range of NLP tasks. Scalar Multiplication Scalar α multiplies To the best of our knowledge, this is the first across all components, i.e., formulation of hypercomplex Attention and αQ = αr + αxi + αyj + αzk. Quaternion models for NLP. Conjugate The conjugate of Q is defined as: • We evaluate our Quaternion NLP models on a wide range of diverse NLP tasks such as Q∗ = r − xi − yj − zk. pairwise text classification (natural language Norm The unit Quaternion Q/ is defined as: inference, question answering, paraphrase Q identification, dialogue prediction), neural Q/ = p . machine translation (NMT), sentiment anal- r 2 + x2 + y 2 + z 2 ysis, mathematical language understanding Hamilton Product The Hamilton product, (MLU), and subject-verb agreement (SVA). which represents the multiplication of two Quaternions Q and P , is defined as: • Our experimental results show that Quater- nion models achieve comparable or better Q ⊗ P = (Qr Pr − Qx Px − Qy Py − Qz Pz ) performance to their real-valued counterparts + (Qx Pr + Qr Px − Qz Py + Qy Pz ) i with up to a 75% reduction in parameter + (Qy Pr + Qz Px + Qr Py − Qx Pz ) j costs. The key advantage is that these mod- els are expressive (due to Hamiltons) and also + (Qz Pr − Qy Px + Qx Py + Qr Pz ) k, parameter efficient. Moreover, our Quater- (2)
which intuitively encourages inter-latent interac- FRPSRQHQWVRIWKHRXWSXW4XDWHUQLRQ4Ň r’ x’ y’ z’ tion between all the four components of Q and P . In this work, we use Hamilton products exten- FRPSRQHQWVRIWKHLQSXW4XDWHUQLRQ4 r x y z sively for vector and matrix transformations that live at the heart of attention models for NLP. SDLUZLVHFRQQHFWLRQVZLWKZHLJKWSDUDPHWHUYDULDEOHV r’ x’ 3 Quaternion Models of Language Wr -Wx -Wy -Wz Wx Wr -Wz Wy r x y z r x y z In this section, we propose Quaternion neural models for language processing tasks. We be- y’ z’ gin by introducing the building blocks, such as Wy Wz Wr -Wx Wz Wx Wr Quaternion feed-forward, Quaternion attention, r x y z r x -Wy y z and Quaternion Transformers. Figure 1: 4 weight parameter variables (Wr , Wx , Wy , Wz ) are used in 16 pairwise con- 3.1 Quaternion Feed-Forward nections between components of the input and output A Quaternion feed-forward layer is similar to a Quaternions. feed-forward layer in real space, while the former operates in hypercomplex space where Hamilton counterpart, resulting in a 75% reduction in pa- product is used. Denote by W ∈ H the weight pa- rameterization. Such a parameterization reduction rameter of a Quaternion feed-forward layer and let can also be explained by weight sharing (Parcollet Q ∈ H be the layer input. The linear output of the et al., 2018b,a). layer is the Hamilton product of two Quaternions: W ⊗ Q. Nonlinearity Nonlinearity can be added to a Quaternion feed-forward layer and component- Saving Parameters? How and Why In lieu of wise activation is adopted (Parcollet et al., 2018a): the fact that it might not be completely obvious at first glance why Quaternion models result in mod- α(Q) = α(r) + α(x)i + α(y)j + +α(z)k, els with smaller parameterization, we dedicate the where Q is defined in (1) and α(.) is a nonlinear following to address this. function such as tanh or ReLU. For the sake of parameterization comparison, let us express the Hamilton product W ⊗ Q in 3.2 Quaternion Attention a Quaternion feed-forward layer in the form of Next, we propose a Quaternion attention model to matrix multiplication, which is used in real-space compute attention and alignment between two se- feed-forward. Recall the definition of Hamilton quences. Let A ∈ H`a ×d and B ∈ H`b ×d be input product in (2). Putting aside the Quaterion unit word sequences, where `a , `b are numbers of to- basis [1, i, j, k]> , W ⊗ Q can be expressed as: kens in each sequence and d is the dimension of Wr −Wx −Wy −Wz r each input vector. We first compute: Wx Wr −Wz Wy x , (3) E = A ⊗ B>, Wy Wz Wr −Wx y Wz −Wy Wx Wr z where E ∈ H`a ×`b . We apply Softmax(.) to E component-wise: where W = Wr + Wx i + Wy j + Wz k and Q is defined in (1). G = ComponentSoftmax(E) We highlight that, there are only 4 distinct pa- B 0 = GR BR + GX BX i + GY BY j + GZ BZ k, rameter variable elements (4 degrees of freedom), namely Wr , Wx , Wy , Wz , in the weight matrix where G and B with subscripts represent the real (left) of (3), as illustrated by Figure 1; while in and imaginary components of G and B. Similarly, real-space feed-forward, all the elements of the we perform the same on A which is described as weight matrix are different parameter variables follows: (4 × 4 = 16 degrees of freedom). In other F = ComponentSoftmax(E > ) words, the degrees of freedom in Quaternion feed- forward is only a quarter of those in its real-space A0 = FR AR + FX AX i + FY AY j + FZ AZ k,
where A0 is the aligned representation of B and Note that in (4), Q ⊗ K returns four ` × ` B 0 is the aligned representation of A. Next, given matrices (attention weights) for each component A0 ∈ R`b ×d , B 0 ∈ R`A ×d we then compute and (r, i, j, k). Softmax is applied component-wise, compare the learned alignments: along with multiplication with V which is multi- X plied in similar fashion to the Quaternion attention C1 = QFFN([A0i ; Bi , A0i ⊗ Bi ; A0i − Bi ]) model. Note that the Hamilton product in the self- attention itself does not change the parameter size X C2 = QFFN([Bi0 ; Ai , Bi0 ⊗ Ai ; Bi0 − Ai ]), of the network. where QFFN(.) is a Quaternion feed-forward layer Quaternion Transformer Block Aside from with nonlinearity and [; ] is the component-wise the linear transformations for forming query, key, Poperator. i refers to word positional contatentation and values. Tranformers also contain position indices and over words in the sequence. Both feed-forward networks with ReLU activations. outputs C1 , C2 are then passed Similarly, we replace the feed-forward connec- Y = QFFN([C1 ; C2 ; C1 ⊗ C2 ; C1 − C2 ]), tions (FFNs) with Quaternion FFNs. We denote this as Quaternion Transformer (full) while denot- where Y ∈ H is a Quaternion valued output. In or- ing the model that only uses Quaternion FFNs in der to train our model end-to-end with real-valued the self-attention as (partial). Finally, the remain- losses, we concatenate each component and pass der of the Transformer networks remain identical into a final linear layer for classification. to the original design (Vaswani et al., 2017) in the sense that component-wise functions are applied 3.3 Quaternion Transformer unless specified above. This section describes our Quaternion adaptation of Transformer networks. Transformer (Vaswani 3.4 Embedding Layers et al., 2017) can be considered state-of-the-art In the case where the word embedding layer is across many NLP tasks. Transformer networks trained from scratch (i.e., using Byte-pair encod- are characterized by stacked layers of linear trans- ing in machine translation), we treat each embed- forms along with its signature self-attention mech- ding to be the concatenation of its four compo- anism. For the sake of brevity, we outline the spe- nents. In the case where pre-trained embeddings cific changes we make to the Transformer model. such as GloVe (Pennington et al., 2014) are used, Quaternion Self-Attention The standard self- a nonlinear transform is used to project the embed- attention mechanism considers the following: dings into Quaternion space. QK > 3.5 Connection to Real Components A = softmax( √ )V, dk A vast majority of neural components in the deep learning arsenal operate in real space. As such, where Q, K, V are traditionally learned via linear it would be beneficial for our Quaternion-inspired transforms from the input X. The key idea here is components to interface seamlessly with these that we replace this linear transform with a Quater- components. If input to a Quaternion module nion transform. (such as Quaternion FFN or attention modules), Q = Wq ⊗ X; K = Wk ⊗ X; V = Wv ⊗ X, we simply treat the real-valued input as a concate- nation of components r, x, y, z. Similarly, the out- where ⊗ is the Hamilton product and X is the in- put of the Quaternion module, if passed to a real- put Quaternion representation of the layer. In this valued layer, is treated as a [r; x; y; z], where [; ] is case, since computation is performed in Quater- the concatenation operator. nion space, the parameters of W is effectively re- duced by 75%. Similarly, the computation of self- Output layer and Loss Functions To train our attention also relies on Hamilton products. The model, we simply concatenate all r, i, j, k compo- revised Quaternion self-attention is defined as fol- nents into a single vector at the final output layer. lows: For example, for classification, the final Softmax output is defined as following: Q⊗K A = ComponentSoftmax( √ )V. (4) dk Y = Softmax(W ([r; x; y; z]) + b),
where Y ∈ R|C| where |C| is the number of • Natural language inference (NLI) - This classes and x, y, z are the imaginary components. task is concerned with determining if two Similarly for sequence loss (for sequence trans- sentences entail or contradict each other. duction problems), the same can be also done. We use SNLI (Bowman et al., 2015), Sc- iTail (Khot et al., 2018), MNLI (Williams Parameter Initialization It is intuitive that spe- et al., 2017) as benchmark data sets. cialized initialization schemes ought to be devised for Quaternion representations and their mod- • Question answering (QA) - This task in- ules (Parcollet et al., 2018b,a). volves learning to rank question-answer pairs. We use WikiQA (Yang et al., 2015) / w = |w|(cos(θ) + qimag sin(θ), which comprises of QA pairs from Bing Search. / where qimag is the normalized imaginary con- structed from uniform randomly sampling from • Paraphrase detection - This task involves [0, 1]. θ is randomly and uniformly sampled from detecting if two sentences are paraphrases of [−π, π]. However, our early experiments show each other. We use Tweets (Lan et al., 2017) that, at least within the context of NLP appli- data set and the Quora paraphrase data set cations, this initialization performed comparable (Wang et al., 2017). or worse than the standard Glorot initialization. • Dialogue response selection - This is a re- Hence, we opt to initialize all components inde- sponse selection (RS) task that tries to se- pendently with Glorot initialization. lect the best response given a message. We use the Ubuntu dialogue corpus, UDC (Lowe 4 Experiments et al., 2015). This section describes our experimental setup Implementation Details We implement Q-Att across multiple diverse NLP tasks. All experi- in TensorFlow (Abadi et al., 2016), along with the ments were run on NVIDIA Titan X hardware. Decomposable Attention baseline (Parikh et al., Our Models On pairwise text classification, we 2016). Both models optimize the cross entropy benchmark Quaternion attention model (Q-Att), loss (e.g., binary cross entropy for ranking tasks testing the ability of Quaternion models on pair- such as WikiQA and Ubuntu). Models are op- wise representation learning. On all the other timized with Adam with the learning rate tuned tasks, such as machine translation and subject- amongst {0.001, 0.0003} and the batch size tuned verb agreement, we evaluate Quaternion Trans- amongst {32, 64}. Embeddings are initialized formers. We evaluate two variations of Transform- with GloVe (Pennington et al., 2014). For Q- ers, full and partial. The full setting converts all Att, we use an additional transform layer to linear transformations into Quaternion space and project the pre-trained embeddings into Quater- is approximately 25% of the actual Transformer nion space. The measures used are generally size. The second setting (partial) only reduces the accuracy measure (for NLI and Paraphrase the linear transforms at the self-attention mech- tasks) and ranking measures (MAP/MRR/Top-1) anism. Tensor2Tensor1 is used for Transformer for ranking tasks (WikiQA and Ubuntu). benchmarks, which uses its default Hyperparam- Baselines and Comparison We use the Decom- eters and encoding for all experiments. posable Attention model as a baseline, adding [ai ; bi ; ai bi ; ai − bi ] before the compare2 lay- 4.1 Pairwise Text Classification ers since we found this simple modification to in- We evaluate our proposed Quaternion attention crease performance. This also enables fair com- (Q-Att) model on pairwise text classification tasks. parison with our variation of Quaternion attention This task involves predicting a label or ranking which uses Hamilton product over Element-wise score for sentence pairs. We use a total of seven multiplication. We denote this as DeAtt. We eval- data sets from problem domains such as: uate at a fixed representation size of d = 200 1 2 https://github.com/tensorflow/ This follows the matching function of (Chen et al., tensor2tensor. 2016).
Task NLI QA Paraphrase RS Measure Accuracy MAP/MRR Accuracy Top-1 Model SNLI SciTail MNLI WikiQA Tweet Quora UDC # Params DeAtt (d = 50) 83.4 73.8 69.9/70.9 66.0/67.1 77.8 82.2 48.7 200K DeAtt (d = 200) 86.2 79.0 73.6/73.9 67.2/68.3 80.0 85.4 51.8 700K Q-Att (d = 50) 85.4 79.6 72.3/72.9 66.2/68.1 80.1 84.1 51.5 200K (-71%) Table 1: Experimental results on pairwise text classification and ranking tasks. Q-Att achieves comparable or competitive results compared with DeAtt with approximately one third of the parameter cost. Model IMDb SST # Params Transformer 82.6 78.9 400K Quaternion Transformer (full) 83.9 (+1.3%) 80.5 (+1.6%) 100K (-75.0%) Quaternion Transformer (partial) 83.6 (+1.0%) 81.4 (+2.5%) 300K (-25.0%) Table 2: Experimental results on sentiment analysis on IMDb and Stanford Sentiment Treebank (SST) data sets. Evaluation measure is accuracy. (equivalent to d = 50 in Quaternion space). We IMDb (Maas et al., 2011) and Stanford Sentiment also include comparisons at equal parameteriza- Treebank (SST) (Socher et al., 2013). tion (d = 50 and approximately 200K parame- ters) to observe the effect of Quaternion represen- Results Table 2 reports results the sentiment tations. We selection of DeAtt is owing to simplic- classification task on IMDb and SST. We observe ity and ease of comparison. We defer the prospect that both the full and partial variation of Quater- of Quaternion variations of more advanced mod- nion Transformers outperform the base Trans- els (Chen et al., 2016; Tay et al., 2017b) to future former. We observe that Quaternion Transformer work. (partial) obtains a +1.0% lead over the vanilla Transformer on IMDb and +2.5% on SST. This Results Table 1 reports results on seven differ- is while having a 24.5% saving in parameter ent and diverse data sets. We observe that a tiny cost. Finally the full Quaternion version leads Q-Att model (d = 50) achieves comparable (or by +1.3%/1.6% gains on IMDb and SST respec- occasionally marginally better or worse) perfor- tively while maintaining a 75% reduction in pa- mance compared to DeAtt (d = 200), gaining a rameter cost. This supports our core hypothesis of 68% parameter savings. The results actually im- improving accuracy while saving parameter costs. prove on certain data sets (2/7) and are compara- ble (often less than a percentage point difference) 4.3 Neural Machine Translation compared with the d = 200 DeAtt model. More- over, we scaled the parameter size of the DeAtt We evaluate our proposed Quaternion Transformer model to be similar to the Q-Att model and found against vanilla Transformer on three data sets that the performance degrades quite significantly on this neural machine translation (NMT) task. (about 2% − 3% lower on all data sets). This More concretely, we evaluate on IWSLT 2015 En- demonstrates the quality and benefit of learning glish Vietnamese (En-Vi), WMT 2016 English- with Quaternion space. Romanian (En-Ro) and WMT 2018 English- Estonian (En-Et). We also include results on the 4.2 Sentiment Analysis standard WMT EN-DE English-German results. We evaluate on the task of document-level sen- timent analysis which is a binary classification Implementation Details We implement models problem. in Tensor2Tensor and trained for 50k steps for both models. We use the default base single GPU Implementation Details We compare our pro- hyperparameter setting for both models and aver- posed Quaternion Transformer against the vanilla age checkpointing. Note that our goal is not to ob- Transformer. In this experiment, we use the tiny tain state-of-the-art models but to fairly and sys- Transformer setting in Tensor2Tensor with a vo- tematically evaluate both vanilla and Quaternion cab size of 8K. We use two data sets, namely Transformers.
BLEU Model IWSLT’15 En-Vi WMT’16 En-Ro WMT’18 En-Et # Params Transformer Base 28.4 22.8 14.1 44M Quaternion Transformer (full) 28.0 18.5 13.1 11M (-75%) Quaternion Transformer (partial) 30.9 22.7 14.2 29M (-32%) Table 3: Experimental results on neural machine translation (NMT). Results of Transformer Base on EN-VI (IWSLT 2015), EN-RO (WMT 2016) and EN-ET (WMT 2018). Parameter size excludes word embeddings. Our proposed Quaternion Transformer achieves comparable or higher performance with only 67.9% parameter costs of the base Transformer model. Results Table 3 reports the results on neural exist, mainly switching and introduction of new machine translation. On the IWSLT’15 En-Vi mathematical operators. data set, the partial adaptation of the Quater- Implementation Details We train Quaternion nion Transformer outperforms (+2.5%) the base Transformer for 100K steps using the de- Transformer with a 32% reduction in parameter fault Tensor2Tensor setting following the original cost. On the other hand, the full adaptation comes work (Wangperawong, 2018). We use the tiny close (−0.4%) with a 75% reduction in paramter hyperparameter setting. Similar to NMT, we re- cost. On the WMT’16 En-Ro data set, Quaternion port both full and partial adaptations of Quater- Transformers do not outperform the base Trans- nion Transformers. Baselines are reported from former. We observe a −0.1% degrade in per- the original work as well, which includes com- formance on the partial adaptation and −4.3% parisons from Universal Transformers (Dehghani degrade on the full adaptation of the Quaternion et al., 2018) and Adaptive Computation Time Transformer. However, we note that the drop in (ACT) Universal Transformers. The evaluation performance with respect to parameter savings is measure is accuracy per sequence, which counts still quite decent, e.g., saving 32% parameters for a generated sequence as correct if and only if the a drop of only 0.1 BLEU points. The full adapta- entire sequence is an exact match. tion loses out comparatively. On the WMT’18 En- Et dataset, the partial adaptation achieves the best Results Table 4 reports our experimental re- result with 32% less parameters. The full adapta- sults on the MLU data set. We observe a mod- tion, comparatively, only loses by 1.0 BLEU score est +7.8% accuracy gain when using the Quater- from the original Transformer yet saving 75% pa- nion Transformer (partial) while saving 24.5% pa- rameters. rameter costs. Quaternion Transformer outper- forms Universal Transformer and marginally is WMT English-German Notably, Quater- outperformed by Adaptive Computation Universal nion Transformer achieves a BLEU score of Transformer (ACT U-Transformer) by 0.5%. On 26.42/25.14 for partial/full settings respectively the other hand, a full Quaternion Transformer still on the standard WMT 2014 En-De benchmark. outperforms the base Transformer (+2.8%) with This is using a single GPU trained for 1M steps 75% parameter saving. with a batch size of 8192. We note that results do not differ much from other single GPU runs (i.e., 4.5 Subject Verb Agreement 26.07 BLEU) on this dataset (Nguyen and Joty, Additionally, we compare our Quaternion Trans- 2019). former on the subject-verb agreement task (Linzen et al., 2016). The task is a binary classification 4.4 Mathematical Language Understanding problem, determining if a sentence, e.g., ‘The keys We include evaluations on a newly released to the cabinet .’ follows by a plural/singular. mathematical language understanding (MLU) data Implementation We use the Tensor2Tensor set (Wangperawong, 2018). This data set is a framework, training Transformer and Quaternion character-level transduction task that aims to test Transformer with the tiny hyperparameter setting a model’s the compositional reasoning capabili- with 10k steps. ties. For example, given an input x = 85, y = −523, x ∗ y the model strives to decode an output Results Table 5 reports the results on the SVA of −44455. Several variations of these problems task. Results show that Quaternion Transform-
Model Acc / Seq # Params Universal Transformer 78.8 - ACT U-Transformer 84.9 - Transformer 76.1 400K Quaternion Transformer (full) 78.9 (+2.8%) 100K (-75%) Quaternion Transformer (partial) 84.4 (+8.3%) 300K ( -25%) Table 4: Experimental results on mathematical language understanding (MLU). Both Quaternion models outper- form the base Transformer model with up to 75% parameter savings. ers perform equally (or better) than vanilla Trans- Quaternion representations for collaborative filter- formers. On this task, the partial adaptation per- ing. A common theme is that Quaternion repre- forms better, improving Transformers by +0.7% sentations are helpful and provide utility over real- accuracy while saving 25% parameters. valued representations. The interest in non-real spaces can be attributed Model Acc Params to several factors. Firstly, complex weight ma- Transformer 94.8 400K trices used to parameterize RNNs help to com- Quaternion (full) 94.7 100K bat vanishing gradients (Arjovsky et al., 2016). Quaternion (partial) 95.5 300K On the other hand, complex spaces are also in- Table 5: Experimental results on subject-verb agree- tuitively linked to associative composition, along ment (SVA) number prediction task. with holographic reduced representations (Plate, 1991; Nickel et al., 2016; Tay et al., 2017a). 5 Related Work Asymmetry has also demonstrated utility in do- mains such as relational learning (Trouillon et al., The goal of learning effective representations lives 2016; Nickel et al., 2016) and question answer- at the heart of deep learning research. While most ing (Tay et al., 2018). Complex networks (Trabelsi neural architectures for NLP have mainly explored et al., 2017), in general, have also demonstrated the usage of real-valued representations (Vaswani promise over real networks. et al., 2017; Bahdanau et al., 2014; Parikh et al., In a similar vein, the hypercomplex Hamilton 2016), there have also been emerging interest in product provides a greater extent of expressive- complex (Danihelka et al., 2016; Arjovsky et al., ness, similar to the complex Hermitian product, al- 2016; Gaudet and Maida, 2017) and hypercom- beit with a 4-fold increase in interactions between plex representations (Parcollet et al., 2018b,a; real and imaginary components. In the case of Gaudet and Maida, 2017). Quaternion representations, due to parameter sav- Notably, progress on Quaternion and hyper- ing in the Hamilton product, models also enjoy a complex representations for deep learning is still 75% reduction in parameter size. in its infancy and consequently, most works on this topic are very recent. Gaudet and Maida pro- Our work draws important links to multi- posed deep Quaternion networks for image clas- head (Vaswani et al., 2017) or multi-sense (Li sification, introducing basic tools such as Quater- and Jurafsky, 2015; Neelakantan et al., 2015) rep- nion batch normalization or Quaternion initializa- resentations that are highly popular in NLP re- tion (Gaudet and Maida, 2017). In a similar vein, search. Intuitively, the four-component structure Quaternion RNNs and CNNs were proposed for of Quaternion representations can also be inter- speech recognition (Parcollet et al., 2018a,b). In preted as some kind of multi-headed architec- parallel Zhu et al. proposed Quaternion CNNs ture. The key difference is that the basic operators and applied them to image classification and de- (e.g., Hamilton product) provides an inductive bias noising tasks (Zhu et al., 2018). Comminiello that encourages interactions between these com- et al. proposed Quaternion CNNs for sound ponents. Notably, the idea of splitting vectors has detection (Comminiello et al., 2018). (Zhang also been explored (Daniluk et al., 2017), which et al., 2019a) proposed Quaternion embeddings of is in similar spirit to breaking a vector into four knowledge graphs. (Zhang et al., 2019b) proposed components.
6 Conclusion Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. 2016. Asso- This paper advocates for lightweight and efficient ciative long short-term memory. arXiv preprint neural NLP via Quaternion representations. More arXiv:1602.03032 . concretely, we proposed two models - Quaternion Michał Daniluk, Tim Rocktäschel, Johannes Welbl, attention model and Quaternion Transformer. We and Sebastian Riedel. 2017. Frustratingly short at- evaluate these models on eight different NLP tasks tention spans in neural language modeling. arXiv and a total of thirteen data sets. Across all data preprint arXiv:1702.04521 . sets the Quaternion model achieves comparable Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, performance while reducing parameter size. All Jakob Uszkoreit, and Łukasz Kaiser. 2018. Univer- in all, we demonstrated the utility and benefits of sal transformers. arXiv preprint arXiv:1807.03819 incorporating Quaternion algebra in state-of-the- . art neural models. We believe that this direction paves the way for more efficient and effective rep- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep resentation learning in NLP. Our Tensor2Tensor bidirectional transformers for language understand- implementation of Quaternion Transformers ing. arXiv preprint arXiv:1810.04805 . will be released at https://github.com/ vanzytay/QuaternionTransformers. Chase Gaudet and Anthony Maida. 2017. Deep quater- nion networks. arXiv preprint arXiv:1712.04604 . 7 Acknowledgements Eric H Huang, Richard Socher, Christopher D Man- The authors thank the anonymous reviewers of ning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word ACL 2019 for their time, feedback and comments. prototypes. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics: Long Papers-Volume 1. Association for Com- References putational Linguistics, pages 873–882. Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Scitail: A textual entailment dataset from science Sanjay Ghemawat, Geoffrey Irving, Michael Isard, question answering. In Thirty-Second AAAI Con- et al. 2016. Tensorflow: A system for large-scale ference on Artificial Intelligence. machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. ({OSDI} 16). pages 265–283. A continuously growing dataset of sentential para- phrases. arXiv preprint arXiv:1708.00391 . Martin Arjovsky, Amar Shah, and Yoshua Bengio. 2016. Unitary evolution recurrent neural networks. Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em- In International Conference on Machine Learning. beddings improve natural language understanding? pages 1120–1128. arXiv preprint arXiv:1506.01070 . Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. learning to align and translate. arXiv preprint 2016. Assessing the ability of lstms to learn syntax- arXiv:1409.0473 . sensitive dependencies. Transactions of the Associ- ation for Computational Linguistics 4:521–535. Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large anno- Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle tated corpus for learning natural language inference. Pineau. 2015. The ubuntu dialogue corpus: A large arXiv preprint arXiv:1508.05326 . dataset for research in unstructured multi-turn dia- logue systems. arXiv preprint arXiv:1506.08909 . Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2016. Enhanced Andrew L. Maas, Raymond E. Daly, Peter T. Pham, lstm for natural language inference. arXiv preprint Dan Huang, Andrew Y. Ng, and Christopher Potts. arXiv:1609.06038 . 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Danilo Comminiello, Marco Lella, Simone Scarda- Association for Computational Linguistics: Human pane, and Aurelio Uncini. 2018. Quaternion con- Language Technologies. Association for Computa- volutional neural networks for detection and lo- tional Linguistics, Portland, Oregon, USA, pages calization of 3d sound events. arXiv preprint 142–150. http://www.aclweb.org/anthology/P11- arXiv:1812.06811 . 1015.
Arvind Neelakantan, Jeevan Shankar, Alexandre Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018. Passos, and Andrew McCallum. 2015. Effi- Hermitian co-attention networks for text matching cient non-parametric estimation of multiple embed- in asymmetrical domains. dings per word in vector space. arXiv preprint arXiv:1504.06654 . Yi Tay, Minh C Phan, Luu Anh Tuan, and Siu Cheung Hui. 2017a. Learning to rank question answer pairs Phi Xuan Nguyen and Shafiq Joty. with holographic dual lstm architecture. In Proceed- 2019. Phrase-based attentions. ings of the 40th International ACM SIGIR Confer- https://openreview.net/forum?id=r1xN5oA5tm. ence on Research and Development in Information Maximilian Nickel, Lorenzo Rosasco, and Tomaso Retrieval. ACM, pages 695–704. Poggio. 2016. Holographic embeddings of knowl- edge graphs. In Thirtieth Aaai conference on artifi- Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2017b. cial intelligence. Compare, compress and propagate: Enhancing neural architectures with alignment factorization Titouan Parcollet, Mirco Ravanelli, Mohamed for natural language inference. arXiv preprint Morchid, Georges Linarès, Chiheb Trabelsi, Renato arXiv:1801.00102 . De Mori, and Yoshua Bengio. 2018a. Quater- nion recurrent neural networks. arXiv preprint Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy arXiv:1806.04418 . Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Ben- Titouan Parcollet, Ying Zhang, Mohamed Morchid, gio, and Christopher J Pal. 2017. Deep complex net- Chiheb Trabelsi, Georges Linarès, Renato De Mori, works. arXiv preprint arXiv:1705.09792 . and Yoshua Bengio. 2018b. Quaternion con- volutional neural networks for end-to-end au- Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric tomatic speech recognition. arXiv preprint Gaussier, and Guillaume Bouchard. 2016. Com- arXiv:1806.07789 . plex embeddings for simple link prediction. In In- Ankur P Parikh, Oscar Täckström, Dipanjan Das, and ternational Conference on Machine Learning. pages Jakob Uszkoreit. 2016. A decomposable attention 2071–2080. model for natural language inference. arXiv preprint arXiv:1606.01933 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Dario Pavllo, David Grangier, and Michael Auli. 2018. Kaiser, and Illia Polosukhin. 2017. Attention is all Quaternet: A quaternion-based recurrent model for you need. In Advances in Neural Information Pro- human motion. arXiv preprint arXiv:1805.06485 . cessing Systems. pages 5998–6008. Jeffrey Pennington, Richard Socher, and Christopher Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Manning. 2014. Glove: Global vectors for word Bilateral multi-perspective matching for natural lan- representation. In Proceedings of the 2014 confer- guage sentences. arXiv preprint arXiv:1702.03814 ence on empirical methods in natural language pro- . cessing (EMNLP). pages 1532–1543. Tony Plate. 1991. Holographic reduced representa- Artit Wangperawong. 2018. Attending to math- tions: Convolution algebra for compositional dis- ematical language with transformers. CoRR tributed representations. abs/1812.02825. http://arxiv.org/abs/1812.02825. Alec Radford, Karthik Narasimhan, Tim Salimans, and Adina Williams, Nikita Nangia, and Samuel R Bow- Ilya Sutskever. 2018. Improving language under- man. 2017. A broad-coverage challenge corpus for standing by generative pre-training . sentence understanding through inference. arXiv preprint arXiv:1704.05426 . Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. models are unsupervised multitask learners . Wikiqa: A challenge dataset for open-domain ques- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and tion answering. In Proceedings of the 2015 Con- Hannaneh Hajishirzi. 2016. Bidirectional attention ference on Empirical Methods in Natural Language flow for machine comprehension. arXiv preprint Processing. pages 2013–2018. arXiv:1611.01603 . Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. 2019a. Richard Socher, Alex Perelygin, Jean Wu, Jason Quaternion knowledge graph embeddings. arXiv Chuang, Christopher D Manning, Andrew Ng, and preprint arXiv:1904.10281 . Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- Shuai Zhang, Lina Yao, Lucas Vinh Tran, Aston bank. In Proceedings of the 2013 conference on Zhang, and Yi Tay. 2019b. Quaternion collabora- empirical methods in natural language processing. tive filtering for recommendation. arXiv preprint pages 1631–1642. arXiv:1906.02594 .
Xuanyu Zhu, Yi Xu, Hongteng Xu, and Changjian Chen. 2018. Quaternion convolutional neural net- works. In Proceedings of the European Conference on Computer Vision (ECCV). pages 631–647.
You can also read