Graph Convolutional Encoders for Syntax-aware Neural Machine Translation

Page created by Bruce Goodman

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Graph Convolutional Encoders
                                                             for Syntax-aware Neural Machine Translation

                                                              Jasmijn Bastings1 Ivan Titov1,2 Wilker Aziz1
                                                                  Diego Marcheggiani1 Khalil Sima’an1
                                                      1
                                                        ILLC, University of Amsterdam 2 ILCC, University of Edinburgh
                                                   {bastings,titov,w.aziz,marcheggiani,k.simaan}@uva.nl

                                                              Abstract                        including RNNs. Despite some successes, tech-
                                                                                              niques explored so far either incorporate syntactic
                                             We present a simple and effective ap-            information in NMT models in a relatively indi-
                                             proach to incorporating syntactic struc-
arXiv:1704.04675v4 [cs.CL] 18 Jun 2020

                                                                                              rect way (e.g., multi-task learning (Luong et al.,
                                             ture into neural attention-based encoder-        2015a; Nadejde et al., 2017; Eriguchi et al., 2017;
                                             decoder models for machine translation.          Hashimoto and Tsuruoka, 2017)) or may be too
                                             We rely on graph-convolutional networks          restrictive in modeling the interface between syn-
                                             (GCNs), a recent class of neural networks        tax and the translation task (e.g., learning repre-
                                             developed for modeling graph-structured          sentations of linguistic phrases (Eriguchi et al.,
                                             data. Our GCNs use predicted syntactic           2016)). Our goal is to provide the encoder with
                                             dependency trees of source sentences to          access to rich syntactic information but let it de-
                                             produce representations of words (i.e. hid-      cide which aspects of syntax are beneficial for
                                             den states of the encoder) that are sensitive    MT, without placing rigid constraints on the in-
                                             to their syntactic neighborhoods. GCNs           teraction between syntax and the translation task.
                                             take word representations as input and           This goal is in line with claims that rigid syntac-
                                             produce word representations as output, so       tic constraints typically hurt MT (Zollmann and
                                             they can easily be incorporated as layers        Venugopal, 2006; Smith and Eisner, 2006; Chiang,
                                             into standard encoders (e.g., on top of bidi-    2010), and, though these claims have been made in
                                             rectional RNNs or convolutional neural           the context of traditional MT systems, we believe
                                             networks). We evaluate their effectiveness       they are no less valid for NMT.
                                             with English-German and English-Czech               Attention-based NMT systems (Bahdanau et al.,
                                             translation experiments for different types      2015; Luong et al., 2015b) represent source sen-
                                             of encoders and observe substantial im-          tence words as latent-feature vectors in the en-
                                             provements over their syntax-agnostic ver-       coder and use these vectors when generating a
                                             sions in all the considered setups.              translation. Our goal is to automatically incorpo-
                                                                                              rate information about syntactic neighborhoods of
                                         1   Introduction
                                                                                              source words into these feature vectors, and, thus,
                                         Neural machine translation (NMT) is one of suc-      potentially improve quality of the translation out-
                                         cess stories of deep learning in natural language    put. Since vectors correspond to words, it is natu-
                                         processing, with recent NMT systems outperform-      ral for us to use dependency syntax. Dependency
                                         ing traditional phrase-based approaches on many      trees (see Figure 1) represent syntactic relations
                                         language pairs (Sennrich et al., 2016a). State-of-   between words: for example, monkey is a subject
                                         the-art NMT systems rely on sequential encoder-      of the predicate eats, and banana is its object.
                                         decoders (Sutskever et al., 2014; Bahdanau et al.,      In order to produce syntax-aware feature
                                         2015) and lack any explicit modeling of syntax or    representations of words, we exploit graph-
                                         any hierarchical structure of language. One poten-   convolutional networks (GCNs) (Duvenaud et al.,
                                         tial reason for why we have not seen much benefit    2015; Defferrard et al., 2016; Kearnes et al., 2016;
                                         from using syntactic information in NMT is the       Kipf and Welling, 2016). GCNs can be regarded
                                         lack of simple and effective methods for incorpo-    as computing a latent-feature representation of a
                                         rating structured information in neural encoders,    node (i.e. a real-valued vector) based on its k-

dobj                      • we introduce a method for incorporating
         det        nsubj                 det
                                                               structure into NMT using syntactic GCNs;
   The         monkey       eats   a            banana
                                                             • we show that GCNs can be used along with
                                                               RNN and CNN encoders;
Figure 1: A dependency tree for the example sen-
                                                             • we show that incorporating structure is ben-
tence: “The monkey eats a banana.”
                                                               eficial for machine translation on English-
                                                               Czech and English-German.
th order neighborhood (i.e. nodes at most k hops
                                                         2     Background
aways from the node) (Gilmer et al., 2017). They
are generally simple and computationally inexpen-        Notation. We use x for vectors, x1:t for a se-
sive. We use Syntactic GCNs, a version of GCN            quence of t vectors, and X for matrices. The i-th
operating on top of syntactic dependency trees, re-      value of vector x is denoted by xi . We use ◦ for
cently shown effective in the context of semantic        vector concatenation.
role labeling (Marcheggiani and Titov, 2017).
                                                         2.1    Neural Machine Translation
   Since syntactic GCNs produce representations
at word level, it is straightforward to use them         In NMT (Kalchbrenner and Blunsom, 2013;
as encoders within the attention-based encoder-          Sutskever et al., 2014; Cho et al., 2014b), given
decoder framework. As NMT systems are trained            example translation pairs from a parallel corpus, a
end-to-end, GCNs end up capturing syntactic              neural network is trained to directly estimate the
properties specifically relevant to the translation      conditional distribution p(y1:Ty |x1:Tx ) of translat-
task. Though GCNs can take word embeddings               ing a source sentence x1:Tx (a sequence of Tx
as input, we will see that they are more effec-          words) into a target sentence y1:Ty . NMT mod-
tive when used as layers on top of recurrent neu-        els typically consist of an encoder, a decoder and
ral network (RNN) or convolutional neural net-           some method for conditioning the decoder on the
work (CNN) encoders (Gehring et al., 2016), en-          encoder, for example, an attention mechanism. We
riching their states with syntactic information.         will now briefly describe the components that we
A comparison to RNNs is the most challenging             use in this paper.
test for GCNs, as it has been shown that RNNs            2.1.1 Encoders
(e.g., LSTMs) are able to capture certain syntac-
                                                         An encoder is a function that takes as input the
tic phenomena (e.g., subject-verb agreement) rea-
                                                         source sentence and produces a representation en-
sonably well on their own, without explicit tree-
                                                         coding its semantic content. We describe recur-
bank supervision (Linzen et al., 2016; Shi et al.,
                                                         rent, convolutional and bag-of-words encoders.
2016). Nevertheless, GCNs appear beneficial even
in this challenging set-up: we obtain +1.2 and +0.7      Recurrent. Recurrent neural networks (RNNs)
BLEU point improvements from using syntactic             (Elman, 1990) model sequential data. They re-
GCNs on top of bidirectional RNNs for English-           ceive one input vector at each time step and up-
German and English-Czech, respectively.                  date their hidden state to summarize all inputs up
   In principle, GCNs are flexible enough to incor-      to that point. Given an input sequence x1:Tx =
porate any linguistic structure as long as they can      x1 , x2 , . . . , xTx of word embeddings an RNN is
be represented as graphs (e.g., dependency-based         defined recursively as follows:
semantic-role labeling representations (Surdeanu
                                                                 RNN(x1:t ) = f (xt , RNN(x1:t−1 ))
et al., 2008), AMR semantic graphs (Banarescu
et al., 2012) and co-reference chains). For ex-          where f is a nonlinear function such as an LSTM
ample, unlike recursive neural networks (Socher          (Hochreiter and Schmidhuber, 1997) or a GRU
et al., 2013), GCNs do not require the graphs to be      (Cho et al., 2014b). We will use the function RNN
trees. However, in this work we solely focus on          as an abstract mapping from an input sequence
dependency syntax and leave more general inves-          x1:T to final hidden state RNN(x1:Tx ), regardless
tigation for future work.                                of the used nonlinearity. To not only summarize
   Our main contributions can be summarized as           the past of a word, but also its future, a bidirec-
follows:                                                 tional RNN (Schuster and Paliwal, 1997; Irsoy and

Cardie, 2014) is often used. A bidirectional RNN 2.2 Graph Convolutional Networks
reads the input sentence in two directions and then We will now describe the Graph Convolutional
concatenates the states for each time step: Networks (GCNs) of Kipf and Welling (2016).
For a comprehensive overview of alternative GCN
B I RNN(x1:Tx , t) = RNNF (x1:t ) ◦ RNNB (xTx :t )
architectures see Gilmer et al. (2017).
where RNNF and RNNB are the forward and A GCN is a multilayer neural network that
backward RNNs, respectively. For further details operates directly on a graph, encoding informa-
we refer to the encoder of Bahdanau et al. (2015). tion about the neighborhood of a node as a real-
valued vector. In each GCN layer, information
Convolutional. Convolutional Neural Networks flows along edges of the graph; in other words,
(CNNs) apply a fixed-size window over the input each node receives messages from all its imme-
sequence to capture the local context of each word diate neighbors. When multiple GCN layers are
(Gehring et al., 2016). One advantage of this ap- stacked, information about larger neighborhoods
proach over RNNs is that it allows for fast parallel gets integrated. For example, in the second layer,
computation, while sacrificing non-local context. a node will receive information from its immediate
To remedy the loss of context, multiple CNN lay- neighbors, but this information already includes
ers can be stacked. Formally, given an input se- information from their respective neighbors. By
quence x1:Tx , we define a CNN as follows: choosing the number of GCN layers, we regulate
the distance the information travels: with k lay-
CNN(x1:Tx , t) = f (xt−bw/2c , .., xt , .., xt+bw/2c ) ers a node receives information from neighbors at
most k hops away.
where f is a nonlinear function, typically a lin-
ear transformation followed by ReLU, and w is the Formally, consider an undirected graph G =
size of the window. (V, E), where V is a set of n nodes, and E is a
set of edges. Every node is assumed to be con-
Bag-of-Words. In a bag-of-words (BoW) en- nected to itself, i.e. ∀v ∈ V : (v, v) ∈ E. Now,
coder every word is simply represented by its word let X ∈ Rd×n be a matrix containing all n nodes
embedding. To give the decoder some sense of with their features, where d is the dimensionality
word position, position embeddings (PE) may be of the feature vectors. In our case, X will contain
added. There are different strategies for defining word embeddings, but in general it can contain any
position embeddings, and in this paper we choose kind of features. For a 1-layer GCN, the new node
to learn a vector for each absolute word position representations are computed as follows:
up to a certain maximum length. We then repre- !
sent the t-th word in a sequence as follows: X
hv = ρ W xu + b
B OW(x1:Tx , t) = xt + pt u∈N (v)

where xt is the word embedding and pt is the t-th where W ∈ Rd×d is a weight matrix and b ∈ Rd
position embedding. a bias vector.1 ρ is an activation function, e.g. a
ReLU. N (v) is the set of neighbors of v, which we
2.1.2 Decoder assume here to always include v itself. As stated
A decoder produces the target sentence condi- before, to allow information to flow over multiple
tioned on the representation of the source sentence hops, we need to stack GCN layers. The recursive
induced by the encoder. In Bahdanau et al. (2015) computation is as follows:
the decoder is implemented as an RNN condi-
!
tioned on an additional input ci , the context vector, X
which is dynamically computed at each time step h(j+1)
v =ρ W (j) h(j)
u +b (j)

using an attention mechanism. u∈N (v)

The probability of a target word yi is now a
(0)
function of the decoder RNN state, the previous where j indexes the layer, and hv = xv .
target word embedding, and the context vector. 1
We dropped the normalization factor used by Kipf
The model is trained end-to-end for maximum log and Welling (2016), as it is not used in syntactic GCNs
likelihood of the next target word given its context. of Marcheggiani and Titov (2017).

h(2)
                                             )                )               W (1)             )
                                           (1               (1 j                              (1
                                                                b              dobj
                                         W det            W nsu                             W det
   GCN

                    h(1)
                                             )                )               W (0)             )
                                           (0               (0 j                              (0
                                                                b              dobj
                                         W det            W nsu                             W det

                    h(0)
   CNN

              *PAD*              The             monkey             eats              a             banana   *PAD*

Figure 2: A 2-layer syntactic GCN on top of a convolutional encoder. Loop connections are depicted
with dashed edges, syntactic ones with solid (dependents to heads) and dotted (heads to dependents)
edges. Gates and some labels are omitted for clarity.

2.3      Syntactic GCNs                                               Labels. Making the GCN sensitive to labels is
Marcheggiani and Titov (2017) generalize GCNs                         straightforward given the above modifications for
to operate on directed and labeled graphs.2 This                      directionality. Instead of using separate matrices
makes it possible to use linguistic structures such                   for each direction, separate matrices are now de-
as dependency trees, where directionality and edge                    fined for each direction and label combination:
labels play an important role. They also integrate                                                                    !
                                                                                                (j)            (j)
                                                                                       X
edge-wise gates which let the model regulate con-                      h(j+1)
                                                                         v     =ρ            Wlab(u,v) h(j)
                                                                                                        u + blab(u,v)
tributions of individual dependency edges. We                                             u∈N (v)
will briefly describe these modifications.
                                                                      where we incorporate the directionality of an edge
Directionality. In order to deal with direction-                      directly in its label.
ality of edges, separate weight matrices are used                        Importantly, to prevent over-parametrization,
for incoming and outgoing edges. We follow the                        only bias terms are made label-specific, in other
convention that in dependency trees heads point                       words: Wlab(u,v) = Wdir(u,v) . The resulting syn-
to their dependents, and thus outgoing edges are                      tactic GCN is illustrated in Figure 2 (shown on top
used for head-to-dependent connections, and in-                       of a CNN, as we will explain in the subsequent
coming edges are used for dependent-to-head con-                      section).
nections. Modifying the recursive computation for
directionality, we arrive at:                                         Edge-wise gating. Syntactic GCNs also include
                                                 !                    gates, which can down-weight the contribution of
                 X        (j)           (j)                           individual edges. They also allow the model to
 h(j+1)
   v     =ρ             Wdir(u,v) h(j)
                                   u + bdir(u,v)                      deal with noisy predicted structure, i.e. to ignore
                   u∈N (v)
                                                                      potentially erroneous syntactic edges. For each
where dir(u, v) selects the weight matrix associ-                     edge, a scalar gate is calculated as follows:
ated with the directionality of the edge connecting                                   
                                                                                                  (j)          (j)
                                                                                                                        
                                                                              (j)
u and v (i.e. WIN for u-to-v, WOUT for v-to-u,                               gu,v = σ h(j)
                                                                                         u   · ŵ dir(u,v) + b̂lab(u,v)
and WLOOP for v-to-v). Note that self loops are
modeled separately,                                                   where σ is the logistic sigmoid function, and
                                                                        (j)                  (j)
   so there are now three times as many parameters                    ŵdir(u,v) ∈ Rd and b̂lab(u,v) ∈ R are learned pa-
as in a non-directional GCN.                                          rameters for the gate. The computation becomes:
   2
     For an alternative approach to integrating labels and di-                   X
                                                                                                 (j)              (j)      
rections, see applications of GCNs to statistical relation learn-      h(j+1)
                                                                        v     =ρ       g (j)
                                                                                        u,v  W           h
                                                                                                 dir(u,v) u
                                                                                                           (j)
                                                                                                               + blab(u,v)
ing (Schlichtkrull et al., 2017).                                                 u∈N (v)

3 Graph Convolutional Encoders necting words that otherwise might be far apart.
The model might not only benefit from this tele-
In this work we focus on exploiting structural in- porting capability however; also the nature of the
formation on the source side, i.e. in the encoder. relations between words (i.e. dependency relation
We hypothesize that using an encoder that incor- types) may be useful, and the GCN exploits this
porates syntax will lead to more informative rep- information (see §2.3 for details).
resentations of words, and that these representa-
This is the most challenging setup for GCNs,
tions, when used as context vectors by the decoder,
as RNNs have been shown capable of capturing at
will lead to an improvement in translation qual-
least some degree of syntactic information with-
ity. Consequently, in all our models, we use the
out explicit supervision (Linzen et al., 2016), and
decoder of Bahdanau et al. (2015) and keep this
hence they should be hard to improve on by incor-
part of the model constant. As is now common
porating treebank syntax.
practice, we do not use a maxout layer in the de-
coder, but apart from this we do not deviate from Marcheggiani and Titov (2017) did not observe
the original definition. In all models we make use improvements from using multiple GCN layers in
of GRUs (Cho et al., 2014b) as our RNN units. semantic role labeling. However, we do expect
that propagating information from further in the
Our models vary in the encoder part, where we
tree should be beneficial in principle. We hypoth-
exploit the power of GCNs to induce syntactically-
esize that the first layer is the most influential one,
aware representations. We now define a series of
capturing most of the syntactic context, and that
encoders of increasing complexity.
additional layers only modestly modify the repre-
BoW + GCN. In our first and simplest model, sentations. To ease optimization, we add a resid-
we propose a bag-of-words encoder (with position ual connection (He et al., 2016) between the GCN
embeddings, see §2.1.1), with a GCN on top. In layers, when using more than one layer.
other words, inputs h(0) are a sum of embeddings
of a word and its position in a sentence. Since the 4 Experiments
original BoW encoder captures the linear order-
ing information only in a very crude way (through Experiments are performed using the Neural Mon-
the position embeddings), the structural informa- key toolkit3 (Helcl and Libovický, 2017), which
tion provided by GCN should be highly beneficial. implements the model of Bahdanau et al. (2015)
in TensorFlow. We use the Adam optimizer
Convolutional + GCN. In our second model, (Kingma and Ba, 2015) with a learning rate of
we use convolutional neural networks to learn 0.001 (0.0002 for CNN models).4 The batch size
word representations. CNNs are fast, but by def- is set to 80. Between layers we apply dropout
inition only use a limited window of context. In- with a probability of 0.2, and in experiments with
stead of the approach used by Gehring et al. (2016) GCNs5 we use the same value for edge dropout.
(i.e. stacking multiple CNN layers on top of each We train for 45 epochs, evaluating the BLEU per-
other), we use a GCN to enrich the one-layer CNN formance of the model every epoch on the vali-
representations. Figure 2 shows this model. Note dation set. For testing, we select the model with
that, while the figure shows a CNN with a window the highest validation BLEU. L2 regularization is
size of 3, we will use a larger window size of 5 in used with a value of 10−8 . All the model selection
our experiments. We expect this model to perform (incl. hyperparameter selections) was performed
better than BoW + GCN, because of the additional on the validation set. In all experiments we obtain
local context captured by the CNN. translations using a greedy decoder, i.e. we se-
lect the output token with the highest probability
BiRNN + GCN. In our third and most powerful
at each time step.
model, we employ bidirectional recurrent neural
networks. In this model, we start by encoding the We will describe an artificial experiment in §4.1
source sentence using a BiRNN (i.e. BiGRU), and and MT experiments in §4.2.
use the resulting hidden states as input to a GCN. 3
https://github.com/ufal/neuralmonkey
Instead of relying on linear order only, the GCN 4
Like Gehring et al. (2016) we note that Adam is too ag-
will allow the encoder to ‘teleport’ over parts of gressive for CNN models, hence we use a lower learning rate.
5
the input sentence, along dependency edges, con- GCN code at https://github.com/bastings/neuralmonkey

8
4.1 Reordering artificial sequences
Our goal here is to provide an intuition for the ca- 6
pabilities of GCNs. We define a reordering task 4

Mean gate bias
where randomly permuted sequences need to be
2 real edges
put back into the original order. We encode the fake edges
original order using edges, and test if GCNs can 0
successfully exploit them. Note that this task is not 2
meant to provide a fair comparison to RNNs. The
input (besides the edges) simply does not carry any 4
information about the original ordering, so RNNs
0 50 100 150 200 250
cannot possibly solve this task. Steps (x1000)
Data. From a vocabulary of 26 types, we gen-
Figure 3: Mean values of gate bias terms for real
erate random sequences of 3-10 tokens. We then
(useful) labels and for fake (non useful) labels sug-
randomly permute them, pointing every token to
gest the GCN learns to distinguish them.
its original predecessor with a label sampled from
a set of 5 labels. Additionally, we point every to-
ken to an arbitrary position in the sequence with a trees by SyntaxNet,7 using the pre-trained Parsey
label from a distinct set of 5 ‘fake’ labels. We sam- McParseface model.8 The Czech and German
ple 25000 training and 1000 validation sequences. sides are tokenized using the Moses tokenizer.9
Sentence pairs where either side is longer than 50
Model. We use the BiRNN + GCN model, i.e. a words are filtered out after tokenization.
bidirectional GRU with a 1-layer GCN on top. We
use 32, 64 and 128 units for embeddings, GRU Vocabularies. For the English sides, we con-
units and GCN layers, respectively. struct vocabularies from all words except those
with a training set frequency smaller than three.
Results. After 6 epochs of training, the model
For Czech and German, to deal with rare words
learns to put permuted sequences back into or-
and phenomena such as inflection and compound-
der, reaching a validation BLEU of 99.2. Fig-
ing, we learn byte-pair encodings (BPE) as de-
ure 3 shows that the mean values of the bias
scribed by Sennrich et al. (2016b). Given the size
terms of gates (i.e. b̂) for real and fake edges are
of our data set, and following Wu et al. (2016), we
far apart, suggesting that the GCN learns to dis-
use 8000 BPE merges to obtain robust frequencies
tinguish them. Interestingly, this illustrates why
for our subword units (16000 merges for full data
edge-wise gating is beneficial. A gate-less model
experiment). Data set statistics are summarized in
would not understand which of the two outgoing
Table 1 and vocabulary sizes in Table 2.
arcs is fake and which is genuine, because only
biases b would then be label-dependent. Conse-
Train Val. Test
quently, it would only do a mediocre job in re-
ordering. Although using label-specific matrices English-German 226822 2169 2999
W would also help, this would not scale to the real English-German (full) 4500966 2169 2999
scenario (see §2.3). English-Czech 181112 2656 2999

4.2 Machine Translation Table 1: The number of sentences in our data sets.
Data. For our experiments we use the En-De
and En-Cs News Commentary v11 data from the Hyperparameters. We use 256 units for word
WMT16 translation task.6 For En-De we also embeddings, 512 units for GRUs (800 for En-De
train on the full WMT16 data set. As our valida- full data set experiment), and 512 units for con-
tion set and test set we use newstest2015 and volutional layers (or equivalently, 512 ‘channels’).
newstest2016, respectively. The dimensionality of the GCN layers is equiva-
Pre-processing. The English sides of the cor- 7
https://github.com/tensorflow/models/tree/master/syntaxnet
8
pora are tokenized and parsed into dependency The used dependency parses can be reproduced by using
the syntaxnet/demo.sh shell script.
6 9
http://www.statmt.org/wmt16/translation-task.html https://github.com/moses-smt/mosesdecoder

Source              Target                         Kendall    BLEU1         BLEU4
 English-German                 37824      8099 (BPE)            BoW               0.3352          40.6       9.5
 English-German (full)          50000     16000 (BPE)              + GCN           0.3520          44.9      12.2
 English-Czech                  33786      8116 (BPE)
                                                                 CNN               0.3601          42.8      12.6
                                                                  + GCN            0.3777          44.7      13.7
              Table 2: Vocabulary sizes.
                                                                 BiRNN             0.3984          45.2      14.9
                                                                   + GCN           0.4089          47.5      16.1
lent to the dimensionality of their input. We report
results for 2-layer GCNs, as we find them most ef-               BiRNN (full)      0.5440          53.0      23.3
fective (see ablation studies below).                              + GCN           0.5555          54.6      23.9

Baselines. We provide three baselines, each                        Table 3: Test results for English-German.
with a different encoder: a bag-of-words encoder,
a convolutional encoder with window size w = 5,
                                                               higher than both baselines so far. Finally, the
and a BiRNN. See §2.1.1 for details.
                                                               BiRNN baseline scores a BLEU4 of 8.9, but it
Evaluation. We report (cased) BLEU results                     is again beaten by the BiRNN+GCN model with
(Papineni et al., 2002) using multi-bleu, as                   +1.9 BLEU1 and +0.7 BLEU4 .
well as Kendall τ reordering scores.10
                                                                                Kendall     BLEU1         BLEU4
4.2.1 Results
English-German. Table 3 shows test results                         BoW           0.2498        32.9          6.0
on English-German. Unsurprisingly, the bag-of-                       + GCN       0.2561        35.4          7.5
words baseline performs the worst. We expected                     CNN           0.2756        35.1          8.1
the BoW+GCN model to make easy gains over                           + GCN        0.2850        36.1          8.7
this baseline, which is indeed what happens. The
                                                                   BiRNN         0.2961        36.9          8.9
CNN baseline reaches a higher BLEU4 score than
                                                                     + GCN       0.3046        38.8          9.6
the BoW models, but interestingly its BLEU1
score is lower than the BoW+GCN model. The                          Table 4: Test results for English-Czech.
CNN+GCN model improves over the CNN base-
line by +1.9 and +1.1 for BLEU1 and BLEU4 , re-
spectively. The BiRNN, the strongest baseline,                 Effect of GCN layers. How many GCN layers
reaches a BLEU4 of 14.9. Interestingly, GCNs                   do we need? Every layer gives us an extra hop
still manage to improve the result by +2.3 BLEU1               in the graph and expands the syntactic neighbor-
and +1.2 BLEU4 points. Finally, we observe a big               hood of a word. Table 5 shows validation BLEU
jump in BLEU4 by using the full data set and beam              performance as a function of the number of GCN
search (beam 12). The BiRNN now reaches 23.3,                  layers. For English-German, using a 1-layer GCN
while adding a GCN achieves a score of 23.9.                   improves BLEU-1, but surprisingly has little effect
                                                               on BLEU4 . Adding an additional layer gives im-
English-Czech. Table 4 shows test results on                   provements on both BLEU1 and BLEU4 of +1.3
English-Czech. While it is difficult to obtain high            and +0.73, respectively. For English-Czech, per-
absolute BLEU scores on this dataset, we can still             formance increases with each added GCN layer.
see similar relative improvements. Again the BoW
baseline scores worst, with the BoW+GCN eas-                                        En-De                 En-Cs
ily beating that result. The CNN baseline scores                                BLEU1     BLEU4     BLEU1     BLEU4
BLEU4 of 8.1, but the CNN+GCN improves on
                                                               BiRNN             44.2       14.1      37.8         8.9
that, this time by +1.0 and +0.6 for BLEU1 and
                                                                + GCN (1L)       45.0       14.1      38.3         9.6
BLEU4 , respectively. Interestingly, BLEU1 scores
                                                                + GCN (2L)       46.3       14.8      39.6         9.9
for the BoW+GCN and CNN+GCN models are
   10
      See Stanojević and Simaan (2015). TER (Snover et al.,   Table 5: Validation BLEU for English-German
2006) and BEER (Stanojević and Sima’an, 2014) metrics,
even though omitted due to space considerations, are con-      and English-Czech for 1- and 2-layer GCNs.
sistent with the reported results.

Effect of sentence length. We hypothesize that Hiero trees are not syntactically-aware, but instead
GCNs should be more beneficial for longer sen- constrained by symmetrized word alignments.
tences: these are likely to contain long-distance Aharoni and Goldberg (2017) propose neural
syntactic dependencies which may not be ade- string-to-tree by predicting linearized parse trees.
quately captured by RNNs but directly encoded
in GCNs. To test this, we partition the validation Multi-task Learning. Sharing NMT parameters
data into five buckets and calculate BLEU for each with a syntactic parser is a popular approach to
of them. Figure 4 shows that GCN-based models obtaining syntactically-aware representations. Lu-
outperform their respective baselines rather uni- ong et al. (2015a) predict linearized constituency
formly across all buckets. This is a surprising re- parses as an additional task. Eriguchi et al. (2017)
sult. One explanation may be that syntactic parses multi-task with a target-side RNNG parser (Dyer
are noisier for longer sentences, and this prevents et al., 2016) and improve on various language
us from obtaining extra improvements with GCNs. pairs with English on the target side. Nadejde et al.
(2017) multi-task with CCG tagging, and also in-
16 tegrate syntax on the target side by predicting a se-
quence of words interleaved with CCG supertags.
14
Latent structure. Hashimoto and Tsuruoka
(2017) add a syntax-inspired encoder on top of
12
BLEU

a BiLSTM layer. They encode source words as
a learned average of potential parents emulating
10 CNN
CNN + GCN a relaxed dependency tree. While their model is
BiRNN trained purely on translation data, they also ex-
8 BiRNN + GCN
periment with pre-training the encoder using tree-

yond syntax, by using semantic annotations such David K Duvenaud, Dougal Maclaurin, Jorge Ipar-
as SRL and AMR, and co-reference chains. raguirre, Rafael Bombarell, Timothy Hirzel, Alán
Aspuru-Guzik, and Ryan P Adams. 2015. Convo-
lutional networks on graphs for learning molecular
Acknowledgments fingerprints. In Advances in neural information pro-
We would like to thank Michael Schlichtkrull and cessing systems, pages 2224–2232.
Thomas Kipf for their suggestions and comments. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,
This work was supported by the European Re- and Noah A. Smith. 2016. Recurrent neural network
search Council (ERC StG BroadSem 678254) and grammars. In Proceedings of the 2016 Conference
the Dutch National Science Foundation (NWO of the North American Chapter of the Association
for Computational Linguistics: Human Language
VIDI 639.022.518, NWO VICI 277-89-002). Technologies, pages 199–209, San Diego, Califor-
nia. Association for Computational Linguistics.

References Jeffrey L Elman. 1990. Finding structure in time. Cog-
nitive science, 14(2):179–211.
Roee Aharoni and Yoav Goldberg. 2017. Towards
String-to-Tree Neural Machine Translation. ArXiv Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa
e-prints. Tsuruoka. 2016. Tree-to-sequence attentional neu-
ral machine translation. In Proceedings of the 54th
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Annual Meeting of the Association for Computa-
gio. 2015. Neural Machine Translation by Jointly tional Linguistics (Volume 1: Long Papers), pages
Learning to Align and Translate. In Proceedings of 823–833, Berlin, Germany. Association for Compu-
the International Conference on Learning Represen- tational Linguistics.
tations (ICLR).
Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Cho. 2017. Learning to Parse and Translate Im-
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin proves Neural Machine Translation. ArXiv e-prints.
Knight, Philipp Koehn, Martha Palmer, and Nathan
Schneider. 2012. Abstract meaning representation Jonas Gehring, Michael Auli, David Grangier, and
(amr) 1.0 specification. In Conference on Empiri- Yann N. Dauphin. 2016. A convolutional en-
cal Methods in Natural Language Processing, pages coder model for neural machine translation. CoRR,
1533–1544. abs/1611.02344.
David Chiang. 2010. Learning to translate with source Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley,
and target syntax. In Proceedings of the 48th Annual Oriol Vinyals, and George E. Dahl. 2017. Neural
Meeting of the Association for Computational Lin- Message Passing for Quantum Chemistry. ArXiv e-
guistics, pages 1443–1452, Uppsala, Sweden. Asso- prints.
ciation for Computational Linguistics.
Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017.
KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- Neural machine translation with source-side latent
danau, and Yoshua Bengio. 2014a. On the Prop- graph parsing. CoRR, abs/1702.02265.
erties of Neural Machine Translation: Encoder-
Decoder Approaches. In SSST-8, Eighth Workshop Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
on Syntax, Semantics and Structure in Statistical Sun. 2016. Deep residual learning for image recog-
Translation, volume abs/1409.1259, pages 103–111. nition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- 770–778.
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
ger Schwenk, and Yoshua Bengio. 2014b. Learn- Jindřich Helcl and Jindřich Libovický. 2017. Neural
ing Phrase Representations using RNN Encoder– monkey: An open-source tool for sequence learn-
Decoder for Statistical Machine Translation. In Pro- ing. The Prague Bulletin of Mathematical Linguis-
ceedings of the 2014 Conference on Empirical Meth- tics, (107):5–17.
ods in Natural Language Processing (EMNLP),
pages 1724–1734, Doha, Qatar. Association for Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Computational Linguistics. Long Short-Term Memory. Neural Computation,
9(8):1735–1780.
Michaël Defferrard, Xavier Bresson, and Pierre Van-
dergheynst. 2016. Convolutional neural networks Ozan Irsoy and Claire Cardie. 2014. Opinion Mining
on graphs with fast localized spectral filtering. In with Deep Recurrent Neural Networks. In Proceed-
Advances in Neural Information Processing Sys- ings of the 2014 Conference on Empirical Methods
tems 29: Annual Conference on Neural Information in Natural Language Processing (EMNLP), pages
Processing Systems 2016, December 5-10, 2016, 720–728, Doha, Qatar. Association for Computa-
Barcelona, Spain, pages 3837–3845. tional Linguistics.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Rico Sennrich and Barry Haddow. 2016. Linguistic In-
Continuous Translation Models. In Proceedings of put Features Improve Neural Machine Translation.
the 2013 Conference on Empirical Methods in Natu- In Proceedings of the First Conference on Machine
ral Language Processing, pages 1700–1709, Seattle, Translation (WMT16), volume abs/1606.02892.
Washington, USA.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, 2016a. Edinburgh neural machine translation sys-
Aäron van den Oord, Alex Graves, and Koray tems for wmt 16. In Proceedings of the First
Kavukcuoglu. 2016. Neural machine translation in Conference on Machine Translation, pages 371–
linear time. CoRR, abs/1610.10099. 376, Berlin, Germany. Association for Computa-
tional Linguistics.
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay
Pande, and Patrick Riley. 2016. Molecular graph Rico Sennrich, Barry Haddow, and Alexandra Birch.
convolutions: moving beyond fingerprints. Jour- 2016b. Neural machine translation of rare words
nal of computer-aided molecular design, 30(8):595– with subword units. In Proceedings of the 54th An-
608. nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715–
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
1725, Berlin, Germany. Association for Computa-
method for stochastic optimization. In ICLR.
tional Linguistics.
Thomas N. Kipf and Max Welling. 2016. Semi-
supervised classification with graph convolutional Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
networks. CoRR, abs/1609.02907. string-based neural mt learn source syntax? In Pro-
ceedings of the 2016 Conference on Empirical Meth-
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. ods in Natural Language Processing, pages 1526–
2016. Assessing the ability of lstms to learn syntax- 1534, Austin, Texas. Association for Computational
sensitive dependencies. Transactions of the Associ- Linguistics.
ation for Computational Linguistics, 4:521–535.
David Smith and Jason Eisner. 2006. Quasi-
Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, synchronous grammars: Alignment by soft projec-
Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi- tion of syntactic dependencies. In Proceedings on
task Sequence to Sequence Learning. CoRR, the Workshop on Statistical Machine Translation,
abs/1511.06114. pages 23–30, New York City. Association for Com-
putational Linguistics.
Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015b. Effective Approaches to Attention- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
based Neural Machine Translation. In Proceed- nea Micciulla, and John Makhoul. 2006. A study
ings of the 2015 Conference on Empirical Meth- of translation edit rate with targeted human annota-
ods in Natural Language Processing, pages 1412– tion. In In Proceedings of Association for Machine
1421, Lisbon, Portugal. Association for Computa- Translation in the Americas, pages 223–231.
tional Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Diego Marcheggiani and Ivan Titov. 2017. Encoding Chuang, Christopher D. Manning, Andrew Ng, and
Sentences with Graph Convolutional Networks for Christopher Potts. 2013. Recursive deep models
Semantic Role Labeling. In Proceedings of the 2017 for semantic compositionality over a sentiment tree-
Conference on Empirical Methods in Natural Lan- bank. In Proceedings of EMNLP.
guage Processing, Copenhagen, Denmark. Associa-
tion for Computational Linguistics. Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill
Byrne. 2016. Syntactically guided neural machine
Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz
translation. In Proceedings of the 54th Annual Meet-
Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn,
ing of the Association for Computational Linguistics
and Alexandra Birch. 2017. Syntax-aware Neural
(Volume 2: Short Papers), pages 299–305, Berlin,
Machine Translation Using CCG. ArXiv e-prints.
Germany. Association for Computational Linguis-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei tics.
jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. pages 311–318. Miloš Stanojević and Khalil Simaan. 2015. Evaluating
mt systems with beer. The Prague Bulletin of Math-
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, ematical Linguistics, 104(1):17–26.
Rianne van den Berg, Ivan Titov, and Max Welling.
2017. Modeling Relational Data with Graph Convo- Miloš Stanojević and Khalil Sima’an. 2014. Fit-
lutional Networks. ArXiv e-prints. ting sentence level translation evaluation with many
dense features. In Proceedings of the 2014 Con-
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- ference on Empirical Methods in Natural Language
tional recurrent neural networks. IEEE Transactions Processing (EMNLP), pages 202–206, Doha, Qatar.
on Signal Processing, 45(11):2673–2681. Association for Computational Linguistics.

Mihai Surdeanu, Richard Johansson, Adam Meyers,
  Lluı́s Màrquez, and Joakim Nivre. 2008. The conll
  2008 shared task on joint parsing of syntactic and
  semantic dependencies. In Proceedings of CoNLL.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
   Sequence to Sequence Learning with Neural Net-
   works. In Neural Information Processing Systems
   (NIPS), pages 3104–3112.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
  Le, Mohammad Norouzi, Wolfgang Macherey,
  Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  Macherey, Jeff Klingner, Apurva Shah, Melvin
  Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
  Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
  Kazawa, Keith Stevens, George Kurian, Nishant
  Patil, Wei Wang, Cliff Young, Jason Smith, Jason
  Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
  Macduff Hughes, and Jeffrey Dean. 2016. Google’s
  neural machine translation system: Bridging the gap
  between human and machine translation. CoRR,
  abs/1609.08144.
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward
  Grefenstette, and Wang Ling. 2016. Learning to
  compose words into sentences with reinforcement
  learning. CoRR, abs/1611.09100.
Andreas Zollmann and Ashish Venugopal. 2006. Syn-
  tax augmented machine translation via chart pars-
  ing. In Proceedings of the Workshop on Statistical
  Machine Translation, StatMT ’06, pages 138–141,
  Stroudsburg, PA, USA. Association for Computa-
  tional Linguistics.

You can also read