Graph Convolutional Encoders for Syntax-aware Neural Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Graph Convolutional Encoders for Syntax-aware Neural Machine Translation Jasmijn Bastings1 Ivan Titov1,2 Wilker Aziz1 Diego Marcheggiani1 Khalil Sima’an1 1 ILLC, University of Amsterdam 2 ILCC, University of Edinburgh {bastings,titov,w.aziz,marcheggiani,k.simaan}@uva.nl Abstract including RNNs. Despite some successes, tech- niques explored so far either incorporate syntactic We present a simple and effective ap- information in NMT models in a relatively indi- proach to incorporating syntactic struc- arXiv:1704.04675v4 [cs.CL] 18 Jun 2020 rect way (e.g., multi-task learning (Luong et al., ture into neural attention-based encoder- 2015a; Nadejde et al., 2017; Eriguchi et al., 2017; decoder models for machine translation. Hashimoto and Tsuruoka, 2017)) or may be too We rely on graph-convolutional networks restrictive in modeling the interface between syn- (GCNs), a recent class of neural networks tax and the translation task (e.g., learning repre- developed for modeling graph-structured sentations of linguistic phrases (Eriguchi et al., data. Our GCNs use predicted syntactic 2016)). Our goal is to provide the encoder with dependency trees of source sentences to access to rich syntactic information but let it de- produce representations of words (i.e. hid- cide which aspects of syntax are beneficial for den states of the encoder) that are sensitive MT, without placing rigid constraints on the in- to their syntactic neighborhoods. GCNs teraction between syntax and the translation task. take word representations as input and This goal is in line with claims that rigid syntac- produce word representations as output, so tic constraints typically hurt MT (Zollmann and they can easily be incorporated as layers Venugopal, 2006; Smith and Eisner, 2006; Chiang, into standard encoders (e.g., on top of bidi- 2010), and, though these claims have been made in rectional RNNs or convolutional neural the context of traditional MT systems, we believe networks). We evaluate their effectiveness they are no less valid for NMT. with English-German and English-Czech Attention-based NMT systems (Bahdanau et al., translation experiments for different types 2015; Luong et al., 2015b) represent source sen- of encoders and observe substantial im- tence words as latent-feature vectors in the en- provements over their syntax-agnostic ver- coder and use these vectors when generating a sions in all the considered setups. translation. Our goal is to automatically incorpo- rate information about syntactic neighborhoods of 1 Introduction source words into these feature vectors, and, thus, Neural machine translation (NMT) is one of suc- potentially improve quality of the translation out- cess stories of deep learning in natural language put. Since vectors correspond to words, it is natu- processing, with recent NMT systems outperform- ral for us to use dependency syntax. Dependency ing traditional phrase-based approaches on many trees (see Figure 1) represent syntactic relations language pairs (Sennrich et al., 2016a). State-of- between words: for example, monkey is a subject the-art NMT systems rely on sequential encoder- of the predicate eats, and banana is its object. decoders (Sutskever et al., 2014; Bahdanau et al., In order to produce syntax-aware feature 2015) and lack any explicit modeling of syntax or representations of words, we exploit graph- any hierarchical structure of language. One poten- convolutional networks (GCNs) (Duvenaud et al., tial reason for why we have not seen much benefit 2015; Defferrard et al., 2016; Kearnes et al., 2016; from using syntactic information in NMT is the Kipf and Welling, 2016). GCNs can be regarded lack of simple and effective methods for incorpo- as computing a latent-feature representation of a rating structured information in neural encoders, node (i.e. a real-valued vector) based on its k-
dobj • we introduce a method for incorporating det nsubj det structure into NMT using syntactic GCNs; The monkey eats a banana • we show that GCNs can be used along with RNN and CNN encoders; Figure 1: A dependency tree for the example sen- • we show that incorporating structure is ben- tence: “The monkey eats a banana.” eficial for machine translation on English- Czech and English-German. th order neighborhood (i.e. nodes at most k hops 2 Background aways from the node) (Gilmer et al., 2017). They are generally simple and computationally inexpen- Notation. We use x for vectors, x1:t for a se- sive. We use Syntactic GCNs, a version of GCN quence of t vectors, and X for matrices. The i-th operating on top of syntactic dependency trees, re- value of vector x is denoted by xi . We use ◦ for cently shown effective in the context of semantic vector concatenation. role labeling (Marcheggiani and Titov, 2017). 2.1 Neural Machine Translation Since syntactic GCNs produce representations at word level, it is straightforward to use them In NMT (Kalchbrenner and Blunsom, 2013; as encoders within the attention-based encoder- Sutskever et al., 2014; Cho et al., 2014b), given decoder framework. As NMT systems are trained example translation pairs from a parallel corpus, a end-to-end, GCNs end up capturing syntactic neural network is trained to directly estimate the properties specifically relevant to the translation conditional distribution p(y1:Ty |x1:Tx ) of translat- task. Though GCNs can take word embeddings ing a source sentence x1:Tx (a sequence of Tx as input, we will see that they are more effec- words) into a target sentence y1:Ty . NMT mod- tive when used as layers on top of recurrent neu- els typically consist of an encoder, a decoder and ral network (RNN) or convolutional neural net- some method for conditioning the decoder on the work (CNN) encoders (Gehring et al., 2016), en- encoder, for example, an attention mechanism. We riching their states with syntactic information. will now briefly describe the components that we A comparison to RNNs is the most challenging use in this paper. test for GCNs, as it has been shown that RNNs 2.1.1 Encoders (e.g., LSTMs) are able to capture certain syntac- An encoder is a function that takes as input the tic phenomena (e.g., subject-verb agreement) rea- source sentence and produces a representation en- sonably well on their own, without explicit tree- coding its semantic content. We describe recur- bank supervision (Linzen et al., 2016; Shi et al., rent, convolutional and bag-of-words encoders. 2016). Nevertheless, GCNs appear beneficial even in this challenging set-up: we obtain +1.2 and +0.7 Recurrent. Recurrent neural networks (RNNs) BLEU point improvements from using syntactic (Elman, 1990) model sequential data. They re- GCNs on top of bidirectional RNNs for English- ceive one input vector at each time step and up- German and English-Czech, respectively. date their hidden state to summarize all inputs up In principle, GCNs are flexible enough to incor- to that point. Given an input sequence x1:Tx = porate any linguistic structure as long as they can x1 , x2 , . . . , xTx of word embeddings an RNN is be represented as graphs (e.g., dependency-based defined recursively as follows: semantic-role labeling representations (Surdeanu RNN(x1:t ) = f (xt , RNN(x1:t−1 )) et al., 2008), AMR semantic graphs (Banarescu et al., 2012) and co-reference chains). For ex- where f is a nonlinear function such as an LSTM ample, unlike recursive neural networks (Socher (Hochreiter and Schmidhuber, 1997) or a GRU et al., 2013), GCNs do not require the graphs to be (Cho et al., 2014b). We will use the function RNN trees. However, in this work we solely focus on as an abstract mapping from an input sequence dependency syntax and leave more general inves- x1:T to final hidden state RNN(x1:Tx ), regardless tigation for future work. of the used nonlinearity. To not only summarize Our main contributions can be summarized as the past of a word, but also its future, a bidirec- follows: tional RNN (Schuster and Paliwal, 1997; Irsoy and
Cardie, 2014) is often used. A bidirectional RNN 2.2 Graph Convolutional Networks reads the input sentence in two directions and then We will now describe the Graph Convolutional concatenates the states for each time step: Networks (GCNs) of Kipf and Welling (2016). For a comprehensive overview of alternative GCN B I RNN(x1:Tx , t) = RNNF (x1:t ) ◦ RNNB (xTx :t ) architectures see Gilmer et al. (2017). where RNNF and RNNB are the forward and A GCN is a multilayer neural network that backward RNNs, respectively. For further details operates directly on a graph, encoding informa- we refer to the encoder of Bahdanau et al. (2015). tion about the neighborhood of a node as a real- valued vector. In each GCN layer, information Convolutional. Convolutional Neural Networks flows along edges of the graph; in other words, (CNNs) apply a fixed-size window over the input each node receives messages from all its imme- sequence to capture the local context of each word diate neighbors. When multiple GCN layers are (Gehring et al., 2016). One advantage of this ap- stacked, information about larger neighborhoods proach over RNNs is that it allows for fast parallel gets integrated. For example, in the second layer, computation, while sacrificing non-local context. a node will receive information from its immediate To remedy the loss of context, multiple CNN lay- neighbors, but this information already includes ers can be stacked. Formally, given an input se- information from their respective neighbors. By quence x1:Tx , we define a CNN as follows: choosing the number of GCN layers, we regulate the distance the information travels: with k lay- CNN(x1:Tx , t) = f (xt−bw/2c , .., xt , .., xt+bw/2c ) ers a node receives information from neighbors at most k hops away. where f is a nonlinear function, typically a lin- ear transformation followed by ReLU, and w is the Formally, consider an undirected graph G = size of the window. (V, E), where V is a set of n nodes, and E is a set of edges. Every node is assumed to be con- Bag-of-Words. In a bag-of-words (BoW) en- nected to itself, i.e. ∀v ∈ V : (v, v) ∈ E. Now, coder every word is simply represented by its word let X ∈ Rd×n be a matrix containing all n nodes embedding. To give the decoder some sense of with their features, where d is the dimensionality word position, position embeddings (PE) may be of the feature vectors. In our case, X will contain added. There are different strategies for defining word embeddings, but in general it can contain any position embeddings, and in this paper we choose kind of features. For a 1-layer GCN, the new node to learn a vector for each absolute word position representations are computed as follows: up to a certain maximum length. We then repre- ! sent the t-th word in a sequence as follows: X hv = ρ W xu + b B OW(x1:Tx , t) = xt + pt u∈N (v) where xt is the word embedding and pt is the t-th where W ∈ Rd×d is a weight matrix and b ∈ Rd position embedding. a bias vector.1 ρ is an activation function, e.g. a ReLU. N (v) is the set of neighbors of v, which we 2.1.2 Decoder assume here to always include v itself. As stated A decoder produces the target sentence condi- before, to allow information to flow over multiple tioned on the representation of the source sentence hops, we need to stack GCN layers. The recursive induced by the encoder. In Bahdanau et al. (2015) computation is as follows: the decoder is implemented as an RNN condi- ! tioned on an additional input ci , the context vector, X which is dynamically computed at each time step h(j+1) v =ρ W (j) h(j) u +b (j) using an attention mechanism. u∈N (v) The probability of a target word yi is now a (0) function of the decoder RNN state, the previous where j indexes the layer, and hv = xv . target word embedding, and the context vector. 1 We dropped the normalization factor used by Kipf The model is trained end-to-end for maximum log and Welling (2016), as it is not used in syntactic GCNs likelihood of the next target word given its context. of Marcheggiani and Titov (2017).
h(2) ) ) W (1) ) (1 (1 j (1 b dobj W det W nsu W det GCN h(1) ) ) W (0) ) (0 (0 j (0 b dobj W det W nsu W det h(0) CNN *PAD* The monkey eats a banana *PAD* Figure 2: A 2-layer syntactic GCN on top of a convolutional encoder. Loop connections are depicted with dashed edges, syntactic ones with solid (dependents to heads) and dotted (heads to dependents) edges. Gates and some labels are omitted for clarity. 2.3 Syntactic GCNs Labels. Making the GCN sensitive to labels is Marcheggiani and Titov (2017) generalize GCNs straightforward given the above modifications for to operate on directed and labeled graphs.2 This directionality. Instead of using separate matrices makes it possible to use linguistic structures such for each direction, separate matrices are now de- as dependency trees, where directionality and edge fined for each direction and label combination: labels play an important role. They also integrate ! (j) (j) X edge-wise gates which let the model regulate con- h(j+1) v =ρ Wlab(u,v) h(j) u + blab(u,v) tributions of individual dependency edges. We u∈N (v) will briefly describe these modifications. where we incorporate the directionality of an edge Directionality. In order to deal with direction- directly in its label. ality of edges, separate weight matrices are used Importantly, to prevent over-parametrization, for incoming and outgoing edges. We follow the only bias terms are made label-specific, in other convention that in dependency trees heads point words: Wlab(u,v) = Wdir(u,v) . The resulting syn- to their dependents, and thus outgoing edges are tactic GCN is illustrated in Figure 2 (shown on top used for head-to-dependent connections, and in- of a CNN, as we will explain in the subsequent coming edges are used for dependent-to-head con- section). nections. Modifying the recursive computation for directionality, we arrive at: Edge-wise gating. Syntactic GCNs also include ! gates, which can down-weight the contribution of X (j) (j) individual edges. They also allow the model to h(j+1) v =ρ Wdir(u,v) h(j) u + bdir(u,v) deal with noisy predicted structure, i.e. to ignore u∈N (v) potentially erroneous syntactic edges. For each where dir(u, v) selects the weight matrix associ- edge, a scalar gate is calculated as follows: ated with the directionality of the edge connecting (j) (j) (j) u and v (i.e. WIN for u-to-v, WOUT for v-to-u, gu,v = σ h(j) u · ŵ dir(u,v) + b̂lab(u,v) and WLOOP for v-to-v). Note that self loops are modeled separately, where σ is the logistic sigmoid function, and (j) (j) so there are now three times as many parameters ŵdir(u,v) ∈ Rd and b̂lab(u,v) ∈ R are learned pa- as in a non-directional GCN. rameters for the gate. The computation becomes: 2 For an alternative approach to integrating labels and di- X (j) (j) rections, see applications of GCNs to statistical relation learn- h(j+1) v =ρ g (j) u,v W h dir(u,v) u (j) + blab(u,v) ing (Schlichtkrull et al., 2017). u∈N (v)
3 Graph Convolutional Encoders necting words that otherwise might be far apart. The model might not only benefit from this tele- In this work we focus on exploiting structural in- porting capability however; also the nature of the formation on the source side, i.e. in the encoder. relations between words (i.e. dependency relation We hypothesize that using an encoder that incor- types) may be useful, and the GCN exploits this porates syntax will lead to more informative rep- information (see §2.3 for details). resentations of words, and that these representa- This is the most challenging setup for GCNs, tions, when used as context vectors by the decoder, as RNNs have been shown capable of capturing at will lead to an improvement in translation qual- least some degree of syntactic information with- ity. Consequently, in all our models, we use the out explicit supervision (Linzen et al., 2016), and decoder of Bahdanau et al. (2015) and keep this hence they should be hard to improve on by incor- part of the model constant. As is now common porating treebank syntax. practice, we do not use a maxout layer in the de- coder, but apart from this we do not deviate from Marcheggiani and Titov (2017) did not observe the original definition. In all models we make use improvements from using multiple GCN layers in of GRUs (Cho et al., 2014b) as our RNN units. semantic role labeling. However, we do expect that propagating information from further in the Our models vary in the encoder part, where we tree should be beneficial in principle. We hypoth- exploit the power of GCNs to induce syntactically- esize that the first layer is the most influential one, aware representations. We now define a series of capturing most of the syntactic context, and that encoders of increasing complexity. additional layers only modestly modify the repre- BoW + GCN. In our first and simplest model, sentations. To ease optimization, we add a resid- we propose a bag-of-words encoder (with position ual connection (He et al., 2016) between the GCN embeddings, see §2.1.1), with a GCN on top. In layers, when using more than one layer. other words, inputs h(0) are a sum of embeddings of a word and its position in a sentence. Since the 4 Experiments original BoW encoder captures the linear order- ing information only in a very crude way (through Experiments are performed using the Neural Mon- the position embeddings), the structural informa- key toolkit3 (Helcl and Libovický, 2017), which tion provided by GCN should be highly beneficial. implements the model of Bahdanau et al. (2015) in TensorFlow. We use the Adam optimizer Convolutional + GCN. In our second model, (Kingma and Ba, 2015) with a learning rate of we use convolutional neural networks to learn 0.001 (0.0002 for CNN models).4 The batch size word representations. CNNs are fast, but by def- is set to 80. Between layers we apply dropout inition only use a limited window of context. In- with a probability of 0.2, and in experiments with stead of the approach used by Gehring et al. (2016) GCNs5 we use the same value for edge dropout. (i.e. stacking multiple CNN layers on top of each We train for 45 epochs, evaluating the BLEU per- other), we use a GCN to enrich the one-layer CNN formance of the model every epoch on the vali- representations. Figure 2 shows this model. Note dation set. For testing, we select the model with that, while the figure shows a CNN with a window the highest validation BLEU. L2 regularization is size of 3, we will use a larger window size of 5 in used with a value of 10−8 . All the model selection our experiments. We expect this model to perform (incl. hyperparameter selections) was performed better than BoW + GCN, because of the additional on the validation set. In all experiments we obtain local context captured by the CNN. translations using a greedy decoder, i.e. we se- lect the output token with the highest probability BiRNN + GCN. In our third and most powerful at each time step. model, we employ bidirectional recurrent neural networks. In this model, we start by encoding the We will describe an artificial experiment in §4.1 source sentence using a BiRNN (i.e. BiGRU), and and MT experiments in §4.2. use the resulting hidden states as input to a GCN. 3 https://github.com/ufal/neuralmonkey Instead of relying on linear order only, the GCN 4 Like Gehring et al. (2016) we note that Adam is too ag- will allow the encoder to ‘teleport’ over parts of gressive for CNN models, hence we use a lower learning rate. 5 the input sentence, along dependency edges, con- GCN code at https://github.com/bastings/neuralmonkey
8 4.1 Reordering artificial sequences Our goal here is to provide an intuition for the ca- 6 pabilities of GCNs. We define a reordering task 4 Mean gate bias where randomly permuted sequences need to be 2 real edges put back into the original order. We encode the fake edges original order using edges, and test if GCNs can 0 successfully exploit them. Note that this task is not 2 meant to provide a fair comparison to RNNs. The input (besides the edges) simply does not carry any 4 information about the original ordering, so RNNs 0 50 100 150 200 250 cannot possibly solve this task. Steps (x1000) Data. From a vocabulary of 26 types, we gen- Figure 3: Mean values of gate bias terms for real erate random sequences of 3-10 tokens. We then (useful) labels and for fake (non useful) labels sug- randomly permute them, pointing every token to gest the GCN learns to distinguish them. its original predecessor with a label sampled from a set of 5 labels. Additionally, we point every to- ken to an arbitrary position in the sequence with a trees by SyntaxNet,7 using the pre-trained Parsey label from a distinct set of 5 ‘fake’ labels. We sam- McParseface model.8 The Czech and German ple 25000 training and 1000 validation sequences. sides are tokenized using the Moses tokenizer.9 Sentence pairs where either side is longer than 50 Model. We use the BiRNN + GCN model, i.e. a words are filtered out after tokenization. bidirectional GRU with a 1-layer GCN on top. We use 32, 64 and 128 units for embeddings, GRU Vocabularies. For the English sides, we con- units and GCN layers, respectively. struct vocabularies from all words except those with a training set frequency smaller than three. Results. After 6 epochs of training, the model For Czech and German, to deal with rare words learns to put permuted sequences back into or- and phenomena such as inflection and compound- der, reaching a validation BLEU of 99.2. Fig- ing, we learn byte-pair encodings (BPE) as de- ure 3 shows that the mean values of the bias scribed by Sennrich et al. (2016b). Given the size terms of gates (i.e. b̂) for real and fake edges are of our data set, and following Wu et al. (2016), we far apart, suggesting that the GCN learns to dis- use 8000 BPE merges to obtain robust frequencies tinguish them. Interestingly, this illustrates why for our subword units (16000 merges for full data edge-wise gating is beneficial. A gate-less model experiment). Data set statistics are summarized in would not understand which of the two outgoing Table 1 and vocabulary sizes in Table 2. arcs is fake and which is genuine, because only biases b would then be label-dependent. Conse- Train Val. Test quently, it would only do a mediocre job in re- ordering. Although using label-specific matrices English-German 226822 2169 2999 W would also help, this would not scale to the real English-German (full) 4500966 2169 2999 scenario (see §2.3). English-Czech 181112 2656 2999 4.2 Machine Translation Table 1: The number of sentences in our data sets. Data. For our experiments we use the En-De and En-Cs News Commentary v11 data from the Hyperparameters. We use 256 units for word WMT16 translation task.6 For En-De we also embeddings, 512 units for GRUs (800 for En-De train on the full WMT16 data set. As our valida- full data set experiment), and 512 units for con- tion set and test set we use newstest2015 and volutional layers (or equivalently, 512 ‘channels’). newstest2016, respectively. The dimensionality of the GCN layers is equiva- Pre-processing. The English sides of the cor- 7 https://github.com/tensorflow/models/tree/master/syntaxnet 8 pora are tokenized and parsed into dependency The used dependency parses can be reproduced by using the syntaxnet/demo.sh shell script. 6 9 http://www.statmt.org/wmt16/translation-task.html https://github.com/moses-smt/mosesdecoder
Source Target Kendall BLEU1 BLEU4 English-German 37824 8099 (BPE) BoW 0.3352 40.6 9.5 English-German (full) 50000 16000 (BPE) + GCN 0.3520 44.9 12.2 English-Czech 33786 8116 (BPE) CNN 0.3601 42.8 12.6 + GCN 0.3777 44.7 13.7 Table 2: Vocabulary sizes. BiRNN 0.3984 45.2 14.9 + GCN 0.4089 47.5 16.1 lent to the dimensionality of their input. We report results for 2-layer GCNs, as we find them most ef- BiRNN (full) 0.5440 53.0 23.3 fective (see ablation studies below). + GCN 0.5555 54.6 23.9 Baselines. We provide three baselines, each Table 3: Test results for English-German. with a different encoder: a bag-of-words encoder, a convolutional encoder with window size w = 5, higher than both baselines so far. Finally, the and a BiRNN. See §2.1.1 for details. BiRNN baseline scores a BLEU4 of 8.9, but it Evaluation. We report (cased) BLEU results is again beaten by the BiRNN+GCN model with (Papineni et al., 2002) using multi-bleu, as +1.9 BLEU1 and +0.7 BLEU4 . well as Kendall τ reordering scores.10 Kendall BLEU1 BLEU4 4.2.1 Results English-German. Table 3 shows test results BoW 0.2498 32.9 6.0 on English-German. Unsurprisingly, the bag-of- + GCN 0.2561 35.4 7.5 words baseline performs the worst. We expected CNN 0.2756 35.1 8.1 the BoW+GCN model to make easy gains over + GCN 0.2850 36.1 8.7 this baseline, which is indeed what happens. The BiRNN 0.2961 36.9 8.9 CNN baseline reaches a higher BLEU4 score than + GCN 0.3046 38.8 9.6 the BoW models, but interestingly its BLEU1 score is lower than the BoW+GCN model. The Table 4: Test results for English-Czech. CNN+GCN model improves over the CNN base- line by +1.9 and +1.1 for BLEU1 and BLEU4 , re- spectively. The BiRNN, the strongest baseline, Effect of GCN layers. How many GCN layers reaches a BLEU4 of 14.9. Interestingly, GCNs do we need? Every layer gives us an extra hop still manage to improve the result by +2.3 BLEU1 in the graph and expands the syntactic neighbor- and +1.2 BLEU4 points. Finally, we observe a big hood of a word. Table 5 shows validation BLEU jump in BLEU4 by using the full data set and beam performance as a function of the number of GCN search (beam 12). The BiRNN now reaches 23.3, layers. For English-German, using a 1-layer GCN while adding a GCN achieves a score of 23.9. improves BLEU-1, but surprisingly has little effect on BLEU4 . Adding an additional layer gives im- English-Czech. Table 4 shows test results on provements on both BLEU1 and BLEU4 of +1.3 English-Czech. While it is difficult to obtain high and +0.73, respectively. For English-Czech, per- absolute BLEU scores on this dataset, we can still formance increases with each added GCN layer. see similar relative improvements. Again the BoW baseline scores worst, with the BoW+GCN eas- En-De En-Cs ily beating that result. The CNN baseline scores BLEU1 BLEU4 BLEU1 BLEU4 BLEU4 of 8.1, but the CNN+GCN improves on BiRNN 44.2 14.1 37.8 8.9 that, this time by +1.0 and +0.6 for BLEU1 and + GCN (1L) 45.0 14.1 38.3 9.6 BLEU4 , respectively. Interestingly, BLEU1 scores + GCN (2L) 46.3 14.8 39.6 9.9 for the BoW+GCN and CNN+GCN models are 10 See Stanojević and Simaan (2015). TER (Snover et al., Table 5: Validation BLEU for English-German 2006) and BEER (Stanojević and Sima’an, 2014) metrics, even though omitted due to space considerations, are con- and English-Czech for 1- and 2-layer GCNs. sistent with the reported results.
Effect of sentence length. We hypothesize that Hiero trees are not syntactically-aware, but instead GCNs should be more beneficial for longer sen- constrained by symmetrized word alignments. tences: these are likely to contain long-distance Aharoni and Goldberg (2017) propose neural syntactic dependencies which may not be ade- string-to-tree by predicting linearized parse trees. quately captured by RNNs but directly encoded in GCNs. To test this, we partition the validation Multi-task Learning. Sharing NMT parameters data into five buckets and calculate BLEU for each with a syntactic parser is a popular approach to of them. Figure 4 shows that GCN-based models obtaining syntactically-aware representations. Lu- outperform their respective baselines rather uni- ong et al. (2015a) predict linearized constituency formly across all buckets. This is a surprising re- parses as an additional task. Eriguchi et al. (2017) sult. One explanation may be that syntactic parses multi-task with a target-side RNNG parser (Dyer are noisier for longer sentences, and this prevents et al., 2016) and improve on various language us from obtaining extra improvements with GCNs. pairs with English on the target side. Nadejde et al. (2017) multi-task with CCG tagging, and also in- 16 tegrate syntax on the target side by predicting a se- quence of words interleaved with CCG supertags. 14 Latent structure. Hashimoto and Tsuruoka (2017) add a syntax-inspired encoder on top of 12 BLEU a BiLSTM layer. They encode source words as a learned average of potential parents emulating 10 CNN CNN + GCN a relaxed dependency tree. While their model is BiRNN trained purely on translation data, they also ex- 8 BiRNN + GCN periment with pre-training the encoder using tree-
yond syntax, by using semantic annotations such David K Duvenaud, Dougal Maclaurin, Jorge Ipar- as SRL and AMR, and co-reference chains. raguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convo- lutional networks on graphs for learning molecular Acknowledgments fingerprints. In Advances in neural information pro- We would like to thank Michael Schlichtkrull and cessing systems, pages 2224–2232. Thomas Kipf for their suggestions and comments. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, This work was supported by the European Re- and Noah A. Smith. 2016. Recurrent neural network search Council (ERC StG BroadSem 678254) and grammars. In Proceedings of the 2016 Conference the Dutch National Science Foundation (NWO of the North American Chapter of the Association for Computational Linguistics: Human Language VIDI 639.022.518, NWO VICI 277-89-002). Technologies, pages 199–209, San Diego, Califor- nia. Association for Computational Linguistics. References Jeffrey L Elman. 1990. Finding structure in time. Cog- nitive science, 14(2):179–211. Roee Aharoni and Yoav Goldberg. 2017. Towards String-to-Tree Neural Machine Translation. ArXiv Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa e-prints. Tsuruoka. 2016. Tree-to-sequence attentional neu- ral machine translation. In Proceedings of the 54th Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Annual Meeting of the Association for Computa- gio. 2015. Neural Machine Translation by Jointly tional Linguistics (Volume 1: Long Papers), pages Learning to Align and Translate. In Proceedings of 823–833, Berlin, Germany. Association for Compu- the International Conference on Learning Represen- tational Linguistics. tations (ICLR). Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Laura Banarescu, Claire Bonial, Shu Cai, Madalina Cho. 2017. Learning to Parse and Translate Im- Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin proves Neural Machine Translation. ArXiv e-prints. Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2012. Abstract meaning representation Jonas Gehring, Michael Auli, David Grangier, and (amr) 1.0 specification. In Conference on Empiri- Yann N. Dauphin. 2016. A convolutional en- cal Methods in Natural Language Processing, pages coder model for neural machine translation. CoRR, 1533–1544. abs/1611.02344. David Chiang. 2010. Learning to translate with source Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, and target syntax. In Proceedings of the 48th Annual Oriol Vinyals, and George E. Dahl. 2017. Neural Meeting of the Association for Computational Lin- Message Passing for Quantum Chemistry. ArXiv e- guistics, pages 1443–1452, Uppsala, Sweden. Asso- prints. ciation for Computational Linguistics. Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017. KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- Neural machine translation with source-side latent danau, and Yoshua Bengio. 2014a. On the Prop- graph parsing. CoRR, abs/1702.02265. erties of Neural Machine Translation: Encoder- Decoder Approaches. In SSST-8, Eighth Workshop Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian on Syntax, Semantics and Structure in Statistical Sun. 2016. Deep residual learning for image recog- Translation, volume abs/1409.1259, pages 103–111. nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- 770–778. cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- ger Schwenk, and Yoshua Bengio. 2014b. Learn- Jindřich Helcl and Jindřich Libovický. 2017. Neural ing Phrase Representations using RNN Encoder– monkey: An open-source tool for sequence learn- Decoder for Statistical Machine Translation. In Pro- ing. The Prague Bulletin of Mathematical Linguis- ceedings of the 2014 Conference on Empirical Meth- tics, (107):5–17. ods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Sepp Hochreiter and Jürgen Schmidhuber. 1997. Computational Linguistics. Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Michaël Defferrard, Xavier Bresson, and Pierre Van- dergheynst. 2016. Convolutional neural networks Ozan Irsoy and Claire Cardie. 2014. Opinion Mining on graphs with fast localized spectral filtering. In with Deep Recurrent Neural Networks. In Proceed- Advances in Neural Information Processing Sys- ings of the 2014 Conference on Empirical Methods tems 29: Annual Conference on Neural Information in Natural Language Processing (EMNLP), pages Processing Systems 2016, December 5-10, 2016, 720–728, Doha, Qatar. Association for Computa- Barcelona, Spain, pages 3837–3845. tional Linguistics.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Rico Sennrich and Barry Haddow. 2016. Linguistic In- Continuous Translation Models. In Proceedings of put Features Improve Neural Machine Translation. the 2013 Conference on Empirical Methods in Natu- In Proceedings of the First Conference on Machine ral Language Processing, pages 1700–1709, Seattle, Translation (WMT16), volume abs/1606.02892. Washington, USA. Rico Sennrich, Barry Haddow, and Alexandra Birch. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, 2016a. Edinburgh neural machine translation sys- Aäron van den Oord, Alex Graves, and Koray tems for wmt 16. In Proceedings of the First Kavukcuoglu. 2016. Neural machine translation in Conference on Machine Translation, pages 371– linear time. CoRR, abs/1610.10099. 376, Berlin, Germany. Association for Computa- tional Linguistics. Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. 2016. Molecular graph Rico Sennrich, Barry Haddow, and Alexandra Birch. convolutions: moving beyond fingerprints. Jour- 2016b. Neural machine translation of rare words nal of computer-aided molecular design, 30(8):595– with subword units. In Proceedings of the 54th An- 608. nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 1725, Berlin, Germany. Association for Computa- method for stochastic optimization. In ICLR. tional Linguistics. Thomas N. Kipf and Max Welling. 2016. Semi- supervised classification with graph convolutional Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does networks. CoRR, abs/1609.02907. string-based neural mt learn source syntax? In Pro- ceedings of the 2016 Conference on Empirical Meth- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. ods in Natural Language Processing, pages 1526– 2016. Assessing the ability of lstms to learn syntax- 1534, Austin, Texas. Association for Computational sensitive dependencies. Transactions of the Associ- Linguistics. ation for Computational Linguistics, 4:521–535. David Smith and Jason Eisner. 2006. Quasi- Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, synchronous grammars: Alignment by soft projec- Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi- tion of syntactic dependencies. In Proceedings on task Sequence to Sequence Learning. CoRR, the Workshop on Statistical Machine Translation, abs/1511.06114. pages 23–30, New York City. Association for Com- putational Linguistics. Thang Luong, Hieu Pham, and Christopher D. Man- ning. 2015b. Effective Approaches to Attention- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- based Neural Machine Translation. In Proceed- nea Micciulla, and John Makhoul. 2006. A study ings of the 2015 Conference on Empirical Meth- of translation edit rate with targeted human annota- ods in Natural Language Processing, pages 1412– tion. In In Proceedings of Association for Machine 1421, Lisbon, Portugal. Association for Computa- Translation in the Americas, pages 223–231. tional Linguistics. Richard Socher, Alex Perelygin, Jean Wu, Jason Diego Marcheggiani and Ivan Titov. 2017. Encoding Chuang, Christopher D. Manning, Andrew Ng, and Sentences with Graph Convolutional Networks for Christopher Potts. 2013. Recursive deep models Semantic Role Labeling. In Proceedings of the 2017 for semantic compositionality over a sentiment tree- Conference on Empirical Methods in Natural Lan- bank. In Proceedings of EMNLP. guage Processing, Copenhagen, Denmark. Associa- tion for Computational Linguistics. Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill Byrne. 2016. Syntactically guided neural machine Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz translation. In Proceedings of the 54th Annual Meet- Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, ing of the Association for Computational Linguistics and Alexandra Birch. 2017. Syntax-aware Neural (Volume 2: Short Papers), pages 299–305, Berlin, Machine Translation Using CCG. ArXiv e-prints. Germany. Association for Computational Linguis- Kishore Papineni, Salim Roukos, Todd Ward, and Wei tics. jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. pages 311–318. Miloš Stanojević and Khalil Simaan. 2015. Evaluating mt systems with beer. The Prague Bulletin of Math- Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, ematical Linguistics, 104(1):17–26. Rianne van den Berg, Ivan Titov, and Max Welling. 2017. Modeling Relational Data with Graph Convo- Miloš Stanojević and Khalil Sima’an. 2014. Fit- lutional Networks. ArXiv e-prints. ting sentence level translation evaluation with many dense features. In Proceedings of the 2014 Con- Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- ference on Empirical Methods in Natural Language tional recurrent neural networks. IEEE Transactions Processing (EMNLP), pages 202–206, Doha, Qatar. on Signal Processing, 45(11):2673–2681. Association for Computational Linguistics.
Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez, and Joakim Nivre. 2008. The conll 2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of CoNLL. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Net- works. In Neural Information Processing Systems (NIPS), pages 3104–3112. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144. Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2016. Learning to compose words into sentences with reinforcement learning. CoRR, abs/1611.09100. Andreas Zollmann and Ashish Venugopal. 2006. Syn- tax augmented machine translation via chart pars- ing. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’06, pages 138–141, Stroudsburg, PA, USA. Association for Computa- tional Linguistics.
You can also read