Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks Siddharth Dalmia Brian Yan Vikas Raunak Florian Metze Shinji Watanabe Language Technologies Institute, Carnegie Mellon University, USA {sdalmia,byan}@cs.cmu.edu Abstract Similarly, many sequence-to-sequence tasks that convert one sequence into another (Sutskever et al., End-to-end approaches for sequence tasks are 2014) can be decomposed to simpler sequence sub- becoming increasingly popular. Yet for com- tasks in order to reduce the overall complexity. plex sequence tasks, like speech translation, systems that cascade several models trained For example, speech translation systems, which on sub-tasks have shown to be superior, sug- seek to process speech in one language and output gesting that the compositionality of cascaded text in another language, can be naturally decom- systems simplifies learning and enables so- posed into the transcription of source language au- phisticated search capabilities. In this work, dio through automatic speech recognition (ASR) we present an end-to-end framework that ex- and translation into the target language through ma- ploits compositionality to learn searchable hid- chine translation (MT). Such cascaded approaches den representations at intermediate stages of a have been widely used to build practical systems sequence model using decomposed sub-tasks. These hidden intermediates can be improved for a variety of sequence tasks like hybrid ASR using beam search to enhance the overall per- (Hinton et al., 2012), phrase-based MT (Koehn formance and can also incorporate external et al., 2007), and cascaded ASR-MT systems for models at intermediate stages of the network to speech translation (ST) (Pham et al., 2019). re-score or adapt towards out-of-domain data. End-to-end sequence models like encoder- One instance of the proposed framework is decoder models (Bahdanau et al., 2015; Vaswani a Multi-Decoder model for speech translation that extracts the searchable hidden intermedi- et al., 2017), are attractive in part due to their sim- ates from a speech recognition sub-task. The plistic design and the reduced need for hand-crafted model demonstrates the aforementioned bene- features. However, studies have shown mixed re- fits and outperforms the previous state-of-the- sults compared to cascaded models particularly for art by around +6 and +3 BLEU on the two test complex sequence tasks like speech translation (In- sets of Fisher-CallHome and by around +3 and aguma et al., 2020) and spoken language under- +4 BLEU on the English-German and English- standing (Coucke et al., 2018). Although direct French test sets of MuST-C.1 target sequence prediction avoids the issue of er- 1 Introduction ror propagation from one system to another in cas- caded approaches (Tzoukermann and Miller, 2018), The principle of compositionality loosely states that there are many attractive properties of cascaded sys- a complex whole is composed of its parts and the tems, missing in end-to-end approaches, that are rules by which those parts are combined (Lake and useful in complex sequence tasks. Baroni, 2018). This principle is present in engineer- In particular, we are interested in (1) the strong ing, where task decomposition of a complex system search capabilities of the cascaded systems that is required to assess and optimize task allocations compose the final task output from individual sys- (Levis et al., 1994), and in natural language, where tem predictions (Mohri et al., 2002; Kumar et al., paragraph coherence and discourse analysis rely 2006; Beck et al., 2019), (2) the ability to incor- on decomposition into sentences (Johnson, 1992; porate external models to re-score each individual Kuo, 1995) and sentence level semantics relies on system (Och and Ney, 2002; Huang and Chiang, decomposition into lexical units (Liu et al., 2020b). 2007), (3) the ability to easily adapt individual com- 1 All code and models are released as part of the ESPnet ponents towards out-of-domain data (Koehn and toolkit: https://github.com/espnet/espnet. Schroeder, 2007; Peddinti et al., 2015), and finally 1882 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1882–1896 June 6–11, 2021. ©2021 Association for Computational Linguistics
(4) the ability to monitor performance of the indi- sequence tasks to learn next word prediction, which vidual systems towards the decomposed sub-task outputs a distribution over the next target token (Tillmann and Ney, 2003; Meyer et al., 2016). yl given the previous tokens y1:l91 and the input In this paper, we seek to incorporate these proper- sequence x = (x1 , xt , . . . , xT ), where T is the ties of cascaded systems into end-to-end sequence input sequence length. In the next sub-section we models. We first propose a generic framework detail the training and inference of these models. to learn searchable hidden intermediates using an auto-regressive encoder-decoder model for any de- 2.2 Auto-regressive Encoder-Decoder Models composable sequence task (§3). We then apply Training: In an auto-regressive encoder-decoder this approach to speech translation, where the in- model, the E NCODER maps the input sequence x termediate stage is the output of ASR, by passing to a sequence of continuous hidden representations continuous hidden representations of discrete tran- hE = (hE E E E 1 , ht , . . . , hT ), where ht ∈ R . The d D ECODER then auto-regressively maps h and the E script sequences from the ASR sub-net decoder to the MT sub-net encoder. By doing so, we gain preceding ground-truth output tokens, ŷ1:l91 , to hD l , the ability to use beam search with optional ex- where hD l ∈ R d . The sequence of decoder hidden ternal model re-scoring on the hidden intermedi- representations form hD = (hD D D 1 , hl , . . . , hL ) and ates, while maintaining end-to-end differentiability. the likelihood of each output token yl is given by Next, we suggest mitigation strategies for the error S OFTMAX O UT, which denotes an affine projection propagation issues inherited from decomposition. of hDl to V followed by a softmax function. We show the efficacy of searchable intermediate representations in our proposed model, called the hE = E NCODER(x) Multi-Decoder, on speech translation with a 5.4 ĥD E l = D ECODER (h , ŷ1:l91 ) (1) and 2.8 BLEU score improvement over the previ- P (yl | ŷ1:l91 , hE ) = S OFTMAX O UT(ĥD l ) (2) ous state-of-the-arts for Fisher and CallHome test sets respectively (§6). We extend these improve- During training, the D ECODER performs token clas- ments by an average of 0.5 BLEU score through sification for next word prediction by considering the aforementioned benefit of re-scoring the inter- only the ground truth sequences for previous to- mediate search with external models trained on the kens ŷ. We refer to this ĥD as oracle decoder same dataset. We also show a method for monitor- representations, which will be discussed later. ing sub-net performance using oracle intermediates Inference: During inference, we can maximize the that are void of search errors (§6.1). Finally, we likelihood of the entire sequence from the output show how these models can adapt to out-of-domain space S by composing the conditional probabilities speech translation datasets, how our approach can of each step for the L tokens in the sequence. be generalized to other sequence tasks like speech recognition, and how the benefits of decomposition hD E l = D ECODER (h , y1:l91 ) (3) persist even for larger corpora like MuST-C (§6.2). P (yl | x, y1:l91 ) = S OFTMAX O UT(hD l ) 2 Background and Motivation L Y ỹ = argmax P (yi | x, y1:i91 ) (4) 2.1 Compositionality in Sequences Models y∈S i=1 The probabilistic space of a sequence is combinato- This is an intractable search problem and it can be rial in nature, such that a sentence of L words from approximated by either greedily choosing argmax a fixed vocabulary V would have an output space S at each step or using a search algorithm like beam of size |V|L . In order to deal with this combinato- search to approximate ỹ. Beam search (Reddy, rial output space, an output sentence is decomposed 1988) generates candidates at each step and prunes into labeled target tokens, y = (y1 , y2 , . . . , yL ), the search space to a tractable beam size of B most where yl ∈ V. likely sequences. As B → ∞, the beam search L Y result would be equivalent to equation 4. P (y | x) = P (yi | x, y1:i91 ) i=1 G REEDY S EARCH := argmax P (yl | x, y1:l91 ) yl An auto-regressive encoder-decoder model uses the above probabilistic decomposition in sequence-to- B EAM S EARCH := B EAM(P (yl | x, y1:l91 )) 1883
(a) Multi-Decoder ST Model (b) Multi-Sequence Attention Figure 1: The left side present the schematics and the information flow of our proposed framework applied to ST, in a model we call the Multi-Decoder. Our model decomposes ST into ASR and MT sub-nets, each of which consist of an encoder and decoder. The right side displays a Multi-Sequence Attention variant of the DECODER ST that is conditioned on both speech information via the ENCODER ASR and transcription information via the ENCODER ST . In approximate search for auto-regressive models, S UBA→B N ET: like beam search, the D ECODER receives alternate candidates of previous tokens to find candidates hE = E NCODERA (A) with a higher likelihood as an overall sequence. ĥD l B B = D ECODERB (hE , ŷ1:l91 ) This also allows for the use of external models like P (ylB | ŷ1:l91 B , hE ) = S OFTMAX O UT(ĥDB l ) (5) Language Models (LM) or Connectionist Temporal Classification Models (CTC) for re-scoring candi- S UBB→C N ET: dates (Hori et al., 2017). P (C | ĥD B DB l ) = S UB B→C N ET (ĥl ) (6) 3 Proposed Framework Note that the final prediction, given by equation 6, does not need to be a sequence and can be a In this section, we present a general framework to categorical class like in spoken language under- exploit natural decompositions in sequence tasks standing tasks. Next we will show how the hidden which seek to predict some output C from an input intermediates become searchable during inference. sequence A. If there is an intermediate sequence B 3.1 Searchable Hidden Intermediates for which A → B sequence transduction followed by B → C prediction achieves the original task, As stated in section §2.2, approximate search algo- then the original A → C task is decomposable. rithms maximize the likelihood, P (y | x), of the entire sequence by considering different candidates In other words, if we can learn P (B | A) then yl at each step. Candidate-based search, particu- we can learn the overall task of P (C | A) through larly in auto-regressive encoder-decoder models, maxB (P (C | A, B)P (B | A)), approximated also affects the decoder hidden representation, hD , using Viterbi search. We define a first encoder- as these are directly dependent on the previous can- decoder S UBA→B N ET to map an input sequence didate (refer to equations 1 and 3). This implies that A to a sequence of decoder hidden states, hDB . by searching for better approximations of the pre- Then we define a subsequent S UBB→C N ET to map vious predicted tokens, yl91 = (yBEAM )l91 , we also hDB to the final probabilistic output space of C. improve the decoder hidden representations for the Therefore, we call hDB hidden intermediates. The next token, hD D l = (hBEAM )l . As yBEAM → ŷ, the following equations shows the two sub-networks of decoder hidden representations tend to the oracle our framework, S UBA→B N ET and S UBB→C N ET, decoder representations that have only errors from which can be trained end-to-end while also exploit- next word prediction, hD D BEAM → ĥ . A perfect ing compositionality in sequence tasks. 2 search is analogous to choosing the ground truth ŷ at each step, which would yield ĥD . 2 Note that this framework does not use locally-normalized We apply this beam search of hidden interme- softmax distributions but rather the hidden representations, thereby avoiding label bias issues when combining multiple diates, thereby approximating ĥDB with hD B BEAM . sub-systems (Bottou et al., 1997; Wiseman and Rush, 2016). This process is illustrated in algorithm 1, which 1884
shows beam search for hD B BEAM that are subsequently quence of speech x and uses a sequence of text passed to the S UBB→C N ET.3 In line 7, we show transcriptions y ASR as an intermediate. In this case, how an external model like an LM or a CTC model the S UBA→B N ET in equation 5 is specified as the can be used to generate an alternate sequence like- ASR sub-net and the S UBB→C N ET in equation 6 is lihood, PEXT (ylB ), which can be combined with specified as the MT sub-net. Since the MT sub-net the S UBA→B N ET likelihood, PB (ylB | x) , with a is also a sequence prediction task, both sub-nets are tunable parameter λ. encoder-decoder models in our architecture (Bah- danau et al., 2015; Vaswani et al., 2017). In Figure Algorithm 1 Beam Search for Hidden Interme- 1 we illustrate the schematics of our transformer diates: We perform beam search to approximate based Multi-Decoder ST model which can also be the most likely sequence for the sub-task A → summarized as follows: B, yB BEAM , while collecting the corresponding D ECODERB hidden representations, hD B hEASR = E NCODER ASR (x) (7) BEAM . The DB output hBEAM , is passed to the final sub-network to ĥD l ASR = D ECODER ASR (hEASR , ŷ1:l91 ASR ) (8) predict final output C and yB BEAM is used for moni- hEST = E NCODER ST (ĥDASR ) (9) toring performance on predicting B. ĥD l ST = D ECODER ST (h EST ST , ŷ1:l91 ) (10) 1: Initialize: BEAM ← {sos}; k ← beam size; 2: hEA ← E NCODERA (x) As we can see from Equations 9 and 10, the MT 3: for l=1 to maxSTEPS do sub-network attends only to the decoder representa- 4: for yl91B ∈ BEAM do tions, ĥDASR , of the ASR sub-network, which could 5: hD B B ) ← D ECODERB (hEA , yl91 lead to the error propagation issues from the ASR l 6: B B for yl ∈ yl91 + {V} do sub-network to the MT sub-network similar to the 7: sl ← PA→B (ylB | x)19λ PEXT (ylB )λ cascade systems, as mentioned in §1. To allevi- H ← (sl , ylB , hDB ate this problem, we modify equation 10 such that 8: l ) 9: end for D ECODER ST attends to both hEST and hEASR : SA DST EST 10: end for ĥl = D ECODER SA ST (h , hEASR , ŷ1:l91 ST ) (11) 11: BEAM ← argk max(H) We use the multi-sequence cross-attention dis- 12: end for DB cussed by Helcl et al. (2018), shown on the right 13: (sB , yB BEAM , h BEAM ) ← argmax( BEAM ) side of Figure 1, to condition the final outputs gen- 14: Return yB BEAM → S UB A→B N ET Monitoring erated by ĥD ST on both speech and transcript in- 15: Return hD B BEAM → Final S UB B→C N ET l formation in an attempt to allow our network to recover from intermediate mistakes during infer- We can monitor the performance of the ence. We call this model the Multi-Decoder w/ S UBA→B N ET by comparing the decoded in- Speech-Attention. termediate sequence yB BEAM to the ground truth B ŷ . We can also monitor the S UBB→C N ET 4 Baseline Encoder-Decoder Model performance by using the aforementioned oracle For our baseline model, we use an end-to-end representations of the intermediates, ĥDB , which encoder-decoder (Enc-Dec) ST model with ASR can be obtained by feeding the ground truth ŷB joint training (Inaguma et al., 2020) as an aux- to D ECODERB . By passing ĥDB to S UBB→C N ET, iliarly loss to the speech encoder. In other we can observe its performance in a vacuum, i.e. words, the model consumes speech input using void of search errors in the hidden intermediates. the E NCODER ASR , to produce hEASR , which is 3.2 Multi-Decoder Model used for cross-attention by D ECODER ASR and the D ECODER ST . Using the decomposed ASR task as In order to show the applicability of our end-to-end an auxiliary loss also helps the baseline Enc-Dec framework we propose our Multi-Decoder model model and provide strong baseline performance, as for speech translation. This model predicts a se- we will see in Section 6. quence of text translations y ST from an input se- 3 The algorithm shown only considers a single top approxi- 5 Data and Experimental Setup mation of the search; however, with added time-complexity, the final task prediction improves with the n-best hD B BEAM for Data: We demonstrate the efficacy of our pro- selecting the best resultant C. posed approach on ST in the Fisher-CallHome cor- 1885
pus (Post et al., 2013) which contains 170 hours of has an E NCODER ST consisting of 2 transformer Spanish conversational telephone speech, transcrip- encoder blocks with the same configuration as tions, and English translations. All punctuations E NCODER ASR , giving a total of 40.5M trainable except apostrophes were removed and results are parameters. The training configuration is also the reported in terms of detokenized case-insensitive same as for the baseline. For the Multi-Decoder w/ BLEU (Papineni et al., 2002; Post, 2018). We com- Speech-Attention model (42.1M trainable parame- pute BLEU using the 4 references in Fisher (dev, ters), we increase the attention dropout of the ST dev2, and test) and the single reference in Call- decoder to 0.4 and dropout on all other components Home (dev and test) (Post et al., 2013; Kumar et al., of the ST decoder to 0.2 while keeping dropout on 2014; Weiss et al., 2017). We use a joint source and the remaining components at 0.1. We verified that target vocabulary of 1K byte pair encoding (BPE) increasing the dropout does not help the vanilla units (Kudo and Richardson, 2018). multi-decoder ST model. We prepare the corpus using the ESPnet library During inference, we perform beam search on and we follow the standard data preparation, where both the ASR and ST output sequences, as dis- inputs are globally mean-variance normalized log- cussed in §3. The ST beam search is identical mel filterbank and pitch features from up-sampled to that of the baseline. For the intermediate ASR 16kHz audio (Watanabe et al., 2018). We also ap- beam search, we use a beam size of 16, length ply speed perturbations of 0.9 and 1.1 and the SS penalty of 0.2, max length ratio of 0.3. In some of SpecAugment policy (Park et al., 2019). our experiments, we also include fusion of a source language LM with a 0.2 weight and CTC with a Baseline Configuration: All of our models are 0.3 weight to re-score the intermediate ASR beam implemented using the ESPnet library and trained search (Watanabe et al., 2017). For the Speech- on 3 NVIDIA Titan 2080Ti GPUs for ≈12 hours. Attention variant, we increase LM weight to 0.4. For the Baseline Enc-Dec baseline, discussed in Note that the ST beam search configuration §4, we use an E NCODER ASR consisting of a con- remains constant across our baseline and Multi- volutional sub-sampling by a factor of 4 (Watan- Decoder experiments as our focus is on improving abe et al., 2018) and 12 transformer encoder overall performance through searchable intermedi- blocks with 2048 feed-forward dimension, 256 ate representations. Thus, the various re-scoring attention dimension, and 4 attention heads. The techniques applied to the ASR beam search are op- D ECODER ASR and D ECODER ST both consist of 6 tions newly enabled by our proposed architecture transformer decoder blocks with the same configu- and are not used in the ST beam search. ration as E NCODER ASR . There are 37.9M trainable parameters. We apply dropout of 0.1 for all com- 6 Results ponents, detailed in the Appendix (A.1). We train our models using an effective batch- Table 1 presents the overall ST performance size of 384 utterances and use the Adam optimizer (BLEU) of our proposed Multi-Decoder (Kingma and Ba, 2015) with inverse square root model. Our model improves by +2.9/+0.3 decay learning rate schedule. We set learning rate (Fisher/CallHome) over the best cascaded baseline to 12.5, warmup steps to 25K, and epochs to 50. We and by +5.6/+1.5 BLEU over the best published use joint training with hybrid CTC/attention ASR end-to-end baselines. With Speech-Attention, (Watanabe et al., 2017) by setting mtl-alpha to 0.3 our model improves by +3.4/+1.6 BLEU over and asr-weight to 0.5 as defined by Watanabe et al. the cascaded baselines and +7.1/+2.8 BLEU (2018). During inference, we perform beam search over encoder-decoder baselines. Both the Multi- (Seki et al., 2019) on the ST sequences, using a Decoder and Multi-Decoder w/ Speech-Attention beam size of 10, length penalty of 0.2, max length on average are further improved by +0.9/+0.4 ratio of 0.3 (Watanabe et al., 2018). BLEU through ASR re-scoring.4 Table 1 also includes our implementation of the Multi-Decoder Configuration: For the Multi- Baseline Enc-Dec model discussed in §4. In this Decoder ST model, discussed in §3, we use way, we are able to make a fair comparison with our the same transformer configuration as the base- framework as we control the model and inference line for the E NCODER ASR , D ECODER ASR , and 4 We also evaluate our models using other MT metrics to D ECODER ST . Additionally, the Multi-Decoder supplement these results, as shown in the Appendix (A.2). 1886
Uses Speech Fisher CallHome Model Type Model Name Transcripts dev(↑) dev2(↑) test(↑) dev(↑) test(↑) Cascade Inaguma et al. (2020) 3 41.5 43.5 42.2 19.6 19.8 Cascade ESPnet ASR+MT (2018) 3 50.4 51.2 50.7 19.6 19.2 Enc-Dec Weiss et al. (2017) ♦ 7 46.5 47.3 47.3 16.4 16.6 Enc-Dec Weiss et al. (2017) ♦ 3 48.3 49.1 48.7 16.8 17.4 Enc-Dec Inaguma et al. (2020) 3 46.6 47.6 46.5 16.8 16.8 Enc-Dec Guo et al. (2021) 3 48.7 49.6 47.0 18.5 18.6 Enc-Dec Our Implementation 3 49.6 50.9 49.5 19.1 18.2 Multi-Decoder Our Proposed Model 3 52.7 53.3 52.6 20.5 20.1 Multi-Decoder +ASR Re-scoring 3 53.3 54.2 53.7 21.1 20.8 Multi-Decoder +Speech-Attention 3 54.6 54.6 54.1 21.7 21.4 Multi-Decoder +ASR Re-scoring 3 55.2 55.2 55.0 21.7 21.5 Table 1: Results presenting the overall performance (BLEU) of our proposed multi-decoder model. Cascade and Enc-Dec results from previous papers and our own implementation of the Enc-Dec are shown for comparison. The best performing models are highlighted. ♦ Implemented with LSTM, while all others are Transformer-based. 23.8 Overall Sub-Net Sub-Net Model ST(↑) ASR(↓) MT(↑) 52.6 23.6 Multi-Decoder ST BLEU Score (↑) Multi-Decoder 52.7 22.6 64.9 52.4 ASR % WER (↓) BLEU 23.4 +Speech-Attention 54.6 22.4 66.6 % WER 52.2 23.2 Table 2: Results presenting the overall ST performance 52 (BLEU) of our Multi-Decoder models, along with their 23 sub-net ASR (% WER) and MT (BLEU) performances. 51.8 22.8 All results are from the Fisher dev set. 51.6 22.6 1 4 8 10 16 configurations to be analagous. For instance, we ASR Beam Size keep the same search parameters for the final output in the baseline and the Multi-Decoder to demon- Figure 2: Results studying the effect of the differ- ent ASR beam sizes in the intermediate representa- strate impact of the intermediate beam search. tion search on the overall ST performance (BLEU) and 6.1 Benefits the ASR sub-net performance (% WER) for our multi- decoder model. Beam of 1 is same as greedy search. 6.1.1 Sub-network performance monitoring An added benefit of our proposed approach over the Baseline Enc-Dec is the ability to monitor the indi- beam size of 1, which is a greedy search, results in vidual performances of the ASR (% WER) and MT lower ASR sub-net and overall ST performances. (BLEU) sub-nets as shown in Table 2. The Multi- As beam sizes become larger, gains taper off as can Decoder w/ Speech-Attention shows a greater MT be seen between beam sizes of 10 and 16. sub-net performance than the Multi-Decoder as 6.1.3 External models for better search well as a slight improvement of the ASR sub-net, suggesting that ST can potentially help ASR. External models like CTC acoustic models and lan- guage models are commonly used for re-scoring 6.1.2 Beam search for better intermediates encoder-decoder models (Hori et al., 2017), due to The overall ST performance improves when a the difference in their modeling capabilities. CTC higher beam size is used in the intermediate ASR directly models transcripts while being condition- search, and this increase can be attributed to the im- ally independent on the other outputs given the in- proved ASR sub-net performance. Figure 1 shows put, and LMs predict the next token in a sequence. this trend across ASR beam sizes of 1, 4, 8, 10, 16 Both variants of the Multi-Decoder improve due while fixing the ST decoding beam size to 10. A to improved ASR sub-net performance using exter- 1887
Overall Sub-Net 40 Model ST(↑) ASR(↓) Baseline Enc-Dec 32.1 33.2 Multi-Decoder Multi-Decoder 52.7 22.6 29.9 Multi-Decoder w/ SA ST BLEU Score (↑) 30 +ASR Re-scoring w/ LM 53.2 22.6 +ASR Re-scoring w/ CTC 52.8 22.1 +ASR Re-scoring w/ LM 53.3 21.7 20.1 19.1 21.2 20 Multi-Decoder w/ Speech-Attn. 54.6 22.4 +ASR Re-scoring w/ LM 55.1 22.4 +ASR Re-scoring w/ CTC 54.7 22.0 +ASR Re-scoring w/ LM 55.2 21.9 10 5.4 5.8 5.6 Table 3: Results presenting the overall ST performance (BLEU) and the sub-net ASR (% WER) of our Multi- < 40% [40, 80)% ≥ 80% Decoder models with external CTC and LM re-scoring ASR % WER (↓) in the ASR intermediate representation search. All re- sults are from the Fisher dev set. Figure 3: Results comparing the ST performances (BLEU) of our Baseline Enc-Dec, Multi-Decoder, and Multi-Decoder w/ Speech-Attention across different nal CTC and LM models for re-scoring, as shown ASR difficulties measured using % WER on the Fisher in Table 3. We use a recurrent neural network LM dev set (1-ref). The buckets on the x-axis are de- trained on the Fisher-CallHome Spanish transcripts termined using the utterance level % WER using the with a dev perplexity of 18.8 and the CTC model Multi-Decoder ASR sub-net performance. from joint loss applied during training. Neither external model incorporates additional data. Al- though the impact of the LM-only re-scoring is not 6.2.1 Robustness through Decomposition shown in the ASR % WER, it reduces substitution Like cascaded systems, searchable intermediates and deletion rates in the ASR and this is observed provide our model adaptability in individual sub- to help the overall ST performance. systems towards out-of-domain data using external in-domain language model, thereby giving access 6.1.4 Error propagation avoidance to more in-domain data. Specifically for speech As discussed in §3, our Multi-Decoder model in- translation systems, this means we can use in- herits the error propagation issue as can be seen domain language models in both source and target in Figure 3. For the easiest bucket of utterances languages. We test the robustness of our Multi- with < 40% WER in Multi-Decoder’s ASR sub- Decoder model trained on Fisher-CallHome con- net, our model’s ST performance, as measured by versational speech dataset on read speech CoVost-2 the corpus BLEU of the bucket, exceeds that of dataset (Wang et al., 2020b). In Table 4 we show the Baseline Enc-Dec. The inverse is true for the that re-scoring the ASR sub-net with an in-domain more difficult bucket of [40, 80)%, showing that LM improves ASR with around 10.0% lower WER, error propagation is limiting the performance of improving the overall ST performance by around our model; however, we show that multi-sequence +2.5 BLEU. Compared to an in-domain ST base- attention can alleviate this issue. For extremely line (Wang et al., 2020a), our out-of-domain Multi- difficult utterances in the ≥ 80% bucket, ST perfor- Decoder with in-domain ASR re-scoring demon- mance for all three approaches is suppressed. We strates the robustness of our approach. also provide qualitative examples of error propaga- tion avoidance in the Appendix (A.3). 6.2.2 Decomposing Speech Transcripts We apply our generic framework to another de- 6.2 Generalizability composable sequence task, speech recognition, and In this section, we discuss the generalizability of show the results of various levels of decomposition our framework towards out-of-domain data. We in Table 5. We show that with phoneme, character, also extend our Multi-Decoder model to other se- or byte-pair encoding (BPE) sequences as interme- quence tasks like speech recognition. Finally, we diates, the Multi-Decoder presents strong results apply our ST models to a larger corpus with more on both Fisher and CallHome test sets. We also language pairs and a different domain of speech. observe that the BPE intermediates perform bet- 1888
Overall Sub-Net En→De En→Fr Model ST(↑) ASR(↓) Model ST(↑) ST(↑) I N - DOMAIN ST M ODEL NeurST (Zhao et al., 2020) 22.9 33.3 Baseline (Wang et al., 2020b) 12.0 - Fairseq S2T (Wang et al., 2020a) 22.7 32.9 +ASR Pretrain (Wang et al., 2020b) ♦ 23.0 16.0 ESPnet-ST (Inaguma et al., 2020) 22.9 32.7 O UT- OF - DOMAIN ST M ODEL Dual-Decoder (Le et al., 2020) 23.6 33.5 Multi-Decoder 11.8 46.8 +ASR Re-scoring w/ in-domain LM 14.4 36.7 Multi-Decoder w/ Speech-Attn. 26.3 37.0 Multi-Decoder w/ Speech-Attention 12.6 46.5 +ASR Re-scoring 26.4 37.4 +ASR Re-scoring w/ in-domain LM 15.0 36.7 Table 6: Results presenting the overall ST performance Table 4: Results presenting the overall ST perfor- (BLEU) of our Multi-Decoder w/ Speech-Attention mance (BLEU) and the sub-net ASR (% WER) of our models with ASR re-scoring across two language- Multi-Decoder models when tested on out-of-domain pairs, English-German (En→De) and English-French data. All models were trained on the Fisher-CallHome (En→Fr). All results are from the MuST-C tst- Es→En corpus and tested on CoVost2 Es→En corpus. COMMON sets. All models use speech transcripts. ♦ Pretrained with 364 hours of in-domain ASR data. approach across several dimensions of ST tasks. Fisher CallHome Model Intermediate ASR(↓) ASR(↓) First, our approach consistently improves over base- ♦ lines across multiple language-pairs. Second, our Enc-Dec - 23.2 45.3 approach is robust to the distinct domains of tele- Multi-Decoder Phoneme 20.7 40.0 phone conversations from Fisher-CallHome and Multi-Decoder Character 20.4 39.9 the TED-Talks from MuST-C. Finally, by scaling Multi-Decoder BPE100 19.7 38.9 from 170 hours of Fisher-CallHome data to 500 Table 5: Results presenting the % WER ASR perfor- hours of MuST-C data, we show that the benefits mance when using the Multi-Decoder model on de- of decomposing sequence tasks with searchable composed ASR task with phoneme, character, and hidden intermediates persist even with more data. BPE100 as intermediates. All results are from the Furthermore, the performance of our Multi- Fisher-CallHome Spanish corpus. ♦ (Weiss et al., 2017) Decoder models trained with only English-German or English-French ST data from MuST-C is com- ter than phoneme/character variants, which could parable to other methods which incorporate larger be attributed to the reduced search capabilities external ASR and MT data in various ways. For in- of encoder-decoder models using beam search on stance, Zheng et al. (2021) use 4700 hours of ASR longer sequences (Sountsov and Sarawagi, 2016) data and 2M sentences of MT data for pretrain- like in phoneme/character sequences. ing and multi-task learning. Similarly, Bahar et al. (2021) use 2300 hours of ASR data and 27M sen- 6.2.3 Extending to MuST-C Language Pairs tences of MT data for pretraining. Our competitive In addition to our results using the 170 hours of the performance without the use of any additional data Spanish-English Fisher-CallHome corpus, in Ta- highlights the data-efficient nature of our proposed ble 6 we show that our decompositional framework end-to-end framework as opposed to the baseline is also effective on larger ST corpora. In particu- encoder-decoder model, as pointed out by Sperber lar, we use 400 hours of English-German and 500 and Paulik (2020). hours of English-French ST from the MuST-C cor- pus (Di Gangi et al., 2019). Our Multi-Decoder 7 Discussion and Relation to Prior Work model improves by +2.7 and +1.5 BLEU, in Ger- Compositionality: A number of recent works man and French respectively, over end-to-end base- have constructed composable neural network mod- lines from prior works that do not use additional ules for tasks such as visual question answering training data. We show that ASR re-scoring gives (Andreas et al., 2016), neural MT (Raunak et al., an additional +0.1 and +0.4 BLEU improvement. 5 2019), and synthetic sequence-to-sequence tasks By extending our Multi-Decoder models to this (Lake, 2019). Modules that are first trained sepa- MuST-C study, we show the generalizability of our rately can subsequently be tightly integrated into a 5 Details of the MuST-C data preparation and model pa- single end-to-end trainable model by passing differ- rameters are detailed in Appendix (A.4). entiable soft decisions instead of discrete decisions 1889
in the intermediate stage (Bahar et al., 2021). Fur- impacting the performance of both the task at hand ther, even a single encoder-decoder model can be and any downstream tasks. Our approach allevi- decomposed into modular components where the ates these problems through intermediate search, encoder and decoder modules have explicit func- external models for intermediate re-scoring, and tions (Dalmia et al., 2019). multi-sequence attention. Joint Training with Sub-Tasks: End-to-end se- 8 Conclusion and Future Work quence models been shown to benefit from intro- ducing joint training with sub-tasks as auxiliary We present searchable hidden intermediates for end- loss functions for a variety of tasks like ASR (Kim to-end models of decomposable sequence tasks. et al., 2017), ST (Salesky et al., 2019; Liu et al., We show the efficacy of our Multi-Decoder model 2020a; Dong et al., 2020; Le et al., 2020), SLU on the Fisher-CallHome Es→En and MuST-C (Haghani et al., 2018). They have been shown to in- En→De and En→Fr speech translation corpora, duce structure (Belinkov et al., 2020) and improve achieving state-of-the-art results. We present var- the model performance (Toshniwal et al., 2017), ious benefits in our framework, including sub-net but this joint training may reduce data efficiency performance monitoring, beam search for better if some sub-nets are not included in the final end- hidden intermediates, external models for better to-end model (Sperber et al., 2019; Wang et al., search, and error propagation avoidance. Further, 2020c). Our framework avoids this sub-net waste we demonstrate the flexibility of our framework at the cost of computational load during inference. towards out-of-domain tasks with the ability to adapt our sequence model at intermediate stages of Speech Translation Decoders: Prior works decomposition. Finally, we show generalizability have used ASR/MT decoding to improve the over- by training Multi-Decoder models for the speech all ST decoding through synchronous decoding recognition task at various levels of decomposition. (Liu et al., 2020a), dual decoding (Le et al., 2020), We hope insights derived from our study stim- and successive decoding (Dong et al., 2020). These ulate research on tighter integrations between the works partially or fully decode ASR transcripts and benefits of cascaded and end-to-end sequence mod- use discrete intermediates to assist MT decoding. els. Exploiting searchable intermediates through Tu et al. (2017) and Anastasopoulos and Chiang beam search is just the tip of the iceberg for search (2018) are closest to our multi-decoder ST model, algorithms, as numerous approximate search tech- however the benefits of our proposed framework niques like diverse beam search (Vijayakumar et al., are not entirely explored in these works. 2018) and best-first beam search (Meister et al., 2020) have been recently proposed to improve di- Two-Pass Decoding: Two-pass decoding in- versity and approximation of the most-likely se- volves first predicting with one decoder and then quence. Incorporating differentiable lattice based re-evaluating with another decoder (Geng et al., search (Hannun et al., 2020) can also allow the sub- 2018; Sainath et al., 2019; Hu et al., 2020; Rijh- sequent sub-net to digest n-best representations. wani et al., 2020). The two decoders iterate on the same sequence, so there is no decomposition into 9 Acknowledgements sub-tasks in this method. On the other hand, our approach provides the subsequent decoder with a This work started while Vikas Raunak was a stu- more structured representation than the input by de- dent at CMU, he is now working as a Research Sci- composing the complexity of the overall task. Like entist at Microsoft. We thank Pengcheng Guo, Hi- two-pass decoding, our approach provides a sense rofumi Inaguma, Elizabeth Salesky, Maria Ryskina, of the future to the second decoder which allows it Marta Méndez Simón and Vijay Viswanathan for to correct mistakes from the previous first decoder. their helpful discussion during the course of this project. We also thank the anonymous reviewers Auto-Regressive Decoding: As auto-regressive for their valuable feedback. This work used the decoders inherently learn a language model along Extreme Science and Engineering Discovery En- with the task at hand, they tend to be domain spe- vironment (XSEDE) (Towns et al., 2014), which cific (Samarakoon et al., 2018; Müller et al., 2020). is supported by National Science Foundation grant This can cause generalizability issues during infer- number ACI-1548562. Specifically, it used the ence (Murray and Chiang, 2018; Yang et al., 2018), Bridges system (Nystrom et al., 2015), which is 1890
supported by NSF award number ACI-1445606, Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- at the Pittsburgh Supercomputing Center (PSC). san Sajjad, and James Glass. 2020. On the linguistic representational power of neural machine translation The work was supported in part by an AWS Ma- models. Computational Linguistics, 46(1):1–52. chine Learning Research Award. This research was also supported in part the DARPA KAIROS Léon Bottou, Yoshua Bengio, and Yann Le Cun. 1997. program from the Air Force Research Laboratory Global training of document processing systems us- under agreement number FA8750-19-2-0200. The ing graph transformer networks. In Proceedings of IEEE Computer Society Conference on Computer Vi- U.S. Government is authorized to reproduce and sion and Pattern Recognition, pages 489–494. IEEE. distribute reprints for Governmental purposes not withstanding any copyright notation there on. The Alice Coucke, Alaa Saade, Adrien Ball, Théodore views and conclusions contained herein are those of Bluche, Alexandre Caulier, David Leroy, Clément the authors and should not be interpreted as neces- Doumouro, Thibault Gisselbrecht, Francesco Calt- agirone, Thibaut Lavril, et al. 2018. Snips voice sarily representing the official policies or endorse- platform: an embedded spoken language understand- ments, either expressed or implied, of the Air Force ing system for private-by-design voice interfaces. In Research Laboratory or the U.S. Government. Privacy in Machine Learning and Artificial Intelli- gence workshop, ICML. References Siddharth Dalmia, Abdelrahman Mohamed, Mike Lewis, Florian Metze, and Luke Zettlemoyer. Antonios Anastasopoulos and David Chiang. 2018. 2019. Enforcing encoder-decoder modularity in Tied multitask learning for neural speech translation. sequence-to-sequence models. arXiv preprint In Proceedings of the 2018 Conference of the North arXiv:1911.03782. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Volume 1 (Long Papers), pages 82–91, New Orleans, Matteo Negri, and Marco Turchi. 2019. MuST- Louisiana. Association for Computational Linguis- C: a Multilingual Speech Translation Corpus. In tics. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and tional Linguistics: Human Language Technologies, Dan Klein. 2016. Neural module networks. In 2016 pages 2012–2017, Minneapolis, Minnesota. Associ- IEEE Conference on Computer Vision and Pattern ation for Computational Linguistics. Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 39–48. IEEE Computer Society. Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Parnia Bahar, Tobias Bieschke, Ralf Schlüter, and Xu, Bo Xu, and Lei Li. 2020. SDST: Successive de- Hermann Ney. 2021. Tight integrated end-to- coding for speech-to-text translation. Proceedings end training for cascaded speech translation. In of the Thirty-Fifth AAAI Conference on Artificial In- 2021 IEEE Spoken Language Technology Workshop telligence. (SLT), pages 950–957. IEEE. Xinwei Geng, Xiaocheng Feng, Bing Qin, and Ting Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Liu. 2018. Adaptive multi-pass decoder for neural gio. 2015. Neural machine translation by jointly machine translation. In Proceedings of the 2018 learning to align and translate. In 3rd Inter- Conference on Empirical Methods in Natural Lan- national Conference on Learning Representations, guage Processing, pages 523–532, Brussels, Bel- ICLR 2015. gium. Association for Computational Linguistics. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with im- Pengcheng Guo, Florian Boyer, Xuankai Chang, proved correlation with human judgments. In Pro- Tomoki Hayashi, Yosuke Higuchi, Hirofumi In- ceedings of the ACL Workshop on Intrinsic and Ex- aguma, Naoyuki Kamo, Chenda Li, Daniel Garcia- trinsic Evaluation Measures for Machine Transla- Romero, Jiatong Shi, et al. 2021. Recent develop- tion and/or Summarization, pages 65–72, Ann Ar- ments on ESPnet toolkit boosted by conformer. In bor, Michigan. Association for Computational Lin- 2021 IEEE international conference on acoustics, guistics. speech and signal processing (ICASSP). IEEE. Daniel Beck, Trevor Cohn, and Gholamreza Haffari. Parisa Haghani, Arun Narayanan, Michiel Bacchiani, 2019. Neural speech translation using lattice trans- Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit formations and graph networks. In Proceedings of Prabhavalkar, Zhongdi Qu, and Austin Waters. 2018. the Thirteenth Workshop on Graph-Based Methods From audio to semantics: Approaches to end-to-end for Natural Language Processing (TextGraphs-13), spoken language understanding. In 2018 IEEE Spo- pages 26–31, Hong Kong. Association for Computa- ken Language Technology Workshop (SLT), pages tional Linguistics. 720–726. IEEE. 1891
Awni Hannun, Vineel Pratap, Jacob Kahn, and Wei- Constantin, and Evan Herbst. 2007. Moses: Open Ning Hsu. 2020. Differentiable weighted finite-state source toolkit for statistical machine translation. In transducers. arXiv preprint arXiv:2010.01003. Proceedings of the 45th Annual Meeting of the As- sociation for Computational Linguistics Companion Jindřich Helcl, Jindřich Libovický, and Dušan Variš. Volume Proceedings of the Demo and Poster Ses- 2018. CUNI system for the WMT18 multimodal sions, pages 177–180, Prague, Czech Republic. As- translation task. In Proceedings of the Third Confer- sociation for Computational Linguistics. ence on Machine Translation: Shared Task Papers, pages 616–623, Belgium, Brussels. Association for Philipp Koehn and Josh Schroeder. 2007. Experiments Computational Linguistics. in domain adaptation for statistical machine transla- tion. In Proceedings of the second workshop on sta- Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, tistical machine translation, pages 224–227. Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Taku Kudo and John Richardson. 2018. SentencePiece: Sainath, et al. 2012. Deep neural networks for A simple and language independent subword tok- acoustic modeling in speech recognition: The shared enizer and detokenizer for neural text processing. In views of four research groups. IEEE Signal process- Proceedings of the 2018 Conference on Empirical ing magazine, 29(6):82–97. Methods in Natural Language Processing: System Demonstrations, pages 66–71. Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan. 2017. Advances in joint CTC-attention based Gaurav Kumar, Matt Post, Daniel Povey, and Sanjeev end-to-end speech recognition with a deep CNN en- Khudanpur. 2014. Some insights from translating coder and RNN-LM. In Proc. Interspeech 2017, conversational telephone speech. In IEEE Interna- pages 949–953. tional Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, Ke Hu, Tara N Sainath, Ruoming Pang, and Rohit Prab- 2014, pages 3231–3235. IEEE. havalkar. 2020. Deliberation model based two-pass end-to-end speech recognition. In ICASSP 2020- Shankar Kumar, Yonggang Deng, and William Byrne. 2020 IEEE International Conference on Acoustics, 2006. A weighted finite state transducer translation Speech and Signal Processing (ICASSP), pages template model for statistical machine translation. 7799–7803. IEEE. Natural Language Engineering, 12(1):35–76. Chih-Hua Kuo. 1995. Cohesion and coherence in aca- Liang Huang and David Chiang. 2007. Forest Rescor- demic writing: From lexical choice to organization. ing: Faster decoding with integrated language mod- RELC Journal, 26(1):47–62. els. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages Brenden M Lake. 2019. Compositional generalization 144–151, Prague, Czech Republic. Association for through meta sequence-to-sequence learning. In Ad- Computational Linguistics. vances in Neural Information Processing Systems, pages 9791–9801. Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Yalta, Tomoki Hayashi, and Shinji Brenden M. Lake and Marco Baroni. 2018. General- Watanabe. 2020. ESPnet-ST: All-in-one speech ization without systematicity: On the compositional translation toolkit. In Proceedings of the 58th An- skills of sequence-to-sequence recurrent networks. nual Meeting of the Association for Computational In Proceedings of the 35th International Conference Linguistics: System Demonstrations, pages 302–311. on Machine Learning, ICML 2018, pages 2879– Association for Computational Linguistics. 2888. PMLR. Patricia Johnson. 1992. Cohesion and coherence in Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Di- compositions in Malay and English. RELC Journal, dier Schwab, and Laurent Besacier. 2020. Dual- 23(2):1–17. decoder transformer for joint automatic speech recognition and multilingual speech translation. Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Proceedings of the 28th International Conference on Joint CTC-attention based end-to-end speech recog- Computational Linguistics. nition using multi-task learning. In 2017 IEEE inter- national conference on acoustics, speech and signal Alexander H. Levis, Neville Moray, and Baosheng Hu. processing (ICASSP), pages 4835–4839. IEEE. 1994. Task decomposition and allocation problems and discrete event systems. Automatica, 30(2):203 – Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 216. method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations, Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, ICLR 2015. Zhongjun He, Hua Wu, Haifeng Wang, and Chengqing Zong. 2020a. Synchronous speech Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris recognition and speech-to-text translation with inter- Callison-Burch, Marcello Federico, Nicola Bertoldi, active decoding. In The Thirty-Fourth AAAI Con- Brooke Cowan, Wade Shen, Christine Moran, ference on Artificial Intelligence, AAAI 2020, pages Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra 8417–8424. 1892
Zhiyuan Liu, Yankai Lin, and Maosong Sun. 2020b. N. Pham, Thai-Son Nguyen, Thanh-Le Ha, J. Hussain, Compositional Semantics, pages 43–57. Springer Felix Schneider, J. Niehues, Sebastian Stüker, and Singapore, Singapore. A. Waibel. 2019. The IWSLT 2019 KIT speech translation system. In International Workshop on Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. Spoken Language Translation (IWSLT). Best-first beam search. Transactions of the Associa- tion for Computational Linguistics, 8:795–809. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Bernd T Meyer, Sri Harish Mallidi, Angel Mario Cas- Machine Translation: Research Papers, pages 186– tro Martinez, Guillermo Payá-Vayá, Hendrik Kayser, 191, Belgium, Brussels. Association for Computa- and Hynek Hermansky. 2016. Performance monitor- tional Linguistics. ing for automatic speech recognition in noisy multi- channel environments. In 2016 IEEE Spoken Lan- Matt Post, Gaurav Kumar, Adam Lopez, Damianos guage Technology Workshop (SLT), pages 50–56. Karakos, Chris Callison-Burch, and Sanjeev Khu- danpur. 2013. Improved speech-to-text transla- Mehryar Mohri, Fernando Pereira, and Michael Ri- tion with the Fisher and Callhome Spanish–English ley. 2002. Weighted finite-state transducers in speech translation corpus. In International Work- speech recognition. Computer Speech & Language, shop on Spoken Language Translation (IWSLT 16(1):69–88. 2013). Mathias Müller, Annette Rios, and Rico Sennrich. Vikas Raunak, Vaibhav Kumar, and Florian Metze. 2020. Domain robustness in neural machine trans- 2019. On compositionality in neural machine trans- lation. In Proceedings of the 14th Conference of lation. NeurIPS Workshop, Context and Composi- the Association for Machine Translation in the Amer- tionality in Biological and Artificial Neural Systems. icas, pages 151–164, Virtual. Association for Ma- chine Translation in the Americas. Raj Reddy. 1988. Foundations and grand challenges of artificial intelligence: AAAI presidential address. Kenton Murray and David Chiang. 2018. Correcting AI Mag., 9(4):9–21. length bias in neural machine translation. In Pro- Shruti Rijhwani, Antonios Anastasopoulos, and Gra- ceedings of the Third Conference on Machine Trans- ham Neubig. 2020. OCR post correction for endan- lation: Research Papers, pages 212–223, Brussels, gered language texts. In Proceedings of the 2020 Belgium. Conference on Empirical Methods in Natural Lan- Nicholas A. Nystrom, Michael J. Levine, Ralph Z. guage Processing (EMNLP), pages 5931–5942, On- Roskies, and J. Ray Scott. 2015. Bridges: A line. Association for Computational Linguistics. uniquely flexible HPC resource for new commu- Tara N Sainath, Ruoming Pang, David Rybach, nities and data analytics. In Proceedings of the Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Vi- 2015 XSEDE Conference: Scientific Advancements sontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Enabled by Enhanced Cyberinfrastructure. Associa- et al. 2019. Two-pass end-to-end speech recognition. tion for Computing Machinery. Proc. Interspeech 2019, pages 2773–2777. Franz Josef Och and Hermann Ney. 2002. Discrimina- Elizabeth Salesky, Matthias Sperber, and Alan W tive training and maximum entropy models for sta- Black. 2019. Exploring Phoneme-Level Speech tistical machine translation. In Proceedings of the Representations for End-to-End Speech Translation. 40th Annual meeting of the Association for Compu- In Proceedings of the 57th Annual Meeting of the tational Linguistics, pages 295–302. Association for Computational Linguistics, pages 1835–1841, Florence, Italy. Association for Compu- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tational Linguistics. Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In Proceedings of Lahiru Samarakoon, Brian Mak, and Albert YS the 40th Annual Meeting of the Association for Com- Lam. 2018. Domain adaptation of end-to-end putational Linguistics, pages 311–318, Philadelphia, speech recognition in low-resource settings. In Pennsylvania, USA. 2018 IEEE Spoken Language Technology Workshop (SLT), pages 382–388. IEEE. Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Niko 2019. SpecAugment: A simple data augmentation Moritz, and Jonathan Le Roux. 2019. Vectorized method for automatic speech recognition. Proc. In- beam search for CTC-attention-based speech recog- terspeech 2019, pages 2613–2617. nition. In Proc. Interspeech 2019, pages 3825– 3829. Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, and Sanjeev Khudanpur. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- 2015. JHU ASpIRE system: Robust LVCSR with nea Micciulla, and John Makhoul. 2006. A study of TDNNS, iVector adaptation and RNN-LMS. In translation edit rate with targeted human annotation. 2015 IEEE Workshop on Automatic Speech Recog- In Proceedings of Association for Machine Transla- nition and Understanding (ASRU), pages 539–546. tion in the Americas. 1893
Pavel Sountsov and Sunita Sarawagi. 2016. Length In Proceedings of the Thirty-Second AAAI Confer- bias in encoder decoder models and a case for global ence on Artificial Intelligence, (AAAI-18), pages conditioning. In Proceedings of the 2016 Confer- 7371–7379. AAAI Press. ence on Empirical Methods in Natural Language Processing, pages 1516–1525, Austin, Texas. Asso- Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, ciation for Computational Linguistics. Dmytro Okhonko, and Juan Pino. 2020a. Fairseq S2T: Fast speech-to-text modeling with fairseq. In Matthias Sperber, Graham Neubig, Jan Niehues, and Proceedings of the 1st Conference of the Asia-Pacific Alex Waibel. 2019. Attention-passing models for ro- Chapter of the Association for Computational Lin- bust and data-efficient end-to-end speech translation. guistics (AACL): System Demonstrations, pages 33– Transactions of the Association for Computational 39. Association for Computational Linguistics. Linguistics, 7:313–325. Changhan Wang, Anne Wu, and Juan Pino. 2020b. Matthias Sperber and Matthias Paulik. 2020. Speech CoVoST 2: A massively multilingual speech- translation and the end-to-end promise: Taking stock to-text translation corpus. arXiv preprint of where we are. Proceedings of the 58th Annual arXiv:2007.10310. Meeting of the Association for Computational Lin- guistics. Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and Ming Zhou. 2020c. Bridging the gap between pre- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. training and fine-tuning for end-to-end speech trans- Sequence to sequence learning with neural networks. lation. Proceedings of the AAAI Conference on Arti- In Advances in neural information processing sys- ficial Intelligence, 34(05):9161–9168. tems, pages 3104–3112. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Christoph Tillmann and Hermann Ney. 2003. Word re- Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En- ordering and a dynamic programming beam search rique Yalta Soplin, Jahn Heymann, Matthew Wies- algorithm for statistical machine translation. Com- ner, Nanxin Chen, Adithya Renduchintala, and putational linguistics, 29(1):97–133. Tsubasa Ochiai. 2018. ESPnet: End-to-end speech processing toolkit. In Proc. Interspeech 2018, pages Shubham Toshniwal, Hao Tang, Liang Lu, and Karen 2207–2211. Livescu. 2017. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. recognition. In Proc. Interspeech 2017, pages 3532– Hershey, and Tomoki Hayashi. 2017. Hybrid 3536. CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Sig- John Towns, Timothy Cockerill, Maytal Dahan, Ian nal Processing, 11(8):1240–1253. Foster, Kelly Gaither, Andrew Grimshaw, Victor Ha- zlewood, Scott Lathrop, Dave Lifka, Gregory D Pe- Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui terson, et al. 2014. XSEDE: accelerating scientific Wu, and Zhifeng Chen. 2017. Sequence-to- discovery. Computing in science & engineering, sequence models can directly translate foreign 16(5):62–74. speech. In Proc. Interspeech 2017, pages 2625– 2629. Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, Sam Wiseman and Alexander M. Rush. 2016. and Hang Li. 2017. Neural machine translation with Sequence-to-sequence learning as beam-search op- reconstruction. In Proceedings of the Thirty-First timization. In Proceedings of the 2016 Conference AAAI Conference on Artificial Intelligence, 2017, on Empirical Methods in Natural Language Process- pages 3097–3103. AAAI Press. ing, pages 1296–1306, Austin, Texas. Association Evelyne Tzoukermann and Corey Miller. 2018. Evalu- for Computational Linguistics. ating automatic speech recognition in translation. In Yilin Yang, Liang Huang, and Mingbo Ma. 2018. Proceedings of the 13th Conference of the Associa- Breaking the beam search curse: A study of (re- tion for Machine Translation in the Americas (Vol- )scoring methods and stopping criteria for neural ume 2: User Track), pages 294–302, Boston, MA. machine translation. In Proceedings of the 2018 Association for Machine Translation in the Ameri- Conference on Empirical Methods in Natural Lan- cas. guage Processing, pages 3054–3059, Brussels, Bel- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob gium. Association for Computational Linguistics. Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Chengqi Zhao, Mingxuan Wang, and Lei Li. 2020. Kaiser, and Illia Polosukhin. 2017. Attention is all NeurST: Neural speech translation toolkit. arXiv you need. In Advances in neural information pro- preprint arXiv:2012.10018. cessing systems, pages 5998–6008. Renjie Zheng, Junkun Chen, Mingbo Ma, and Liang Ashwin K. Vijayakumar, Michael Cogswell, Ram- Huang. 2021. Fused acoustic and text encoding for prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. multimodal bilingual pretraining and speech transla- Crandall, and Dhruv Batra. 2018. Diverse beam tion. arXiv preprint arXiv:2102.05766. search for improved description of complex scenes. 1894
You can also read