The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021 Dan Liu1,2 , Mengge Du2 , Xiaoxi Li2 , Yuchen Hu1 , and Lirong Dai 1 1 University of Science and Technology of China, Hefei, China 2 iFlytek Research, Hefei, China {danliu, huyuchen}@mail.ustc.edu.cn lrdai@ustc.edu.cn {danliu, xxli16,mgdou}@iflytek.com Abstract marginal probability based on conventional Atten- This paper describes USTC-NELSLIP’s sub- tion Encoder-Decoder (Sennrich et al., 2016) ar- missions to the IWSLT2021 Simultaneous chitectures (Transformer (Vaswani et al., 2017) in- Speech Translation task. We proposed a novel cluded), which is due to the deep coupling between simultaneous translation model, Cross Atten- source contexts and target history contexts. To tion Augmented Transducer (CAAT), which solve this problem, we propose a novel architecture, extends conventional RNN-T to sequence-to- Cross Attention augmented Transducer (CAAT), sequence tasks without monotonic constraints, and a latency loss function to ensure CAAT model e.g., simultaneous translation. Experiments works with an appropriate latency. In simultane- on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT ous translation, policy is integrated into translation achieves better quality-latency trade-offs com- model and learned jointly for CAAT model. pared to wait-k, one of the previous state-of- In this work, we build simultaneous translation the-art approaches. Based on CAAT architec- systems for both text-to-text (T2T) and speech- ture and data augmentation, we build S2T and to-text S2T) task. We propose a novel archi- T2T simultaneous translation systems in this tecture, Cross Attention Augmented Transducer evaluation campaign. Compared to last year’s (CAAT), which significantly outperforms wait-k optimal systems, our S2T simultaneous trans- lation system improves by an average of 11.3 (Ma et al., 2019) baseline in both text-to-text and BLEU for all latency regimes, and our T2T si- speech-to-text simultaneous translation task. Be- multaneous translation system improves by an sides, we adopt a variety of data augmentation average of 4.6 BLEU. methods, back-translation (Edunov et al., 2018), Self-training (Kim and Rush, 2016) and speech 1 Introduction synthesis with Tacotron2 (Shen et al., 2018). Com- This paper describes the submission to IWSLT bining all of these and models ensembling, we 2021 Simultaneous Speech Translation task by Na- achieved about 11.3 BLEU (in S2T task) and 4.6 tional Engineering Laboratory for Speech and Lan- BLEU (in T2T task) gains compared to the best guage Information Processing (NELSLIP), Univer- performance last year. sity of Science and Technology of China, China. Recent work in text-to-text simultaneous transla- 2 Data tion tends to fall into two categories, fixed policy and flexible policy, represented by wait-k (Ma et al., 2.1 Statistics and Preprocessing 2019) and monotonic attention (Arivazhagan et al., EN→DE Speech Corpora The speech datasets 2019; Ma et al., 2020b) respectively. The draw- used in our experiments are shown in Table 1, back of fixed policy is that it may introduce over where MuST-C, Europarl and CoVoST2 are speech latency for some sentences and under latency for translation specific (speech, transcription and trans- others. Meanwhile, flexible policy often leads to lation included), and LibriSpeech, TED-LIUM3 difficulties in model optimization. are speech recognition specific (only speech and Inspired by RNN-T (Graves, 2012), we aim transcription). After augmented with speed and to optimize the marginal distribution of all ex- echo perturbation, we use Kaldi (Povey et al., 2011) panded paths in simultaneous translation. How- to extract 80 dimensional log-mel filter bank fea- ever, we found it’s impossible to calculate the tures, computed with a 25ms window size and a 30 Proceedings of the 18th International Conference on Spoken Language Translation, pages 30–38 Bangkok, Thailand (Online), August 5–6, 2021. ©2021 Association for Computational Linguistics
10ms window shift, and specAugment (Park et al., For EN→DE task, we directly use Sentence- 2019) were performed during training phase. Piece (Kudo and Richardson, 2018) to generate a unigram vocabulary of size 32,000 for source Corpus Segments Duration(h) and target language jointly. And for EN→JA task, sentences in Japanese are firstly participled by MuST-C 250.9k 448 MeCab (Kudo, 2006), and then a unigram vocab- Europarl 69.5k 155 ulary of size 32,000 is generated for source and CoVoST2 854.4k 1090 target jointly similar to EN→DE task. LibriSpeech 281.2k 960 During data preprocessing, the bilingual datasets TED-LIUM3 268.2k 452 are firstly filtered by length less than 1024 and Table 1: Statistics of speech corpora. length ratio of target to source 0.25 < r < 4. In the second step, with a baseline Transformer model trained with only bilingual data, we filtered the Text Translation Corpora The bilingual paral- mismatched parallel pairs with log-likelihood from lel datasets for Englith to German(EN→DE) and the baseline model, threshold is set to −4.0 for English to Japanese (EN→JA) used are shown in EN→DE task and −5.0 for EN→JA task. At last Table 2, and the monolingual datasets in English, we keep 27.3 million sentence pairs for EN-DE German and Japanese are shown in Table 3. And task and 17.0 sentence pairs for EN→JA task. we found the Paracrawl dataset in EN→DE task is too big to our model training, we randomly select 2.2 Data Augmentation a subset of 14M sentences from it. For text-to-text machine translation, augmented data from monolingual corpora in source and target Corpus Sentences language are generated by self-training (He et al., MuST-C(v2) 229.7k 2019) and back translation (Edunov et al., 2018) Europarl 1828.5k respectively. Statistics of the augmented training Rapid-2019 1531.3k data are shown in Table 4. WIT3-TED 209.5k EN→DE Data EN→DE EN→JA Commoncrawl 2399.1k WikiMatrix 6227.2k Bilingual data 27.3M 17.0M Wikititles 1382.6k +back-translation 34.3M 22.0M Paracrawl 82638.2k +self-training 41.3M 27.0M WIT3-TED 225.0k Table 4: Augmented training data for text-to-text trans- JESC 2797.4k lation. kftt 440.3k We further extend these two data augmentation EN→JA WikiMatrix 3896.0k methods to speech-to-text translation, detailed as: Wikititles 706.0k Paracrawl 10120.0k 1. Self-training: Maybe similar to sequence- level distillation (Kim and Rush, 2016; Ren Table 2: Statistics of text parallel datasets. et al., 2020; Liu et al., 2019). Transcriptions of all speech datasets (both speech recogni- tion and speech translation specific) are sent 0 Language Corpus Sentences to a text translation model to generate text y 0 in target language, the generated y with its Europarl-v10 2295.0k corresponding speech are directly added to EN News-crawl-2019 33600.8k speech translation dataset. Europarl-v10 2108.0k DE 2. Speech Synthesis: We employ Tacotron2 News-crawl-2020 53674.4k (Shen et al., 2018) with slightly modified by News-crawl-2019 3446.4k introducing speaker representations to both en- JA News-crawl-2020 10943.3k coder and decoder as our text-to-speech (TTS) model architecture, and trained on MuST- Table 3: Statistics of monolingual datasets. C(v2) speech corpora to generate filter-bank 31
speech representations. We randomly select tion of latency metric through all possible expanded 4M sentence pairs from EN→DE text trans- paths: lation corpora and generate audio feature by speech synthesis. The generated filter bank L(x, y) = Lnll (x, y) + Llatency (x, y) features and their corresponding target lan- X guage text are used to expand our speech trans- = − log p(ŷ|x) + Eŷ l(ŷ) lation dataset. ŷ (1) X X = − log p(ŷ|x) + Pr(ŷ|y, x)l(ŷ) The expanded training data are shown in Table 5. ŷ ŷ Besides, during the training period for all the speech translation tasks, we sample the speech data from the whole corpora with fixed ratio and the concrete ratio for different dataset is shown in Ta- ble 6. Dataset Segements Duration(h) Raw S2T dataset 1.17M 1693 +self-training 2.90M 4799 +Speech synthesis 7.22M 10424 Table 5: Expanded speech translation dataset with self- training and speech synthesis. Figure 1: Expanded paths in simultaneous translation. Dataset Sample Ratio p(ŷ|x) Where Pr(ŷ|y, x) = P p(ŷ 0 |x) , and ŷ ∈ 0 MuST-C 2 ŷ ∈H(x,y) Europarl 1 H(x, y) is an expansion of target sequence y, and CoVoST2 1 l(ŷ is the latency of expanded path ŷ. LibriSpeech 1 However, RNN-T is trained and inferenced TED-LIUM3 2 based on source-target monotonic constraint, Speech synthesis 5 which means it isn’t suitable for translation task. P And the calculation of marginal probabil- Table 6: Sample ratio for different datasets during train- ity ŷ∈H(x,y) Pr(ŷ|x) is impossible for Attention ing period. Encoder-Decoder framework due to deep coupling of source and previous target representation. As shown in Figure 1, the decoder hidden states for 3 Methods and Models the red path ŷ 1 and the blue path ŷ 2 is not equal at the intersection s12 6= s22 . To solve this, we sepa- 3.1 Cross Attention Augmented Transducer rate the source attention mechanism from the target Let x and y denote the source and target se- history representation, which is similar to joiner quence, respectively. The policy of simultane- and predictor in RNN-T. The novel architecture ous translation is denoted as an action sequence can be viewed as a extension version of RNN-T p ∈ {R, W }|x|+|y| where R denotes the READ with attention mechanism augmented joiner, and is action and W the WRITE action. Another repre- named as Cross Attention Augmented Transducer sentation of policy is extending target sequence (CAAT). Figure 2 is the implementation of RAAT y to length |x| + |y| with blank symbol φ as based on Transformer. ŷ ∈ (v ∪ {φ})|x|+|y| , where v is the vocabulary Computation cost of joiner in CAAT is signif- of the target language. The mapping from y to sets icantly more expensive than that of RNN-T. The of all possible expansion ŷ denotes as H(x, y). complexity of joiner is O(|x| · |y|) during train- Inspired by RNN-T (Graves, 2012), the loss func- ing, which means O(|x|) times higher than conven- tion for simultaneous translation can be defined as tional Transformer. We solve this problem by mak- the marginal conditional probability and expecta- ing decisions with decision step size d > 1, and 32
output Latency loss and its gradients can be calculated by the forward-backward algorithm, similar to Se- Add & Norm quence Criterion Training in ASR (Povey, 2005). Feed Forward At last, we add the cross entropy loss of offline translation model as an auxiliary loss to CAAT Add & Norm Add & Norm Add & Norm model training for two reasons. First we hope the Cross Feed Feed Forward Attention Forward CAAT model fall back to offline translation in the worst case; second, CAAT models is carried out Add & Norm Add & Norm in accordance with offline translation when source Self Self Attention Attention sentence ended. The final loss function for CAAT training is defined as follows: x y(shift right) L(x, y) = LCAAT (x, y) + λlatency Llatency (x, y) Figure 2: Architecture of CAAT based on Transformer. + λCE LCE (x, y) X = − log p(ŷ|x) reduce the complexity of joiner from O(|x| · |y|) ŷ to O(|x|·|y|) X d . Besides, to further reduce video mem- + λlatency Pr(ŷ|y, x)d(ŷ) ory consumption, we split hidden states into small ŷ pieces before sent into joiner, and recombine it for X − λCE log p(yj |x, y
case of our method with {m = 1, r = 0}. Note 3.5 Unsegmented Data Processing that the look-ahead window size in our method is To deal with unsegmented data, we segment the fixed, which ensures increasing transformer layers input text based on sentence ending marks for T2T won’t affect latency. track. For S2T task, input speech is simply seg- mented into utterances with duration of 20 sec- 3.3 Text-to-Text Simultaneous Translation onds and each segmented piece is directly sent to We implemented both CAAT in Sec. 3.1 and wait-k our simultaneous translation systems to obtain the (Ma et al., 2019) systems for text-to-text simulta- streaming results. We found an abnormally large neous translation, both of them are implemented average lagging (AL) on IWSLT tst2018 test set based on fairseq (Ott et al., 2019). based on existed SimuEval toolkit(Ma et al., 2020a) All of wait-k experiments use the parameter set- and segment strategy, so relevant results are not pre- tings based on big transformer (Vaswani et al., sented here. A more reasonable latency criterion 2017) with unidirectional encoders, which corre- may be needed for unsegmented data in the future. sponds to a 12-layer encoder and 6-layer decoder transformer with a embedding size of 1024, a feed 4 Experiments forward network size of 4096, and 16 heads atten- 4.1 Effectiveness of CAAT tion. Hyper-parameters of our CAAT model architec- To demonstrate the effectiveness of CAAT architec- tures are shown in Table 7. CAAT training re- ture, we compare it to wait-k with speculative beam quires significantly more GPU memory than con- search (SBS) (Ma et al., 2019; Zheng et al., 2019b), ventional Transformer (Vaswani et al., 2017), for one of the previous state-of-the-art. The latency- |x|·|y| quality trade-off curves on S2T and T2T tasks are the O d complexity of joiner module. We shown in Figure 3 and Figure 4(a). We can find that mitigate this problem by reducing joiner hidden CAAT significantly outperforms wait-k with SBS, dimension for lower decision step size d. especially in low latency section(AL < 1000ms 3.4 Speech-to-Text Simultaneous Translation for S2T track and AL < 3 for T2T track). 3.4.1 End-to-End Systems 27 The main system of End-to-End Speech-to-Text 26 simultaneous Translation is based on the aforemen- 25 tioned CAAT structure. For speech encoder, two 24 2D convolution blocks are introduced before the 23 BLEU stacked 24 Transformer encoder layers. Each con- volution block consists of a 3-by-3 convolution 22 layer with 64 channels and stride size as 2, and a 21 ReLU activation function. Input speech features are 20 wait-k_SBS downsampled 4 times by convolution blocks and CAAT 19 flattened to 1D sequence as input to transformer lay- 18 ers. Other hyper-parameters are shown in Table 7. 500 1000 1500 2000 2500 3000 AL The latency-quality trade-off may be adjusted by varying the decision step size d and the latency Figure 3: Comparison of CAAT and wait-k with scaling factor λlatency . We submitted systems with SBS systems on EN→DE Speech-to-Text simultane- best performance in each latency region. ous translation. 3.4.2 Cascaded Systems The cascaded system consists of two modules, si- 4.2 Effectiveness of data augmentation multaneous automatic speech recognition (ASR) In order to testify the effectiveness of data augmen- and simultaneous text-to-text Machine Translation tation, we compare the results of different data aug- (MT). Both simultaneous ASR and MT system are mentation methods based on the offline and simulta- built with CAAT proposed in Sec. 3.1. And we neous speech translation task. As demonstrated in found the cascaded systems outperforms end-to- Table 8, adding new generated target sentences into end system in medium and high latency region. the training corpora by using Self-training gives 34
Parameters S2T config T2T config-A T2T config-B layers 24 12 12 attention heads 8 16 16 Encoder FFN dimension 2048 4096 4096 embedding size 512 1024 1024 attention heads 8 16 16 FFN dimension 2048 4096 4096 Predictor embedding size 512 1024 1024 output dimension 512 512 1024 attention heads 8 8 16 FFN dimension 1024 2048 4096 Joiner embedding size 512 512 1024 decision step size {16,64} {4,10,16,32} {10,32} / latency scaling factor {1.0,0.2} {1.0,0.2} 0.2 Table 7: Parameters of CAAT in T2T and end-to-end S2T simultaneous translation. Noted that both predictor and joiner have 6 layers for T2T and S2T tasks, and the additional two parameters for end-to-end 2T simultaneous translation, which is the main context and right context described in Sec.3.2, are set m = 32 and r = 16 . Dataset BLEU more or less under different latency regimes, and the increase is quite obvious in low latency regime. Original speech corpora 21.24 Compared with the best result in 2020, we finally +self-training 28.21 get improvement by 6.8 and 3.4 BLEU in low and +Speech systhesis 29.72 high latency regime respectively. Table 8: Performance of offline speech translation on En→JA Task Results of Text-to-Text simultane- MuST-C(v2) tst-COMMON with different datasets. ous translation (EN→JA) track are plotted in Fig- ure 4(b), where the curve naming CAAT bst is best a boost of nearly 7 BLEU points and speech syn- performances in this track with or without model- thesis provides the other 1.5 BLEU points increase ensembling method. Curves in this sub-figure show on MuST-C(v2) tst-COMMON. As illustrated in the similar conclusion to the former subsection, Figure 3, all the data augmentation methods are that the result of proposed CAAT significantly out- employed and provide nearly 3 BLEU points on performs that of wait-k with SBS. While we can average in the simultaneous task at different la- also find that the gap between CAAT and offline is tency regimes. Note that our data augmentation more obvious (nearly 0.4 BLEU), this is mainly be- methods alleviate the scarcity of parallel datasets cause parameters of joiner block for EN→JA track in the End-to-End speech translation task and make in high-latency regime is reduced a lot from that a significant improvement. for EN→DE track, due to the unstable EN→JA training. 4.3 Text-to-Text Simultaneous Translation EN→DE Task The performances of text-to-text 4.4 Speech-to-Text Simultaneous Translation EN→DE task is shown in Figure 4(a). We can End-to-End System In this section, we discuss see that the performance of proposed CAAT is al- about our final results of End-to-End system based ways better than that of wait-k with SBS and the on CAAT. We tune the decision step size d and best results from ON-TRAC in 2020 (Elbayad latency scaling factor λlatency to meet different la- et al., 2020), especially in low latency regime, and tency regime requirements. For low, medium and the performance of CAAT with model ensemble high latency, the corresponding d and λlatency are is nearly equivalent to offline result. Moreover, it set to (16,64,64) and (1.0,1.0,0.2) respectively. We can be further noticed from Figure 4(a) that the show our final latency-quality trade-offs in Figure 5. model ensemble can also improve the BLUE score Combined with our data augmentation methods and 35
36 18 17 34 16 32 BLEU BLEU 15 30 offline CAAT_ens 14 CAAT 28 wait-k_SBS_ens offline ON-TRAC_2020 13 CAAT_bst wait-k_SBS_ens 26 12 2 3 4 5 6 7 8 9 10 11 12 13 2 3 4 5 6 7 8 9 10 11 12 AL AL (a) tst-COMMON(v2) on EN→DE (b) IWSLT dev2021 on EN→JA Figure 4: Latency-quality trade-offs of Text-to-Text simultaneous translation. new CAAT model structure, it can be seen that our translation task. single model system has already outperformed the best results of last year in all latency regimes and 30.0 provides 9.8 BLEU scores increase on average. En- 27.5 sembling different models can further boost the 25.0 BLEU scores by roughly 0.5-1.5 points at different latency regimes. 22.5 BLEU 20.0 Cascaded System Under the cascaded setting, 17.5 we paired two well-trained ASR and MT systems, CAAT where the WER of ASR system’s performance is 15.0 CAAT_ens Cascaded System 6.30 with 1720.20 AL, and the MT system is fol- 12.5 Best_results_2020 lowed by the config-A in Table 7, whose results 10.0 500 1000 1500 2000 2500 3000 3500 4000 are 34.79 BLEU and 5.93 AL. We found the best AL medium and high-latency systems at decision step size pair (dasr , dmt ) with (6, 10) and (12, 10) re- Figure 5: Latency-quality trade-offs of Speech-to- Text simultaneous translation on MuST-C(v2) tst- spectively. Performance of cascaded systems are COMMON. shown in Figure 5. Note that under current con- figuration of ASR and MT systems, we can not provide valid results that satisfy the requirement of 5 Related Work AL at low latency regime since cascaded system usually has a larger latency compared to End-to Simultaneous Translation Recent work on si- End system. During the online decoding of the multaneous translation falls into two categories. cascaded system, only after specific tokens are rec- The first category uses a fixed policy for the ognized by the ASR system, the translation model READ/WRITE actions and can thus be easily inte- can further translate them to obtain the final result. grated into the training stage, as typified by wait- The decoded results from ASR model first has a k approaches (Ma et al., 2019).The second cate- delay compared to the actual contents of the audio, gory includes models with a flexible policy learned and the two-steps decoding further accumulates the and/or adaptive to current context, e.g., by Rein- delay, which contributes to the higher latency com- forcement Learning (Gu et al., 2017), Supervise pared to the End-to-End system. However, it still Learning (Zheng et al., 2019a) and so on. A special can be seen that cascaded system has significant sub-category of flexible policy jointly optimizes advantages over End-to-End system at medium and policy and translation by monotonic attention cus- high latency regime and it still has a long way to go tomized to translation model, e.g., Monotonic Infi- for End-to-End system in the simultaneous speech nite Lookback (MILk) attention (Arivazhagan et al., 36
2019) and Monotonic Multihead Attention (MMA) Linhao Dong, Feng Wang, and Bo Xu. 2019. Self- (Ma et al., 2020b). We propose a novel method attention aligner: A latency-control end-to-end model for ASR using self-attention network and to optimize policy and translation model jointly, chunk-hopping. In IEEE International Confer- which is motivated by RNN-T (Graves, 2012) in ence on Acoustics, Speech and Signal Processing, online ASR. Unlike RNN-T, the CAAT model re- ICASSP 2019, Brighton, United Kingdom, May 12- moves the monotonic constraint, which is critical 17, 2019, pages 5656–5660. IEEE. for considering reordering in machine translation Sergey Edunov, Myle Ott, Michael Auli, and David tasks. The optimization of our latency loss is moti- Grangier. 2018. Understanding back-translation at vated by Sequence Discriminative Training in ASR scale. arXiv preprint arXiv:1808.09381. (Povey, 2005). Maha Elbayad, Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Antoine Caubrière, Benjamin Lecou- Data Augmentation As described in Sec. 2, the teux, Yannick Estève, and Laurent Besacier. 2020. size of training data for speech translation is sig- On-trac consortium for end-to-end and simultane- nificantly smaller than that of text-to-text machine ous speech translation challenge tasks at iwslt 2020. arXiv preprint arXiv:2005.11861. translation, which is the main bottleneck to im- prove the performance of speech translation. Self- Ramazan Gokay and Hulya Yalcin. 2019. Improving training, or sequnece-level knowledge distillation low resource turkish speech recognition with data by text-to-text machine translation model, is the augmentation and tts. In 2019 16th International Multi-Conference on Systems, Signals & Devices most effective way to utilize the huge ASR train- (SSD), pages 357–360. IEEE. ing data (Liu et al., 2019; Pino et al., 2020). On the other hand, synthesizing data by text-to-speech Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint (TTS) has been demonstrated to be effective for arXiv:1211.3711. low resource speech recognition task (Gokay and Yalcin, 2019; Ren et al., 2019). To the best of our Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- tor O.K. Li. 2017. Learning to translate in real-time knowledge, this is the first work to augment data with neural machine translation. In Proceedings of by TTS for simultaneous speech-to-text translation the 15th Conference of the European Chapter of the tasks. Association for Computational Linguistics: Volume 1, Long Papers, pages 1053–1062, Valencia, Spain. Association for Computational Linguistics. 6 Conclusion Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio In this paper, we propose a novel simultane- Ranzato. 2019. Revisiting self-training for ous translation architecture, Cross Attention Aug- neural sequence generation. arXiv preprint mented Transducer (CAAT), which significantly arXiv:1909.13788. outperforms wait-k in both S2T and T2T simulta- Yoon Kim and Alexander M Rush. 2016. Sequence- neous translation task. Based on CAAT architec- level knowledge distillation. arXiv preprint ture and data augmentation, we build simultaneous arXiv:1606.07947. translation systems on text-to-text and speech-to- Taku Kudo. 2006. Mecab: Yet another text simultaneous translation tasks. We also build part-of-speech and morphological analyzer. a cascaded speech-to-text simultaneous translation http://mecab.sourceforge.jp. system for comparison. Both T2T and S2T systems Taku Kudo and John Richardson. 2018. SentencePiece: achieve significant improvements over last year’s A simple and language independent subword tok- best-performing systems. enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. References Association for Computational Linguistics. Naveen Arivazhagan, Colin Cherry, Wolfgang Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Macherey, Chung-Cheng Chiu, Semih Yavuz, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. Ruoming Pang, Wei Li, and Colin Raffel. 2019. End-to-end speech translation with knowledge distil- Monotonic infinite lookback attention for simulta- lation. neous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Com- Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, putational Linguistics, pages 1313–1323, Florence, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Italy. Association for Computational Linguistics. Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and 37
Haifeng Wang. 2019. STACL: Simultaneous trans- to speech and automatic speech recognition. In In- lation with implicit anticipation and controllable la- ternational Conference on Machine Learning, pages tency using prefix-to-prefix framework. In Proceed- 5410–5419. PMLR. ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Rico Sennrich, Barry Haddow, and Alexandra Birch. Florence, Italy. Association for Computational Lin- 2016. Neural machine translation of rare words guistics. with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Linguistics (Volume 1: Long Papers), pages 1715– Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 1725, Berlin, Germany. Association for Computa- evaluation toolkit for simultaneous translation. In tional Linguistics. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Demonstrations, pages 144–150, Online. Associa- Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng tion for Computational Linguistics. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by condition- Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, ing wavenet on mel spectrogram predictions. In and Jiatao Gu. 2020b. Monotonic multihead atten- 2018 IEEE International Conference on Acoustics, tion. In 8th International Conference on Learning Speech and Signal Processing (ICASSP), pages Representations, ICLR 2020, Addis Ababa, Ethiopia, 4779–4783. IEEE. April 26-30, 2020. OpenReview.net. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Myle Ott, Sergey Edunov, Alexei Baevski, Angela Kaiser, and Illia Polosukhin. 2017. Attention is all Fan, Sam Gross, Nathan Ng, David Grangier, and you need. In Advances in Neural Information Pro- Michael Auli. 2019. fairseq: A fast, extensible cessing Systems 30: Annual Conference on Neural toolkit for sequence modeling. In Proceedings of Information Processing Systems 2017, December 4- the 2019 Conference of the North American Chap- 9, 2017, Long Beach, CA, USA, pages 5998–6008. ter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Min- Chunyang Wu, Yongqiang Wang, Yangyang Shi, nesota. Association for Computational Linguistics. Ching-Feng Yeh, and Frank Zhang. 2020. Stream- ing transformer-based acoustic models using self- Daniel S Park, William Chan, Yu Zhang, Chung-Cheng attention with augmented memory. arXiv preprint Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. arXiv:2005.08042. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang preprint arXiv:1904.08779. Huang. 2019a. Simultaneous translation with flexi- ble policy via restricted imitation learning. In Pro- Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad ceedings of the 57th Annual Meeting of the Asso- Dousti, and Yun Tang. 2020. Self-training for ciation for Computational Linguistics, pages 5816– end-to-end speech translation. arXiv preprint 5822, Florence, Italy. Association for Computa- arXiv:2006.02490. tional Linguistics. Daniel Povey. 2005. Discriminative training for large Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang vocabulary speech recognition. Ph.D. thesis, Uni- Huang. 2019b. Speculative beam search for simulta- versity of Cambridge. neous translation. In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Processing and the 9th International Joint Confer- Burget, Ondrej Glembek, Nagendra Goel, Mirko ence on Natural Language Processing (EMNLP- Hannemann, Petr Motlicek, Yanmin Qian, Petr IJCNLP), pages 1395–1402, Hong Kong, China. As- Schwarz, et al. 2011. The kaldi speech recogni- sociation for Computational Linguistics. tion toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing Society. Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech: End-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3787–3796, Online. Association for Computational Linguistics. Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Almost unsupervised text 38
You can also read