ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH LARGE SPEECH CORPUS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH LARGE SPEECH CORPUS Kwangyoun Kim*, Kyungmin Lee*, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim Samsung Electronics Co., Ltd., Korea arXiv:2001.00577v1 [eess.AS] 2 Jan 2020 ABSTRACT we chose to use the attention based method since the accu- racy of this method has surpassed that of the conventional In this paper, we present a new on-device automatic speech HMM-DNN based state-of-the-art ASR systems [6]. Despite recognition (ASR) system based on monotonic chunk-wise their extreme accuracy, attention models which require full attention (MoChA) models trained with large (> 10K hours) alignment between the input and the output sequences are corpus. We attained around 90% of a word recognition rate for not capable of providing streaming ASR services. Some re- general domain mainly by using joint training of connection- searches have been made to address this lack of streaming ist temporal classifier (CTC) and cross entropy (CE) losses, capabilities of the attention models [7, 8, 9]. In [7], an on- minimum word error rate (MWER) training, layer-wise pre- line neural transducer was proposed, which applies the full training and data augmentation methods. In addition, we attention method on chucks of input, and is trained with an compressed our models by more than 3.4 times smaller using additional end-of-chunk symbol. In [8], a hard monotonic an iterative hyper low-rank approximation (LRA) method attention based model was proposed for streaming decoding while minimizing the degradation in recognition accuracy. with acceptable accuracy degradation. Furthermore, in [9], The memory footprint was further reduced with 8-bit quan- a monotonic chunk-wise attention (MoChA) method was tization to bring down the final model size to lower than 39 proposed, which showed the promising accuracy by loosen- MB. For on-demand adaptation, we fused the MoChA mod- ing a hard monotonic alignment constraint and using a soft els with statistical n-gram models, and we could achieve a attention over a few speech chunks. relatively 36% improvement on average in word error rate (WER) for target domains including the general domain. In this paper, we explain how we improved our MoChA based ASR system to become an on-device commercializa- Index Terms— Online speech recognition, end-to-end, tion ready solution. First, we trained the MoChA models by attention, compression using connectionist temporal classification (CTC) and cross- entropy (CE) losses jointly to learn alignment information 1. INTRODUCTION precisely. A minimum word error rate (MWER) method, which is a type of sequence-discriminative training, was Recently, End-to-end (E2E) neural network architectures adopted to optimize the models [10]. Also, for better stability based on sequence to sequence (seq2seq) learning for au- and convergence of model training, we applied a layer-wise tomatic speech recognition (ASR) have been gaining lots of pre-training mechanism [11]. Furthermore, in order to com- attention [1, 2], mainly because they can learn both the acous- press the models, we present a hyper low-rank matrix approx- tic and the linguistic information, as well as the alignments imation (hyper-LRA) method by employing DeepTwist [12] between them, all simultaneously unlike the conventional with minimum accuracy degradation. Another important re- ASR systems which were based on the hybrid models of quirement for the commercializing ASR solutions is to boost hidden Markov models (HMMs) and deep neural network the recognition accuracy for user context-specific keywords. (DNN) models. Moreover, the E2E models are more suitable In order to bias the ASR system during inference time, we to be compressed since they do not need separate phonetic fused the MoChA models with statistical n-gram based per- dictionaries and language models, making them one of the sonalized language models (LMs). best candidates for on-device ASR systems. The main contribution of this paper is in successfully Among the various E2E ASR model architectures such building the first ever attention-based streaming ASR system as attention-based encoder-decoder models [3] and recurrent capable of running on devices to the best of our knowledge. neural network transducer (RNN-T) based models [4, 5], We succeeded not only in training MoChA models with large * Equal contribution. The author would like to thank Jiyeon Kim, Mehul corpus for Korean and English, but also in satisfying the Kumar and Abhinav Garg for constructive comment. needs of commercial on-device applications.
The rest of this paper is composed as follows: the speech context, we could name this attention method as full atten- recognition models based on attention methods are explained tion. The Decoder, which consists of uni-directional LSTM in a section 2. A section 3 describes how optimization meth- layers, computes the current decoder state sl from the previ- ods improved recognition accuracy, and explanation for the ous predicted label yl−1 and the previous context cl−1 . And compression algorithm for MoChA models is followed in a the output label yl is calculated by a P rediction block using section 4. A section 5 describes the n-gram LM fusion for the current decoder state, the context and the previous output on-demand adaptation, and then discusses the methods and label. related experiments results in a section 6 and 7. sl = Decoder(sl−1 , yl−1 , cl−1 ) (4) yl = P rediction(sl , yl−1 , cl ) 2. MODEL ARCHITECTURE Typically, the prediction block consists of one or two fully Attention-based encoder-decoder models are composed of connected layers and a softmax layer to generate a probabil- an encoder, a decoder, and an attention block between the ity score for the target labels. We applied max pooling layer two [13]. The encoder converts an input sequence into a se- between two fully connected layers. The probability of pre- quence of hidden vector representations referred to as encoder dicted output sequence y = {y1 , y2 , ..., yL } for given x is embeddings. The decoder is a generative model which pre- calculated as in equation (5). dicts a sequence of the target labels. The attention is used to learn the alignment between the two sequences of the encoder L Y embeddings and the target labels. P (y|x) = P (yl |x, y1:l−1 ) (5) l 2.1. Attention-based speech recognition where P (yl |x, y1:l−1 ) is the probability of each output label. The attention-based models can be applied to ASR sys- Even though the attention-based models have shown state- tems [14, 15] using the following equations (1)-(4). of-the-art performance, they are not a suitable choice for the streaming ASR, particularly because they are required to cal- ht = Encoder(ht−1 , xt ) (1) culate the alignment between the current decoder state and the encoder embeddings of the entire input frames. where x = {x1 , x2 , ..., xT } is the speech feature vector se- quence, and h = {h1 , h2 , ..., hT } is the sequence of encoder embeddings. The Encoder can be constructed of bi- or uni- directional long short term memory (LSTM) layers [16]. Due 2.2. Monotonic Chunk-wise Attention to the difference in length of the input and the output se- A monotonic chunk-wise attention (MoChA) model is intro- quence, the model is often found to have difficulty in con- duced to resolve the streaming incapability of the attention- vergence. In order to compensate for this, pooling along the based models under the assumption that the alignment re- time-axis on the output of intermediate layers of the encoder lationship between speech input and output text sequence is used, effectively reducing the length of h to T 0 < T . should be monotonic [8, 9]. et,l = Energy(ht , sl ) MoChA model computes the context by using two kinds of attentions, a hard monotonic attention and a soft chunkwise = v T tanh(Wh ht + Ws sl + b) (2) attention. The hard monotonic attention is computed as, at,l = Sof tmax(et,l ) emonotonic t,l = M onotonicEnergy(ht , sl ) An attention weight at,l , which is often referred as alignment, represents how the encoder embeddings of each frame ht and amonotonic t,l = σ(emonotonic t,l ) the decoder state sl are correlated [13]. We employed an addi- zt,l ∼ Bernoulli(amonotonic ) (6) t,l tive attention method to compute the correlations. A softmax ( function converts the attention energy into the probability dis- 1, if amonotonic t,l ≥ 0.5 = tribution which is used as attention weight. A weighted sum 0, otherwise of the encoder embeddings is computed using the attention weights as, where zt,l is the hard monotonic attention used to determine T0 X whether to attend the encoder embedding ht . The Decoder cl = at,l ht (3) attends at uth encoder embedding to predict next label only if t zu,l = 1. The equation (6) is computed on t ≥ ul−1 , where where cl denotes the context vector, and since the encoder em- ul−1 denotes the attended encoder embedding index for pre- beddings of the entire input frames are used to compute the vious output label prediction. The soft chunkwise attention is
computed as accuracy. Moreover in order to ensure the stability in training MoChA models, we adopted the pre-training scheme. echunk t,l = ChunkEnergy(ht , sl ) achunk t,l = Sof tmax(echunk t,l ) (7) 3.1. Joint CTC-CE training Xu ct,l = achunk t,l ht In spite of the different length between the speech feature se- t=u−w+1 quences and the corresponding text sequences, CTC loss in- where u is the attending point chosen from monotonic atten- duces the model that the total probability of all possible align- tion, w is the pre-determined chunk size, and achunk is the ment cases between the input and the output sequence is max- t,l chunkwise soft attention weight and ct,l chunk is the chunkwise imized [17]. The CTC loss are defined as follows, context which is used to predict the output label. X T X We used the modified additive attention for computing the LCT C = − log(P (ytπ |xt ) (10) attention energy in order to ensure model stability [9]. π⊂Π(y ∗ ) t v Energy 0 (ht , sl ) = g tanh(Wh ht +Ws sl−1 +b)+r (8) where Π(y ∗ ) are all of the possible alignments generated with ||v|| {Blank} symbol and the repetition of outputs units for having where g, r are additional trainable scalars, and others are same length as input speech frames, and π is one of them. same as Energy() in equation (2). M onotonicEnergy() P (ytπ |xt ) is the probability about tth predicted label is tth and ChunkEnergy() are computed using equation (8) with label in π alignment case. own trainable variables. A CTC loss can be readily applicable for training MoChA Model, especially the encoder, because it also leads the align- ment between input and output as a monotonic manner. More- over CTC loss has the advantage of learning alignment in noisy environments and can help to quickly learn the align- ment of the attention based model through joint training [18]. The joint training loss is defined as follows, LT otal = λLCE + (1 − λ)LCT C λ ∈ [0, 1] (11) where LT otal is joint loss of the two losses. 3.2. MWER training In this paper, the byte-pair encoding (BPE) based sub-word units were used as the output unit of the decoder [19]. Thus the model is optimized to generate individual BPE output se- quences well. However, the eventual goal of the speech recog- nition is to reduce the word-error rate (WER). Also, since the decoder is used along with a beam search during inference, it Fig. 1. Model architecture is effective to improve the accuracy by directly defining a loss that lowers the exptected WER of candidated beam search re- sults. The loss of MWER is represented as follows, 3. TRAINING AND OPTIMIZATION X LM W ER = P (yb |x)(W b − W̄) (12) The main objective of the attention-based encoder-decoder b⊂B model is to find parameters which minimize the the cross en- tropy (CE) between the predicted sequences and the ground where B are all the candidates of beams search results, and truth sequences. W b indicates the number of word error of each beam result X sequence y b . The average word error of all the beam W̄ helps LCE = − log(P (yl∗ |x, y1:l−1 ∗ )) (9) model converging well by reducing the variance. l ∗ where y is the ground truth label sequence. We trained LT otal = λLCE + (1 − λ)LM W ER λ ∈ [0, 1] (13) MoChA models with CTC loss and CE loss jointly for learn- ing the alignment better and the MWER loss based sequence- The MWER loss, LM W ER , also can be easily integrated with discriminative training was employed to further improve the the CE loss by linearly interpolating as shown in Equation 13.
3.3. Layer-wise pre-training where b ∈ Rn and σ(·) represent a bias vector and a non- linear activation function, respectively. Then it propagates A layer-wise pre-training of the encoder was proposed in [11] through the layers and increases the training loss. In the to ensure that the model converges well and has better perfor- LRA, Ũ 0 and Ṽ T are updated in backward pass by constrain- mance. The initial encoder consists of 2 LSTM layers with a ing the weight space of r(m + n) dimensions. However, for max-pool layer with a pooling factor of 32 in between. After large ρ, it is difficult to find the optimal weights due to the sub-epochs of training, a new LSTM layer and a max-pool reduced dimension of the weight space. Instead, we find an layer are added. The total pooling factor 32 is divided into optimal LRA by employing DeepTwist method [12], called 2 for lower layer and 16 for newly added higher layer. This a hyper-LRA. The hyper-LRA algorithm modifies retraining strategy is repeated until the entire model network is com- pletely built with 6 encoder LSTM layers and 5 max-pool layers with 2 pooling factor. Finally, the total reducing fac- tor is changed to 8, with only the lower 3 max-pool layers Algorithm 1: The hyper-LRA algorithms having a pooling factor 2. Procedure TrainingModelWeights() During pre-training of our MoChA models, when a new Result: Weight matrices {Wi } LSTM and a max-pool layer were piled up at each stage, for each iteration N do the training and validation errors shot up. In order to address for each layer i do this, we employed a learning rate warm-up for every new pre- xi+1 ← σ(W xi + b) ; // xi :input training stage. if N mod D = 0 then xi+1 ← xi+1 + ei ; // ei in (15) 3.4. Spec augmentation Wi ← Wi + ∆Wi ; // ∆Wi in (14) end Because end-to-end ASR model learns from the transcribed end corpus, large dataset is one of the most important factor to compute the loss L; achieve better accuracy. Data augmentation methods have for each layer i do been introduced to generate additional data from the origi- ∂L Wi ← Wi − η ∂W ; // η:learning rate nals, and recently, spec augmentation shows state-of-the-art i result on public dataset [20]. spec augmentation masks parts end of spectrogram along the time and frequency axis, thus model end could learn from masked speech in a lack of information. for each layer i do Wi ← Wi + ∆Wi ; end 4. LOW-RANK MATRIX APPROXIMATION Procedure InferenceModelWeights() 0 T We adopted a low-rank matrix approximation (LRA) algo- Result: Weight matrices {Ũi }, {Ṽi } rithm based on singular value decomposition (SVD) to com- for each layer i do 0 T press our MoChA model [21]. Given a weight matrix W ∈ Ũi , Ṽi ← truncatedSV D(Wi ) Rm×n , SVD is U ΣV T , where Σ ∈ Rm×n is a diagonal ma- end trix with singular values, and U ∈ Rm×m and V ∈ Rn×n mn are unitary matrices. If we specify the rank r < m+n , the T m×n truncated SVD is given by W̃ = Ũ Σ̃Ṽ ∈ R , where Ũ ∈ Rm×r , Σ̃ ∈ Rr×r , and Ṽ ∈ Rn×r are the top-left process by adding the LRA distortion to weights and the cor- submatrices of U, Σ, and V , respectively. For an LRA, W responding errors to the outputs of layers every D iterations, is replaced by the multiplication of Ũ 0 = Ũ Σ̃ and Ṽ T , the where D is a distortion period. After retraining, instead of number of weight parameters is reduced as r(m + n) < mn. W , the multiplication of Ũ 0 and Ṽ T is used for the inference mn Hence we obtain the compression ratio ρ = r(m+n) in mem- model. ory footprints and computation complexity for matrix-vector multiplication. From LRA, we have an LRA distortion as Note that the hyper-LRA algorithm optimizes W rather than Ũ 0 and Ṽ T . In other words, a hyper-LRA method per- ∆W = Ũ 0 Ṽ T − W. (14) forms weight optimization in the hyperspace of the truncated weight space, which has the same dimension with the orig- For each layer, given an input x ∈ Rm , the output error is inal weight space. Therefore the hyper-LRA approach can given by provide much higher compression ratio than the conventional LRA whereas it requires more computational complexity for e = σ((W + ∆W )x + b) − σ(W x + b), (15) retraining process.
5. ON-DEMAND ADAPTATION scheduled sampling was applied at a later stage of training to reduce the mismatch between training and decoding. The ini- The on-demand adaptation is an important requirement not tial total pooling factor in Encoder is 32 for Librispeech, but only for personal devices such as mobiles but also for home 16 is used for internal usage data due to the training sensitiv- appliances such as televisions. We adopted a shallow fu- ity, and they reduced into 8 after the pre-training stage. The sion [22] method incorporated with n-gram LMs at inference n-gram LMs were stored in a const arpa structure [27]. time. By interpolating both general LM and domain-specific LMs, we were able to boost the accuracy of the target do- 6.2. Performance mains while minimizing degradation in that of the general domain. The probabilities computed from the LMs and the We performed several experiments to build the baseline E2E models are interpolated at each beam step as follows, model on each dataset, and evaluated accuracies are shown in Table 1. In the table, Bi− and U ni− mean bi-directional and P 0 (yl |x, y1:l−1 ) = log P (yl |x, y1:l−1 ) uni-directional respectively, and Cellsize denotes the size of N the encoder LSTM cells. The size of attention dimension is X (16) + αn log PLMn (yl |y1:l−1 ) same as the encoder cell size, and 1000 was used the size of n=1 the decoder. The chunk size of MoChA is 2 for all the experi- ments since we could not see any significant improvement in where N is the number of n-gram LMs, PLM is a posterior accuracy by extending the size more than two. distribution computed by the n-gram LMs. The LM distribu- tion was calculated by looking up a probability per each BPE for the given context. 6. EXPERIMENT (a) Bi-LSTM Full Attention 6.1. Experimental setup We evaluated with Librispeech corpus which consists of 960 hours of data first and Internal usage data as well. The usage (b) Uni-LSTM Full Attention corpus consists of around 10K hours of transcribed speech for Korean and English each and was recorded in mobiles and televisions. We used randomly sampled one hour of us- age data as our validation sets for each language. We doubled speech corpus by adding the random noise both for training (c) Uni-LSTM MoChA and for validating. The decoding speed evaluated on Samsung Galaxy S10+ equipped with Samsung Exynos 9820 chipsets, Fig. 2. Comparison of alignment by each attention method a Mali-G76 MP12 GPU and 12GB of DRAM memory. A sample rate of speech data was 16kHz and the bits As shown in Fig. 2, compared with bi-directional LSTM per sample were 16. The speech data were coded with 40- case, it seems that uni-directional model’s alignment has dimensional power mel-frequency filterbank features which some time delay because uni-directional LSTM cannot use are computed by power operation of mel-spectrogram with backward information from input sequence. [28] Alignment 1 15 as the power function coefficient [23]. The frames were calculation with soft full attention may have the advantage of computed every 10ms and were windowed by 25ms Hanning seeing more information and utilizing better context. But for window. We split words into 10K word pieces through byte speech recognition, since the alignment of speech utterances pair encoding (BPE) method for both Korean and English nor- is monotonic, that advantage could be not so great. malized text corpus. Especially for Korean, we reduced the The accuracy of each trained model with various opti- total number of units for Korean characters from 11,172 to mization method are shown in Table 2. The joint weight λ 68 by decomposing each Korean character into two or three for joint CTC training was 0.8, and it was gradually increased graphemes depending on the presence of final consonants. during the training. We used 13 and 50 for the max size of the We constructed our ASR system based on ReturNN [24]. frequency-axis mask and the time(frame)-axis mask, respec- In order to speed up the training, we used a multiple GPU tively, and masks were applied one for each. For joint MWER training based on the Horovod [25, 26] all reduce method. training, we used 0.6 as λ and beam size is 4. Spec augmen- And for better convergence of the model, a ramp-up strat- tation made large improvement, especially on test-other, and egy for both the learning rate and the number of workers was after Joint MWER training, finally the accuracies in WERs used [6]. We used a uniform label smoothing method on the on test-clean and test-other were improved relatively 16.41% output label distribution of the decoder for regularization, and and 17.71% respectively compared to that of the baseline.
Table 1. The performance of Attention based model depending on the number of direction and the cell size of LSTM layers in encoder. Joint CTC and Label smoothing are applied for all the results, and Data augmentation is only used on Usage data. The beam size of beam search based decoding is 12. Librispeech Usage KOR Usage ENG Encoder Attention Cell size WER(Test-clean) Test-other WER WER Bi-LSTM Full 1024 4.38% 14.34% 8.58% 8.25% Full 1536 6.27% 18.42% - - Uni-LSTM 1024 6.88% 19.11% 11.34% 10.77% MoChA 1536 6.30% 18.41% 9.33% 8.82% Table 2. Accuracy improvement from the optimizations Table 3. Performance for hyper-LRA. The size of models Librispeech were evaluated in megabytes (MB), and the beam size was Test-clean Test-other 4. xRT denotes real-time factor for decoding speed. MoChA (baseline) 6.70% 18.86% Hyper Korean English Bits + Joint CTC LRA WER xRT Size WER xRT Size 6.30% 18.41% & Label smoothing 32 no 9.37 4.89 530.56 9.03 4.32 530.50 + Spec augmentation 5.93% 15.98% 32 yes 9.85 0.99 140.18 8.91 1.15 153.98 + Joint MWER 5.60% 15.52% 32 +MWER 9.60 1.26 140.18 8.64 1.48 153.98 8 no 9.64 1.18 132.88 9.07 0.94 132.87 8 yes 10.21 0.33 35.34 9.24 0.38 38.77 6.3. Compression 8 +MWER 9.80 0.35 35.34 8.88 0.44 38.77 We respectively applied hyper-LRA to weight matrix on each layer, the rank r of which was chosen empirically. For en- Table 4. Building times for n-gram LMs (in seconds). coder LSTM layers, the ranks of the first and the last layers are Domain entities patterns utterances time set larger than those of the internal layers due to the severity Contact 2307 23 53061 4.37 of accuracy degradation. The distortion period D was set as App 699 25 17475 1.78 the total iterations in one sub-epoch divided by 16. The com- IoT 441 4 1764 0.74 pressed model was retrained with a whole training data. In addition, we adopted 8-bit quantization both to compress and ically dropped from 12.76% to 6.78% without any accuracy to speed up by using Tensorflow-lite. [29, 30]. As shown in degradation in a general domain. The additional xRT for the Table 3, the sizes of the models were reduced at least 3.4 times LM fusion was less than 0.15xRT on average even though the by applying hyper-LRA, and totally more than 13.68 times re- number of LM look-up reached millions. The LM sizes for duced after 8-bit quantization with minimum degradation of general and for all the three domains were around 43MB and the performance. Furthermore, we were able to compensate 2MB respectively. All test sets were recorded in mobiles. the performance by using MWER joint training. At the same time, the decoding speed of Korean and English models got Table 5. Performance improvement of on-demand adaptation. 13.97 and 9.81 times faster than that of baseline models, re- *xRTs were evaluated on-devices, but WERs were evaluated spectively. The average latency of final models were 140ms on servers with the uncompressed MoChA model in Table 1. and 156ms, and the memory usage during decoding (CODE Length MoChA Adapted + DATA) were 230MB and 235MB for Korean and English, Domain (in hours) WER xRT WER xRT respectively. General 1.0 9.33 0.35 9.30 0.61 Contact 3.1 15.59 0.34 11.08 0.42 6.4. Personalization App 1.2 12.76 0.34 6.78 0.48 IoT 1.5 38.83 0.43 21.92 0.52 We evaluated our on-demand adaptation method for the three domains in Korean. Names of contacts, IoT devices and appli- cations were used to manipulate utterances with pattern sen- 7. DISCUSSION tences where the names were replaced with a class name like ”call @name”. Individual n-gram LMs were built for each We accomplished to construct the first on-device stream- domain using the synthesized corpus. LMs for the specific ing ASR system based on MoChA models trained with large domains were built within 5 seconds as in Table. 4. corpus. In spite of the difficulties in training the MoChA mod- As in Table 5, the WER for an App domain was dramat- els, we adopted various training strategies such as joint loss
training with CTC and MWER, layer-wise pre-training and [10] Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu, data augmentation. Moreover, by introducing hyper-LRA, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and we could reduce the size of our MoChA models to be fit on Anjuli Kannan, “Minimum word error rate training for devices without sacrificing the recognition accuracies. For attention-based sequence-to-sequence models,” 2018 personalization, we used shallow fusion method with n-gram IEEE International Conference on Acoustics, Speech LMs, it showed improved results on target domains without and Signal Processing (ICASSP), pp. 4839–4843, 2018. sacrificing accuracy for a general domain. [11] Albert Zeyer, André Merboldt, Ralf Schlüter, and Her- mann Ney, “A comprehensive analysis on attention 8. REFERENCES models,” in Interpretability and Robustness in Audio, Speech, and Language (IRASL) Workshop, Conference [1] Eric Battenberg, Jitong Chen, Rewon Child, Adam on Neural Information Processing Systems (NeurIPS), Coates, Yashesh Gaur, Yi Li, Hairong Liu, San- Montreal, Canada, Dec. 2018. jeev Satheesh, David Seetapun, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to- [12] D. Lee, P. Kapoor, and B. Kim, “Deeptwist: Learning end speech recognition,” 2017 IEEE Automatic Speech model compression via occasional weight distortion,” Recognition and Understanding Workshop (ASRU), pp. ArXiv, vol. abs/1810.12823, 2018. 206–213, 2017. [13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- [2] Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, gio, “Neural machine translation by jointly learning to Bo Li, Leif Johnson, and Navdeep Jaitly, “A compar- align and translate,” in 3rd International Conference ison of sequence-to-sequence models for speech recog- on Learning Representations, ICLR 2015, San Diego, nition,” in Proc. Interspeech 2017, 2017, pp. 939–943. CA, USA, May 7-9, 2015, Conference Track Proceed- ings, 2015. [3] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen, attend and spell,” ArXiv, vol. [14] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, abs/1508.01211, 2015. Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in Neural [4] Alex Graves, “Sequence transduction with recurrent Information Processing Systems 28: Annual Conference neural networks,” ArXiv, vol. abs/1211.3711, 2012. on Neural Information Processing Systems 2015, De- cember 7-12, 2015, Montreal, Quebec, Canada, 2015, [5] Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar, pp. 577–585. “Exploring architectures, data and units for stream- ing end-to-end speech recognition with rnn-transducer,” [15] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann 2017 IEEE Automatic Speech Recognition and Under- Ney, “Improved training of end-to-end attention mod- standing Workshop (ASRU), pp. 193–199, 2017. els for speech recognition,” in Proc. Interspeech 2018, 2018, pp. 7–11. [6] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Ro- hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli [16] Sepp Hochreiter and Jürgen Schmidhuber, “Long short- Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Go- term memory,” Neural Computation, vol. 9, no. 8, pp. nina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel 1735–1780, Nov. 1997. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” 2018 IEEE Interna- [17] Alex Graves, Santiago Fernández, Faustino J. Gomez, tional Conference on Acoustics, Speech and Signal Pro- and Jürgen Schmidhuber, “Connectionist temporal clas- cessing (ICASSP), pp. 4774–4778, 2018. sification: labelling unsegmented sequence data with re- current neural networks,” in ICML, 2006. [7] Navdeep Jaitly, David Sussillo, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, and Samy Bengio, “A neural [18] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint transducer,” ArXiv, vol. abs/1511.04868, 2016. ctc-attention based end-to-end speech recognition us- ing multi-task learning,” 2017 IEEE International Con- [8] Colin Raffel, Thang Luong, Peter J. Liu, Ron J. Weiss, ference on Acoustics, Speech and Signal Processing and Douglas Eck, “Online and linear-time attention by (ICASSP), pp. 4835–4839, 2017. enforcing monotonic alignments,” in ICML, 2017. [19] Rico Sennrich, Barry Haddow, and Alexandra Birch, [9] Chung-Cheng Chiu and Colin Raffel, “Monotonic “Neural machine translation of rare words with subword chunkwise attention,” in International Conference on units,” in Proceedings of the 54th Annual Meeting of the Learning Representations, 2018. Association for Computational Linguistics, ACL 2016,
August 7-12, 2016, Berlin, Germany, Volume 1: Long [29] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Papers, 2016. Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- [20] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz “SpecAugment: A Simple Data Augmentation Method Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion for Automatic Speech Recognition,” in Proc. Inter- Mané, Rajat Monga, Sherry Moore, Derek Murray, speech 2019, 2019, pp. 2613–2617. Chris Olah, Mike Schuster, Jonathon Shlens, Benoit [21] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin- value decomposition based low-footprint speaker adap- cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, tation and personalization for deep neural network,” in Oriol Vinyals, Pete Warden, Martin Wattenberg, Mar- ICASSP, 2014. tin Wicke, Yuan Yu, and Xiaoqiang Zheng, “Tensor- Flow: Large-scale machine learning on heterogeneous [22] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N. systems,” 2015, Software available from tensorflow.org. Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An analysis of incorporating an external language model [30] Google Inc., “Tensorflow Lite,” Online docu- into a sequence-to-sequence model,” 2018 IEEE In- ments; https://www.tensorflow.org/lite, ternational Conference on Acoustics, Speech and Signal 2018, [Online; accessed 2018-02-07]. Processing (ICASSP), 2018. [23] Chanwoo Kim, Minkyu Shin, Abhinav Garg, and Dhananjaya Gowda, “Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System,” in Proc. Interspeech 2019, 2019, pp. 739–743. [24] Patrick Doetsch, Albert Zeyer, Paul Voigtlaender, Ilya Kulikov, Ralf Schlüter, and Hermann Ney, “Returnn: The rwth extensible training framework for universal re- current neural networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [25] Alexander Sergeev and Mike Del Balso, “Horovod: fast and easy distributed deep learning in TensorFlow,” ArXiv, vol. abs/1802.05799, 2018. [26] C. Kim, S. Kim, K. Kim, M. Kumar, J. Kim, K. Lee, C. Han, A. Garg, E. Kim, M. Shin, S. Singh, L. Heck, and D. Gowda, “End-to-end training of a large vocab- ulary end-to-end speech recognition system,” in 2019 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), 2019, accepted. [27] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Dec. 2011, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB. [28] Hasim Sak, Andrew W. Senior, Kanishka Rao, and Françoise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” ArXiv, vol. abs/1507.06947, 2015.
You can also read