DOMAIN ROBUST, FAST, AND COMPACT NEURAL LANGUAGE MODELS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DOMAIN ROBUST, FAST, AND COMPACT NEURAL LANGUAGE MODELS Alexander Gerstenberger1 , Kazuki Irie1,∗ , Pavel Golik2 , Eugen Beck1,2 , Hermann Ney1,2 1 Human Language Technology and Pattern Recognition Group, Computer Science Department RWTH Aachen University, 52074 Aachen, Germany 2 AppTek GmbH, 52062 Aachen, Germany alexander.gerstenberger@rwth-aachen.de, {irie, beck, ney}@cs.rwth-aachen.de, pgolik@apptek.com ABSTRACT is crucial for obtaining a good neural language model. Sec- ond, the diversity in the data requires some extra effort in Despite advances in neural language modeling, obtaining a modeling [9, 10] for building a robust model, in contrast to good model on a large scale multi-domain dataset still re- n-gram language models for which such a diversity can be mains a difficult task. We propose training methods for build- simply leveraged by static or Bayesian interpolation [11, 12]. ing neural language models for such a task, which are not In this work, we aim at building neural language mod- only domain robust, but reasonable in model size and fast els (LMs) on a large scale multi-domain dataset, which are for evaluation. We combine knowledge distillation from pre- not only domain robust (no need for domain label at test trained domain expert language models with the noise con- time), but reasonable in terms of model size, and fast for trastive estimation (NCE) loss. Knowledge distillation allows evaluation. To achieve this goal, we combine knowledge to train a single student model which is both compact and distillation (KD) [13–15] using pre-trained domain expert domain robust, while the use of NCE loss makes the model models together with the noise contrastive estimation [16–19] self-normalized, which enables fast evaluation. We conduct loss. We conduct our experiments on a multi-domain speech experiments on a large English multi-domain speech recogni- recognition dataset provided by AppTek, which disposes tion dataset provided by AppTek. The resulting student model about 10 B words (from which we selected 1.2 B for neural is of the size of one domain expert, while it gives similar per- model training) for language model training. We demon- plexities as various teacher models on their expert domain; the strate that our method effectively achieves a language model model is self-normalized, allowing for 30% faster first pass which is of the size of one expert, while it gives similar decoding than the naive models which require the full soft- perplexities as the teacher models on their expert domain, max computation, and finally it gives improvements of more and it is self-normalized. We implement our models using the than 8% relative in terms of word error rate over a large multi- TensorFlow [20] based open-source toolkit RETURNN [21]1 . domain 4-gram count model trained on more than 10 B words. Index Terms— language modeling, domain robustness, 2. RELATED WORK teacher student learning, ASR In [9], a domain robust neural language model is constructed as a large mixture of domain experts. An obvious downside 1. INTRODUCTION of such an approach is the large size of the final model. In this work, instead of copying all domain experts’ parameters, we Neural network language models [1], such as long short-term make use of knowledge distillation [13–15] to obtain a sin- memory (LSTM) [2] recurrent neural networks (RNNs) [3, 4] gle student model which is both compact and domain robust. or Transformers [5–7], have been shown to consistently out- Distillation from domain experts for robust acoustic modeling perform the n-gram language models and give large improve- has been investigated in [22]. In case of language modeling, ments for automatic speech recognition. However, such im- distillation must be combined with an efficient softmax com- provements are not obtained for free; they are results of care- putation method: in previous work, [23] uses the word-class ful tuning of the model hyper-parameters. In practice, for factorized output, while [24] uses the NCE. In this work, our large scale tasks (with more than a few billion-word training primary goal is to use the NCE since it makes the model self- text) containing sub-corpora with multiple domains, it is not normalized, which is fast for evaluation, while we also com- straightforward to obtain a good neural language model [8]. pare it with the sampled softmax [25] variant. Our experiment First, the model size must be increased for a large amount of also include knowledge transfer from powerful Transformer data. This slows down the training and tuning process which teacher models to a single domain robust LSTM student. *Work conducted while the author was at RWTH Aachen. Now with the 1 Example config files are available in https://github.com/rwth-i6/ Swiss AI Lab, IDSIA, USI & SUPSI, 6928 Manno-Lugano, Switzerland. returnn-experiments/tree/master/2020-lm-domain-robust.
3. TRAINING METHOD choosing the domain label with the highest weight (as shown in Sec 4.1). 3.1. Knowledge distillation for large vocabulary LM We consider two methods for building the interpolated For knowledge distillation from a teacher pT (w|h) to a stu- teacher model. First, we simply estimate a single set of in- dent language model pθ (w|h) with its parameters θ and a vo- terpolation weights for the teacher models based on their per- cabulary V , we optimize θ to minimize the distillation loss plexity on the entire development set and use a static interpo- which computes for each history h in the data: lated teacher model (static teacher approach). Alternatively, we can estimate target domain specific interpolation weights based on each development subset and use these domain con- X LKD (h; θ) = − pT (w|h) log pθ (w|h) (1) w∈V ditional weights to dynamically define the teacher depending on the domain of the training sequence (domain conditional In practice, this term is interpolated with the standard cross- approach) as in [22]. This results in a better teacher ensemble. entropy loss using an interpolation weight. When large vocabulary word-level language models are 4. EXPERIMENTAL SET-UPS trained using some method for avoiding the full softmax, the 4.1. AppTek English multi-domain dataset distillation loss must also be adapted accordingly. We con- We conduct experiments on an English large multi-domain sider both NCE and sampled softmax methods. dataset provided by AppTek. The LM training data consists Knowledge distillation using sampled softmax: In the of 33 subsets with domains including news, movie subtitles sampled softmax loss [25], the normalization term in the soft- (entertainment), user generated content and sport, which com- max is computed based on a subset of words sampled for each prises 10.2 B words in total with a vocabulary size of 250 K batch from a noise distribution. Thus, we can directly ob- words. Our domain labels are movies, news, social media, tain the distillation loss by replacing pT (w|h) and pθ (w|h) in user generated content (UGC) and voice messages (MSG). Eq. (1) with the corresponding sampled softmax probabilities, These target domains are defined by the dev and eval datasets. making sure to use the same samples for teacher and student. We first check which subsets of the training dataset are Knowledge distillation using NCE: While sampled soft- relevant to our target domains. We train 4-gram Kneser- max makes training faster than the full softmax, use of the Ney LMs [26] on each subset of the training data. Then, NCE [16–18] loss allows both to train faster and to train self- we linearly interpolate the models by using the interpolation normalized models (therefore, we would only have to com- weights optimized on every domain specific subset of the dev pute the exponential of the logits at test time). The NCE loss data. The resulting interpolation weights on each domain trains the model to discriminate noise samples drawn from a indicate the domain relevance of each training subset. Table noise distribution q from true data by means of logistic re- 1 shows Top-8 most relevant subsets. gression. For knowledge distillation, we use the loss which computes for each data point (h, w): Based on this analysis, we assign news-04 as news expert News, because it has highest weight on the news domain, and ent-04 as movies expert Movie, in total 1.2 B words2 . We X LKD-NCE (h, w; θ) = − gT (w̃, h) log gθ (w̃, h) w̃∈Dq ∪{w} pre-train separate models on each of the two datasets3 as ex- perts for the corresponding domain. We use an interpolation + (1 − gT (w̃, h)) log [1 − gθ (w̃, h)] of these models as teacher for distillation. (2) 4.2. Model architectures where gθ (w, h) := σ(sθ (w, h) − log q(w|h)) with sθ (w, h) We pre-train both LSTM and Transformer based teacher mod- the logits of the student (similarly gT (w, h) for the teacher), els. Deep Transformer models have recently shown good per- and Dq is the set of words sampled from a noise distribution formance on a variety of LM datasets, outperforming LSTM q. In order to obtain a self-normalized student model, the models. However, Transformer LMs require more memory teacher models are also pre-trained using the NCE loss. for evaluation due to the self-attention. For distillation, a 3.2. Knowledge distillation for domain robust modeling student LSTM model can benefit from Transformer teachers, 2 We could also merge news-01 and -04 or ent-02, -03, and -04 to train the We make use of distillation methods above for building a sin- corresponding experts, if we had more computational resources. gle, compact, domain robust model. The teacher model in our 3 In our preliminary experiments with LSTMs, we found the interpola- experiments is an interpolation of multiple neural language tion of models trained on subsets outperform the single model trained on the models trained on different sub-corpora of the dataset. whole dataset. We could potentially obtain a better expert models by first First of all, the target domain labels are specified in the training a single background model on the whole data and then fine-tune that model on the domain subsets separately, as has been done in [9]. However, development set. We can assign each training subset to a do- in practice, pre-training a single model on the whole data would require the main, by training a 4-gram count model on the subset, in- model to be very large; distributed training of “reasonable size” models sepa- terpolating the models optimized for each domain and then rately on different subsets is more convenient and it potentially scales better.
Table 1. Interpolation weights (scaled by factor 100) for 4- ensemble: on news, 7.4% relative improvement is obtained. gram LMs. We removed values smaller than 10−2 . We show 8 The last part of Table 2 shows the results for distillation most relevant subsets out of 33. #Running words in million. experiments using domain conditional interpolation of expert Train # Run. Interpolation weights on dev set models as the teacher. The resulting student model gives subset words All Movies News Social UGC MSG comparable performance to the previous case with the static news-01 93 2.0 - 10.8 0.2 0.1 - teacher. The improvement by domain conditional weights ent-01 94 5.2 3.7 3.1 13.7 13.5 6.9 does not seem to carry over to the student performance5 . ent-02 174 7.3 12.7 1.2 2.1 3.7 11.4 news-02 18 2.7 - 6.2 1.9 2.0 0.4 Table 2. Perplexities for sampled softmax case. “Dom.” in- news-03 2,960 3.7 1.0 6.3 0.9 4.6 3.0 dicates domain conditional weights for interpolating expert ent-03 651 15.9 23.3 3.1 21.5 20.0 14.8 models to obtain the teacher. Dom. ent-04 469 22.8 48.1 1.1 28.2 27.0 20.7 Model Model Development perplexity news-04 730 27.6 4.2 56.7 4.4 12.4 9.9 Role Type All Movie News Social UGC MSG 4-gram 155.5 186.7 103.0 158.9 174.4 187.2 News which would potentially allow us to make use of the good LSTM 100.0 123.0 65.7 103.1 96.5 131.5 performance of Transformer models, while obtaining more 4-gram 150.4 99.1 246.2 110.5 134.6 154.5 Movie memory efficient LSTM models for evaluation. LSTM 104.4 79.2 134.9 149.7 83.8 118.4 Our LSTM language models use two LSTM layers with Teacher 78.7 79.4 69.0 75.3 74.7 95.4 LSTM 2048 hidden units each, 128 input embedding dimension, Student 75.0 76.5 63.9 72.6 71.3 92.5 which amounts to 600 M parameters given the vocabulary of Teacher 78.7 75.7 63.0 74.4 74.2 94.9 LSTM × size 250 K. For the Transformer models we use 128 input Student 75.2 77.5 62.3 73.9 71.7 94.3 embedding dimension, 32 layers, 2048 feed-forward dimen- sion, 768 residual dimension (tied with key/query and value 5.2. Results for NCE based distillation dimensions) and 16 attention heads, which amounts to 400 M The (normalized) perplexities for the NCE experiments are parameters. Following [7], no positional encoding is used. shown in Table 3. While we obtain slightly better perplexities We use the frequency sorted log-uniform distribution to compared with the sampled softmax variants (Sec 5.1), the sample 8192 and 1024 negative samples respectively for sam- overall observations are similar: The student model outper- pled softmax [25] and NCE [16]. We share the noise samples form the expert models but the domain conditional distillation within the same batch. For NCE, we set the constant nor- approach does not give extra gain in performance. malization term to one and initialize the bias of the softmax layer to − log(V ), following [27]4 , which makes the model Table 3. Perplexities for LSTM models trained with NCE. initially self-normalized. We found this to be crucial for train- Again, “Dom.” indicates domain conditional weights for in- ing model with the NCE loss to match performance of models terpolating expert models to obtain the teacher. Dom. trained with the full softmax. All models are trained on a sin- Model Development perplexity gle NVIDIA Tesla V100 GPU (16 GB) at RWTH IT Center. Role All Movie News Social UGC MSG News 100.7 126.5 65.3 103.9 96.9 131.6 5. TEXT BASED EXPERIMENTS Movie 103.7 77.6 149.0 82.5 80.4 117.5 5.1. Results for sampled softmax based distillation Teacher 77.1 77.5 68.0 73.8 72.1 94.0 Student 75.0 76.2 65.0 72.5 70.4 91.6 The results for the sampled softmax case are shown in Table Teacher 77.1 74.0 62.1 72.6 71.6 93.7 2. All models are trained for 6 epochs until convergence. Our × Student 75.1 76.6 63.3 72.7 71.4 93.7 distillation loss scale is set to 0.5, for which we achieved the best results. The top part of the table compares the expert We will therefore use the student trained with the static LSTM models with the 4-gram models trained only on the teacher (from Table 3) for the ASR experiment later. The corresponding subset. variance of the log normalization term for the model is The middle part of Table 2 shows the results for the dis- 0.023 and the mean value is -0.034, which is acceptably tillation with the static teacher. The teacher is obtained by self-normalized6 . interpolation between News and Movie LSTM models, us- Can we reduce the student model size? In addition, we ing interpolation weights estimated on the whole dev set. We investigate the possibility to reduce the student model size. note that this teacher does not outperform the individual ex- 5 We note that the word frequency used in the sampler is computed on the pert models on their expert domain. By distillation, we obtain whole training set. Using domain specific sampling distributions might lead a single student model with better perplexities than the teacher to different conclusions. 6 Following [28], we can also correct the logits by subtracting the mean. 4 We scaled this value by 1.5. We found this to improve the model perfor- For unnormalized LM scores we then get 75.0 on dev and 92.1 on eval with mance and convergence speed in our experiments on this dataset. correction, compared with 77.6 and 95.3 respectively without correction.
Table 4 shows the results for student models with LSTM size broadcast news and media as well as entertainment domains. of 1024 and 512 instead of 2048, as well as a model with a We also use RETURNN for the acoustic model training. bottle neck layer before the softmax [29]. The bottleneck ap- The decoding is carried out with the RASR toolkit [33, proach works best in our experiments, achieving a compres- 34]. The recognition lexicon contains 250 K words. In con- sion rate of 5.7 while showing only up to 3.5% degradation trast to popular benchmark sets, such as Hub5 2000 or Lib- compared to the full size LSTM student. riSpeech, no manual segmentation is available and we apply an automatic speech activity detection to break the long audio Table 4. Perplexities for small students in the NCE case. hid- recordings into utterances. Each utterance then is processed den denotes the LSTM dimension and bn-dim is the dimen- in isolation. sion of an additional linear bottleneck layer before softmax. Table 6 summarizes the results. We evaluate both de- hidden bn- #Param. Development perplexity codings using normalized and unnormalized language model dim dim [M] All Movie News Social UGC MSG scores for the LSTM model trained with NCE. In both cases, - 600 75.0 76.2 65.0 72.5 70.4 91.6 2048 we obtain improvements of up to 7-8% relative in WER over 512 212 76.7 78.1 67.4 73.4 72.6 91.5 our 4-gram baseline LM trained on all text data. This con- 1024 300 81.2 81.3 72.0 76.0 77.6 97.2 - firms that the full softmax computation is not needed for the 512 163 90.5 88.0 86.9 83.1 86.6 102.6 NCE-trained models. The system which uses the unnormal- ized scores runs 30% faster. Only looking at the time for the 5.3. Transformer teachers for an LSTM student language model score computation shows a speedup of 40%. Table 5 shows the results for distillation with Transformer In addition, we also evaluate the Transformer student teachers. The Transformer teacher ensemble gives 69.4 per- trained using sampled softmax in Sec. 5.3. We obtain 9.5% plexity which is 12% relative improvements over the LSTM relative improvement in WER over the 4-gram baseline. teacher (Table 2). While the resulting LSTM student outper- Table 6. WERs (%) for first pass recognition experiments. forms the student trained with LSTM teachers (Table 2), the “Normalized” refers to the use of full softmax for evaluation. improvement is only marginal. Train Norm- Dev Eval LM Finally, we also use a Transformer as student using sam- data alized PPL WER PPL WER pled softmax distillation. Our observation is similar to the 4-gram 10.2B 108.7 19.0 119.7 21.8 Yes LSTM teacher case (Sec. 5.1): the student outperforms both 75.0 17.5 91.8 20.5 NCE-LSTM 1.2B the teacher and each domain expert. We obtain up to 7% rel- No 77.6 17.6 95.3 20.5 ative improvement over the teacher model. Transformer 1.2B Yes 65.5 17.2 81.9 20.2 Table 5. Sampled softmax distillation using Transformer 7. CONCLUSION teacher. Again, “Dom.” indicates domain conditional weights We presented a training method for LSTM language models for interpolating expert models to obtain the teacher. on a difficult large scale multi-domain task, which success- Dom. Model Model Development perplexity fully combines knowledge distillation from pre-trained do- Role Type All Movie News Social UGC MSG main LMs and NCE loss. We compressed the large ensem- News 91.1 118.8 55.3 92.1 86.0 124.7 ble of domain experts into a single, compact model, while Trafo Movie 95.2 74.2 131.6 74.5 73.2 106.0 maintaining similar perplexities of the teacher, and being self- Teacher Trafo 69.4 73.1 56.8 65.8 63.8 88.2 normalized. We achieved up to 8-9% improvement in WER LSTM 73.6 75.5 62.9 69.0 67.0 90.7 over a strong 4-gram model trained on much more data. In Student Trafo 65.5 68.5 54.1 62.8 59.7 83.6 future work, we explore how to extend this method for new Teacher Trafo 69.4 70.1 52.0 65.1 63.5 87.8 target domains or additional training data; aiming for an in- × Student LSTM 73.7 76.3 60.2 70.4 70.1 93.6 cremental lifelong learning algorithm. 8. ACKNOWLEDGEMENT 6. ASR EXPERIMENTS We thank Tobias Menne for help with the We carry out ASR first pass decoding [30] experiments us- baseline ASR system, Ralf Schlüter and ing the obtained LSTM and Transformer student models. Our Volker Steinbiss for helpful comments on the paper. This work has received funding from system is based on the hybrid approach [31]. The HMM state the European Research Council (ERC) under the European Unions tying schema was estimated following a phonetic decision Horizon 2020 research and innovation programme (grant agreement tree approach [32], resulting in 5K tied states. The acoustic No 694537, project SEQCLAS). The work reflects only the authors model is a compact bi-directional neural network with four views and none of the funding parties is responsible for any use layers with 512 LSTM cells [2] per layer and per direction. that may be made of the information it contains. Experiments were We trained it on 80-dimensional MFCC features extracted partially performed with computing resources granted by RWTH from a very large collection of various recordings from the Aachen under project nova0003.
9. REFERENCES [19] Xie Chen, Xunying Liu, Mark J. F. Gales, and Philip C. Woodland, “Recurrent neural network language model training with noise con- [1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jan- trastive estimation for speech recognition,” in Proc. IEEE Int. Conf. vin, “A neural probabilistic language model,” The Journal of Machine on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Aus- Learning Research, vol. 3, pp. 1137–1155, 2003. tralia, Apr. 2015, pp. 5411–5415. [2] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” [20] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. Jeffrey Dean, and Matthieu Devin et al., “Tensorflow: A system for [3] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and large-scale machine learning,” in Proc. USENIX Sympo. on Operating Sanjeev Khudanpur, “Recurrent neural network based language Systems Design and Impl. (OSDI 16), Savannah, GA, USA, Nov. 2016, model,” in Interspeech, Makuhari, Japan, Sept. 2010, pp. 1045–1048. pp. 265–283. [21] Albert Zeyer, Tamer Alkhouli, and Hermann Ney, “RETURNN as a [4] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney, “LSTM neural generic flexible neural toolkit with application to translation and speech networks for language modeling.,” in Proc. Interspeech, Portland, OR, recognition,” in Proc. Assoc. for Computational Linguistics (ACL), USA, Sept. 2012, pp. 194–197. Melbourne, Australia, July 2018. [5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion [22] Zhao You, Dan Su, and Dong Yu, “Teach an all-rounder with experts in Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention different domains,” in Proc. IEEE Int. Conf. on Acoustics, Speech and is all you need,” in Proc. Advances in Neural Information Processing Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 6425–6429. Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017. [23] Kazuki Irie, Zhihong Lei, Ralf Schlüter, and Hermann Ney, “Prediction [6] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii of LSTM-RNN full context states as a subtask for N-gram feedforward Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde, language models,” in ICASSP, Calgary, Canada, Apr. 2018, pp. 6104– “Jasper: An End-to-End Convolutional Neural Acoustic Model,” in 6108. Proc. Interspeech, 2019, pp. 71–75. [24] Jesús Andrés-Ferrer, Nathan Bodenstab, and Paul Vozila, “Effi- [7] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language cient language model adaptation with noise contrastive estimation and modeling with deep Transformers,” in Interspeech, Graz, Austria, Sept. kullback-leibler regularization,” in Proc. Interspeech 2018, Hyderabad, 2019, pp. 3905–3909. India, Sept. 2018, pp. 3368–3372. [8] Anirudh Raju, Denis Filimonov, Gautam Tiwari, Guitang Lan, and [25] Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Ben- Ariya Rastrow, “Scalable Multi Corpora Neural Language Models for gio, “On using very large target vocabulary for neural machine transla- ASR,” in Proc. Interspeech, 2019, pp. 3910–3914. tion,” in Proc. ACL, Beijing, China, July 2015, pp. 1–10. [9] Kazuki Irie, Shankar Kumar, Michael Nirschl, and Hank Liao, [26] Reinhard Kneser and Hermann Ney, “Improved backing-off for m- “RADMM: Recurrent adaptive mixture model with applications to do- gram language modeling,” in Proc. IEEE Int. Conf. on Acoustics, main robust language modeling,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Detroit, MI, USA, May 1995, Speech and Signal Processing (ICASSP), Calgary, Canada, Apr. 2018, pp. 181–184. pp. 6079–6083. [27] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard [10] Michael Hentschel, Marc Delcroix, Atsunori Ogawa, Tomoharu Iwata, Schwartz, and John Makhoul, “Fast and robust neural network joint and Tomohiro Nakatani, “A unified framework for feature-based do- models for statistical machine translation,” in Proc. Assoc. for Compu- main adaptation of neural network language models,” in Proc. IEEE Int. tational Linguistics (ACL), Baltimore, Maryland, June 2014, pp. 1370– Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 1380. UK, May 2019, pp. 7250–7254. [28] Jacob Goldberger and Oren Melamud, “Self-normalization properties [11] Cyril Allauzen and Michael Riley, “Bayesian language model interpo- of language modeling,” in Proc. Assoc. for Computational Linguistics lation for mobile speech input,” in Proc. Interspeech, Florence, Italy, (ACL), Santa Fe, USA, Aug. 2018, pp. 764–773. Aug. 2011, pp. 1429–1432. [29] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and [12] Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar, Bhuvana Ramabhadran, “Low-rank matrix factorization for deep neu- Mirko Hannemann, Youssef Oualil, and Ilya Oparin, “Connecting and ral network training with high-dimensional output targets,” in Proc. Comparing Language Model Interpolation Techniques,” in Proc. Inter- IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), speech, 2019, pp. 3500–3504. Vancouver, Canada,, May 2013, pp. 6655–6659. [13] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowl- [30] Eugen Beck, Wei Zhou, Ralf Schlüter, and Hermann Ney, “LSTM lan- edge in a neural network,” in NIPS Deep Learning and Representation guage models for LVCSR in first-pass decoding and lattice-rescoring,” Learning Workshop, Montreal, Canada, Dec. 2014. arXiv preprint arXiv:1907.01030, July 2019. [14] Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?,” [31] Hervé Bourlard and Christian J. Wellekens, “Links between Markov in Proc. Advances in Neural Information Processing Systems (NIPS), models and multilayer perceptrons,” in Advances in Neural Informa- Quebec, Canada, Dec. 2014, vol. 27, pp. 2654–2662. tion Processing Systems I, D.S. Touretzky, Ed., pp. 502–510. Morgan [15] Cristian Buciluă, Rich Caruana, and Alexandru Niculescu-Mizil, Kaufmann, San Mateo, CA, USA, 1989. “Model compression,” in Proc. ACM SIGKDD Int. Conf. on Knowl- [32] Steve Young, Julian Odell, and Philip C. Woodland, “Tree-based state edge Disc. and Data Mining, Philadelphia, PA, USA, Aug. 2006, pp. tying for high accuracy acoustic modelling,” in Proc. Workshop on 535–541. Human Language Technology, Plainsboro, NJ, USA, Mar. 1994, pp. [16] Michael Gutmann and Aapo Hyvärinen, “Noise-contrastive estimation: 307–312. A new estimation principle for unnormalized statistical models,” in [33] David Rybach, Stefan Hahn, Patrick Lehnen, David Nolden, Martin Proc. of Int. Conf. on AI and Statistics, 2010, pp. 297–304. Sundermeyer, Zoltán Tüske, Simon Wiesler, Ralf Schlüter, and Her- [17] Andriy Mnih and Yee Whye Teh, “A fast and simple algorithm for mann Ney, “RASR - the RWTH Aachen University open source speech training neural probabilistic language models,” in Proc. Int. Conf. on recognition toolkit,” in Proc. IEEE Automatic Speech Recog. and Un- Machine Learning (ICML), Edinburgh, Scotland, 2012, ICML’12, pp. derstanding Workshop (ASRU), Honolulu, HI, USA, Dec. 2011. 419–426. [34] Simon Wiesler, Alexander Richard, Pavel Golik, Ralf Schlüter, and [18] Zhuang Ma and Michael Collins, “Noise contrastive estimation and Hermann Ney, “RASR/NN: The RWTH neural network toolkit for negative sampling for conditional models: Consistency and statistical speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech efficiency,” in Proc. Conf. on Empirical Methods in Nat. Lang. Pro- and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 3313– cessing (EMNLP), Brussels, Belgium, Oct.-Nov. 2018, pp. 3698–3707. 3317.
You can also read