Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches Juan Cruz-Benito1 * Sanjay Vishwakarma2† , Francisco Martin-Fernandez1 , Ismael Faro1 , 1 IBM Quantum. IBM T.J. Watson Research Center. Yorktown Heights, NY 10598, USA 2 Electrical and Computer Engineering. Carnegie Mellon University. Mountain View, USA juan.cruz@ibm.com, svishwak@andrew.cmu.edu, paco@ibm.com, ismael.faro1@ibm.com arXiv:2009.07740v4 [cs.CL] 12 Jan 2021 Abstract and Nguyen 2015; Bielik, Raychev, and Vechev 2016; Cruz- Benito et al. 2018) used different techniques such as graph- In recent years, the use of deep learning in language mod- based statistical LMs, probabilistic LMs, or Deep Learning els gained much attention. Some research projects claim that (DL) LMs to suggest code to programmers similarly to auto- they can generate text that can be interpreted as human- writing, enabling new possibilities in many application ar- completer features in IDEs. LMs were used to generate auto- eas. Among the different areas related to language process- mated source code based on sample code inputs or pseudo- ing, one of the most notable in applying this type of model- code and evaluating how this generated code performs (Oda ing is programming languages. For years, the Machine Learn- et al. 2015; Tiwang, Oladunni, and Xu 2019; Fedus, Good- ing community has been researching this software engineer- fellow, and Dai 2018). Another exciting application of NLP ing area, pursuing goals like applying different approaches to into source code languages is the automatic translation be- auto-complete, generate, fix, or evaluate code programmed by tween different languages. The work reported in (Nguyen, humans. Considering the increasing popularity of the Deep- Nguyen, and Nguyen 2013; Roziere et al. 2020) explores Learning-enabled language models approach, we detected a different supervised and unsupervised approaches to mi- lack of empirical papers that compare different deep learn- grate code between different programming languages to im- ing architectures to create and use language models based on programming code. This paper compares different neu- prove interoperability or port codebases written in obsolete ral network architectures like AWD-LSTMs, AWD-QRNNs, or deprecated languages (such as COBOL or Python2). An- and Transformer while using transfer learning and different other example found is the use of Bayesian networks, atten- tokenizations to see how they behave in building language tion mechanisms, and pointer networks (Proksch, Lerch, and models using a Python dataset for code generation and fill- Mezini 2015; Li et al. 2017; Donahue, Lee, and Liang 2020) ing mask tasks. Considering the results, we discuss each ap- to fill a given code portion with missings. proach’s different strengths and weaknesses and what gaps There is a more general understanding of the natural lan- we find to evaluate the language models or apply them in a guages’ different characteristics in the NLP broad field. real programming context. Since there exist many research fields related to human lan- guages, there is a richer background on existing language Introduction characteristics. For example, there is much knowledge on aspects like the minimal representation units of a word in a We are digitally surrounded by computational Language specific language, the most used words of a language, or if Models (LMs) that guide us while writing to reduce the user a word is a neologism or not. Programming languages share effort, suggest different options for words/sentences to en- some syntax similarities with spoken languages. However, it hance our style, or fix our grammatical/correctness errors does not have the same restrictions in the sense of common accurately (Kannan et al. 2016; Bryant and Briscoe 2018; words or neologisms (Allamanis et al. 2015; Karampatsis Ghosh and Kristensson 2017). Many of the keys we press and Sutton 2019), or other syntax restrictions and features while writing on a keyboard act as part of the inputs to com- such as punctuation, format, or style. Every programming pose new datasets for those models that shape how we com- language has indeed reserved words and symbols to denote municate with others. Nevertheless, does it happen in the different actions, resources, or syntax. However, there is an same way when we write code? Succinctly, yes. Accord- essential part of the source code that is only limited by the ing to some recent surveys found in the literature (Alla- programmer’s imagination, the conventions existing, or the manis et al. 2018; Chen, Le, and Babar 2020), the Natu- guides for good practices. As (Karampatsis and Sutton 2019) ral Language Processing (NLP) subfield related to program- claims, ming language includes examples of LMs used in several tasks and contexts. For example, the authors of (Nguyen [...] traditional language models limit the vocabulary to a fixed set of common words. For code, this strong as- * Contact author sumption has been shown to have a significant negative † Intern at IBM Quantum at the time of writing this paper effect on predictive performance [...]
In that paper, Karampatsis and Sutton 2019 present how ity et al. 2017), the One-Billion Word benchmark (Chelba segmenting words into subword units can improve source et al. 2014), or The Hutter Prize Wikipedia dataset 3 . code modeling. Similarly, other researchers (Ganin et al. The AWD-LSTM is a variation of the famous Long 2016; Kim et al. 2016; Karpathy 2016) dug in representing Short-Term Memory (LSTM) architecture (Hochreiter and source code vocabulary with a similar emphasis on model- Schmidhuber 1997). The LSTM is a type of Recurrent Neu- ing words using sub-word units and envisioning their impor- ral Network (RNN) especially capable of processing and tance when using neural networks (NNs). Nevertheless, how predicting sequences. That ability with sequences is the rea- that word segmentation affect the accuracy or the appropri- son why LSTMs have been employed largely in LMs (Kim ateness of the code generated or auto-completed in some et al. 2016). The AWD-LSTM includes several optimiza- modern LM using deep learning approaches? That kind of tions compared to the regular LSTM. Two of the most im- question raises the main goal for this paper: discover what portant ones are the use of Average Stochastic Gradient De- kinds of associations between different modern neural net- scent (ASGD) and the WeightDropout. The ASGD is used work architectures and tokenization models produce the best as the NN’s optimizer to consider the previous weights (not results when creating LMs to generate and auto-complete only the current one) during each training iteration. The source code. WeightDropout introduces the dropout technique (Srivastava To pursue that goal, this research aims to conduct exper- et al. 2014) to avoid overfitting but with the characteristic of iments combining different deep neural network (DNN) ar- returning zero, not with a subset of activations between lay- chitectures with different tokenization and pre-trained mod- ers, like in traditional dropout, but with a random subset of els over an existing Python dataset. Using that experimenta- weights. tion, we want to investigate the combinations that improve The QRNN is another type of RNN that includes alternate code generation and auto-completion tasks (for example, convolutional and pooling layers in the architecture. This de- filling the blanks) while checking the outcomes from those sign makes the QRNN able to capture better long-term se- tasks using metrics like accuracy and human assessment. quences and train much faster since the convolutional layers The rest of the paper is as follows: Section 2 presents allow the computation of intermediate representations from the different approaches followed during the research, the the input in parallel. They can be up to 16 times faster at train DNNs used, the software methods and data employed. Sec- and test time than LSTMs while having better predictive ac- tion 3 describes results achieved during the research accord- curacy than stacked LSTMs of the same hidden size. We ing to different metrics and tests, while section 4 discusses use a QRNN modified (AWD-QRNN) to include the same these findings and the implications of the results as appro- ASGD and WeightDropout modifications to improve its sta- priate. Finally, Section 5 presents some conclusions. bility and optimize its capabilities, as for the AWD-LSTM. We utilize AWD-LSTM and AWD-QRNN to produce Materials and methods LMs capable of solving the task of generating source code based on input as in the literature (Merity, Keskar, and We have trained a set of deep neural networks using different Socher 2018a,b; Wang, Gong, and Liu 2019; Gong et al. architectures, tokenization techniques, and software libraries 2018; Takase, Suzuki, and Nagata 2018; Yang et al. 2018; to develop this research work. Following, we introduce the Krause et al. 2018; Merity, Keskar, and Socher 2018a; different materials and methods employed for that purpose. Howard and Ruder 2018; Eisenschlos et al. 2019). The Transformer is probably the most popular current Deep Neural Networks and tokenization models DNN architecture in NLP due to its performance and re- used cent state-of-the-art results in many tasks. It is an encoder- Regarding the DNN architectures employed, we chose to use decoder architecture in which each layer uses attention the following ones: ASGD Weight-Dropped LSTM (AWD- mechanisms. This use of (self-)attention mechanisms makes LSTM) (Merity, Keskar, and Socher 2018b), Quasi Recur- Transformer able to model the relationships between all rent Neural Networks (QRNNs) (Bradbury et al. 2017), and words in a sentence regardless of their respective position. Transformer (Vaswani et al. 2017). These DNN architec- That implies a significant improvement over RNNs, en- tures have been reportedly getting some state-of-the-art re- abling much more parallelization of data processing and un- sults (Merity, Keskar, and Socher 2018b; Wang, Gong, and blocking the training over more massive datasets. The excel- Liu 2019; Gong et al. 2018; Takase, Suzuki, and Nagata lent results of the Transformer architecture empowered the 2018; Yang et al. 2018; Krause et al. 2018; Rae et al. 2019; NLP community to create new state-of-the-art transformer- Dai et al. 2019; Baevski and Auli 2018; Merity, Keskar, and based models (Young et al. 2018) like those used in the cur- Socher 2018a; Howard and Ruder 2018; Eisenschlos et al. rent research: GPT-2 (Radford et al. 2019), BERT (Devlin 2019; Brown et al. 2020; Devlin et al. 2019) recently in the et al. 2019), and RoBERTa (Liu et al. 2019). NLP field in several groundbreaking digital products1,2 and We choose to use GPT-2 since it is a causal transformer in some of the most known datasets like the Penn Tree Bank (unidirectional) that can predict the next token in a sequence. (Mikolov et al. 2011), WikiText-2 and WikiText-103 (Mer- So, it can generate source code based on input, allowing us to compare the results with the AWD-LSTM and AWD- 1 https://openai.com/blog/openai-api/ QRNN experiments. Regarding BERT and RoBERTA, we 2 https://blog.google/products/search/search-language- 3 understanding-bert/ http://prize.hutter1.net/
used them to study how a masked modeling approach can fer learning approach similar to other researchers in exist- perform auto-complete source code. In that case, we did not ing literature (Howard and Ruder 2018; Ruder et al. 2019; use them for text generation, as in the other experiments, Chronopoulou, Baziotis, and Potamianos 2019; Eisenschlos since BERT and RoBERTa are not designed for text gen- et al. 2019). We employed the pre-trained models on En- eration. However, they can generate text (more diverse but glish texts to later fine-tune the models for the selected tasks slightly worse in quality) (Wang and Cho 2019). using the GitHub CodeSearchNet dataset. The deep-neural- Considering the tokenization techniques, for every AWD- networks-related source code was coded using the FastAI LSTM and AWD-QRNN, we chose the following types of library (versions 1.0.61 and 2 dev 0.0.21). To apply the dif- tokens: word, unigram, char, and byte-pair encoding (BPE) ferent tokenization techniques into the AWD-LSTMs and (Sennrich, Haddow, and Birch 2016) -albeit some stud- QRNNs, we replaced the default Spacy tokenizer (Honnibal ies show that BPE is suboptimal for pre-training (Bostrom and Montani 2017) with Google Sentencepiece (Kudo and and Durrett 2020)-. For the Transformer models, we used Richardson 2018), following a similar approach to (Czapla, the default ones from the pre-defined models: Wordpiece Howard, and Kardas 2018). In the development of Trans- method (Schuster and Nakajima 2012) for BERT and BPE former architectures to see how they perform filling the over raw bytes instead of Unicode characters for GPT-2 and blanks and generating texts, we used HuggingFace’s Trans- RoBERTa. The different techniques were selected because former library combined with FastAI v2 (following the Fas- they produce different tokens’ granularity that can enrich tAI’s example6 ), as included on the code repository that our experimentation: full words, sub-words of specific sizes, supports this paper. To train the neural networks, we have character-sized tokens, or byte pairs. Also, they enable us used some techniques worth to mention (all the details are to compare the tokenization between the different types of in the code repository). To find the most appropriate learn- models and tasks to solve. ing rate to use automatically, we used the function lr find provided by FastAI following the proposal of (Smith 2017). Experimentation details This function trains the DNN over the dataset for a few itera- The dataset used for the experimentation is the Python tions while varying from very low to very high learning rates dataset included in the “GitHub CodeSearchNet Challenge at the beginning of each mini-batch of data to find which dataset” (Husain et al. 2019). It includes 2 million (com- is optimal one regarding error (loss) metrics until the DNN ment, Python code) pairs from open-source libraries. As ob- diverges. To pursue a faster convergence, we schedule the served during the dataset preparation for our experiments, learning rate as described in (Smith and Topin 2019) using there are about 11 million Python code sentences. The rea- the one cycle policy (fit one cycle) in FastAI. Considering son to choose this dataset is that it has already been used the transfer learning technique used, we trained the first “one in previous research related to NLP and source code. The cycle” on the top of the existing pre-trained model to later full dataset includes several languages: Go, Java, JavaScript, unfreeze all the model layers and do a more extended train- PHP, Python, and Ruby. We chose to use only the Python ing (10-30 epochs) to improve the results. Regarding other part of the dataset because it enables us to compare the ex- training details, in general, we used the default parameters isting literature that uses Python language more than other from FastAI, except for a fixed multiplier to control all the programming languages. The software libraries and pack- dropouts (drop mult) in AWD-LSTMs and AWD-QRNNs ages used primarily during the research were the follow- set to 0.3 because of some heuristics discovered during test- ing: FastAI (Howard and Gugger 2020), Google Senten- ing this research. Also, we decided to train similar archi- cePiece (Kudo and Richardson 2018), and HuggingFace’s tectures using a fixed number of epochs to make the mod- Transformers (Wolf et al. 2020). he preprocessing applied els comparable. For the AWD-LSTM and AWD-QRNN, we to the dataset included removing most of the code com- used 30 epochs for fine-tuning because we found during ments and autoformatting the code according to the PEP-8 the experimentation that the most remarkable improvement Python style guide using the autopep84 package. Regarding for every model produced occurs during that range of it- the AWD-LSTM networks, we have been using the FastAI- erations. Similarly, we fine-tuned the transformers for ten provided base models pre-trained using the Wikitext-103 epochs since we did not find a significant improvement after dataset (Merity et al. 2017). There are no default pre-trained that. For more information about the training setup and soft- models in the FastAI’s AWD-QRNN version of those net- ware details, please refer to the repository that supports this works, so we trained them from scratch. As we introduced, paper and the FastAI documentation. regarding the Transformer architectures, we have been us- Finally, the hardware used to run the different soft- ing three standard pre-trained models as a basis: GPT-2, ware and neural networks training was a computer running BERT, and RoBERTa. In each case, the exact pre-trained Ubuntu Linux 18.04 LTS Bionic Beaver (64 bits). It has two model used were gpt2, bert-base-cased, and roberta-base. Nvidia Tesla V100 GPUs x 16 gigabytes of RAM (Nvidia These pre-trained models are available from HuggingFace’s CUDA version 10.1), a CPU with 16 cores Intel(R) Xeon(R) model5 . CPU E5-2690 v4 @ 2.60GHz, 120 gigabytes of RAM, and As the reader could infer from the previous explana- 120 Gigabytes for the primary disk (HDD). tions about using pre-trained versions, we followed a trans- All the supporting materials and software details related to this paper are publicly available in a GitHub repository 4 https://pypi.org/project/autopep8/ 5 6 https://huggingface.co/models https://docs.fast.ai/tutorial.transformers
(Cruz-Benito and Vishwakarma 2020a). The NN models produced are under the Zenodo record (Cruz-Benito and Vishwakarma 2020b). Results This section presents the results achieved after the full train- ing of the selected DNN architectures with the different to- kenization models. As outlined in the previous section, we trained AWD- LSTM and AWD-QRNN DNN architectures using differ- ent tokenization models - word, unigram, BPE, and char-, and Transformer, using three different base models (GPT-2, BERT, and RoBERTa). We trained every AWD-LSTM and AWD-QRNN using one epoch to fit the model’s head and fine-tuned for 30 epochs. Meanwhile, the Transformer net- works were trained equally for one epoch to fit the head and Figure 1: Evolution of the accuracy of neural networks de- fine-tune the models for ten epochs. voted to source code generation during the training epochs We followed a two-way strategy to evaluate the NNs trained: use the NN training metrics and human evaluation of the models’ output. The metrics used are some of the most common in the literature: accuracy for the validation set and loss for the training and validation sets. They help the re- searchers understand how the NN acts over time, how the model is fitted to the dataset, and the performance and error scores while using training and validation datasets. In this case, the accuracy is the score concerning the LM’s ability to predict the next word of filling the missings accurately given a sequence of words from the validation set. The loss metrics report the error after applying the DNN to the train- ing or validation set, respectively. Every implementation de- tail related to the DNNs and the metrics is available in the GitHub repository (Cruz-Benito and Vishwakarma 2020a). Apart from those metrics, we assessed the models’ quality Figure 2: Evolution of the accuracy of neural networks de- by applying them in the proposed tasks -generate text and voted to filling the blanks during the training epochs auto-complete- and observing how they perform. Training results model combination performed better in the case of ac- curacy metrics was the AWD-LSTM with char tokeniza- Table 1 displays the final metrics for the different NNs at tion (accuracy 0.779633). The second one was the GPT- the end of the training. Similarly, figures 1 and 2 show the 2 transformer model -BPE over raw bytes tokenization- evolution of each model’s accuracy during the training. Fig- (0.743738), and the third one the AWD-QRNN with char to- ures 3, 4, 5, and 6 show the training loss and validation loss kenization (0.736358). Related to AWD-LSTM and AWD- evolution along the training epochs. QRNN architectures’ combination with other tokenization techniques, we obtained poor results on accuracy: between DNN architecture Epochs Accuracy Train loss Validation loss Pre-trained? AWD-LSTM word 31 0.494893 2.308937 2.341698 Yes 0.494893 to 0.580373. On the other hand, according to the AWD-LSTM unigram AWD-LSTM BPE 31 31 0.557226 0.580373 1.639875 1.561393 1.826841 1.703536 Yes Yes results shown in Table 1 and Figure 2, both models (BERT AWD-LSTM char 31 0.779633 0.757956 0.742808 Yes and RoBERTa) had excellent accuracy results in the trans- AWD-QRNN word 31 0.515747 1.972508 2.144126 No AWD-QRNN unigram 31 0.539951 1.790150 1.894901 No former models intended for auto-completion 0.999238 and AWD-QRNN BPE 31 0.538290 1.824709 1.896698 No AWD-QRNN char 31 0.736358 0.944526 0.897850 No 0.999468, respectively. GPT-2 11 0.743738 1.407818 1.268246 Yes BERT 11 0.999238 0.014755 0.004155 Yes About how the pre-training and transfer learning af- RoBERTa 11 0.999468 0.010569 0.002920 Yes fects the results, the two top results regarding the accu- racy come up from pre-trained models in the English lan- Table 1: Results after full training of each NN architecture guage (0.779633 and 0.743738), yet the third-best result was from a non-pre-trained network (0.736358). Compar- On the one hand, according to the results displayed in Ta- ing the similar networks, the average (mean) accuracy of ble 1 and Figure 1, for neural networks intended for auto- the AWD-LSTM pre-trained versions is 0.603031 (standard mated source code generation -AWD-LSTM, AWD-QRNN, deviation -std- of 0.123144), while the average accuracy of and Transformer GPT-2-, the overall NN-tokenization AWD-QRNN non-pre-trained versions was 0.582587 (std of
Figure 3: Evolution of the training loss of DNNs devoted to Figure 5: Evolution of the validation loss of DNNs devoted generating source code during the training epochs to generating source code during the training epochs Figure 4: Evolution of the training loss of neural networks Figure 6: Evolution of the validation loss of neural networks devoted to filling the blanks during the training epochs devoted to filling the blanks during the training epochs 0.103107). The only combination NN-tokenization model code: AST-LSTM AST-LSTM (accuracy 90.3%) (Tiwang, that worked worse when it was pre-trained was the one with Oladunni, and Xu 2019) or AST-MLP (90.1%) (Tiwang, the word tokenization. Oladunni, and Xu 2019). So, considering this, our results Regarding the observed losses, it is worth commenting are good in terms of accuracy compared to the existing peer- that the AWD-LSTM char, AWD-QRNN char, and the three reviewed literature. As a side note, we did not find models transformer models (GPT-2, BERT, RoBERTa) could be available publicly to compare with, neither in these nor in trained for more epochs or with a higher learning rate. The other peer-reviewed papers. For the models devoted to auto- model may be be underfitting since the training loss is higher complete code (BERT and RoBERTa), as of now, we did than the validation loss (Table 1, figures 3, 4, 5, and 6). not find peer-reviewed papers trying to solve this task using To put in context the accuracy achieved during the ex- Transformer architectures. perimentation, we compare the results with the existing lit- erature. The papers (Tiwang, Oladunni, and Xu 2019; Li Results of using the trained DNN models et al. 2017; Raychev, Bielik, and Vechev 2016) present mod- Following, we introduce some examples of how the different els trained to generate Python code considered state of the neural networks generated source code or auto-completed it. art when published. Our three best models trained to gen- These examples are used to assess the model’s output qual- erate source code outperform the approaches based on the ity, avoiding relying only on the NN metrics. This valida- following architectures: Vanilla LSTM (accuracy: 67.3%) tion using more methods than the simple metrics is com- (Li et al. 2017), Attention LSTM (69.8%) (Li et al. 2017), mon in the AI and NLP fields. Many papers employ human Pointer Mixture Network (70.1%) (Li et al. 2017), or Proba- evaluation or specialized metrics to grasp better the model’s bilistic Model (69.2%) (Raychev, Bielik, and Vechev 2016). performance and capture any other particular detail or qual- Our models perform worse than other approaches based on ity indicator on the related task proposed. In this case, we the abstract syntax tree (AST) instead of using the textual show only those with the best accuracy results. To gener-
ate source code, we show output from AWD-LSTM char, {’sequence’: ’[CLS] from pandas import def [SEP]’, ’score’: 8.858884393703192e-05, AWD-QRNN char, and GPT-2. For auto-completion, we ’token’: 19353, show results from BERT and RoBERTa. For the source code ’token_str’: ’def’}] generation, every model faced the same challenges (Cruz- RoBERTa Benito and Vishwakarma 2020a). Those are the results when [{’sequence’: ’from pandas import\n’, any of them try to generate five different sentences with the ’score’: 0.6224209666252136, ’token’: 50118, input from pandas import: ’token_str’: ’_ C’}, {’sequence’: ’from pandas import.’, AWD-LSTM char ’score’: 0.22222988307476044, ’token’: 4, "from pandas import defaults as _rp pandas =" ’token_str’: ’.’}, {’sequence’: ’from pandas import ’, "from pandas import DataFrameEncoder if self" ’score’: 0.038354743272066116, "from pandas import OrderedDict else: passed" ’token’: 1437, ’token_str’: ’_ G’}, "from pandas import sequence if type(seq1) i" {’sequence’: ’from pandas import\n\n’, "from pandas import conditional_insertion as" ’score’: 0.028566861525177956, ’token’: 50140, AWD-QRNN char ’token_str’: ’_ C_C’}, {’sequence’: ’from pandas import.’, ’score’: 0.021909384056925774, "from pandas import webbrowser self.login fr" ’token’: 479, "from pandas import __http_args or [ ’Pipeli" ’token_str’: ’_ G.’}] "from pandas import iterations is not None p" "from pandas import service params to servic" For auto-completion, we observe that almost no auto- "from pandas import difference if not isinst" completion is right. The more accurate sentences from the Python language point are the ones in which the mask to- GPT-2 ken has been replaced by white space or by a dot. Neverthe- "from pandas import time, np\n " less, they are not correct but closer to be right compared to "from pandas import pandas from time.time.datetime import Date\n the other ones. One interesting thing is that BERT assigns with n" "from pandas import gtk, os\n a very high score to a predicted mask, which is not correct, from pandas_utils import pandas_utils\n and shallow scores to the other possible solutions (also in- import pylint" "from pandas import wcpy\n correct). In the case of RoBERTa, it gives lower scores to import cpy_context as cpy_context\n all the solutions, yet also fails on the correctness: the second " "from pandas import gkpy\n sentence predicted (score 0.222) can be closer to be right """\n ... pass\n compared to the first one (score 0.622). kwargs = cg" To assess the generation, we do not focus on the semantics Discussion of the imports used or whether they are part of the Pandas li- Considering the results obtained, one could convincingly as- brary or not, but on the language’s correctness. In general, sert that the tokenization model used profoundly affects the from a Python language perspective, the outputs from GPT- results when generating automated source code. Although 2 are better. The outputs include line breaks, indentation, and that may be accurate, we must discuss it carefully. fair use of multiple inputs in one sentence (except in one of the outputs). The AWD-LSTM and AWD-QRNN failed to Discussing the outcomes from resulting models auto-generate an import sentence appropriately, or at least, First, our overall results are consistent with the existing liter- they fail regarding the usual manner used by regular users. ature (Karampatsis and Sutton 2019; Ganin et al. 2016; Kim As a final comment on this, the other models trained failed et al. 2016; Karpathy 2016). Sub-word tokenization works on similar issues, plus they do not get enough semantic con- better in the case of modeling source code, as 2019 stated. text related to the Pandas library. Every result obtained is consistent in that sense. Even more, Similarly, concerning the source code auto-completion, as 2016 envision, char tokenization probably should the best both BERT and RoBERTa tried to autofill the mask token option to try by default when dealing with LMs and source in the sentence from pandas import [MASK]. These are the code. Furthermore, according to the results achieved, mod- results: els such as GPT-2 -using a tokenization model based on BPE BERT over raw bytes- can outperform LSTM/QRNN models like [{’sequence’: ’[CLS] from pandas import [SEP] [SEP]’, those we tested to grasp better the internals of a program- ’score’: 0.9969683289527893, ming language. As showcased during the results, even if ’token’: 102, ’token_str’: ’[SEP]’}, GPT-2 was not the best model in terms of accuracy, it gave {’sequence’: ’[CLS] from pandas import [CLS] [SEP]’, better code outputs than the other ones selected for the com- ’score’: 0.0010887219104915857, ’token’: 101, parison. ’token_str’: ’[CLS]’}, As future work, it would be great to check if the better {’sequence’: ’[CLS] from pandas import. [SEP]’, ’score’: 0.0004200416151434183, textual output in the case of GPT-2 is because of a) it is a ’token’: 119, much bigger and better pre-trained model (163,037,184 pa- ’token_str’: ’.’}, {’sequence’: ’[CLS] from pandas import ; [SEP]’, rameters against 36,491 for AWD-LSTM and AWD-QRNN ’score’: 0.00027348980074748397, ’token’: 132, models), b) it is related to dataset’s size or quality, c) if it is ’token_str’: ’;’}, related to both causes or, d) if it is related to other issues.
Continuing with the comments about the accuracy, one 2016) use custom datasets, but we find a lack in the liter- may note that the textual outputs generated by AWD-LSTM ature of well-recognized code datasets to use. Comparing char, AWD-QRNN char, and GPT-2 could be polished to with other recent papers in the NLP field used as the basis be more accurate. The final training loss is higher than the for this research (Radford et al. 2019; Devlin et al. 2019; Liu validation loss for the three selected DNN architectures, et al. 2019; Brown et al. 2020), the dataset may be relatively which can be a sign of underfitting. We find the same is- small to train a big LM to accomplish appropriately chal- sue (train loss ¡¡ valid loss) for BERT and RoBERTa-based lenging tasks like generating source code or auto-completing models. Whether the purpose of this paper is not to produce it. Future work may be testing these or new approaches in state-of-the-art results per se, we continued the training for bigger datasets to train big LMs focused on modeling the over five more epochs to verify it. The improvement ob- Python language and checking whether the results are bet- tained from extending the training for the best approaches ter. Recent examples of LMs -such as GPT-3 (Brown et al. was residual in general, so we decided to report the results 2020)- claim to produce accurate textual outputs even in for 1+30 epochs in AWD-LSTMs and AWD-QRNNs, and contexts in which they were not trained. Part of the expla- 1+10 epochs for Transformer. nation given for that ability is the use of gargantuan datasets combined with Transformer and other attention-based archi- Examining the effect of pre-training and transfer tectures. So, those approaches can also be relevant to other learning contexts like ours. Another line for future research can be us- ing datasets focused on specific libraries or Python aspects Regarding the pre-training and transfer learning, every pre- and verify if these approaches specialize positively for those trained model (on English-based datasets) got better accu- contexts the DNN models used in this paper. racy than its non-pre-trained counterparts except in word tokenization models. It seems to be strongly related to the Related to evaluating the code generated or filled, we ob- statements we introduced at the beginning of this paper, cit- served in the literature different approaches (Celikyilmaz, ing (Allamanis et al. 2015; Karampatsis and Sutton 2019) Clark, and Gao 2020; Roziere et al. 2020). In the con- about the source code does not have the same restrictions text of LMs modeling source code, many papers and soft- in the sense of common words or neologisms. In this sense, ware libraries devoted to translating between programming the conclusion comes into our mind rapidly: if we consider languages typically evaluate text generation using methods source code words in “word” units, they probably will not and metrics like BLEU (Papineni et al. 2002), or variants fit in the fixed set of words used in a human language like like SacreBLEU (Post 2018). Other papers like (Tiwang, English. So, the LM’s knowledge acquired during the pre- Oladunni, and Xu 2019) rely on the accuracy to assess an training is not entirely valid when we get out of that fixed set LM’s performance based on deep learning. Some models of words that compose a language. Most words in the pro- can even solve different tasks that are part of existing bench- gramming language are neologisms for the LMs pre-trained marks (Weston et al. 2015) or are evaluated, checking their in English, and thus, it needs to incorporate them and their perplexity (similarly to those that evaluate the model using relationships into the learned knowledge. For the sub-word the accuracy). The current tendency in large models is to units, the LM can be less sensible to the neologisms. Po- evaluate them using human intervention to evaluate the out- tentially, it could be more robust the more divided a word put’s quality (Radford et al. 2019; Brown et al. 2020). We is since the set of bytes or chars is more straightforward assessed the models using accuracy during our experiments than the chunks present in richer constructions or informa- and evaluated the models’ textual outputs based on our prior tion units. human knowledge. It would be interesting for the future to plan new evaluation processes involving larger cohorts of Going deeper into this research, concerning the pre- source code experts to evaluate the models such as (Radford training effect over LMs modeling source code, it could be et al. 2019; Brown et al. 2020) do. One of the potential new worth researching the relationship between the pre-training assessments can be usability tests conducted with program- in different human-spoken languages and the LM ability to mers. They can compare the code they would write against work with existing source code specific programming lan- the code proposed by any of the DNNs presented here and guages. the result from other common code auto-completion tools included in integrated development environments. As we Reviewing the textual assessment of resulting LMs outlined in the results section, relying only on metrics like About the tests made generating source code or filling in accuracy should not be enough. As in our case, accuracy the blanks using the trained LMs, we think that, in general, and the other metrics can be a good indicator of the model’s the textual results obtained are not so good, yet they are in- performance, yet we need to verify LMs behavior and qual- formative of how LMs are working and how they can be ity using complementary methods like specialized metrics or improved. One of the things that can explain these results human evaluation. For tasks like auto-completion or source is the dataset used. In this case, we used a public dataset code generation, there are no existing specialized metrics that other researchers can use to make results and experi- (like BLEU in translation, for example), so one of the fu- ments comparable and replicable. In the literature, we do ture research lines is improving the evaluation of LMs for not find a standard dataset for these tasks against which source code. Based on some existing ideas in broad NLP, we can compare easily. Other papers (Tiwang, Oladunni, there are many opportunities to explore in that sense. From and Xu 2019; Li et al. 2017; Raychev, Bielik, and Vechev new test suites for language models used in source code con-
texts (Roziere et al. 2020) to behavioral testing (Ribeiro et al. Bielik, P.; Raychev, V.; and Vechev, M. 2016. PHOG: proba- 2020) or human-centric evaluation of the models (Celiky- bilistic model for code. In International Conference on Ma- ilmaz, Clark, and Gao 2020) with particular emphasis on chine Learning, 2933–2942. reproducible and unbiased assessments, or combinations of Bostrom, K.; and Durrett, G. 2020. Byte Pair Encoding is automatic testing and human-centric assessments. Suboptimal for Language Model Pretraining. In Findings of the Association for Computational Linguistics: EMNLP Conclusions 2020, 4617–4624. Online: Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020. This paper compares how different approaches to tokeniza- findings-emnlp.414. tion models, deep neural network architectures, pre-trained models, and transfer learning affect the results from lan- Bradbury, J.; Merity, S.; Xiong, C.; and Socher, R. 2017. guage models used to generate source code or auto-complete Quasi-recurrent neural networks. In Proceedings of the software pieces. We studied different DNN architectures like 5th International Conference on Learning Representations AWD-LSTM, AWD-QRNN, and Transformer to seek which (ICLR 2017). URL https://arxiv.org/pdf/1611.01576.pdf. kind of them work better with different tokenization mod- Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; els (word, unigram, BPE, and char). Also, we compared the Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, pre-training effect on the results given by LMs after training A.; et al. 2020. Language models are few-shot learners. Ad- them and fine-tuning them via transfer learning to work with vances in Neural Information Processing Systems 33. URL other languages (English language to Python programming https://arxiv.org/pdf/2005.14165.pdf. language). As a result of this work, we find that in small LMs (like our AWD-LSTM and AWD-QRNN models), the Bryant, C.; and Briscoe, T. 2018. Language model based tokenization using char-sized chunks works better than us- grammatical error correction without annotated training ing any other tokenization models. In larger models like the data. In Proceedings of the Thirteenth Workshop on Inno- Transformer GPT-2, the accuracy was slightly worse than vative Use of NLP for Building Educational Applications, the other architectures. However, GPT-2 raised better results 247–253. on the source code generation tests (even using another to- Celikyilmaz, A.; Clark, E.; and Gao, J. 2020. Eval- kenization approach like BPE over raw bytes). For source uation of Text Generation: A Survey. arXiv preprint code auto-completion, we tested some transformer models arXiv:2006.14799 . like BERT and RoBERTA. While their accuracy was above any other models, they did not perform very well when per- Chelba, C.; Mikolov, T.; Schuster, M.; Ge, Q.; Brants, T.; forming the tasks proposed in our tests. In general, we find Koehn, P.; and Robinson, T. 2014. One Billion Word Bench- that pre-trained models work better, even if they were not mark for Measuring Progress in Statistical Language Mod- trained initially for a programming language like Python eling. In Fifteenth Annual Conference of the International (our models were pre-trained using the English language). Speech Communication Association. Finally, related to evaluating tasks like automating source Chen, H.; Le, T. H. M.; and Babar, M. A. 2020. Deep code generation and source code auto-completion, we raise Learning for Source Code Modeling and Generation: Mod- concerns about the literature gaps and propose some re- els, Applications and Challenges. ACM Computing Surveys search lines to work on in the future. (CSUR) . Chronopoulou, A.; Baziotis, C.; and Potamianos, A. 2019. Acknowledgments An Embarrassingly Simple Approach for Transfer Learning We thank the IBM Quantum team and the IBM Research from Pretrained Language Models. In Proceedings of the ETX team for the insightful discussions about this research 2019 Conference of the North American Chapter of the As- and the support received during the development of this re- sociation for Computational Linguistics: Human Language search. Technologies, Volume 1 (Long and Short Papers), 2089– 2095. References Cruz-Benito, J.; Faro, I.; Martı́n-Fernández, F.; Therón, R.; and Garcı́a-Peñalvo, F. J. 2018. A Deep-Learning-based Allamanis, M.; Barr, E. T.; Bird, C.; and Sutton, C. 2015. proposal to aid users in Quantum Computing programming. Suggesting accurate method and class names. In Proceed- In International Conference on Learning and Collaboration ings of the 2015 10th Joint Meeting on Foundations of Soft- Technologies, 421–430. Springer. ware Engineering, 38–49. Cruz-Benito, J.; and Vishwakarma, S. 2020a. Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. cbjuan/tokenizers-neural-nets-2020-paper: v1.0 doi: A survey of machine learning for big code and naturalness. 10.5281/zenodo.4011767. URL https://github.com/cbjuan/ ACM Computing Surveys (CSUR) 51(4): 1–37. tokenizers-neural-nets-2020-paper/. Baevski, A.; and Auli, M. 2018. Adaptive Input Representa- Cruz-Benito, J.; and Vishwakarma, S. 2020b. NN mod- tions for Neural Language Modeling. In International Con- els produced by cbjuan/tokenizers-neural-nets-2020-paper: ference on Learning Representations. v1.0 doi:10.5281/zenodo.4293857.
Czapla, P.; Howard, J.; and Kardas, M. 2018. Universal lan- Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; and guage model fine-tuning with subword tokenization for pol- Brockschmidt, M. 2019. Codesearchnet challenge: Eval- ish. arXiv preprint arXiv:1810.10222 . uating the state of semantic code search. arXiv preprint Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; and arXiv:1909.09436 . Salakhutdinov, R. 2019. Transformer-XL: Attentive Lan- Kannan, A.; Kurach, K.; Ravi, S.; Kaufmann, T.; Tomkins, guage Models beyond a Fixed-Length Context. In Proceed- A.; Miklos, B.; Corrado, G.; Lukacs, L.; Ganea, M.; Young, ings of the 57th Annual Meeting of the Association for Com- P.; et al. 2016. Smart reply: Automated response suggestion putational Linguistics, 2978–2988. for email. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Mining, 955–964. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). Karampatsis, R.-M.; and Sutton, C. 2019. Maybe deep neu- ral networks are the best choice for modeling source code. Donahue, C.; Lee, M.; and Liang, P. 2020. Enabling arXiv preprint arXiv:1903.05734 . Language Models to Fill in the Blanks. In Proceed- ings of the 58th Annual Meeting of the Association for Karpathy, A. 2016. The unreasonable effectiveness of recur- Computational Linguistics, 2492–2501. Online: Association rent neural networks URL http://karpathy.github.io/2015/05/ for Computational Linguistics. doi:10.18653/v1/2020.acl- 21/rnn-effectiveness. main.225. URL https://www.aclweb.org/anthology/2020. Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2016. acl-main.225. Character-Aware Neural Language Models. In Proceedings Eisenschlos, J.; Ruder, S.; Czapla, P.; Kadras, M.; Gug- of the Thirtieth AAAI Conference on Artificial Intelligence, ger, S.; and Howard, J. 2019. MultiFiT: Efficient Multi- AAAI’16, 2741–2749. AAAI Press. lingual Language Model Fine-tuning. In Proceedings of Krause, B.; Kahembwe, E.; Murray, I.; and Renals, S. 2018. the 2019 Conference on Empirical Methods in Natural Lan- Dynamic Evaluation of Neural Sequence Models. In Dy, guage Processing and the 9th International Joint Confer- J.; and Krause, A., eds., Proceedings of the 35th Inter- ence on Natural Language Processing (EMNLP-IJCNLP), national Conference on Machine Learning, volume 80 of 5702–5707. Hong Kong, China: Association for Computa- Proceedings of Machine Learning Research, 2766–2775. tional Linguistics. doi:10.18653/v1/D19-1572. URL https: Stockholmsmässan, Stockholm Sweden: PMLR. URL http: //www.aclweb.org/anthology/D19-1572. //proceedings.mlr.press/v80/krause18a.html. Fedus, W.; Goodfellow, I.; and Dai, A. M. 2018. MaskGAN: Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim- Better Text Generation via Filling in the . In Inter- ple and language independent subword tokenizer and deto- national Conference on Learning Representations. Online. kenizer for Neural Text Processing. In Proceedings of the URL https://arxiv.org/pdf/1801.07736.pdf. 2018 Conference on Empirical Methods in Natural Lan- guage Processing: System Demonstrations, 66–71. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Li, J.; Wang, Y.; Lyu, M. R.; and King, I. 2017. Code com- Domain-adversarial training of neural networks. The Jour- pletion with neural attention and pointer networks. In Pro- nal of Machine Learning Research 17(1): 2096–2030. ceedings of the Twenty-Seventh International Joint Confer- ence on Artificial Intelligence (IJCAI-18), 4159–41655. Ghosh, S.; and Kristensson, P. O. 2017. Neural networks for text correction and completion in keyboard decoding. arXiv Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; preprint arXiv:1709.06429 . Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining ap- Gong, C.; He, D.; Tan, X.; Qin, T.; Wang, L.; and Liu, T.-Y. proach. arXiv preprint arXiv:1907.11692 . 2018. Frage: Frequency-agnostic word representation. Ad- vances in neural information processing systems 31: 1334– Merity, S.; Keskar, N. S.; and Socher, R. 2018a. An anal- 1345. ysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240 . Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term Merity, S.; Keskar, N. S.; and Socher, R. 2018b. Regular- memory. Neural computation 9(8): 1735–1780. izing and Optimizing LSTM Language Models. In Interna- Honnibal, M.; and Montani, I. 2017. spacy 2: Natural tional Conference on Learning Representations. language understanding with bloom embeddings, convolu- Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2017. tional neural networks and incremental parsing URL https: Pointer sentinel mixture models. In Proceedings of the //spacy.io/. 5th International Conference on Learning Representations Howard, J.; and Gugger, S. 2020. Fastai: A layered API for (ICLR 2017). URL https://arxiv.org/pdf/1609.07843.pdf. deep learning. Information 11(2): 108. Mikolov, T.; Deoras, A.; Kombrink, S.; Burget, L.; and Howard, J.; and Ruder, S. 2018. Universal Language Model Černockỳ, J. 2011. Empirical evaluation and combination Fine-tuning for Text Classification. In Proceedings of the of advanced language modeling techniques. In Twelfth an- 56th Annual Meeting of the Association for Computational nual conference of the international speech communication Linguistics (Volume 1: Long Papers), 328–339. association.
Nguyen, A. T.; and Nguyen, T. N. 2015. Graph-based sta- Acoustics, Speech and Signal Processing (ICASSP), 5149– tistical language model for code. In 2015 IEEE/ACM 37th 5152. IEEE. IEEE International Conference on Software Engineering, Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Ma- volume 1, 858–868. IEEE. chine Translation of Rare Words with Subword Units. In Nguyen, A. T.; Nguyen, T. T.; and Nguyen, T. N. 2013. Lexi- Proceedings of the 54th Annual Meeting of the Associa- cal statistical machine translation for language migration. In tion for Computational Linguistics (Volume 1: Long Papers), Proceedings of the 2013 9th Joint Meeting on Foundations 1715–1725. of Software Engineering, 651–654. Smith, L. N. 2017. Cyclical learning rates for training neural Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, networks. In 2017 IEEE Winter Conference on Applications T.; and Nakamura, S. 2015. Learning to generate pseudo- of Computer Vision (WACV), 464–472. IEEE. code from source code using statistical machine translation Smith, L. N.; and Topin, N. 2019. Super-convergence: Very (t). In 2015 30th IEEE/ACM International Conference on fast training of neural networks using large learning rates. Automated Software Engineering (ASE), 574–584. IEEE. In Artificial Intelligence and Machine Learning for Multi- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Domain Operations Applications, volume 11006, 1100612. BLEU: a method for automatic evaluation of machine trans- International Society for Optics and Photonics. lation. In Proceedings of the 40th annual meeting of the Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Association for Computational Linguistics, 311–318. Salakhutdinov, R. 2014. Dropout: a simple way to prevent Post, M. 2018. A Call for Clarity in Reporting BLEU neural networks from overfitting. The journal of machine Scores. In Proceedings of the Third Conference on Machine learning research 15(1): 1929–1958. Translation: Research Papers, 186–191. Brussels, Belgium: Takase, S.; Suzuki, J.; and Nagata, M. 2018. Direct Out- Association for Computational Linguistics. doi:10.18653/ put Connection for a High-Rank Language Model. In Pro- v1/W18-6319. URL https://www.aclweb.org/anthology/ ceedings of the 2018 Conference on Empirical Methods in W18-6319. Natural Language Processing, 4599–4609. Proksch, S.; Lerch, J.; and Mezini, M. 2015. Intelligent code Tiwang, R.; Oladunni, T.; and Xu, W. 2019. A Deep Learn- completion with Bayesian networks. ACM Transactions on ing Model for Source Code Generation. In 2019 Southeast- Software Engineering and Methodology (TOSEM) 25(1): 1– Con, 1–7. IEEE. 31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Sutskever, I. 2019. Language Models are Unsupervised tention is all you need. In Advances in neural information Multitask Learners . processing systems, 5998–6008. Rae, J. W.; Potapenko, A.; Jayakumar, S. M.; Hillier, C.; and Wang, A.; and Cho, K. 2019. BERT has a Mouth, and It Lillicrap, T. P. 2019. Compressive Transformers for Long- Must Speak: BERT as a Markov Random Field Language Range Sequence Modelling. In International Conference on Model. In Proceedings of the Workshop on Methods for Learning Representations. Optimizing and Evaluating Neural Language Generation (NAACL HLT 2019), 30–36. Raychev, V.; Bielik, P.; and Vechev, M. 2016. Probabilistic model for code with decision trees. ACM SIGPLAN Notices Wang, D.; Gong, C.; and Liu, Q. 2019. Improving Neural 51(10): 731–747. Language Modeling via Adversarial Training. In Chaud- huri, K.; and Salakhutdinov, R., eds., Proceedings of the Ribeiro, M. T.; Wu, T.; Guestrin, C.; and Singh, S. 2020. 36th International Conference on Machine Learning, vol- Beyond Accuracy: Behavioral Testing of NLP Models with ume 97 of Proceedings of Machine Learning Research, CheckList. In Proceedings of the 58th Annual Meeting of 6555–6565. Long Beach, California, USA: PMLR. URL the Association for Computational Linguistics, 4902–4912. http://proceedings.mlr.press/v97/wang19f.html. Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442. URL https://www.aclweb. Weston, J.; Bordes, A.; Chopra, S.; Rush, A. M.; van org/anthology/2020.acl-main.442. Merriënboer, B.; Joulin, A.; and Mikolov, T. 2015. Towards ai-complete question answering: A set of prerequisite toy Roziere, B.; Lachaux, M.-A.; Chanussot, L.; and Lample, tasks. arXiv preprint arXiv:1502.05698 . G. 2020. Unsupervised Translation of Programming Lan- guages. Advances in Neural Information Processing Systems Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; 33. Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi- son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, Ruder, S.; Peters, M. E.; Swayamdipta, S.; and Wolf, T. J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, 2019. Transfer learning in natural language processing. In Q.; and Rush, A. 2020. HuggingFace’s ransformers: State- Proceedings of the 2019 Conference of the North American of-the-Art Natural Language Processing. In Proceedings Chapter of the Association for Computational Linguistics: of the 2020 Conference on Empirical Methods in Natu- Tutorials, 15–18. ral Language Processing: System Demonstrations, 38–45. Schuster, M.; and Nakajima, K. 2012. Japanese and korean Online: Association for Computational Linguistics. URL voice search. In 2012 IEEE International Conference on https://www.aclweb.org/anthology/2020.emnlp-demos.6.
Yang, Z.; Dai, Z.; Salakhutdinov, R.; and Cohen, W. W. 2018. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model. In International Conference on Learning Representations. Young, T.; Hazarika, D.; Poria, S.; and Cambria, E. 2018. Recent trends in deep learning based natural language pro- cessing. IEEE Computational Intelligence magazine 13(3): 55–75.
You can also read