Styler: Learning Formatting Conventions to Repair Checkstyle Errors - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Styler: Learning Formatting Conventions to Repair Checkstyle Errors Benjamin Loriot Fernanda Madeiral Martin Monperrus Abstract—Ensuring code formatting conventions is an essential Inspired by the problem statement of program repair [24], aspect of modern software quality assurance, because it helps we state in this paper the problem of automatically repairing in code readability. In this paper, we present S TYLER, a tool formatting errors: given a program, its format checker rules, dedicated to fix formatting errors raised by Checkstyle, a highly arXiv:1904.01754v3 [cs.SE] 10 Aug 2020 configurable format checker for Java. To fix formatting errors in and one rule violation, the goal is to modify the source code a given project, S TYLER 1) learns fixes for self-generated errors formatting so that no violation is raised by the format checker. according to the project-specific Checkstyle ruleset, based on In this paper, we explore this problem in the context of [8], token sequence fed into a LSTM neural network, and then 2) a popular format checker for the Java language. We present predicts fixes. In an empirical evaluation, we find that S TYLER S TYLER, a repair tool dedicated to fix Checkstyle formatting repairs 38% of 11,220 real Checkstyle errors mined from 70 GitHub projects. Moreover, we compare S TYLER with the IntelliJ errors in Java source code. The uniqueness of S TYLER is to plugin C HECK S TYLE -IDEA and the machine learning-based be applicable to any formatting coding convention, because its code formatters N ATURALIZE and C ODE B UFF. We find that approach is not based on rules to repair specific Checkstyle S TYLER fixes errors from a more diverse set of Checkstyle rules errors. The key idea of S TYLER is the usage of machine (24 rules, compared to C HECK S TYLE -IDEA: 19; N ATURALIZE: learning to learn the coding conventions that are used in a 20; C ODE B UFF: 17), and it uniquely repairs errors for two rules. Finally, S TYLER generates small repairs, and once trained, it software project. Once trained, S TYLER predicts changes on predicts repairs in seconds. The promising results suggest that formatting characters (e.g. whitespaces, new lines, indentation) S TYLER can be used in IDEs and in Continuous Integration to fix a formatting convention violation happening in the wild. environments to repair Checkstyle errors. Technically, S TYLER uses a sequence-to-sequence machine learning model based on a long short-term memory neural network (LSTM). I. I NTRODUCTION We conduct a large scale experiment to evaluate S TYLER Code readability is the first requirement for program com- using a curated dataset of 11,220 real Checkstyle errors mined prehension: one cannot comprehend what one cannot easily from 70 GitHub projects. Based on our research questions, we read. To improve code readability, most developers agree on find that S TYLER repairs many errors (38%), and repairs errors using coding conventions, so the code is clear and uniformly from more different Checkstyle formatting rules compared to consistent across a given code base or organization [23], [16]. the state-of-the-art of machine learning formatters [3], [26] and A major challenge of using coding conventions is to keep all the tailored, human engineered IntelliJ plugin C HECK S TYLE - source code files consistent with the agreed conventions. The IDEA [9]. Moreover, S TYLER produces small repairs and its first step towards that is the detection of coding convention performance is fast enough for developers. To sum up, our contributions are: violations (or errors). This can be automatically performed • A novel approach to fix violations of code formatting using linters, which are static analysis tools that warn software developers about possible violations of coding conventions conventions, based on machine learning. The approach [36]. The usage of linters also brings challenges because is able to learn project-specific formatting rules without the developers need to create a configuration according to manual setup; • A tool, called S TYLER , which implements our approach their adopted conventions so that the linter detects the right violations (not more and not less), and then to repair eventually in the context of Java and Checkstyle, to repair Check- violations. In this paper, we focus on the later task, automat- style formatting violations. The tool is made publicly ically repairing linter violations, which is a little researched available [21]; • A curated dataset of real-world formatting Checkstyle problem, and we focus on formatting errors1 . To repair a formatting error detected by a format checker, errors, which contains 11,220 errors mined from 70 developers can either perform the fix manually or use a code GitHub repositories; To our knowledge, this is the largest formatter. Both alternatives are not satisfactory. Manually dataset of this kind, made publicly available for future fixing formatting errors is a waste of valuable developer time. research; • A comparative experiment of the performance of S TYLER With code formatters, the key problem is that they do not take into account the project-specific convention rules, those that against the state-of-the-art of automatic code formatting are configured by the developers for the used format checker. [9], [3], [26], showing that S TYLER outperforms it. The remainder of this paper is organized as follows. Sec- 1 In this paper, we refer to linters specialized in formatting as format tion II and Section III present the background of this work. checkers. Section IV presents our tool, S TYLER. Section V presents the
2 design of our experiment for evaluating S TYLER and compar- linter before she commits her changes. If she does not do ing it with three code formatters: the experimental results are it, she might face a lot of errors raised by the linter after presented in Section VI. Section VII presents discussions, and the end of the building step for a release or for shipping the Section VIII presents the related works. Finally, Section IX program. On the other hand, when a linter is integrated in build presents the final remarks. tools, it is automatically executed in Continuous Integration (CI) environments. The important coding conventions might II. BACKGROUND be configured to make CI builds break when they are violated. A. Coding Conventions This way, developers are forced to repair coding convention violations early in the software development process. Coding conventions (also known as coding style or coding Several linters have been developed depending on the pro- standards) are rules that developers agree on for writing code. gramming language: e.g. ESLint [13] for JavaScript, Pylint The usage of coding conventions improves code readability [29] for Python, StyleCop [34] for C#, and RuboCop [31] for but it does not change the program behavior. Ruby. For Java, which is our target language in this paper, the There are several coding convention classes: e.g. naming, most commonly used linter is Checkstyle [8]. Checkstyle sup- control flow style, and formatting. In this paper, we focus ports predefined well-known coding conventions, such as the on the latter: formatting coding conventions. Formatting here Google Java Style Guide [16] and the Sun Code Conventions refers to the appearance or the presentation of the source [35]. It also allows developers to configure a specific ruleset code. One can change the formatting by using non-printable to match their own preferences. Checkstyle is a flexible linter characters such as spaces, tabulations, and line breaks. In that can be integrated in both an IDE (e.g. IntelliJ, Eclipse, and free-format languages such as Java and C++, the formatting NetBeans) and in a build tool (e.g. Maven and Gradle). In the does not change the abstract syntax tree. In non-free-format Java ecosystem, Checkstyle is often executed in Continuous languages such as Haskell or Python, formatting is even related Integration environments such as Travis and Circle CI. to behavior: correcting formatting issues can fix a bug [7]. For instance, a well-known formatting coding convention is III. S TUDY OF C HECKSTYLE U SAGE IN THE W ILD about the placement of braces in code blocks. Figure 1 shows Static analysis tools have been subject of investigation two ways that developers may follow when writing conditional in recent research [39], [38], [22]. However, there is little blocks: one developer might place the left brace in a new line, empirical knowledge of the extent of what Checkstyle, one while another one might place it in the end of the conditional popular static analyzer, is used in the wild. To ground our work line. Agreeing on coding conventions avoids edit wars and with a solid empirical basis, we then investigate the usage of endless debates: all developers in a team decide on how to Checkstyle and its rules in open source projects. format code once and for all. Checkstyle can be executed on a project in different ways. The straightforward ways are 1) by directly invoking Check- if (condition) style on the command line, 2) by a build tool, or 3) by { if (condition) { // do something // do something a continuous integration service. Independently of the way } } Checkstyle is executed, there must exist a configuration file with the Checkstyle rules defined by the developers: we refer (a) Left curly on new line. (b) Left curly on end of line. to this file as Checkstyle ruleset. In this section, we report on Fig. 1: Two conventions for placing a left curly brace. our large-scale study on the usage of Checkstyle on GitHub. A. Checkstyle Usage in Practice Method. To measure the usage of Checkstyle on GitHub, we B. Coding Convention Checkers queried GitHub2 to only retrieve Java projects with at least five A challenge faced by developers is to keep their code stars, because stars have been shown meaningful to sample compliant with the agreed coding conventions. Basically, every projects from GitHub [6]: we found 148,127 Java projects. new change, every new commit must satisfy the convention Then, we searched each of them for finding a Checkstyle rules. Manually checking if code changes do not violate the ruleset file. A Checkstyle ruleset file can have any name, coding conventions is not an option because it would be too but we followed a conservative approach towards identifying time-consuming and error-prone. true positives: we used a set of commonly used names3 . For To overcome this problem, a mechanism to automatically simplicity, in the rest of this paper we refer to a Checkstyle check if a code follows the coding convention rules is required. ruleset file as checkstyle.xml. Such a tool is known as linter, or coding convention enforcers [2]. A linter is a static analysis tool that warns software Results. We found 3,793 Java projects containing a developers about possible code errors or violations of coding checkstyle.xml file, which is 2.56% of all Java projects conventions [36]. Note that linters may go beyond coding 2 In June 9, 2020. conventions and also perform some basis static analysis on 3 Checkstyle ruleset file commonly used names: [‘checkstyle.xml’, the program behavior. ‘.checkstyle.xml’, ‘checkstyle_rules.xml’, ‘checkstyle_config.xml’, ‘check- style_configuration.xml’, ‘checkstyle_checker.xml’, ‘checkstyle_checks.xml’, Linters can be usually integrated in IDEs and build tools. ‘google_checks.xml’, ‘sun_checks.xml’]. Variants by replacing ‘_’ by ‘-’ are When integrated in IDEs, the developer manually runs the also used.
3 RightCurly 3,719 (98.05%) A. Targeted Error Types RegexpSingleline 3,162 (83.37%) S TYLER is about learning how to repair errors related to LeftCurly 3,083 (81.28%) PackageName 3,047 (80.33%) formatting coding conventions (see Section II-A). For instance, UpperEll 3,033 (79.96%) consider that a developer specified that her preference on the TypeName 3,018 (79.57%) left curly token “{” in a conditional block must always be ParameterName 2,996 (78.99%) placed in a new line (as shown in Figure 1a). If this rule is MemberName 2,966 Formatting-related (78.2%) rules FileTabCharacter 2,955 (77.91%) not satisfied (e.g. such as in Figure 1b), Checkstyle triggers Non-formatting-related rules a formatting-related error (see Figure 4a). In order to fix this MethodName 2,947 (77.7%) 3,000 3,200 3,400 3,600 3,800 4,000 violation, a new line break should be inserted in the program before the token “{”. # Projects on Github In Checkstyle, there are different classes of checks: e.g. for- Fig. 2: The top-10 most popular Checkstyle rules. matting, naming, and lightweight linting checks. In S TYLER, we exclusively focus on formatting checks, such as indenta- tion and whitespace before and after punctuation. We ignore with at least five stars on GitHub. Table I shows the proportion Checkstyle checks that are not related to formatting, e.g. of those projects with their build tools and CI services if any. unused imports and method name. We note that build tools are widely used among projects using Checkstyle: 98% of the projects use at least one build tool. B. S TYLER Workflow Moreover, 55% of the projects use a continuous integration service, which shows the software engineering maturity of the Figure 3 shows the S TYLER workflow. It is composed of sampled projects. two main components: ‘S TYLER training’ for learning how to fix formatting errors and ‘S TYLER prediction’ for actually TABLE I: Usage of build tools and CI services by 3,793 repairing a concrete Checkstyle error. S TYLER receives as projects that use Checkstyle. input a software project, including its source code and its Checkstyle ruleset. Maven 54 % Build tool usage Gradle 47 % Ant 10 % Styler Training (learning) TravisCI 51 % Project with source A. Training B. CI usage C. Training LSTM code and data Error-encoding CircleCI 4% models Checkstyle ruleset generation (tokenization) B. Popularity of Checkstyle Rules Styler Prediction (repairing) Java code (Figure 4b) Method. To check the usage of Checkstyle rules4 , we analyzed Checkstyle error tokenized D. (Figure 4a) F. Predicting the previously-found checkstyle.xml files from the 3,793 E. Error-encoding (Figure 4c) Checkstyle-error repair (LSTM projects using Checkstyle. Our goal is to investigate the most (tokenization) localization models) used rules and check if formatting-related rules, which are the Repaired Java code tokenized target of this work, are widely used. (Figure 4e) Results. We found at least one usage for the 174 Checkstyle Repaired Java code Repaired de-tokenized rules. Figure 2 shows the top-10 most used rules. The bars G. (Figure 4f) H. Repair Java code in dark red represent formatting-related rules, and the bars in Repair-decoding I. Repair selection verification (de-tokenization) gray represent the other rules. In the top-10 most used rules, there are four rules related to formatting. Notably, the top- Fig. 3: S TYLER workflow. 3 most used rules are formatting-related ones. Therefore, we conclude that formatting-related rules are very important for The component ‘S TYLER training’ is responsible for learn- developers, which validates the relevance of our work. ing how to repair Checkstyle errors on the given project according to its project-specific Checkstyle ruleset. It creates IV. S TYLER the training data by injecting Checkstyle formatting errors on S TYLER is a tool to fix Checkstyle formatting errors in source code files in the project (step A). Then, it translates the Java source code, in order to help developers in different training data into abstract token sequences (step B) in order software development workflows. For instance, S TYLER could to train LSTM neural networks (step C). The learned LSTM be used locally as a pre-hook commit when developers are models are eventually used to predict repairs. about to release projects. Also, it could be configured to run in The component ‘S TYLER prediction’ is responsible for Continuous Integration, where pull requests are automatically predicting fixes for real Checkstyle errors. It first localizes opened with formatting fixes’ suggestions. In this section, we Checkstyle errors by running Checkstyle on the project (step present the workflow and the technical principles of S TYLER. D). Then, S TYLER encodes the error line into an abstract token 4 The set of Checkstyle rules we considered in our study is from Checkstyle sequence (step E), which is given as input to the LSTM models version 8.33 (released in May 31, 2020). (step F) previously learned. The models predict fixes for the
4 given Checkstyle error: these fixes are in the format of abstract [ERROR] .../NodeRelationshipCache.java:812:82: token sequences, so they must be translated back to Java code ’{’ at column 82 should be on a new (step G). S TYLER then runs Checkstyle on the new Java codes line. [LeftCurly] containing the predicted fixes (step H). Finally, among the (a) Checkstyle LeftCurly rule violation. predicted fixes where no Checkstyle error is raised, S TYLER selects one formatting repair to give as output (step I). As 812 p u b l i c v o i d v i s i t C h a n g e d N o d e s ( N o d e C h a n g e V i s i t o r v i s i t o r , i n t nodeTypes ) { S TYLER only impacts the formatting of the code, its repairs do 813 l o n g denseMask = changeMask ( t r u e ) ; not change the behavior of the program under consideration. (b) Source code snippet of the error. C. S TYLER in Action before-context Identifier Consider the Checkstyle error presented in Figure 4a. This 0_SP , 1_SP int 1_SP Identifier 1_SP ) error is raised by a violation of the Checkstyle LeftCurly rule: 4_SP { 1_NL_4_ID long 1_SP Identifier the left curly should be on a new line. Checkstyle provides, 1_SP = 1_SP Identifier 0_SP ( 1_SP for a given error, the location (line and column) where the after-context Checkstyle rule is violated. The Java source code that caused (c) Buggy abstract token sequence. such an error is presented in Figure 4b. S TYLER encodes the incorrectly formatted lines (Figure 4b) 0_SP 1_SP 1_SP 1_SP 1_NL 1_NL_4_ID 1_SP into the abstract token sequence shown in Figure 4c. Then, 1_SP 1_SP 0_SP 1_SP this abstract token sequence is given as input to LSTM (d) Formatting token sequence generated by a LSTM model. models, which predict the formatting token sequence shown before-context Identifier in Figure 4d. This predicted formatting token sequence is 0_SP , 1_SP int 1_SP Identifier 1_SP ) then used to modify the formatting tokens from the buggy 1_NL { 1_NL_4_ID long 1_SP Identifier abstract token sequence. It results in a predicted abstract token 1_SP = 1_SP Identifier 0_SP ( 1_SP sequence, as shown in Figure 4e, that may fix the current after-context Checkstyle error. The diff between Figure 4c and Figure 4e (highlighted in bold) shows that the predicted repair is the (e) Predicted abstract token sequence. replacement of the formatting token 4_SP by 1_NL. This 812 p u b l i c v o i d v i s i t C h a n g e d N o d e s ( N o d e C h a n g e V i s i t o r predicted repair means that the four whitespaces before the v i s i t o r , i n t nodeTypes ) 813 { token “{” should be replaced by a new line. 814 l o n g denseMask = changeMask ( t r u e ) ; Then, the predicted abstract token sequence (Figure 4e) is translated back to Java code (Figure 4f). Finally, when running (f) Source code snippet with repaired formatting. Checkstyle on the new Java code, no Checkstyle error is raised, Fig. 4: S TYLER: from the Checkstyle-formatting error to a fix. meaning that S TYLER successfully repaired the error. D. Java Source Code Encoding indentation deltas are represented by ∆_ID (indent), negative ones are represented by ∆_DD (dedent), and deltas equal to S TYLER encodes the Java source code into an abstract zero (there is no indentation change between two lines) are token sequence that is required to predict formatting changes. ignored, they are not represented by an abstract token. The First, S TYLER translates each Java token to an abstract token complete representation after the calculation of the number of by keeping the value of the Java keywords, separators, and new lines and the indentation delta is n_NL_∆_(ID|DD): operators (e.g. + → +), and by replacing the other token kinds for instance, in Figure 4b, the new line between lines 812 and such as literals, comments, and identifiers by their types (e.g. 813 is represented by 1_NL_4_ID), i.e. one new line and x → Identifier). Second, for each pair of subsequent indentation delta +4. Java tokens, S TYLER creates an abstract formatting token that depends on the presence of a new line. If there is no new line, S TYLER counts the number of whitespaces, and then E. Training Data Generation represents it like n_SP, where n is the number of whitespaces S TYLER does not use predefined templates for repairing (e.g. → 1_SP). If there is no whitespace between two Java formatting errors. S TYLER uses machine learning for inferring tokens (e.g. x=), S TYLER adds 0_SP between the tokens. The a model to repair formatting errors and, consequently, it needs same process is applied for tabulations. training data. One option is to mine past commits from the If there are new lines between two Java tokens, S TYLER first project under consideration to collect training data. However, counts the number of new lines, and represents it as n_NL, there might not exist enough data in the history of the project where n is the number of new lines. Then, S TYLER calculates to cover all Checkstyle formatting rules. the indentation delta (∆) between the line containing the So in order to have enough data for training, our key insight previous token and the line containing the next token: the is to generate the training data. The idea is to modify error- delta is the difference of the indentation between the two free Java source code files in the project in order to trigger lines (the indentation is composed of whitespace or tabulation Checkstyle formatting rule violations. Then, one obtains a pair characters, exclusively, depending of the project). Positive of files (αorig , αerr ): αorig is the file without the formatting
5 error, and αerr is the file with the formatting error. αorig Algorithm 1 Batch injection of Checkstyle errors in Java files. is a repaired version of αerr , and we can use supervised Input: ruleset – Checkstyle configuration of the project machine learning to predict αorig given αerr . We experiment under consideration that idea in two different ways (called protocols in this paper) Input: f iles – corpus of error-free Java files taken from the to generate training data: we name them as Stylerrandom and project Styler3grams , which we present as follows. Input: numberOf Errors – number of errored files to be The Stylerrandom protocol for injecting Checkstyle errors generated in a project consists of automated insertion or deletion of a Input: protocol in [Stylerrandom , Styler3grams ] single formatting character (space, tabulation, or new line) Output: dataset with Checkstyle errors in Java source files. These modifications require a careful 1: const BAT CH_SIZE ← 500 procedure so that 1) the project still compiles and 2) its 2: var dataset ← {} behavior is not changed. For this, we specify the locations 3: while dataset.length < numberOf Errors do in the source code files that are suitable to perform the 4: var modif iedF iles ← {} modifications. For insertions, the suitable locations are before 5: for i ← 0; i < BAT CH_SIZE; i + + do or after any token. For deletions, the suitable locations are 1) 6: f ile ← selectRandom(f iles) before or after any punctuation (“.”, “,”, “(”, “)”, “[”, “]”, “{”, 7: f ile0 ← changeF ormatting(f ile, protocol) “}”, and “;”), 2) before or after any operator (e.g. “+”, “-”, 8: modif iedF iles.append(f ile0 ) “*”, “=”, “+=”), and 3) in any token sequence longer than one 9: end for indentation character. 10: checkstyleResult ← The Styler3grams protocol is meant to produce likely runCheckstyle(modif iedF iles, ruleset) errors. It performs modifications at the abstract token 11: erroredF iles ← selectErroredF iles(checkstyleResult) level instead of directly changing the Java source code as 12: dataset.append(erroredF iles) Stylerrandom . The idea is to replace formatting tokens by 13: end while the ones used by developers in a similar context (i.e. the same 14: return dataset surrounding Java tokens). For that, we use 3-grams, where 3gram = {Java_token, f ormatting_ token, Java_token}. So given an error-free Java file, the task of Styler3grams is the Once the context surrounding a formatting error is tok- following. First, the Java file is tokenized (see Section IV-D), enized, S TYLER places two tags around the error, so that and a random formatting token is picked and used to form its location and its violation type can be further identified. a 3-gram, which is 3gramorig . Then, given a corpus of 3- The tags consist of the name of the Checkstyle rule that was grams previously mined from a project, Styler3grams finds a violated and raised the error. For instance, the error presented 3grami−corpus that matches the surrounding Java tokens of in Figure 4a is about the Checkstyle LeftCurly rule, so the tags 3gramorig . Several matches can be found, but the selection of around the error are and as a 3grami−corpus is random according to its frequency in the shown in Figure 4c. corpus. Then, 3gramorig is replaced by 3grami−corpus : since To insert the tags concerning the error type in the abstract the Java tokens match, only the formatting token is actually token sequence, S TYLER needs to find a place so that the tags replaced. Finally, Styler3grams performs a de-tokenization so surround the tokens related to the origin of the error, and at the that an error version of the original error-free Java file is same time to minimize the number of tokens between the two created. tags to have precise information about the location. S TYLER Algorithm 1 presents the algorithm that S TYLER uses to places the tags according to the location information given by generate one training dataset per protocol (Stylerrandom and Checkstyle (line and column). When Checkstyle provides the Styler3grams ). The input of the algorithm is the Checkstyle line and the column, S TYLER places n tokens ruleset of the project, a corpus of error-free Java files taken before the error and n tokens after. When from the project, the number of errored files to be generated, Checkstyle provides the line but not the column (e.g. when and the injection protocol to be used. Then, in each batch the error is about the LineLength rule), S TYLER places the iteration, a random file is selected from the corpus of error- i tokens before the line and free Java files, and the specified injection protocol is applied to j tokens after the end of the line. The values of k, n, i, and it. Once a batch is completed, Checkstyle is executed so that j are explained in Section IV-I. the algorithm selects the modified files that contain a single error. The algorithm ends when the desired number of errored files is reached. G. Machine Learning Model Learning (Figure 3–step C). S TYLER aims to translate a buggy F. Error Encoding token sequence (input sequence) to a new token sequence In order to repair formatting errors, the Java source code with no Checkstyle errors (output sequence). S TYLER uses a encoding using an abstract token sequence (see Section IV-D) sequence-to-sequence translation based on a recurrent neural must capture both the error in the code and the context network LSTM (Long Short-Term Memory), similar to what surrounding the error. Therefore, S TYLER considers a token is used for natural language translation. Thanks to the token window of k lines before and after the error. abstraction employed by S TYLER to encode Java source code
6 I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP H. Repair Verification and Selection Fi = 0_SP 1_SP 1_NL 1_SP S TYLER performs x predictions per training data generation Oi = ( 0_SP Identifier 1_SP , 1_NL Identifier 1_SP protocol (i.e. Stylerrandom and Styler3grams ), so in the end (a) length(Fi ) = length(I)/2. S TYLER generates x × 2 predictions to repair a single error. I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP After the translation of those predictions back to Java source Fi = 0_SP 1_SP 1_SP 2_SP 1_NL_4_DD code (Figure 3–step G), S TYLER performs a verification (Fig- Oi = ( 0_SP Identifier 1_SP , 1_SP Identifier 2_SP ure 3–step H), where Checkstyle is executed on the resulting Java source code files. From the correctly repaired files (i.e. the (b) length(Fi ) > length(I)/2. ones that do not result in Checkstyle errors), S TYLER selects I = ( 0_SP Identifier 0_SP , 1_SP Identifier 1_SP the best one to give as output, where the best prediction is the Fi = 0_SP 1_SP one that has the smallest source code diff (Figure 3–step I). Oi = ( 0_SP Identifier 1_SP , 1_SP Identifier 1_SP (c) length(Fi ) < length(I)/2. I. Implementation Fig. 5: Generation of the sequence Oi based on the predicted S TYLER is implemented in Python. We use javalang [18] formatting tokens Fi and the input I. for parsing and OpenNMT-py [25] for the machine learning part. The code is publicly available [21]. For optimally training the LSTM models, we performed an exploratory study by training models with different configura- tions. The configurations combine values for key parameters, (see Section IV-D and Section IV-F), the input and output which are the model attention type (general or mlp), the vocabularies are small (respectively ∼150 and ∼50), hence number of layers (1, 2, or 3) and the number of units (256 are well handled by LSTM models. We use LSTM with or 512) for the model encoder/decoder, and the model word bidirectional encoding, which means that the embedding is embedding size (256 or 512). For each configuration, the able to catch information around the formatting error in training was performed for a maximum of 20k iterations, with the two directions: for instance, an error triggered by the a batch size of 32, and a model was saved in the iterations 10k Checkstyle WhitespaceAround rule, which checks that a token and 20k. This means that, in the end, we obtained 48 models (2 is surrounded by whitespaces, requires the contexts before and model attention types × 3 numbers of layers × 2 numbers of after the token. units × 2 embedding sizes × 2 number of training iterations) per training data generation protocol (i.e. Stylerrandom and Predicting/Repairing (Figure 3–step F). Once the LSTM mod- Styler3grams ). els are trained (one per training protocol, see Section IV-E), Those models were created for one open-source project5 , S TYLER can be used for predicting fixes for an erroneous randomly selected from the top-5 projects with most diversity sequence I as in Figure 4c. For an input sequence I, a LSTM in terms of number of formatting rules (see Section V-B). model predicts x alternative formatting token sequences using The project was given as input to S TYLER, which produced a technique called beam search, that we use off-the-shelf. training data by injecting Checkstyle errors in error-free files These alternatives are all potential repairs for the formatting in the project (see Section IV-E). For each protocol, 10k errors error (e.g. Figure 4d). were injected. This data was used to train the LSTM models, where 9k errors were used for training and 1k for validation. Note that the LSTM models predict formatting token se- When the 48 models per protocol were created, we ran each quences (e.g. Figure 4d), but the goal is to have token se- of them on real errors from the project so that we could test quences containing Java and formatting tokens (e.g. Figure 4e), the models and choose the configuration of the best ones. so they can further be translated back to Java code. Then, We picked the configuration of the models, one per protocol, S TYLER generates a new abstract token sequence (Oi ) for each that repaired more real errors. The best Stylerrandom -based formatting token sequence (Fi ), based on the original input I, model was with general model attention type, 2 layers, 512 such as in Figure 5a. Recall that I is composed of pairs of Java units, embedding size of 512, and 20k training iterations, and tokens and formatting tokens (see Section IV-D), therefore its the best Styler3grams -based model was with general model number of formatting tokens is LI = length(I)/2. However, attention type, 1 layer, 512 units, embedding size of 256, a LSTM model does not enforce the output size, thus we and 20k training iterations. Those are the configurations we cannot guarantee that the length of a predicted formatting used for training the models for our experiments described in token sequence (LFi = length(Fi )) is equal to LI . If Section VI. LF > LI , S TYLER uses the first LI formatting tokens from For prediction, the beam search creates x = 5 potential Fi and ignores the remaining ones to generate Oi , such as in repairs per model. Finally, about the error encoding, we set Figure 5b. If LF < LI , S TYLER uses all formatting tokens k = 5, n = 10, i = 2, and j = 13. Recall that those parameters from Fi , and copies the LFi + 1, LFi + 2, . . . , LI original are about the token window before and after the error (i.e. the formatting tokens from I, such as in Figure 5c. Finally, after context surrounding the error) and the placement of tags for creating x abstract token sequences Oi , S TYLER continues its workflow (Figure 3–step G). 5 https://github.com/inovexcorp/mobi
7 the location and violation type identification once the error is B. Data Collection encoded. These parameters are made big enough to contain To answer our research questions, we create a dataset of real important information and, at the same time, small enough to Checkstyle formatting errors by mining open source projects. still allow for learning and prediction, and were set based on For that, we first build a list of projects to collect errors from meta-optimization. by filtering projects from our study presented in Section III. We select the projects that 1) use Checkstyle, 2) have only V. E VALUATION D ESIGN one Checkstyle ruleset file, 3) contain at least one Checkstyle formatting rule in the Checkstyle ruleset, and 4) use Maven. We conduct an evaluation of S TYLER on real Checkstyle This results in 1,791 projects. errors mined from GitHub repositories, and compare S TYLER For each project, we try to reproduce Checkstyle errors with against three state-of-the-art code formatting systems. In this the following procedure. We first clone the remote repository section, we present the design of our evaluation. from GitHub6 . Then, we search in the history of the project for the last commit (cn ) that contains modifications in the checkstyle.xml file: this commit is used as a starting A. Research Questions point for the reproduction of real errors. We aim to answer the following five research questions. We then perform a sanity check in the checkstyle.xml file from the commit cn : if it contains unresolved variables, RQ #1 [Accuracy]: To what extent does S TYLER repair real- we discard the project. Otherwise, we submit all files of world Checkstyle errors, compared to other systems? cn to a process of finding a version of Checkstyle that is Overall accuracy is an important metric to measure the value compatible to the checkstyle.xml of the project. This of tools. We investigate the accuracy of S TYLER on real is necessary because new versions of Checkstyle sometimes Checkstyle errors, which allows us to understand to what introduce breaking backward compatibility7 , and they might extent S TYLER repairs formatting errors that have occurred fail to parse a checkstyle.xml used with previous ver- in practice. Moreover, we compare the accuracy of S TYLER sions of Checkstyle. The process consists of executing multiple to the accuracy of three code formatters, by using the same Checkstyle versions on the project, from a newer version to dataset of errors, to investigate if, and to what extent, S TYLER an older one, until finding one version that does not fail or outperforms the competing systems. until the available options end8 . If a compatible Checkstyle version is found, we gather RQ #2 [Error type]: To what extent does S TYLER repair all commits since cn , inclusive: this process ensures that all different error types, compared to other systems? commits are based on the same version of the Checkstyle Checkstyle has different formatting rules, so it raises different ruleset. For each selected commit, we check it out, and we error types. In this research question, we investigate if, and to check if the pom.xml file overrides any Checkstyle config- what extent, S TYLER repairs different error types compared uration option: if it does, we discard that commit because to the other systems. This analysis is also important to find if we cannot untangle the Maven+Checkstyle configuration with the systems are complementary to each other. high accuracy. Otherwise, we run Checkstyle on the commit RQ #3 [Quality]: What is the size of the repairs generated by source tree. If at least one Checkstyle error is raised, we save S TYLER, compared to other systems? the errored Java files and also the metadata information about There may be several alternative repairs that fix a given the errors (the Checkstyle error types and their location). Checkstyle error, including ones that change other lines in the We remove duplicate Java files according to the file content program and not only the ill-formatted line. In this research among all commits if any. Then, we select the files con- question, we compare the size of the repairs produced by taining a single Checkstyle error related to formatting. We S TYLER against the repairs from the other systems. perform this selection to accurately evaluate repairs predicted by S TYLER. Finally, we keep projects where all criteria yield RQ #4 [Performance]: How fast is S TYLER for learning and at least 20 Checkstyle formatting errors. By applying this for predicting formatting repairs? systematic reproduction and selection process, we obtained a To investigate if S TYLER is applicable in practice, we measure dataset containing 11,220 Checkstyle errors spread over 70 its performance for fixing Checkstyle errors. This is a valuable projects. Additionally, Table II shows the stats per Checkstyle information for who is interested in using S TYLER as a pre- formatting rule. commit hook in IDEs or in continuous integration. RQ #5 [Technical analysis]: How do the two training data C. Systems Under Comparison generation techniques of S TYLER contribute to its accuracy? We selected three systems to be compared with S TYLER: Finally, we perform a technical analysis on the two protocols one is an IDE-based code formatter plugin for Checkstyle, for training data generation contained in S TYLER (see Sec- 6 All repositories were cloned in June 24, 2020. tion IV-E), to investigate if one of them contributes more to the 7 Checkstyle release notes: https://checkstyle.sourceforge.io/releasenotes. accuracy of S TYLER. This is an important investigation from html the research viewpoint so that other researchers can further 8 Our current implementation supports 35 Checkstyle versions, from 8.0 to choose a random or a 3-gram approach in related research. 8.33.
8 and the other two are the state-of-the-art of machine learning VI. E VALUATION R ESULTS AND D ISCUSSION formatters that aim to assist developers to fix code formatting- We present and discuss the results for our five research related issues without any prior or ad-hoc formatting rules. questions in this section. 1) C HECK S TYLE -IDEA: C HECK S TYLE -IDEA [9], also referred as CS-IDEA in this paper, is a plugin for the IntelliJ IDE. It provides IDE integrated feedback against a given A. Accuracy of S TYLER (RQ #1) Checkstyle ruleset and suggests fixes for Checkstyle errors. To measure the accuracy of S TYLER and the accuracy of the 2) NATURALIZE: NATURALIZE [3] is a tool dedicated other three systems on the 11,220 real errors, we categorize to assist developers on fixing coding conventions related to the repair attempts per status. Table III shows the results per naming and formatting in Java programs. It learns coding tool and per status of the repair attempts: repaired/no error conventions from a codebase and suggests fixes to developers refers to errors that were successfully repaired, i.e. no error such as formatting modifications, based on the n-gram model. is raised after the repair attempt; repaired/new errors refers 3) C ODE B UFF: C ODE B UFF [26] is a code formatter appli- to errors that were fixed, but new errors were introduced in cable to any programming language with an ANTLR grammar. the source code; not repaired/same error refers to errors that Instead of formatting the code according to ad-hoc rules for a were not repaired, i.e. the same error is still in the source language, C ODE B UFF aims to infer the formatting rules given code; not repaired/same+new refers to errors that were not a grammar for the language and a set of files following the repaired and new errors were introduced in the source code; same formatting rules. For each token, a KNN model makes and broken refers to cases containing files that cannot be the decision to indent it or to align it with another token based parsed by javalang after the repair attempts. on the AST of the source file. S TYLER repairs 38% of the errors while CS-IDEA repairs 63%, which is the greatest overall accuracy among the four D. Set-up considered tools. NATURALIZE and C ODE B UFF repair less errors (13% and 15%, respectively). To check if there is a 1) C HECK S TYLE -IDEA: To use CS-IDEA, for each significant difference between S TYLER and the other tools, we project in our dataset, we first create a project in IntelliJ used McNemar test and we considered α = 0.05: we found containing the checkstyle.xml file and the errored files. p-value=0.000 for all three tests. This means that S TYLER and Then, we import the Checkstyle ruleset (Settings > Editor any other tool have a different proportion of errors. > Code Style > Import schema > Checkstyle configuration). To run the C HECK S TYLE -IDEA plugin we simply call the We note that S TYLER and CS-IDEA are the most reliable function “Refactor code” from the IDE. tools in the sense of delivering to an end-user either a repaired source code or, in the worst case scenario, the code with 2) NATURALIZE and C ODE B UFF adaptation: To use NAT- the same error. It is not the same case of NATURALIZE and URALIZE , we have to slightly modify it: i) NATURALIZE C ODE B UFF, which have higher rates of delivering source code recommends multiple fixes, so we take the first one for a given with new errors or broken. They were, however, designed for error as being the repair; and ii) we changed NATURALIZE to a different goal, and do not take into account the Checkstyle only work for indentation, excluding fixes regarding variable ruleset of the project like S TYLER and CS-IDEA do. Yet, naming conventions (which are out of the scope of this paper). they are relevant for our experiment since they are the state- To run C ODE B UFF, we give it the required configuration, of-the-art of machine learning-based code formatters. Our including the number of spaces for indentation. This number is results show the need of specialized, focused-tools to repair based on the most common indentation used in the considered Checkstyle errors. projects (usually two or four spaces). 3) Training tools: We trained S TYLER for each project in RQ #1: To what extent does S TYLER repair real-world our real error dataset. The training process includes a step for Checkstyle errors, compared to other systems? creating the training data (see Figure 3–step A), where we S TYLER repairs 38% (4,231/11,220) of the Checkstyle create 10,000 errors per project. To conduct a fair evaluation, errors we found in the wild, and it outperforms the two we ensure that S TYLER learns repairs based on the same state-of-the-art machine learning systems, NATURALIZE and Checkstyle ruleset that is used for the real errors in the C ODE B UFF. CS-IDEA is able to repair 63% of the errors, evaluation. Therefore, for each project from the real error however we note that CS-IDEA is heavily engineered, dataset, we select as training seeds all error-free Java files whereas S TYLER’s approach to repair formatting errors from the last commit that modified the checkstyle.xml is fully automated and hence more appropriate for easily file used to collect the real errors. We take special care of handling new and configurable rules. consistency in the observed results: all three machine learning- based systems, S TYLER, NATURALIZE, and C ODE B UFF, are trained using the same Java files. B. Error Type Analysis (RQ #2) 4) Testing tools: Finally, we run all the four tools to repair To answer RQ #2, we investigate if S TYLER is effective the 11,220 errors from the real error dataset. in fixing different Checkstyle error types (one error type is 9 S TYLER also targets the following rules that are not contained in our related to one Checkstyle rule). Figure 6 shows the repaired dataset: AnnotationLocation, AnnotationOnSameLine, EmptyForInitializer- Checkstyle errors per error type and per tool in a heatmap. The Pad, SingleSpaceSeparator, and TypecastParenPad. colour scale is from dark to light colours, where the darkest
9 TABLE II: Real error dataset stats per formatting rule9 . Checkstyle rule (25) Projects (70) Errors (11,220) CommentsIndentation 10 ( 14%) 32 (
10 CommentsIndentation (32) 9.4 % 40.6 % 25.0 % 0.0 % 40.6 % EmptyForIteratorPad (10) 100.0 % 0.0 % 40.0 % 40.0 % 100.0 % EmptyLineSeparator (2729) 2.3 % 91.9 % 20.4 % 1.0 % 94.5 % FileTabCharacter (595) 9.4 % 93.1 % 6.4 % 31.8 % 98.8 % GenericWhitespace (6) 100.0 % 100.0 % 16.7 % 33.3 % 100.0 % Indentation (755) 84.2 % 92.2 % 3.8 % 74.8 % 94.4 % LeftCurly (197) 95.9 % 92.9 % 35.5 % 34.5 % 95.9 % LineLength (2774) 31.1 % 48.7 % 0.0 % 1.3 % 51.5 % MethodParamPad (62) 51.6 % 80.6 % 11.3 % 12.9 % 87.1 % NewlineAtEndOfFile (321) 61.4 % 0.0 % 0.0 % 0.0 % 61.4 % NoLineWrap (11) 100.0 % 0.0 % 0.0 % 0.0 % 100.0 % NoWhitespaceAfter (44) 18.2 % 22.7 % 2.3 % 15.9 % 22.7 % NoWhitespaceBefore (141) 78.0 % 71.6 % 34.8 % 46.1 % 94.3 % OneStatementPerLine (4) 25.0 % 25.0 % 0.0 % 0.0 % 25.0 % OperatorWrap (231) 55.8 % 0.0 % 15.2 % 4.3 % 57.6 % ParenPad (120) 100.0 % 36.7 % 35.0 % 26.7 % 100.0 % Regexp (374) 2.9 % 2.9 % 8.6 % 11.0 % 14.2 % RegexpMultiline (8) 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % RegexpSingleline (474) 10.8 % 37.8 % 2.1 % 5.3 % 38.0 % RegexpSinglelineJava (203) 2.5 % 0.5 % 9.4 % 0.0 % 11.3 % RightCurly (372) 54.3 % 76.3 % 3.5 % 26.6 % 78.2 % SeparatorWrap (16) 37.5 % 6.2 % 25.0 % 0.0 % 37.5 % TrailingComment (370) 88.1 % 0.0 % 31.4 % 0.0 % 88.9 % WhitespaceAfter (563) 78.5 % 86.7 % 17.1 % 56.7 % 90.6 % WhitespaceAround (807) 93.4 % 79.2 % 40.1 % 29.4 % 96.5 % Styler Checkstyle-IDEA Naturalize CodeBuff All Fig. 6: Types of Checkstyle error repaired per tool (RQ #2). error and the repaired source code. Among all repairs that pass RQ #3: What is the size of the repairs generated by all Checkstyle rules, the diff should be as small as possible for S TYLER, compared to other systems? being the least disrupting for the developers. In the context of S TYLER has a median repair size of five changed lines, a pull request on GitHub, a smaller diff is usually considered the same as NATURALIZE. Yet, NATURALIZE produces as easier to review and merge [12]. small formatting repairs with a less reliable predictability We calculate the size in lines of the diff from the errors that compared to S TYLER. CS-IDEA and C ODE B UFF clearly S TYLER, CS-IDEA, NATURALIZE, and C ODE B UFF repaired. produce bigger formatting repairs. The ability to produce Figure 7 shows the results: the x axis presents the size small diffs is an important property for code-reviews and distribution of the diffs, and each boxplot represents one tool. pull-request-based development, hence our results show that S TYLER (in green) and NATURALIZE (in yellow) have S TYLER can be realistically used in a modern software the smallest medians of diff size, which are both equal to development context. five changed lines. Yet, they suffer from fewer bad cases (the right-hand part of the distribution). CS-IDEA (in pink) and C ODE B UFF (in blue) produce larger diff sizes, and have D. Performance (RQ #4) medians equals to 7 and 55, respectively. In the worst cases, To investigate if S TYLER can be used in practice, we they produce the largest diffs, the 95th percentile passes 200 measure the execution time spent when running S TYLER on changed lines, compared to 7 lines by S TYLER. the real error dataset. Table IV shows the minimum, median, We performed Wilcoxon rank sum test to verify if the average, and maximum spent time on projects, split over the distributions of the diff sizes obtained by S TYLER and the different steps from the S TYLER workflow. For training data other tools are systematically different from one another. We generation, S TYLER took at least 16 minutes and up to six found p-value=0.000 when testing S TYLER with CS-IDEA and a half hours. To tokenize the training data, it took up to and C ODE B UFF, and p-value=0.0000000039 when testing 13 minutes, and for training the models, it took mostly about S TYLER with NATURALIZE. Considering α = 0.05, we reject one hour. Therefore, the training of S TYLER (data generation the null hypothesis, which means that the distributions of + tokenization + model training) took around two hours and S TYLER is significantly different from the other ones. a half on average. This can be considered just fine, since the
11 Styler Naturalize RQ #5: How do the two training data generation Checkstyle-IDEA CodeBuff techniques of S TYLER contribute to its accuracy? For most errors, S TYLER selects a repair predicted by the LSTM model based on the Styler3grams protocol because it produces a smaller diff, which is desirable for devel- opers. Yet, the model based on Stylerrandom exclusively contributes to the overall accuracy of S TYLER with 20% of the fixes. 0 25 50 75 100 125 150 175 200 VII. D ISCUSSION Diff size A. Machine learning versus rule-based approaches Fig. 7: Size of the repairs per tool. The two boxplot whiskers S TYLER employs a machine-learning-based approach for represent the 5th and the 95th percentiles (RQ #3). repairing formatting convention violations. An alternative ap- proach would be a rule-based one. For instance, there would TABLE IV: Statistics on the performance of S TYLER (RQ #4). be one transformation to be applied in the code per Checkstyle rule. As said, this approach requires the engineering of a Training Prediction Data generation Tokenization Models Average Time transformation for every single linter rule, which is time- Stepa: A B C E→I consuming. While this is costly, this might be even impractical Min 00:16:18 00:00:51 00:31:54 1.608 s/err for highly configurable linters such as Checkstyle: the rule- Med 00:45:10 00:09:09 00:59:14 2.215 s/err based repair system would need to have different transforma- Avg 01:15:38 00:08:18 00:58:38 2.277 s/err tions for the same linter rule due to the configurable properties. Max 06:30:44 00:13:51 01:22:27 3.407 s/err a On the contrary, a machine learning approach does not require The steps were executed in a computer containing a processor In- tel(R) Core(TM) i9-10980XE CPU @ 3.00GHz and 125GiB system costly human engineering. It is able to infer transformations memory. For training the models, we used GPUs GeForce RTX 2080 for a diverse set of linter rules. Our experiments have validated Ti. this property in the context of formatting errors raised by Checkstyle. training is meant to happen only when the coding conventions B. Threats to Validity change (i.e. the Checkstyle ruleset file), which means rarely (a given version of coding conventions usually lasts for months). S TYLER generates training data for repairing errors based After S TYLER is trained for a given project, it takes in average on the Checkstyle configuration file contained in a given two seconds to predict a repair, which is fast enough to be used project. This means that S TYLER assumes that all formatting in IDEs or in continuous integration environments. rules contained in the Checkstyle configuration file are valid. In practice, however, developers might ignore the violations of RQ #4: How fast is S TYLER for learning and for certain rules. Our experiment does not take this scenario into predicting formatting repairs? account, thus we do not claim that 100% of the fixes produced On average, S TYLER needs about two hours and a half by S TYLER are necessarily relevant for developers. for training, and two seconds for predicting a repair. The The real error dataset contains Checkstyle errors mined training time is not an issue since it only happens when the from GitHub repositories. It is to be noted that it does not Checkstyle ruleset file changes. The prediction time relates cover all existing Checkstyle formatting rules. It is worth to to usability: our results show that S TYLER can be used in mention that we are still collecting real errors, and those can the IDE or in CI, in a practical setting. potentially cover new rules. Moreover, the dataset might not be representative of the real distribution of the 19 rules in the real world. Consequently, future research is needed to strengthen E. Technical Analysis on S TYLER (RQ #5) the validity of our study. When selecting real errors, we chose only files containing a At prediction time, S TYLER used two trained LSTM mod- single real Checkstyle error (see Section V-B). We performed els, each one based on a different training data generation pro- this selection so that we could accurately check if the error tocol: Stylerrandom and Styler3grams . We investigate how was correctly repaired by the tools. Files containing more than the two protocols contribute to the final output of S TYLER. one error are hard to check the correctness of repairs: once an We found that S TYLER fixed 852 Checkstyle errors with the error is repaired, the location of the other ones in the file would Stylerrandom -based model exclusively, while it fixed 1,008 change. Therefore, our results are based on single-error files, errors with the Styler3grams -based model — 2,374 errors and future investigations on multiple-error files are needed. were fixed with both models. This shows that the model based Finally, to compare the quality of the repairs produced on Styler3grams is more effective. Moreover, when selecting by S TYLER with the repairs produced by the other three one repair to give as output (Figure 3–step I), S TYLER selected tools, we measured the size in lines of the diff between the the repair from the Styler3grams -based model in most cases buggy and repaired program versions. However, the diff size because it generates smaller diffs. is only one dimension for comparing the tools, which only
12 approximates the developer’s perception on formatting repairs. They mined millions of buggy and patched program versions User studies, such as proposing to developers formatting from the history of GitHub repositories, and abstracted them repairs, are interesting future experiments to further investigate to train an Encoder-Decoder model. The model was able to fix the practical value of this research. hundreds of unique buggy methods in the wild. [10] proposed SequenceR, an end-to-end program repair approach focused on VIII. R ELATED W ORK one-line fixes. In an experiment with Defects4J, SequenceR A. The use of static analysis tools was shown to be able to learn to repair behavioral bugs by Static analysis tools have been subject of investigation in generating patches that pass all tests. recent research. [39] investigated their usage in 20 popular Java open source projects hosted on GitHub and using Travis C. Linter-error repair and formatting CI to support CI activities. They first found out that the Linter-error repair. There are some tools to fix errors raised projects use seven static analysis tools—[8], [14], [28], [20], by specific linters. For instance, ESLint [13] is a linter for [4], [11], and [19]—being Checkstyle the most used one. JavaScript, but it also includes automated solutions to repair About the integration of static analysis tools in CI pipelines, errors raised by it. For Python, there exists the autopep8 tool they found out that build breakages due those tools are mainly [5], which formats Python code to conform to the PEP 8 related to adherence to coding standards, while breakages Style Guide for Python Code [27]. For Java, there exists the related to likely bugs or vulnerabilities occur less frequently. C HECK S TYLE -IDEA [9] plugin for IntelliJ, which we used [39] discuss that some tools are sometimes configured to not to be compared to S TYLER. C HECK S TYLE -IDEA is able to break the build but just to produce warnings, possibly because highlight the error and also to suggest fixes in some cases. of the high number of false positives. However, it is very limited in repairing errors from several [38] investigated the usage of static analysis tools from the different rules as we have shown in RQ #2. perspective of the development context in which these tools are Code formatters. A way to enforce formatting conventions lies used. For that, they surveyed 42 developers and interviewed in code formatters. In Section V-C, we described NATURALIZE 11 industrial experts that integrate static analysis tools in their [3] and C ODE B UFF [26]: NATURALIZE recommends fixes workflow. They found out that static analysis tools are used in for coding conventions related to naming and formatting in three main development contexts, which are local environment, Java programs, and C ODE B UFF infers formatting rules to code review, and continuous integration. Moreover, they also any language given a grammar. Similar to the idea behind found out that developers differently consider warning types C ODE B UFF, [30] had previously experimented with different depending on the context, e.g., when performing code reviews learning algorithms and feature set variations to learn the style they mainly look at style conventions and code redundancies. of a given corpus so that it could be applied to arbitrary code. [22] focused on one specific static analysis tool: [33]. Beyond those academic systems, there are code formatters Through an online survey with 18 developers from different such as google-java-format [15], which reformats source code organizations, they found out that most respondents agree that according to the Google Java Style Guide [16], and as such the issues reported by static analysis tools are relevant for fixes violations of the Google Style. However, these formatters improving the design and implementation of software. are usually not configurable or require manual tweaking, which is a tedious process for developers. This is a problem because B. Learning for repairing compiler errors and behavioral not all developers are ready to follow a unique convention bugs style. S TYLER, on the other hand, is generic and automatically Learning for repairing compiler errors. There are related captures the conventions used in a project to fix formatting works in the area of automatic repair of compiler errors. violations. In this case, the compiler syntax rules are the equivalent of the formatting rules. There, recurrent neural networks and IX. C ONCLUSION token abstraction have been used to fix syntactic errors [7]. In this paper, we presented S TYLER, which implements a In DeepFix, [17] use a language model for repairing syntactic novel approach to repair formatting errors raised by Check- compilation errors in C programs. Out of 6,971 erroneous C style, the popular linter for Java programs. S TYLER creates programs, DeepFix was able to completely repair 27% and a corpus of Checkstyle errors, learns from it, and predicts partially repair 19% of the programs. Later, [1] proposed fixes for new errors, using machine learning. Our experimental TRACER, which outperformed DeepFix, repairing 44% of results on 11,220 real Checkstyle errors showed that S TYLER the programs. [32] confirmed the efficiency of LSTM over repairs real errors from a more diverse set of Checkstyle rules n-grams and of token abstraction for single token compiling than the systems C HECK S TYLE -IDEA, NATURALIZE, and errors. These approaches do not target formatting errors, which C ODE B UFF. Moreover, S TYLER produces smaller repairs than is the target of S TYLER. the compared systems, and its prediction time is low so it can Learning for repairing behavioral bugs. As for repairing be used in IDEs or in Continuous Integration environments. compiler errors, there are also learning systems for repairing There are interesting areas for future work. First, improve- behavioral bugs, those that, for instance, break test cases. [37] ments on the error injection protocols for creating training data investigated the feasibility of using Neural Machine Transla- can be done so as to improve the representativeness of seeded tion techniques for learning bug-fixing patches for real defects. formatting errors. This might increase the performance of
You can also read