National University of Ireland, Maynooth
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
National University of Ireland, Maynooth MAYNOOTH, CO. KILDARE, IRELAND. DEPARTMENT OF COMPUTER SCIENCE, TECHNICAL REPORT SERIES Generation Strategies for TestSuites of GrammarBased Software Mark Hennessy and James F. Power NUIM-CS-TR-2005-02 http://www.cs.nuim.ie Tel: +353 1 7083847 Fax: +353 1 7083848
Generation Strategies for Test-Suites of Grammar-Based Software Mark Hennessy James F. Power∗ Computer Science Dept. Computer Science Dept. National University of Ireland National University of Ireland Maynooth, Co. Kildare, Ireland Maynooth, Co. Kildare, Ireland markh@cs.nuim.ie jpower@cs.nuim.ie ABSTRACT sume the presence of an oracle with which to check the re- The use of statement coverage has proved to be a useful met- sult against. A testing strategy that exploits both methods ric when testing code with a test-suite. Similarly, the cover- is preferable to ensure a reasonable confidence in the cor- age of a grammar’s rules is an effective metric when testing rect functioning of the system but sections of code may go a parser. However when testing a whole parser front-end, untested. it is not immediately obvious whether there is a correlation between rule coverage and underlying code coverage. We In testing a parser, we would like to ensure that all valid sen- use a number of generation strategies to generate a series of tences are accepted while incorrect sentences are rejected. test-suites. We apply these test-suites to keystone, a parser This is to ensure that the structure of the underlying gram- front-end for ISO C++ and offer empirical evidence to sug- mar is adequately tested. As there is no regard for the gest which generation strategy offers the best coverage whilst “meaning” of the sentences, this is known as syntactic cov- using the least amount of test-cases. erage. However when testing a parser front-end we must ensure that the sentences passed as input are semantically correct to ensure that the underlying code is exercised. We Keywords refer to this as semantic coverage. Furthermore we would Software Testing, Parser Testing, Rule Coverage, Metrics, like a test suite to utilise as many of the grammar rules Purdoms Algorithm. as possible because a grammar rule represents (through its associated semantic action) the gateway to the underlying 1. INTRODUCTION code of the parser front-end. In this paper, we test the cover- The testing of a program or software system is an essential ages of a parser front-end, keystone [3], in both the syntactic and integral part of the software process. Testing assures us and semantic dimensions using, not only specification-based that a specification of a program is correct or that a system and implementation-based test-suites but also with a test- behaves in the intended way. The popularity of grammar- suite derived automatically using Purdom’s algorithm [9]. based tools [6] has ensured that testing these systems for keystone aides in the static analysis of C++ programs and correct functioning and robustness is crucial. consists of a program processor and a symbol table. The program processor is responsible for the scanning and pars- There are a number of methods available when testing a ing and is also responsible for initiating and directing symbol grammar-based system [10]. Specification-based testing in- table construction and name lookup. The symbol table al- volves deriving inputs and expected outcomes for each test- lows name-lookup in accordance with Clause 3 of the ISO case directly from the specification of the system. A draw- C++ standard [1]. back of this method is that some parts of the code may remain unexercised, thus lowering confidence in the robust- In Section 2, we outline the test-suite generation strategies ness of the software. With implementation-based testing, and their operation. The methodologies used to determine input data for a test-case is generated from the implementa- the coverage achieved are outlined in Section 3. Section 4 tion but the expected outcomes cannot be determined from presents the coverages achieved by each of the test-suites for the implementation. Implementation-based test-suites as- keystone in both the syntactic and semantic domain. Fur- ∗ thermore we show how test-suite generation compares to On Sabbatical in Clemson University, South Carolina, reduced test-suites with regard to coverage. In Section 5, USA. we conclude the paper. 2. GENERATION STRATEGIES To conduct our study, a number of different types of test- suite were employed. Two existing test-suites that were used during the development of the current version of keystone were chosen. These test-suites were augmented with two other types of test-suite to ensure that the potential maxi- mum code coverage was achieved. The first of these types was based on the idea of test-suite reduction [5] and involves
Test-Suite Summary g++ C++ test-suite from g++.dg DIR of gcc distribution. ISO C++ test-cases derived directly from the ISO standard Min. g++ Minimum number of test-cases from g++ test-suite. Min. ISO Minimum number of test-cases from ISO test-suite. Purdom C++ test-suite generated using Purdom’s algorithm. CDRC Purdom C++ test-suite generated using Context Dependant Rule Coverage Table 1: Summary of the six test-suites used. taking a large, existing test-suite and reducing it down to a 1: for each test-case tc in test-suite ts do minimum that provides the same rule coverage. The second 2: Add tc coverage vector to array [ ][ ] a type involves generating test-cases directly from the gram- 3: end for mar specification and to this end, we chose Purdom’s semi- 4: minsuite = { } nal algorithm for the generation of sentences from a context- 5: addColumns ( a ) free grammar [9]. A summary of the six test-suites used can 6: for each column that sums to 1 do be seen in Table 1. 7: minsuite = minsuite ∪ essential tc 8: end for 9: while not all rules covered do 2.1 Existing Test-suites 10: addRows ( a ) During the testing of keystone, two large existing test-suites 11: addColumns ( a ) for C++ were used. The first of these was the g++.dg 12: minsuite = minsuite ∪ largest-covering tc test-suite used to test the C++ compiler that forms part 13: end while of the GNU Compiler Collection, gcc. The second was a specification-based suite derived from the ISO C++ stan- Figure 1: Test-Suite Reduction Algorithm dard [1] which has been used to measure conformance with the ISO standard [4]. Hence the process will always be heuristic and in our case 2.2 Test-suite Reduction we choose to always use the test-case that contributes the The notion behind test-suite reduction [5] is a relatively sim- most coverage even though it can be proved that this will ple one. Given an existing test-suite, we wish to reduce it not guarantee the smallest test-suite. to the smallest core of test-cases that still provides the same amount of rule coverage. The algorithm shown in Figure 1., operates as follows: 2.3 Purdom’s Algorithm Purdom’s [9] algorithm and its later interpretation [8] ad- dress the issue of automatically generating test cases from a 1. For each test-case in the test-suite, a vector containing context-free grammar. The goal of the algorithm is to gen- an entry for each rule is output. Within the vector, the erate a series of short sentences, such that every grammar rule is marked as covered or not with a one or zero. rule is used at least once. The algorithm proceeds in two distinct phases. 2. The vectors are placed together in a 2D array. The rows are indexed by test-case and the columns are in- The first phase calculates two tables for each non-terminal. dexed by rule number. The columns are then summed. The first of these, the SHORT table, calculates the rule to use to derive the shortest sentence starting with the respec- 3. For any column that sums to one, i.e. only one test- tive non-terminal. The second table called PREV contains case covers the rule, then this test-case is deemed es- the rule to use to introduce non-terminal n into the shortest sential and added to the minimal test-suite. When a derivation. The second phase of the algorithm utilises these test-case is added to the minimum suite, all of the rules tables to generate the sentences. A table known as ONCE that are covered by this test-case are set to zero. This keeps track of the rules covered. The algorithm terminates process is repeated until all the essential test-cases are when all the grammar rules have been exercised. added to the min. suite. 4. The rows are then summed. The test-case that con- However rule coverage discloses a grammar’s structure in a tributes the most coverage is now identified i.e. the weak sense. For large and complex grammars it is desirable row with the largest sum. This is added to the mini- that valid combinations of productions are utilised to gener- mum set and its coverages are set to zero. This step ate test cases that reflect more accurately the rich syntactic is repeated until all columns sum to zero, i.e. all cov- structure of the grammar. A generalisation of rule cover- erages have been accounted for in the min. suite. age has been proposed [7], such that the context in which a rule is covered is taken into account. This is known as Context Dependant Rule Coverage (CDRC) and in essence It is worth noting that once all the essential test-cases have it ensures that every possible valid combination of rule pairs been removed, the problem of choosing the minimum test-set are exercised. that covers the remaining rules is equivalent to the minimum cardinality hitting set, which is an intractable problem [2]. Using the grammar shown in Table 2 as an example, CDRC
S S S X bXXX PPP !aa b X X P !! a A B C A B C A B C "b ,l S S S " b , l S S S a B B b a B C a B C c C ,l , l @ @ B b C C c C C S S S S S S B b c C c C C Figure 2: Sample test-suite achieving CDRC for the Grammar in Table 2. Every rule for each direct occurence of a non-terminal on the right-hand side of a grammar rule is accounted for. 1 S → ABC end and the code generation of the back-end. The second 2 A → aB suite, ISO, was used in [4] to measure conformance with the 3 B → Bb ISO C++ standard and consists of 440 test-cases sectioned 4 B → C according to the clauses found in the ISO C++ standard[1]. 5 C → cC 6 C → To achieve the test-suite reduction, a number of steps were taken. The first was to modify the parser for keystone to out- Table 2: Simple grammar. put a single file for each test-case containing a rule number, one per line, of each rule used during the parse of that test- case. The test-suite reduction algorithm was implemented Purdom works as follows: Every non-terminal that appears in the Java programming language using 217 lines of code. on the right hand side of a grammar rule is noted, e.g. non- The algorithm was applied to both existing test-suites to terminal B occurs on the right hand side of rule 2, A → a produce two new test-suites which we call Min. g++ and B . This is known as a direct occurence of B in A. So for a Min. ISO. test-suite to exhibit CDRC for the simple grammar above, all rules with B on the left-hand side of the rule must be The syntactic coverage that each test-suite provided was de- exercised for every direct occurence of B in the grammar. A termined by the following method: each file output by the sample test-suite achieving CDRC for the above grammar parser was concatenated into a single monolithic file con- would be: {abbcb, acc, ac} and is shown below in Figure 2. taining all the rule coverages for every test-case in the suite. This was then sorted using the UNIX tool sort. Finally the In our extension of Purdom’s original algorithm, we added UNIX tool uniq was used to pare down the sorted file, such another table called OCCS to the algorithm which keeps that only one instance of every covered rule remained in the track of all the direct occurences within a grammar. The file. The number of lines in the file output by uniq is the table is indexed by non-terminals with all the direct oc- number of rules covered by a test-suite. curences for a given non-terminal making up the entries. This table along with the existing ONCE table is consulted Purdom’s algorithm was implemented in 673 lines of code when choosing the next rule to be used. When all the en- in the Python scripting language. The extension to context- tries in the OCCS table have been covered, the generation dependent rule coverage added an extra 246 lines of code. of test-cases ceases. This modification to Purdom’s original The number of test-cases output by Purdom’s original algo- algorithm along with the original Purdom algorithm added rithm was 53 while CDRC Purdom output 71 test-cases. two more test-suites thus bringing the total number of test- suites to six. Finally, keystone itself was profiled with the tool gcov, a profiling tool that is a member of gcc. This tool measures the statement coverage for a given file when a test-case is 3. METHODOLOGY executed. This is illustrated in Figure 3. The case study was carried out using six test-suites for the ISO C++ language standard. A number of tools and pro- grams were used to generate the test-cases and to reduce 4. RESULTS the test-suites. All tests were performed on keystone version In this section we present the results of our case study to 0.30. Two large, existing test-suites were first picked. The determine which generation strategy is the most effective at first of these was the g++ test-suite from gcc version 3.4. achieving maximum coverage in the syntactic and semantic This implementation-based suite consists of 1183 individual dimension. The results shown are paritioned according to test-cases partitioned into sections that test the parser front- their domain. It is important to note the distinction between
Both of these modules are heavily dependent on the seman- tics of the test-case in question and coverages of both can only be achieved by test-cases that are semantically correct. It is also worth noting that the coverage results of these modules is based upon code that is used only for the nor- mal operation of keystone. Thus user aides for debugging such as pretty print methods etc. are excluded from the measurement figures. The statement coverage figures for the Parser files and the modules Scope and Type are presented in Figures 5, 6 and 7 respectively. From these results we can see that the coverage offered by the reduced test-suites is nearly identical to that of their larger counterparts. The poor results offered by the Purdom approaches are due to the fact that the test-cases generated lack semantic correctness and thus they never get to execute underlying code for the symbol table. Figure 3: The steps involved in measuring the code 4.2 Existing Test-Suites The existing test-suites, g++ and ISO consisted of 1183 coverage with gcov. keystone is compiled by gcc and 440 test-cases respectively and are shown summerised with extra flags. This outputs a profiled keystone in Table 4. These were the benchmarks with which the executable. When one of our test-suites is ran by other test-suites were measured against. The suite g++ is this executable, a statistics file corresponding to a an implementation-based suite and achieved coverage in the source file is output. gcov can then determine how syntactic domain of 491 rules out of 536 total rules. This many lines of code are executed in this correspond- test-suite is designed to fully test the C++ compiler from ing source file. gcc. The presence of GNU C++ extensions in some of the test-cases means that full coverage is not achieved due to the fact that keystone was developed with only the C++ stan- what we define as the syntactic domain and the semantic dard in mind. This suite exhibits the best coverage across domain. Coverage of the grammar rules alone is classed as the semantic dimension on average due to larger coverages coverage of the syntactic domain. If all of the grammar rules of the Lexer and Parser code. are exercised by a test-suite, then that test-suite is said to achieve full coverage in the syntactic domain. The coverage Test-Suite Num. Rules Rules covered of the semantic domain is determined by how much of the g++ 1153 491 / 536 underlying parser front-end code is executed when a test- ISO 440 430 / 536 suite is run. We expect to see a close correlation between coverage in the syntactic domain and coverage in the seman- tic domain due to the fact that the only entry point to the Table 4: Rule coverage for existing test-suites. underlying code is through the associated semantic action for a grammar rule. The results are presented in Table 3 Suite ISO consists of 440 test-cases derived directly from the and are discussed in the rest of this section. clauses of the ISO C++ standard [1]. This is a specification- based suite and covers 430 of 536 rules. It is interesting to note that despite the concerted effort to ensure every rule 4.1 Keystone Structure has a test-case, it falls well short of the implementation- Keystone based suite in the syntactic dimension but the test-cases are X still well constructed such that the coverage is for module % XXXX % Scope is higher than g++ and identical for Type. Lexer Parser Symboltable Q Q The fact that the coverages for the lexer are slightly lower Scopes Types than the rule coverages is due to the fact that all the test- cases are lexically positive. Thus there are no deliberately Figure 4: Keystone Structure. mis-spelled tokens, hence error checking code within the lexer is never covered. It is worth pointing out briefly the structure of keystone and how the results for the semantic domain are interpreted. 4.3 Reduced Test-Suites The underlying code is separated into four distinct sections The algorithm outlined in Section 2.1 above was applied as shown in Figure 4. Lexer is the code coverage associated to both the existing test-suites to provide two new suites with the file output by the tool Flex. Parser is a directory called Min. g++ and Min. ISO. The coverages achieved containing files generated by the tool BTYacc and associ- by these new suites are identical in the syntactic domain as ated files to deal with semantic actions. Within the symbol- their larger versions and similarly they give exactly the same table of keystone are modules to determine scope within a coverage in the semantic domain for Lexer, Scope and Type program and to aid in type checking and allocation. with Parser coverage being lower.
(a) g++ (b) Min. g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 5: Code coverage across the six test-suites for the Parser module.
(a) g++ (b) Min. g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 6: Code coverage across the six test-suites for the Scope module.
(a) g++ (b) Min g++ (c) ISO (d) Min. ISO (e) Purdom (f) CDRC Purdom Figure 7: Code coverage across the six test-suites for the Type module.
No. Syntactic Coverages Semantic Coverages Test Suite Test Cases Rules Covered (%) Lexer (%) Parser (%) Scope (Avg. %) Type (Avg. %) g++ 1183 91.6 77.6 86.1 82.4 84.5 Min. g++ 48 91.6 77.6 82.9 82.4 84.5 ISO 440 80.2 68.4 73.9 84.0 84.5 Min. ISO 49 80.2 68.4 72.5 84.0 84.5 Purdom 53 100 72.5 23.2 34.8 37.9 CDRC Purdom 71 100 77.6 25.0 26.2 31.0 Table 3: Summerised Coverages in both dimensions for the six test-suites. However an interesting finding of this study is the size of 1. The ISO C++ standard [1] defines a grammar that the reduced test-suites. As stated they give almost the same actually accepts a super-set of C++. Hence any ap- coverages across the board as their larger counterparts yet proach to automatically generating test-cases from the the size of the new test-suites is dramatically smaller. We grammar alone will be difficult to produce semantically find that for Min. g++ there are 1135 less test-cases, a re- correct test-cases. duction of 96%. For Min. ISO, there are 391 less test-cases, a saving of 89%. The size of the reductions are summarised 2. The test-cases produced by Purdom’s algorithm are in in Table 5. the most part short sentences, however for the C++ grammar, in tandem with the small test-cases, a single Min. No. No. original % large sentence is produced that attempts to cover as Suite Test-Cases Test-Cases Reduction many rules as possible. It is impossible due to the g++ 48 1183 96.0 complexity of the file to touch it up by hand, hence ISO 49 440 88.9 keystone cannot parse this test-case. Table 5: Percentage reduction achieved by the Test- Suite reduction algorithm. The fact that this large test-case cannot be parsed accounts for the low coverage figures that can be observed in modules Parser, Scope and Type. Module Lexer has the same cov- 4.4 Generated Test-Suites erage due to the fact that keystone maintains a token buffer By applying the C++ grammar used by keystone to the vari- during a parse, so the lexical code maintains a coverage sim- ants of Purdom’s algorithm discussed in Section 2.2, two new ilar to the other test-suites. test-suites are created. These are referred to as Purdom and CDRC Purdom. As Purdom’s original algorithm only gives We can see from the dramatic difference in underlying code consideration to producing sentences that are grammatically coverage of keystone that test-cases generated from syntactic correct, i.e. the derivation of a sentence through repeated considerations alone are not sufficient to fully test a parser application of the grammar rules, the resulting test-cases front-end. must be touched up by hand to ensure they can be parsed by keystone. An example of this is with the generated test-case : 5. CONCLUSIONS In this paper we have shown the coverages achieved in both the syntactic and semantic dimensions exhibited by a num- ber of test-suites for ISO C++. The main findings of our “USING NAMESPACE IDENTIFIER SEMI” work are: would be translated and modified as follows: 1. The test-cases produced by Purdom’s algorithm give full rule coverage in the syntactic domain. Further- namespace X{}; more the coverage is achieved by using a series of small using namespace X; test-cases. However, the test-cases produced are not semantically correct and thus fail to achieve notewor- thy coverage in the semantic domain. It is also worth The declaration of namespace X is essential to the successful noting that the test-cases produced by CDRC Pur- parse of this test-case. dom offer no extra advantage in terms of the size of the test-suite or coverage of the semantic dimension. The number of test-cases output by the test-suite Purdom is 53 and the coverage of the syntactic dimension is 100% (due 2. Test-suite reduction provides an excellent alternative to the nature of the algorithm). CDRC Purdom outputs 71 to the generation of test-cases. Reducing a large, ex- test-cases that also fully cover the syntactic dimension. isting test-suite is a simple process. Furthermore the number of test-cases remaining is comparable to amount When the results of the semantic coverage are analysed, the of test-cases generated by the Purdom approach, but results are poor in comparison with the previous approaches. with the added advantage of being semantically cor- The reasons are two-fold: rect.
3. We have established a correlation between rule cover- age in the syntactic domain and code coverage in the semantic domain. Generated test-cases, that lacked semantic correctness gave poor coverage of the under- lying code in comparison to the other test-suites. As well as this, reduced test-suites, which have identical coverage in the syntactic domain as their larger coun- terparts exhibit exactly the same code coverage in the semantic dimension as the larger suites. Our ongoing work includes extending the implementation- based and specification-based suites to achieve full coverage in the syntactic domain. We then hope to see an exact relationship between full coverage in the syntactic dimension mapping to full coverage in the semantic dimension. 6. REFERENCES [1] ISO/IEC JTC 1. International Standard: Programming Languages - C++. Number 14882:1998(E) in ASC X3. American National Standards Institute, first edition, September 1998. [2] M.R. Garey and D.S. Johnson. Computers and Intractability: A guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [3] T.H. Gibbs, B.A. Malloy, and J.F. Power. Decorating tokens to facilitate recognition of ambiguous language constructs. Software - Practice and Experience, 33(1):19–39, January 2003. [4] T.H. Gibbs, B.A. Malloy, and J.F. Power. Progression toward conformance of C++ language compilers. Dr. Dobbs Journal, 28(11):54–60, September 2003. [5] J.A. Jones and M.J. Harrold. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on Software Engineering, 29(3):195–210, 2003. [6] P. Klint, R. Lämmel, and C. Verhoef. Towards an engineering discipline for grammarware. Draft, Submitted for journal publication; Online since July 2003, 47 pages, Febuary 2005. [7] R. Lämmel. Grammar Testing. In Proc. of Fundamental Approaches to Software Engineering (FASE) 2001, volume 2029 of LNCS, pages 201–216. Springer-Verlag, 2001. [8] B.A. Malloy and J.F. Power. An interpretation of purdom’s algorithm for automatic generation of test cases. In 1st Annual International Conference on Computer and Information Science, Orlando, FL., 2001. [9] P. Purdom. A sentance generator for testing parsers. BIT, 12(3):366–375, 1972. [10] M. Roper. Software Testing. McGraw-Hill, 1994.
You can also read