Change-impact Prediction - SENSA: Sensitivity Analysis for Quantitative
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SENSA: Sensitivity Analysis for Quantitative Change-impact Prediction Haipeng Cai, Siyuan Jiang, Ying-jie Zhang†, Yiji Zhang, and Raul Santelices University of Notre Dame, Indiana, USA †Tsinghua University, Beijing, China email: {hcai|sjiang1|yzhang20|rsanteli}@nd.edu, †zhangyj1991@gmail.com ABSTRACT a change in a statement for any execution (via static forward slic- Sensitivity analysis is used in many fields to determine how dif- ing) or for a set of executions (via dynamic forward slicing). Slic- ferent parts of a system respond to variations in stimuli. Software- ing, however, is also imprecise. Static slicing usually reports many development tasks, such as change-impact analysis, can also benefit potentially-affected statements that are not really affected [6, 23], from sensitivity analysis. In this paper, we present S ENS A, a novel whereas dynamic slicing [2,5,18] reduces the size of the results but dynamic-analysis technique and tool that uses sensitivity analysis still produces false positives [21, 30] in addition to false negatives. and execution differencing for estimating the influence of program statements on the rest of the program. S ENS A not only identi- To increase the precision of slicing by reducing the size of the fies the statements that can be affected by another statement, but resulting slices (the sets of affected statements), researchers have also quantifies those effects. Quantifying effects in programs can tried combining static slices with execution data [13,15,19,23] and greatly increase the effectiveness of tasks such as change-impact pruning slices based on some criteria [1, 8, 33, 39]. However, those analysis by helping developers prioritize and focus their inspection methods still suffer from false positives and false negatives. More- of those effects. Our studies on four Java subjects indicate that over, the number of statements reported as “affected” can still be S ENS A can predict the impact of changes much more accurately too large. Optimizing underlying algorithms can reduce this im- than the state of the art: forward slicing. The S ENS A prototype precision [19, 22, 24] but at great costs for small payoffs. tool is freely available to the community for download. In this paper, we take a different approach to measuring the in- 1. INTRODUCTION fluence of statements in programs. Instead of trying to further re- Modern software is increasingly complex and changes constantly. duce the number of affected statements, our approach distinguishes Therefore, it is crucial to provide automated and effective support statements by likelihood of impact so that users can focus on the to analyze the interactions among its components and, in particu- most likely impacts first. Our new technique and tool, S ENS A, lar, the effects of changes. Developers must understand the con- uses sensitivity analysis [28] and execution differencing [30,34,40] sequences and risks of modifying each part of a software system to estimate those likelihoods. Sensitivity analysis is used in many even before they can properly design and test their changes. Unfor- fields to measure relationships among different parts of a system. In tunately, existing techniques for analyzing the effects of code, such software engineering, sensitivity analysis has been used to analyze as those for change-impact analysis [7], are quite imprecise, which requirements and components [14, 27] and, in a restricted way, for hinders their applicability and limits their adoption in practice. mutation analysis [9, 17, 37]. Meanwhile, execution differencing has been used for program integration and debugging. However, Many change-impact analyses operate at coarse levels such as meth- neither approach, alone or combined, has been used for forward ods and classes (e.g., [20, 26]). Although these techniques pro- slicing and change-impact analysis, as our new technique does. vide a first approximation of the effects of changes, they do not distinguish which particular statements play a role in those effects S ENS A quantifies the effect that the behavior of a statement or po- and, thus, can classify as impacted many methods and classes that tential changes in it may have on the rest of the program. S ENS A are not really impacted (false positives). Moreover, these tech- inputs a program P , a test suite T , and a statement c. For each niques can miss code-level dependencies not captured by higher- test case t in T , S ENS A repeatedly executes t, each time changing level structural relationships [36] (false negatives). the value computed by c to a different value and finding the differ- ences in those executions. For each execution, the differences show At the code level (statements), the forward version of program which statements change their behavior (i.e., state or occurrences) slicing [16, 38] reports all statements that might be impacted by when c is modified. Using this information for all test cases in T , S ENS A computes the sensitivity of each statement s to changes in c as the frequency with which s behaves differently. S ENS A reports this frequency as the estimated likelihood that c affects (or impacts) s in future runs of the program. The greater the frequency for s is, the more likely it is that s will be impacted by the behavior of c. To evaluate the effectiveness of S ENS A for estimating the influ- ence of statements, we empirically compared the ability of S ENS A and static and dynamic forward slicing for predicting the impacts 1
of changes in those statements. For each of a number of potential float prog(int n, float s) { change locations in four Java subjects, we ranked all statements in int g; the subject by impact likelihood according to S ENS A and by depen- 1: g = n; 2: if (g ≥ 1 && g ≤ 6) { dence distance as originally proposed by Weiser for slicing [38]. 3: s = s + (7-g)*5; Then, we computed the average effort a developer would spend 4: if (g == 6) inspecting each ranking to find all statements impacted when ap- 5: s = s * 1.1; plying a real change to each such location and executing the cor- } responding test suite. Our results show that, for these subjects, else { S ENS A outperforms both forms of slicing at predicting change im- 6: s = 0; 7: print g, " is invalid"; pacts, reducing slice-inspection efforts to small fractions. } 8: return s; } We also performed three case studies manually on these subjects to investigate how well S ENS A and program slicing highlight the Figure 1: Example program used throughout the paper. cause-effect chains that explain how bugs actually propagate to fail- ing points and, thus, how bug fixes can be designed. To achieve this goal, first, we manually identified the statements that are im- not re-define v) from s2 to s1 . For example, in Figure 1, state- pacted by each bug and also propagate the erroneous state to a fail- ment 8 is data dependent on statement 3 because 3 defines s, 8 ing point. Then, we computed how well S ENS A and slicing isolate uses s, and there is a path (3,4,8) that does not re-define s after 3. those event chains. Although we cannot make general conclusions, Data dependencies can also be classified as intra-procedural (all we found again that S ENS A was better than slicing for isolating the definition-use paths are in the same procedure) or inter-procedural specific ways in which buggy statements make a program fail. (some definition-use path crosses procedure boundaries). The for- mal parameters of prog, however, are inputs—they are not data The main benefit of this work is the greater usefulness of quantified dependent on any other statement. influences, as computed by S ENS A, over simple static and dynamic program slices. With this kind of information, developers can use 2.2 Program Slicing slices more effectively, especially large slices, by focusing on the Program slicing [16,38], also called static slicing, determines which statements most likely to be affected in practice. Moreover, quanti- statements of the program may affect or be affected by another fied impacts can be used not only for change-related tasks but also, statement. A static forward slice (or, simply, forward slice) from potentially, to improve other applications, such as testability analy- statement s is the set containing s and all statements transitively sis [37], execution hijacking [35], and mutation testing [9]. affected by s along control and data dependences. (To avoid in- valid inter-procedural paths, the solution by Horwitz, Reps, and The main contributions of this paper are: Binkley [16] can be used.) For example, the forward slice from • The concept of impact quantification, which, unlike slicing, statement 3 in Figure 1 is the set {3,5,8}. We include statement 3 distinguishes statements by estimated likelihood of impact. in the slice as it affects itself. Statements 5 and 8, which use s, are in the slice because they are data dependent on the definition of • A new technique and tool, S ENS A, that estimates these like- s at 3. Another example is the forward slice from statement 1 in lihoods and is available to the public. prog, which is {1,2,3,4,5,6,7,8}. Statement 2 uses g, so it is data • Studies that measure and illustrate the effectiveness of S ENS A dependent on statement 1. Statements 3, 4, 5, 6, and 7 are control for predicting the effects and impacts of changes. dependent on statement 2, so they are also in the forward slice. Fi- nally, statement 8 depends on g at 1 transitively because it depends on s at 3, 5 and 6, all of which have definition-clear paths to 8. 2. BACKGROUND This section presents core concepts necessary for understanding the The size of a static slice can vary according to the precision of the rest of the paper and illustrates these concepts using the example analysis. For example, the point-to analysis can affect the size of a program of Figure 1. Program prog in this figure takes an integer static slice. If we use a coarse points-to analysis in which a pointer n and a floating point number s as inputs, creates a local variable can be any memory address, a forward slice from a statement s g, initializes g to the value of n, manipulates the value of s based that defines a pointer p would include any statement that uses or on the value of g, and returns the value of s. dereferences any pointer (which may or may not point to the same address as p) if the statement is reachable from s. 2.1 Program Dependencies Control and data dependences are the building blocks of program 3. EXECUTION DIFFERENCING slicing [16, 38]. A statement s1 is control dependent [11] on a Differential execution analysis (DEA) is designed specifically to statement s2 if a branching decision taken at s2 determines whether identify the runtime semantic dependencies [25] of statements on s1 is necessarily executed. In Figure 1, an example is statement 3, changes. Semantic dependencies tell which statements are truly which is control dependent on statement 2 because the decision affected by other statements or changes. Data and control depen- taken at 2 determines whether statement 3 executes or not. This dencies and, therefore, slicing, only provide necessary but not suf- dependence is intra-procedural because both statements are in the ficient conditions for semantic dependence. This is the main reason same procedure. Control dependences can also be inter-procedural for the imprecision of slicing. (across procedures) [32]. Although finding semantic dependencies is an undecidable prob- A statement s1 is data dependent [3] on a statement s2 if a vari- lem, DEA detects such dependencies on changes when they oc- able v defined (written to) at s2 is used (read) at s1 and there is cur at runtime to under-approximate the set of semantic dependen- a definition-clear path in the program for v (i.e., a path that does cies in the program. Therefore, DEA cannot guarantee 100% recall 2
of semantic dependencies but it achieves 100% precision. This is In contrast, static forward slicing ranks the program statements by much better than what dynamic slicing usually achieves [21, 30]. dependence distance from line 1. These distances are obtained through a breadth-first search (BFS) of the dependence graph—the DEA executes a program before and after a change to collect the inspection order suggested originally by Weiser [38]. Thus, the re- execution history (including states) [30] of each execution and then sult for static slicing is the ranking ({1}, {2,3,4,7}, {5,6}, {8}). For compare both executions. The execution history of a program is the dynamic slicing, a BFS of the dynamic dependence graph, which sequence of statements executed and the values computed at each is the same for both test cases, yields ranking ({1}, {2,3,4}, {8}). statement. The differences between two execution histories reveal which statements had their behavior (i.e., occurrences and values) To measure the predictive power of these rankings, suppose that the altered by a change—the conditions for semantic dependence [25]. developer decides to change line 1 to g = n + 2. The actual set of impacted statements for this change and test suite is {1,2,3,4,8}— 4. TECHNIQUE all of them are impacted for both test cases. This is exactly the set of The goal of S ENS A is, for a program P and its test suite T , to statements placed at the top of the ranking by S ENS A. In contrast, quantify the potential impacts of changing a statement s in P by static slicing predicts that statement 1 will be the most impacted, applying sensitivity analysis [28] and execution differencing [30]. followed by 2, 3, and 4, and then statement 8 as the least impacted. In Section 4.1, we first give an overview of this new technique us- This prediction is less accurate than that of S ENS A, especially be- ing an example program to illustrate it. Then, in Section 4.2, we cause static slicing also places the unaffected statement 7 at the give a precise description of our technique including its process same level as 2, 3, and 4, and predicts 5 and 6 above 8. and its algorithm. Finally, in Section 4.3, we describe the three state-modification strategies that S ENS A offers. Dynamic slicing, against intuition, performs even worse than static slicing with respect to S ENS A in this example. This approach ranks 4.1 Overview the actually-impacted statements 4 and 8 at or near the bottom and misses statement 5 altogether. Thus, dynamic slicing can be impre- When a change is made to a program statement, DEA tells develop- cise and produce false negatives for impact prediction. ers which other statements in the program are impacted. However, developers first need to know the possible impacts of changing a Naturally, the predictions of S ENS A depend on the test suite and the statement before deciding to change it and before designing and modifications chosen. The quality of the predictions depends also applying a change in that location. Also, in general, developers of- on which change we are making predictions for. If, for example, ten need to understand the influences that specific statements exert n in the first test case is 5 instead of 2 and the range [0..10] is on the rest of the program and, thus, determine their role in it. specified instead of [1..6] to modify line 1, S ENS A would produce the ranking ({1,2,8}, {3,4,6,7}, {5}) by using 20 modified runs. To identify and quantify those influences, S ENS A uses sensitivity The static- and dynamic-slicing rankings remain the same. analysis on a program statement s that is a candidate for chan- ges or whose role in the program must be measured. Based on If, in this scenario, the developer changes line 1 to g = n × 2, the the test suite for the program, S ENS A repeatedly runs the pro- actual impact set is the entire program because, after the change, gram while modifying the state of statement s and identifies in lines 5, 6, and 7 will execute for this test suite. In this case, S ENS A detail, using DEA, the impacted statements for each modification. still does a better job than static slicing, albeit to a lesser extent, by Then, S ENS A computes the frequency with which each statement including 2 and 8 at the top of the ranking and including 3, 4, 6, is impacted—the sensitivity of those statements to s. These sensi- and 7 in second place, whereas static slicing places 6 and 8 at or tivities are estimates, based on the test suite and the modifications near the end. S ENS A only does worse than static slicing for line 5. made, of the strength or likelihood of each influence of s. Dynamic slicing still does a poor job by missing lines 5, 6, and 7. We use the example program prog in Figure 1 again to illustrate Note that prog is a very simple program that contains eight state- how S ENS A works using two test inputs: (2, 82.5) and (3, 94.7). ments only. This program does not require much effort to identify Suppose that a developer asks for the effect of line 1 on the rest and rank potential impacts, regardless of the approach used. In a of prog. S ENS A instruments line 1 to invoke a state modifier and more realistic case, however, the differences in prediction accuracy also instruments the rest of the program to collect the execution between S ENS A and both forms of slicing can be substantial, as the histories that DEA needs. The developer also configures S ENS A studies we present in Sections 5 and 6 strongly suggest. to modify q with values in its “valid” range of [1..6]. For each test case, S ENS A first executes prog without changes to provide the baseline execution history for DEA. Then S ENS A re-executes the 4.2 Description test case five times—once for each other value of q in range [1..6]. S ENS A is a dynamic analysis that, for a statement C (e.g., a can- Finally S ENS A applies DEA to the execution histories of the base- didate change location) in a program P with test suite T , assigns line and modified runs of each test case and computes the frequency to each statement s in a program P a value between 0 and 1. This (i.e., the fraction of all executions) with which every other line was value is an estimate of the “size” or frequency of the influence of C impacted (i.e., changed its state or occurrences). on each s in P . Next, we present S ENS A’s process and algorithm. In the example, the result is the ranking ({1,2,3,4,8}, {5}, {6,7}), where lines 1, 2, 3, 4, and 8 are tied at the top because their states 4.2.1 Process (the values of q and/or s) change in all modified runs and, thus, Figure 2 shows the diagram of the process that S ENS A follows to their sensitivity is 1.0. Line 5 comes next with sensitivity 0.2 as it quantify influences in programs. The process logically flows from executes for one modification of each test case (when q changes to the top to the bottom of the diagram and is divided into three stages, 6) whereas, in the baseline executions, it is not reached. Lines 6 each one highlighted with a different background color: (1) Pre- and 7 rank at the bottom because they never execute. process, (2) Runtime, and (3) Post-process. 3
Statement Program Test suite SENSA Instrumented process Instrumenter program Pre- Runtime Run repeatedly modifying state Normal run Modified execution Normal execution process histories history Post- Influence quantification Quantified influences Figure 2: Process used by S ENS A for influence quantification. For the first stage, on the top left, the diagram shows that S ENS A gin with the first stage, the algorithm starts by computing this static inputs a Program and a Statement, which we denote as P and C, slice (line 1). At lines 2–5, S ENS A initializes the map influence respectively. In this stage, an Instrumenter inserts at C in program that associates to all statements in the slice a frequency and a rank. P a call to a Runtime module (which executes in the second stage) Initially, these two values are 0 for all statements. Then, at line 6, and also uses DEA to instrument P to collect the execution history S ENS A instruments P to let S ENS A modify at runtime the values of the program, including state updates [30]. The result, shown in at C and to collect the execution histories for DEA. the diagram, is the Instrumented program. For the second stage, the loop at lines 7–16 determines, for all test In the second stage, S ENS A inputs a Test suite, T , and runs the cases t, how many times each statement is impacted by the modifi- program repeatedly with modifications (Run repeatedly modifying cations made by S ENS A. At line 8, S ENS A executes t without mod- state) of the value produced by C, for each test case t in T . S ENS A ifications to obtain the baseline execution history. Then, line 10 also runs the instrumented P without modifications (Normal run) executes this same test for the number of times indicated by the as a baseline for comparison. For the modified executions, the run- parameter given by S ENS A-M ODS (). Each time, S ENS A modifies time module uses a user-specified strategy (see Section 4.3) and the value or branching decision at C with a different value. other parameters, such as a value range, to decide which value to use as a replacement at C each time C is reached. Also, for all ex- For the third stage, line 11 asks DEA for the differences between ecutions, the DEA instrumentation collects the Modified execution each modified run and the baseline run. In lines 12–14, the differ- histories and the Normal execution history per test case. ences found are used to increment the frequency counter for each statement affected by the modification. Then, the loop at lines 17– In the third and last stage, S ENS A uses DEA to identify the dif- 19 divides the influence counter for each statement by the total ferences between the Modified and Normal execution histories and, number of modified runs performed, which normalizes this counter thus, the statements affected for each test case by each modification to obtain its influence frequency in range [0..1]. made during the Runtime stage. In this Post-process stage, Influ- ence quantification processes the execution history differences and calculates the frequencies as fractions in the range [0..1]. These are the frequencies with which the statements in P were “impacted” by 4.3 Modification Strategies the modifications made at runtime. The result is the set Quantified S ENS A is a generic modifier of program states at given program influences of statements influenced by C, including the frequencies locations. The technique ensures that each new value picked to that quantify those influences. As part of this output, S ENS A also replace in a location is unique to maximize diversity while mini- ranks the statements by decreasing influence. mizing bias. Whenever S ENS A runs out of possible values for a test case, it stops and moves on to the next test case. 4.2.2 Algorithm Users can specify parameters such as the modification strategy to Algorithm 1 formally describes how S ENS A quantifies the influ- use to pick each new value for a statement. The choice of values ences of a statement C in a program P . The statements ranked affects the quality of the results of S ENS A, so we designed three by S ENS A are those in the forward static slice of C in P because different strategies while making it straightforward to add other those are the ones that can be influenced by C. Therefore, to be- strategies in the future. The built-in modification strategies are: 4
Algorithm 1 : S ENS A(program P , statement C, test suite T ) Table 1: Experimental subjects and their statistics // Stage 1: Pre-process Subject Short description LOC Tests Changes 1: slice = S TATIC S LICE(P , C) Schedule1 priority scheduler 301 2650 7 2: influence = ∅ // map statement→(frequency,rank) NanoXML XML parser 3521 214 7 3: for each statement s in slice do XML-security encryption library 252203 92 7 4: influence[s] = (0, 0) JMeter performance tester 391533 79 7 5: end for 6: P 0 = S ENS A-I NSTRUMENT(P , C) // Stage 2: Runtime The first and second questions address the benefits of S ENS A over- 7: for each test case t in T do all and per ranking-inspection effort—when developers can only 8: exHistBaseline = S ENS A-RUN N ORMAL(P 0 , t) inspect a portion of the predicted ranking in practice. The third 9: for i = 1 to S ENS A-M ODS () do question targets the practicality of this technique. 10: exHistModified = S ENS A-RUN M ODIFIED(P 0 , t) // Stage 3: Post-process 11: affected = DEA-D IFF(exHistBaseline, exHistModified) 5.1 Experimental Setup 12: for each statement s in affected do We implemented S ENS A in Java as an extension of our depen- 13: influence[s].frequency++ dence analysis and instrumentation toolkit DUA-F ORENSICS [29]. 14: end for As such, S ENS A works on Java bytecode programs. The S ENS A 15: end for tool is available to the public for download.2 For our experiments, 16: end for we run S ENS A on a dedicated Linux desktop with a quad-core // Stage 3: Post-process (continued) 3.10GHz Intel i5-2400 CPU and 8GB of memory. 17: for each statement s in influence do 18: influence[s].frequency /= S ENS A-M ODS ()×|T | We studied four Java programs of various types and sizes obtained 19: end for from the SIR repository [10], including test suites and changes. 20: R ANK B Y F REQUENCY(influence) Table 1 lists these subjects. Column LOC shows the size of each 21: return influence // frequency and rank per statement subject in lines of code. Column Tests shows the number of tests for each subject. Column Changes shows the number of changes used for each subject. We limited this number to seven because, for 1. Random: Picks a random value from a specified range. The some of the subjects, this is the maximum available number. For default range covers all elements of the value’s type except for each subject, we picked the first seven changes that come with it. char, which only includes readable characters. For some refer- All changes are fixes for bugs in these subjects. ence types such as String, objects with random states are picked. For all other reference types, the strategy currently picks null.1 The first subject, Schedule1, is part of the Siemens suite that we translated from C to Java. This program can be considered as rep- 2. Incremental: Picks a value that diverges from the original value resentative of small software modules. NanoXML is a lean XML by increments of i (default is 1.0). For example, for value v, the parser designed for a small memory footprint. We used version v1 strategy picks v+i and then picks v–i, v+2i, v–2i, etc. For com- of this program as published on the SIR. XML-security is the XML mon non-numeric types, the same idea is used. For example, for signature and encryption component of the Apache project. JMeter string foo, the strategy picks fooo, fo, foof, oo, etc. is also an Apache application for load-testing functional behavior 3. Observed: First, the strategy collects all values that C computes and measuring the performance of software. 3 for the entire test suite. Then, the strategy picks iteratively from this pool each new value to replace at C. The goal is to ensure 5.2 Methodology that the chosen values are meaningful to the program. Figure 3 shows our experimental process. On the top left, the inputs for S ENS A include a program, a statement (e.g., a candidate change 5. EVALUATION location), and a test suite. S ENS A quantifies the influence of the To evaluate S ENS A, we studied its ability to predict the impacts input statement on each statement of the program for the test suite of changes in typical operational conditions. We compared these of the program. This technique outputs those statements ranked by predictions with those of the state-of-the-art—static and dynamic decreasing influence of the input statement. For tied statements in slicing. Our rationale is that the more closely a technique approx- the ranking, the rank assigned to all of them is the average position imates the actual impacts that changes will have, the more effec- of these statements in the ranking. To enable a comparison of this tively developers will maintain and evolve their software. To this ranking with the rankings for slicing, the S ENS A ranking includes end, we formulated three research questions: at the bottom, tied with influence zero, those statements in the static forward slice not found to be affected during analysis. RQ1: How good is S ENS A overall at predicting the statements impacted by changes? On the right of the diagram, an Actual impact computation proce- dure takes the same inputs and also a change for the input state- RQ2: How good are subsets of the S ENS A rankings, if time con- ment. This procedure uses the execution-differencing technique straints exist, at predicting impacted statements? DEA [30] to determine the exact set of statements whose behavior 2 RQ3: How expensive is it to use S ENS A? http://nd.edu/~hcai/sensa/html 3 Some subjects contain non-Java code. The analyzed subsets in 1 Instantiating objects with random states is in our future plans. Java are 22361 LOC for XML-security and 35547 LOC for JMeter. 5
Statement Program Test suite Change Table 2: Average inspection costs for all actual impacts Average cost for all impacts and changes (%) Subject Ideal Static Dynamic S ENS A S ENS A S ENS A case slicing slicing -Rand -Inc -Obs Actual impact SENSA computation Schedule1 39.4 49.3 47.8 40.2 40.2 40.2 NanoXML 7.3 27.3 22.2 9.2 9.9 - XML-security 5.6 36.4 39.9 11.6 12.5 - Quantified Actually-impacted JMeter 0.2 12.5 37.7 7.7 3.0 - influences statements Ranking It is important to note that the test suites used to compute the S ENS A comparison rankings are the same that we used to find the actual impacts when applying the changes. Therefore, the S ENS A predictions might ap- Effectiveness pear biased on the surface. However, there is a strong reason to use of SENSA the same test suite for both the technique and the “ground truth” (the actual impacts): developers will use the same, existing test suite to run S ENS A and then to observe the actual impacts of their Figure 3: Experimental process. changes. For that reason, we decided to use the same test suite for both parts of the experimental process. actually changes when running the test suite on the program before 5.3 Results and Analysis and after this change. The procedure Ranking comparison at the bottom of the diagram 5.3.1 RQ1: Overall Effectiveness measures the effectiveness (i.e., prediction accuracy) of the S ENS A Table 2 presents, for all four subjects, the average inspection costs ranking by comparing this ranking with the set of actually-impacted per subject for the seven changes in that subject for the Ideal sce- statements. The process makes a similar comparison, not shown in nario (best possible ranking) and for a number of techniques. The the diagram, for the rankings obtained from breadth-first searches units are percentages of the static slice sizes, which range between of the dependence graphs for static and dynamic slicing (the traver- 55–79% of the entire program. The techniques are Static slic- sal order suggested originally by Weiser [38]). ing, Dynamic slicing, and three versions of S ENS A: S ENS A-Rand, S ENS A-Inc, and S ENS A-Obs for strategies Random, Incremental, Ranking comparison computes the effectiveness of a ranking with and Observed, respectively. respect to a set of impacted statements by determining how high in the ranking those statements are located.4 For each impacted For some changes in NanoXML, XML-security, and JMeter, statement, its rank value in the ranking represents the effort that a S ENS A-Obs observed only one value at the input statement for developer would spend to find that statement when traversing the the entire test suite. As a consequence, this strategy, by definition, ranking from the top. The more impacted statements are located could not make any modification of that value and, thus, could not near the top of the ranking, the more effective is the ranking at iso- be applied to those changes. Therefore, we omitted the average lating and predicting the actual impacts that are likely to occur in cost result for this version of S ENS A for those subjects in Table 2. practice (after the change is made). The process computes the av- erage rank of the impacted statements to represent their inspection We first observe, from the Ideal case results, that the number of cost. The effectiveness of the ranking is the inverse of that cost. statements impacted in practice by the changes in our study, as a percentage of the total slice size, decreased with the size of the To provide a basis of comparison of the quality of the predictions subject—from 39.4% in Schedule1 down to 0.2% in JMeter. This of the techniques we studied, we also computed the inspection cost phenomenon can be explained by two factors. First, the larger sub- for the ideal scenario for each change. This ideal case corresponds jects in this study consist of a variety of loosely coupled modules to a ranking in which the top |S| statements are exactly those in and the changes, which are thus more scattered, can only affect the set S of impacted statements for that change. This is the best smaller fractions of the program. The second factor is that, never- possible ranking for predicting all impacts of the change. theless, slicing will find connections among those modules that are rarely, if ever, exercised at runtime. Also, aliasing in those subjects For RQ1, we computed the average inspection costs for the en- decreases the precision of slicing. For Schedule1, however, there tire rankings for S ENS A, static slicing, and dynamic slicing. For is little use of pointers and most of the program executes and is RQ2, we computed, for each ranking, the percentage of impacted impacted for every test case and by every change. In fact, an av- statements found in each fraction from the top of the ranking—for erage of 78.5% of the statements in the slices for Schedule1 were fractions N1 to N N , where N is the size of the ranking. For RQ3, impacted. These factors explain the much greater inspection costs we measured the time it takes to run the test suite on the original for the ideal case and all techniques in this subject. (non-instrumented) program and on the instrumented versions of the program for S ENS A and dynamic slicing. We also collected the For Schedule1, the inspection cost for all three versions of S ENS A times S ENS A and DUA-F ORENSICS take for static slicing and for was 40.2% on average for all changes in that subject. This cost was, pre- and post-processing in S ENS A. remarkably, only 0.8 percentage points greater than the ideal case. This is a remarkable result, especially because static and dynamic 4 The top statement in a ranking has a rank value of 1. slicing did worse than the ideal case by 8.4 or more percentage 6
points. Our data for individual changes (not shown in the table) of effort for understanding the potential consequences of changes. for this subject show a consistent trend in which S ENS A cost no more than 1.4 percentage points more than the ideal case. Also for individual changes, the differences among the costs of the three 5.3.2 RQ2: Effectiveness per Inspection Effort versions of S ENS A were very small—less than 1% in all cases. In Because program slices are often very large, developers cannot consequence, for this subject, all three strategies were equally good examine in practice all statements in each ranking produced by alternatives, and all of them were better than both types of slicing. S ENS A or slicing. Instead, they will likely inspect a fraction of all potential impacts and will focus on the most likely impacts The average results for the changes in NanoXML show an even first. With this prioritization, developers can maximize the cost- greater gap in effectiveness than in Schedule1 between the two ap- effectiveness of their inspection by analyzing as many highly-ranked plicable versions of S ENS A and static and dynamic slicing. The impacts as possible within their budget. To understand the effects worst-performing version of our technique, S ENS A-Inc, cost only of such prioritization, we studied the effectiveness of each portion 2.2 percentage points more than the ideal case on average (9.9% vs. of each ranking produced by the studied techniques. 7.3% of all statements that can possibly be affected), whereas the breadth-first traversals for static and dynamic slicing cost at least Figures 4–7 show the cost-effectiveness curves for the best possible 14.9% more than the ideal ranking. ranking-examination order and for the studied techniques S ENS A (two or three applicable versions) and static and dynamic slicing. For XML-security, the results follow the same trend as in the first For each graph, each point in the horizontal axis represents the frac- two subjects, in which S ENS A provided better predictions than tion of the ranking examined from the top of that ranking, whereas both static and dynamic slicing at increasing rates. The two ap- the vertical axis corresponds to the percentage of actually-impacted plicable versions of S ENS A incurred an average cost of no more statements (on average for all changes in the respective subject) than 6.9 percentage points over the ideal cost, whereas slicing cost found within that fraction of the ranking. Note that the result pre- at least 30.8 points more than the ideal. sented in Table 2 for each ranking is the average of the Y values for that entire ranking in the corresponding graph. Remarkably, for XML-security, unlike Schedule1 and NanoXML, dynamic slicing produced worse predictions than static slicing. This The Ideal curves in these graphs provide detailed insights on how apparently counter-intuitive result is explained by the divergence in cost-effective impact prediction techniques can aspire to be. For the paths taken by the executions before and after the actual chan- all subjects but Schedule1, this curve rises sharply within the first ges. Because of these divergences, dynamic slicing actually missed 10% of the ranking. These curves are not straight lines because many impacted statements that were not dynamically dependent on they correspond to the average curves for all changes in per sub- the input statement that are, however, statically dependent on that ject and, in general, the actual impacts for these changes (which statement. S ENS A did not suffer from this problem because its define Ideal) vary in size. Only for Schedule1, the Ideal curve is modifications were able to alter those paths to approximate the ef- mostly straight because all slices have almost the same size and the fects that the actual changes would have later. impacted fractions of the slices have similar sizes. For the largest subject, JMeter, the results again indicate a supe- For Schedule1, because of the high baseline (ideal) costs, all curves riority of S ENS A over slicing, at least for these changes and this are relatively close to each other. However, there are two distinctive test suite. On average, S ENS A cost 7.5 and 2.8 percentage points groups: the S ENS A versions near the ideal curve and the slicing than the ideal cost for the Random and Incremental strategies, re- curves below them. Interestingly, the S ENS A curves overlap with spectively. With respect to slicing, S ENS A cost at least 4.8 and the ideal curve until about 70% of the ranking. Therefore, a user of 9.4 points less, respectively. No technique, however, was able to S ENS A can find as many impacts as it is possible if the inspection get close to the considerably small ideal ranking (0.2%). There are budget is 70% or less. At 70%, 90% of the impacts are predicted two remarkable observations for this subject. First, dynamic slices by S ENS A, in contrast with 70–75% for slicing. Also, S ENS A- was much worse than static slicing—by a much greater than for Obs, which only applies to this subject, has virtually the same cost- XML-security—which is, again, explained by its inability to con- effectiveness as the two other versions of S ENS A. sider alternative execution paths. Second, for this subject, we ob- served the greatest difference between the two applicable S ENS A The results for the three other subjects, which represent more mod- versions, where, for the first and only time, S ENS A-Inc was supe- ern software than Schedule1, indicate even stronger cost- rior. This superiority suggests that the actual changes we studied effectiveness benefits for the two applicable S ENS A versions. The in JMeter had effects observable mostly in the vicinity of the val- curves in Figures 5, 6, and 7 show that S ENS A was not only better ues computed in the unchanged program, as S ENS A-Inc simulates, than slicing overall at predicting impacts, but also that this supe- rather than at random different locations in the value range. riority is even greater in comparison for fractions of the resulting rankings. The curves for either S ENS A-Rand or S ENS A-Inc, or In sum, for RQ1 and for these subjects and changes, the results in- both, grow much faster at the beginning than those for slicing. In dicate that S ENS A is considerably better than slicing techniques at other words, these results suggest that users can benefit even more predicting which statements will be impacted when these changes from choosing S ENS A over slicing when they are on a budget. are actually made. In particular, these results highlight the impre- cision of both static and dynamic slicing, at least with respect to The only case in which one of the slicing approaches seems to be changes. More importantly, our observations for S ENS A are rem- competitive with S ENS A is for JMeter. For this subject, on aver- iniscent of similar conclusions observed for mutation analysis [4] age for all seven changes, static slicing overcomes S ENS A-Rand in which small modifications appear to be representative of faults at about 25% of the inspection effort. However, after that point, in general—or, as in this experiment, fault fixes. These results sug- static slicing maintains only a small advantage over this version gest that developers who use S ENS A can save a substantial amount of S ENS A, and this difference disappears somewhere between 80– 7
100% 100% 90% 90% 80% 80% % of impacts found % of impacts found 70% 70% Ideal 60% Stat. slice 60% 50% Dyn. slice 50% Ideal SensA‐rand Stat. slice 40% 40% SensA‐inc Dyn. slice 30% SensA‐obs 30% SensA‐rand 20% 20% SensA‐inc 10% 10% 0% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of slice inspected % of slice inspected Figure 4: Impacted statements found versus cost for Schedule1 Figure 6: Impacted statements found versus cost for XML-security 100% 100% 90% 90% 80% 80% % of impacts found % of impacts found 70% 70% Ideal Ideal 60% 60% Stat. slice Stat. slice 50% 50% Dyn. slice Dyn. slice 40% 40% SensA‐rand SensA‐rand 30% 30% SensA‐inc SensA‐inc 20% 20% 10% 10% 0% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of slice inspected % of slice inspected Figure 5: Impacted statements found versus cost for NanoXML Figure 7: Impacted statements found versus cost for JMeter 90% of the inspection. More importantly, S ENS A-Rand is consid- erably better than static slicing before the 25% point, which ex- of magnitude larger than for the other subjects (see Table 1). plains the overall advantage of 4.8 percentage points for the for- mer technique as reported in Table 2. The main highlight for this The next three columns report the time in seconds taken by each subject, however, is S ENS A-Inc, which performs better than both the three stages of S ENS A for each subject. Each time reported techniques at all points. here is the average for all changes for that subject. First, the pre- processing stage (column Static analysis) performs static slicing, In sum, for RQ2, the results indicate that, for these subjects, test which is needed by our experiment and is also necessary to instru- suites, and changes, S ENS A makes not only better predictions than ment the program for dynamic slicing, DEA, and S ENS A. As ex- slicing to understand the potential impacts of changes, but it is also pected, this time grows with the size of the subjects. Interestingly, better by even greater amounts for discovering those impacts early despite the plenty of room for optimization of the implementation in the resulting rankings. This makes S ENS A an even more attrac- of DUA-F ORENSICS and S ENS A, the static analysis for S ENS A tive choice when users are on a budget, as it is likely to happen does not consume more than an average of ten minutes. in practice. Also, among all versions of S ENS A, and despite per- forming slightly worse overall than S ENS A-Rand for NanoXML For the second stage, as expected due to the varying ratios of sub- and XML-security, S ENS A-Inc seems the best choice based on its ject size to test-suite size, the greatest amount of time is spent by cost-effectiveness for JMeter. JMeter and the second greatest by Schedule1. In both cases, the cost in time is approximately five minutes. For the third and last stage, which processes the runtime data to compute the S ENS A 5.3.3 RQ3: Computational Costs rankings, the time costs were less than the runtime costs. This For RQ3, we analyze the practicality of S ENS A. Table 3 shows means that, overall, no individual phase takes an unreasonable the time in seconds it takes to run the test suite for each subject, amount of time to produce impact predictions. In the worst case, without instrumentation (column Normal run), using the machine for JMeter, the total cost is below 20 minutes. and environment described in Section 5.1. These times help put in perspective the analysis times taken by S ENS A and slicing. Perhaps In all, these computational-cost results are encouraging for the prac- against intuition, the longest running time is for Schedule1 because, ticality of S ENS A for three reasons. First, we believe that, in many despite being the smallest subject, its test suite is at least an order cases, developers can make use of the predictions of S ENS A if 8
(a bug to fix) provided with the subject and we identified the first Table 3: Average computational costs of techniques, in seconds failing point (typically, the only one) where the bug is manifested. Subject Normal Static Instrumented Influence Given the faulty statement, which a developer could identify with name run analysis run ranking a fault-localization technique, we manually identified the sequence Schedule1 169.0 5.4 290.6 19.8 of all statements that propagate the fault to the failing point. We NanoXML 13.4 13.7 35.8 1.8 discarded affected statements that did not participate in this prop- XML-security 58.2 153.5 67.0 63.2 agation to the failing point. All statements are Java bytecode in- JMeter 37.5 594.6 328.3 238.9 structions in a readable representation. Given a chain of events—the set of propagating statements—and they are provided within 20 minutes for a program such as JMe- the bug fix that is provided as a change with the subject, we com- ter and less than 10 minutes for smaller programs. Second, the puted how high the chain is in the rankings computed by S ENS A machine and environment used in our experiment correspond to and by static and dynamic slicing from the fault location. Specifi- a mid-level setup that developers typically use nowadays. Third, cally, we calculated the average rank of the statements in this chain many of the steps performed by our research toolset not only can in each ranking to determine how well those rankings highlight the be optimized for individual operations, but can also undergo algo- effects of the fault that actually cause the failure. For S ENS A, we rithmic improvements. Moreover, the iterative nature of S ENS A used the Incremental strategy, which performed best overall in the provides opportunities for significant speed-ups via parallelization. study of Section 5. Naturally, three case studies are insufficient to draw definitive con- 5.4 Threats to Validity clusions, but, nevertheless, these cases shed light on the workings The main internal threat to the validity of our studies is the potential of the three techniques that we are studying for a particular appli- presence of implementation errors in S ENS A. S ENS A is a research cation that requires an extensive manual effort to investigate. Next, prototype developed for this work. However, S ENS A is built on top we present our results and analysis for these case studies.5 of DUA-F ORENSICS, whose development started in 2006 and has matured considerably over the time. Also, this S ENS A layer built on top DUA-F ORENSICS has been tested, manually examined, and 6.1 NanoXML improved over more than half a year. In this case for NanoXML, the fault is located in a condition for a while loop that processes the characters of the DTD of the input Another internal threat to our experiment is the possibility of pro- XML document. The execution of this fault by some test cases cedural errors in our use of S ENS A, DUA-F ORENSICS, and re- triggers a failure by failing to completely read the input, which then lated scripts in our experimental process. To reduce this risk, we causes an unhandled exception to be thrown when parsing the next tested, inspected, debugged, and manually verified the results of section of the document. The bug and its propagation mechanism each phase of this process. are not easy to understand because the exception is thrown from a statement located far away from the fault. After an exhaustive The main external threat to the validity of our study and conclusions inspection, we manually identified the 30 statements that constitute about S ENS A is that we studied only a limited number and variety the entire cause-effect sequence that causes the failure. of subjects and changes (bug fixes) using test suites that may not represent all possible behaviors of those subjects. Nevertheless, All 30 statements that cause the failure—and, thus, will change we chose these four subjects to represent a variety of sizes, coding with the bug fix—are placed by S ENS A-Inc in the top 11.5% of styles, and functionality to achieve as much representativeness as the ranking, whereas static slicing places them in the top 28.7% possible of software in general. In addition, these subjects have and dynamic slicing puts them in the top 73.5%. Also, the average been used extensively in experiments conducted in the past by the inspection cost, computed with the method of Section 5.2, is 6.8% authors and/or other researchers around the world. Moreover, the for S ENS A-Inc, 9.3% for static slicing, and 24.4% for dynamic slic- three largest subjects are real-world open-source programs. ing. Similar to our findings for many changes in Section 5, dynamic slicing was much less effective as well in this case because of the 6. CASE STUDIES changes in the execution path that the bug fix causes after it is ap- To understand in detail the comparative benefits of S ENS A and slic- plied (which singles out the statements that propagated the fault). ing for a more specific task, we investigated in detail how these techniques could help a developer isolate the concrete effects of We also wanted to understand how well the techniques detect the faulty code to help understand and fix that fault. To that end, we statements whose behavior changes only because their state and not conducted three case studies of fault-effects analysis. When a (can- their execution occurrences. Such statements seem more likely to didate) faulty statement is identified, a developer must decide how be highlighted by slicing. Thus, we run the predictions for the 23 to fix it. This decision requires understanding the effects that this statements in this propagation chain that execute in both the buggy statement has on the failing point—a failed assertion, a bad output, and fixed versions of the program. S ENS A-Inc places these state- or the location of a crash. However, not all effects of a faulty state- ments at an average rank of 5.33%, whereas static slicing puts them ment are necessarily erroneous. The interesting behavior of a fault at 8.05% on average and dynamic slicing predicts them at 9.30%, is the chain of events from that fault to the failing point. also on average. All techniques showed improvements for this sub- set of statements, especially dynamic slicing. However, dynamic For diversity, we performed the case studies on different subjects. slicing still performed worse that static slicing, possibly caused by For representativeness, we picked the three largest subjects that we long dynamic dependence distances to the impacted statements. already set up for our previous study: NanoXML, XML-security, 5 and JMeter. For each case study, we chose the first change location For full details, see http://nd.edu/~hcai/sensa/casestudies 9
6.2 XML-security of cause-effects identification using S ENS A on program failures. The bug provided with XML-security is only revealed by one of the 92 unit tests for this subject. This unit test fails in the the buggy A few other techniques discriminate among statements within slices. version because of an assertion failure in that test due to an un- Two of them [12, 39] work on dynamic backward slices to estimate expected result. Manually tracing the execution backwards from influences on outputs, but do not consider impact influences on the that assertion to the fault location reveals that the fault caused an entire program. These techniques could be compared with S ENS A incorrect signature on the input file via a complex combination of if a backward variant of S ENS A is developed in the future. Also for control and data flow. The complete sequence of events for the fail- backward analysis, thin slicing [33] distinguish statements in slices ure trace contains more than 200 bytecode-like statements. Many by pruning control dependencies and pointer-based data dependen- of those statements, however, belong to helper functions that, for cies incrementally as requested by the user. Our technique, instead, all practical purposes, work as atomic operations. Therefore, we keeps all statements from the static slice (which is a safe approach) skipped those functions to obtain a more manageable and focused and automatically estimates their influence to help users prioritize cause-effect chain that be more easily identified and understood. their inspections. For the resulting chain of 55 statements, S ENS A-Inc places 36 of Program slicing was introduced as a backward analysis for program them in the top 1% of its ranking. A total of 86.3% of those top 1% comprehension and debugging [38]. Static forward slicing [16] statements are, in fact, in the chain. In sharp contrast, for the top was then proposed for identifying the statements affected by other 4% of its ranking, static slicing only locates 9 of those statements. statements, which can be used for change-impact analysis [7]. Un- The cost of inspecting the entire sequence using S ENS A is 6.6% of fortunately, static slices are often too big to be useful. Our work the slice whereas static slicing requires inspecting 33.15% of the alleviates this problem by recognizing that not all statements are slice and dynamic slicing needs 17.9%. equally relevant in a slice and that a static analysis can estimate their relevance to improve the effectiveness of the forward slice. The entire forward static slice consists of 18,926 statements. Thus, Other forms of slicing have been proposed, such as dynamic slic- users would cover the full sequence of events when reaching 1,255 ing [18] and thin slicing [33], that produce smaller backward slices statements in the ranking of S ENS A. With static and dynamic slic- but can miss important statements for many applications. Our tech- ing, instead users would have to visit 6,274 and 3,388 statements, nique, in contrast, is designed for forward slicing and does not drop respectively. Thus, for this fault, a developer can find the chain of statements but scores them instead. events that makes the assertion fail much faster than using slicing. Dynamic impact analysis techniques [20, 26], which collect execu- 6.3 JMeter tion information to assess the impact of changes, have also been in- For this case study, we chose again the first buggy version provided vestigated. These techniques, however, work at a coarse granularity with JMeter and we picked, from among all 79 unit tests, the test level (e.g., methods) and their results are subject to the executions that makes the program fail. The failing point is an assertion check observed. Our technique, in contrast, works at the statement level by the end of that unit test. Despite the much larger size of both the and analyzes the program statically to predict the impacts of chan- subject and the forward slice from this fault, the fault-propagation ges for any execution, whether those impacts have been observed sequence consists of only 4 statements. yet or not. In other words, our technique is predictive, whereas dynamic techniques are descriptive. Yet, in general, static and dy- Static slicing ranks two of those statements—half of the sequence— namic techniques complement each other; we intend to investigate in the top 1% and the other two statements farther away. S ENS A, in that synergy in the future. contrast, places the entire sequence in its top 1%, making it much easier to distinguish the effect from all other possible impacts of 8. CONCLUSION AND FUTURE WORK the fault. The inspection of the entire failure sequence using static Program slicing is a popular but imprecise analysis technique with slicing would require a developer to go through 2.6% of the for- a variety of applications. To address this imprecision, we presented ward slice, or 848 statements. For S ENS A, this cost would be only a new technique and tool called S ENS A for quantifying statements 0.1%, or 32 statements. in slices. This quantification improves plain forward static slices by increasing their effectiveness for change-impact and cause-effects As we have frequently observed for dynamic slicing, for this fault, prediction. Rather than pruning statements from slices, S ENS A and considering the effects of fixing it, dynamic slicing would cost grades statements according to their relevance in a slice. much more than S ENS A and static slicing to identify the fault- propagation sequence. In find all four statements in the sequence, We plan to extend our studies to more subjects and changes. We are users would have to traverse 12.1% of the slice, which corresponds also developing a visualization for quantified slices to improve our to 4,094 statements. Once again, this case gives S ENS A an ad- understanding of the approach, to enable user studies, and to help vantage over static slicing, and especially over dynamic slicing, in other researchers. Using this tool, we will study how developers assisting with the location of the failure sequence. take advantage in practice of quantified slices. 7. RELATED WORK Slightly farther in the future, we foresee adapting S ENS A to quan- In preliminary work [31], we described at a high level an early ver- tify slices for other important tasks, such as debugging, comprehen- sion of S ENS A and we showed initial, promising results for it when sion, mutation analysis, interaction testing, and information-flow compared with the predictions from breadth-first traversals of static measurement. More generally, we see S ENS A’s scores as abstrac- slices [33,38]. In this paper, we expand our presentation of S ENS A, tions of program states as well as interactions among such states. its process, algorithm, and modification strategies. Moreover, we These scores can be expanded to multi-dimensional values and data expand our comprehensive study to four Java subjects, we add dy- structures to further annotate slices. Such values can also be sim- namic slicing to our comparisons, and we present three case studies plified to discrete sets as needed to improve performance. 10
You can also read