Oracle-Guided Component-Based Program Synthesis
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Oracle-Guided Component-Based Program Synthesis Susmit Jha Sumit Gulwani Sanjit A. Seshia Ashish Tiwari UC Berkeley Microsoft Research UC Berkeley SRI International jha@eecs.berkeley.edu sumitg@microsoft.com sseshia@eecs.berkeley.edu tiwari@csl.sri.com ABSTRACT approaches have the advantage of completeness of specifi- We present a novel approach to automatic synthesis of loop- cation, such specifications are often unavailable, difficult to free programs. The approach is based on a combination of write, or expensive to check against using automated verifi- oracle-guided learning from examples, and constraint-based cation techniques. In this paper, we propose a novel oracle- synthesis from components using satisfiability modulo theo- guided approach to program synthesis, where an I/O oracle ries (SMT) solvers. Our approach is suitable for many appli- that maps a given program input to the desired output is cations, including as an aid to program understanding tasks used as an alternative to having a complete specification. such as deobfuscating malware. We demonstrate the effi- The key idea of our algorithm is to query the I/O oracle on ciency and effectiveness of our approach by synthesizing bit- an input that can distinguish between non-equivalent pro- manipulating programs and by deobfuscating programs. grams that are consistent with the past interaction with the I/O oracle. The process is repeated until a semantically unique program is obtained. Our experimental results show Categories and Subject Descriptors that only few rounds of interaction are needed. D.1.2 [Programming Techniques]: Automatic Program- We apply the oracle-guided approach to automated syn- ming; I.2.2 [Artificial Intelligence]: Program Synthesis; thesis of loop-free programs, those that compute functions of K.3.2 [Learning]: Concept Learning their input and terminate. Such programs arise in a variety of application contexts, such as low-level bit-manipulating code, scientific computing kernels, parts of control software Keywords in graphical languages such as LabVIEW, and even appli- Program synthesis, Oracle-based learning, SMT, SAT cations in high-level scripting languages such as Javascript and Ruby that are formed by chaining multiple high-level operators. A key characteristic of our method is that it is 1. INTRODUCTION component-based, meaning that we synthesize a program by Automatic synthesis of programs has long been one of the performing a circuit-style, loop-free composition of compo- holy grails of software engineering. It has found many prac- nents drawn from a given component library. We can also tical applications: generating optimal code sequences [20, address the challenge of identifying whether the given set 11], optimizing performance-critical inner loops, generat- of components is insufficient to synthesize the desired pro- ing general-purpose peephole optimizers [2, 3], automating gram. For this purpose, we additionally require making only repetitive programming tasks [15], and filling in low-level one query to a more expensive validation oracle that checks details after the higher-level intent has been expressed [24]. whether the program is correct or not. Two applications of synthesis are of particular interest in Our synthesis algorithm is based on a novel constraint- this paper. The first is that of automating the discovery based approach that reduces the synthesis problem to that of non-intuitive algorithms (e.g., [8]). The second applica- of solving two kinds of constraints: the I/O-behavioral con- tion, as we show in this paper, is program understanding, straint whose solution yields a candidate program consistent and more specifically, program deobfuscation. The need for with the interaction with the I/O oracle, and the distinguish- deobfuscation techniques has arisen in recent years, espe- ing constraint whose solution provides the input that distin- cially due to an increase in the amount of malicious, and guishes between non-equivalent candidate programs. These mostly obfuscated, code (malware) [28]. Currently, human constraints can be solved using off-the-shelf SMT (Satisfia- experts use decompilers and manually deobfuscate the re- bility Modulo Theory) solvers. Traditional synthesis algo- sulting code (see, e.g., [22]). Clearly, this is a tedious task rithms perform a expensive combinatorial search over the that could benefit from automated tool support. space of all possible programs. In contrast, our technique A traditional view of program synthesis is that of synthe- leaves the inherent exponential nature of the problem to the sis from complete specifications. One approach is to give underlying SMT solver, whose engineering advances over the a specification as a formula in a suitable logic [19, 26, 10, years allow them to effectively deal with problem instances 8]. Another is to write the specification as a simpler, but that arise in practice, which are usually not hard, and hence possibly far less efficient program [20, 11, 24]. While these end up not requiring exponential reasoning. Contributions and Organization. • We propose a novel oracle-guided approach to synthesis, where an I/O oracle obviates the need for complete spec- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ifications. Our approach has interesting connections to not made or distributed for profit or commercial advantage and that copies results from computational learning theory (Section 5). bear this notice and the full citation on the first page. To copy otherwise, to • We present an instantiation of the oracle-guided approach republish, to post on servers or to redistribute to lists, requires prior specific to synthesis of loop-free programs over a given set of com- permission and/or a fee. ponents (see problem definition in Section 3). This is en- ICSE ’10, May 2-8 2010, Cape Town, South Africa abled by a novel constraint-based technique that involves Copyright 2010 ACM 978-1-60558-719-6/10/05 ...$10.00.
an interaction between SMT solvers and the I/O oracle 1 genStringObs(int input) (Section 4). We also present an interesting optimization 2 { that leverages biased sampling (Section 4.6). 3 a1=1, a2=0; b1=1, b2=0; c1=0; c2=0; • We demonstrate the utility of our synthesis technique to 4 if (input == 0) discovery of bit-manipulating programs [29], which are 5 { a1 = 0; a2 = 0; b1=0; b2 = 0; } often needed for optimizing performance (Section 6.1). 6 else if (input == 1) These programs are quite unintuitive and can be difficult 7 { c1=0; c2 = 1; } for even expert programmers to discover. The upcoming 8 else if (input == 2) 4th volume of the classic series the art of computer pro- 9 { a1 =1; a2 = 0; c1=1; c2=1; } gramming by Knuth has a chapter on bitwise tricks [13]. 10 else if (input == 3) 11 { b1 = 0; b2 = 0; c1=1; c2=1; } • We propose a novel application of program synthesis to 12 else return NULL; program understanding. We demonstrate this in the con- 13 c = 2*c1 + c2; text of malware deobfuscation by deobfuscating examples 14 if(c == 1) { return rot13("EPCG GB, 7"); } drawn from and inspired by the Conficker and MyDoom 15 else viruses using our synthesis technique (Section 6.2). 16 if (c == 2) { 17 if (input * (input-1) % 2 == 0) 2. MOTIVATING EXAMPLES 18 return rot13("EPCG GB", 7); We present two examples in this section to introduce the 19 else synthesis problem and motivate our approach. 20 return rot13("RUYB", 4); 21 } 2.1 Bit Manipulation 22 else { 23 if (b1 ⊕ b2) return rot13("ZNVY SEBZ",9) Consider the following programming problem: Given a 24 else if ((a1 ⊕ a2) = (b1 ⊕ b2)) bit-vector integer x, of finite but arbitrary length, construct 25 return rot13("RUYB", 4); a new bit-vector y that corresponds to x with the rightmost 26 else return rot13("QNGN", 4); string of contiguous 1s turned off, i.e., reset to 0s. Such 27 } programming problems often arise while developing low level 28 } embedded code, network applications or in other domains 29 rot13(char *buf, int sz) where bit-level manipulation is needed. 30 { Let us contemplate writing a formal specification for this 31 char *buf1 = malloc((sz+1) * sizeof(char)); problem. The most natural and easiest specification involves 32 char a; the use of alternating quantifiers, where n is the length of x: 33 while (a =~ *buf) ∃i, j. { 0 ≤ i, j < n ∧ (∀k. j ≤ k ≤ i =⇒ x[k] = 1) 34 { ∧ (∀k. 0 ≤ k < j =⇒ x[k] = 0) 35 *buf1 = (~a-1/(~((a | 32))/13*2-11)*13); 36 buf++; buf1++; ∧ (x[i + 1] = 0 ∨ i = n − 1) 37 } ∧ (∀k. i < k < n =⇒ x[k] = y[k]) 38 return buf1; ∧ (∀k. 0 ≤ k ≤ i =⇒ y[k] = 0) } 39 } Figure 1: Obfuscated program inspired by MyDoom The above specification is not easy to write. Moreover, veri- fying any candidate implementation against the above spec- programs is of great practical significance. ification is challenging due to the presence of quantifiers. Let us consider writing some sample input-output pairs, 2.2 Deobfuscation or examples, for the problem. For any input x, it is easy to A major challenge of dealing with malware is simply to provide the corresponding output y. Some example (x, y) understand what the malicious code is doing. We introduce pairs are (0110, 0000), (0101, 0100), (110110, 110000). here an example inspired by the obfuscations introduced in Finally, let us contemplate writing a program for the above variants of MyDoom, an e-mail virus affecting Microsoft problem. A straightforward, but inefficient, implementation Windows that became the fastest-spreading virus when it is a loop that iterates through the bits of x and zeroes out was first released in 2004 [21]. the rightmost contiguous string of 1s. Can we synthesize a The example involves the construction of the SMTP header shorter and more efficient implementation? It is difficult to for the e-mail sent by the virus. SMTP requires a prescribed answer this, but it is easy to speculate that the elementary sequence of messages of different types, initially starting operations that may be used inside such an efficient imple- with a “hello” message, followed by the “from”, “reply to”, mentation will be the standard bit-vector operators: bit-wise and other similar control fields, followed finally by the data logical operations (|, &, ⊕ , ∼), and basic arithmetic oper- segment of the e-mail. The fragment we have constructed ations (+, −, ∗, /, %). is a program genStringObs, given in Figure 1, which con- Given a set of possible elementary operations, and an abil- structs a string corresponding to the message type. ity to generate outputs for given inputs, our oracle-guided We used two types of obfuscations in this example. The synthesis tool Brahma will synthesize the following nontriv- first is a control-flow obfuscation drawn from several obfus- ial and tricky procedure for solving the above problem. cations given by Collberg et al. [5]. The second obfuscation 1 turnOffRightMostOneBitString (x) is one that is directly used in MyDoom, involving the modi- 2 { t1 = x-1; t2 = (x | t1 ); t3 = t2 +1; fication of each alphanumeric character in the message type 3 t4 = (t3 & x); return t4 ; } string by the rot13 substitution cipher [14]. The reader A programmer will require considerable familiarity with can appreciate the difficulty of decoding exactly what this bit-level manipulations to come up with such an implemen- program is doing. tation. Hence, automated synthesis of such difficult-to-write Is there a simpler (deobfuscated) program that performs
1 genString(int input) {O1 , . . . , ON } as temporary variables in the following form: 2 { if(input == 0) return "EHLO"; ~ P (I): 3 else if (input == 1) return "RCPT TO"; ~π1 ); . . . ; Oπ := fπ (V ~π ); 4 else if (input == 2) return "MAIL FROM"; Oπ1 := fπ1 (V N N N 5 else if (input == 3) return "DATA"; return OπN ; 6 else return NULL; where 7 } C1. each variable in V ~ ~π is either an input variable from I, Figure 2: Deobfuscated version of genStringObs i or a temporary variable Oπj such that j < i, and C2. π1 , . . . , πN is a permutation of 1, . . . , N . the same function as genStringObs? It is difficult to answer this question. However, looking at the obfuscated code in Program P above appears to be a straight-line program, Figure 1, it is easier to guess that a deobfuscated program but, in fact, it can be more complex because the base com- will probably use the following types of components: ponents fi ’s can be complex. In particular, base components • Conditional if-then-else (ternary) operators, can be “if-then-else” functions, and using these components, • Boolean expressions that occur in genStringObs, and Program P can describe arbitrary loop-free programs. • Operators that return strings of size bounded by size of We note that the program P above is using all components the largest string in genStringObs. from the library. We can assume this without any loss of generality. Even when there is a correct Program P using Given these components and the obfuscated program gen- fewer components, that program can always be extended to StringObs, our oracle-guided synthesis tool Brahma au- a program that uses all components by adding dead code. tomatically computes the (deobfuscated) program given in Dead code can be easily statically identified and removed in Figure 2. The reader can appreciate how much easier this a post-processing step. program is to understand. We also note that program P above is using each base component only once. We can assume this without any loss 3. PROBLEM DEFINITION of generality. If there is a Program P using multiple copies of The goal is to synthesize a loop-free program using a given the same base component, we assume that the user provides set of base components and using input-ouput examples. We multiple copies explicitly in the library. Such a restriction assume the presence of an I/O oracle that can be queried on of using each base component only once is interesting in two any input. The I/O oracle, when given an input, returns regards. First, it can be used to enforce efficient or minimal the output of the desired program (that we wish to syn- programs. Second, it prunes the search space of possible thesize) on that input. We also assume the presence of a programs making the synthesis problem finite and tractable. validation oracle that validates the correctness of a candi- Informally, the synthesis problem is to come up with a date program. Finally, we assume that we are given a set of program – using only the base components in the given li- (base) components that should be used as building blocks brary – that is accepted by the validation oracle. in the synthesized program. Each component is given in the form of its input-output specification, which is written as a 4. ORACLE-BASED SYNTHESIS logical formula relating the inputs and the outputs of that In this section, we provide our solution for the program component. For ease of presentation, we assume that all synthesis problem formally described above. Our solution components have exactly one output. We also assume that is based on encoding the space of all possible programs by all inputs and outputs have the same type. These restric- a formula (Section 4.1). Given a set of input-output pairs, tions are easily removed. we then constrain this formula further so that it encodes Formally, the synthesis problem in our proposed program- only those programs that work correctly on the given input- ming methodology requires the following: output pairs (Section 4.2). By solving this constraint, we • A validation oracle V that, given any candidate program generate a candidate solution. If the candidate solution is (constructed from base components), returns a Boolean not the desired program, we provide a way to generate a new answer indicating whether the candidate program is the input-output pair (Section 4.3). The overall procedure that desired one or not. We discuss how a validation oracle combines these parts to solve the program synthesis problem can be implemented for the program classes considered is presented in Section 4.4. We present enhancements to our in this paper in Sections 6.1 and 6.2. basic procedure in Section 4.6. • An I/O oracle I that, given any program input, returns the output of the desired program on that input. 4.1 Background: Encoding Programs • A set of specifications {hI~i , Oi , φi (I~i , Oi )i | i = 1, . . . , N }, We present an encoding of the space of well-formed candi- called a library, where (I~i , Oi , φi (I~i , Oi )i is the specifica- date programs, that is, of programs P satisfying constraints C1 tion for the base component fi , which includes and C2, as formulas. This encoding is drawn from recent work [8]. Note, however, that the material in subsequent – a tuple of input variables I~i and an output variable Oi sections depends only on the existence of such an encod- – an expression φi (I~i , Oi ) over variables I~i and Oi that ing. Our proposed approach can be used with alternative specifies the input-output relationship of the i-th com- encodings as well. ponent. Intuitively, the encoding we use involves viewing the space of candidate programs as all ways of connecting components The symbols I~i , Oi are assumed to be distinct. from the library that satisfy syntactic and semantic well- The goal of the synthesis problem is to synthesize a pro- formedness constraints. Each connection is encoded using an gram P that can be validated by the validation oracle V, integer-valued location variable. Put another way, the value i.e., V(P ) = true. Furthermore, program P should be con- of a location variable determines which component goes on structed using only the set of base components in the library, which location (line-number), and from which location (line- i.e., Program P should take I~ as its inputs and use the set number or circuit input) it gets its input arguments.
The main property of the encoding that our approach re- The consistency constraint ψcons encodes that every line lies upon is distilled into the following theorem. This the- in the program should have at most one component, while orem states the existence of two formulas (encodings): the the acyclicity constraint ψacyc encodes that every variable first formula ψwfp represents the set of all syntactically well- should be initialized before it is used. formed programs; whereas the second formula φfunc repre- The function Lval2Prog returns the program correspond- sents the set of all semantic input-output behaviors of a ing to a given valuation L as follows: in the ith line of well-formed program. Lval2Prog(L), we have the assignment Oj := fj (Oσ(1) , . . , Oσ(t) ) if lOj = i, lI k = lOσ(k) for k = 1, . . , t, where t is the ar- Theorem 1. There exists a set of integer-valued location j variables L, a well-formedness constraint ψwfp (L) over L, a ity of component fj , and (Ij1 , . . , Ijt ) is the tuple of input mapping Lval2Prog, and a functional constraint φfunc (L, I, ~ O) variables I~j of fj . The well-formedness constraint describes ~ O} such that the following properties hold: over L ∪ {I, syntactically correct programs, but it does not describe the semantics of these programs. • Lval2Prog is a bijective mapping from the set of values L ~ O) is obtained by tak- that satisfy the constraint ψwfp (L) to the set of programs The functional constraint φfunc (L, I, that satisfy constraints C1 and C2. ing ψwfp (L) and adding to it constraints capturing the dataflow semantics and semantics of components. • Let L0 be a satisfying assignment to the formula ψwfp . If α and β are any candidate input and output values, ~ O) def φfunc (L, I, = ∃P, R ψwfp (L) ∧ φlib (P, R) then the formula φfunc (L0 , α, β) is true iff the program ~ O, P, R) Lval2Prog(L0 ) returns the value β on the input α. ∧ ψconn (L, I, N def ^ The proof of Theorem 1 follows from the results stated in [8]. φlib (P, R) = ( φi (I~i , Oi )) We now describe the encoding more formally. Let P and i=1 R denote the union of all formal inputs (parameters) and ~ O, P, R) def ^ formal outputs (return variables) of the components respec- ψconn (L, I, = (lx = ly ⇒ x = y) tively, that is, ~ x,y∈P∪R∪I∪{O} P := N S ~ R := N S i=1 Ii i=1 {Oi } = {O1 , . . . , ON } where φlib represents the semantics of the base components Any straight-line program constructed using N components (that relates the inputs and outputs of each component), can be described by a set of location variables L and ψconn represents the dataflow semantics (that matches L := {lx | x ∈ P ∪ R} the inputs and output of the different components and the that contains one new variable lx for each variable x in inputs and output of the overall program with each other, P ∪ R with the following interpretation associated with each in accordance with values of location variables). of these variables. ~ O) represents the class of all syn- The formula φfunc (L, I, • If x is the output variable Oi of the component fi , then lx tactically well-formed programs P , constructed using only is the line number in the program where the component fi is used. the N base components, that on input I~ return output O. Hence, we can solve the program synthesis problem by find- • If x is the j th input parameter of the component fi , then ing appropriate values for the L variables. We need to find lx is the line number “from where component fi gets its values for L such that the input-output behavior of the re- j th input”. sulting program matches the input-output behavior specified In the above description, line number refers to either a by the I/O oracle. line of the program, or to some input. For uniformity, each A key step in our solution of the program synthesis prob- input in I~ is assigned a line number from 0, . . . , |I| ~ − 1 and lem is to synthesize programs that work for finitely many ~ ~ the program line numbers then take values from |I|, . . . , |I|+ input-output pairs. We discuss this next. ~ N − 1. Let M = |I| + N . The variables L take values in the range 0, . . . , M − 1 and these new line numbers have the 4.2 I/O-behavioral Constraint following interpretation. In this section, we show how to generate a constraint ~ line number j is blank; it takes the value whose solution provides a candidate program whose input- • For 0 ≤ j < |I|, output behavior matches a given finite set of input-output of the j th input of the program. examples. ~ ≤ j < M , line number j contains the (j−|I|+1)-th • For |I| ~ Given a set E of input-output examples {(αj , βj )}j , we assignment statement of the original program P . use the notation BehaveE to denote the following constraint, The well-formedness constraint ψwfp (L), defined below, which we refer to as I/O-behavioral constraint. encodes the interpretation of the location variables lx along def ^ BehaveE (L) = φfunc (L, αj , βj ) with syntactic well-formedness constraints, such as consis- (αj ,βj )∈E tency and acyclicity constraints. def ^ ^ ~ ≤ lx < M ) Let L0 be a set of values such that BehaveE (L0 ) is true. It ψwfp (L) = (0 ≤ lx < M ) ∧ (|I| follows from the definition of the I/O-behavioral constraint x∈P x∈R that the program encoded by L0 will give output βj , when- ∧ ψcons (L) ∧ ψacyc (L) ever it is given an input αj , for all pairs (αj , βj ) in E. This def ^ property of the I/O-behavioral constraint is stated below. ψcons = (lx 6= ly ) x,y∈R,x6≡y Theorem 2 (I/O-behavioral Constraint). For any def N ^ ^ satisfying solution L0 to the I/O-behavioral constraint, the ψacyc = lx < l y input-output behavior of the program Lval2Prog(L0 ) matches i=1 x∈I ~i ,y≡Oi all the input-output examples in the set E.
The proof of the above theorem is immediate from the IterativeSynthesis(): definition of an I/O-behavioral contraint and Theorem 1. 1 // Input: Set of base components used in We next check if the program, which is synthesized by 2 // construction of BehaveE and DistinctE,L considering finitely many input-output pairs, is the desired 3 // Output: Candidate Program program. We want to avoid the use of the validation oracle, 4 E := {(α0 , I(α0 ))} // Pick any value α0 for I~ since it is expensive. Here we use what is perhaps the central 5 while (1) { idea of our approach: generate a “distinguishing” input that 6 L := T-SAT(BehaveE (L)); differentiates this program from another candidate program. 7 if (L == ⊥) return "Components insufficient"; 8 ~ α := T-SAT(DistinctE,L (I)); 4.3 Distinguishing Constraint 9 if (α == ⊥) { In this section, we show how to generate a constraint 10 P := Lval2Prog(L); whose solution provides an input that distinguishes a given 11 if (V(P )) return P ; candidate program from another non-equivalent candidate 12 else return "Components insufficient"; } program, both of which have a given set of input-output 13 E := E ∪ {α, I(α)}; } pairs in their respective input-output behavior. Let E be a set of input-output pairs. Let P be a can- Figure 3: Oracle-guided Synthesis Procedure didate program, defined by values L, whose input-output behavior matches the set E. Suppose P is not the desired for A. Otherwise, it returns ⊥. The function T-SAT is imple- program. Then, there should be some input I~ such that P mented as a call to a Satisfiability Modulo Theory (SMT) gives incorrect output on I.~ But, how do we find such an I? ~ solver. SMT solvers check for satisfiability of a first-order If P is not the desired program, then let us assume that formula with respect to underlying background theories [4]. there is a correct program P ′ . Clearly, for all input-output The pseudo-code for the procedure is given in Figure 3. pairs (αj , βj ) in E, the program P ′ should return βj when The procedure maintains a set E of input-output examples it is given input αj . But since P is not the desired program, constructed by querying the I/O oracle I on a new input at whereas P ′ is the desired program, P and P ′ should give the start of the while loop (Line 4) and in each iteration of different outputs on some new input. the while loop (Line 13). In each iteration of the while loop, the procedure attemps to synthesize a candidate program P We say I~ is a distinguishing input if there is another pro- (represented by L) that satisfies the set E of input-output gram P ′ whose input-output behavior also matches E, but examples (Line 6). If it fails, then it returns failure (Line 7). P and P ′ give different outputs on the input I. ~ The con- Otherwise, it checks (in Line 9) whether the candidate pro- ~ straint DistinctE,P (I), defined below, represents the set of gram P is the semantically unique program that satisfies the all distinguishing inputs I~ and we refer to it as distinguishing given set of input-output examples. A program is semanti- constraint. cally unique if any other program that satisfies the given set DistinctE,L (I) ~ def ~ O) = ∃L′ , O, O′ BehaveE (L′ ) ∧ φfunc (L, I, of input-output examples produces the same output as the ′ ~ program for any other input. If P is the semantically unique ∧ φfunc (L , I, O ) ∧ O 6= O′ ′ program, then the procedure either returns P (Line 11) or Theorem 3 (Distinguishing Constraint). If α is a failure (Line 12) depending on whether the validation oracle satisfying solution to the distinguishing constraint V validates P or not. If the candidate program is not se- ~ then there exists a program P ′ such that DistinctE,P (I), mantically unique, then an input α is obtained that is added ′ P and P have different behaviors on input α, but have the to E to help narrow down the choice of candidate programs same behavior on all the inputs in the set E. (Line 13). The following theorem states the correctness of Proce- The proof of Theorem 3 follows from the definition of the dure IterativeSynthesis. Note that if the inputs I~ take distinguishing constraint, Theorem 2 and Theorem 1. We values from a finite domain, then the number of iterations now have all the ingredients for describing our overall pro- of the loop in the procedure is bounded by the total number cedure for solving the synthesis problem. of different inputs; and hence, in such cases the procedure is guaranteed to terminate. 4.4 Oracle-Guided Synthesis In this section, we describe our oracle-guided iterative syn- Theorem 4. If Procedure IterativeSynthesis, given in thesis procedure. The description uses the I/O-behavioral Figure 3, returns a program P , then V(P ) is true. If Proce- constraint and the distinguishing constraint described above. dure IterativeSynthesis returns “Components insufficient”, The procedure works by iteratively synthesizing new pro- then there does not exist any program P constructed from the grams that work correctly on more and more inputs. It set of base-components such that V(P ) is true. Furthermore, starts with a set containing just one arbitrarily chosen input. Procedure IterativeSynthesis is guaranteed to terminate In each iteration, the procedure synthesizes a program that when the inputs I~ take values from a finite domain. works correctly on the current finite set of inputs. If such a program is found, then the procedure attempts to find a dis- The proof of the correctness theorem follows immediately tinguishing input. If a distinguishing input is found, then it from the description of the procedure in Figure 3, combined is added into the set of inputs for subsequent iterations. In with the properties stated in Theorem 1, Theorem 2 and all other cases, the procedure terminates. It either returns Theorem 3. We also illustrate in Figure 4 all three cases in the correct program, or it notes that the components pro- which Procedure IterativeSynthesisterminates. The first vided are insufficient for synthesizing the correct program. case corresponds to step 7 and the second and third cases For solving the I/O-behavioral constraint and the dis- correspond to step 11 and step 12 respectively. tinguishing constraint, the procedure makes use of a func- tion T-SAT. Given a formula φ(A), the function T-SAT(φ(A)) 4.5 Illustration on Running Example searches for values for A that will make the formula φ true. If We illustrate the oracle-guided synthesis approach on the successful, then T-SAT(φ(A)) returns one such specific value running example presented in Section 2.1. The problem was,
No program exists with given components ConstrainedRandomInput: 1 a Step 7: Set E of I/O examples show components insufficient 1 // cnt is a global variable initialized to 0 for synthesizing valid program. 2 // K is a parameter (#rightmost bits to set) b K Step 12: Discovered semantically unique program P is found 3 if (cnt < 2 ) { incorrect by the validator − no synthesis feasible. 4 α := sample(Inputs); 2 Valid program exists with given components 5 α :=Set rightmost K bits of α to cnt; Step 11: Valid program P is returned. 6 cnt := cnt + 1; } ~ 7 else α := T-SAT(DistinctE,L (I)); Figure 5: Strategy for generating a new input Figure 4: Termination cases of Synthesis Procedure. based on sampling from the input space with an The validation oracle is needed only to ensure cor- application-dependent bias. rectness in case 1b. in some way. We explore two alternative ways for sampling the input space. given a bit-vector integer x, of finite but arbitrary length, to construct a new bit-vector y that corresponds to x with Sampling Uniformly at Random the rightmost string of contiguous 1s turned off. Our technique starts with a random input 01011 and the Let Inputs be the set of all possible valuations for the input I/O oracle I (the user) is used to obtain the corresponding variables. Let sample(Inputs) be a function that returns expected output 01000. This step corresponds to Line 4 of a particular input from the input space Inputs by sam- the algorithm presented in Figure 3. pling the set Inputs uniformly at random. The function Given the input/output pair (01011, 01000), our technique sample(Inputs) can be used to find a new input, in place generates the following candidate program (Line 6): (we give of the call to the SMT solver, in Line 8 of Procedure Iter- only the expression returned) ativeSynthesis. We will call this new variant as Random. (x + 1) & (x − 1) Sampling With Bias Then, it checks whether a semantically different program can be generated in Line 8. In this case, our technique generates The second approach we consider is based on biasing the the following alternative program and the distinguishing in- search for inputs towards a certain part of the input space. put 00000: Not all inputs in the input space are equally important. For (x + 1) & x example, a program may take an integer input i, but have The I/O oracle is used to obtain the output 00000 for this the same behavior for all i > 5 and have interesting behav- input (Line 13). This is added to the set of input/output iors only on values 0 ≤ i ≤ 5. For many applications, the pairs E. Note that the newly added pair rules out one of user knows a-priori which inputs are more crucial in defin- the candidate programs, namely, (x + 1) & (x − 1). ing the overall program. The idea behind the sampling with In the next iteration, with the updated set E, the tech- bias strategy is to search for distinguishing inputs by biasing nique finds the program the search to this part of the input space. −(¬x) & x In the bitvector benchmarks used in this paper, the in- and the check in Line 8 generates the alternate program put space consists of all (tuples of) bitvectors of a certain (((x& − x) | − (x − 1))&x) ⊕ x bit width. It is well-known that, for a very large class of and the input 00101. Hence, we add (00101, 00100) to E. commonly-used bitvector functions, the rightmost bits in- This rules out (((x& − x) | − (x − 1))&x) ⊕ x. fluence the output more than the leftmost bits. Note that at this stage, the program (x + 1)&x remains a Property 1 (See [29], Chapter 2). A function map- candidate, since it was not ruled out in the earlier iterations. ping bitvectors to bitvectors can be implemented with add, In next four iterations, Brahma generates (01111, 00000), subtract, bitwise and, bitwise or, and bitwise not instruc- (00110, 00000), (01100, 00000) and (01010, 01000) as input- tions if and only if each bit of the output depends only on output examples and adds them to E. The semantically bits at and to the right of that bit in each input operand. unique program generated from the resulting set E is the desired program: This suggests that we should bias the sampling so that we get more variety on the rightmost bits. (((x − 1)|x) + 1)&x. The code ConstrainedRandomInput in Figure 5 uses a con- strained random strategy for generating a new input. It 4.6 Optimization starts with an input α that is sampled uniformly at random, but then it sets its rightmost K bits to the (rightmost K bits The basic procedure described above can be improved by in the) number cnt. Since cnt is incremented each time, we using alternate ways to generate the inputs that are used by get a new combination in each time. Specifically, if K = 2, the procedure for synthesis. then in four calls to the Function ConstrainedRandomInput, IterativeSynthesis uses an SMT solver in two ways: we will get all four combinations – 00, 01, 10 and 11 – in the (a) First, an SMT solver is used to generate a candidate rightmost 2 digits of I. The code ConstrainedRandomInput program that works for the current set of inputs. finds the first 2K inputs this way. If more are needed, then (b) Second, an SMT solver is used to generate a new distin- it goes back to using the SMT solver. The new variant of guishing input on which the currently synthesized program IterativeSynthesis – obtained by replacing the call to the and the desired program potentially differ. SMT solver in Line 8 by the code ConstrainedRandomInput Although SMT solvers are fast and capable of handling – will be called Constrained Random. very large formulas, using them in every iteration compro- We will compare the performance of IterativeSynthe- mises efficiency. It is tempting to speculate that the use of sis, Random, and Constrained Random in Section 6. SMT solvers for generating a distinguishing input (case (b) above) can be avoided; for example, by replacing it by a function that finds new inputs by sampling the input space 5. DISCUSSION
5.1 Choosing Base Components tween the cheaper queries to the I/O oracle and the more It is reasonable to ask how base components are chosen in expensive queries to the validation oracle, which allows us to our approach and what happens when the given set of base optimize our implementation. Finally, our algorithm iterates components is either insufficient or very large. by finding distinguishing inputs, which is not an operation The choice of base components is made by the user and is supported by Angluin’s model. guided by the application domain. This allows the user to Two other results from learning theory also shed light on use his/her knowledge to guide the synthesis and influence why our oracle-based approach is effective in practice. success. It is not unreasonable to expect users to provide this First, note that our focus on loop-free programs that com- information. In several application domains, there is a nat- pute functions of finite-precision bit-vector inputs indicates a ural choice for the set of base components. For example, a connection to the work on learning Boolean circuits. In par- natural set of base components for synthesizing bitvector al- ticular, the classic result on learning constant-depth Boolean gorithms will contain components that perform bitwise and, (AC 0 ) circuits from a few test inputs [17] provides a partial or, not, xor, negation, increment and decrement operations. explanation for the effectiveness of this strategy. The result In our experiments on synthesizing bitvector programs (Sec- relies on a theorem stating that AC 0 circuits can be ap- tion 6), we started with such a set of base components, re- proximated well by low-degree polynomials, which in turn ferred to as the standard library. If the synthesis procedure are known to be identifiable by their behavior on few inputs. found that this set of components was insufficient, the stan- The second relevant result relates to the notion of teach- dard library was augmented with a set of new components ing dimension introduced by Goldman and Kearns [7]. In- suggested by the user and the synthesis procedure was re-run formally, the teaching dimension of a concept class is the with this extended library. minimum number of examples a teacher (oracle) must re- Nevertheless, choosing a reasonable set of base compo- veal to uniquely identify any target concept from that class. nents is crucial for the feasibility of our synthesis approach. As our experiments show, we need very few examples to syn- The search space of candidate programs grows exponentially thesize our target programs in practice, indicating that these with the number of base components. The strategy of start- programs form a concept class with a low teaching dimen- ing with a small set of base components, and then incremen- sion. Moreover, our algorithm fits closely with a result from tally adding components, can partly avoid the need to deal the Goldman-Kearns paper [7], showing that the generation with very large set of base components. However, it can be of an optimal teaching sequence of examples is equivalent to successful only if the synthesis engine not only synthesizes a minimum set cover problem. In the set cover problem for correct programs quickly, but also reports infeasibility of the a given target concept, the universe of elements is the set synthesis problem quickly. In our experiments, we show that of all incorrect concepts (programs) and each set Si , corre- our technique can detect infeasibility efficiently. sponding to example xi , contains concepts that are differen- Choosing Base Components for Deobfuscation. When tiated from the target concept by this example xi . We can using our program synthesis approach for performing pro- see that our Procedure IterativeSynthesis computes such gram deobfuscation, the base components are picked from a distinguishing example in each iteration, and terminates the assignment and conditional statements in the obfuscated when it has computed a “set cover” that distinguishes the code. For example, consider the obfuscated program in Fig- target concept from all other candidate concepts (the “uni- ure 1. Although it is difficult to understand the obfuscated verse”). Given this close connection, it does seem that the program, it is easier to guess that the set of important base classes of functions corresponding to the bit-manipulating components will include if-then-else components and equal- and deobfuscation examples we consider have small teach- ity comparators. Similarly, for the other deobfuscation ex- ing dimension, and also Procedure IterativeSynthesis is amples (reported in Section 6), the base components used effective at generating a sequence of examples close to the for synthesis contain only operators (such as left-shift and optimal teaching sequence. bitwise-xor) that appear explicitly in the obfuscated code These connections to learning theory are very intriguing. (see Figure 7). We leave a formal exploration of these links to future work. 5.2 Connections to Learning Our oracle-guided synthesis framework has close connec- 6. EXPERIMENTAL RESULTS tions to certain fundamental results in computational learn- We present experimental evaluation of our technique and ing theory. We explore these connections in this section. compare different approaches namely IterativeSynthesis, Our oracle-based model is similar to the query-based learn- Random, and Constrained Random discussed in Section 4. ing model proposed by Angluin [1], but with some important distinctions. In Angluin’s model, a learner interacts with Setup and Benchmarks. an oracle through the use of membership and equivalence We have implemented IterativeSynthesis in a tool called queries in order to learn a target concept. In our setting, Brahma. It uses Yices 1.0.21 [25] as the underlying SMT the target concept is the program we seek to synthesize. A solver. We ran our experiments on 8x Intel(R) Xeon(R) membership query is similar to the query we make to an CPU 1.86GHz with 4GB of RAM. Brahma was able to syn- I/O oracle, except that the former returns a binary answer thesize the desired circuit for each of the benchmark exam- whereas the I/O oracle returns an output value. An equiv- ples. Semi-biased Brahma implements Constrained Ran- alence query is similar to a query to the validation oracle, dom with the parameter K = 2. Thus, it differs only in first except that, in Angluin’s model, if the candidate concept 4 steps from Brahma. As mentioned in Section 4, this is is not equivalent to the target concept, the oracle returns specially targetted towards synthesis of bitvector programs. a counterexample as evidence for this non-equivalence. In We selected a set of 25 benchmark examples to evaluate our context, since the validation oracle is called only at the our technique. 22 benchmarks (P1-P22) are bit-manipulation end, when we are left with a semantically unique program programs from the book Hacker’s Delight, commonly re- consistent with the set of examples, such a counterexample ferred to as the Bible of bit twiddling hacks [29]. 3 bench- is not needed. Moreover, Angluin’s model treats both kinds marks were used as examples to illustrate the use of our tech- of queries as equally expensive. We make a distinction be- nique for deobfuscation. These benchmarks reflect obfusca-
14 7 Runtime P1(x) : P22(x) Iter Count Turn-off : Round 12 6 P21(x) : Ratio of Iteration Counts rightmost 1 up to the bit. Next higher 10 5 next highest Ratio of Runtime unsigned 1 o1 =(x - 1) number power of 2 2 res=(x & o1 ) 8 4 with same 1 o1 =(x - 1) number of 1 2 o2 =(o1 >> 1) P19(x) : 3 o3 =(o1 | o2 ) 6 3 bits Turn-off the 4 o4 =(o3 >> 2) 1 o1 =(- x) 4 2 rightmost 5 o5 =(o3 | o4 ) 2 o2 =(x & o1 ) contiguous 6 o6 =(o5 >> 4) string of 1 3 o3 =(x + o2 ) 2 1 7 o7 =(o5 | o6 ) bits 4 o4 =(x ⊕ o2 ) 8 o8 =(o7 >> 8) 5 o5 =(o4 >> 2) 0 0 1 o1 =(x - 1) 6 o6 =(o5 / o2 ) 9 o9 =(o7 | o8 ) 1 3 5 7 9 11 13 15 17 19 21 2 o2 =(x | o1 ) 7 res=(o6 | o3 ) 10 o10 =(o9 >> 16) 3 o3 =(o2 + 1) 11 o11 =(o9 | o10 ) Bitvector Benchmarks 4 res=(o3 & x) 12 res=(o10 + 1) Figure 8: Ratio of Runtime for Random Input Genera- tion to SemiBiased Brahma Figure 6: Selected Bit-vector Benchmarks 6 1.5 Runtime Iter Count tion strategies from literature on obfuscation techniques [5] 5 1.2 (P23) and Internet worms - Conficker [22] (P24) and My- Ratio of Iteration Counts Doom [21] (P25). Some of these examples are presented in Ratio of Runtime 4 Figure 6 and Figure 7. All the benchmarks used in the ex- 0.9 periments are listed in a more detailed version of the paper made available as a technical report [9]. 3 0.6 6.1 Bit-Manipulating Programs 2 Recall from Section 5.1 that the bitvector benchmarks were run using a standard library of base components, and 0.3 if necessary, an extended library. In Table 1, we report the 1 runtime when using the standard library (col. 4) and when using the user-augmented extended library (col. 5), in case 0 0 the standard library was not sufficient. Note that our tool 1 3 5 7 9 11 13 15 17 19 21 quickly terminates when the given library is insufficient. Bitvector Benchmarks For bitvector benchmarks, the user plays the role of the Figure 9: Ration of Runtime for Brahma to SemiBiased I/O oracle as well as the validation oracle. If the user guar- Brahma antees that the provided set of base components is sufficient to encode the desired solution, then we do not require the 6.2 Deobfuscation validation oracle. Otherwise, it is theoretically impossible to know whether or not the generated solution is the cor- The I/O oracle involves simply evaluating the obfuscated rect one without a validation oracle. However, in practice, program on the given input. The validation oracle can be a our algorithm detects insufficiency of the base components program equivalence checking tool or the user. by discovering inconsistency, and not by a query to the val- An additional challenge that this domain offers is the pres- idation oracle. This suggests that in the absence of any ence of arbitrary string constants. Our synthesis framework validation oracle, we can consider the semantically unique can be easily extended to discovering such constants. For candidate program returned by our algorithm to be the cor- this purpose, we introduce a generic base component fc that rect program for all practical purposes. simply outputs some arbitrary constant c. The component We now compare the three approaches on bit-vector bench- fc takes no input and returns one output O and its func- marks using two metrics - the total runtime and the number tional specification is written as O = c. The only change of iterations. We present the ratio of runtimes of random to the framework is that since c is allowed to be arbitrary, input generation (col 2 of Table 1) and semi-biased Brahma we existentially quantify over c in the functional constraint (col 5 of Table 1) in Figure 8. Semi-biased Brahma is φfunc described in Section 4.1. 1.5 times to 12 times faster than random technique. For For the three examples that we used in experiments, ob- P22, the random technique did not finish in 1 hour while serve that Brahma gives the best performance. The key semi-biased Brahma was able to synthesize it in 186 sec- observation from the experiments is that random input gen- onds. Also, the number of iterations required to synthe- eration does not work well for examples such as P25 where size a program is also reduced significantly as shown in Ta- randomly generating integers has a rare chance of 1 in 232 to ble 1. Brahma and semi-biased Brahma is compared in pick an input which produces any of the first 4 possible out- Figure 9. While the number of iterations is more for semi- puts. Brahma takes exactly 5 iterations to query the I/O biased Brahma, it is faster than the Brahma on larger oracle with inputs that generate all the 5 possible outputs. benchmarks. It reduces the runtime for P18 from 140.65 The experimental results indicate that adding a distin- seconds to 25.55 seconds, P21 from 527.91 seconds to 272.28 guishing input is better than adding a random input to E seconds and P22 from 1108.15 seconds to 187.17 seconds. because it guarantees that atleast one candidate program is This validates the optimization proposed in Section 4. definitely removed from the search space. Thus, it guaran-
P24: Multiply by 45 P23: Interchange the source and destination addresses. 1 int multiply45Obs(int y) 1 interchangeObs(IPaddress* src, IPadress* dest) 2 { a=1; b=0; z=1; c=0; 2 { *src = *src ⊕ *dest; 3 while(1) { if (a == 0) { 3 if (*src == *src ⊕ *dest) { *src = *src ⊕ *dest; 4 if (b == 0) { y=z+y; a =¬a; 4 if (*src == *src ⊕ *dest) { *dest = *src ⊕ *dest; 5 b=¬b;c=¬c; if (¬ c) break; } 5 if (*dest == *src ⊕ *dest) { *src = *dest ⊕ *src; 6 else { 6 return; } 7 z=z+y; a=¬a; b=¬b; c=¬c; 7 else { *src = *src ⊕ *dest; *dest = *src ⊕ *dest; 8 if (¬ c) break; } } 8 return;} } 9 else if(b == 0) {z=y
Shapiro’s Algorithmic Debugging System [23] performs [8] S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan. synthesis by repeatedly using the oracle to identify errors Component based synthesis applied to bitvector in the current (incorrect) program and then fixing those er- circuits. Technical Report MSR-TR-2010-12, Microsoft rors. We do not fix incorrect programs. We use the in- Research, Feb 2010. correct program to identify a distinguishing input and then [9] S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. re-synthesize a new program that works on the new input- Oracle-guided component-based program synthesis. output pair as well. Technical Report UCB/EECS-2010-15, EECS In programming by demonstration [15, 16, 6], the user Department, University of California, Berkeley, Feb demonstrates how to perform a task and the system learns 2010. an appropriate representation of the task procedure. Un- [10] T. A. Johnson and R. Eigenmann. Context-sensitive like our method, these approaches do not make active oracle domain-independent algorithm composition and queries, but rely on the demonstrations the user chooses. selection. In PLDI, 2006. Making active queries is important for efficiency and termi- [11] R. Joshi, G. Nelson, and K. H. Randall. Denali: A nating quickly (so that user is not overwhelmed with queries). goal-directed superoptimizer. In PLDI, 2002. In programming by sketching [24], implementations are [12] E. Kitzelmann and U. Schmid. Inductive synthesis of synthesized from sketches – partially-specified programs with functional programs: An explanation based holes. The SKETCH system uses SAT solving within a generalization approach. J. Machine Learning Res., counterexample-guided loop that constantly interacts with 7:429–454, 2006. a verifier to check candidate implementations against a com- [13] D. E. Knuth. The art of computer programming. plete specification, where the verifier provides counterexam- http://www-cs-faculty.stanford.edu/~knuth/ ples until a correct solution has been found. In contrast, taocp.html. we do not use counterexamples for synthesis. Further, we require a validation oracle only when the specification can [14] J. Kominek. rot13 implementation. http://www. not be realized using the provided components. This veri- miranda.org/∼jkominek/rot13/, Accessed Sep. 2009. fier is not required to return a counter-example. In practice, [15] T. Lau, P. Domingos, and D. S. Weld. Version space we never require a query to the verifier and our technique algebra and its application to programming by correctly identifies infeasible scenarios without calling the demonstration. In ICML, pages 527–534, 2000. validation oracle. [16] H. Lieberman, editor. Your wish is my command: Giving users the power to instruct their software. 8. CONCLUSION Morgan Kaufmann, 2001. We have presented a novel approach to program synthesis [17] N. Linial, Y. Mansour, and N. Nisan. Constant depth based on oracle-guided learning from examples and SMT circuits, fourier transform, and learnability. In FOCS, solvers. Applications to synthesis of bit-vector programs pages 574–579, 1989. and deobfuscation have been demonstrated. Experiments [18] D. Mandelin, L. Xu, R. Bodı́k, and D. Kimelman. indicate that our approach can be efficient and effective for Jungloid mining: Helping to navigate the API jungle. discovering unintuitive code and for program understanding. In PLDI, pages 48–61, 2005. [19] Z. Manna and R. Waldinger. A deductive approach to Acknowledgments program synthesis. ACM TOPLAS, 2(1):90–121, 1980. [20] H. Massalin. Superoptimizer - a look at the smallest We are grateful to Rastislav Bodik, George Necula, John program. In ASPLOS, pages 122–126, 1987. Rushby, Natarajan Shankar, Hassen Saidi, Dawn Song, Richard Waldinger and the anonymous reviewers for their insightful [21] MyDoom Wikipedia Article. http://en.wikipedia. comments. The UC Berkeley authors were supported in part org/wiki/Mydoom, URL accessed Sep. 2009. by NSF grants CNS-0644436 and CNS-0627734, and by an [22] P. Porras, H. Saidi, and V. Yegneswaran. An analysis Alfred P. Sloan Research Fellowship. The fourth author was of conficker’s logic and rendezvous points. Technical supported in part by NSF grants CNS-0720721 and CSR- report, SRI International, March 2009. 0917398. [23] E. Y. Shapiro. Algorithmic Program DeBugging. MIT Press, Cambridge, MA, USA, 1983. [24] A. Solar-Lezama, L. Tancau, R. Bodı́k, S. Seshia, and 9. REFERENCES V. Saraswat. Combinatorial sketching for finite [1] D. Angluin. Queries and concept learning. Machine programs. In ASPLOS, 2006. Learning, 2(4):319–342, 1987. [25] SRI Intl. Yices: An SMT solver. [2] S. Bansal and A. Aiken. Automatic generation of http://yices.csl.sri.com/. peephole superoptimizers. In ASPLOS, 2006. [26] M. Stickel, R. Waldinger, M. Lowry, T. Pressburger, [3] S. Bansal and A. Aiken. Binary translation using and I. Underwood. Deductive composition of peephole superoptimizers. In OSDI, 2008. astronomical software from subroutine libraries. In [4] C. Barrett, R. Sebastiani, S. A. Seshia, and C. Tinelli. CADE, 1994. Satisfiability modulo theories. In Handbook of [27] P. D. Summers. A methodology for lisp program Satisfiability, volume 4, chapter 8. IOS Press, 2009. construction from examples. J. ACM, 24(1), 1977. [5] C. Collberg, C. Thomborson, and D. Low. A taxonomy [28] Symantec Corporation. Internet security threat report of obfuscating transformations. Technical Report 148, volume XIV. http://www.symantec.com/business/ Dept. Comp. Sci., The Univ. of Auckland, July 1997. theme.jsp?themeid=threatreport, April 2009. [6] A. Cypher, editor. Watch what I do: Programming by [29] H. S. Warren. Hacker’s Delight. Addison-Wesley demonstration. MIT Press, 1993. Longman, Boston, MA, USA, 2002. [7] S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:20–31, 1995.
You can also read