SIMPLE HIGH-LEVEL CODE FOR CRYPTOGRAPHIC ARITHMETIC - WITH PROOFS, WITHOUT COMPROMISES - WIREGUARD
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Simple High-Level Code For Cryptographic Arithmetic – With Proofs, Without Compromises Andres Erbsen Jade Philipoom Jason Gross Robert Sloan Adam Chlipala MIT CSAIL, Cambridge, MA, USA {andreser, jadep, jgross}@mit.edu, rob.sloan@alum.mit.edu, adamc@csail.mit.edu Abstract—We introduce a new approach for implementing where X25519 was the only arithmetic-based crypto primitive cryptographic arithmetic in short high-level code with machine- we need, now would be the time to declare victory and go checked proofs of functional correctness. We further demonstrate home. Yet most of the Internet still uses P-256, and the that simple partial evaluation is sufficient to transform into the fastest-known C code, breaking the decades-old pattern that the current proposals for post-quantum cryptosystems are far from only fast implementations are those whose instruction-level steps Curve25519’s combination of performance and simplicity. were written out by hand. A closer look into the development and verification process These techniques were used to build an elliptic-curve library (with an eye towards replicating it for another function) reveals that achieves competitive performance for 80 prime fields and a discouraging amount of duplicated effort. First, expert time multiple CPU architectures, showing that implementation and proof effort scales with the number and complexity of concep- is spent on mechanically writing out the instruction-level tually different algorithms, not their use cases. As one outcome, steps required to perform the desired arithmetic, drawing from we present the first verified high-performance implementation experience and folklore. Then, expert time is spent to optimize of P-256, the most widely used elliptic curve. Implementations the code by applying a carefully chosen combination of simple from our library were included in BoringSSL to replace existing transformations. Finally, expert time is spent on annotating specialized code, for inclusion in several large deployments for Chrome, Android, and CloudFlare. the now-munged code with assertions relating values of run- time variables to specification variables, until the SMT solver I. M OTIVATION (used as a black box) is convinced that each assertion is implied by the previous ones. As the number of steps required The predominant practice today is to implement crypto- to multiply two numbers is quadratic in the size of those graphic primitives at the lowest possible level of abstraction: numbers, effort grows accordingly: for example “Verifying usually flat C code, ideally assembly language if engineering Curve25519 Software” [1] reports that multiplication modulo time is plentiful. Low-level implementation has been the only 2255 − 19 required 27 times the number of assertions used way to achieve good performance, ensure that execution time for 2127 − 1. It is important to note that both examples use does not leak secret information, and enable the code to the same simple modular multiplication algorithm, and the be called from any programming language. Yet placing the same correctness argument should apply, just with different burden of every detail on the programmer also leaves a lot parameters. This is just one example of how manual repetition to be desired. The demand for implementation experts’ time of algorithmic steps inside primitive cryptographic arithmetic is high, resulting in an ever-growing backlog of algorithms to implementations makes them notoriously difficult to write, be reimplemented in specialized code “as soon as possible.” review, or prove correct. Worse, bugs are also common enough that it is often not Hence we suggest a rather different way of implementing economical to go beyond fixing the behavior for problematic cryptographic primitives: to implement general algorithms in inputs and understand the root cause and impact of the defect. the style that best illustrates their correctness arguments and There is hope that these issues can eventually be overcome then derive parameter-specific low-level code through partial by redoubling community efforts to produce high-quality evaluation. For example, the algorithm used for multiplication crypto code. However, even the most renowned implementors modulo 2127 − 1 and 2255 − 19 is first encoded as a purely are not infallible, and if a typo among thousands of lines of functional program operating on lists of digit-weight pairs: it monotonous code can only be recognized as incorrect by top generates a list of partial products, shifts down the terms with experts,1 it is reasonable to expect diminishing returns for weight greater than the modulus, and scales them accordingly. achieving correctness through manual review. A number of It is then instantiated with lists of concrete weights and length, tools for computer-aided verification have been proposed [1]– for example ten limbs with weights 2d25.5ie , and partially [6], enabling developers to eliminate incorrect-output bugs evaluated so that no run-time manipulation of lists and weights conclusively. Three solid implementations of the X25519 remains, resulting in straight-line code very similar to what Diffie-Hellman function were verified. In the beautiful world an expert would have written. As the optimization pipeline 1 We refer to the ed25519-amd64-64-24k bug that is discussed along is proven correct once and for all, correctness of all code with other instructive real-world implementation defects in Appendix A. generated from a high-level algorithm can be proven directly
on its high-level implementation. A nice side benefit of this Input: approach is that the same high-level algorithm and proof can Definition wwmm_step A B k S ret := be used for multiple target configurations with very little effort, divmod_r_cps A (λ ’(A, T1), @scmul_cps r _ T1 B _ (λ aB, enabling the use of fast implementations in cases that were @add_cps r _ S aB _ (λ S, previously not worth the effort. divmod_r_cps S (λ ’(_, s), In practice, we use the fast arithmetic implementations mul_split_cps’ r s k (λ ’(q, _), @scmul_cps r _ q M _ (λ qM, reported on here in conjunction with a library of verified @add_longer_cps r _ S qM _ (λ S, cryptographic algorithms, the rest of which deviates much less divmod_r_cps S (λ ’(S, _), from common implementation practice. All formal reasoning @drop_high_cps (S R_numlimbs) S _ (λ S, ret (A, S) ))))))))). is done in the Coq proof assistant, and the overall trusted computing base also includes a simple pretty-printer and the Fixpoint wwmm_loop A B k len_A ret := match len_A with C language toolchain. The development, called Fiat Cryptog- | O => ret raphy, is available under the MIT license at: | S len_A’ => λ ’(A, S), wwmm_step A B k S (wwmm_loop A B k len_A’ ret) https://github.com/mit-plv/fiat-crypto end. The approach we advocate for is already used in widely de- Output: ployed code. Most applications relying on the implementations void wwmm_p256(u64 out[4], u64 x[4], u64 y[4]) { described in this paper are using them through BoringSSL, u64 x17, x18 = mulx_u64(x[0], y[0]); u64 x20, x21 = mulx_u64(x[0], y[1]); Google’s OpenSSL-derived crypto library that includes our u64 x23, x24 = mulx_u64(x[0], y[2]); Curve25519 and P-256 implementations. A high-profile user u64 x26, x27 = mulx_u64(x[0], y[3]); is the Chrome Web browser, so today about half of HTTPS u64 x29, u8 x30 = addcarryx_u64(0x0, x18, x20); u64 x32, u8 x33 = addcarryx_u64(x30, x21, x23); connections opened by Web browsers worldwide use our fast u64 x35, u8 x36 = addcarryx_u64(x33, x24, x26); verified code (Chrome versions since 65 have about 60% u64 x38, u8 _ = addcarryx_u64(0x0, x36, x27); market share [7], and 90% of connections use X25519 or P- u64 x41, x42 = mulx_u64(x17, 0xffffffffffffffff); u64 x44, x45 = mulx_u64(x17, 0xffffffff); 256 [8]). BoringSSL is also used in Android and on a majority u64 x47, x48 = mulx_u64(x17, 0xffffffff00000001); of Google’s servers, so a Google search from Chrome would u64 x50, u8 x51 = addcarryx_u64(0x0, x42, x44); likely be relying on our code on both sides of the connection. // 100 more lines... Maintainers appreciate the reduced need to worry about bugs. Fig. 1: Word-by-word Montgomery Multiplication, output specialized to P-256 (2256 − 2224 + 2192 + 296 − 1) II. OVERVIEW We will explain how we write algorithm implementations in a high-level way that keeps the structural principles as this parameter-agnostic code is written in continuation-passing clear and readable as possible, proving functional correctness. style where functions do not return normally but instead take Our explanation builds up to showing how other security as arguments callbacks to invoke when answers become avail- constraints can be satisfied without sacrificing the appeal of able. The notation λ ’(x, y), ... encodes an anonymous our new strategy. Our output code follows industry-standard function with arguments x and y. “constant-time” coding practices and is only bested in through- The bottom half of the figure shows the result of spe- put by platform-specific assembly-language implementations cializing the general algorithm to P-256, the elliptic curve (and only by a modest margin). This section will give a most commonly used in TLS and unfortunately not supported tour of the essential ingredients of our methodology, and the by any of the verified libraries discussed earlier. Note that next section will explain in detail how to implement field some operations influence the generated code by creating arithmetic for implementation-friendly primes like 2255 − 19. explicit instruction sequences (highlighted in matching colors), Building on this intuition (and code!), Section III-F culmi- while other operations instead control which input expressions nates in a specializable implementation of the word-by-word should be fed into others. The number of low-level lines Montgomery reduction algorithm for multiplication modulo per high-level line varies between curves and target hardware any prime. Section IV will explain the details of the compiler architectures. The generated code is accompanied by a Coq that turns the described implementations into the kind of code proof of correctness stated in terms of the straightline-code that implementation experts write by hand for each arithmetic language semantics and the Coq standard-library definitions primitive. We report performance measurements in Section V of integer arithmetic. and discuss related work and future directions in Sections VI and VII. A. High-Level Arithmetic Library Designed For Specialization Fig. 1 gives a concrete example of what our framework At the most basic level, all cryptographic arithmetic algo- provides. The algorithm in the top half of the figure is rithms come down to implementing large mathematical objects a transcription of the word-by-word Montgomery reduction (integers or polynomials of several hundred bits) in terms of algorithm, following Gueron’s Algorithm 4 [9]. As is required arithmetic provided by the hardware platform (for example by the current implementation of our optimization pipeline, integers modulo 264 ). To start out with, let us consider the
example of addition, with the simplifying precondition that where the ai s and bi s are unknown program inputs. While we no digits exceed half of the hardware-supported maximum. cannot make compile-time simplifications based on the values type num = list N of the digits, we can reduce away all the overhead of dynamic allocation of lists. We use Coq’s term-reduction machinery, add : num → num → num which allows us to choose λ-calculus-style reduction rules to add (a :: as) (b :: bs) = let x = a + b in x :: add as bs apply until reaching a normal form. Here is what happens with add as [] = as our example, when we ask Coq to leave let and + unreduced add [] bs = bs but apply other rules. To give the specification of this function acting on a base-264 add [a1 , a2 , a3 ] [b1 , b2 , b3 ] ⇓ let n1 = a1 + b1 in n1 :: little-endian representation, we define an abstraction function evaluating each digit list ` into a single number. let n2 = a2 + b2 in n2 :: let n3 = a3 + b3 in n3 :: [] b c : num → N ba :: asc = a + 264 basc We have made progress: no run-time case analysis on lists b[]c = 0 remains. Unfortunately, let expressions are intermixed with list constructions, leading to code that looks rather different To prove correctness, we show (by naive induction on a) that than assembly. To persuade Coq’s built-in term reduction to evaluating the result of add behaves as expected: do what we want, we first translate arithmetic operations to ∀a, b. badd a bc = bac + bbc continuation-passing style. Concretely, we can rewrite add. Using the standard list type in Coq, the theorem becomes add0 : ∀α. num → num → (num → α) → α add0 (a :: as) (b :: bs) k = Lemma eval_add : ∀a b, badd a bc = bac+bbc. let n = a + b in induction a,b;cbn; rewrite ?IHa;nia. Qed. add0 as bs (λ`. k (n :: `)) The above proof is only possible because we are inten- add0 as [] k = k as tionally avoiding details of the underlying machine arithmetic add0 [] bs k = k bs – if the list of digits were defined to contain 64-bit words instead of N, repeated application of add would eventually Reduction turns this function into assembly-like code. result in integer overflow. This simplification is an instance of the pattern of anticipating low-level optimizations in writing add0 [a1 , a2 , a3 ] [b1 , b2 , b3 ] (λ`. `) ⇓ let n1 = a1 + b1 in high-level code: we do expect to avoid overflow, and our let n2 = a2 + b2 in choice of a digit representation is motivated precisely by that let n3 = a3 + b3 in aim. It is just that the proofs of overflow-freedom will be [n1 , n2 , n3 ] injected in a later stage of our pipeline, as long as earlier stages like our current one are implemented correctly. There When add0 is applied to a particular continuation, we can is good reason for postponing this reasoning: generally we care also reduce away the result list. Chaining together sequences about the context of higher-level code calling our arithmetic of function calls leads to idiomatic and efficient assembly-style primitives. In an arbitrary context, an add implemented using code, based just on Coq’s normal term reduction, preserving 64-bit words would need to propagate carries from each word sharing of let-bound variables. This level of inlining is com- to the next, causing an unnecessary slowdown when called mon for the inner loops of crypto primitives, and it will also with inputs known to be small. simplify the static analysis described in the next subsection. B. Partial Evaluation For Specific Parameters C. Word-Size Inference It is impossible to achieve competitive performance with Up to this point, we have derived code that looks almost arithmetic code that manipulates dynamically allocated lists exactly like the assembly code we want to produce. The at runtime. The fastest code will implement, for example, a code is structured to avoid overflows when run with fixed- single numeric addition with straightline code that keeps as precision integers, but so far it is only proven correct for much state as possible in registers. Expert implementers today natural numbers. The final major step is to infer a range of write that straightline code manually, applying various rules of possible values for each variable, allowing us to assign each thumb. Our alternative is to use partial evaluation in Coq to one a register or stack-allocated variable of the appropriate bit generate all such specialized routines, beginning with a single width. library of high-level functional implementations that general- The bounds-inference pass works by standard abstract inter- ize the patterns lurking behind hand-written implementations pretation with intervals. As inputs, we require lower and upper favored today. bounds for the integer values of all arguments of a function. Consider the case where we know statically that each num- These bounds are then pushed through all operations to infer ber we add will have 3 digits. A particular addition in our top- bounds for temporary variables. Each temporary is assigned level algorithm may have the form add [a1 , a2 , a3 ] [b1 , b2 , b3 ], the smallest bit width that can accommodate its full interval.
prime architecture # limbs base representation (distributing large x into x0 ...xn ) 2256 − 2224 + 2192 + 296 − 1 64-bit 4 264 x = x0 + 264 x1 + 2128 x2 + 2192 x3 (P-256) 2255 − 19 (Curve25519) 64-bit 5 251 x = x0 + 251 x1 + 2102 x2 + 2153 x3 + 2204 x4 2255 − 19 (Curve25519) 32-bit 10 225.5 x = x0 + 226 x1 + 251 x2 + 277 x3 + ... + 2204 x8 + 2230 x9 2448 − 2224 − 1 (p448) 64-bit 8 256 x = x0 + 256 x1 + 2112 x2 + ... + 2392 x7 2127 − 1 64-bit 3 242.5 x = x0 + 243 x1 + 285 x2 Fig. 2: Examples of big-integer representations for different primes and integer widths As an artificial example, assume the input bounds III. A RITHMETIC T EMPLATE L IBRARY a1 , a2 , a3 , b1 ∈ [0, 231 ]; b2 , b3 ∈ [0, 230 ]. The analysis con- Recall Section II-A’s toy add function for little-endian cludes n1 ∈ [0, 232 ]; n2 , n3 ∈ [0, 230 + 231 ]. The first big integers. We will now describe our full-scale library, temporary is just barely too big to fit in a 32-bit register, starting with the core unsaturated arithmetic subset, the foun- while the second two will fit just fine. Therefore, assuming dation for all big-integer arithmetic in our development. For the available temporary sizes are 32-bit and 64-bit, we can those who prefer to read code, we suggest src/Demo.v transform the code with precise size annotations. in the framework’s source code, which contains a succinct let n1 : N264 = a1 + b1 in standalone development of the unsaturated-arithmetic library up to and including prime-shape-aware modular reduction. let n2 : N232 = a2 + b2 in The concrete examples derived in this section are within let n3 : N232 = a3 + b3 in established implementation practice, and an expert would be [n1 , n2 , n3 ] able to reproduce them given an informal description of the strategy. Our contribution is to encode this general wisdom Note how we may infer different temporary widths based on in concrete algorithms and provide these with correctness different bounds for arguments. As a result, the same primitive certificates without sacrificing the programmer’s sanity. inlined within different larger procedures may get different bounds inferred. World-champion code for real algorithms A. Multi-Limbed Arithmetic takes advantage of this opportunity. Cryptographic modular arithmetic implementations dis- This phase of our pipeline is systematic enough that we tribute very large numbers across smaller “limbs” of 32- or 64- chose to implement it as a certified compiler. That is, we define bit integers. Fig. 2 shows a small sample of fast representations a type of abstract syntax trees (ASTs) for the sorts of programs for different primes. Notice that many of these implementa- that earlier phases produce, we reify those programs into our tions use bases other than 232 or 264 , leaving bits unused in AST type, and we run compiler passes written in Coq’s Gallina each limb: these are called unsaturated representations. Con- functional programming language. Each pass is proved correct versely, the ones using all available bits are called saturated. once and for all, as Section IV explains in more detail. Another interesting feature shown in the examples is that the exponents of some bases, such as the original 32-bit D. Compilation To Constant-Time Machine Code Curve25519 representation [10], are not whole numbers. In the What results is straightline code very similar to that written actual representation, this choice corresponds to an alternating by hand by experts, represented as ASTs in a simple lan- pattern, so “base 225.5 ” uses 26 bits in the first limb, 25 in the guage with arithmetic and bitwise operators. Our correctness second, 26 in the third, and so on. Because of the alternation, proofs connect this AST to specifications in terms of integer these are called mixed-radix bases, as opposed to uniform- arithmetic, such as the one for add above. All operations radix ones. provided in our lowest-level AST are implemented with input- These unorthodox big integers are fast primarily because independent execution time in many commodity compilers of a specific modular-reduction trick, which is most efficient and processors, and if so, our generated code is trivially when the number of bits in the prime corresponds to a limb free of timing leaks. Each function is pretty-printed as C boundary. For instance, reduction modulo 2255 − 19 is fastest code and compiled with a normal C compiler, ready to be when the bitwidths of a prefix of limbs sum to 255. Every benchmarked or included in a library. We are well aware that unsaturated representation in our examples is designed to top implementation experts can translate C to assembly better fulfill this criterion. than the compilers, and we do not try to compete with them: while better instruction scheduling and register allocation B. Partial Modular Reduction for arithmetic-heavy code would definitely be valuable, it is Suppose we are working modulo a k-bit prime m. Multiply- outside the scope of this project. Nevertheless, we are excited ing two k-bit numbers can produce up to 2k bits, potentially to report that our library generates the fastest-known C code much larger than m. However, if we only care about what for all operations we have benchmarked (Section V). the result is mod m, we can perform a partial modular
s × t = 1 × s0 t0 + 243 × s0 t1 + 285 × s0 t2 + 243 × s1 t0 + 286 × s1 t1 + 2128 × s1 t2 85 +2 × s2 t0 + 2128 × s2 t1 + 2170 × s2 t2 = s0 t0 + 243 (s0 t1 + s1 t0 ) + 285 (s0 t2 + 2s1 t1 + s2 t0 ) + 2127 (2s1 t2 + 2s2 t1 ) + 2170 × s2 t2 Fig. 3: Distributing terms for multiplication mod 2127 − 1 reduction to reduce the upper bound while preserving modular We format the first intermediate term suggestively: down equivalence. (The reduction is “partial” because the result is each column, the powers of two are very close together, not guaranteed to be the minimal residue.) differing by at most one. Therefore, it is easy to add down The most popular choices of primes in elliptic-curve the columns to form our final answer, split conveniently into cryptography are of the form m = 2k − cl 2tl − . . . − digits with integral bit widths. c0 2t0 , encompassing what have been called “generalized At this point we have a double-wide answer for multiplica- Mersenne primes,” “Solinas primes,” “Crandall primes,” tion, and we need to do modular reduction to shrink it down “pseudo-Mersenne primes,” and “Mersenne primes.” Although to single-wide. For our example, the last two digits can be any number could be expressed this way, and the algorithms rearranged so that the modular-reduction rule applies: we describe would still apply, choices of m with relatively few terms (l k) and small ci facilitate fast arithmetic. 2127 (2s1 t2 + 2s2 t1 ) + 2170 s2 t2 (mod 2127 − 1) Set s = 2k and c = cl 2tl + . . . + c0 2t0 , so m = s − c. To = 2127 ((2s1 t2 + 2s2 t1 ) + 243 s2 t2 ) (mod 2127 − 1) reduce x mod m, first split x by finding a and b such that = 1((2s1 t2 + 2s2 t1 ) + 243 s2 t2 ) (mod 2127 − 1) x = a + sb. Then, a simple derivation yields a division-free procedure for partial modular reduction. As a result, we can merge the second-last digit into the first and merge the last digit into the second, leading to this final x mod m = (a + sb) mod (s − c) formula specifying the limbs of a single-width answer: = (a + (s − c)b + cb) mod (s − c) = (a + cb) mod m (s0 t0 + 2s1 t2 + 2s2 t1 ) + 243 (s0 t1 + s1 t0 + s2 t2 ) + 285 (s0 t2 + 2s1 t1 + s2 t0 ) The choice of a and b does not further affect the correctness of this formula, but it does influence how much the input is D. Associational Representation reduced: picking a = x and b = 0 would make this formula As is evident by now, the most efficient code makes use a no-op. One might pick a = x mod s, although the formula of sophisticated and specific big-number representations, but does not require it. Here is where careful choices of big-integer many concrete implementations operate on the same set of representation help: having s be at a limb boundary allows for underlying principles. Our strategy of writing high-level tem- a good split without any computation! plates allows us to capture the underlying principles in a fully Our Coq proof of this trick is reproduced here: general way, without committing to a run-time representation. Lemma reduction_rule a b s c (_: s-c 0) Abstracting away the implementation-specific details, like the : (a + s * b) mod (s - c) = exact number of limbs or whether the base system is mixed- (a + c * b) mod (s - c). or uniform-radix, enables simple high-level proofs of all Proof. necessary properties. replace (a+s*b) with ((a+c*b)+b*(s-c)). A key insight that allows us to simplify the arithmetic rewrite add_mod,mod_mult,add_0_r,mod_mod. proofs is to use multiple different representations for big all: auto; nsatz. integers, as long as changes between these representations can Qed. be simplified away once the prime shape is known. We have two main representations of big integers in the core library: C. Example: Multiplication Modulo 2127 − 1 associational and positional. A big integer in associational Before describing the general modular-reduction algorithm form is represented as a list of pairs; the first element in each implemented in our library, we will walk through multiplica- pair is a weight, known at compile time, and the second is tion specialized to just one modulus and representation. To a runtime value. The decimal number 95 might be encoded simplify matters a bit, we use the (relatively) small modulus as [(16, 5); (1, 15)], or equivalently as [(1, 5); 2127 −1. Say we want to multiply 2 numbers s and t in its field, (10, 9)] or [(1, 1); (1, 6); (44, 2)]. As long with those inputs broken up as s = s0 + 243 s1 + 285 s2 and as the sum of products is correct, we do not care about t = t0 + 243 t1 + 285 t2 . Distributing multiplication repeatedly the order, whether the weights are powers of each other, or over addition gives us the answer form shown in Fig. 3. whether there are multiple pairs of equal weight.
Definition mul (p q:list(Z*Z)) : list(Z*Z) Fixpoint place (t:Z*Z) (i:nat) : nat*Z := := flat_map (λt, if dec (fst t mod weight i = 0) then map (λt’,(fst t*fst t’, snd t*snd t’)) let c:=fst t / weight i in (i, c*snd t) q) p. else match i with | S i’ => place t i’ Lemma eval_map_mul (a x:Z) (p:list (Z*Z)) | O => (O, fst t*snd t) end. : eval (map (λt, (a*fst t, x*snd t)) p) = a*x * eval p. Definition from_associational n p := Proof. induction p; push; nsatz. Qed. List.fold_right (λt, let wx := place t (pred n) in Lemma eval_mul p q add_to_nth (fst wx) (snd wx) ) : eval (mul p q) = eval p * eval q. (zeros n) p. Proof. induction p; cbv [mul]; push; nsatz. Qed. Lemma place_in_range (t:Z*Z) (n:nat) : (fst (place t n) < S n)%nat. Fig. 4: Definition and correctness proof of multiplication Proof. cbv [place]; induction n; break_match; autorewrite with cancel_pair; omega. Qed. In associational format, arithmetic operations are extremely easy to reason about. Addition is just concatenation of the Lemma weight_place t i lists. Schoolbook multiplication (see Fig. 4) is also simple: : weight (fst(place t i)) * snd(place t i) (a1 · x1 + . . .)(b1 · y1 + . . .) = (a1 b1 · x1 y1 + . . .), where = fst t * snd t. a1 b1 is the new compile-time-known weight. Because of the Proof. (* ... 4 lines elided ... *) Qed. flexibility allowed by associational representation, we do not Lemma eval_from_associational {n} p have to worry about whether the new compile-time weight : eval (from_associational (S n) p) produced is actually in the base. If we are working with the = Associational.eval p. base 225.5 in associational representation, and we multiply two Proof. pairs that both have weight 226 , we can just keep the 252 rather cbv [from_associational]; induction p; than the base’s 251 weight. push; Positional representation, on the other hand, uses lists of try pose proof place_in_range a n; runtime values with weight functions. There is only one try omega; try nsatz. Qed. runtime value for each weight, and order matters. This format is closer to how a number would actually be represented at run- time. Separating associational from positional representations Fig. 5: Coq definition and selected correctness proofs of lets us separate reasoning about arithmetic from reasoning conversion from associational to positional representation about making weights line up. Especially in a system that must account for mixed-radix bases, the latter step is nontrivial and is better not mixed with other reasoning. E. Carrying Fig. 5 shows the code and proofs for conversion from In unsaturated representations, it is not necessary to carry associational to positional format, including the fixing of immediately after every addition. For example, with 51-bit misaligned weights. The helper function place takes a pair limbs on a 64-bit architecture, it would take 13 additions to t = (w, x) (compile-time weight and runtime value) and risk overflow. Choosing which limbs to carry and when is recursively searches for the highest-index output of the weight part of the design and is critical for keeping the limb values function that w is a multiple of. After finding it, place returns bounded. Generic operations are easily parameterized on carry an index i and the runtime value to be added at that index: strategies, for example “after each multiplication carry from (w/weighti ·x). Then, from_associational simply adds limb 4 to limb 5, then from limb 0 to limb 1, then from limb each limb at the correct index. 5 to limb 6,” etc. The library includes a conservative default. The modular-reduction trick we explained in Section III-B can also be succinctly represented in associational format (Fig. F. Saturated Arithmetic 6). First we define split, which partitions the list, separating The code described so far was designed with unsaturated out pairs with compile-time weights that are multiples of the representations in mind. However, unsaturated-arithmetic per- number at which we are splitting (s, in the code). Then it formance degrades rapidly when used with moduli that it divides all the weights that are multiples of s by s and returns is not a good match for, so older curves such as P-256 both lists. The reduce function simply multiplies the “high” need to be implemented using saturated arithmetic. Basic list by c and adds the result to the “low” list. Using the mul we arithmetic on saturated digits is complicated by the fact that defined earlier recreates the folklore modular-reduction pattern addition and multiplication outputs may not fit in single words, for prime shapes in which c is itself multi-limb. necessitating the use of two-output addition and multiplication
Definition split (s:Z) (p:list (Z*Z)) Lemma S2_mod : list(Z*Z)*list(Z*Z) := : eval (sat_add S1 (scmul q m)) mod m let hl := = eval S1 mod m. partition (λt, fst t mod s =? 0) p in Proof. push; zsimplify; auto. Qed. (snd hl, map (λt, (fst t / s, snd t)) (fst hl)). Fig. 8: Proof that adjusting S to be 0 mod m preserves the correct value mod m Definition reduce(s:Z)(c:list _)(p:list _) : list(Z*Z) := let lh := split s p in ones are lists of integers (list Z), columns representations fst lh ++ mul c (snd lh). are lists of lists of integers (list (list Z)). Columns Lemma eval_split s p (s_nz:s0) : representations are very similar to positional ones, except eval (fst (split s p)) that the values at each position have not yet been added + s * eval (snd (split s p)) together. Starting from the columns representation, we can = eval p. add terms together in a specific order so that carry bits Proof. (* ... (7 lines elided) *) Qed. are used right after they are produced, enabling them to be stored in flag registers in assembly code. The code to Lemma eval_reduce s c p convert from associational to columns is very similar to (s_nz:s0)(m_nz:s-eval c0) : associational-to-positional conversion (Fig. 7). eval (reduce s c p) mod (s - eval c) With these changes in place, much of the code shown = eval p mod (s - eval c). before for associational works on positional representa- Proof. cbv [reduce]; push. tions as well. For instance, the partial modular-reduction rewrite
Given eval S < eval m + eval B ; is often a tuple of base-type values. Our let construct bakes To show: in destructuring of tuples, in fact using typing to ensure that eval (div_r (add (add S (scmul a B)) (scmul q m))) all tuple structure is deconstructed fully, with variables bound < eval m + eval B only to the base values at a tuple’s leaves. Our deep embedding (eval S + a ∗ eval B + q ∗ eval m)/r < eval m + eval B of this language in Coq uses dependent types to enforce that (eval m ∗ r + r ∗ eval B − 1)/r < eval m + eval B constraint, along with usual properties like lack of dangling eval m + eval B − 1 < eval m + eval B variable references and type agreement between operators and Fig. 9: Intermediate goals of the proof that the word-by- their arguments. word Montgomery reduction state value remains bounded by Several of the key compiler passes are polymorphic in the a constant, generated by single-stepping our proof script choices of base types and operators, but bounds inference is specialized to a set of operators. We assume that each of the following is available for each type of machine integers (e.g., more significant limbs of a commutes with adding multiples 32-bit vs. 64-bit). of m. Even better, adding a multiple of m and a multiple of Integer literals: n b can only increase the length of S by one limb, balancing Unary arithmetic operators: − e out the loss of one limb when dividing S by r and allowing Binary arithmetic operators: e1 + e2 , e1 − e2 , e1 × e2 for an implementation that loops through limbs of a while Bitwise operators: e1 e2 , e1 e2 , e1 & e2 , e1 | e2 operating in-place on S. The proof of the last fact is delicate Conditionals: if e1 6= 0 then e2 else e3 – for example, it is not true that one of the additions always Carrying: addWithCarry(e1 , e2 , c), carryOfAdd(e1 , e2 , c) produces a newly nonzero value in the most significant limb Borrowing: subWithBorrow(c, e1 , e2 ), borrowOfSub(c, e1 , e2 ) and the other never does. In our library, this subtlety is handled Two-output multiplication: mul2(e1 , e2 ) by a 15-line/100-word proof script; the high-level steps, shown in Fig. 9, admittedly fail to capture the amount of thought that We explain only the last three categories, since the earlier was required to find the invariant that makes the bounds line ones are familiar from C programming. To chain together up exactly right. multiword additions, as discussed in the prior section, we need to save overflow bits (i.e., carry flags) from earlier additions, to IV. C ERTIFIED B OUNDS I NFERENCE use as inputs into later additions. The addWithCarry operation Recall from Section II-B how we use partial evaluation implements this three-input form, while carryOfAdd extracts to specialize the functions from the last section to particular the new carry flag resulting from such an addition. Analo- parameters. The resulting code is elementary enough that it gous operators support subtraction with borrowing, again in becomes more practical to apply relatively well-understood the grade-school-arithmetic sense. Finally, we have mul2 to ideas from certified compilers. That is, as sketched in Section multiply two numbers to produce a two-number result, since II-C, we can define an explicit type of program abstract syntax multiplication at the largest available word size may produce trees (ASTs), write compiler passes over it as Coq functional outputs too large to fit in that word size. programs, and prove those passes correct. All operators correspond directly to common assembly instructions. Thus the final outputs of compilation look very A. Abstract Syntax Trees much like assembly programs, just with unlimited supplies of The results of partial evaluation fit, with minor massaging, temporary variables, rather than registers. into this intermediate language: Operands O ::= x | n Base types b Expressions e ::= let (x1 , . . . , xn ) = o(O, . . . , O) in e Types τ ::= b | unit | τ × τ | (O, . . . , O) Variables x We no longer work with first-class tuples. Instead, programs Operators o are sequences of primitive operations, applied to constants Expressions e ::= x | o(e) | () | (e, e) and variables, binding their (perhaps multiple) results to new | let (x1 , . . . , xn ) = e in e variables. A function body, represented in this type, ends in Types are trees of pair-type operators × where the leaves the function’s (perhaps multiple) return values. are one-element unit types and base types b, the latter of which Such functions are easily pretty-printed as C code, which is come from a domain that is a parameter to our compiler. It how we compile them for our experiments. Note also that will be instantiated differently for different target hardware the language enforces the constant time security property architectures, which may have different primitive integer types. by construction: the running time of an expression leaks no When we reach the certified compiler’s part of the pipeline, information about the values of the free variables. We do we have converted earlier uses of lists into tuples, so we can hope, in follow-on work, to prove that a particular compiler optimize away any overhead of such value packaging. preserves the constant-time property in generated assembly. Another language parameter is the set of available primitive For now, we presume that popular compilers like GCC and operators o, each of which takes a single argument, which Clang do preserve constant time, modulo the usual cycle
of occasional bugs and bug fixes. (One additional source- a backward transformation: calculate intervals for variables, level restriction is important, forcing conditional expressions then rewrite the program bottom-up with precise base types to be those supported by native processor instructions like for all variables. We could not find a way with PHOAS to conditional move.) write a recursive function that returns both bounds information and a new term, taking better than quadratic time, while B. Phases of Certified Compilation it was trivial to do with first-order terms. We also found To begin the certified-compilation phase of our pipeline, we that the established style of term well-formedness judgment need to reify native Coq programs as terms of this AST type. for PHOAS was not well-designed for large, automatically To illustrate the transformations we perform on ASTs, we walk generated terms like ours: proving well-formedness would through what the compiler does to an example program: frequently take unreasonably long, as the proof terms are let (x1 , x2 , x3 ) = x in quadratic in the size of the syntax tree. The fix was to let (y1 , y2 ) = ((let z = x2 × 1 × x3 in z + 0), x2 ) in switch well-formedness from an inductive definition into an y1 × y2 × x1 executable recursive function that returns a linear number of simple constraints in propositional logic. The first phase is linearize, which cancels out all intermediate uses of tuples and immediate let-bound variables and moves D. Extensibility with Nonobvious Algebraic Identities all lets to the top level. Classic abstract interpretation with intervals works sur- let (x1 , x2 , x3 ) = x in prisingly well for most of the cryptographic code we have let z = x2 × 1 × x3 in generated. However, a few spans of code call for some algebra let y1 = z + 0 in to establish the tightest bounds. We considered extending our abstract interpreter with general algebraic simplification, y1 × x2 × x1 perhaps based on E-graphs as in SMT solvers [12], but in Next is constant folding, which applies simple arithmetic the end we settled on an approach that is both simpler and identities and inlines constants and variable aliases. more accommodating of unusual reasoning patterns. The best- let (x1 , x2 , x3 ) = x in known among our motivating examples is Karatsuba’s multi- plication, which expresses a multi-digit multiplication in a way let z = x2 × x3 in with fewer single-word multiplications (relatively slow) but z × x2 × x1 more additions (relatively fast), where bounds checking should At this point we run the core phase, bounds inference, the recognize algebraic equivalence to a simple multiplication. one least like the phases of standard C compilers. The phase Naive analysis would treat (a + b)(c + d) − ac equivalently is parameterized over a list of available fixed-precision base to (a + b)(c + d) − xy, where x and y are known to be types with their ranges; for our example, assume the hardware within the same ranges as a and c but are not necessarily supports bit sizes 8, 16, 32, and 64. Intervals for program equal to them. The latter expression can underflow. We add a inputs, like x in our running example, are given as additional new primitive operator equivalently such that, within the high- inputs to the algorithm. Let us take them to be as follows: level functional program, we can write the above expression x1 ∈ [0, 28 ], x2 ∈ [0, 213 ], x3 ∈ [0, 218 ]. The output of the as equivalently ((a + b)(c + d) − ac, ad + bc + bd). The latter algorithm has annotated each variable definition and arithmetic argument is used for bounds analysis. The soundness theorem operator with a finite type. of bounds analysis requires, as a side condition, equality for all pairs of arguments to equivalently. These side conditions let (x1 : N216 , x2 : N216 , x3 : N232 ) = x in are solved automatically during compilation using Coq tactics, let z : N232 = x2 ×N232 x3 in in this case ring algebra. Compilation then replaces any z ×N264 x2 ×N264 x1 equivalently (e1 , e2 ) with just e1 . Our biggest proof challenge here was in the interval rules for V. E XPERIMENTAL R ESULTS bitwise operators applied to negative numbers, a subject mostly missing from Coq’s standard library. The first purpose of this section is to confirm that imple- menting optimized algorithms in high-level code and then sep- C. Important Design Choices arately specializing to concrete parameters actually achieves Most phases of the compiler use a term encoding called the expected performance. Given the previous sections, this parametric higher-order abstract syntax (PHOAS) [11]. Briefly, conclusion should not be surprising: as code generation is this encoding uses variables of the metalanguage (Coq’s extremely predictable, it is fair to think of the high-level Gallina) to encode variables of the object language, to avoid implementations as simple templates for low-level code. Thus, most kinds of bookkeeping about variable environments; and it would be possible write new high-level code to mimic every for the most part we found that it lived up to that promise. low-level implementation. Such a strategy would, of course, However, we needed to convert to a first-order representation be incredibly unsatisfying, although perhaps still worthwhile (de Bruijn indices) and back for the bounds-inference phase, as a means to reduce effort required for correctness proofs. essentially because it calls for a forward analysis followed by Thus, we wish additionally to demonstrate that the two simple
generic-arithmetic implementations discussed in Sections III-B multiplications by the modular-reduction coefficient c = 19 and III-F are sufficient to achieve good performance in all with one of the inputs of the multiplication instead of the cases. Furthermore, we do our best to call out differences output (19 × (a × b) −→ (19 × a) × b). This algebra pays off that enable more specialized implementations to be faster, because 19 × a can be computed using a 64-bit multiplication, speculating on how we might modify our framework in the whereas 19×(ab) would require a significantly more expensive future based on those differences. 128-bit multiplication. We currently implement this trick for just 2255 − 19 as an ad-hoc code replacement after partial A. X25519 Scalar Multiplication evaluation but before word-size inference, using the Coq We measure the number of CPU cycles different imple- ring algebra tactic to prove that the code change does not mentations take to multiply a secret scalar and a public change the output. We are planning on incorporating a more Curve25519 point (represented by the x coordinate in Mont- general version of this optimization into the main pipeline, gomery coordinates). Despite the end-to-end task posed for after a planned refactoring to add automatic translation to this benchmark, we believe the differences between implemen- continuation-passing style (see Section VII). tations we compare against lie in the field-arithmetic imple- The results of a similar benchmark on an earlier version of mentation – all use the same scalar-multiplication algorithm our pipeline were good enough to convince the maintainers and two almost-identical variants of the Montgomery-curve of the BoringSSL library to adopt our methodology, resulting x-coordinate differential-addition formulas. in this Curve25519 code being shipped in Chrome 64 and The benchmarks were run on an Intel Broadwell i7-5600U used by default for TLS connection establishment in other processor in a kernel module with interrupts, power manage- Google products and services. Previously, BoringSSL included ment, Hyper Threading, and Turbo Boost features disabled. the amd64-64 assembly code and a 32-bit C implementation Each measurement represents the average of a batch of 15,000 as a fallback, which was the first to be replaced with our consecutive trials, with time measured using the RDTSC generated code. Then, the idea was raised of taking advantage instruction and converted to CPU cycles by multiplying by of lookup tables to optimize certain point ECC multiplications. the ratio of CPU and timestamp-counter clock speeds. C code While the BoringSSL developers had not previously found was compiled using gcc 7.3 with -O3 -march=native it worthwhile to modify 64-bit assembly code and review -mtune=native -fwrapv (we also tried clang 5.0, but the changes, they made use of our code-generation pipeline it produced ∼10% slower code for the implementations here). (without even consulting us, the tool authors) and installed Implementation CPU cycles a new 64-bit C version. The new code (our generated code amd64-64, asm 151586 linked with manually written code using lookup tables) was this work B, 64-bit 152195 more than twice as fast as the old version and was easily sandy2x, asm 154313 chosen for adoption, enabling the retirement of amd64-64 hacl-star, 64-bit 154982 from BoringSSL. donna64, 64-bit C 168502 this work A, 64-bit 174637 this work, 32-bit 310585 donna32, 32-bit C 529812 B. P-256 Mixed Addition In order, we compare against amd64-64 and sandy2x, the fastest assembly implementations from SUPERCOP [13] that use scalar and vector instructions respectively; the ver- Next, we benchmark our Montgomery modular arithmetic ified X25519 implementation from the HACL∗ project [4]; as used for in-place point addition on the P-256 elliptic curve and the best-known high-performance C implementation with one precomputed input (Jacobian += affine). A scalar- curve25519-donna, in both 64-bit and 32-bit variants. multiplication algorithm using precomputed tables would use The field arithmetic in both amd64-64 and hacl-star has some number of these additions depending on the table been verified using SMT solvers [1], [3]. We do not claim size. The assembly-language implementation nistz256 was that the ranking in this table represents inherent differences reported on by Gueron and Krasnov [14] and included in between the 5 fastest implementations – we would be entirely OpenSSL; we also measure its newer counterpart that makes unsurprised if a seemingly irrelevant change in the CPU, use of the ADX instruction-set extension. The 64-bit C code compiler, or source code rearranged them. we benchmark is also from OpenSSL and uses unsaturated- We report on our code generated using the standard repre- style modular reduction, carefully adding a couple of multiples sentations for both 32-bit and 64-bit, though we are primarily of the prime each time before performing a reduction step interested in the latter, since we benchmark on a 64-bit with a negative coefficient to avoid underflow. These P- processor. We actually report on two of our 64-bit variants, 256 implementations here are unverified. The measurement both of which have correctness proofs of the same strength. methodology is the same as for our X25519 benchmarks, Variant A is the output of the toolchain described in Section except that we did not manage to get nistz256 running in III. Variant B contains an additional implementation grouping a kernel module and report userspace measurements instead.
Implementation fastest clang gcc icc One measurement in Fig. 10 corresponds to 1000 sequential nistz256 +ADX ˜550 computations of a 256-bit Montgomery ladder, aiming to sim- nistz256 AMD64 ˜650 ulate cryptographic use accurately while still constraining the this work A 1143 1811 1828 1143 source of performance variation to field arithmetic. A single OpenSSL, 64-bit C 1151 1151 2079 1404 source file was used for all mpn_sec-based implementations; this work B 1343 1343 2784 1521 the modulus was specified as a preprocessor macro to allow Saturated arithmetic is a known weak point of current com- compiler optimizations to try their best to specialize GMP pilers, resulting in implementors either opting for alternative code. Our arithmetic functions were instantiated using standard arithmetic strategies or switching to assembly language. Our heuristics for picking the compile-time parameters based on programs are not immune to these issues: when we first ran the shorthand textual representation of the prime, picking a our P-256 code, it produced incorrect output because gcc champion configuration out of all cases where correctness 7.1.1 had generated incorrect code2 ; clang 4.0 exited with proof succeeds, with two unsaturated and one Montgomery- a mere segmentation fault.3 Even in later compiler versions based candidates. The 64-bit trials were run on an x86 Intel where these issues have stopped appearing, the code generated Haswell processor and 32-bit trials were run on an ARMv7-A for saturated arithmetic varies a lot between compilers and is Qualcomm Krait (LG Nexus 4). obviously suboptimal: for example, there are ample redundant We see a significant performance advantage for our code, moves of carry flags, perhaps because the compilers do not even compared to the GMP version that “cheats” by leaking consider flag registers during register allocation. Furthermore, secrets through timing. Speedups range between 1.25X and expressing the same computation using intrinsics such as 10X. Appendix B includes the full details, with Tables I and _mulx_u64 (variant A in the table) or using uint128 II recording all experimental data points, including for an and a bit shift (variant B) can produce a large performance additional comparison implementation in C++. In our current difference, in different directions on different compilers. experiments, compilation of saturated arithmetic in Coq times The BoringSSL team had a positive enough experience with out for a handful of larger primes. There is no inherent reason adopting our framework for Curve25519 that they decided to for this slowness, but our work-in-progress reimplementation use our generated code to replace their P-256 implementation of this pipeline (Section VII) is designed to be executable as well. First, they replaced their handwritten 64-bit C im- using the Haskell extraction feature of Coq, aiming to sidestep plementation. Second, while they had never bothered to write such performance unpredictability conclusively. a specialized 32-bit P-256 implementation before, they also VI. R ELATED W ORK generated one with our framework. nistz256 remains as an A. Verified Elliptic-Curve Cryptography option for use in server contexts where performance is critical and where patches can be applied quickly when new bugs Several past projects have produced verified ECC imple- are found. The latter is not a purely theoretical concern – mentations. We first summarize them and then give a unified Appendix A contains an incomplete list of issues discovered comparison against our new work. in previous nistz256 versions. Chen et al. [1] verified an assembly implementation of The two curves thus generated with our framework for Bor- Curve25519, using a mix of automatic SAT solving and ingSSL together account for over 99% of ECDH connections. manual Coq proof for remaining goals. A later project by Chrome version 65 is the first release to use our P-256 code. Bernstein and Schwabe [2], described as an “alpha test,” explores an alternative workflow using the Sage computer- C. Benchmarking All the Primes algebra system. This line of work benefits from the chance for direct analysis of common C and assembly programs for Even though the primes supported by the modular-reduction unsaturated arithmetic, so that obviously established standards algorithms we implemented vastly outnumber those actually of performance can be maintained. used for cryptography, we could not resist the temptation to Vale [5] supports compile-time metaprogramming of as- do a breadth-oriented benchmark for plausible cryptographic sembly code, with a cleaner syntax to accomplish the same parameters. The only other implementation of constant-time tasks done via Perl scripts in OpenSSL. There is a superficial modular arithmetic for arbitrary modulus sizes we found similarity to the flexible code generation used in our own is the mpn_sec section of the GNU Multiple Precision work. However, Vale and OpenSSL use comparatively shallow Arithmetic Library4 . We also include the non-constant-time metaprogramming, essentially just doing macro substitution, mpn interface for comparison. The list of primes to test was simple compile-time offset arithmetic, and loop unrolling. Vale generated by scraping the archives of the ECC discussion has not been used to write code parameterized on a prime list curves@moderncrypto.org, finding all supported modulus (and OpenSSL includes no such code). A verified textual representations of prime numbers. static analysis checks that assembly code does not leak secrets, 2 https://gcc.gnu.org/bugzilla/show bug.cgi?id=81300, https://gcc.gnu.org/ including through timing channels. bugzilla/show bug.cgi?id=81294 HACL∗ [4] is a cryptographic library implemented and 3 https://bugs.llvm.org/show bug.cgi?id=24943 verified in the F∗ programming language, providing all the 4 https://gmplib.org/ functionality needed to run TLS 1.3 with the latest primitives.
64-Bit Field Arithmetic Benchmarks 1.5 Time (seconds) GMP mpn sec API GMP mpn API this work 1 0.5 0 100 150 200 250 300 350 400 450 500 550 log2(prime) 32-Bit Field Arithmetic Benchmarks 15 Time (seconds) GMP mpn sec API GMP mpn API this work 10 5 0 100 150 200 250 300 350 400 450 500 550 log2(prime) Fig. 10: Performance comparison of our generated C code vs. handwritten using libgmp Primitives are implemented in the Low∗ imperative subset be generated automatically, at low developer cost; and indeed of F∗ [15], which supports automatic semantics-preserving the BoringSSL team took advantage of that flexibility. translation to C. As a result, while taking advantage of Advantage of our work: small trusted code base. Every F∗ ’s high-level features for specification, HACL∗ performs past project we mentioned includes either an SMT solver or comparably to leading C libraries. Additionally, abstract types a computer-algebra system in the trusted code base. Further- are used for secret data to avoid side-channel leaks. more, each past project also trusts some program-verification Jasmin [6] is a low-level language that wraps assembly- tool that processes program syntax and outputs symbolic style straightline code with C-style control flow. It has a Coq- proof obligations. We trust only the standard Coq theorem verified compiler to 64-bit x86 assembly (with other targets prover, which has a proof-checking kernel significantly smaller planned), along with support for verification of memory safety than the dependencies just mentioned, and the (rather short, and absence of information leaks, via reductions to Dafny. A whiteboard-level) statements of the formal claims that we Dafny reduction for functional-correctness proof exists but has make (plus, for now, the C compiler; see below). not yet been used in a significant case study. We compare against these past projects in a few main Disadvantage of our work: stopping at C rather than claims. assembly. Several past projects apply to assembly code rather Advantage of our work: first verified high-performance than C code. As a result, they can achieve higher perfor- implementation of P-256. Though this curve is the most mance and remove the C compiler from the trusted base. Our widely used in TLS today, no prior projects had demonstrated generated code is already rather close to assembly, differing performance-competitive versions with correctness proofs. A only in the need for some register allocation and instruction predecessor system to HACL∗ [3] verified more curves, includ- scheduling. In fact, it seems worth pointing out that every ing P-256, but in high-level F∗ code, incurring performance phase of our code-generation pipeline is necessary to get code overhead above 100X. low-level enough to be accepted as input by Jasmin or Vale. Advantage of our work: code and proof reuse across We have to admit to nontrivial frustration with popular C algorithms and parameters. A more fundamental appeal of compilers in the crypto domain, keeping us from using them to our approach is that it finally enables the high-performance- bridge this gap with low performance cost. We wish compiler cryptography domain to benefit from pleasant software- authors would follow the maxim that, if a programmer takes engineering practices of abstraction and code reuse. The code handwritten assembly and translates it line-by-line to the becomes easier to understand and maintain (and therefore to compiler’s input language, the compiler should not generate prove), and we also gain the ability to compile fast versions of code slower than the original assembly. Unfortunately, no new algorithm variants automatically, with correctness proofs. popular compiler seems to obey that maxim today, so we Considering the cited past work, it is perhaps worth emphasiz- are looking into writing our own (with correctness proof) or ing how big of a shift it is to do implementation and proving integrating with one of the projects mentioned above. at a parameterized level: as far as we are aware, no past Disadvantage of our work: constant-time guarantee mech- project on high-performance verified ECC has ever verified anized only for straightline code. Our development also in- implementations of multiple elliptic curves or verified multiple cludes higher-level cryptographic code that calls the primitives implementations of one curve, targeted at different hardware discribed here, with little innovation beyond comprehensive characteristics. With our framework, all of those variants can functional-correctness proofs. That code is written following
You can also read