A Multiplatform Parallel Approach for Lattice Sieving Algorithms
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Multiplatform Parallel Approach for Lattice Sieving Algorithms Michal Andrzejczak1 and Kris Gaj2 1 Military University of Technology, Warsaw, Poland michal.andrzejczak@wat.edu.pl 2 George Mason University, Virginia, USA kgaj@gmu.edu Abstract. Lattice sieving is currently the leading class of algorithms for solving the shortest vector problem over lattices. The computational difficulty of this problem is the basis for constructing secure post-quantum public-key cryptosystems based on lattices. In this paper, we present a novel massively parallel approach for solving the shortest vector problem using lattice sieving and hardware acceleration. We combine previously reported algorithms with a proper caching strategy and develop hardware architecture. The main advantage of the proposed approach is eliminating the overhead of the data transfer between a CPU and a hardware accelerator. The authors believe that this is the first such architecture reported in the literature to date and predict to achieve up to 8 times higher throughput when compared to a multi-core high-performance CPU. Presented methods can be adapted for other sieving algorithms hard to implement in FPGAs due to the communication and memory bottleneck3 . 1 Introduction Over the last decade, post-quantum cryptography (PQC) has emerged as one of the most important topics in the area of theoretical and applied cryptography. This new branch of cryptology is considered an answer to the threat of quantum computers. A full-scale quantum computer will be able to break popular public-key cryptosystems, such as RSA and ECDSA, using Shor’s algorithm. In 2016, the United States National Institute of Standards and Technology (NIST) announced the Post-Quantum Cryptography Standardization Process (NIST PQC), aimed at developing new cryptographic standards resistant to attacks involving quantum computers. In January 2019, 26 of these candidates (including results of a few mergers) advanced to Round 2, and in July 2020, 15 of them were qualified for Round 3. The biggest group of submissions during all rounds were lattice-based algorithms. The difficulty of breaking these cryptosystems relies on the complexity of some well-known and computationally- hard problems regarding mathematical objects called lattices. One of these problems is the Shortest Vector Problem (SVP). Lattice sieving, which is the subject of this paper, is a family of algorithms that can be used to solve SVP (at least for relatively small to medium dimensions of lattices). Recently, significant progress in lattice sieving has been made, especially due to Albrecht et al. [4]. Although multiple types of lattice sieving algorithms emerged in recent years, all of them share a single fundamental operation called vector reduction. As a result, an efficient acceleration of vector reduction is likely to work with most of the sieves and give them a significant boost in performance. Modern CPUs have vector instructions, pipelining, and multi-threading capabilities, all of which have been used in the past to improve the performance of lattice sieving. Specialized hardware seems to be the next frontier necessary to achieve a substantial further speed-up. Sieving is a popular technique in cryptography. It was used previously, for example, for factoring large integers. However, it is a memory-intense method, so there exists a data transfer bottleneck disrupting any potential hardware acceleration, which is the biggest problem. 3 This is an extended version of the paper published as [5]
2 M. Andrzejczak, K. Gaj. 1.1 Contribution To take full advantage of modern hardware parallel capabilities, we propose a modified approach to lattice sieving algorithms and present a massively parallel FPGA sieving accelerator. In the modified sieving algorithm, due to proper caching techniques on the hardware side, there is no data transfer bottleneck. Thus, the accelerator works with full performance, and a significant speed-up is achieved. In the end, the cost comparison for solving SVP instances with Amazon AWS is presented. 2 Mathematical background A lattice L is a discrete additive group generated by n linearly independent vectors b1 , b2 , . . . , bn ∈ Rm nX o L(b1 , b2 , . . . , bn ) = xi bi | xi ∈ Z (1) The vectors b1 , . . . , bn are called a basis of the lattice, and we define B as a m×n matrix consisting of basis vectors as columns. In this paper, we consider only the case of m = n. The lattice, understood as a set of vectors, can be written as L(B) = {xB | x ∈ Zn } (2) pP We define the length of a vector as the Euclidean norm kxk = x2i . The Shortest Vector Problem (SVP) aims at finding a linear combination of basis vectors with the shortest length possible. For a given basis B ∈ Zn×m , the shortest vector v ∈ L(B) is a vector for which ∀x ∈ Zn kvk ≤ kxBk (3) The shortest vector in a lattice is also called the first successive minimum and is denoted λ1 (L). There are known estimates on boundaries of the length of the shortest vector in a given lattice. For the most well-known SVP Challenge [1], a found vector v should be shorter than 1/n (Γ (n/2 + 1)) kvk ≤ 1.05 · √ · (det L)1/n , (4) π where Γ is Euler’s gamma function, and det is determinant of the basis B generating the lattice L. If two different vectors v, u ∈ L(B) satisfy kv ± uk ≥ max(kvk, kuk), then v, u are called Gauss-reduced. If every pair of two vectors (v, u) from the set A ∈ L(B) is Gauss-reduced, then the set A is called pairwise-reduced. In this paper, we denote vectors as bold lowercase letters. Matrices are denoted as bold upper- case letters. Lattice points and vectors are used alternatively. 3 Lattice sieving SVP is one of the best-known problems involving lattices. Due to its computational complexity, it can be used as a basis for the security of lattice-based cryptosystems. Lattice sieving is one of several approaches to solve SVP. It is not a single algorithm, but rather a class of algorithms. These algorithms are similar to one another and rely on a similar basic operation, but differ in terms of their time and space complexity. The term ”lattice sieving” was proposed in the pioneering work of Ajtai—Kumar—Sivakumar ( [2], [3]). In 2001, these authors introduced a new randomized way for finding the shortest vector in an n-dimensional lattice by sieving sampled vectors. The main idea was to sample a long list L = {w1 , . . . , wN } of random lattice points, and compute all possible differences among points from this list L. As the algorithm progresses, during
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 3 reduction, shorter and shorter vectors are discovered. By repeating this step many times, the shortest vector in the lattice is being found as a result of subtracting two vectors vi − vj . The method proposed by Ajtai et al. is the main element of lattice sieving algorithms. Other algorithms differ mostly in the way of handling lattice vectors, grouping them, or using some techniques of prediction. However, the main idea is still to sample new random vectors and reduce them using those already accumulated. 3.1 The GaussSieve In 2010, Micciancio and Voulgaris [12] presented two new algorithms: ListSieve with the time complexity 23.199n and the space complexity 21.325n , and GaussSieve, able to find a solution in the running time 20.52n , using memory space in the range of 20.2n . The GaussSieve is shown below as Algorithm 1. The key idea is taken from Ataji’s work and is based mostly on pairwise vector reduction. The GaussSieve starts with an empty list of lattice points L and an empty stack S. The stack is the first source of points to be processed in the next iteration of reduction. In the case of an empty stack, a new point is sampled using Klein’s method for sampling random lattice points [10], with modifications and extensions from [8]. Algorithm 1: GaussSieve(B) — algorithm that can compute the shortest vector. The .pop() operation returns the first vector from a given queue. KleinSampler() is a method for random sampling of new vectors. GaussReduce reduces vector by other vectors from the set L. Data: B - lattice basis, c - maximum number of collisions, λ1 (B)- targeted norm Result: t : t ∈ L(B) ∧ ktk ≤ λ1 (B) 1 begin 2 L ←− ∅, S ←− ∅, i ←− 0, t ←− KleinSampler(B) 3 while i < c and ktk > λ1 (B) do 4 if S 6= ∅ then 5 vnew ← S.pop() 6 else 7 vnew ← KleinSampler(B) 8 end if 9 vnew ← GaussReduce(vnew , L, S) 10 if kvnew k = 0 then 11 i←i+1 12 else 13 L ← L ∪ {vnew } 14 if kvnew k < ktk then 15 t ← vnew 16 end if 17 end if 18 end while 19 return t 20 end Next, a sampled lattice point v is pairwise reduced by every vector from the list L. The reduction method called GaussReduce returns vectors u, v satisfying max(kuk, kvk) ≤ ku ± vk. This method is shown below as Algorithm 2. Thus, the list L is always Gauss reduced, so in the case of reducing a vector already on the list, the vector is moved to the stack. If the vector v is non-zero after reducing by the whole list, it is added to L. Otherwise, the number of collisions i is incremented. A collision occurs when the point is reduced to zero, which means that the same point has been sampled before. The algorithm stops when the number of collisions exceeds the
4 M. Andrzejczak, K. Gaj. given boundary c, or the shortest vector already found is at least as short as the targeted estimate. Algorithm 2: GaussReduce(p, L, S) Data: p - lattice vector, L - list of already reduced vectors, S - list of vectors to reduce Result: p - reduced vector 1 begin 2 for vi ∈ L do 3 Reduce(p, vi ) 4 end for 5 for vi ∈ L do 6 if Reduce(vi , p) then 7 L ← L\{vi } 8 S.push(vi − p) 9 end if 10 end for 11 return p 12 end By analyzing Algorithm 1 and Algorithm 2, the number of Reduce() calls can be approximated as k 2 for the case of k vectors. There have been many papers improving the complexity of presented algorithm and propos- ing modifications that speed up the computations by several orders of magnitude by applying additional techniques. However, the GaussSieve is still a part of newer methods and the vector reduction step is crucial for every lattice sieving algorithm. 3.2 Parallel sieves There have been several papers devoted to developing a parallel version of a lattice sieve. In addition to the gsieve and fplll libraries, Milde and Schneider [13] proposed a parallel version of the GaussSieve algorithm. Their main idea was to use several instances of the algorithm connected in a circular fashion. Each instance has its own queue Q, list L, and stack S. When a vector is reduced by one instance, it is moved to another one. The stacks contain vectors that were reduced in a given instance and need to pass the instance’s list once more. During the algorithm, vectors in the instances lists are not always Gauss reduced. Ishiguro et al. [9] modified the idea of the parallel execution of the GaussSieve algorithm. The stack is only one, global for all instances (threads). The execution of the algorithm is divided into three parts. In the first part, sampled vectors in the set V (new or from the stack) are reduced by vectors in the local instance lists. After reduction, the reduced vectors are compared. If any vector is different than before the reduction step, it is moved to the global stack. In the next step, sampled vectors are reduced by themselves. In the last step, vectors from the local lists are reduced by the sampled vectors. The procedure ends with moving vectors from the set V to local lists in instances. A new batch of vectors is sampled, and the procedure starts from the beginning. The advantage of this approach for parallel execution is that vectors in local lists are always pairwise reduced. In [6], Bos et al. combined ideas from [13] and [9]. As in Milde and Schneider, each node main- tains its own local list, but the rounds are synchronized between nodes. During synchronization, the vectors are ensured to be pairwise reduced as in the Ishiguro et al. approach. Yang et al. [14] proposed a parallel architecture for GaussSieve on GPUs. A single GPU executes a parallel approach proposed by Ishiguro et al. Communication and data flow between multiple GPUs is performed by adopting ideas from Bos et al.
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 5 Every paper listed above targeted a multi-thread or multi-device implementation, but due to FPGAs structure, some ideas might be also adapted to hardware. As for FPGAs, there is no publicly available paper about hardware implementation of lattice sieving. FPGAs have been used for solving SVP, but using a class of enumeration algorithms. In 2010, Detrey et al. proposed an FPGA accelerator for the Kannan-Fincke-Pohst enumeration algorithm (KFP) [7]. For a 64-dimensional lattice, they obtained an implementation faster by a factor of 2.12, compared to a multi-core CPU, using an FPGA device with a comparable cost (Intel Core 2 Quad Q9550 vs. Xilinx Virtex-5 SXT 35). For a software implementation, the fplll library was used. 4 Hardware acceleration of vector reduction Lattices used in cryptography are usually high-dimensional. The hardest problem solved in the SVP Challenge, as of April 2021, is for a 180-dimensional lattice [1]. This dimension is still signif- icantly smaller than the dimensions of lattices used in the post-quantum cryptography public-key encryption schemes submitted to the NIST Standardization Process. However, it is still a challenge, similar to RSA-Challenge, to solve as big a problem as possible. Thus, a hardware acceleration might help to find solutions for higher dimensions in a shorter time. Algorithm 3 describes a common way of implementing the Reduce function in software. This method is dependant on the lattice dimension, which affects the dot product computation and the update of the vector’s value. The number of multiplications increases proportionally to the dimension. A standard modern CPU requires more time to perform the computations as the lattice dimension increases. However, both affected operations are highly parallelizable. Almost all multiplications can be performed concurrently by utilizing low-level parallelism. Thus, specialized hardware can be competitive even for modern CPUs capable of performing vectorized instructions and can offer a higher level of parallelism. Algorithm 3: Reduce(v, u) – vector reduction. The return value is true or f alse, de- pending on whether reduction occurs or not. Data: v, u - lattice vectors Result: true or f alse 1 begin P 2 dot = v i · ui 3 if 2 · |dot| ≤ kuk2 then 4 return false 5 else j m dot 6 q= kuk2 7 for i = 0; i < n; i + + do 8 v i − = q · ui 9 end for 10 kvk2 + = q 2 · kuk2 − 2 · q · dot 11 return true 12 end if 13 end In this section, we present a new hardware architecture for lattice vector reduction. A novel approach to use FPGAs for low-level parallelism is suggested. The most frequently used operation in lattice sieves – the Reduce function is analyzed and accelerated. The analysis is performed step by step, line by line, and the corresponding hardware is proposed.
6 M. Andrzejczak, K. Gaj. 4.1 Computing a vector product Computing the product of two vectors and updating the vector’s value are the two most time- consuming operations during reduction. However, there is a chance for FPGAs to accelerate these computations with massive parallelism. The proposed hardware circuit for obtaining a vector product is shown in Fig. 1. The first step is a multiplication of corresponding coefficients. This multiplication is performed in one clock cycle, even for a very large lattice. After executing the multiplication step, the results are moved to an addition tree, consisting of dlog2 (n)e addition layers. For the majority of FPGAs, the critical path for addition is shorter than for multiplication. Thus, it is possible to perform more than one addition in a single clock cycle without negatively affecting the maximum clock frequency. Let β denote the number of additions performed in one clock cycle, with a shorter latency path than multiplication. This parameter depends on an FPGA vendor and device family. For our target device, β = 4 addition layers are executed in one clock cycle, and the latency of the addition tree for an n-dimensional vector is dlog2 (n)/βe clock cycles. The proposed design also offers an option for the pipelined execution. It is possible to feed new vectors to registers v and u in each clock cycle, reaching the highest possible performance for a given set of vectors. The total latency required for computing the vector product is 1+dlog2 (n)/βe cycles. Using this approach, the maximum level of parallelism is achieved. Fig. 1. Hardware module for the pipelined vector product computation. v and u are input vectors stored in registers. 4.2 Division with rounding to the nearest integer The next operation performed in the proposed accelerator for the Reduce function is a division with rounding to the nearest integer. The division involves the computed vector product, dot, and the square of the norm of the second vector kuk2 . Instead of performing normal division, we take advantage of the fact that the result of the division is rounded to the nearest integer in a limited range, so several conditions can be checked instead of performing a real division. The comparisons being made are listed in Table 1 and are easily executed in hardware by using simple shifts, additions, and subtractions. The full range of possible results is not covered. The selection of the results range is based on statistical data and the chosen assumption. Assuming that sampled vectors provided to the sieving algorithm are no longer than x times the approximate shortest vector, the rounded division will never generate a result bigger than x. The following lemma describes this approach:
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 7 Table 1. Conditions checked for the rounded division with the restricted result range Division result Value Condition < 0; 0.5) 0 2 · |dot| < kuk2 < 0.5; 1.5) 1 kuk ≤ 2 · |dot| < 3 · kuk2 2 < 1.5; 2.5) 2 3 · kuk2 ≤ 2 · |dot| < 5 · kuk2 < 2.5; 3.5) 3 5 · kuk2 ≤ 2 · |dot| < 7 · kuk2 < 3.5; 4.5) 4 7 · kuk2 ≤ 2 · |dot| < 9 · kuk2 Lemma 1. For any two vectors v, u ∈ Rn in a lattice, with the Euclidean norm no larger than x times the norm of the shortest vector, the result |hv,ui| kuk2 is at most x. Proof. Let’s assume, that u is the shortest vector. For the Euclidean inner product h·, ·i and any v, u ∈ Rn we have: |hv, ui| ≤ kvk · kuk (5) Thus, if kvk = x · kuk (6) then |hv, ui| ≤x (7) kuk2 Based on experiments, x = 4 is sufficient to accept all sampled vectors. The division in the selected range is necessary to make the comparison with CPU implementations more accurate and avoid rare events when the vector is required to be reduced again, which may lead to data desynchronization in the accelerator. One may ask if there will never be any two vectors that produce a different result than expected. This issue is handled by an overflow signal that is asserted when the result is out of range. If that happens, vectors are reduced once again. The hardware module performing rounded division is shown in Fig. 2. The dot product is converted to its absolute value, and the sign is saved to be applied at the end of division. The necessary comparisons are performed in parallel. A look-up table decides about the absolute value, based on the results of comparisons. In the last step, a stored sign is applied to the result. All operations are performed in one clock cycle. 4.3 Update of vector values Having all the necessary values, it is possible to update the reduced vector element and its norm. These two operations can be performed in parallel. The element update function simply subtracts the product q · u from v. This operation can also be executed in parallel. Fig. 3 presents the hardware realization of this part. In the first step, the products q · ui are calculated. They are then subtracted from vi in the second step. Each step is executed in a separate clock cycle to decrease the critical path’s length and obtain a higher maximum clock frequency. In hardware, the norm update function is executed in three steps, taking one clock cycle each. At first, 2 · dot and q · kuk are computed. Secondly, the multiplication by q is applied to both partial results. At the end, the subtraction and addition operations are performed. 4.4 Reduce module The described above parts were used to develop the entire Reduce algorithm. The hardware block diagram combining previously discussed steps is shown in Fig. 4. Taking into account the capabilities of modern FPGAs, it is possible to execute two reductions at once: the reduction of vector u by vector v (u ± v) and the opposite reduction of vector v
8 M. Andrzejczak, K. Gaj. Fig. 2. Hardware module for the division with rounding with the restricted range of possible results. Results from arithmetic operations are pushed to logic responsible for the comparisons. Next, based on the comparison results, a decision about the value of |q| is made, and using the stored sign value, the result is converted to a final signed value. Fig. 3. Hardware module for the vector elements value update. v and u are input vectors, mi denotes the multiplication and si the subtraction logic. ri denotes a register used to store an intermediate value after multiplication. by vector u (v ± u). We call them branched operations due to utilizing the same dot product, computed in the first step. The logic required for the branched computations is shaded in Fig. 4, and can be omitted in the standard implementation. Moreover, not every algorithm can take advantage of the branched execution of a vector reduction. Some algorithms have a strict schedule for the vector reduction and are not able to process data from the branched execution. With additional shift registers required to store data for further steps of the algorithm, it is possible to start computations for a new vector pair in each clock cycle, utilizing pipeline properties of used building blocks, and increasing the total performance. The latency for one pair of vectors depends only on the dimension of a lattice. For an n- dimensional lattice, the latency fcl (n) equals exactly log2 (n) fcl (n) = +5 (8) β
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 9 Fig. 4. The architecture of the reduce accelerator with pipelining and branching. The shaded part denotes logic implemented in the branched version and omitted in the standard implementation. clock cycles. This is also the number of pairs of vectors being processed in the module concurrently. Therefore, for the 200 MHz clock frequency, the pipelined version can perform up to 200,000,000 vector reductions per second, and the branched version can perform twice as many reductions. This calculation does not include the communication overhead, so the practical performance for a standard approach will be lower. 5 Theoretical performance analysis 5.1 Data transfer costs The biggest issue with current algorithms is the data transfer cost. Even the largest FPGAs are not able to store all required data to run sieving standalone for currently attacked dimensions. Thus, only a hybrid solution is considered. However, with only a part of the algorithm being executed on the FPGA side, some data is required to be exchanged between both sides. In lattice sieving, the transferred data will consist mostly of lattice points. For a simple vector expressing the lattice point, its size depends on the dimension of the lattice. In the presented accelerator, each vector element is stored in 16 bits. It can be extended to 32 bits if needed, but due to our experiments on reduced lattices, 16 bits is sufficient. Additionally, the squared value of a vector length is also stored in another 32 bits. Thus, the number of bits fnb (n) required for a simple n-dimensional vector is expressed as: fnb (n) = n · 16 + 32 = (n + 2) · 16 (9) This number also matches the number of bits required for one vector transfer in any direction between CPU and FPGA. The communication time depends on the size of data and on the width of a data bus. The commonly used data buses are w = {32, 64, 128, 256}-bits wide and are able to deliver a new data in every FPGA clock cycle. The data transfer latency ftl for one vector is expressed as the number of clock cycles and can be obtained from the equation: fnb (n) (n + 2) · 16 ftl (n, w) = = (10) w w
10 M. Andrzejczak, K. Gaj. 5.2 Scenario I — basic vector reduction acceleration Knowing the accelerator performance and communication costs, two use case scenarios can be analyzed. The first use case involves performing a simple vector reduction in FPGA every time this operation is required during an algorithm execution on CPU. The required data is sent to FPGA, and the result is sent back to CPU. An FPGA does not store any additional vectors; it only computes a single result. The clock latency includes the time required for two vector transfers from CPU to FPGA, the time of the operation itself, and one transfer of the result back from FPGA to CPU. The total latency fel (n, w) can be computed as: (n + 2) · 16 log2 (n) fel (n, w) = 3 · ftl (n, w) + fcl (n) = 3 · + +5 (11) w β This formula describes the lower bound for the time of the data transfer and computations. In some cases, there is no need to transfer the result back, e.g., because there is no reduction between the two input points. The performance of the accelerator can be expressed as a number of reductions per second. With the known clock frequency H, the exact performance PF P GA for an n-dimensional lattice and the selected bus width w is described by the equation: H PF P GA (n, w) = (12) fel (n, w) In Fig. 5, a comparison of the performance between solutions with different bus widths and CPU scenarios is presented. The clock frequency is set to 200 MHz, a little below the maximum possible clock frequency obtained from compilation tools for the presented design. The performance of CPU is marked with dots. The red dots represent the performance of CPU with vectors that fit in the processor’s cache (Experiment #1), the orange dots represent the performance for a larger set of vectors that have to be stored outside of the processor’s cache (Experiment #2). Fig. 5. Performance comparison between designs with different bus widths and two different experiments, involving a different number of vectors, on the same CPU. The performance of CPU is an average result from 5 trials in the same dimension and the FPGA performance is taken from Eq. 12 and 200 MHz clock.
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 11 5.3 Scenario II — sending a set of vectors for reduction In the second scenario, a j-elemental set of vectors is sent to FPGA first. In the next step, the reduction is performed in the same way as in the GaussSieve algorithm, so that the result is a Gauss-reduced set of vectors. In the last step, the entire set is transferred back to CPU, so the total communication overhead for an n-dimensional lattice and a w-bit data bus is 2 · ftl (n, w) · j clock cycles. The total number of the Reduce function calls equals at least j 2 (according to Section 3.1). With these numbers and the known CPU performance PCP U (n), it is possible to derive a formula for the minimum size of a set, that can be Gauss-reduced in a shorter time with the help of an FPGA accelerator than by using only CPUs. Let us assume that for a set of the size j, the time required to execute the algorithm is longer when using only CPUs, namely: tCP U ≥ tF P GA (13) Then, extending and modifying the equations to find a proper value: j2 2 · j · ftl (n, w) + j 2 · fcl (n) ≥ (14) PCP U (n) H j 2 · H ≥ 2 · j · ftl (n, w) · PCP U (n) + j 2 · fcl (n) · PCP U (n) (15) j · H ≥ 2 · ftl (n, w) · PCP U (n) + j · fcl (n) · PCP U (n) (16) j · H − j · fcl (n) · PCP U (n) ≥ 2 · ftl (n, w) · PCP U (n) (17) j · (H − fcl (n) · PCP U (n)) ≥ 2 · ftl (n, w) · PCP U (n) (18) and, as a result, the size of the set must be at least: 2 · ftl (n, w) · PCP U (n) j≥ (19) H − fcl (n) · PCP U (n) and j ∈ Z. This equation is always valid for j ≥ 6 and n ≥ 50, so the set can fit FPGAs memory. The required set is rather small. Percentage wise, the communication overhead plays a smaller part in total latency than in the first use case scenario and the performance for hardware acceleration in first use case are not that different from CPUs. In this scenario, vectors are stored in FPGA memory cells, with negligible latency access to minimize data access costs. The internal memory has some space limitations. Thus, in the described scenario, the reduce module has an upper acceleration boundary, directly affected by the size of the available memory M . One lattice vector requires fnb (n) bits of memory. Thus, assuming that only vectors are stored in FPGA memory, the maximum possible number of lattice vectors Nlv (n, M ) to store in FPGA, can be obtained from a simple equation: M Nlv (n, M ) = (20) fnb (n) In Fig. 6, the maximum number of vectors capable of being stored is presented. The considered dimensions range from 60 to 90. The entries represent an FPGA device with the biggest amount of memory in each family for Intel’s two high performance families. The third line represents the theoretical memory requirements for GaussSieve algorithm. For lattices with more than 86 dimensions, no FPGA device is able to fit all necessary data to perform GaussSieve algorithm. Thus, this use case cannot be used directly in practice (as of 2021). The data complexity of lattice sieves is so high, that even doubling the FPGAs memory size will not allow to solve problems of significantly larger lattices.
12 M. Andrzejczak, K. Gaj. Fig. 6. The maximum size of a vector set able to fit in FPGA memory for selected devices from two major FPGA families. The green line represents theoretical memory requirements. 6 Caching approach to lattice sieving for multi-platform environment In the previous section, simple approaches for the hardware acceleration use case for the vector reduction were introduced. The next step is to develop more efficient way of using this accelerator to increase the performance of any sieve. Due to the communication overhead, a simple call to the accelerator for every occurrence of the reduce operation will not give any speedup. The data transfer takes almost 90% of the total execution time, and the performance is lower than on standard CPU. Moreover, it is not possible to run the entire algorithm on an FPGA due to its lack of sufficiently large memory to perform standalone sieving on FPGAs. Thus, in this section, a caching approach for lattice sieving algorithms in a multi-platform environment is presented. Our modification allows eliminating the communication delays, omitting the memory limitations, and fully utilizing the proposed parallel architecture for lattice sieving by combining previously reported methods with caching techniques. The proposed techniques will also work for other kinds of sieves. A software/hardware approach is considered, where only a part of computations is performed in FPGAs, the rest of an algorithm is executed on CPU, and the majority of necessary data is stored on CPU. Currently, there are several approaches to combining CPUs with FPGAs. Thus, the calculations are not focused on any particular solution, but rather on a universal approach, applicable to each practical realization of the system combining both device types. 6.1 Reducing newly sampled vectors by a set In large lattice dimensions, the total required memory is significantly larger than the memory available in any FPGA device. For dimensions larger than 85, the accelerator must cooperate with CPU during the reduction of the newly sampled vectors due to memory limitations. The vectors will be processed in smaller sets, and an efficient way to manage the data transfer is required.
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 13 In this approach, every new vector is used for reduction at least 2 · |L| times. Assuming that the set L is going to be divided into smaller sets Li , capable of fitting in FPGA memory, the data transfer costs may reduce or even eliminate the acquired acceleration. In the most basic approach, the newly sampled vector vn is reduced by the set Li that fits FPGA’s memory. In the first step, vn is reduced sequentially by elements from Li , while the reduction of elements from Li by vn in the second step is executed in parallel. Elements used in the first step of the reduction are replaced by other elements from L. This approach is not efficient due to the data transfer requirements, and several changes must be made to achieve the best performance. 6.2 On-the-fly reduction It is not necessary to wait until all data is available on the FPGA side. The designed algorithm should take advantage of the fact that reductions may start right after sending the first two vectors. Every new vector will be reduced by those transferred so far, and the communication will happen in the background. This approach will allow to reduce the combined time of computations and data transfers. The gains from the on-the-fly reduction depend on the approach for sieving. Applying ideas from Bos et al. [6] or Milde and Schneider [13] will require a different data transfer schedule and will be affected differently by the continuous memory transfer. The ideal algorithm should allow avoiding any data transfer costs. 6.3 Maximizing performance with the proper schedule of operations To efficiently accelerate any sieving with FPGAs (or any other devices), the aforementioned ele- ments must be included in the algorithm’s design. Taking ideas from literature for parallel sieve, let us divide the GaussSieve execution into three parts, as proposed by Ishiguro et al. [9] and extend it to meet our requirements. The algorithm will operate on a set S of newly sampled vectors, instead of only one vector. The first part is the reduction of the set S by already reduced vectors in the list L. A data transfer latency for one lattice vector depends on the lattice dimensions and the data bus width w, as shown in Eq. 10. Thus, to avoid the data transfer overhead, one reduced lattice vector should be processed during the exact time required for a new one to be transferred. This can be done by extending the size of the set S from one to k = ftl (n, w). Then, taking into account the pipelining capabilities of the reduce function accelerator, during the first reduction, after k clock cycles, the accelerator should be able to start processing a new vector. The algorithm can take advantage of the pipelined execution of instructions due to the lack of any data dependency between vectors from the set S and an already used vector from the list L. The state of registers during the first step of sieving is visualized in Fig. 7, 8 and 9. The number of reductions in the first step is equal to k · |L|, and this is also the number of clock cycles spent on computations. The communication cost will include only sending first k + 1 vectors, where the remaining vectors will be transferred during computations. The FPGA latency will be then: p1 fel (n, w) = k 2 + k · |L| + fcl (n) (21) In the second step, elements from the set S are reduced by themselves. The set is already in FPGA memory, so there is no transfer overhead. The accelerator is going to execute the normal GaussSieve algorithm. Without the transfer overhead, the latency of computations is p2 k2 k2 fel = · fcl (n) + + fcl (n) (22) 2 2 During the second stage computations, a new batch S 0 of sampled vectors can be transferred to FPGA. The number of clock cycles required to transfer a new data consisting of k vectors is
14 M. Andrzejczak, K. Gaj. expressed as k · ftl (n, w) = ftl (n, w)2 = k 2 , and is smaller than the computational latency of the second step. In the last step, all vectors from the list L are reduced by vectors from the set S, which is already in the FPGA memory. Each vector v ∈ L is going to be reduced by k vectors, and reductions may be performed in parallel. Again, there is no communication overhead. If any reduction occurs, the lattice vector is transferred back to CPU in the background. Otherwise if no reduction happens, there is no need for moving a given vector back to CPU. The latency of the third step is then p3 fel = k · |L| + fcl (n) (23) Fig. 7. The state of the accelerator and registers after the first clock cycle. The first part of the next vector u0 is loaded, while the remaining parts of the SIPO unit contain parts of the previously loaded u. The first vector v0 from the internal set is delivered to the reduce module to be reduced by the vector u. FIFO contains k − 1 elements and is smaller by one element than the SIPO unit. Fig. 8. The state of the accelerator and registers after z = fcl (n) clock cycles.The reduction of the first vector is finished and the vector v0 is going to be put in the FIFO queue. In every part of the reduce module, the same vector u is used for reduction. Only z parts from k parts of the new vector u0 had been transferred so far. The FIFO queue contains k − z elements. By adding all the three steps together, it is possible to compute the latency of adding k new vectors to the list L of already Gauss-reduced vectors. The latency for the first execution will be then k2 fel (n, w) = k 2 + 2 · k · |L| + · fcl (n) + k 2 + 3 · fcl (n) (24) 2 As the new batch S 0 of sampled vectors is transferred during the second step, for every next execution, the cost of data transfer can be omitted and then the final latency becomes: k2 fel (n, w) = k 2 + 2 · k · |L| + · fcl (n) + 3 · fcl (n) (25) 2
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 15 Fig. 9. The state of the accelerator and registers after k clock cycles. The entire new vector u0 is on the FPGA side and will be used in the next run. Only vi remaining in the reduce module had not been reduced by u so far and will be reduced in the next z clock cycles. Compared to CPU, the expected acceleration can be computed as 2 · |L| + k H A= · (26) PCP U (n) k 3 · fcl (n) k + 2 · |L| + 2 · fcl (n) + k where the PCP U (n) is the performance of CPU for n dimensional lattice, expressed as the maximum number of reduce operations per second, and H is the maximum clock frequency of the hardware accelerator. To determine the acceleration for the targeted platforms, we first measured the performance of software implementation. We took advantage of the fplll library, that was used as a basis of g6k code for computing the best result in the TU Darmstadt SVP Challenge (as of May 2021, dimensions from 158 to 180). Thus, the sieving operations implemented in fplll were used for constructing experiments aimed at measuring performance of software sieving. In the designed experiment, a large set of vectors is sampled and pairwise reduced. Only the reduction time is measured. The size of the set is large enough to exceed the processor’s cache memory, which allows us to measure the performance in a real scenario. For the experiments, an Amazon Web Service c5n.18xlarge instance, equipped with a 72-core, 3.0 GHz Intel Xeon Platinum processor was used. In Fig. 10, an expected acceleration for the targeted lattice dimensions is presented, as the combined visualisation of Eq. 26 and obtained experimental data. For dimensions being currently considered in the SVP challenge (dimensions between 158 and 180), the expected acceleration from the proposed FPGA accelerator, compared to one CPU core, is around 45x. As for FPGA, clock frequency was set to 200 MHz, and in our algorithm, there is no visible difference between the considered data bus widths. The number of elements in L was set to 10,000. The accelerator almost always performs the pipelined vector reduction. Only in a small part of the second stage, the reductions are not pipelined. The communication bottleneck has been completely eliminated. Almost the maximum theoretical performance of the proposed accelerator (i.e., the performance without taking into account the communication overhead) has been achieved with this approach. Moreover, the accelerator can be adapted to other parallel sieves and work with other devices. It is possible to use the proposed accelerator in the parallel implementation of g6k as one of the devices performing the basic step of sieving, a vector reduction. Some of the algorithms (e.g., [6]) allow immediately reducing both processed vectors (v − u and u − v) in the next consecutive steps. In that case, a branched version of the accelerator can be used. Then, the third scenario changes, reducing the execution only to the two first steps. In the branched version, the first and the last steps are computed at a time. Thus, the total acceleration can increase by around 1.5 times.
16 M. Andrzejczak, K. Gaj. Fig. 10. Expected acceleration offered by FPGA in the third use case scenario for a single core FPGA clocked with 200 MHz compared to a pure software implementation of the fplll code, run on a single thread. Each dot represents the possible theoretical acceleration for given trial in the experiment. Red line represents the linear approximation of possible acceleration. 7 Multiple parallel instances of the accelerator in one FPGA During the second stage, a new batch of sampled vectors is transferred to an FPGA. However, communication requires less time than the computations during the second stage. After all k vectors are transmitted, the data bus waits unused until the third stage of the algorithm. The clock latency for the data transfer CT (n) is k 2 cycles, whereas the computational latency CC (n) is k 2 · fcl (n) + k 2 + fcl (n)) cycles. Then, the ratio C C (n) CT (n) indicates how many times the computations are longer. A difference between the two times can be used to send several other sets S to other accelerators implemented in the same FPGA. The maximum number of accelerators working in parallel is expressed then as: CC (n) k 2 · fcl (n) + k 2 + fcl (n) = CT (n) k2 fcl (n) (27) = fcl (n) + 1 + k2 ≈ fcl (n) + 1 The term fclk(n) 2 for large dimensions is always lower than 1 and can be omitted. Then, for the targeted dimensions, n > 64, the computations are fcl (n) + 1 times longer than the communica- tion. This number is also the maximum number of accelerators working in parallel with the full performance each. It is possible to connect more accelerator instances, but some of them will have to wait until all new sets S 0 are transferred. The other way to maximize the performance and avoid data transferr bottlenecks is to extend the computation latency fcl . In Fig. 11, a schedule representing the execution of the algorithm using several accelerators working in parallel is presented. The number of accelerators is denoted as σ, and every accelerator is denoted as Ai . The execution of the first and the second stage starts at the same time in every accelerator. Vectors used for reduction are everywhere the same. The last stage differs. Every next accelerator starts sieving k + fcl (n) clock cycles after the previous one and takes as
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 17 an input an output vector from the previous unit. The k + fcl (n) clock cycles are required for processing the first vector by an accelerator and pushing it further. The last accelerator starts working (σ − 1) · (k + fcl (n)) clock cycles after the first one. This approach is different from the first stage because the reduction vi − uj is performed instead of uj − vi , where vectors vi are loaded from CPU. Fig. 11. The activity diagram for multiple accelerators in one FPGA. Ai denotes an i-th instance of the accelerator in FPGA, and σ is the number of accelerators. k is the size of sampled vectors sets S and S 0 . S is currently used, where S 0 will be used in the next run. fcl (n) is a function representing the reduction latency for n-dimensional vectors. The inbound transmission is divided into two parts. In the first and the last stage, the previously reduced vectors, marked by small rectangles, are transferred to FPGA. In the second stage, a new batch of sampled vectors divided into σ sets, marked as a bigger rectangle, is transferred. The outbound transmission starts in the third stage and can proceed until the end of the second stage. At first, the Gauss-reduced sets Si of vectors from the second stage are sent back to CPU. As for results from the third stage, only shortened vectors are pushed back to CPU. It is hard to estimate the number of shortened vectors. To avoid data loses, the output FIFO queue should have a large enough memory available. Fortunately, there is only one FIFO queue, receiving data from the last accelerator. The performance of multiple accelerators, implemented in one FPGA, scales with their number. For σ accelerators, the performance will be ≈ σ times higher compared to a single one. The acceleration is not exactly σ times better due to the (σ − 1) · (k + fcl (n)) clock cycles delay for the last module and proportionally less for other modules in the third step. A single instance of the accelerator is presented in Fig. 12. This instance is able to perform the parallel version of GaussSieve in a way described in this paper. The serial-in, parallel-out (SIPO) unit is used to concatenate arriving data into one vector. After the entire vector is transferred, this vector may be saved to one of the internal FIFO queues, currently not used by the reduce module, or directly provided as one of the input vectors for reduction. Vectors from CPU are saved in queues only in the second stage of the algorithm. When the reduction is applied, there are two options. In the first and the second stage, vectors are always written back to the currently used internal queue. In the second and the third stage, vectors that were shortened during the reduction are placed in the output FIFO queue. In the second step, all vectors are transferred back to CPU, but in the third stage, only the reduced vectors are moved back. If vectors stay the same (i.e., there was no reduction), then they are overwritten in FPGA. A multi-core version consists of several instances connected into a chain of accelerators. An output from one element is connected to the input of the next element. The last instance is responsible for the data transfer back to CPU.
18 M. Andrzejczak, K. Gaj. Fig. 12. The reduce accelerator with supporting logic and input/output interface. Shaded part is one instance of the accelerator, while the remaining parts are required for data concatenation and transfer. A pipelined version without branching is used. 7.1 Final results To measure the highest possible acceleration, we tried to fit as many instances as possible in one FPGA. The final number for 160-dimensional lattices is 20 accelerators working in parallel. This number is bigger than the boundary in Eq. 27, so we extended the fcl latency to 12 to achieve higher clock frequency (150 MHz) and this way mitigated the second stage data transfer bottleneck. The design was described in the VHDL language and verified in simulation. Our code passes all stages of the FPGA design process. However, the actual run would be too long to be attempted with the current equipment and algorithm for a 160-dimensional lattice. Then, we used a proposed in the literature method for comparing cross-platform implementations [11], and we cost-compared our estimated results using two Amazon AWS instances: f1.2xlarge equipped with Xilinx FPGAs and c5.18xlarge aforementioned Intel Xeon. The results are presented in Table 2. The FPGA-based AWS instance can solve an equivalent problem for only 6% of the CPU-based instance price. Table 2. The normalized cost comparison for GaussSieve executed on CPU and FPGAs. The performance of one core is used as a reference value to compute the acceleration for multiple cores. The total acceleration refers to the acceleration obtained by fully utilizing a device, and it denotes a number of cores multiplied by their acceleration, which is equivalent to the number of CPU cores that matches the same performance. The normalized acceleration compares FPGA designs to a multi-core CPU. The price per acceleration is in row E. This price is compared to the price for CPU in row F. No. Device CPU FPGA A # of cores 72 20 B acceleration per core 1 30 C total acceleration (A · B) 72 600 D normalized acceleration 1 8.32 E AWS price ($/h) 3.05 1.65 F price per acceleration (E/D) 3.05 0.20 G compared to CPU (F/F.CPU) 1 0.06 7.2 Comparison to other results It is hard to compare cross-platform implementations. Looking only at the performance, the pre- sented implementation achieves more than 8x speed-up compared to a 72-core CPU for a 160- dimensional lattice, so the implementation has the performance of around 576 CPU cores. The [14]
A Multiplatform Parallel Approach for Lattice Sieving Algorithms 19 achieved 21.5x acceleration for a 96-dimensional lattice when compared to [9] (2x CPUs with 8 cores), so it has the performance of around 344 cores (in a lower dimension). And the cost of power consumption will be very likely probably lower for FPGA when compared to GPU. 8 Conclusions This paper introduces a new approach to lattice sieving by using a massively parallel FPGA design to accelerate the most common operation in every lattice sieving algorithm – vector reduction. As an example, the GaussSieve algorithm was accelerated. The acceleration is possible only with the proposed modification to parallel versions of sieving algorithms. The modification is devoted to eliminating the communication overhead between the specialized circuit, implemented in FPGA, and the CPU, running the rest of the algorithm, by using a caching strategy. The acceleration depends on the lattice dimension and increases linearly as a function of that dimension. For the targeted 160-dimensional lattice, the proposed solution is estimated to achieve 8.32 better performance compared to CPU. The results were obtained from FPGA simulation and CPU experiments. Comparing the cost of solving the SVP problem in AWS, the presented architecture will require only 6% of the CPU-based costs. Our project is also the first attempt reported to date to accelerate lattice sieving with specialized hardware. The proposed hardware accelerator can be used directly for almost any lattice sieve performing a vector reduction operation. In this paper, the GaussSieve algorithm was investigated as an example algorithm. The parallel hardware architecture with the proposed caching strategy can be adapted to other GaussSieve modifications reported in the literature [13], [6], [9], as well as for other lattice sieving algorithms with a better complexity. As a part of future work, the adoption of the presented solution to algorithms other than GaussSieve will be explored. Additionally, an application of the proposed solution to other algorithms hard to implement in FPGAs due to the communication and memory bottleneck will be investigated. References 1. SVP Challenge, https://www.latticechallenge.org/svp-challenge/ 2. Ajtai, M., Kumar, R., Sivakumar, D.: An Overview of the Sieve Algorithm for the Shortest Lattice Vector Problem. In: Silverman, J.H. (ed.) Cryptography and Lattices. pp. 1–3. Lecture Notes in Computer Science, Springer Berlin Heidelberg 3. Ajtai, M., Kumar, R., Sivakumar, D.: A sieve algorithm for the shortest lattice vec- tor problem. In: Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing - STOC ’01. pp. 601–610. ACM Press. https://doi.org/10.1145/380752.380857, http://portal.acm.org/citation.cfm?doid=380752.380857 4. Albrecht, M.R., Ducas, L., Herold, G., Kirshanova, E., Postlethwaite, E.W., Stevens, M.: The General Sieve Kernel and New Records in Lattice Reduction. In: Ishai, Y., Rijmen, V. (eds.) Advances in Cryptology – EUROCRYPT 2019. pp. 717–746. Springer International Publishing, Cham (2019) 5. Andrzejczak, M., Gaj, K.: A multiplatform parallel approach for lattice sieving algorithms. In: Qiu, M. (ed.) Algorithms and Architectures for Parallel Processing. pp. 661–680. Springer International Publishing, Cham (2020) 6. Bos, J.W., Naehrig, M., van de Pol, J.: Sieving for Shortest Vectors in Ideal Lattices: A Practical Perspective. Int. J. Appl. Cryptol. 3(4), 313–329 (2017). https://doi.org/10.1504/IJACT.2017.089353, https://doi.org/10.1504/IJACT.2017.089353 7. Detrey, J., Hanrot, G., Pujol, X., Stehlé, D., Detrey, J., Hanrot, G., Pujol, X., Stehlé, D.: Accelerating Lattice Reduction with FPGAs. In: Abdalla, M., Barreto, P.S.L.M. (eds.) Progress in Cryptology – LATINCRYPT 2010. vol. 6212, pp. 124–143. Springer Berlin Heidelberg. https://doi.org/10.1007/978- 3-642-14712-8 8, http://link.springer.com/10.1007/978-3-642-14712-8 8 8. Gentry, C., Peikert, C., Vaikuntanathan, V.: Trapdoors for hard lattices and new crypto- graphic constructions. In: Proceedings of the Fourtieth Annual ACM Symposium on The- ory of Computing - STOC 08. p. 197. ACM Press. https://doi.org/10.1145/1374376.1374407, http://dl.acm.org/citation.cfm?doid=1374376.1374407
20 M. Andrzejczak, K. Gaj. 9. Ishiguro, T., Kiyomoto, S., Miyake, Y., Takagi, T.: Parallel Gauss Sieve Algorithm: Solving the SVP Challenge over a 128-Dimensional Ideal Lattice. In: Krawczyk, H. (ed.) Public-Key Cryptography – PKC 2014, vol. 8383, pp. 411–428. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642- 54631-0 24, http://link.springer.com/10.1007/978-3-642-54631-0 24 10. Klein, P.: Finding the closest lattice vector when it’s unusually close. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms. p. 937–941. SODA ’00, Society for Industrial and Applied Mathematics, USA (2000) 11. Kuo, P.C., Schneider, M., Dagdelen, O., Reichelt, J., Buchmann, J., Cheng, C.M., Yang, B.Y.: Extreme Enumeration on GPU and in Clouds. In: Cryptographic Hardware and Embedded Systems – CHES 2011, vol. 6917, pp. 176–191. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23951- 9 12, http://link.springer.com/10.1007/978-3-642-23951-9 12 12. Micciancio, D., Voulgaris, P.: Faster exponential time algorithms for the shortest vector prob- lem. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algo- rithms. pp. 1468–1480. Society for Industrial and Applied Mathematics. https://doi.org/10/ggbsdz, https://epubs.siam.org/doi/10.1137/1.9781611973075.119 13. Milde, B., Schneider, M.: A Parallel Implementation of GaussSieve for the Shortest Vector Problem in Lattices. In: Malyshkin, V. (ed.) Parallel Computing Technologies. pp. 452–458. Lecture Notes in Computer Science, Springer Berlin Heidelberg 14. Yang, S.Y., Kuo, P.C., Yang, B.Y., Cheng, C.M.: Gauss Sieve Algorithm on GPUs. In: Handschuh, H. (ed.) Topics in Cryptology – CT-RSA 2017, vol. 10159, pp. 39–57. Springer International Publishing. https://doi.org/10.1007/978-3-319-52153-4 3
You can also read