Reticle: A Virtual Machine for Programming Modern FPGAs
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Reticle: A Virtual Machine for Programming Modern FPGAs Luis Vega Joseph McMahan Adrian Sampson University of Washington University of Washington Cornell University USA USA USA vegaluis@cs.washington.edu jmcmahan@cs.washington.edu asampson@cs.cornell.edu Dan Grossman Luis Ceze University of Washington University of Washington USA USA djg@cs.washington.edu luisceze@cs.washington.edu Abstract ACM Reference Format: Modern field-programmable gate arrays (FPGAs) have re- Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze. 2021. Reticle: A Virtual Machine for Programming cently powered high-profile efficiency gains in systems from Modern FPGAs. In Proceedings of the 42nd ACM SIGPLAN Inter- datacenters to embedded devices by offering ensembles of national Conference on Programming Language Design and Imple- heterogeneous, reconfigurable hardware units. Programming mentation (PLDI ’21), June 20ś25, 2021, Virtual, Canada. ACM, New stacks for FPGAs, however, are stuck in the pastÐthey are York, NY, USA, 16 pages. https://doi.org/10.1145/3453483.3454075 based on traditional hardware languages, which were appro- priate when FPGAs were simple, homogeneous fabrics of 1 Introduction basic programmable primitives. We describe Reticle, a new Field-programmable gate arrays (FPGAs) have emerged as low-level abstraction for FPGA programming that, unlike a relief to stagnating performance on CPUs and GPUs [16, existing languages, explicitly represents the special-purpose 17, 38]. Their key advantage is their ASIC-like ability to cus- units available on a particular FPGA device. Reticle has two tomize data paths, control logic, and memory hierarchies for levels: a portable intermediate language and a target-specific specific applications. Unlike an ASIC, however, deploying an assembly language. We show how to use a standard instruc- FPGA-based accelerator merely requires buying off-the-shelf tion selection approach to lower intermediate programs to partsÐand not the astronomical investment that manufac- assembly programs, which can be both faster and more ef- turing custom silicon entails. fective than the complex metaheuristics that existing FPGA Early FPGAs were simple, homogeneous fabrics mostly toolchains use. We use Reticle to implement linear algebra based on lookup tables (LUTs), and their toolchains could operators and coroutines and find that Reticle compilation treat them as fluidly reconfigurable circuits. Modern FPGAs, runs up to 100 times faster than current approaches while however, no longer resemble those simple, homogeneous producing comparable or better run-time and utilization. architectures. Because real, specialized hardware remains far more efficient than reconfigurable emulated circuits, mod- CCS Concepts: · Software and its engineering → Soft- ern FPGAs incorporate an array of heterogeneous, special- ware notations and tools; Compilers; purpose łhardened” units that implement commonplace func- tionality: memories, arithmetic units, and complex intercon- Keywords: compilers, FPGAs nects [21, 48]. To make these modern FPGAs perform well, it is critical to exploit this fixed-function logic as much as possibleÐprograms that underutilize it can consume signifi- Permission to make digital or hard copies of all or part of this work for cantly more area and power [40]. personal or classroom use is granted without fee provided that copies The mainstream approach to program FPGA today is by are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights using behavioral hardware description languages (HDLs), ei- for components of this work owned by others than the author(s) must ther written by hand or emitted by higher-level languages [4, be honored. Abstracting with credit is permitted. To copy otherwise, or 12, 22, 34, 35, 49] as shown in Figure 1. These languages, how- republish, to post on servers or to redistribute to lists, requires prior specific ever, rely on behavioral HDLs as an ad hoc IRs, because pro- permission and/or a fee. Request permissions from permissions@acm.org. grams can be ported to multiple targets without the burden PLDI ’21, June 20ś25, 2021, Virtual, Canada of directly programming low-level and target-specific primi- © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. tives. Instead, the complex task of compiling traditional hard- ACM ISBN 978-1-4503-8391-2/21/06. . . $15.00 ware languages to these primitives is normally performed https://doi.org/10.1145/3453483.3454075 by proprietary vendor toolchains. 756
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze This work FPGAs fast. The representation should offer both a high- IR Instruction selection ASM Place level abstraction for portability and a low-level abstraction High-level High-level to directly address model-specific hardware resourcesÐand Languages compilers Traditional approach come with an infrastructure that provides analyses and opti- Behavioral HDLs Synthesis Place Route Bitstream mizations to target specific devices. We describe Reticle, a low-level language for efficiently programming modern FPGAs. The goal is to provide an inter- Figure 1. Overview of the traditional compilation pipeline mediate language and a compilation target that can replace compared to Reticle’s. Independent of the source language, the use of traditional behavioral HDLs for targeting FPGAs, FPGA programs today are funneled down to the same ab- as shown in Figure 1. Reticle is a compilation target for straction (hardware description languages). This abstraction higher-level languages that are substantially different from is not expressive enough to capture high-performance oper- traditional HDLs. ations available in DSPs, such as SIMD. Reticle has two layers: a portable intermediate language that abstracts over specific hardware units while providing high-level control over resource binding and placement, and Moreover, behavioral HDLs like Verilog and VHDL do not a parameterized assembly language that explicitly addresses have a way to represent modern FPGA’s primitives. Alterna- a device’s hardware resources. tively, vendor toolchains use heuristics that attempt to guess The contributions in this paper are: when a program’s logic can efficiently map to a device’s • We identify and measure the challenges of using be- available hardened units. For example, a Verilog expression havioral HDLs for programming modern FPGAs in a + b would need to compile to an adder circuit when gener- terms of result quality and compiler speed. ating a custom ASIC; in an FPGA toolchain, it might instead • We design Reticle, an intermediate language that can map onto a specific configuration of an FPGA’s digital signal describe the binding of high-performance hardware processing slice (DSP) that includes a range of built-in integer primitives in modern FPGAs (i.e., DSPs) as well as arithmetic units. classic reconfigurable logic (i.e., LUTs). Relying on heuristics for performing this mapping results • We implement a compiler for Reticle that optimizes in unpredictable and poor performance. As authors from and lowers hardware programs and then emits struc- Xilinx, a major FPGA vendor, observed [31]: tural, device-specific Verilog to bypass the majority of The necessity of breadth coverage by commer- traditional vendor FPGA toolchains. cial tools often leads to implementations that do • We demonstrate that the Reticle compiler runs up to not take full advantage of the underlying hard- 100× faster than an existing vendor FPGA toolchain ware. For example, UltraScale+ devices employ while producing comparable or better run-time and DSP blocks that are rated at 891MHz for the utilization results on linear algebra benchmarks. fastest speed grade. Nonetheless, large designs implemented on FPGAs typically achieve system This paper describes a language design aiming to efficiently frequencies lower than 400MHz. program modern FPGAs, but it leaves important avenues for future work to build on. Currently, the intermediate language We corroborate this finding in Section 7, which demonstrates focuses on programming DSP and LUT slices; it does not sup- that HDL-based FPGA programming leads to extremely slow port memory primitives, such as BRAMs. Additionally, the compilation and unpredictable, suboptimal performance re- paper focuses on designing and implementing an assembly sults. In practice, programmers must resort to using vendor- language capable of capturing layout semantics that enable specific Verilog annotations to extract the best performance layout optimizations such as instruction cascading. The com- from modern FPGAs, or even building ad hoc Verilog post- piler, however, uses only a simple solver-based approach for processing scripts to insert directives for efficiency [3]. Al- placement, limiting the search of optimal layouts for any though it is fragile and not portable, this approach has no given program. There is plenty of exploration needed in the viable alternative: vendors keep lower-level representations layout space i.e., incorporating timing information that is secret and the accompanying toolchains are closed source. beyond the scope of this work. Lastly, the Reticle compiler, This paper’s thesis is that we need a new low-level pro- as of today, relies on routing and bitstream generation from gramming abstraction for modern FPGAs. Innovation in traditional toolchains as shown in Figure 1. high-level FPGA programming models is accelerating [15, 18, 28, 34], and these new compilers need a better target than Verilog. Where traditional HDLs hide the complexity of mod- 2 Background ern FPGA resources, a better abstraction would explicitly The first major program transformation that FPGA compilers represent the model-specific capabilities that make modern perform today is hardware synthesis. This transformation 757
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada 1 module bit_and ( input a , input b , output y ); 1 (* use_dsp = " yes " *) 2 assign y = a & b; 2 module dsp_add (...); 3 endmodule 3 genvar i; 4 for (i =0; i
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze behavioral, scalar 3 Overview 350 structural, vectorized (hand-optimized) Reticle is an intermediate representation (IR) and compiler 300 for FPGAs that addresses these challenges. Reticle aims to 250 directly represent and optimize for the heterogeneous pro- grammable resources available in modern FPGAs: specifi- DSPs used 200 cally, LUTs and DSPs. Its goal is to target the efficiency of 150 structural FPGA implementations while adding abstraction and portability. Reticle is an instruction-based IR that decou- 100 ples the low-level details of the underlying hardware from a 50 higher-level instruction set that can generate code for differ- 0 ent hardware targets. More importantly, Reticle presents an 8 16 32 64 128 256 512 1024 alternative approach for programming FPGAs and is not a Loop bound (N) drop-in replacement for any stage in the traditional compi- (a) DSP utilization. lation flow (i.e., the goal is not to support traditional HDLs like Verilog). Alternatively, higher-level languages can use behavioral, scalar 5.0K structural, vectorized (hand-optimized) Reticle as a compiler target. Reticle addresses the challenges from Section 2 by using 4.0K a more expressive type system that supports vector types, which enable programs to promote particular hardware re- sources over others when they are available. Additionally, LUTs used 3.0K the intermediate language makes primitive constraints part 2.0K of the language semantics, so the Reticle compiler can reject programs with unsatisfiable constraints instead of silently 1.0K ignoring them as in HDL hints. Therefore, programs are more predictable in terms of resource usage and performance. 0.0 We show how to use instruction selection to map a portable 8 16 32 64 128 256 512 1024 representation onto a device-specific representation, while Loop bound (N) achieving the same optimization results as target-specific (b) LUT utilization. structural implementations. The rest of the paper describes the Reticle language design Figure 4. Resource utilization for multiple loop bounds ( ) and compiler implementation. Section 4 describes the two of the behavioral program described in Figure 3 versus a forms of Reticle: a portable, high-level intermediate language hand-optimized and structural version of the same program. and a low-level, device-specific assembly language that can Even though a compiler hint in the behavioral program re- be parameterized for a specific FPGA device. Section 5 de- quests the use of DSPs, a more optimal DSP configuration scribes how the Reticle compiler lowers from the intermedi- exists, leading to under-utilization of the resource potential ate language to an assembly language. We show how to use (the behavioral program runs out of DSP resources and must standard instruction selection to efficiently and deterministi- resort to LUTs). cally lower intermediate programs to assembly programsÐa sharp departure from traditional FPGA toolchains, which must resort to expensive, often randomized metaheuristics A second challenge is that, in HDLs, hints are merely to perform similar lowering [33]. Sections 6 and 7 show how suggestionsÐnot constraints. Behavioral-to-structural syn- our compiler implementation emits structural hardware de- thesis heuristics make it difficult to deterministically respect scriptions for a specific FPGA target and results comparable constraints, so toolchains silently ignore hints that they are to or better than a traditional HDL toolchain while running unable to fulfill. The consequence is that programming with many times faster. Finally, Section 8 discusses the responsibil- hints is unpredictable in both area and performance. ities and optimization opportunities for front-end languages The third challenge is that directly programming at the when compiling to Reticle. structural level, while necessary for peak efficiency, is im- practical. It requires understanding the complex, device- specific semantics of DSP configuration parameters. Struc- turally programming the DSP in Xilinx UltraScale+ FPGAs, 4 The Language for example, can entail setting up to 96 parameters. This This section describes the Reticle language. Reticle has two representation is verbose, brittle, and vendor-specificÐno variants: the high-level intermediate language, where opera- structural representation is portable across FPGA families. tions are abstract and portable across FPGA devices, and a 759
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada t0 : i8 = const [5]; ∈ F ( : ) ∗ → ( : ) + { + } t1 : i8 = sll [1]( t0 ); ∈ F | t2 : i8 = add (t0 , t1 ) @ ?? ; ∈ F : = ⊗[ ∗ ] ( ∗ ) Figure 6. Reticle instructions to compute the expression ∈ F : = ⊞[ ∗ ] ( + ) @ 5 × 2 + 5. The constant 5 and shift-left-logical operation (a) The Intermediate Language consume no compute resources (wire operations), while the add instruction does (compute operation). ∈ F ( : ) ∗ → ( : ) + { + } ∈ F | Table 1. The intermediate instruction set. ∗ ∗ ∈ F : = ⊗[ ] ( ) Instruction Type Operation ∈ F : = ⊠[ ∗ ] ( + ) @ Arithmetic add, sub, mul (b) The Assembly Language Bitwise not, and, or, xor, Compute Comparison eq, neq, lt, gt, le, ge ∈ F ?? | Control mux ∈ F ( , ) Memory reg ∈ F ?? | Shift sll, srl, sra Wire ∈ F lut | dsp Misc slice, cat, id, const ∈ F | | + by a primitive . Therefore, compute instructions are candi- ⊗ ∈ ⊞ ∈ ⊠ ∈ date for optimizations. Figure 6 shows an example of wire ?? ∈ ∈ ∈ and compute instructions. −→ Compute instructions also have an annotation @ that ∈ , , ∈ Z can optionally control which kind of resource to use on Figure 5. The Intermediate and Assembly Languages. the target device: either LUTs or DSPs. The annotation may be the wildcard ??, in which case the compiler has the freedom to choose which resource to use for the instruction. Interestingly, other operations besides simple bit extrac- low-level assembly language, where operations correspond tion and slicing can be implemented as wire instructions i.e., to physical primitives available on a specific device. static shift instructions. Consider the implementation of the Figure 5 lists the syntax for the two languages, which share logical left shift instruction sll described in Figure 6, which a common structure and differ in the kinds of operations that consist of taking the lower 7-bit wires of t0 and appending a are available. We first describe the intermediate language 1-bit wire assigned to the value zero in order to produce t1. and then show how the assembly language differs. Curiously, single-bit constant values such as zero and one 4.1 The Intermediate Language can be created with electrical ground and voltage available throughout the device without consuming any LUT or DSP. Figure 5a lists the Reticle intermediate language. A program Therefore, we leverage this knowledge to define these and is a function with a name n, a number of inputs and outputs other operations i.e., constants as wire instructions. ( : ), and a sequence of instructions ins. Function bodies are in A-normal form (ANF) [41]: they consist of a flat list of Instruction set. Table 1 lists the full set of compute and instructions whose arguments are always variables . wire instructions in the intermediate language. Most instruc- tions are pure, i.e., they have no side effects. The only excep- Wire & compute instructions. There are two types of tion is the register instruction, reg. In the absence of register instructions in the language: wire and compute instructions. instructions, programs can leverage referential transparency. While both share a common format, compute instructions are An add instruction, for example, takes two arguments and the ones that consume device resources and therefore con- writes to an output of a given type, such as: sume area; wire instructions are area-free and only involve c: i8 = add (a ,b) @ ?? ; wiring. Both kinds of instructions support static integer at- tributes , argument variables , and always produce a single A reg instruction looks similar to add , but it is stateful in output value ( : ). its operation. Furthermore, the add instruction will write a Wire instructions consist of operations ⊗, while compute new value to c each cycle (based on inputs a and b ), whereas instructions are based on an operation ⊞ that are performed the reg instruction will hold its value until overwritten. For 760
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze layout optimizations IR instruction assembly instruction assembly code structural routing and bitstream (a) selection (c) placement (e) generation verilog bitgen target device description layout (b) (d) target-independent family-specific device-specific Reticle Traditional tools Figure 7. Reticle compilation pipeline. (a) The intermediate program (Figure 5a). (b) The target description specification (Figure 9). (c) The compiled assembly program (Figure 5b). (d) The device layout specification. (e) The placed assembly program with known locations (Figure 5b). example, the following register instruction will produce a 0 as multiply-add, with known implementation costs (area as long as b is False. Similarly, once b is True, then the value and latency.) Although instructions are considered spe- of a will be bound to c every cycle. cialized instructions, they are still portable within an FPGA family. Devices within a family share the same primitives, c: i8 = reg [0]( a ,b) @ ?? ; varying only the total number of primitives available in them. The stateful reg instruction is essential for allowing cycles To capture the semantics of these varied operations, each in programs. Registers łbreak up” combinational cycles by is defined in terms of a sequence of intermediate language stopping them from looping back within the same cycle. As operations, which are then automatically composed in the we discuss in Section 6.1, a program with a cycle and without compilation process. (This means that assembly operations register is considered ill-formed and will be rejected. ⊠ can be composed of one or more intermediate operations in a single instruction.) Therefore, the number of in- Semantics. The primary goal behind the intermediate structions is far greater than instructions, allowing language is to capture the semantics of operations available different FPGA architectures to be targeted with a simpler in modern FPGA, while removing details of the primitives intermediate language. For example, two intermediate oper- used to implement such instructions. This is accomplished by ations consisting of a multiplication followed by an addition using dataflow and synchronous semantics [7]. The dataflow can be fused into a muladd (if it supported by the hardware semantics are used to describe the behavior of pure combi- target) and it can be expressed in assembly as: national instructions [45], whereas the synchronous model y: i8 = muladd (a ,b ,c) @ dsp (?? , ?? ); abstracts away the details about how register instructions are updated. For example, a synchronous design is defined as Assembly instructions also differ from compute instructions a hardware program in which all stateful elements (registers), because they support location semantics . A location in- can only be updated on a single event trigger, i.e., a positive cludes not only a primitive kind (LUTs or DSPs), but also clock edge. Therefore, the syntax for describing such timing a Cartesian , coordinate describing the physical place- details is not required for programming FPGAs, resulting in ment of the operation. Each coordinate can be either a con- a more compact representation. crete expression or a wildcard ??, indicating that the com- piler is responsible for determining the placement. While the 4.2 The Assembly Language wildcard gives the compiler the greatest flexibility, placing The Reticle assembly language resembles the intermediate explicit constraints on coordinates with expressions gives language, but it replaces high-level, abstract operators like front-end tools greater control over programs (and its ul- add with target-specific primitives available on a particular timate performance). An expression can refer to variables FPGA device. Figure 5b lists the syntax for the assembly defined in other coordinate expressions to place constraints language, in which compute instructions are replaced between the placement of the two instructions. For exam- with assembly instructions . (Reticle assembly retains ple, an instruction location could be specified using uncon- the same wire instructions as the intermediate language.) As strained variables like ( x0_loc , y0_loc ) , while another in- we lower to hardware targets, special-purpose hardware be- struction is constrained to always be adjacent (right after) comes available that can handle specialized operations, such within the same column: ( x0_loc , y0_loc +1) . Because we 761
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada t0 : i8 = mul (a ,b) @ ?? ; ∈ F + t1 : i8 = add (t0 ,c) @ ?? ; ∈ F [ , , ] ( : ) ∗ → ( : ){ + } (a) Intermediate program ∈ F : = ⊞ | ⊗ [ ∗ ] ( + ) t0 : i8 = mul (a ,b) @ dsp (?? , ?? ); ∈ F lut | dsp t1 : i8 = add (t0 ,c) @ dsp (?? , ?? ); (b) Assembly program, cost=2 ⊗ ∈ ⊞ ∈ ∈ ∈ ∈ ∈Z t0 : i8 = muladd (a ,b ,c) @ dsp (?? , ?? ); (c) Assembly program, cost=1 Figure 9. The Target Description Language. Figure 8. Example of an intermediate program (a) and two reg [ lut ,1 ,2]( a: i8 , en : bool ) -> (y: i8 ) { equivalent assembly programs (b,c) with different costs. y: i8 = reg [0]( a , en ); } used the same variables, the two instructions have a place- add [ lut ,1 ,2]( a: i8 ,b: i8 ) -> (y: i8 ) { y: i8 = add (a ,b ); ment relationship: they have the same x_loc (column), and } the second instruction is right after the first. Anecdotally, we were inspired by the Lava hardware language [5], on the add_reg [ lut ,1 ,2]( a: i8 ,b: i8 , en : bool ) -> (y: i8 ) { benefits of incorporating layout semantics to perform layout t0 : i8 = add (a ,b ); optimizations (discussed further in related work, Section 9). y: i8 = reg [0]( t0 , en ); } 5 Compilation Our hardware compiler performs a series of transformations Figure 10. Example of an FPGA target described using the to convert and optimize a source intermediate program into target description language (Figure 9). This hypothetical a target structural representation. The transformations in target supports three assembly instructions (reg, add, and the compiler are described in Figure 7, including instruc- add_reg), which are implemented using LUTs and have area tion selection, layout optimizations, instruction placement, and latency cost of 1 and 2 respectively. and code generation. Each of these program transformations progressively increases the level of detail of the compiler tar- the best implementation depends on the target-specific costs get such as: target-independent, family-specific, and device- of these instructions. specific transformations. Reticle’s instruction selector uses a target definition that 5.1 Instruction Lowering describes the instructions available for a specific FPGA fam- ily. The target description gives the area and latency costs The Reticle compiler is responsible for lowering the abstract for each assembly instruction along with its semantics in intermediate language to the concrete assembly language. terms of intermediate language instructions. The core problem is instruction selection, i.e., choosing a high- quality sequence of assembly instructions that have the same Target Description Language. Because the availability semantics as the original intermediate instructions. A key of different low-level hardware operations (and their costs) consequence of Reticle’s design is that the problem is similar can vary across FPGA families, the Reticle compiler needs a to instruction selection in a traditional software compiler [2] mechanism for describing a target platform. We designed a but applied to the hardware domain. This is not the first target description language that allows succinct specification time instruction selection has been proposed for hardware of assembly instructions supported by a given FPGA target; compilation [9, 26, 27]. Whereas today’s RTL toolchains rely it is specified in Figure 9. on slow, unpredictable metaheuristics to do a similar logical- In FPGA terms, a target is defined as a set of devices that to-physical mapping [33], the Reticle compiler can leverage support the same kinds of primitives, and it is often referred the large body of work on efficient, deterministic instruction as an FPGA family or series. Devices within a family can be selection algorithms to achieve the same effect. programmed with the same set of assembly instructions, and Figure 8a shows an example of instruction selection in only differ on the number of instructions that are capable to Reticle. The intermediate-language program in Figure 8a is accommodate spatially. Moreover, devices only differ on the semantically equivalent to both assembly programs in Fig- number of DSPs and LUTs supported (columns and rows) ures 8b and 8c, assuming a target architecture that supports and how their columns are arranged i.e., six columns of LUTs mul , add , and muladd assembly instructions. The choice of followed by one column of DSPs. 762
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze Concretely, a target description is defined as a list of as- t0 : i8 = muladd (a ,b , in ) @ dsp (?? , ?? ); t1 : i8 = muladd (c ,d , t0 ) @ dsp (?? , ?? ); sembly definitions that represents all the assembly in- structions supported by a specific family. Each definition c a co has an operation name , a hardware resource that the d b DSP y t1 operation occupies, and area and latency costs as integers c ci , and the (typed) inputs and outputs to the operation. The a a co definition also has a body that defines its semantics in terms b b DSP y of intermediate instructions. The body consists of a sequence in c ci t0 of instructions that resemble an intermediate language pro- (a) Without cascading, regular routing gram, without cycles (DAG) or information. The in- struction selection algorithm uses the body and costs to t0 : i8 = muladd_co (a ,b , in ) @ dsp (x ,y ); determine when a fragment of an intermediate language t1 : i8 = muladd_ci (c ,d , t0 ) @ dsp (x ,y +1); program can be replaced with an equivalent target-specific c a co assembly instruction. d b DSP y t1 Figure 10 lists an example target in this specification lan- c ci guage. This hypothetical target supports three assembly in- t0 structions: reg , add , and add_reg . a a co b b DSP y Instruction Selection. The steps for performing instruc- in c ci tion selection include: data-flow graph (DFG) generation, (b) With cascading, high-speed routing tree-partitioning and selection. Initially, the intermediate program is converted to a DFG, where nodes represent in- Figure 11. Example of optimizing the layout of an assembly structions and inputs, and edges correspond to how data program (a) using instruction cascading. In (b), the unknown flow through the program. location specifiers (ł??”) have become parametric layout ex- Once the DFG is created, the graph is partitioned into trees pressions over and coordinates. They imply the adjacency of intermediate instructions. The reason behind this parti- constraint in , while still being place-able almost anywhere. tioning is the fact that the DFG might contains cycles, which These constraints can be solved later, during the instruction are not supported by tree-covering algorithms. Because the placement step, for a given device. Reticle definition of well-formed programs excludes combi- national cycles (see Section 6.1), we know that simply cutting on register operations is sufficient to make valid trees. The procedure for partitioning the DFG into trees consists on 5.2 Layout Optimizations finding the nodes in the graph that are root candidates to After instruction selection, the Reticle compiler can further make a cut. There are two conditions required to be a root optimize assembly programs by placing them into high- node, (1) the node must be a compute instruction, and (2) its performance spatial layouts. Layout optimizations can be outgoing edges must be greater than one or none; compute expressed as constraints in the assembly language, using nodes without outgoing edges represent outputs, meanwhile coordinate expressions. The relative placement of target- compute nodes with more than one outgoing edge can con- specific operations can have a large impact on the efficiency tain cycles and therefore they are considered as root nodes. of a program. For example, by placing DSP-mapped opera- After tree-partitioning, the next step is selection, whose tions within the same column, programs can take advantage goal is to transform and optimize these trees of compute of DSP cascading: leveraging high-speed routing resources instructions into assembly instructions using the target de- available within DSP columns [42]. Hardware support for scription specification. Instruction selection is performed DSP cascading is widely available in most architectures to- using a linear-time tree-covering algorithm originally devel- day, including FPGAs designed by Intel [23], Xilinx [47], oped for code generation in compilers [2]. The procedure is Lattice [30], and Achronix [1]. based on dynamic programming, using previous solutions Figure 11a shows an example containing a pair of muladd to create better solutions at every node while traversing the instructions without any layout constraints. There are mul- tree in a postorder fashion. Then, the solutions (assembly tiple valid layout candidates for this program; however, the instructions) from every tree are composed to produce a version in Figure 11b, which places the operations vertically final assembly program. The assembly instructions in this adjacent in the same DSP column, is far faster than one that program have unknown locations (coordinate holes) that scatters the operations across different columns or more are further optimized spatially (if necessary), and later re- distant within the same column. solved by the instruction placement stage in the compiler A Reticle assembly program can express this layout op- for a specific device. timization as a layout constraint using expressions for the 763
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada placement coordinates on each instruction. By using x as If Z3 cannot find a valid placement for every instruction, the column for both operations and y and y +1 as the row placement fails. expression, the assembly program describes a placement of Once a valid placement is found for each instruction, neighboring DSPs. Moreover, the semantic of the muladd_co the layout engine optionally performs a series of shrink- assembly instruction means that not only the DSP is con- ing passes as an optimization. It computes the highest - and figured to perform the muladd operation but must use the -coordinate for each resource type, takes this as a maximum cascade port ( co ) instead of the default port ( y ) for the result. area, then uses a binary search to successively re-run place- Similarly, the muladd_ci instruction uses the cascade input ment with an artificially reduced area. If it succeeds, the next port ( ci ) instead of the default port ( c ) for the partial sum. iteration shrinks again; if it fails, binary search is repeated in Notably, this and other parameterizable layout optimiza- the new interval. The end result is a more compact physical tions can be ported within an FPGA family using our assem- layout on the FPGA. bly language, and later solved in the compilation pipeline i.e., instruction placement, for maximum portability. 5.4 Code Generation The goal of code generation is to expand assembly and wire 5.3 Instruction Placement instructions into structural Verilog with layout annotations (Figure 2c). Because of the work of our prior compiler passes, After instruction selection and layout optimizations have this step is purely one of generation Ð we simply need to taken place, all assembly instructions must be placed in a create valid Verilog that reflects our accumulated decisions. valid position on the target device. Therefore, the placement While this transformation is more complex than a conven- procedure consists of converting a family-specific program tional assembly to a binary format, it conceptually serves (unresolved locations) into an equivalent device-specific pro- the same role Ð converting to a format that can be given to gram (resolved locations) as described in Figure 7; finding program a hardware target. a unique value for each coordinate variable used, as well as Instructions have been selected, optimized, and placed; filling in all wildcards (??). now, based on the resources previously chosen for each in- Deciding a physical layout consists of mapping all assem- struction, they are expanded to a set of primitive LUTs or bly instructions to specific FPGA resources, for a specific DSPs. DSP-based instructions are converted into a DSP prim- target FPGA. Each instruction will already have undergone itive with a proper configuration in terms of ports and at- selection, so the task is reduced to finding a mapping for tributes to execute the desired instruction. On the other hand, each LUT instruction to an available LUT slice and each DSP LUT-based instructions require configuring a LUT for every instruction to an available DSP slice. All modern FPGAs are bit of computation. The reason for this is that these primi- constructed as columns of resources; the layout engine takes tives produce a single-bit output and not a full word. (For as input the layout of the target FPGA Ð specifically, which example, one 8-bit integer operation requires 8 LUTs.) Addi- columns are DSPs and LUTs, and how many entries or slices tionally, there are instructions e.g., addition or subtraction those columns have. that require other primitives also present within a LUT slice Notably, LUT column slices are different from DSP slices, such as carry chains. In any case, each primitive is anno- due to the fact that LUT slices host more than one pro- tated with the coordinate result obtained in the instruction grammable resource. We formulate the placement problem placement step. in terms of these slices. Not every instruction will result in instantiating LUTs or To solve and optimize layout, we use the Z3 SAT solver [13]. DSPs. As we expect, wire operations consume no area to The layout problem is expressed as a series of constraints for execute (they simply require different wiring). These instruc- each instruction, which are fed to the solver. Z3 quickly finds tions are generated as direct structural Verilog expressions a valid coordinate assignment for each instruction, subject without location information. to the following constraints: • The -coordinate must match a column of the appro- 6 Implementation priate resource (DSP vs LUT); We implemented a Reticle compiler in 8662 LoC in Rust, • The -coordinate must be between 0 and the maximum together with a Verilog AST library (2486 LoC). We used number of resources for that type of column; this AST library for code generation. Additionally, a target • If there is a relative constraint placed (as described library describing the assembly instructions supported by in Section 5.2) such that this instruction must follow the Xilinx UltraScale FPGA is written in 444 lines, using our another at 1 , then the -coordinate must be at 1 + 1; target description language (TDL). The instruction placement • All instruction resources are unique (this instruction’s is implemented in 117 lines of Python, using the Z3 bindings. coordinates cannot match any other instruction’s co- The following two subsections explain details about the ordinates). well-formedness criterion and the interpretation of programs. 764
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze t0 : bool = const [1]; Algorithm 1: Reticle interpreter. t1 : i8 = const [4]; t0 : i8 = const [4]; t2 : i8 = add (t3 , t1 ) @ ?? ; 1 function Interpreter(trace, program) t1 : i8 = add (t1 , t0 ) @ ?? ; t3 : i8 = reg [0]( t2 , t0 ) @ ?? ; 2 ( , , ) ← WellFormedCheck( ); 3 ( ,outputs) ← GetPortNames( ); (a) Ill-formed (b) Well-formed 4 ← Trace(); 5 foreach step_in ∈ trace do Figure 12. Example of an ill and well formed program. A 6 ← Update( , _ , ); well-formed program only allows cycles when stateful in- 7 ← Eval( , ); structions such as reg are present in the path of the cycle. 8 _ ← Step( , ); 9 ← Push( , _ ); 6.1 Well-Formedness 10 ← Eval( , ); In hardware design, programs typically need to avoid combi- 11 end foreach national loops: register-free cycles in the wiring graph that 12 return T ; would produce undefined behavior [39]. In Reticle, this con- 13 end function straint manifests as a well-formedness criterion. The depen- dency graph for a well-formed program, in both the interme- diate and assembly language variants, must be acyclic (a dag) when register instructions ( reg ) are removed. This section gives users a fast, convenient way to debug their programs describes how we define and check this criterion. without having to actually program an FPGA. Figure 12 shows examples of ill-formed and well-formed The first step in the interpreter consists of checking that a Reticle intermediate language programs. Both programs at- program is well-formed (line 2). This procedure topologically tempt to increment a stored value by a constant value 4. And sorts the instructions in the function (see Section 6.1) and both programs contain dependence cycles: in general, cycles returns two sorted instruction queues and an environment are required for instructions to reuse their own outputs as ar- containing the initial value of every register instruction. guments later in time. However, Figure 12b’s cycle includes The returned queues include one queue of pure instructions a reg instruction while Figure 12a’s has a combinational and another queue with register instructions. (register-free) loop. For every step in the input trace, the interpreter updates all The Reticle implementation checks well-formedness by variables in the environment with new values _ forming a dependence graph for a given function, where (line 6). Then, pure instructions are evaluated under the the vertices are instructions and the edges are definitionś current environment (line 7). Next, a new step value is cre- use relationships. It then sorts nodes in topological order, ated for all variables and pushed into the output excluding reg instructions. If the sort procedure succeeds, trace (lines 8 - 9). Finally, register instructions are evalu- the program is well-formed. ated, updating the environment for the step in the . Reticle differs from many traditional hardware tools in rejecting programs with combinational loops. Many inter- 7 Evaluation preters (a.k.a. simulators) for hardware description languages We evaluated Reticle by generating programs for linear al- (HDLs) such as Verilog and VHDL silently produce undefined gebra operators and control coroutines (Section 7.1), and or x-values instead of producing errors [46]. Hardware engi- then compiled them to structural Verilog with layout anno- neers must therefore carefully avoid creating these cycles or tations (Figure 2c) using the compilation pipeline described risk obscuring serious bugs. We instead opt to reject these in Section 5. We also compiled these benchmarks to two programs ahead of time to avoid the need to handle this behavioral Verilog baselines for a standard vendor toolchain, undesired behavior during compilation and interpretation. and compared their compilation time and the quality of the resulting hardware. 6.2 Interpreter Furthermore, the two behavioral Verilog baselines include: To clearly define the meaning of a Reticle program, we de- (1) one using standard, portable Verilog, and (2) an advanced fine an interpreter in Algorithm 1 that evaluates a function version using vendor-specific synthesis hints. The latter rep- program by stepping through a sequence of values defined resents the use of ad hoc and vendor-specific Verilog lan- in an input trace1 and producing an output trace T. This guage extensions that can tune the toolchain to do a better job of mapping the program to the FPGA’s fixed-function 1 A łtrace” is a general term for a map of values present in a hardware circuit. resources, and it represents significant implementation effort Each variable (circuit element) in the trace has an associated value for every clock cycle in the domain. An input trace gives a complete specification beyond standard RTL design. We generate these baselines by for a circuit’s inputs, for every cycle, while an output trace does so for the transforming Reticle programs using translation backends outputs. that emit code resembling standard behavioral Verilog. 765
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada 2 4000 10 Compiler speedup (log) 3 300 Run-time speedup 3000 Lang DSPs used LUTs used 2 200 base 1 10 2000 hint reticle 1 100 1000 0 10 0 0 0 64 128 256 512 64 128 256 512 64 128 256 512 64 128 256 512 Size Size Size Size (a) tensoradd 8000 Compiler speedup (log) 2 10 1.5 150 Run-time speedup 6000 Lang DSPs used LUTs used 1.0 100 base 1 4000 hint 10 reticle 0.5 50 2000 0 10 0.0 0 0 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36 5x3 5x9 5x18 5x36 Size Size Size Size (b) tensordot 1.0 60 0.04 Compiler speedup (log) 2 10 Run-time speedup 0.8 50 0.02 Lang DSPs used LUTs used 0.6 40 base 1 0.00 10 30 hint 0.4 reticle 20 −0.02 0.2 10 −0.04 0 10 0.0 0 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 Size Size Size Size (c) fsm Figure 13. Compiler, run-time, and utilization results of three benchmarks, tensoradd (a), tensordot (b), and fsm (c), when using behavioral Verilog (base), behavioral Verilog with DSP hints (hint), and Reticle (reticle). We use a Xilinx xczu3eg-sbva484-1 FPGA, with 360 DSPs number of states (3, 5, 7, 9) based on input values. The moti- and 71K LUTs, as a target device. For the baseline RTL vation is to show that Reticle programs can describe control- toolchain, we use Xilinx’s Vivado 2020.1. oriented programs normally found in hardware processor schedulers and protocols. More importantly, these programs can only be implemented on LUTs, not DSPs: conditional 7.1 Benchmark Description branching requires multiplexing (the mux instruction in Reti- We use three benchmarks, intended to represent three dis- cle), which it is implemented using only LUT-based logic. tinct facets of Reticle: a tensor addition kernel tensoradd demonstrates vectorization, a dot product implementation tensordot demonstrates fused operations and cascading, 7.2 Results Comparison and a finite state machine fsm demonstrates support for Compiler speed. The leftmost plots in Figure 13 compare control-oriented programs. Each benchmark is parameter- the compilation time for Vivado (labeled base for standard ized with four sizes. Verilog and hint for directive-laden Verilog) and our com- The tensoradd benchmark consists of an element-wise piler (reticle), when compiling and placing (layout) pro- summation over four different one-dimensional tensor sizes grams for every benchmark described in Section 7.1. The (128, 256, 512, 1024). We pipelined the addition operation Reticle compiler is between 10 and 100 times faster than with register instructions to get the best possible perfor- Vivado. By starting with programs at a lower level of ab- mance available in DSP primitives. straction, the Reticle compiler is solving a simpler problem Next, tensordot consists of five systolic arrays [29] per- than a traditional HDL toolchain like Vivado. The Reticle forming the dot operation over five pairs of one-dimensional compiler focuses exclusively on selecting and configuring tensors of four different sizes (3, 9, 18, 36). the FPGA’s coarse-grained heterogeneous resources; an RTL Finally, fsm is based on a coroutine, implemented as a toolchain also attempts to perform bit-level logic synthe- hardware finite state machine (FSM), that ranges over some sis [8] to transform behavioral descriptions into structural 766
PLDI ’21, June 20ś25, 2021, Virtual, Canada Luis Vega, Joseph McMahan, Adrian Sampson, Dan Grossman, and Luis Ceze realizations, which is important for traditional circuit gen- Lastly, the fsm benchmark shows the performance of eration but does not directly affect the mapping to modern control-oriented programs when mapped to LUTs. This kind units like DSPs. of control logic is a kind of pathological case for Reticle: there The compilation performance gains in linear algebra bench- is no way to use hardened logic resources like DSPs, which marks (tensoradd, tensordot) monotonically decreases as are Reticle’s main target, and traditional HDL toolchains the sizes of the tensors grow, which translates into more use complex logic synthesis optimizations to minimize the DSPs to be placed by Reticle’s SMT-based layout mechanism. number of LUTs they require. Our aim with this benchmark On the other hand, the speedup obtained when compiling is to show that Reticle can nonetheless support this kind the fsm benchmark is somewhat average due to the fact the of synthesis and that the performance is not much worse number of used LUTs are relatively low. from a heavily engineered behavioral HDL toolchain. In this case, Reticle produces fsm programs that are slower than Run-time performance. The second plots in Figure 13 Verilog’s results. While the Reticle compiler focuses on ex- show run-time speedup for Reticle over Vivado, which is the tracting peak performance from hardened logic units like ratio between the running times for the generated FPGA- DSPs, it nonetheless supports LUT-based compilation with based programs from the different compilers. Here, a running much faster compilation and some performance penalty. time is the critical path of the hardware circuit, which de- termines the maximum clock frequency at which hardware operates. For tensoradd, Reticle-generated programs are Utilization. The final two plots in the rows of Figure 13 faster than the standard Vivado baseline for all tensor sizes compare the FPGA resources used by the generated pro- because of the performance advantages of using the hard- grams. The aim here is to show how the difference in the ened units in DSPs compared to LUT-based logic. Vivado’s resource binding policies for Reticle versus Verilog. With heuristics fail to exploit DSPs at all using a pure behavioral Verilog, Vivado’s job is to search for any implementation that description ( base ); Reticle, in contrast, maps the program to matches the behavioral descriptionÐany resource-binding DSP hardware deterministically. hints are łsoft” and the compiler can ignore them. In contrast, Surprisingly, even though there is hardware support for Reticle placement and resource annotations are łhard”: the vectorization in every DSP of Xilinx FPGAs, Vivado fails to compiler predictably allocates exactly the kind of resource use this feature when using behavioral representation even that the programmer requested. in the presence of compiler hints. Vivado fails to exploit The benchmarks’ resource utilization reveals this unpre- vectorization even for this simple, dependency-free parallel dictability in Vivado as the sizes vary. In Reticle, both linear workload. Reticle successfully selects vectorized DSP config- algebra benchmarks use vector instructions and chained urations in every case. While vectorized configurations make instructions ( mul followed by an add ) that the compiler de- more area-efficient use of DSP resources, they are slightly terministically maps to DSPs. Vivado performs a heuristic slower than scalar operations on DSPs. This phenomenon mapping based on the availability of resources, resulting on explains why the hint-laden Verilog versions can be slightly unpredictable behavior that, for example, silently replaces faster than Reticle for some sizes: when sufficient DSP re- DSPs with LUTs in the largest size of tensoradd. sources exist on a target, Vivado can heuristically select scalar operations (at tensor sizes 64, 128, and 256). However, Vivado’s heuristic approach fails when the program grows 8 Discussion larger, i.e., at a tensor size of 512: a scalar configuration ex- This section describes the requirements and optimization hausts all the DSPs on the target, and the toolchain silently opportunities for front-end tools when targeting Reticle. falls back to using slower LUT-based implementations in- stead. At this latter configuration, the Reticle-generated vec- torized program is nearly 3× faster than the Verilog program, both with and without hints. A differently annotated Reticle 8.1 Requirements program could express the scalar configuration as well; we When compiling from a higher-level language or tool, the fol- focus specifically on the vectorized version here to show the lowing compilation steps may be necessary depending on the differences with a traditional HDL toolchain. features and abstractions of the source tool. Reticle is built Next, the tensordot benchmark shows the benefits of around instructions and does not have higher-level features cascading DSPs (Section 5.2). The latest version of Vivado for control-flow, dynamic scheduling of operations, or am- (2020.1) is capable of applying this type of cascade optimiza- biguous resource sharing; these features, if present, must be tion when using hints, similar to our compiler, at the expense compiled down to Reticle instructions. Notably, the absence of compilation time (up to 100 times slower in the worst case). of these features in Reticle does not exclude any categories The performance is the same for Reticle and Verilog with of hardware design, but rather requires that higher-level fea- hints, and both outperform plain Verilog. tures are translated to explicit, łstructural” implementations. 767
Reticle: A Virtual Machine for Programming Modern FPGAs PLDI ’21, June 20ś25, 2021, Virtual, Canada t0 : i8 = mul (a ,b ); t0 : i8 = add (a ,b ); t1 : i8 = add (t0 ,c ); t1 : i8 = add (c ,d ); t2 : i8 = reg [0]( t1 , en ); // cycle 0 t2 : i8 = add (e ,f ); t0 : i8 = add (a ,b ); t3 : i8 = add (g ,h ); (a) Scheduled in one cycle (a) Sequential (b) Parallel t0 : i8 = reg [0]( a , en ); // cycle 0 t1 : i8 = reg [0]( b , en ); // cycle 0 Figure 15. Example of two different resource sharing strate- t2 : i8 = mul (t0 , t1 ); gies for a program that computes four additions. (a) The t3 : i8 = reg [0]( t2 , en ); // cycle 1 sequential implementation, using one instruction in four t4 : i8 = add (t3 ,c ); units of time, and (b) the parallel implementation, using four t5 : i8 = reg [0]( t4 , en ); // cycle 2 instructions in one unit of time. (b) Scheduled in three cycles t0 : i8 = add (a ,b ); t1 : i8 = add (c ,d ); Figure 14. Example of two different scheduling solutions for t2 : i8 = add (e ,f ); a program that computes the expression ∗ + . t3 : i8 = add (g ,h ); t0 : i8 = add (a ,b ); (a) Scalar program (b) Vector program Figure 16. Example of two equivalent programs computing Control flattening. This step involves flattening any con- four additions in parallel: a scalar program (unoptimized) trol structures available in the high-level languages to Reti- and a vector program (optimized). cle instructions. For example, if-then-else constructs or phi- nodes can be lowered to mux instructions as: 8.2 Optimizations t0 : i8 = mux ( cond ,a ,b ); The following compilation steps are not required to generate a Reticle program; however, they provide important opportu- nities for higher-level tools to take advantage of information Scheduling. This task consists of choosing when abstract present in the source program and use it for optimization. operations run by mapping them onto clock cycles and insert- Vectorization. The goal of this optimization is to com- ing registers. Scheduling decisions impact the performance, bine independent scalar instructions, that are scheduled at including throughput and run-time, of programs. For exam- the same clock cycle, into vector instructions (Section 4.) ple, Figure 14 describes two schedules for a program that Front-end tools can promote the use of vector instructions computes the abstract expression × + . The Reticle pro- in Reticle by using vector types; alternatively, more complex gram in Figure 14a produces a result every clock cycle, while optimizations can attempt to automatically combine scalar the program in Figure 14b does so every three clock cycles. operations into vector expressions. The benefits of vectoriza- Depending on the target primitives, certain schedules can be tion are twofold: faster programs, due to high-performance more profitable than others, because the register distribution primitives (i.e, DSPs), and better resource utilization, because within DSPs and LUTs slices varies considerably. more operations are mapped to the same primitive. For ex- ample, the two programs shown in Figure 16 describe the Resource sharing. This step includes assigning abstract result of this optimization. Generating individual instruc- operations to Reticle instructions, which are eventually low- tions for each arithmetic operation is valid, but vectorization ered to physical resources by the Reticle compiler. Resource can provide large gains in performance and efficiency. sharing strategies often involve space-time trade-offs and, similar to scheduling, affect program performance. For exam- Resource binding. The purpose of this optimization is ple, Figure 15 shows two different strategies for a program to control how instructions are bound to primitives. The that computes four additions. The program described in Fig- Reticle IR supports such control via the resource annotations ure 15a uses one add instruction sequentially to compute the dsp or lut , as described in Figure 17a and 17b respectively. four additions, taking four units of time, while the program These annotations are not required; if omitted, the Reticle in 15b uses four instructions for computing all additions compiler will assign resources according to internal metrics in one unit of time (in parallel). If the input tool does not as described in Section 5. Interestingly, these constraints can distinguish between these cases, or uses the ambiguity for be exploited by higher-level tools to optimize programs for optimizations, the final space/time decision must be made metrics the Reticle compiler does not, by default, accom- before outputting Reticle code. modate. For example, rather than prioritizing performance, 768
You can also read