Polymorphisation: Improving Rust compilation times through intelligent monomorphisation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Polymorphisation: Improving Rust compilation times through intelligent monomorphisation David Wood (2198230W) April 15, 2020 MSci Software Engineering with Work Placement (40 Credits) ABSTRACT In addition, Rust’s compilation model is different from Rust is a new systems programming language, designed to other programming languages. In C++, the compilation provide memory safety while maintaining high performance. unit is a single file, whereas Rust’s compilation unit is a One of the major priorities for the Rust compiler team is crate1 . While compilation times of entire C++ and entire to improve compilation times, which is regularly requested Rust projects are generally comparable, modification of a by users of the language. Rust’s high compilation times are single C++ file requires far less recompilation than modifi- a result of many expensive compile-time analyses, a com- cation of a single Rust file. In recent years, implementation pilation model with large compilation units (thus requiring of incremental compilation into rustc has improved recom- incremental compilation techniques), use of LLVM to gen- pilation times after small modifications to code. erate machine code, and the language’s monomorphisation Rust generates machine code using LLVM, a collection of strategy. This paper presents a code size optimisation, im- modular and reusable compiler and toolchain technologies. plemented in rustc, to reduce compilation times by intelli- At LLVM’s core is a language-independent optimiser, and gently reducing monomorphisation through “polymorphisa- code-generation support for multiple processors. Languages tion” - identifying where functions could remain polymor- including Rust, Swift, Julia and C/C++ have compilers us- phic and producing fewer monomorphisations. By reduc- ing LLVM as a “backend”. By de-duplicating the effort of ing the quantity of LLVM IR generated, and thus reduc- efficient code generation, LLVM enables compiler engineers ing code generation time spent in LLVM by the compiler, to focus on implementing the parts of their compiler that are this project achieves 3-5% compilation time improvements specific to the language (in the “frontend” of the compiler). in benchmarks with heavy use of closures in highly-generic, Compiler frontends are expected to transform their internal frequently-instantiated functions. representation of the user’s source code into LLVM IR, an intermediate representation from LLVM which enables opti- misation and code-generation to be written without knowl- 1. INTRODUCTION edge of the source language. While LLVM enables Rust to have world-class runtime Rust is a multi-paradigm systems programming language, performance, LLVM is a large framework which is not fo- designed to be syntactically similar to C++ but with an ex- cused on compile-time performance. This is exacerbated by plicit goal of providing memory safety and safe concurrency technical debt in rustc which results in the generation of while maintaining high performance. The first stable release low-quality LLVM IR. Recent work on MIR2 optimisations of the language was in 2015 and each year since 2016, Rust have improved the quality of rustc’s generated LLVM IR, has been the “most loved programming language” in Stack and reduced the work required by LLVM to optimise and Overflow’s Developer Survey [11, 12, 13, 14]. In the years produce machine code. since, Rust has been used by many well-known companies, Finally, rustc monomorphises generic functions. Monomor- such as Atlassian, Braintree, Canonical, Chef, Cloudflare, phisation is a strategy for compiling generic code, which du- Coursera, Deliveroo, Dropbox, Google and npm. plicates the definition of generic functions for every instan- Each year, the Rust project runs a “State of Rust” sur- tiation, producing significantly more generated code than vey, asking users of the language on their opinions to help other translation strategies. As a result, LLVM has to per- establishment development priorities for the project. Since form significant work to eliminate or optimise redundant IR. 2017 [22, 19, 20], faster compilation times have consistently This project improves the compilation times of Rust projects been an area where users of the language say Rust needs to by implementing a code-size optimisation in rustc to elim- improve. inate redundant monomorphisation, which will reduce the There are a handful of reasons why rustc, the Rust com- quantity of generated LLVM IR and eliminate some of the piler, is slow. Compared to other systems programming lan- work previously required of LLVM and thus improve compi- guages, Rust has a moderately-complex type system and lation times. In particular, the analysis implemented by this has to enforce the constraints that make Rust safe at com- pilation time - analyses which take more time than those 1 Crates are entire Rust projects, including every Rust file required for a language with a simpler type system. Lots in the project of effort is being invested in optimising rustc’s analyses to 2 MIR is rustc’s “middle intermediate representation” and is improve performance. introduced in Section 3.1.3 1
project will detect where generic parameters to functions, compared to homogeneous translation, with only modest in- closures and generators are unused and fewer monomorphi- creases in code size. sations could be generated as a result. Dragos et al. [3] contribute an approach for compila- Detection of unused generic parameters, while relatively tion of polymorphic code which allows the user to decide simple, was deliberately chosen, as the primary contribution which code should be specialised (monomorphised). This of this work is the infrastructure within rustc to support approach allows for specialised code to be compiled sepa- polymorphic MIR during code generation, which will enable rately and mixed with generic code. Their approach, again more complex extensions to the initial analysis (see further implemented in Scala, introduces an annotation which can discussion in Section 5.1). Despite this, there is ample ev- be applied to type parameters. This annotation instructs the idence in the Rust ecosystem, predating this project, that compiler to specialise code for that type parameter. Generic detection of unused generic parameters could yield signifi- class definitions require that the approach presented must be cant compilation time improvements. able to specialise entire class definitions while maintaining Within Rayon, a data-parallelism library, a pull request subclass relationships so specialised and generic code can [17] was submitted to reduce the usage of closures by moving interoperate. Evaluation is performed on the Scala stan- them into nested functions, which do not inherit generic pa- dard library and a series of benchmarks where runtime per- rameters (and thus would have fewer unnecessary monomor- formance is improved by over 20x for code size increases phisations). Similarly, in rustc, the same author submitted of between 16% and 161%, however this approach requires a pull request [16] to apply the same optimisation manually programmers correctly identify functions to annotate. to the language’s standard library. Stucki et al. [18] identify that specialisation of generic Likewise, in Serde, a popular serialisation library, it was code, which can improve performance by an order of magni- observed [21] that a significant proportion of LLVM IR gen- tude, is only effective when called from monomorphic sites erated was the result of a tiny closure within a generic func- or other specialised code. Performance of specialised code tion (which inherited and did not use generic parameters of within these “islands” regresses to that of generic code when the parent function, resulting in unnecessary monomorphisa- invoked from generic code. Stucki et al’s implementation tions). In this particular case, the closure contributed more provides a Scala macro which creates a “specialised body” LLVM IR than all but five significantly larger functions. containing the original generic code, where type parameters This project makes all of these changes to rustc, Rayon have Scala’s specialisation annotation. Dispatch code is then and Serde unnecessary, as the compiler will apply this opti- added, which matches on the incoming type and invokes the misation automatically. correct specialisation. While this approach induces runtime In Section 2, this paper reviews the existing research on overhead, this is made up for in the performance improve- optimisations which aim to reduce compilation times or code ments achieved through specialisation. Their approach is size through monomorphisation strategies. Next, Section 3 implementable in a library without compiler modifications describes the polymorphisation analysis implemented by this and achieves speedups of up to 30x, averaging 12x when project in more detail and Section 4 presents a thorough benchmarked with ScalaMeter. analysis of the compilation time impact of polymorphisa- Ureche et al. [23] discuss efforts by the Java platform ar- tion implemented in rustc. Finally, Section 5 summarises chitects in Project Valhalla to update the Java Virtual Ma- the result of this research and describes future work that it chine (JVM)’s bytecode format to allow load-time class spe- enables. cialisation, which will enable opt-in specialisation in Java. Java is contrasted to Scala, which has three generic compila- 2. BACKGROUND tion schemes (erasure, specialisation and Ureche et al’s mini- boxing). The interaction of these schemes can cause subtle 2.0.1 Monomorphisation-related Optimisations performance issues which also affect Project Valhalla. In particular, miniboxed functions are still erased when invoked Optimisation of monomorphisation to improve compila- from erased functions, and boxing is required at generic com- tion time is a subject that has received little attention. How- pilation scheme boundaries. This paper presents three tech- ever, within the Java ecosystem, lots of literature exists on niques to eliminate the slowdowns associated with minibox- addressing runtime performance impacts that result from ing. The authors validate their technique in the miniboxing monomorphisation (or a lack thereof). plugin for Scala, producing performance improvements of Ureche et al. [24] present a similar optimisation, “Mini- 2-4x. boxing”, designed to target the Java Virtual Machine and These papers focus on languages based on the JVM due to implemented in Scala. Homogeneous and heterogeneous ap- the performance impact that using primitives in generic code proaches to generic code compilation are presented. Hetero- can have. In order to preserve bytecode backwards com- geneous approaches duplicate and adapt code for each type patibility, the JVM requires that value types be converted individually, increasing code size (also known as monomor- to heap objects, typically through boxing, when interacting phisation, the approach taken in Rust and C++). Ho- with generic code. Scala has been of particular focus in this mogeneous translation generates a single function but re- research because of the inclusion of specialisation annota- quires data to have a common representation, irrespective of tions as a language feature. type, which is typically achieved through pointers to heap- allocated objects. Ureche et al. identify that larger value types (e.g. integers) can contain smaller value types (e.g 2.0.2 Code Size Optimisations bytes) and that this can be exploited to reduce the dupli- In order to improve Rust’s compilation times, this project cation necessary in homogeneous translations. When im- implements an optimisation to reduce the quantity of gen- plemented in the Scala compiler, performance matched het- erated LLVM IR. It is therefore relevant to review research erogeneous translation and obtained speedups of over 22x into optimisations which aim to reduce code-size. 2
Edler von Koch et al. [4] discuss a compiler optimisation code as when hand-optimised with annotations. technique for reducing code size by exploiting the structural similarity of functions. Structurally similar functions have 2.0.3 Compilation Time Optimisations equivalent signatures and equivalent control flow graphs. Much like research which aims to reduce code-size, re- Edler von Koch et al. introduce a technique for merging search into optimisations to reduce compilation time in more two structurally similar functions by combining the bodies varied scenarios can be valuable to identify optimisation op- of both functions and inserting control flow constructs to se- portunities being missed. lect the correct instructions where the functions differ. The Han et al. [5] discuss a technique for reducing compi- original functions are replaced with stubs which then call lation time in parallelising compilers. In this paper, the the merged function with an additional argument to select authors inspect the OSCAR compiler, which parallelises C the correct code path. The authors implemented this opti- or Fortran77 programs. OSCAR repeatedly applies multiple misation in LLVM and achieved code size reductions of an program analysis passes and restructuring passes. After OS- average 4% in Spec CPU2006 benchmarks. CAR’s second iteration, Han et al. identify that the OSCAR Debray et al. [2] present a series of transformations which will apply an analysis to a function irrespective of whether can be performed by the compiler to reduce code size. Trans- the restructuring passes changed the function. Han et al. formations are split into two categories, classical analyses propose an optimisation to reduce compilation time by re- and optimisations and techniques for code factoring. Most moving redundant analyses. Compilation time reductions of relevant to the techniques employed by this project is 27%, 13% and 19% are achieved for three large-scale propri- redundant-code elimination. Redundant computations ex- etary real applications. ist where the same computation has been completed pre- Leopoldseder et al. [9] present a technique to determine viously and that result is guaranteed to still be available. when it is beneficial to duplicate instructions from merge In their implementation, redundant code elimination tech- blocks to their predecessors in order to attain further optimi- niques were particularly effective at removing redundant sation. The authors implement their approach, Dominance- computation of global pointer values (on the Alpha proces- based duplication simulation (DBDS), in GraalVM JIT com- sor where the techniques were implemented) when enter- piler, which introduces a constraint that the optimisation ing functions. Debray et al. evaluate the optimisations by must consider the impacts on compilation time. Leopoldseder implementing them in a post-link-time optimiser, alto, and et al. identify that constant folding, conditional elimina- comparing the code size of a set of benchmarks before-and- tion, partial escape analysis/scalar replacement and read after optimisation by alto, achieving over 30% reductions. elimination are achievable after code duplication has taken Kim et al. [7] present an iterative approach to the proce- place. Other code duplication approaches use backtracking dural abstraction techniques described by Debray et al. Pro- to tentatively perform duplication while maintaining the op- cedural abstraction identifies sequences of instructions (in- tion to reverse the duplication if it is determined not prof- struction patterns) which are repeated and creates a proce- itable. Backtracking approaches are compile-time intensive dure containing the pattern, replacing the original instances and thus not suitable for JIT compilation. Leopoldseder et with a call to the procedure. In traditional procedural ab- al. choose to simulate duplication opportunities and then straction approaches, such as those discussed in the orig- fit those opportunities into a cost model which maximises inal paper [2], pattern identification is followed by a sav- peak performance while minimising compilation time and ing estimation before patterns are replaced. In contrast, code size. The authors achieve performance improvements the approach presented by Kim et al. iterates global sav- of up to 40%, mean performance improvements of 5.89%, ing estimation and pattern replacement following pattern mean code size increases of 9.93% and mean compilation identification. In eight benchmarks, an implementation tar- time increases of 18.84% across their benchmarks. geting a CalmRISC8 embedded processor, averaged a 14.9% Unfortunately, none of these projects address the prob- reduction in code size compared to traditional procedural lem of reducing code-size when monomorphisation is used abstraction techniques. However, due to the small number exclusively as a strategy for compiling generic code. of instructions and registers on the CalmRISC8 processor, generated code would present many opportunities to the op- 3. POLYMORPHISATION timisation which could skew results higher than might be Polymorphisation, a concept introduced by this paper, is achieved on more common instruction sets. an optimisation which determines when functions, closures Petrashko et al. [15] present context-sensitive call-graph and generators could remain polymorphic during code gen- construction algorithms which exploits the types provided to eration. type parameters as contexts. They demonstrate through a Rust supports parameterising functions by constants or motivational example that context-sensitive call graph anal- types - these are known as generic functions and this feature yses result in many calls being considered highly polymor- is similar to generics in other languages, like Java’s gener- phic and not able to be inlined. Through running analyses ics or C++’s templates. Generic functions and types are separately in the context of each possible type argument a desirable feature for programmers as they enable greater (possible through their context-sensitive call graph), the au- code reuse. When generating machine code, there are two thors are able to remove all boxing and unboxing in ad- approaches to dealing with generic functions - monomorphi- dition to producing monomorphic calls which enable Java’s sation and boxing. JIT to inline and aggressively optimise functions. Petrashko C++ and Rust perform monomorphisation, where multi- et al. find that through use of the context-sensitive analy- ple copies of a function are generated for each of the types or sis, performance improves by 1.4x and discovers 20% more constants that the function was instantiated with. In con- monomorphic call sites. Generated bytecode size was re- trast, Java performs dynamic dispatch, where each object is duced by up to 5x and achieved the same performance on heap-allocated and a single copy of a function is generated, 3
which takes an address. tion of polymorphisation. In addition, Rust compiles to LLVM IR, the intermediate representation of LLVM. LLVM IR doesn’t have any concept 3.1.1 Query System of generics, so Rust must perform either dynamic dispatch Traditionally, compilers are implemented in passes, where or monomorphisation. source code, an abstract syntax tree or intermediate repre- The initial polymorphisation analysis implemented in this sentation is processed multiple times. Each pass takes the project determines when a type or constant parameter to a result of the previous pass and performs an analysis or op- function, closure or generator is unused, and thus when this eration - such as lexical analysis, parsing, semantic analysis, would result in multiple redundant copies of the function optimisation and code generation. Multi-pass compilers are being generated. By generating fewer redundant monomor- well-established in the literature, but the expectations on phisations of functions in the LLVM IR, there would be less compilers have changed in recent years. work for LLVM to do, reducing compilation times and code Modern programming languages are expected to have high- size. quality integration into development environments [10], and Types with unused generic parameters are disallowed by assist in powering code completion engines and convenience rustc, but there are no checks for unused generic parameters features such as jump-to-definition. Moreover, users ex- in functions. Despite this, it is assumed that it is rare for pect this information quickly and while they are writing programmers to write functions which have unused generic their code (when it might not parse correctly or type-check). parameters. However, closures inherit the generic parame- In addition, many programming languages’ designs make it ters of their parent functions and often don’t make use of hard or impossible to statically determine the correct pro- these parameters. cessing order, as expected by traditional multi-pass compiler For example, consider the code shown in Listing 1, which architectures. was taken from Serde, a popular Rust serialisation library. rustc’s query system enables incremental and “demand- driven” compilation, and is integral to satisfying these expec- 1 fn parse_value( tations. Polymorphisation, as implemented in this project, 2 &mut self, uses the query system and limitations of the query system 3 visitor: V, are reflected in its design (see Section 3.1.4). 4 ) -> Result In the query system, the compiler can be thought of as 5 where having a “database” of knowledge about the current crate, 6 V: de::Visitor
actual parsing from input data. As such, queries form a parameter is an integer which determines its order in the directed-acyclic graph. list of generic parameters in scope. To perform a substitu- Queries could form a cyclic graph by invoking themselves. tion, the subst function on Ty is invoked with a SubstsRef To prevent this, the query engine checks for cyclic invoca- which replaces each TyKind::Param with the type from the tions and aborts execution. SubstsRef with the corresponding index. As all queries are invoked through the query context, The details of how generic parameters are implemented accesses can be recorded and a dependency graph can be within rustc are directly leveraged by the polymorphisation produced. To avoid unnecessary computation when inputs implemented in this project, as well as the TypeFoldable change, the query utilises the dependency graph in the “red- infrastructure used to implement susbts. green algorithm” used for query re-evaluation. TypeFoldable is a trait (Rust’s equivalent of a Java in- try_mark_green is the primary operation in the red-green terface) implemented by types that embed type informa- algorithm. Before evaluating a query, the red-green algo- tion, allowing the recursive processing of the contents of the rithm checks each of the dependencies of the query. Each TypeFoldable. TypeFolder is another trait used in con- dependent query will have a colour: red, green or unknown. junction with TypeFoldable. TypeFolder is implemented Green queries’ inputs haven’t changed, and the cached re- on types (like SubstsFolder, which is used by subst) and sult can be re-used. Red queries’ inputs have changed, and is used to implement behaviour when a type, const, re- the result must be recomputed. gion or binder is encountered in a type. Used together, the If the dependency’s colour is unknown then TypeFolder and TypeFoldable traits allow for traversal and try_mark_green is invoked recursively. Should the depen- modification of complex recursive type structures. dency be successfully marked as green, then the algorithm continues with the next of the current query’s dependencies. 3.1.3 MIR However, if the dependency was marked as red, then the de- MIR (mid-level IR) is an intermediate representation of pendent query must be recomputed. When the result of the functions, closures, generators and statics used in rustc, con- dependent query was the same as before re-execution then structed from the HIR and type inference results. It is a sim- it would not invalidate the current node. plified version of Rust that is better suited to flow-sensitive analyses. For example, all loops are simplified through use 3.1.2 Types and Substitutions of a “goto” construct; match expressions are simplified by rustc represents types with the ty::Ty type, which repre- introducing a “discriminant” statement; method calls are re- sents the semantics of a type (in contrast to placed with explicit calls to trait functions; and drops/panics hir::Ty from the HIR (high-level IR), rustc’s desugared are made explicit. AST, which represents the syntax of a type). MIR is based on a control-flow graph, composed of basic ty::Ty is produced during type inference on the HIR, and blocks. Each basic block contains zero or more statements then used for type-checking. Ty in the HIR assumes that any with a single successor, and end with a terminator, which two types are distinct until proven otherwise, u32 (Rust’s 32- can have multiple successors. bit unsigned integer type) at line 10, column 20 would be Memory locations on the stack are known as “locals” in considered different to u32 at line 20, column 4. the MIR. Locals include function arguments, temporaries ty::Ty is used to represent the “substitutions” for 8 } a type (conceptually, a list of types that are to be substituted with the generic parameters of the type). SubstsRef>, where GenericArg is a space-efficient wrapper around Within rustc, the MIR is a set of data structures, but GenericArgKind. GenericArgKind represents what kind of can be dumped textually for debugging. Consider the MIR generic parameter is present - type, lifetime or const. for the Rust code in Listing 2, it starts with the name and Ty contains TyKind::Param instances, each of which has signature of the function in a pseudo-Rust syntax, shown in a name and index. Only the index is required, the name is Listing 3. for debugging and error reporting. The index of the generic Following the function signature, the variable declarations 5
1 // WARNING: This output format is intended for 1 bb0: { 2 // human consumers only and is subject to 2 StorageLive(_1); 3 // change without notice. Knock yourself out. 3 _1 = const 4 fn main() -> () { ,→ std::collections::HashSet::::new() -> 5 ... ,→ bb2; 6 } 4 } Listing 3: Function prelude in MIR Listing 6: bb0 in MIR 1 bb2: { for all locals are shown (Listing 4). Locals are printed with 2 StorageLive(_3); a leading underscore, followed by their index. Temporaries 3 StorageLive(_3); are intermingled with user-defined variables. 4 _3 = &mut _1; 5 _2 = const 1 let mut _0: (); ,→ std::collections::HashSet::::insert( 2 let _2: bool; ,→ move _3, const 1i32) -> [return: bb3, 3 let mut _3: &mut std::collections::HashSet; ,→ unwind: bb4]; 4 let _4: bool; 6 } 5 let mut _5: &mut std::collections::HashSet; 6 let _6: bool; Listing 7: bb2 in MIR 7 let mut _7: &mut std::collections::HashSet; Listing 4: Variables in MIR There are other basic blocks in the MIR for Listing 2, but bb0 and bb2 are sufficient for understanding the key concepts To distinguish which locals are temporaries and which are of the MIR. user-defined variables, the MIR output then displays debug- In addition to StorageLive, Assign, StorageDead, there info for user-defined variables, such as set, shown in Listing are a selection of other statements: 5. Within each scope block, user-defined variables are listed • Assign statements write an rvalue into a place. with the user’s name from the variable and the variable’s • FakeRead statements represent the reading that a pat- place. In this example, the mapping is direct, but the com- tern match might do, and exists to improve diagnos- piler can store multiple user variables in a single local, and tics. perform field accesses or dereferences. Scope blocks represent the lexical structure of the source • SetDiscriminant statements write the discriminant of program (used in generating debuginfo). Each statement an enum variant (its index) to the representation of an and terminator in the MIR output (omitted in these snip- enum. pets) is followed by a comment which states which scope it exists in. • StorageLive and StorageDead starts and ends the live range for storage of a local. 1 scope 1 { 2 debug set => _1; • InlineAsm executes inline assembly. 3 } • Retag retags references in a place - this is part of the “Stacked Borrows” aliasing model by Jung et al. [6]. Listing 5: Scope in MIR • AscribeUserType encodes a user’s type ascription so All of the remaining MIR output is used by the basic that they can be respected by the borrow checker. blocks of the current MIR body, starting with bb0, shown in Listing 6. Basic blocks are numbered from zero upwards, • Nop is a no-op, and is useful for deleting instructions and contain statements followed by terminator on the last without affecting statement indices. line. bb0 starts with a StorageLive statement, which indi- Likewise, in addition to Call, there are various other ter- cates that the local _1 is live (until a matching StorageDead minators: statement, not included in this listing). Code generation uses StorageLive statements to allocate stack space. At • Goto jumps to a single successor block. the end of bb0, there is a Call terminator, which invokes • SwitchInt evaluates an operand to an integer and jumps HashSet::new and resumes execution in bb2. While termi- to a target depending on the value. nators can have multiple successors, because HashSet::new does not require cleanup if it panicked, there is no panic • Resume and Abort indicate that a landing pad is fin- edge. ished and unwinding should continue or the process bb2 has two StorageLive statements followed by an Assign should be aborted. statement on line 4 into a temporary, creating a mutable bor- row of _1, before ending the block with a Call terminator • Return indicates a return from the function, assumes invoking HashSet::insert. that _0 has had an assignment. 6
• Drop and DropAndReplace drops a place (and option- During code generation, the MIR being translated into ally replaces it). LLVM IR (or Cranelift IR) has no substitutions applied, therefore all of the generic parameters in the function are • Call invokes a function. still present. Where required, substitutions (those collected during monomorphisation, as will be discussed in Section • Assert panics if a condition doesn’t hold. 3.1.5) are applied to types. However, for shims, these sub- stitutions are typically redundant as the MIR of the shim • Yield indicates a suspend point in a generator. is generated specifically for a type, and do not have any • GeneratorDrop indicates the end of dropping a gener- remaining generic parameters. ator. 3.1.5 Monomorphisation • FalseEdges and FalseUnwind are used for control flow In order to determine which items will be included in the that is impossible, but required for borrow check to be generated LLVM IR for a crate, rustc performs monomor- conservative. phisation collection. Non-generic, non-const functions map to one LLVM artefact, whereas generic functions can con- Closures - similar to anonymous functions or lambdas in tribute zero to N artefacts depending on the number of in- other languages - which can capture variables from the par- stantiations of the function. Monomorphisations can be ent environment, are also represented in the MIR. Functions produced from instantiations of functions defined in other and closures are represented almost identically in the MIR. crates. Closures inherit the generic parameters of the parent func- “Mono items” are anything that results in a function or tion, and have additional generic parameters tracking the global in the LLVM IR of a codegen unit. Mono items can closure’s inferred kind, signature and upvars (captured vari- reference other mono items, for example, if a function baz ables). references a function quux then the mono item for baz will Generators are suspendable functions, which operate sim- reference the mono item for quux (typically this results in the ilarly to closures, but can yield values in addition to return- generated LLVM IR for baz referencing the generated LLVM ing them. Like closures, generators are represented almost IR for quux too). Therefore, mono items form a directed identically to functions in the MIR, but have extra generic graph, known as the “mono item graph”, which contains all parameters which track the generator’s inferred resume type, items necessary for codegen of the entire program. yield type, return type and witness (an inferred type which In order to compute the mono item graph, the collector is a tuple of all types that could up in a generator frame). starts with the roots - non-generic syntactic items in the rustc provides a pair of visitor traits (depending on whether source code. Roots are found by walking the HIR of the mutability is required) which types can implement to sim- crate, and whenever a non-generic function, method or static plify traversal of the MIR. item is found, a mono item is created with the DefId of 3.1.4 Shims the item and an empty SubstsRef (since the item is non- generic). There are some circumstances where rustc will generate From the roots, neighbours are discovered by inspecting functions during compilation, these functions are known as the MIR of each item. Whenever a reference to another shims, and they are often specialised for specific types. mono item is found, it is added to the set of all mono items. Drop glue is an example of a shim generated by the com- This process is repeated until the entire mono item graph piler. Rust doesn’t require that the user write a destruc- has been discovered. By starting with non-generic roots, tor for their type, any type will be dropped recursively in the current mono item will always be monomorphic, so all a specified order. To implement this, rustc transforms all generic parameters of neighbours will always be known. Ref- drops into a call to drop_in_place for the given type. erences to other mono items can take the form of function drop_in_place’s implementation is generated by the com- calls, taking references to functions, closures, drop glue, un- piler for each T. There are a variety of shims generated by sizing casts and boxes. the compiler: • Drop shims implements the drop glue required to deal- 1 fn main() { locate a given type. 2 bar(2u64); 3 } • Clone shims implement cloning for a builtin type, like 4 arrays, tuples and closures. 5 fn foo() { 6 bar(2u32); • Closure-once shims generate a call to an associated 7 } method of the FnMut trait. 8 • Reify shims are for fn() pointers where the function 9 fn bar(t: T) { cannot be turned into a pointer. 10 baz(t, 8u16); 11 } • FnPtr shims generate a call after a dereference for a 12 function pointer. 13 fn baz(f: F, g: G) { 14 } • Vtable shims generate a call after a dereference and move for a vtable. Listing 8: Monomorphisation Example 7
For example, consider the code in Listing 8. Monomorphi- 1 query used_generic_params(key: DefId) -> sation will start by walking the HIR of the crate and looking ,→ BitSet { for non-generic syntactic items. After this step, the set of 2 desc { mono items will contain the functions main and foo, both 3 |tcx| "determining which generic of which have no generic parameters. ,→ parameters are used by `{}`", Next, monomorphisation will traverse the MIR of each ,→ tcx.def_path_str(key) root, looking for neighbours. In main, the terminator of bb0 4 } (shown in Listing 9) references the bar function, with the 5 } substitutions u64. Listing 12: Query definition 1 bb0: { 2 StorageLive(_1); 1 fn used_generic_params( 3 _1 = const bar::(const 2u64) -> bb1; 2 tcx: TyCtxt
1 fn mark_used_by_default_parameters, is invoked recursively on the closure or generator, and any 3 def_id: DefId, parameters from the parent which were used in the closure 4 generics: &'tcx ty::Generics, or generator are marked as used in the parent. 5 used_parameters: &mut BitSet, The analysis was significantly more complex during imple- 6 ) { mentation and had to special-case some casts, call and drop 7 if !tcx.is_trait(def_id) terminators to avoid marking parameters as used unneces- 8 && (tcx.is_closure(def_id) sarily, often in order to avoid parts of the compiler which 9 || tcx.type_of(def_id).is_generator()) were previously always monomorphic. However, during in- 10 { tegration of the analysis, as support for polymorphic MIR 11 for param in &generics.params { during codegen was improved, and these special-cases were 12 used_parameters.insert(param.index); removed. 13 } Some generic parameters are only used in the bounds of 14 } else { another generic parameter. After the analysis has detected 15 for param in &generics.params { which generic parameters are used in the body of the item, 16 if param.is_lifetime() { the predicates of the item are checked for uses of unused 17 used_parameters.insert( generic parameters in the bounds on used generic parame- 18 param.index); ters. 19 } 20 } 1 fn bar() {} 21 } 2 22 3 fn foo(_: I) 23 if let Some(parent) = generics.parent { 4 where 24 mark_used_by_default_parameters( 5 I: Iterator, 25 tcx, parent, tcx.generics_of(parent), 6 { 26 used_parameters); 7 bar::() 27 } 8 } 28 } Listing 15: Example of generic parameter use in predicate Listing 14: Used-by-default parameters Consider the example shown in Listing 15. foo has two generic parameters, I and T, and uses I in the substitutions the MIR body for the current item and uses a MIR visitor of a call to bar. If only the body of foo was analysed then to traverse it. Only three components of the MIR need to T would be considered unused despite it being clear that T be visited by the analysis: locals, types and constants. is used in the iterator bound on I. visit_local_decl is invoked for each local variable in the item’s MIR before the MIR is traversed. This analysis’ im- 3.2.1 Testing plementation of visit_local_decl calls super_local_decl To test of the analysis, a debugging flag, -Z polymorphize- (which proceeds with traversal as normal) for all but one errors, is added, which emits compilation errors on items case: If the current item is a closure or generator and if the with unused generic parameters, an example of which is local currently being visited is the first local, then it returns shown in Listings 16 and 17. early. By emitting compilation “errors”, the project leverages the In the MIR, the first local for a closure or generator is existing error diagnostics infrastructure to enable introspec- a reference to the closure of generator itself. If the anal- tion into the compiler internals from a test written at the ysis proceeded to visit this local as normal, then it would Rust source-code level. eventually visit the substitutions of the closure or genera- tor. Unfortunately, this would result in all inherited generic 1 struct Foo(F); 2 parameters for a closure always being considered used, de- feating the point of the analysis. 3 impl Foo { visit_const and visit_ty are invoked for each constant 4 // Function has an unused generic and type in the MIR, and are where the visitor “bottoms 5 // parameter from impl. out”. This analysis’ implementation of visit_const and 6 pub fn unused_impl() { visit_ty call a visit_foldable function (which isn’t part 7 //~^ ERROR item has unused generic parameters of the MIR visitor). 8 } visit_foldable takes any type which implements 9 } TypeFoldable and checks if the type has any type or con- stant parameters before visiting it with this analysis’ type Listing 16: Example of an analysis test visitor. In the type visitor, the analysis traverses the structure of a In addition, this approach integrates well into rustc’s ex- type or constant and looks for ty::Param and isting test infrastructure, which writes all of the error output ConstKind::Param, setting the index from each within the for a test into a stderr file and checks that the output hasn’t BitSet, marking the parameter as used. In addition, the changed by comparison with that file. In addition, anywhere type visitor also special-cases closures and generators. In- an error is expected is annotated in the test source. 9
1 error: item has unused generic parameters • foo:: 2 --> $DIR/functions.rs:38:12 3 | • foo:: 4 LL | impl Foo { • foo:: 5 | - generic parameter `F` is unused 6 LL | // Function has an unused generic However, with the analysis, only a single mono item would 7 LL | // parameter from impl. be produced: foo::. 8 LL | pub fn unused_impl() { rustc’s code generation iterates over the mono items col- 9 | ^^^^^^^^^^^ lected and generates LLVM IR for each, adding the result to an LLVM module. By de-duplicating during monomor- Listing 17: Example of an analysis test error phisation collection, polymorphisation “falls out” of the way the compiler is structured (though some more changes are required). 3.3 Implementing polymorphisation in rustc In particular, various changes to code generation were re- quired wherever another Instance is referenced. For exam- Integration of the analysis’ results into the compiler re- ple, when generating LLVM IR for a TerminatorKind::Call quired significant debugging and many targeted modifica- (a call or invoke instruction in LLVM), an Instance is tions where previously only monomorphic types and MIR generated to determine the exact item which is being in- were expected. voked and its mangled symbol name. Without applying the Throughout rustc’s code generation infrastructure, the Instance::polymorphize function, the instruction would ty::Instance type is used. Instance is constructed through refer to the mangled symbol for the unpolymorphised func- use of Instance::resolve which takes a tion, which no longer exists. (DefId, SubstsRef, the matching in- is handled by a should_codegen_locally function which dex in the analysis’ BitSet is checked to determine whether checked whether upstream monomorphisations where avail- the parameter was used. If the parameter was used, then able for a given Instance. should_codegen_locally is in- the current parameter is kept. If the parameter was unused, voked before a mono item is created (and polymorphisation then the parameter is replaced with an identity substitu- occurs), so it looks for an unpolymorphised monomorphi- tion. During implementation, replacing the parameter with sation in the upstream crate. However, if the item would a new ty::Unused type was considered, but it was hypoth- be polymorphised then the upstream crate would only con- esised that this would require more changes to the compiler tain a polymorphised copy, which would result in unnec- than an identity substitution. essary code generation when compiling the current crate. For example, given an unsubstituted type containing should_codegen_locally was changed to always check for ty::Param(0), if the parameter were considered used by the a polymorphised monomorphisation in the upstream crate. analysis, then it would be substituted by the real type, but, if the parameter were considered unused, then it would be 3.3.2 Specialisation substituted with ty::Param(0) and remain the same. In Rust, data structures are defined separately from their During monomorphisation collection, as described in Sec- methods. Multiple impl blocks can exist which implement tion 3.1.5, when a mono item is identified, it is added to a methods on a type. For generic types, impl blocks can pick set. This project concerns itself with one kind of mono item, concrete types for parameters or remain generic, but mul- a MonoItem::Fn which contains an Instance type. When a tiple impls cannot overlap. Specialisation is an unstable new MonoItem::Fn is created, Instance::polymorphize is feature in Rust which enables impl blocks to overlap if one invoked before it is added to the final mono item set. This block is clearly “more specific” than another. has the effect of reducing the number of mono items. Instance::resolve can only resolve specialised functions of traits under certain circumstances and detects whether 1 fn foo(_: B) { } further specialisation can occur by whether the item needs 2 substitution (i.e. has unsubstituted type or constant param- 3 fn main() { eters). After integration of polymorphisation, this check had 4 foo::(1); to be more robust. In particular, polymorphisation meant 5 foo::(2); that closure types in Instance::resolve could need sub- 6 foo::(3); stitution (by containing a ty::Param from an unused par- 7 } ent parameter) and would thus “need substitution” for the purposes of the specialisation check, despite this having no Listing 18: Example of an polymorphisation deduplication impact on whether the item was further specialisable. Initially, this was resolved by implementing a fast type Consider the example in Listing 18, monomorphisation visitor which checked for generic parameters, but would not would normally produce three mono items (and three copies visit the parent substitutions of a closure or generator. How- of the function): ever, this check is in the “fast path” and a visitor would 10
have been too expensive, so this was replaced by a addition generated to drop. InstanceDef::DropGlue’s DefId always to the TypeFlags infrastructure within the compiler (after referred to the drop_in_place function. some refactoring [1] enabled this change), shown partially in With polymorphisation, the Ty which contains the same identity param- 2 &ty::Closure(_, ref substs) => { eters would be applied to the Ty and a InstanceDef. For normal functions, InstanceDef::Item(DefId) is the variant of InstanceDef which is used. However, for drop glue, a 4. EVALUATION InstanceDef::DropGlue(DefId, Option is the type that the shim is being on a set of benchmarks is compared with a build of the 11
compiler without this project’s changes applied. mono items generated for these benchmarks was also com- For the purposes of evaluation, the compiler is built in pared and the results are shown in Table 2. release configuration from the Git commit with the hash 2f2ccd2 [26] (PR #69749 [28] at 9ef1d94 [27] merged with • script-servo is the script crate from Servo, the par- master, 150322f [8]). 150322f is also built in release config- allel browser engine used by Firefox. uration to act as a baseline for comparisons. • regression-31157 is a small program which caused Benchmarking has been performed two times previously large performance regressions in earlier versions of the and performance improvements identified from those earlier Rust compiler. runs are implemented in the versions being compared in this section, as discussed in Section 4.1. • ctfe-stress-4 is a small program which is used to There are 40 benchmarks that are run in rustc’s stan- stress compile-time function evaluation. dard performance benchmarking suite [29]. Benchmarks are either real Rust programs from the ecosystem which are im- In applicable benchmarks, compilation time performance portant; real Rust programs from the ecosystem which are on clean runs generally improves by 3-5%. Incremental runs known to stress the compiler; or artificial stress tests. generally have lower performance improvements, if any. script- All benchmarks are compiled with both debug and release servo in the release configuration’s results show a 11.4% profiles, some benchmarks are run with the check profile. slowdown in patched incremental runs. Query profiling shows Each benchmark is run at least four times: that this slowdown comes from LLVM LTO’s optimisations - this could be due to LTO being more effective and thus • clean: a non-incremental build. more optimisation taking place or that the polymorphised functions make LTO more expensive to perform. However, • baseline incremental: an incremental build starting LTO does not appear to impact performance in the debug with empty cache. configuration or on other benchmarks. script-servo in the • clean incremental: an incremental build starting debug configuration shows a 5.1% performance improvement with complete cache, and clean source directory – the which corresponds to an 11 second decrease in compilation “perfect” scenario for incremental. time. ctfe-stress-4’s mono-item count shows no reduction. • patched incremental: an incremental build starting This isn’t surprising as the test contains mostly compile- with complete cache, and an altered source directory. time evaluated functions (as expected for a test which aims The typical variant of this is “println” which represents to stress compile-time function evaluation). This suggests the addition of a println! macro somewhere in the that that polymorphisation may have reduced the number source code. of functions which had to be evaluated at compile-time. Memory usage across the benchmarks was very varied, perf is used to collect information on the execution of each ranging from 19% increases to 15% decreases, depending on benchmark, notably: CPU clock; clock cycles; page faults; the benchmark. Like instructions executed, memory usage memory usage; instructions executed; task clock and wall increases typically happened when incremental compilation time. In addition, for each run, the execution count and was enabled, while decreases happened when incremental total time spent on each query in rustc’s query system is compilation was disabled. also captured - this allows for detailed comparisons between Earlier benchmarking revealed that re-execution of the runs to better determine why there has been a performance used_generic_params query resulted in increased loads from impact. crate metadata (intermediate data from compilation of de- Furthermore, for a subset of the benchmarks which had an pendencies) so the version of the query benchmarked in appreciable performance impact, the number of mono-items this paper’s results were added to crate metadata and in- generated are compared. cremental compilation storage. In addition, the BitSet re- Benchmarks were performed on a Linux host with an turned from the query was replaced with a u64 as a micro- AMD Ryzen 5 3600 6-Core Processor and 64GB of RAM optimisation. with ASLR disabled in both the kernel and userspace. 4.1 Discussion 5. CONCLUSIONS AND FUTURE WORK For a majority of the benchmarks, there was no apprecia- This paper has presented a compilation time optimisation ble difference in compilation time performance. This result in the Rust compiler which reduces redundant monomor- is expected, the optimisation would have limited impact in phisation. Currently, this implementation from this project programs which do not contain highly generic code contain- is being reviewed as PR #69749 [28] in the upstream Rust ing closures. project and preliminary work, such as PR #69935 [25], has In those benchmarks that did not show any significant landed. improvement or regression, any performance differences be- There are no aspects of the Rust language itself which tween the compiler versions are considered noise. The thresh- makes implementation of this optimisation particularly chal- old used for noise is determined by comparing compiler ver- lenging or complex, most of the difficulty arises from imple- sions that had no functional changes (“noise runs”) and ob- mentation details of rustc. With that said, rustc is well serving the impact on performance results on each bench- suited to having this optimisation implemented. mark. Due to the structure of the compiler, in an ideal scenario, Three benchmarks were appreciably impacted by the opti- only modifications to monomorphisation and call genera- misation implemented in this PR, results for these in Table tion are necessary to enable polymorphisation. Alternate 1 (complete results available online [29]). The number of approaches, such as changing the the MIR of closures and 12
generators to only inherit the parameters that they use from [4] T. J. Edler von Koch, B. Franke, P. Bhandarkar, and their parents, would be significantly more invasive and chal- A. Dasgupta. Exploiting function similarity for code lenging, as the current system has emergent properties which size reduction. In Proceedings of the 2014 enables simplifications in other areas of the compiler. SIGPLAN/SIGBED Conference on Languages, However, rustc will not always fail eagerly when an invalid Compilers and Tools for Embedded Systems, LCTES case is encountered, which can significantly lengthen debug- ’14, page 85–94, New York, NY, USA, 2014. ging sessions as the root cause of a failure is determined. Association for Computing Machinery. This project has provided an excellent opportunity to learn [5] J. Han, R. Fujino, R. Tamura, M. Shimaoka, more about the structure of the Rust compiler, and the tech- H. Mikami, M. Takamura, S. Kamiya, K. Suzuki, niques used in modern, production compilers to implement T. Miyajima, K. Kimura, and et al. Reducing LLVM IR generation. parallelizing compilation time by removing redundant Implementation of polymorphisation resulted in 3-5% com- analysis. In Proceedings of the 3rd International pilation time improvements in some benchmarks. While Workshop on Software Engineering for Parallel there are further improvements that could be made to im- Systems, SEPS 2016, page 1–9, New York, NY, USA, prove the integration and improve performance when incre- 2016. Association for Computing Machinery. mental compilation is enabled (see Section 5.1), this is a [6] R. Jung, H.-H. Dang, J. Kang, and D. Dreyer. Stacked promising result. borrows: An aliasing model for rust. Proc. ACM Program. Lang., 4(POPL), Dec. 2019. 5.1 Future Work [7] D.-H. Kim and H. J. Lee. Iterative procedural Investigation into the interaction between polymorphisa- abstraction for code size reduction. In Proceedings of tion and LTO should be performed to determine the source the 2002 International Conference on Compilers, of the compile-time regression noticed in the script-servo Architecture, and Synthesis for Embedded Systems, benchmark. In addition, tweaking of the circumstances un- CASES ’02, page 277–279, New York, NY, USA, 2002. der which the query’s results are cached (both incremen- Association for Computing Machinery. tally and cross-crate) should be performed to impact on [8] T. R. P. Language. commit 150322f, 2020 (accessed performance experienced when incremental compilation is March 25, 2020). enabled. https://github.com/rust-lang/rust/commit/ With infrastructure in place to support polymorphic code 150322f86d441752874a8bed603d71119f190b8b. generation, the polymorphisation analysis can be extended [9] D. Leopoldseder, L. Stadler, T. Würthinger, J. Eisl, to detect more advanced cases. For example, when only the D. Simon, and H. Mössenböck. Dominance-based size of a type is used then monomorphisation can occur for duplication simulation (dbds): Code duplication to the distinct sizes of instantiated type; or when the memory enable compiler optimizations. In Proceedings of the representation of a type is independent of the representation 2018 International Symposium on Code Generation of its generic parameters, such as Vec (because the T is and Optimization, CGO 2018, page 126–137, New behind indirection), functions like Vec::len can be poly- York, NY, USA, 2018. Association for Computing morphised. Machinery. Furthermore, the integration currently doesn’t fully work [10] Microsoft. Language server protocol, 2020 (accessed with Rust’s new symbol mangling scheme, which isn’t yet March 29, 2020). https://microsoft.github.io/ enabled by default - support for this could be added. language-server-protocol/. [11] S. Overflow. Stack overflow developer survey 2016, 6. ACKNOWLEDGEMENTS 2016 (accessed March 25, 2020). Thanks to Eduard-Mihai Burtescu, Niko Matsakis and ev- https://insights.stackoverflow.com/survey/ eryone else in the Rust compiler team for their assistance 2016#technology-most-loved-dreaded-and-wanted. and time during this project. In addition, thanks to Michel [12] S. Overflow. Stack overflow developer survey 2017, Steuwer for his comments and guidance throughout the year. 2017 (accessed March 25, 2020). https://insights.stackoverflow.com/survey/ 2017#most-loved-dreaded-and-wanted. 7. REFERENCES [13] S. Overflow. Stack overflow developer survey 2018, [1] E.-M. Burtescu. rustc: keep upvars tupled in 2018 (accessed March 25, 2020). Closure,Generatorsubsts., 2020 (accessed March 25, https://insights.stackoverflow.com/survey/ 2020). 2018/#most-loved-dreaded-and-wanted. https://github.com/rust-lang/rust/pull/69968. [14] S. Overflow. Stack overflow developer survey 2019, [2] S. K. Debray, W. Evans, R. Muth, and B. De Sutter. 2019 (accessed March 25, 2020). https://insights. Compiler techniques for code compaction. ACM Trans. stackoverflow.com/survey/2019#technology-_ Program. Lang. Syst., 22(2):378–415, Mar. 2000. -most-loved-dreaded-and-wanted-languages. [3] I. Dragos and M. Odersky. Compiling generics through [15] D. Petrashko, V. Ureche, O. Lhoták, and M. Odersky. user-directed type specialization. In Proceedings of the Call graphs for languages with parametric 4th Workshop on the Implementation, Compilation, polymorphism. In Proceedings of the 2016 ACM Optimization of Object-Oriented Languages and SIGPLAN International Conference on Programming Systems, ICOOOLPS ’09, page 42–47, Object-Oriented Programming, Systems, Languages, New York, NY, USA, 2009. Association for and Applications, OOPSLA 2016, page 394–409, New Computing Machinery. York, NY, USA, 2016. Association for Computing 13
You can also read