Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Appears in Computer Architecture News (CAN), September 2005 Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset Milo M. K. Martin1, Daniel J. Sorin2, Bradford M. Beckmann3, Michael R. Marty3, Min Xu4, Alaa R. Alameldeen3, Kevin E. Moore3, Mark D. Hill3,4, and David A. Wood3,4 http://www.cs.wisc.edu/gems/ 1 2 3 4 Computer and Information Electrical and Computer Computer Sciences Dept. Electrical and Computer Sciences Dept. Engineering Dept. Univ. of Wisconsin-Madison Engineering Dept. Univ. of Pennsylvania Duke Univ. Univ. of Wisconsin-Madison Abstract nization, thread scheduling and migration). Fur- thermore, as the single-chip microprocessor The Wisconsin Multifacet Project has created a sim- evolves to a chip-multiprocessor, the ability to sim- ulation toolset to characterize and evaluate the per- ulate these machines running realistic multi- formance of multiprocessor hardware systems threaded workloads is paramount to continue commonly used as database and web servers. We innovation in architecture research. leverage an existing full-system functional simula- Simulation Challenges. Creating a timing simula- tion infrastructure (Simics [14]) as the basis around tor for evaluating multiprocessor systems with which to build a set of timing simulator modules for workloads that require operating system support is modeling the timing of the memory system and difficult. First, creating even a functional simulator, microprocessors. This simulator infrastructure which provides no modeling of timing, is a sub- enables us to run architectural experiments using a stantial task. Providing sufficient functional fidelity suite of scaled-down commercial workloads [3]. To to boot an unmodified operating system requires enable other researchers to more easily perform such implementing supervisor instructions, and inter- research, we have released these timing simulator facing with functional models of many I/O devices. modules as the Multifacet General Execution-driven Such simulators are called full-system simulators Multiprocessor Simulator (GEMS) Toolset, release [14, 20]. Second, creating a detailed timing simula- 1.0, under GNU GPL [9]. tor that executes only user-level code is a substan- tial undertaking, although the wide availability of 1 Introduction such tools reduces redundant effort. Finally, creat- Simulation is one of the most important tech- ing a simulation toolset that supports both full-sys- niques used by computer architects to evaluate tem and timing simulation is substantially more their innovations. Not only does the target machine complicated than either endeavor alone. need to be simulated with sufficient detail, but it Our Approach: Decoupled Functionality and also must be driven with a realistic workload. For Timing Simulation. To address these simulation example, SimpleScalar [4] has been widely used in challenges, we designed a modular simulation the architectural research community to evaluate infrastructure (GEMS) that decouples simulation new ideas. However it and other similar simulators functionality and timing. To expedite simulator run only user-mode, single-threaded workloads development, we used Simics [14], a full-system such as the SPEC CPU benchmarks [24]. Many functional simulator, as a foundation on which var- designers are interested in multiprocessor systems ious timing simulation modules can be dynami- that run more complicated, multithreaded work- cally loaded. By decoupling functionality and loads such as databases, web servers, and parallel timing simulation in GEMS, we leverage both the scientific codes. These workloads depend upon efficiency and the robustness of a functional simu- many operating system services (e.g., I/O, synchro- lator. Using modular design provides the flexibility 1
Appears in Computer Architecture News (CAN), September 2005 Contended Locks Basic Verification Random Tester Simics Opal etc. Detailed Processor Model Microbenchmarks Ruby M Memory System Simulator S I Interconnection Coherence Network Controllers Caches & Memory Figure 1. A view of the GEMS architecture: Ruby, our memory simulator can be driven by one of four memory system request generators. to simulate various system components in different dent on Simics. Such a decoupling allows the tim- levels of detail. ing models to focus on the most common 99.9% of While some researchers used the approach of all dynamic instructions. This task is much easier adding full-system simulation capabilities to exist- than requiring a monolithic simulator to model the ing user-level-only timing simulators [6, 21], we timing and correctness of every function in all adopted the approach of leveraging an existing full- aspects of the full-system simulation. For example, system functional simulation environment (also a small mistake in handling an I/O request or in used by other simulators [7, 22]). This strategy modeling a special case in floating point arithmetic enabled us to begin workload development and is unlikely to cause a significant change in timing characterization in parallel with development of fidelity. However, such a mistake will likely affect the timing modules. This approach also allowed us functional fidelity, which may prevent the simula- to perform initial evaluation of our research ideas tion from continuing to execute. By allowing using a more approximate processor model while Simics to always determine the result of execution, the more detailed processor model was still in the program will always continue to execute cor- development. rectly. We use the timing-first simulation approach Our approach is different from trace-driven [18], in which we decoupled the functional and simulation. Although our approach decouples timing aspects of the simulator. Since Simics is functional simulation and timing simulation, the robust enough to boot an unmodified OS, we used functional simulator is still affected by the timing its functional simulation to help us avoid imple- simulator, allowing the system to capture timing- menting rare but effort-consuming instructions in dependent effects. For example, the timing model the timing simulator. Our timing modules interact will determine the winner of two processors that with Simics to determine when Simics should exe- are trying to access the same software lock in mem- cute an instruction. However, what the result of the ory. Since the timing simulator determines when execution of the instruction is ultimately depen- the functional simulator advances, such timing- 2
Appears in Computer Architecture News (CAN), September 2005 dependent effects are captured. In contrast, trace- vides multiple drivers that can serve as a source of driven simulation fails to capture these important memory operation requests to Ruby: effects. In the limit, if our timing simulator was 1) Random tester module: The simplest driver of 100% functionally correct, it would always agree Ruby is a random testing module used to stress test with the functional simulator, making the func- the corner cases of the memory system. It uses false tional simulation redundant. Such an approach sharing and action/check pairs to detect many pos- allows for “correctness tuning” during simulator sible memory system and coherence errors and development. race conditions [25]. Several features are available Design Goals. As the GEMS simulation system has in Ruby to help debug the modeled system includ- primarily been used to study cache-coherent shared ing deadlock detection and protocol tracing. memory systems (both on-chip and off-chip) and 2) Micro-benchmark module: This driver sup- related issues, those aspects of GEMS release 1.0 are ports various micro-benchmarks through a com- the most detailed and the most flexible. For exam- mon interface. The module can be used for basic ple, we model the transient states of cache coher- timing verification, as well as detailed performance ence protocols in great detail. However, the tools analysis of specific conditions (e.g., lock contention have a more approximate timing model for the or widely-shared data). interconnection network, a simplified DRAM sub- system, and a simple I/O timing model. Although 3) Simics: This driver uses Simics’ functional sim- we do include a detailed model of a modern ulator to approximate a simple in-order processor dynamically-scheduled processor, our goal was to with no pipeline stalls. Simics passes all load, store, provide a more realistic driver for evaluating the and instruction fetch requests to Ruby, which per- memory system. Therefore, our processor model forms the first level cache access to determine if the may lack some details, and it may lack flexibility operation hits or misses in the primary cache. On a that is more appropriate for certain detailed micro- hit, Simics continues executing instructions, architectural experiments. switching between processors in a multiple proces- Availability. The first release of GEMS is available sor setting. On a miss, Ruby stalls Simics’ request at http://www.cs.wisc.edu/gems/. GEMS is open- from the issuing processor, and then simulates the source software and is licensed under GNU GPL cache miss. Each processor can have only a single [9]. However, GEMS relies on Virtutech’s Simics, a miss outstanding, but contention and other timing commercial product, for full-system functional affects among the processors will determine when simulation. At this time, Virtutech provides evalua- the request completes. By controlling the timing of tion licenses for academic users at no charge. More when Simics advances, Ruby determines the tim- information about Simics can be found at ing-dependent functional simulation in Simics http://www.virtutech.com/. (e.g., to determine which processor next acquires a memory block). The remainder of this paper provides an over- view of GEMS (Section 2) and then describes the 4) Opal: This driver models a dynamically-sched- two main pieces of GEMS: the multiprocessor uled SPARC v9 processor and uses Simics to verify memory system timing simulator Ruby (Section 3), its functional correctness. Opal (previously known which includes the SLICC domain-specific lan- as TFSim[18]) is described in more detail later in guage for specifying cache-coherence protocols and Section 4. systems, and the detailed microarchitectural pro- The first two drivers are part of a stand-alone cessor timing model Opal (Section 4). Section 5 executable that is independent of Simics or any discusses some constraints and caveats of GEMS, actual simulated program. In addition, Ruby is spe- and we conclude in Section 6. cifically designed to support additional drivers (beyond the four mentioned above) using a well- 2 GEMS Overview defined interface. The heart of GEMS is the Ruby memory system GEMS’ modular design provides significant simulator. As illustrated in Figure 1, GEMS pro- simulator configuration flexibility. For example, 3
Appears in Computer Architecture News (CAN), September 2005 our memory system simulator is independent of 3.1 Protocol-Independent Components our out-of-order processor simulator. A researcher The protocol-independent components of Ruby can obtain preliminary results for a memory system include the interconnection network, cache arrays, enhancement using the simple in-order processor memory arrays, message buffers, and assorted glue model provided by Simics, which runs much faster logic. The only two components that merit discus- than Opal. Based on these preliminary results, the sion are the caches and interconnection network. researcher can then determine whether the accom- Caches. Ruby models a hierarchy of caches associ- panying processor enhancement should be imple- ated with each single processor, as well as shared mented and simulated in the detailed out-of-order caches used in chip multiprocessors (CMPs) and simulator. other hierarchical coherence systems. Cache char- GEMS also provides flexibility in specifying acteristics, such as size and associativity, are config- many different cache coherence protocols that can uration parameters. be simulated by our timing simulator. We separated Interconnection Network. The interconnection the protocol-dependent details from the protocol- network is the unified communication substrate independent system components and mechanisms. used to communicate between cache and memory To facilitate specifying different protocols and sys- controllers. A single monolithic interconnection tems, we proposed and implemented the protocol network model is used to simulate all communica- specification language SLICC (Section 3.2). In the tion, even between controllers that would be on the next two sections, we describe our two main simu- same chip in a simulated CMP system. As such, all lation modules: Ruby and Opal. intra-chip and inter-chip communication is han- dled as part of the interconnect, although each 3 Multiprocessor Memory System (Ruby) individual link can have different latency and band- Ruby is a timing simulator of a multiprocessor width parameters. This design provides sufficient memory system that models: caches, cache control- flexibility to simulate the timing of almost any kind lers, system interconnect, memory controllers, and of system. banks of main memory. Ruby combines hard- A controller communicates by sending mes- coded timing simulation for components that are sages to other controllers. Ruby’s interconnection largely independent of the cache coherence proto- network models the timing of the messages as they col (e.g., the interconnection network) with the traverse the system. Messages sent to multiple des- ability to specify the protocol-dependent compo- tinations (such as a broadcast) use traffic-efficient nents (e.g., cache controllers) in a domain-specific multicast-based routing to fan out the request to language called SLICC (Specification Language for the various destinations. Implementing Cache Coherence). Ruby models a point-to-point switched inter- Implementation. Ruby is implemented in C++ and connection network that can be configured simi- uses a queue-driven event model to simulate tim- larly to interconnection networks in current high- ing. Components communicate using message end multiprocessor systems, including both direc- buffers of varying latency and bandwidth, and the tory-based and snooping-based systems. For simu- component at the receiving end of the buffer is lating systems based on directory protocols, Ruby scheduled to wake up when the next message will release 1.0 supports three non-ordered networks: a be available to be read from the buffer. Although simplified full connected point-to-point network, a many buffers are used in a strictly first-in-first-out dynamically-routed 2D-torus interconnect inspired (FIFO) manner, the buffers are not restricted to by the Alpha 21364 [19], and a flexible user-defined FIFO-only behavior. The simulation proceeds by network interface. The first two networks are auto- invoking the wakeup method for the next sched- matically generated using certain simulator config- uled event on the event queue. Conceptually, the uration parameters, while the third creates an simulation would be identical if all components arbitrary network by reading a user-defined config- were woken up each cycle; thus, the event queue uration file. This file-specified network can create can be thought of as an optimization to avoid unnecessary processing during each cycle. 4
Appears in Computer Architecture News (CAN), September 2005 complicated networks such as a CMP-DNUCA tem components such as cache controllers and network [5]. directory controllers. Each controller is conceptu- For snooping-based systems, Ruby has two ally a per-memory-block state machine, which totally-ordered networks: a crossbar network and a includes: hierarchical switch network. Both ordered net- • States: set of possible states for each cache works use a hierarchy of one or more switches to block, create a total order of coherence requests at the net- • Events: conditions that trigger state transitions, work’s root. This total order is enough for many such as message arrivals, broadcast-based snooping protocols, but it requires that the specific cache-coherence protocol does not • Transitions: the cross-product of states and rely on stronger timing properties provided by the events (based on the state and event, a transi- more traditional bus-based interconnect. In addi- tion performs an atomic sequence of actions tion, mechanisms for synchronous snoop response and changes the block to a new state), and combining and other aspects of some bus-based • Actions: the specific operations performed protocols are not supported. during a transition. The topology of the interconnect is specified by For example, the SLICC code might specify a a set of links between switches, and the actual rout- “Shared” state that allows read-only access for a ing tables are re-calculated for each execution, block in a cache. When an external invalidation allowing for additional topologies to be easily message arrives at the cache for a block in Shared, it added to the system. The interconnect models vir- triggers an “Invalidation” event, which causes a tual networks for different types and classes of mes- “Shared x Invalidation” transition to occur. This sages, and it allows dynamic routing to be enabled transition specifies that the block should change to or disabled on a per-virtual-network basis (to pro- the “Invalid” state. Before a transition can begin, all vide point-to-point order if required). Each link of required resources must be available. This check the interconnect has limited bandwidth, but the prevents mid-transition blocking. Such resource interconnect does not model the details of the checking includes available cache frames, in-flight physical or link-level layer. By default, infinite net- transaction buffer, space in an outgoing message work buffering is assumed at the switches, but Ruby queue, etc. This resource check allows the control- also supports finite buffering in certain networks. ler to always complete the entire sequence of We believe that Ruby’s interconnect model is suffi- actions associated with the transition without cient for coherence protocol and memory hierarchy blocking. research, but a more detailed model of the inter- SLICC is syntactically similar to C or C++, but it connection network may need to be integrated for is intentionally limited to constrain the specifica- research focusing on low-level interconnection net- tion to hardware-like structures. For example, no work issues. local variables or loops are allowed in the language. 3.2 Specification Language for Implement- We also added special language constructs for ing Cache Coherence (SLICC) inserting messages into buffers and reading infor- mation from the next message in a buffer. One of our main motivations for creating GEMS was to evaluate different coherence proto- Each controller specified in SLICC consists of cols and coherence-based prediction. As such, flex- protocol-independent components, such as cache ibility in specifying cache coherence protocols was memories and directories, as well as all fields in: essential. Building upon our earlier work on table- caches, per-block directory information at the driven specification of coherence protocols [23], we home node, in-flight transaction buffers, messages, created SLICC (Specification Language for Imple- and any coherence predictors. These fields consist menting Cache Coherence), a domain-specific lan- of primitive types such as addresses, bit-fields, sets, guage that codifies our table-driven methodology. counters, and user-specified enumerations. Mes- sages contain a message type tag (for statistics gath- SLICC is based upon the idea of specifying indi- ering) and a size field (for simulating contention on vidual controller state machines that represent sys- 5
Appears in Computer Architecture News (CAN), September 2005 the interconnection network). A controller uses are hopeful the process can be partially or fully these messages to communicate with other control- automated in the future. Finally, we have restricted lers. Messages travel along the intra-chip and inter- SLICC in ways (e.g., no loops) in which we believe chip interconnection networks. When a message will allow automatic translation of a SLICC specifi- arrives at its destination, it generates a specific type cation directly into a synthesizable hardware of event determined by the input message control description language (such as VHDL or Verilog). logic of the particular controller (also specified in Such efforts are future work. SLICC). 3.3 Ruby’s Release 1.0 Limitations SLICC allows for the specification of many types of invalidation-based cache coherence proto- Most of the limitations in Ruby release 1.0 are cols and systems. As invalidation-based protocols specific to the implementation and not the general are ubiquitous in current commercial systems, we framework. For example, Ruby release 1.0 supports constructed SLICC to perform all operations on only physically-indexed caches, although support cache block granularity (configurable, but canoni- for indexing the primary caches with virtual cally 64 bytes). As such, the word-level granularity addresses could be added. Also, Ruby does not required for update-based protocols is currently model the memory system traffic due to direct not supported. SLICC is perhaps best suited for memory access (DMA) operations or memory- specifying directory-based protocols (e.g., the pro- mapped I/O loads and stores. Instead of modeling tocols used in the Stanford DASH [13] and the SGI these I/O operations, we simply count the number Origin [12]), and other related protocols such as that occur. For our workloads, these operations are AMD’s Opteron protocol [1, 10]. Although SLICC infrequent enough (compared to cache misses) to can be used to specify broadcast snooping proto- have negligible relative impact on our simulations. cols, SLICC assumes all protocols use an asynchro- Those researchers who wish to study more I/O nous point-to-point network, and not the simpler intensive workloads may find it necessary to model (but less scalable) synchronous system bus. The such effects. GEMS release 1.0 distribution contains a SLICC specification for an aggressive snooping protocol, a 4 Detailed Processor Model (Opal) flat directory protocol, a protocol based on the Although GEMS can use Simics’ functional sim- AMD Opteron [1, 10], two hierarchical directory ulator as a driver that approximates a system with protocols suitable for CMP systems, and a Token simple in-order processor cores, capturing the tim- Coherence protocol [16] for a hierarchical CMP ing of today’s dynamically-scheduled superscalar system [17]. processors requires a more detailed timing model. The SLICC compiler translates a SLICC specifi- GEMS includes Opal—also known as TFSim cation into C++ code that links with the protocol- [18]—as a detailed timing model using the timing- independent portions of the Ruby memory system first approach. Opal runs ahead of Simics’ func- simulator. In this way, Ruby and SLICC are tightly tional simulation by fetching, decoding, predicting integrated to the extent of being inseparable. In branches, dynamically scheduling, executing addition to generating code for Ruby, the SLICC instructions, and speculatively accessing the mem- language is intended to be used for a variety of pur- ory hierarchy. When Opal has determined that the poses. First, the SLICC compiler generates HTML- time has come for an instruction to retire, it based tables as documentation for each controller. instructs the functional simulation of the corre- This concise and continuously-updated documen- sponding Simics processor to advance one instruc- tation is helpful when developing and debugging tion. Opal then compares its processor states with protocols. Example SLICC code and corresponding that of Simics to ensure that it executed the instruc- HTML tables can be found online [15]. Second, the tion correctly. The vast majority of the time Opal SLICC code has also served as the basis for translat- and Simics agree on the instruction execution; ing protocols into a model-checkable format such however, when an interrupt, I/O operation, or rare as TLA+ [2, 11] or Murphi [8]. Although such kernel-only instruction not implemented by Opal efforts have thus far been manual translations, we occurs, Opal will detect the discrepancy and 6
Appears in Computer Architecture News (CAN), September 2005 recover as necessary. The paper on TFSim/Opal 5 Constraints and Caveats [18] provides more discussion of the timing-first As with any complicated research tool, GEMS is philosophy, implementation, effectiveness, and not without its constraints and caveats. One caveat related work. of GEMS is that although individual components Features. Opal models a modern dynamically- and some entire system timing testing and sanity scheduled, superscalar, deeply-pipelined processor checking has been performed, a full end-to-end core. Opal is configured by default to use a two- validation of GEMS has not been performed. For level gshare branch predictor, MIPS R10000 style instance, through the random tester we have veri- register renaming, dynamic instruction issue, mul- fied Ruby coherently transfers cache blocks, how- tiple execution units, and a load/store queue to ever, we have not performed exhaustive timing allow for out-of-order memory operations and verification or verified that it strictly adheres to the memory bypassing. As Opal simulates the SPARC sequential consistency memory model. Neverthe- ISA, condition codes and other architectural state less, the relative performance comparisons gener- are renamed as necessary to allow for highly-con- ated by GEMS should suffice to provide insight into current execution. Because Opal runs ahead of the many types of proposed design enhancements. Simics functional processor, it models all wrong- Another caveat of GEMS is that we are unable to path effects of instructions that are not eventually distribute our commercial workload suite [3], as it retired. Opal implements an aggressive implemen- contains proprietary software that cannot be redis- tation of sequential consistency, allowing memory tributed. Although commercial workloads can be operations to occur out of order and detecting pos- set up under Simics, the lack of ready-made work- sible memory ordering violations as necessary. loads will increase the effort required to use GEMS Limitations. Opal is sufficient for modeling a pro- to generate timing results for commercial servers cessor that generates multiple outstanding misses and other multiprocessing systems. for the Ruby memory system. However, Opal’s microarchtectual model, although detailed, does 6 Conclusions not model all the most advanced features of some In this paper we presented Multifacet’s General modern processors (e.g., a trace cache and Execution-driven Multiprocessor Simulator advanced memory-dependence speculation predic- (GEMS) as a simulation toolset to evaluate multi- tors). Thus, the model may need to be extended processor architectures. By using the full-system and further validated in order to be used as a basis simulator Simics, we were able to decouple the for experiments that focus on the microarchitec- development of the GEMS timing model from ture of the system. Also, the first release of Opal ensuring the necessary functional correctness does not support hardware multithreading, but at required to run an unmodified operating system least one of our internal projects has preliminarily and commercial workloads. The multiprocessor added such support to Opal. Similar to most timing simulator Ruby allows for detailed simula- microarchitecture simulators, Opal is heavily tied tion of cache hierarchies. The domain-specific lan- to its target ISA (in this case, SPARC), and porting guage SLICC grants GEMS the flexibility to it to a different ISA, though possible, would not be implement different coherence protocols and sys- trivial. tems under a single simulation infrastructure. The Perhaps Opal’s biggest constraints deal with its processor timing simulator Opal can be used to dependence on the Simics functional execution simulate a dynamically-scheduled superscalar pro- mode. For instance, because Simics uses a flat cessor, and it relies on Simics for functional cor- memory image, Opal cannot execute a non- rectness. To enables others to more easily perform sequentially consistent execution. Also we devel- research on multiprocessors with commercial oped Opal to simulate systems based on the SPARC workloads, we have released GEMS under the ISA, and therefore it is limited to the specific TLB GNU GPL, and the first release is available at configurations supported by Simics and the simu- http://www.cs.wisc.edu/gems/. lated operating system. 7
Appears in Computer Architecture News (CAN), September 2005 7 Acknowledgments Language and Tools for Hardware and Software Engineers. Addision-Wesley, 2002. We thank Ross Dickson, Pacia Harper, Carl [12] James Laudon and Daniel Lenoski. The SGI Origin: Mauer, Manoj Plakal, and Luke Yen for their contri- A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on butions to the GEMS infrastructure. We thank the Computer Architecture, pages 241–251, June 1997. University of Illinois for original beta testing. This [13] Daniel Lenoski, James Laudon, Kourosh work is supported in part by the National Science Gharachorloo, Anoop Gupta, and John Hennessy. The Directory-Based Cache Coherence Protocol for Foundation (CCR-0324878, EIA/CNS-0205286, the DASH Multiprocessor. In Proceedings of the and CCR-0105721) and donations from Compaq, 17th Annual International Symposium on Computer Architecture, pages 148–159, May 1990. IBM, Intel Corporation and Sun Microsystems. We [14] Peter S. Magnusson et al. Simics: A Full System thank Amir Roth for suggesting the acronym Simulation Platform. IEEE Computer, 35(2):50–58, February 2002. SLICC. We thank Virtutech AB, the Wisconsin [15] Milo M. K. Martin et al. Protocol Specifications and Condor group, and the Wisconsin Computer Sys- Tables for Four Comparable MOESI Coherence tems Lab for their help and support. Protocols: Token Coherence, Snooping, Directory, and Hammer. http://www.cs.wisc.edu/multifacet/theses/milo_ma rtin_phd/, 2003. References [16] Milo M. K. Martin, Mark D. Hill, and David A. Wood. Token Coherence: A New Framework for [1] Ardsher Ahmed, Pat Conway, Bill Hughes, and Shared-Memory Multiprocessors. IEEE Micro, Fred Weber. AMD Opteron Shared Memory MP 23(6), Nov/Dec 2003. Systems. In Proceedings of the 14th HotChips [17] Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Symposium, August 2002. Alan J. Hu, Milo M. K. Martin, and David A. Wood. [2] Homayoon Akhiani, Damien Doligez, Paul Harter, Improving Multiple-CMP Systems Using Token Leslie Lamport, Joshua Scheid, Mark Tuttle, and Coherence. In Proceedings of the Eleventh IEEE Yuan Yu. Cache Coherence Verification with Symposium on High-Performance Computer TLA+. In FM’99—Formal Methods, Volume II, Architecture, February 2005. volume 1709 of Lecture Notes in Computer Science, [18] Carl J. Mauer, Mark D. Hill, and David A. Wood. page 1871. Springer Verlag, 1999. Full System Timing-First Simulation. In [3] Alaa R. Alameldeen, Milo M. K. Martin, Carl J. Proceedings of the 2002 ACM Sigmetrics Conference Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, on Measurement and Modeling of Computer Mark D. Hill, and David A. Wood. Simulating a Systems, pages 108–116, June 2002. $2M Commercial Server on a $2K PC. IEEE [19] Shubhendu S. Mukherjee, Peter Bannon, Steven Computer, 36(2):50–57, February 2003. Lang, Aaron Spink, and David Webb. The Alpha [4] Todd Austin, Eric Larson, and Dan Ernst. 21364 Network Architecture. In Proceedings of the SimpleScalar: An Infrastructure for Computer 9th Hot Interconnects Symposium, August 2001. System Modeling. IEEE Computer, 35(2):59–67, [20] Mendel Rosenblum, Stephen A. Herrod, Emmett February 2002. Witchel, and Anoop Gupta. Complete Computer [5] Bradford M. Beckmann and David A. Wood. System Simulation: The SimOS Approach. IEEE Managing Wire Delay in Large Chip- Parallel and Distributed Technology: Systems and Multiprocessor Caches. In Proceedings of the 37th Applications, 3(4):34–43, 1995. Annual IEEE/ACM International Symposium on [21] Lambert Schaelicke and Mike Parker. ML-RSIM Microarchitecture, December 2004. Reference Manual. Technical Report tech. report [6] Nathan. L. Binkert, Erik. G. Hallnor, and Steven. K. 02-10, Department of Computer Science and Reinhardt. Network-Oriented Full-System Engineering, Univ. of Notre Dame, Notre Dame, Simulation using M5. In Proceedings of the Sixth IN, 2002. Workshop on Computer Architecture Evaluation [22] Jared Smolens, Brian Gold, Jangwoo Kim, Babak Using Commercial Workloads, February 2003. Falsafi, James C. Hoe, , and Andreas G. Nowatzyk. [7] Harold W. Cain, Kevin M. Lepak, Brandon A. Fingerprinting: Bounding the Soft-Error Detection Schwartz, and Mikko H. Lipasti. Precise and Latency and Bandwidth. In Proceedings of the Accurate Processor Simulation. In Proceedings of Eleventh International Conference on Architectural the Fifth Workshop on Computer Architecture Support for Programming Languages and Operating Evaluation Using Commercial Workloads, pages 13– Systems, pages 224–234, October 2004. 22, February 2002. [23] Daniel J. Sorin, Manoj Plakal, Mark D. Hill, Anne E. [8] David L. Dill, Andreas J. Drexler, Alan J. Hu, and Condon, Milo M. K. Martin, and David A. Wood. C. Han Yang. Protocol Verification as a Hardware Specifying and Verifying a Broadcast and a Design Aid. In International Conference on Multicast Snooping Cache Coherence Protocol. Computer Design. IEEE, October 1992. IEEE Transactions on Parallel and Distributed [9] Free Software Foundation. GNU General Public Systems, 13(6):556–578, June 2002. License (GPL). [24] Systems Performance Evaluation Cooperation. http://www.gnu.org/copyleft/gpl.html. SPEC Benchmarks. http://www.spec.org. [10] Chetana N. Keltcher, Kevin J. McGrath, Ardsher [25] David A. Wood, Garth A. Gibson, and Randy H. Ahmed, and Pat Conway. The AMD Opteron Katz. Verifying a Multiprocessor Cache Controller Processor for Multiprocessor Servers. IEEE Micro, Using Random Test Generation. IEEE Design and 23(2):66–76, March-April 2003. Test of Computers, pages 13–25, August 1990. [11] Leslie Lamport. Specifying Systems: The TLA+ 8
You can also read