VIRTUAL DIVIDE AND CONQUER SCAN TEST ARCHITECTURE FOR MULTI-CLOCK DOMAIN SOC
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Virtual Divide and Conquer Scan Test Architecture for Multi-Clock Domain SoC A Thesis Submitted For the Degree of Master of Science (Engineering) in the Faculty of Engineering by Senthil Arasu T Supercomputer Education and Research Center Indian Institute of Science BANGALORE – 560 012 OCTOBER 2006
c Senthil Arasu T OCTOBER 2006 All rights reserved
... to Babu, Vai, Aylakka & Sangee
Abstract In modern SoC, there can be a number of different clock domains, as many as 20 in some communication-related ASICs Scan-testing of designs with multiple clock domains poses several problems. In multi-clock domain design, in order to balance scan chains, scan elements clocked by different clocks are often connected to the same scan chain. The clock skew present on different clock trees makes it unsafe to pulse all the clocks simultaneously for shift and capture in a scan chain with clock mixing. To ensure safe shift operations, lockup latches are inserted at the clock domain crossings. Similarly, in order to capture correct data launched in one clock domain into a flop in another clock domain, lockup latches must be inserted in the functional path. This may affect the functional timing of the path/design. Although lockup latches solve the problem of shifting data in a scan chain with clock mixing, capturing the response of the circuit under test in such a scan chain requires careful analysis from timing perspective. A simple solution often used in many designs is to capture only in one clock domain at a time. This is done by pulsing all the clocks during shift, but only one clock during capture. This results in wasted clock cycles. In this thesis, we leverage the wasted clock cycles to develop a new architecture that saves test time and test power spent in the wasted cycles. In this work, we present a scan test architecture, which uses ”Virtual Divide and Conquer” (VDNC) to handle the multiclock domain scan test problem with reduction in test data volume and test power. i
Acknowledgments First and foremost, I would like to express my sincere gratitude to Prof. S.K. Nandy and Dr. C.P. Ravikumar for their active guidance and encouragement throughout the course of this research. This research would not have taken its current shape without their insightful comments, critical remarks and constant feedback. I am extremely grateful to Texas Instruments for sponsoring this research. My col- leagues in Texas Instruments have been a great source of encouragement and help. My special thanks are due to Dr. Ken Butler, Dr. Graham Hetherington, V R Devanathan, R Raghuraman, Phani Kumar, P Sundar, Narasimha Murthy, Sathya Kaginele and R Madhu for their continuous support and cooperation. Senthil Arasu T ii
Contents Abstract i Acknowledgments ii 1 Introduction 1 1.1 Testing of Modern Day SoC . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scan Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Load-Unload Procedure . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Capture Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 At-Speed Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 Scan Test Pattern and Volume . . . . . . . . . . . . . . . . . . . . 4 1.3 Test Application Time Trends . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Test Power Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Multi-Clock Domain SoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Multi-Clock Domain Scan Shift . . . . . . . . . . . . . . . . . . . 8 1.5.2 Multi-Clock Domain Scan Capture . . . . . . . . . . . . . . . . . 10 1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Background and Related Work 14 2.1 Scan Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Vanilla Scan Architecture . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Divide and Conquer Scan Architecture . . . . . . . . . . . . . . . 18 2.2 Survey of Test Power Reduction Techniques . . . . . . . . . . . . . . . . 23 2.2.1 Using Scan Architecture . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Using Selective Test Sets . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Survey of Test Time Reduction Techniques . . . . . . . . . . . . . . . . . 25 2.4 ATPG solution to Multi-Clock Domain . . . . . . . . . . . . . . . . . . . 27 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Virtual Divide and Conquer 29 3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Test Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Test Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 iii
3.4.1 Scan Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.2 Scan Stitching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Scan Router Integration . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.4 Scan Test Pattern Generation . . . . . . . . . . . . . . . . . . . . 37 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4 Experiments and Results 40 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 Vanilla FullScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.3 Virtual Divide and Conquer . . . . . . . . . . . . . . . . . . . . . 43 4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Conclusions and Future Work 51 5.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Appendix I - Scan Router RTL 53 Bibliography 56 Publications from this Thesis 60
List of Tables 3.1 Test Modes and Scan Paths for VDNC Scan for XDSL chip . . . . . . . . 31 3.2 Test Time Calculation for VDNC Scan on XDSL chip . . . . . . . . . . . 33 4.1 Experimental Results for Stuck-at patterns . . . . . . . . . . . . . . . . . 44 4.2 Experimental Results for Transition fault patterns . . . . . . . . . . . . . 45 v
List of Figures 1.1 Normal D Flop, a Mux Scan Flop, a Scan Chain . . . . . . . . . . . . . . 3 1.2 Test Quality Trade-offs (Source : ITRS 2005) . . . . . . . . . . . . . . . 6 1.3 A Scan Chain with Clock Mixing . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Arrival time of clkb earlier to clka . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Arrival time of clkb later than clka . . . . . . . . . . . . . . . . . . . . . 11 1.6 Multi-Clock Domain Scan Shift Solution - Lockup Latches . . . . . . . . 11 1.7 Waveform of Multi-Clock Domain Scan Shift Solution . . . . . . . . . . . 12 1.8 Multi-Clock Domain Scan Capture Solution - Single Domain Capture . . 12 2.1 Vanilla FullScan : Illustration . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Vanilla FullScan Implementation on XDSL . . . . . . . . . . . . . . . . 17 2.3 Wasted Shift Cycles in the Single Clock Domain Capture Scheme . . . . 18 2.4 Divide and Conquer Scan : Illustration . . . . . . . . . . . . . . . . . . . 21 2.5 One Divide and Conquer Scan Implementation on XDSL . . . . . . . . . 22 2.6 Staggered Scan Capture in ATPG based solution . . . . . . . . . . . . . 27 3.1 Virtual Divide and Conquer: Illustration . . . . . . . . . . . . . . . . . . 30 3.2 One Implementation of Virtual Divide and Conquer on XDSL . . . . . . 31 3.3 Virtual Divide and Conquer Scan Test Waveform . . . . . . . . . . . . . 32 3.4 Scan Router Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 VDNC Implementation Flow . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Test Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Daisy Mode Test Time Comparison . . . . . . . . . . . . . . . . . . . . . 47 4.3 Test Power Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 VDNC Test Power for Stuckat Faults . . . . . . . . . . . . . . . . . . . . 49 vi
Chapter 1 Introduction 1.1 Testing of Modern Day SoC Testing of modern-day VLSI systems has become expensive. In what he called the “Moore’s Law for Test,” Gelsinger [9] showed that the test cost per transistor is steadily increasing from 1991 and will overtake the manufacturing cost per transistor by around 2010. Already, it is recognized that test cost is a significant portion of the total cost of a system-on-chip (SoC) due to the increasing logic and memory density in modern VLSI systems. Several techniques have been invented to improve test application time - built- in self-test (BIST) for both logic and memories, scan test compression for logic. While memory BIST is now a de facto standard for memories, Scan testing continues to be the most popular form of Design-for-Testability (DFT) for on-chip logic in modern-day SoC. Although logic BIST is expected to become an alternative in the future, scan test continues to enjoy popularity due to relatively smaller area overhead and fewer timing closure challenges and availability of scan test compression techniques and the support in terms of EDA tools and automated flows. When scan test was first proposed, scan test was practiced using a single scan chain that threaded all the sequential elements in the circuit, and was intended for circuits with a single clock. The biggest advantage of scan test was the ease of testability with the additional controllability and observability points. The main drawbacks came from 1
CHAPTER 1. INTRODUCTION 2 converting normal flip-flops to scan flip-flop which resulted in increased area and perfor- mance overhead and the test application time as each test vector must be scanned into using the scan chains. Inorder to address the test application time problem, Multiscan architecture was proposed where instead of having a single scan chain and have all the scan flip flops on the chain, there were multiple scan chains and the scannable flip flops were distributed among them which resulted in shorter scan chains and hence reduced time to load the test pattern. Partial scan architecture was proposed to address the problem of area and performance overheads due to scan flops. The early scan architectures were mainly intended for testing of stuck-at faults; but scan testing has since been extended for at-speed testing. Related techniques such as boundary scan have been invented to extend the idea of scan test to system-level. Using scan test for multi-million gate system-on-chip designs presents several challenges. 1.2 Scan Testing In a full or partial scan design, the sequential elements are converted to scan flops and are stitched back to back along the Q → SD path to form a chain of scan flops refered to as the scan chain, which can be shifted in at one end through the scanin port and shifted out through the other end which is the scanout port. Scan testing for stuck-at faults like opens and shorts consists of two steps, viz. load-unload procedure and capture procedure. 1.2.1 Load-Unload Procedure In the load-unload step, the state of internal sequential elements required to excite a fault is loaded into elements by shifting the test vector into scan chain. As the test vector is loaded into the scan chain, the current state of the design gets scanned out. Thus in this step, a new state of the design is directly loaded into the sequential elements and the existing state is read directly as part of the unload operation. During load-unload procedure the scan-enable pin is asserted high and the value in
CHAPTER 1. INTRODUCTION 3 Figure 1.1: Normal D Flop, a Mux Scan Flop, a Scan Chain the SD is captured into the flop at the arrival of a clock edge in a edge triggered mux scan flop. 1.2.2 Capture Procedure The load-unload procedure initializes the design into a state which can excite a fault. The fault is excited and the response is captured into another flop in the capture procedure. During the capture procedure cycle, the scan-enable pin is asserted low and the value in the D is captured into the flop at the arrival of a clock edge in a edge triggered mux scan flop. 1.2.3 At-Speed Testing Scan testing could be used to perform at-speed testing to detect timing related faults in the logic. The at-speed defects manifest itself as a slow-to-rise fault or a slow-to-fall fault. To detect a speed related fault, two patterns are required. First pattern would launch the
CHAPTER 1. INTRODUCTION 4 transition and the second pattern would capture. There are two popular methodologies based on the mechanism employed to launch the transition, viz. Launch-off-shift and Launch-off-capture. Launch-off-shift : For a scan chain of length n, the chain is initialized in the first n − 1 cycles. In the last shift cycle, the transition is launched and the scan enable is asserted low. This is followed by an at-speed capture pulse, to capture the response [14, 15, 16]. The advantage of this method is that transition could be initialized using the shift path which results in less number of patterns because of higher controllability. The disadvantage is that the scan enable must be closed at the frequency of scan capture. From a timing closure perspective this involves considerable efforts since a clock tree must be built for the scan enable path. Launch-off-capture : The scan chain is initialized with the n shift cycles and the scan enable is asserted low. The transition is launched using a separate clock pulse apart from the shift cycles. The transition is launched using a clock pulse which is immediately followed by an at-speed capture pulse [14, 15, 16]. The advantage of this scheme is the scan enable need not be closed at the frequency of scan capture. The disadvantage is increase in pattern volume because of sequential ATPG. 1.2.4 Scan Test Pattern and Volume A scan test pattern consists of a state to be loaded, a capture cycle, a state to be unloaded. P0 --- Capture --- xx P1 --- Capture --- P0’ P2 --- Capture --- P1’ i.e., when P 1 is scanned in the P 00 which is the result of previous capture is scanned out. Each scan pattern contains a state to be loaded for all the chain elements, the size of a pattern is equal to the number of elements in the scan chain. The number of
CHAPTER 1. INTRODUCTION 5 scan patterns generated depends on the design and the number of patterns required is exponential w.r.t test coverage. For higher coverages, the number of patterns required is huge. It becomes increasingly difficult as the tool tries to detect the Random Pattern Resistant Faults(RPRF). The pattern volume of at-speed faults is usually a few order of magnitude larger than that of the stuck-at scan test volume because of the nature of sensitization. 1.3 Test Application Time Trends According to ITRS2005 [29] – “Unconstrained, increased digital logic die complexity and content drives proportional increases on the test data volume (number and width of vectors). Unconstrained, this additional test data volume drives increases in test capital and operational costs by requiring additional vector memory depth per digital channel of the test tools (ATEs) and by increasing test application time per DUT . . . To keep the test quality level of embedded cores against deep sub-micron defects, such as resistive opens or small delay defects, additional delay test is required. Therefore, the number of test patterns increases inevitably and imperatively, and test application time may be as much as thirty times larger than today in 2010. This means various techniques for the significant reduction of test time, such as test pattern compaction, scan architecture improvement, and scan shift speed acceleration, are strongly needed. “ The design and defect complexity explodes the test data volume required to ship quality chips. The need for lower Defective Parts Per Million (DPPM) is dictated by the market needs. For example a device that is used in the automotive industry mandates a closer to zero DPPM. With the device complexity and the process uncertainties, inorder to produce a closer to zero DPPM device requires exponential increase in test volume and hence test application time and cost. Figure 1.2 shows the exponential increase in the test cost for the lower DPPM. Quality is a clear tradeoff for test cost, again shown in figure 1.2 is cost of shipping defective parts increases with increase in DPPM.
CHAPTER 1. INTRODUCTION 6 Figure 1.2: Test Quality Trade-offs (Source : ITRS 2005) 1.4 Test Power Trends Power consumed during test has been shown to be twice as high as power consumed during normal mode [35]. The reasons for the increased power consumption during test mode could be • Increased switching activity during test mode compared to normal operation of the chip • Parallel testing of modules or sub-chips to reduce test application time which results in excessive energy and power dissipation. • Test logic designed to reduce test complexity is idle during normal mode but is intensively used in test mode. • successive functional input vectors have correlation in contrast to test vectors where the correlation is low since they are random patterns [30] With the growing test cost concerns, multi-site testing is being seriously explored and deployed to reduce test cost by a factor proportional to the number of multisite.
CHAPTER 1. INTRODUCTION 7 Multi-site testing involves testing more than one device concurrently on the same load- board using the same Automatic Test Equipment (ATE). Also emphasized by ITRS2005 [29], a limiting factor for multi-site testing is the number of power supplies and the power supply range. With increased power dissipation during scan testing, application of multi-site becomes a challenge. A solution to this problem commonly pursued is to reduce the scan shift frequency so that average power is reduced. 1 P ower = CV 2 f.α (1.1) 2 Using 1.1 we estimate a first order test power estimate. where C and V are the total load capacitance and operating voltage respectively. f is the frequency of operation. α is the activity factor. Lowering f results in power reduction. Incase of scan shift test power, f is the scan shift speed and reducing the scan shift speed results in lower scan shift power during manufacturing test. Though this solves the test power problem it results in increased test application time, which defeats the primary motivation for multi-site testing. Inorder to successfully apply multi-site testing and reduce test application time, DFT methodologies to reduce test power becomes a necessity. Test power can be broadly classified into two measures, viz. Average Test Power and Instantaneous Peak Power Average Test Power : Average test power is the total distribution of power over a time period. The ratio of energy to test time gives the average power. Elevated average power increases the thermal load that must be vented away from the device under test to prevent structural damages to silicon, bonding wires or package [10]. Test power during scan shift is equated to average test power Instantaneous Peak Power : Instantaneous power is the value of power consumed at any given instant. Usually, it is defined as the power consumed right after the application of the synchronizing clock signal [10]. Elevated instantaneous power might overload the power distribution system and cause IR drop problems which will manifest itself as a timing related fault. Peak power during scan capture is equated with instantaneous peak
CHAPTER 1. INTRODUCTION 8 test power. 1.5 Multi-Clock Domain SoCs SoC designs with multiple clock domains are now common due to several reasons : • IP reuse • Ability to reduce functional power by turning off clocks for unused blocks at any time • Difficulties in clock tree synthesis for a single clock domain in a large SoC In a modern SoC, there can be a number of different clock domains, as many as twenty in some communication-related ASICs, where each subsystem operates in a different clock domain, depending on its functionality and interface with other components on the board. Scan testing of SoCs with multiple clocks poses several challenges in terms of clocking and timing. Balanced scan chains are recommended for optimal test time. In a SoC with multiple clock domains, to perform balancing of scan elements across chains, it is required that the scan elements from different clock domains are mixed on the same chain. Having scan elements from different clock domains on the same chain have the following issues. 1.5.1 Multi-Clock Domain Scan Shift Problem : Consider two flops FA and FB , clocked by clka and clkb respectively as shown in figure 1.3. There is a cloud of combinational logic from the Q of FA to the D of FB . Since the clock sources are different, the insertion delay on these clock trees and hence the clock arrival times of the rising edges of clocks at these flops are different. There could be two cases with the difference in the clock arrival times. • Case 1 : As illustrated in figure 1.4, due to the insertion delays if it so happens that clkb arrives before clka then the correct data is latched i.e., old data is moved from FA to FB and the new data is latched on FA when the clka arrives.
CHAPTER 1. INTRODUCTION 9 Figure 1.3: A Scan Chain with Clock Mixing • Case 2 : As illustrated in figure 1.5, incase again due to the insertion delays if the clkb arrives after clka arrives then the new data from FA is latched onto FB which is a wrong data. Solution : A naive yet difficult solution to implement would be to balance all the clocks. Matching clock insertion delays could be difficult because due to functional requirement a clock insertion delay exception could be set on a clock domain that could be much smaller than the insertion delay requirement for scan shift. Inorder for these clock domains to meet worst case insertion during test mode would mean addition of clock buffers on the clock path, which results in area and power. Another solution to the case illustrated in figure 1.5 is to insert a lockup latch at the clock domain crossing to delay the data from FA being available at FB by half cycle on the Q → SD path as shown in figure 1.6. A Lockup latch is an normal latch clocked by the same clock as launch flop in the scan chain i.e., in the case illustrated, by the clka and it opens at the negative edge of the clock and close on the positive edge. So, when the new data is latched onto FA , it is not available to FB till the negative edge of the clock and there by delaying the data by half cycle. With the data delayed by half cycle if the clkb arrives after clka , the case of early data being captured will not occur until arrival times of clkb off clkA by half cycle.
CHAPTER 1. INTRODUCTION 10 Figure 1.4: Arrival time of clkb earlier to clka When the insertion delays of clock domain crossing each other is greater than half cycle then several timing closure tricks could be applied, one of which is to invert the clock to the clock domain that has the larger insertion delay so that essentially the time of the arrival of the edge is not more than half the shift clock period. Though the problem of capturing the early data or HOLD in the scan shift path is made easy using the lockup latch, meeting timing is made tough because the paths that were one full shift clock period are now due to the lockup latches half shift clock period. Figure 1.7 shows a waveform for the design with scan shift fixed. It could be observed that the case 2 where clkb is arriving after clka is shown and the data a is available only on the negative edge of the clock and the right data being latched until clkb does not arrive after half cycle past clka . 1.5.2 Multi-Clock Domain Scan Capture Problem: The problem described in the previous section for scan shift is from a Q → SD path and the same could happen for the Q → D path. The problem was addressed with much ease by adding a lockup latch in the Q → SD path. Since the path Q → SD is purely a test logic path and the impact of making the path a half cycle path did not
CHAPTER 1. INTRODUCTION 11 Figure 1.5: Arrival time of clkb later than clka Figure 1.6: Multi-Clock Domain Scan Shift Solution - Lockup Latches affect functionality of the chip. But the Q → D being a functional path, addition of any lockup latches could have serious repercussions in the functional timing because the path would have just half cycle to meet timing. Consider a case where, data from more than one clock domains are captured in a flop, then the addition of lockup latches would be cumbersome and would end up with adding one latch per flop in the worst case. Hence the solution of adding a lockup latch in the Q → D path is infeasible. Though the method of balancing all clocks so that the insertion delays are matched during test
CHAPTER 1. INTRODUCTION 12 Figure 1.7: Waveform of Multi-Clock Domain Scan Shift Solution mode would work, it is way too much of over-design and also is difficult from a timing closure point of view. Solution: A solution that has been adopted by the industry in view of no other efficient solution is to capture in one domain at a time as shown in figure 1.8. In the illustration, where there are two clock domains clka and clkb , test data is shifted into both the domains i.e., flops FA and FB but during capture, the capture pulse is given in only domain i.e., either clka or clkb depending on the faults targeted if they are in clka domain or clkb domain respectively. Scan shift happens without any data error because of the presence of the lockup latch at the crossing of clock domains. Figure 1.8: Multi-Clock Domain Scan Capture Solution - Single Domain Capture It must be noticed that the test application time spent in shifting through the flops
CHAPTER 1. INTRODUCTION 13 of the domain in which the test response is not captured is wasted. The test power dissipated during this operation is higher overall due to the amount of logic that are toggling simultaneously. The main idea behind this thesis in trying to remove or reduce the wasted test cycles used to shift data into domain in which data is not captured. This results in reduction in test application time and test power. 1.6 Organization of the Thesis The thesis is organized as follows — Chapter 2 provides an insight into the past work on popular scan architectures (vanilla and DNC), test power reduction techniques, test application time reduction techniques and ATPG based solutions to tackle the multiple clock domain scan test problem. Chapter 3 provides details on the proposed virtual divide and conquer architecture with detailed analysis on test time and test power. Chapter 4 describes the experiments and the results obtained to substantiate the theory. The thesis concludes with conclusions and scope for future work in chapter 5.
Chapter 2 Background and Related Work In the previous chapter, we pointed out the challenges in VLSI testing viz., reduction of test application time and test power. We also pointed out the challenge of testing multi-clock domain SoC. In this chapter, popular industry scan architectures like vanilla scan and divide-and- conquer scan are described in detail along with their test time and test power analysis. Existing literature on test time and test power reduction techniques are also reviewed. 2.1 Scan Architecture A scan architecture is a key element that decides the testability of the chip and hence determines the test cost and quality of the device in market. There are many scan archi- tecture implemented on industry designs, of them the most popular ones are the vanilla scan architecture, appreciated for its simplicity and the divide-and-conquer (DN C) scan architecture used for implementation of hierarchical design-for-testability. In this section both the vanilla and DN C are discussed in detail along with analysis of test time and test power with a toy design XDSL. 14
CHAPTER 2. BACKGROUND AND RELATED WORK 15 2.1.1 Vanilla Scan Architecture Architecture Consider an XDSL SoC with two sub-chips and four blocks - ARM , EM IF , CP U , DDR (Figure 2.2). Assume that there are four clock domains, one corresponding to each block, as shown in the figure. The four clock domains could be because of functionality or logically isolated block that could have different clocks to manage clock skew and ease timing closure. The Automatic Test Equipment (ATE) has a limitation on the number of clocks that it can supply and based on the high speed scan memory decide the number of scan chains supported. Suppose the tester permits k scan-input and k scan-out pins. The vanilla full-scan architecture will enforce that k scan chains are inserted in each of the four blocks and these chains are concatenated at the top-level as illustrated in Figure 2.2. For any particular fault type (stuck-at, transition-delay, etc.) scan test involves a single “test mode”, where the scan chains are loaded through the k scan inputs, and the responses are unloaded using the k scan outputs [6]. In a bottom-up flow, there is limited opportunity to balance scan chains, and in any block, the length of the longest chain may be much larger than the average length of the scan chain. To avoid this problem, we can consider an alternate where scan chains are balanced through “clock mixing. For example, we can consider balancing the scan chains across the ARM and EMIF blocks, and the CPU and the DDR blocks. However, this solution poses some implementation challenges. The clock trees for the blocks are routed separately to keep skew in the individual trees in control. But due to clock skew between the ARM and EMIF sub-blocks, it is not safe to pulse both the ARM clock and the EMIF clock simultaneously during shift or capture operations. The same is true for the CPU and DDR blocks. Lockup latches will be needed at every clock domain crossing in the ARM+EMIF sub-chip [28]. Similarly, in order to correctly capture data launched in one clock domain into a flop in another clock domain, lockup latches must be inserted in the functional path. This impacts the functional timing of the design. While lockup latches address the problem of shifting data in a scan chain with clock mixing, capturing the response of the circuit under test in such a scan chain
CHAPTER 2. BACKGROUND AND RELATED WORK 16 Figure 2.1: Vanilla FullScan : Illustration requires careful timing analysis. A simplified solution often used in many designs is to capture only in one clock domain at a time. This is implemented by pulsing all the clocks during shift, but only one clock during capture. This is illustrated for the XDSL chip in Figure 2.3. Test Time Analysis Assuming that the length of the longest chain in the blocks are lARM , lEM IF , lDDR , and lCP U , the scan test application time will be proportional to lARM + lEM IF + lDDR + lCP U . Each pattern time consists of a shift time and a capture time. The capture time is negligible compared to the shift time. Since all the elements in the scan chain are scanned, lets say the number of patterns of capture required for the block to be PEM IF , PARM , PCP U , PDDR . Then test time could be given as, Tvanilla = (PEM IF + PARM + PCP U + PDDR ) × (lARM + lEM IF + lDDR + lCP U ) (2.1)
CHAPTER 2. BACKGROUND AND RELATED WORK 17 Figure 2.2: Vanilla FullScan Implementation on XDSL Since the scan chain length can be written as, Lvanilla = (lARM + lEM IF + lDDR + lCP U ) (2.2) Hence above equation could be written as, X Tvanilla = Pj × Lvanilla (2.3) j=blocks Since the fault is excited and captured in only one domain at a time, the other domain flops are just shifted through and the test time wasted in shifting through these flops could be written as X Twasted = Pj × (Lvanilla − lj ) (2.4) j=blocks Test Power Analysis If the number of flops in the sub-blocks are nARM , nEM IF , nDDR , and nCP U , the activity factor during shift is directly proportional to nARM + nEM IF + nDDR + nCP U . Since all the flops will toggle at the same time during scan load/unload.
CHAPTER 2. BACKGROUND AND RELATED WORK 18 Figure 2.3: Wasted Shift Cycles in the Single Clock Domain Capture Scheme α ∝ nARM + nEM IF + nDDR + nCP U (2.5) 1 T estP owervanilla = CV 2 f × α (2.6) 2 Since the flops that are not captured could be avoided the test power wasted could be written as, 1 T estP owerwasted = T estP owervanilla − max | CV 2 f.ni | (2.7) i=blocks 2 2.1.2 Divide and Conquer Scan Architecture Architecture Divide-and-Conquer (DN C) scan is a hierarchical scan test method [1, 5, 24] where the essential idea is to provide a scan access mechanism to allow scan testing of individual portions of the SoC. For example, if there are n sub-chips in the SoC, DNC scan will use the available bandwidth of k scan pins to route k scan chains through each of the sub-chips. A scan multiplexer logic (also known as scan router) is used to permit testing
CHAPTER 2. BACKGROUND AND RELATED WORK 19 of one sub-chip at a time. Since sub-chips may interact through glue logic, it becomes necessary to also permit a daisy-chain mode which is essentially the vanilla fullscan mode. In the daisy chain mode, the target fault list includes all faults that are not already caught in the n individual scan test modes. Since only portions of the SoC are tested at a time, the sequential elements in the remaining parts of the chip can be initialized to constant values to reduce test power [24, 5]. In Figure 2.5, we have illustrated how DN C can be applied to the XDSL SoC. The chip is partitioned into 2 sub-chips, namely, (ARM + EM IF ) and (CP U + DDR). If the chip has k scanin and k scanout ports, we must insert (balanced) scan chains in the two sub-chips and connect the scan chains to a scan router as indicated in the figure. In test mode 0, the (ARM + EM IF )sub-chip will be scan-tested through the scan path scanin → ARM → EM IF → scanout the flops in the DDR and CPU sub-blocks will be initialized to constants. In test mode 1, the CPU+DDR sub-chip will be scan-tested through the scan path scanin → DDR → CP U → scanout the flops in the ARM and EMIF sub-blocks will be initialized to constants. In mode 2, the daisy chain mode, the scan path would be scanin → ARM → EM IF → DDR → CP U → scanout Note that DN C fits well into a physical design hierarchy; in a hierarchical physical design flow, it is natural to partition the chip into logical partitions such as (ARM + EM IF ) and (CP U + DDR) so as to balance the gate counts across partitions. Another
CHAPTER 2. BACKGROUND AND RELATED WORK 20 consideration during physical partitioning is the connectivity between the blocks, so that an effective floorplan can be derived. This partitioning strategy also works well from the view point of DNC scan, since balancing the gate counts would tend to balance the number of faults across the partitions, leading to balance in ATPG run-times on the individual partitions. Similarly, keeping physically related modules together will lead to a smaller target fault set for the daisy chain mode. As shown in [24], DNC scan architecture allows us to run the ATPG for the partitions concurrently and the only dependence in the ATPG flow is that the daisy-chain mode ATPG cannot be started without completing the ATPG runs for the partitions. The daisy chain mode ATPG has a dependency on the test group fault list. Since the daisy chain mode targets faults that are not detected during the test group ATPGs. Therefore, the speedup of a distributed implementation of the ATPG is impacted adversely by a long run of the daisy chain. See Equation 2.8 which provides the speedup S obtained for the XDSL chip. In this equation, TM refers to the test application time for module M . TXDSL SXDSL = (2.8) max(TARM +EM IF , TCP U +DDR ) + TDAISY Test Time Analysis The DNC scan architecture for the XDSL example includes 3 modes. In mode 0, which corresponds to the (ARM + EM IF ) test group, the length of the scan chain can be taken to be L0 = lARM + lEM IF . The number of test cycles in this mode of operation is given by T0 = L0 × (PARM + PEM IF ) (2.9) Similarly, in mode 1, the test cycle count is given by T1 = L1 × (PDDR + PCP U ) (2.10)
CHAPTER 2. BACKGROUND AND RELATED WORK 21 Figure 2.4: Divide and Conquer Scan : Illustration where L1 = lDDR + lCP U . During test application, when a pattern is shifted into EMIF, there are lARM wasted cycles in ARM (and vice versa). Based on this argument, the number of wasted cycles is given by Twasted = PARM · LEM IF + PEM IF · LARM (2.11) +PDDR · LCP U + PCP U · LDDR In general, consider a DNC scan architecture with m + 1 modes, including the daisy- chain mode. During mode j, 0 ≤ j < m, let the subset of blocks that are tested be given by Bj,1 , Bj,2 , · · ·, Bj,Mj . Here, Mj is the number of modules in test group (or test mode) j. The effective length of the scan chain in mode j is given by
CHAPTER 2. BACKGROUND AND RELATED WORK 22 Figure 2.5: One Divide and Conquer Scan Implementation on XDSL Mj X Lj = LBj,k (2.12) k=0 The wasted cycles are given by m−1 Mj X X Twasted = Pj,k · (Lj − LBj,k ) (2.13) j=0 k=1 Test Power Analysis The dynamic power dissipated in a CMOS circuit is proportional to the amount of node toggling. During scan test, there are two major sources of toggling that contribute to power, namely, scan shifting (which causes switching in the entire circuit) and clock- ing (which causes toggling in the clock tree). The only way to minimize the power in clock trees is through clock gating. However, scan shift power can be reduced through techniques such as pattern optimization and reordering [26]. During shift operation, it is easy to see that the test power is proportional to the length of the scan chain, since longer scan chains will typically result in more switching activity. In a clock mixed scan chain, the power dissipated in shifting data into scan flops that do not capture data is wasteful. For example, consider the DNC scan architecture for XDSL, with three modes
CHAPTER 2. BACKGROUND AND RELATED WORK 23 of operation, namely, (ARM +EM IF ), (CP U +DDR), and the daisy chain mode. Dur- ing mode 0, ARM + EM IF are tested through a scan chain of length LARM + LEM IF . Although all the flops in the chain toggle during pattern shift, only the ARM core flops capture data. The toggling activity in EM IF , therefore, is wasteful. Similarly, the clock tree power dissipated in the EM IF clock tree during scan shift is also wasteful. Power savings can be achieved by careful selection of toggling and clocking only the flops that take part in capture cycle. DNC places constant values on scan chains that are not part of the test mode. This reduces shift test power to an extent [5]. However, clock power reduction is not addressed in this architecture. T estP ower = max |T estP owerMj | (2.14) where, 1 T estP owerMj = CV 2 f × αMj + βMj (2.15) 2 The αMj is the activity factor during mode Mj which involves the flops present in scan chain used during Mj mode. And βMj is the constant power dissipated in other parts of circuit which is not scanned but are clocked and dissipate power. 2.2 Survey of Test Power Reduction Techniques The motivation behind test power reduction is two fold, one is to ensure a safe and non self-destructive test and the other is, by reducing the power consumed during test the speed of test could be increased and hence the same testing could be performed in lesser time within the power limits. Test power reduction techniques devised so far and published can be classified into two main category viz., Architecture and Test Data
CHAPTER 2. BACKGROUND AND RELATED WORK 24 2.2.1 Using Scan Architecture Reducing test power by improving the scan architecture is one means that results in guaranteed test power reduction [7, 2, 3]. The advantage of these methods is that they are independent of the vendor tools i.e., scan stitching and test pattern generation tool. The disadvantage of these methods is that they must be implemented as an architecture in the design and hence requires planning upfront and implementation efforts and analysis of the applicability of the architecture per design is mandatory. Based on the design, a particular architecture must be selected to yield best results in terms of test power reduction. Average test power reduction which is directly related to the scan shift power in the general case can be reduced either by controlling the clock during scan shift operation to the scan elements or the data to the scan elements. Controlling Test Clock : Several techniques have been proposed for scan shift test power reduction by control- ling the clock. Saxena and Bonhomme [4, 28] have described techniques that uses Clock gating to reduce the power dissipated in the clock tree during scan shift. Sankaralingam et al [27] proposed a technique using programmable scan chain disable. The technique disables the clock to scan chains and along with test pattern ordering are able to achieve greater reduction in test power. The clock tree power has been shown to consume signif- icant portion of the total test power [22]. However, clock gating is not easy to implement due to the physical design and timing closure challenges it poses [8]. Controlling Scan Chain : Other techniques uses, scan chain transformation which alters the scan chain by selectively enabling and disabling the selected scan chain during shift mode. Whetzel [31] use an approach to transform the conventional scan architecture into a scan path having a desired number of selectable, separate scan paths. Lee [17] proposed a interleaving scan architecture based on adding delay buffers among the scan chains. The amount of logic that toggles during the test is controlled and hence the test power is reduced. The control of the logic could either be the scan chain or the clock to
CHAPTER 2. BACKGROUND AND RELATED WORK 25 the scan elements. While applying the test power reduction techniques, either the test time improved marginally or remained the same. Few techniques had area overhead and other techniques like clock gating had physical design and timing closure restrictions. 2.2.2 Using Selective Test Sets By selection of test sets, either by reordering test patterns generated or by compaction techniques [11, 25, 26, 34] the test power can be reduced. Test pattern reordering reduces the switching activity in the circuit during scan shifting. There has been active research to reduce test power through this channel because test power reduction along these channels are design independent though the patterns themselves are design dependent. Test patterns generated are ordered to detect as many faults with minimum patterns possible. Typically, production test patterns are ordered to place the patterns that detect the maximum number of faults on the top of the set since early rejection of a bad part will reduce the total testing time. We may potentially lose this advantage when manufacturing test patterns are reordered for power minimization. 2.3 Survey of Test Time Reduction Techniques Since the test time directly translates to test cost and hence the product cost, research to reduce test time and yet guarantee the quality has been conducted for few decades now. Scan test paved a new way for test compared to functional test to ensure quantitative means of ensuring quality. Scan test suffers from large pattern volume and hence test time. Current test time research is focused on improving the scan architectures to reduce test time and maintain the quality goals. Logic BIST : Logic BIST solutions are often the best in terms of test application time, since pat- terns are generated on-chip and compacted on-chip; Logic BIST is also attractive from a field test perspective where after the chip is part of a system, the logic BIST can
CHAPTER 2. BACKGROUND AND RELATED WORK 26 be activated on the field and failures over time can be identified which is infeasible on other test solution which require a ATE. However, logic BIST comes with area overhead, physical design challenges, and functional timing closure challenges [12]. Since the input pattern is generated completely from a on-chip LFSR, all patterns applied are random and a category of faults ”Random Pattern Resistant Faults” (RPRF) cannot be detected or require large number of patterns to detect them. Scan Compression : Several solutions are available for reducing the scan data volume and scan test appli- cation time. These solutions uses compression technique [23, 32] to accept compressed data and decompress it on-chip and load, followed by compression of response and un- load. Thus by compressing and decompressing the scan data, the volume of data applied to the chip reduces and hence reduction in test time. In these solutions, the input pat- tern is not generated purely from the LFSR instead, a seed is generated externally and is fed to the LFSR based on which a pattern is generated that is used to excite the fault. Using external algorithms, RPRF faults are targeted by loading a seed into the LFSR that would in turn generate a random pattern which will excite the RPRF. Illinois Scan Architecture : Another popular architecture is the Illnosis Scan Architecture, where the same input is fed to multiple scan chains and the test response is compressed and scanned out. The difference compared to the other technique is that there is no decompress logic in the input path. This is advantageous because any random input pattern can be applied to a single chain. In this architecture since there is no decompress logic in the input path, any pattern generated by an external algorithm can be applied directly. The drawback is that the random input pattern only applies to a single chain and all other chains are copies of the first chain. The common disadvantage in all of the above scheme is the compressor logic. The compressor logic which is used to compress the test response and later compare it with the fault free circuit signature could have aliasing errors. Due to aliasing error a fault can be masked and hence a bad device could be categorized as a good device. There has
CHAPTER 2. BACKGROUND AND RELATED WORK 27 been extensive research in this field to device compression method that would be free of aliasing errors [33, 20] Most of the research and publication in the field of test time reduction focused on overcoming the limitation pertaining to the number of tester channel available for scan shifting. Using techniques like Illinois Scan, MentorGraphics TestKompressT M , Synopsys DBISTT M etc. to increase virtual tester channels by having more number of internal scan chains and having on-chip circuitry to decompress and compress the scan inputs and scan outputs respectively. There was not much focus on test power because when there is a power issue then the tests could still be run at slower speed. 2.4 ATPG solution to Multi-Clock Domain Figure 2.6: Staggered Scan Capture in ATPG based solution Makar [19] summarizes the scan-test related problems in SoC with multiple clock domains. ATPG vendors are beginning to address these problems [13, 18]. Custom clocking procedures are supported by some ATPG tools, where the ATPG algorithm is enhanced to generate patterns by pulsing more than one clock when it is safe to do so. This requires analysis of the clock domains to ensure that they are not interacting. Clearly, the effectiveness of custom clocking depends on the number of non- interacting clocks; it is expected that designs with many non-interacting clock domains will benefit from custom clocking through smaller number of test patterns.
CHAPTER 2. BACKGROUND AND RELATED WORK 28 Pulsing the capture clock in staggered fashion for interactive clock domain clocks results in sequential ATPG for the second pulse. Because the first pulse changes the contents of the sequential elements. Sequential ATPG poses challenges in terms of ATPG runtimes and pattern volume. Techniques discussed in [13, 18] did not report any reduction in test power. 2.5 Summary In this chapter we reviewed the Vanilla Scan and Divide and Conquer Scan architectures. We also did a brief survey of the test power reduction techniques and test application time reduction techniques. The ATPG solution to the multi clock domain scan test problem was also described. In the next chapter we describe in detail the proposed scan test architecture, Virtual Divide and Conquer Scan.
Chapter 3 Virtual Divide and Conquer In the previous chapter, we discussed the existing scan test architectures that are related to the proposed scan architecture. A survey of the techniques used for test time and test power reduction were also studied. In this chapter, we propose a new scan test architecture - the Virtual Divide and Conquer Scan Architecture and analyze the test time and test power benefits theoreti- cally. 3.1 Architecture The Divide-and-Conquer (DN C) scan architecture provides definite advantage in terms of test power and enables distributed hierarchical ATPG flow, when compared to vanilla full scan. When it is applied to designs with multiple clock domains, it can throw up the same problems as the vanilla fullscan architecture. Since the partitioning strategy is based on balancing gate counts to balance ATPG run-times [24], it will become necessary to group together blocks operating at different clock domains. As a result, clock mixing in scan chains will become inevitable, bringing with it all the problems associated with clock mixing. The DNC architecture also uses the policy of a activating a single clock domain during capture cycle [24] - therefore, the problem of wasted clock cycles persists (Figure 2.3). The DNC scan architecture is extended to alleviate the two problems 29
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 30 mentioned above. Figure 3.1: Virtual Divide and Conquer: Illustration In Virtual Divide and Conquer (V DN C) scan, the design is partitioned into test groups based on clock domain information. Since the partition may not preserve hierar- chical boundaries, it is referred as virtual partitioning. A test group in VDNC consists of scan chains that are clocked by a single clock or domains of same frequency but indepen- dent of each other. Two clock domains are considered independent if there exists no path between them or all the paths between them are false paths. Test patterns are generated for each test group separately. Since there is only one clock per test group, the shift and capture are completely safe on all flops in the scan chains. Hence all flops scanned with test data are also used to capture new data. A simple illustration of VDNC is shown in Figure 3.1. One implementation of VDNC scan for the XDSL example is shown in Figure 3.2. Since the ARM core and the CPU work at 100 MHz and are independent
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 31 from the clock analysis, it is possible to group them together. We will therefore have the following test modes (Table 3.2). Mode Frequency Scan Path Comments V DN C 100 100 MHz scanin → ARM → Faults in ARM CPU → scanout and CPU V DN C 200 200 MHz scanin → DDR → scanout Faults in DDR V DN C 75 75 MHz scanin → EMIF → scanout Faults in EMIF V DN C daisy One clock scanin → ARM → CPU → Daisy mode; at-a-time Capture DDR → EMIF → scanout Inter Clock Domain faults Table 3.1: Test Modes and Scan Paths for VDNC Scan for XDSL chip Figure 3.2: One Implementation of Virtual Divide and Conquer on XDSL The XDSL example may not fully bring out the nuances of the VDNC architec- ture. For example, assume that the CPU block has two sub-blocks with two different clock domains, say CPU-1 (100 MHz) and CPU-2 (75 MHz). VDNC will then group (ARM+CPU-1), (EMIF+CPU-2), and (DDR). In general, a design hierarchy can be represented as Design = B1 · B1,1 + B1 · B1,2 + B2 · B2,1 · B2,1,1 + · · · (3.1) Here, B1 , B2 , · · · represent the blocks at the first level of hierarchy, B1,1 , B1,2 , · · ·
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 32 represent the blocks at the second level of hierarchy under B1 , and so on. Suppose that we add clock domain information to the leaf-level blocks that work on a single clock, by prefixing the clock information to the name of the block. For example, if B1,1 works at 100 75 75 75 100 MHz, we can write it as B1,1 . Note that VDNC may combine B1,1,3 , B2,1,1,3 , and B3,1 provided there is no skew in the clocks reaching these blocks. We shall use the notation 75 75 75 B1,1,3 + B2,1,1,3 + B3,1 to indicate this test mode. Figure 3.3 illustrates the test time benefit offered by VDNC. Here, we shall assume that there are three test modes, V DN C 100 , V DN C 200 , V DN C 75 other than the daisy- chain mode. Note that the pattern is shifted into ARM chain and the response capture happens immediately thereafter. Figure 3.3: Virtual Divide and Conquer Scan Test Waveform This eliminates the wasted cycles which become inevitable in a clock-mixing solution. In the next section, we quantify the amount of test cycles saved in the VDNC scheme.
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 33 3.2 Test Time Analysis Again, we use the example of XDSL to understand the test time calculation for VDNC before generalizing the calculation to a generic case. Assume that VDNC scan for XDSL has three modes, one for each of the clock domain, and a daisy chain mode. The three independent modes are intended for the four clock domains, namely, ARM , EM IF , DDR, and CP U . Since the daisy mode is used to test inter-clock-domain faults, the number of patterns generated in this mode is quite small. Let Pj be the number of patterns for mode j and Lj be the length of the scan chain in mode j. The following table provides the test cycles consumed in the different modes. Also refer to Figure 3.3. Modej Pj Lj Test Cycles V DN C 100 PARM +CP U LARM +CP U PARM +CP U · LARM +CP U V DN C 75 PEM IF LEM IF PEM IF · LEM IF V DN C 200 PDDR LDDR PDDR · LDDR Table 3.2: Test Time Calculation for VDNC Scan on XDSL chip As we saw in the previous section, the VDNC scan technique may combine into the same test mode j several blocks at possibly different levels of hierarchy based on clock domain information. Let Lj be the total length of the scan chain in mode j; this is the sum of the chain lengths for the blocks that are assigned to mode j. Let Pj be the number of patterns in mode j. Then the total number of test cycles for the test modes Pm−1 0, 1, 2, · · · , m − 1 is given by j=0 Pj · Lj . The reduction in the number test cycles in VDNC scan is equal to the number of wasted cycles in DNC scan. Compared to VDNC the number of cycles wasted in DNC is m−1 X Twasted = Lj · (PT − Pj ) (3.2) j=0 Pm−1 where PT is the total pattern volume i.e, i=0 Pi In the case of XDSL it is equal to
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 34 Twasted = L1 · (P2 + P3 ) + L2 · (P1 + P3 ) + L3 · (P1 + P2 ) When the same is written in terms on total scan chain length, m−1 X Twasted = Pj · (L − Lj ) (3.3) j=0 3.3 Test Power Analysis In the VDNC architecture, since test partitioning is based on clock domains, it is possible to not only reduce scan shift power, but also the clock tree power. Consider the VDNC scan architecture for XDSL with three individual test modes and the daisy chain mode, as explained in the previous section. Scan Shift Power : Consider the V DN C 200 test mode, scan shifting only impacts the flops in the DDR core. The clock to the remaining blocks is not pulsed, hence eliminating the scan shift power as well as clock tree power in these blocks. Further, since the length of the scan chain in V DN C 200 test mode is smaller than the value for LARM +EM IF +DDR+CP U vanilla scan, a reduction of a factor LARM +EM IF +DDR+CP U /LDDR in scan shift power can be expected. Clock Tree Power : Also the clock to the EMIF, ARM, CPU core is not pulsed at the chip level, hence clock tree power consumption due to these clock is avoided during the V DN C 100 mode of operation. This reduction in clock power does not require clock gating. As a result, the physical design challenges associated with clock gating do not impact VDNC scan implementation[8]. In order to analyze the test power reduction in V DN C, we shall use the following simple model. We neglect the capture power in our average power calculation, since it is negligible compared to the other components, namely, scan shift power and clock tree power. Assume that there are n blocks in the chip, with fi flops in block i. Also, assume that there are m clock domains and the number of buffers in clock domain j is Cj . If
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 35 vanilla scan were to be used, the total test power can be written as m X n X Pvanilla = α · Cj + β · fj (3.4) 1 1 where α and β are constants of proportionality. The first term corresponds to the power dissipation in the clock tree buffers and the second term corresponds to the power dissipation in the flops during scan shifting and capture. When DN C is used with p partitions, assume that the number of flops in partition k is fk . The power dissipation in DNC is taken as the DNC mode in which the average power is the maximum, it is given by m k=p X PDN C = M AXk=0 (α · Cj + β · fk ) (3.5) 1 The test power dissipation for DNC architecture can be expected to be lower than Pvanilla because once a portion of flops are involved in scan shifting. It is also evident that since the entire chip clock trees are active, there is no savings in clock tree power. In the Virtual Divide-and-conquer technique, the test power for the mode in which the power dissipation is maximum could be written as k=p PV DN C = M AXk=0 (α · Ck + β · fk ) (3.6) The test power dissipation for the VDNC architecture is much lower than the Vanilla and DNC architecture because there is a reduction in both the clock tree power and the scan shift power. This results in an overall power reduction. It was shown by Pouya et al [22] that during test, the clock tree power contributes to 99% of the total test power. Hence with VDNC, cutting down on the clock tree power results in significant reduction in test power. 3.4 Implementation Implementation of a VDNC scan architecture involves the following steps
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER 36 • Scan Router Design • Scan Stitching • Scan Router Integration • Scan Test Pattern Generation 3.4.1 Scan Router Scan Router is a block of muxing logic that allows the test modes to be programmed. The test modes, selects the scan chain to be connected between the chip level scan in and scan out pins. The scan router logic accepts the scan select bits, chip level scan in and scan out along with the test group scan out ports as inputs and controls the chip level scanout, test group scan in through its output ports. Appendix I shows a scan router RTL used to collect the experimental data. It is written in verilog and assumes the design has 8 scan chains. But this can be configured in the RTL to have different sizes depending on the limitation of the tester. 3.4.2 Scan Stitching Inorder to stitch balanced set of scan chains per clock domains, the scan stitching tool is run multiple times by defining one clock at a time and stitching chains for that particular clock domain. Most of the scan stitching tools operate by tracing the clocks defined to identify the scan elements to be stitched as part of the chain. 3.4.3 Scan Router Integration The scan router RTL is synthesized and integrated with the design core which has the scan chains stitched per clock domain. Figure 3.4 shows the scan router RTL integrated with the core.
You can also read