VIRTUAL DIVIDE AND CONQUER SCAN TEST ARCHITECTURE FOR MULTI-CLOCK DOMAIN SOC

Page created by Randy Glover
 
CONTINUE READING
Virtual Divide and Conquer Scan Test Architecture for
               Multi-Clock Domain SoC

                             A Thesis
                 Submitted For the Degree of
                Master of Science (Engineering)
                 in the Faculty of Engineering

                                 by

                       Senthil Arasu T

            Supercomputer Education and Research Center
                     Indian Institute of Science
                      BANGALORE – 560 012

                         OCTOBER 2006
c Senthil Arasu T
OCTOBER 2006
All rights reserved
... to Babu, Vai, Aylakka & Sangee
Abstract

In modern SoC, there can be a number of different clock domains, as many as 20 in some
communication-related ASICs
   Scan-testing of designs with multiple clock domains poses several problems.
   In multi-clock domain design, in order to balance scan chains, scan elements clocked
by different clocks are often connected to the same scan chain. The clock skew present
on different clock trees makes it unsafe to pulse all the clocks simultaneously for shift
and capture in a scan chain with clock mixing. To ensure safe shift operations, lockup
latches are inserted at the clock domain crossings. Similarly, in order to capture correct
data launched in one clock domain into a flop in another clock domain, lockup latches
must be inserted in the functional path. This may affect the functional timing of the
path/design. Although lockup latches solve the problem of shifting data in a scan chain
with clock mixing, capturing the response of the circuit under test in such a scan chain
requires careful analysis from timing perspective. A simple solution often used in many
designs is to capture only in one clock domain at a time. This is done by pulsing all
the clocks during shift, but only one clock during capture. This results in wasted clock
cycles.
   In this thesis, we leverage the wasted clock cycles to develop a new architecture that
saves test time and test power spent in the wasted cycles. In this work, we present a
scan test architecture, which uses ”Virtual Divide and Conquer” (VDNC) to handle the
multiclock domain scan test problem with reduction in test data volume and test power.

                                            i
Acknowledgments

First and foremost, I would like to express my sincere gratitude to Prof. S.K. Nandy
and Dr. C.P. Ravikumar for their active guidance and encouragement throughout the
course of this research. This research would not have taken its current shape without
their insightful comments, critical remarks and constant feedback.

I am extremely grateful to Texas Instruments for sponsoring this research. My col-
leagues in Texas Instruments have been a great source of encouragement and help. My
special thanks are due to Dr. Ken Butler, Dr. Graham Hetherington, V R Devanathan,
R Raghuraman, Phani Kumar, P Sundar, Narasimha Murthy, Sathya Kaginele and R
Madhu for their continuous support and cooperation.

                                                                     Senthil Arasu T

                                           ii
Contents

Abstract                                                                                                                                        i

Acknowledgments                                                                                                                                ii

1 Introduction                                                                                                                                  1
  1.1 Testing of Modern Day SoC . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    1
  1.2 Scan Testing . . . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    2
       1.2.1 Load-Unload Procedure . . . . . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    2
       1.2.2 Capture Procedure . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
       1.2.3 At-Speed Testing . . . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
       1.2.4 Scan Test Pattern and Volume . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
  1.3 Test Application Time Trends . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  1.4 Test Power Trends . . . . . . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    6
  1.5 Multi-Clock Domain SoCs . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
       1.5.1 Multi-Clock Domain Scan Shift . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
       1.5.2 Multi-Clock Domain Scan Capture                               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  1.6 Organization of the Thesis . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13

2 Background and Related Work                                                                                                                  14
  2.1 Scan Architecture . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      2.1.1 Vanilla Scan Architecture . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
      2.1.2 Divide and Conquer Scan Architecture                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
  2.2 Survey of Test Power Reduction Techniques .                                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
      2.2.1 Using Scan Architecture . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
      2.2.2 Using Selective Test Sets . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  2.3 Survey of Test Time Reduction Techniques . .                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  2.4 ATPG solution to Multi-Clock Domain . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
  2.5 Summary . . . . . . . . . . . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28

3 Virtual Divide and Conquer                                                                                                                   29
  3.1 Architecture . . . . . . . .   .   .   .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  3.2 Test Time Analysis . . . .     .   .   .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
  3.3 Test Power Analysis . . .      .   .   .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
  3.4 Implementation . . . . . .     .   .   .   .    .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35

                                                     iii
3.4.1 Scan Router . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
         3.4.2 Scan Stitching . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
         3.4.3 Scan Router Integration . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
         3.4.4 Scan Test Pattern Generation        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
   3.5   Summary . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38

4 Experiments and Results                                                                                                          40
  4.1 Experimental Setup . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   40
      4.1.1 Vanilla FullScan . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
      4.1.2 Divide and Conquer . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
      4.1.3 Virtual Divide and Conquer         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
  4.2 Results and Analysis . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   44
  4.3 Summary . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   50

5 Conclusions and Future Work                                                                                                      51
  5.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 51
  5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                              52

Appendix I - Scan Router RTL                                                                                                       53

Bibliography                                                                                                                       56

Publications from this Thesis                                                                                                      60
List of Tables

 3.1   Test Modes and Scan Paths for VDNC Scan for XDSL chip . . . . . . . .          31
 3.2   Test Time Calculation for VDNC Scan on XDSL chip . . . . . . . . . . .         33

 4.1   Experimental Results for Stuck-at patterns . . . . . . . . . . . . . . . . .   44
 4.2   Experimental Results for Transition fault patterns . . . . . . . . . . . . .   45

                                          v
List of Figures

 1.1   Normal D Flop, a Mux Scan Flop, a Scan Chain . . . . . . . . . . . .                                              .   .    3
 1.2   Test Quality Trade-offs (Source : ITRS 2005) . . . . . . . . . . . . .                                            .   .    6
 1.3   A Scan Chain with Clock Mixing . . . . . . . . . . . . . . . . . . . .                                            .   .    9
 1.4   Arrival time of clkb earlier to clka . . . . . . . . . . . . . . . . . . . .                                      .   .   10
 1.5   Arrival time of clkb later than clka . . . . . . . . . . . . . . . . . . .                                        .   .   11
 1.6   Multi-Clock Domain Scan Shift Solution - Lockup Latches . . . . . .                                               .   .   11
 1.7   Waveform of Multi-Clock Domain Scan Shift Solution . . . . . . . . .                                              .   .   12
 1.8   Multi-Clock Domain Scan Capture Solution - Single Domain Capture                                                  .   .   12

 2.1   Vanilla FullScan : Illustration . . . . . . . . . . . . . . . . . . . .                                   .   .   .   .   16
 2.2   Vanilla FullScan Implementation on XDSL . . . . . . . . . . . .                                           .   .   .   .   17
 2.3   Wasted Shift Cycles in the Single Clock Domain Capture Scheme                                             .   .   .   .   18
 2.4   Divide and Conquer Scan : Illustration . . . . . . . . . . . . . . .                                      .   .   .   .   21
 2.5   One Divide and Conquer Scan Implementation on XDSL . . . . .                                              .   .   .   .   22
 2.6   Staggered Scan Capture in ATPG based solution . . . . . . . . .                                           .   .   .   .   27

 3.1   Virtual Divide and Conquer: Illustration . . . . . .                      . . . . . .             .   .   .   .   .   .   30
 3.2   One Implementation of Virtual Divide and Conquer                          on XDSL                 .   .   .   .   .   .   31
 3.3   Virtual Divide and Conquer Scan Test Waveform .                           . . . . . .             .   .   .   .   .   .   32
 3.4   Scan Router Integration . . . . . . . . . . . . . . .                     . . . . . .             .   .   .   .   .   .   37
 3.5   VDNC Implementation Flow . . . . . . . . . . . . .                        . . . . . .             .   .   .   .   .   .   39

 4.1   Test Time Comparison . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   46
 4.2   Daisy Mode Test Time Comparison .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   47
 4.3   Test Power Comparison . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   48
 4.4   VDNC Test Power for Stuckat Faults        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   49

                                            vi
Chapter 1

Introduction

1.1      Testing of Modern Day SoC
Testing of modern-day VLSI systems has become expensive. In what he called the
“Moore’s Law for Test,” Gelsinger [9] showed that the test cost per transistor is steadily
increasing from 1991 and will overtake the manufacturing cost per transistor by around
2010. Already, it is recognized that test cost is a significant portion of the total cost of
a system-on-chip (SoC) due to the increasing logic and memory density in modern VLSI
systems. Several techniques have been invented to improve test application time - built-
in self-test (BIST) for both logic and memories, scan test compression for logic. While
memory BIST is now a de facto standard for memories, Scan testing continues to be
the most popular form of Design-for-Testability (DFT) for on-chip logic in modern-day
SoC. Although logic BIST is expected to become an alternative in the future, scan test
continues to enjoy popularity due to relatively smaller area overhead and fewer timing
closure challenges and availability of scan test compression techniques and the support
in terms of EDA tools and automated flows.
   When scan test was first proposed, scan test was practiced using a single scan chain
that threaded all the sequential elements in the circuit, and was intended for circuits
with a single clock. The biggest advantage of scan test was the ease of testability with
the additional controllability and observability points. The main drawbacks came from

                                             1
CHAPTER 1. INTRODUCTION                                                                    2

converting normal flip-flops to scan flip-flop which resulted in increased area and perfor-
mance overhead and the test application time as each test vector must be scanned into
using the scan chains. Inorder to address the test application time problem, Multiscan
architecture was proposed where instead of having a single scan chain and have all the
scan flip flops on the chain, there were multiple scan chains and the scannable flip flops
were distributed among them which resulted in shorter scan chains and hence reduced
time to load the test pattern. Partial scan architecture was proposed to address the
problem of area and performance overheads due to scan flops.
   The early scan architectures were mainly intended for testing of stuck-at faults; but
scan testing has since been extended for at-speed testing. Related techniques such as
boundary scan have been invented to extend the idea of scan test to system-level. Using
scan test for multi-million gate system-on-chip designs presents several challenges.

1.2      Scan Testing
In a full or partial scan design, the sequential elements are converted to scan flops and
are stitched back to back along the Q → SD path to form a chain of scan flops refered
to as the scan chain, which can be shifted in at one end through the scanin port and
shifted out through the other end which is the scanout port. Scan testing for stuck-at
faults like opens and shorts consists of two steps, viz. load-unload procedure and capture
procedure.

1.2.1     Load-Unload Procedure

In the load-unload step, the state of internal sequential elements required to excite a fault
is loaded into elements by shifting the test vector into scan chain. As the test vector is
loaded into the scan chain, the current state of the design gets scanned out. Thus in this
step, a new state of the design is directly loaded into the sequential elements and the
existing state is read directly as part of the unload operation.
   During load-unload procedure the scan-enable pin is asserted high and the value in
CHAPTER 1. INTRODUCTION                                                                       3

               Figure 1.1: Normal D Flop, a Mux Scan Flop, a Scan Chain

the SD is captured into the flop at the arrival of a clock edge in a edge triggered mux
scan flop.

1.2.2        Capture Procedure

The load-unload procedure initializes the design into a state which can excite a fault. The
fault is excited and the response is captured into another flop in the capture procedure.
During the capture procedure cycle, the scan-enable pin is asserted low and the value
in the D is captured into the flop at the arrival of a clock edge in a edge triggered mux
scan flop.

1.2.3        At-Speed Testing

Scan testing could be used to perform at-speed testing to detect timing related faults in
the logic. The at-speed defects manifest itself as a slow-to-rise fault or a slow-to-fall fault.
To detect a speed related fault, two patterns are required. First pattern would launch the
CHAPTER 1. INTRODUCTION                                                                   4

transition and the second pattern would capture. There are two popular methodologies
based on the mechanism employed to launch the transition, viz. Launch-off-shift and
Launch-off-capture.
   Launch-off-shift : For a scan chain of length n, the chain is initialized in the first
n − 1 cycles. In the last shift cycle, the transition is launched and the scan enable
is asserted low. This is followed by an at-speed capture pulse, to capture the response
[14, 15, 16]. The advantage of this method is that transition could be initialized using the
shift path which results in less number of patterns because of higher controllability. The
disadvantage is that the scan enable must be closed at the frequency of scan capture.
From a timing closure perspective this involves considerable efforts since a clock tree
must be built for the scan enable path.
   Launch-off-capture : The scan chain is initialized with the n shift cycles and the
scan enable is asserted low. The transition is launched using a separate clock pulse apart
from the shift cycles. The transition is launched using a clock pulse which is immediately
followed by an at-speed capture pulse [14, 15, 16]. The advantage of this scheme is the
scan enable need not be closed at the frequency of scan capture. The disadvantage is
increase in pattern volume because of sequential ATPG.

1.2.4     Scan Test Pattern and Volume

A scan test pattern consists of a state to be loaded, a capture cycle, a state to be
unloaded.

        P0 --- Capture --- xx
        P1 --- Capture --- P0’
        P2 --- Capture --- P1’

   i.e., when P 1 is scanned in the P 00 which is the result of previous capture is scanned
out.
   Each scan pattern contains a state to be loaded for all the chain elements, the size
of a pattern is equal to the number of elements in the scan chain. The number of
CHAPTER 1. INTRODUCTION                                                                   5

scan patterns generated depends on the design and the number of patterns required is
exponential w.r.t test coverage. For higher coverages, the number of patterns required
is huge. It becomes increasingly difficult as the tool tries to detect the Random Pattern
Resistant Faults(RPRF).
   The pattern volume of at-speed faults is usually a few order of magnitude larger than
that of the stuck-at scan test volume because of the nature of sensitization.

1.3      Test Application Time Trends
According to ITRS2005 [29] – “Unconstrained, increased digital logic die complexity
and content drives proportional increases on the test data volume (number and width of
vectors). Unconstrained, this additional test data volume drives increases in test capital
and operational costs by requiring additional vector memory depth per digital channel of
the test tools (ATEs) and by increasing test application time per DUT . . .
   To keep the test quality level of embedded cores against deep sub-micron defects, such
as resistive opens or small delay defects, additional delay test is required. Therefore,
the number of test patterns increases inevitably and imperatively, and test application
time may be as much as thirty times larger than today in 2010. This means various
techniques for the significant reduction of test time, such as test pattern compaction, scan
architecture improvement, and scan shift speed acceleration, are strongly needed. “
   The design and defect complexity explodes the test data volume required to ship
quality chips. The need for lower Defective Parts Per Million (DPPM) is dictated by
the market needs. For example a device that is used in the automotive industry mandates
a closer to zero DPPM. With the device complexity and the process uncertainties, inorder
to produce a closer to zero DPPM device requires exponential increase in test volume
and hence test application time and cost.
   Figure 1.2 shows the exponential increase in the test cost for the lower DPPM. Quality
is a clear tradeoff for test cost, again shown in figure 1.2 is cost of shipping defective
parts increases with increase in DPPM.
CHAPTER 1. INTRODUCTION                                                                  6

                Figure 1.2: Test Quality Trade-offs (Source : ITRS 2005)

1.4     Test Power Trends
Power consumed during test has been shown to be twice as high as power consumed
during normal mode [35]. The reasons for the increased power consumption during test
mode could be

   • Increased switching activity during test mode compared to normal operation of the
      chip

   • Parallel testing of modules or sub-chips to reduce test application time which results
      in excessive energy and power dissipation.

   • Test logic designed to reduce test complexity is idle during normal mode but is
      intensively used in test mode.

   • successive functional input vectors have correlation in contrast to test vectors where
      the correlation is low since they are random patterns [30]

   With the growing test cost concerns, multi-site testing is being seriously explored
and deployed to reduce test cost by a factor proportional to the number of multisite.
CHAPTER 1. INTRODUCTION                                                                  7

Multi-site testing involves testing more than one device concurrently on the same load-
board using the same Automatic Test Equipment (ATE). Also emphasized by ITRS2005
[29], a limiting factor for multi-site testing is the number of power supplies and the
power supply range. With increased power dissipation during scan testing, application
of multi-site becomes a challenge. A solution to this problem commonly pursued is to
reduce the scan shift frequency so that average power is reduced.

                                           1
                                   P ower = CV 2 f.α                                  (1.1)
                                           2
   Using 1.1 we estimate a first order test power estimate. where C and V are the total
load capacitance and operating voltage respectively. f is the frequency of operation. α
is the activity factor. Lowering f results in power reduction. Incase of scan shift test
power, f is the scan shift speed and reducing the scan shift speed results in lower scan
shift power during manufacturing test.
   Though this solves the test power problem it results in increased test application
time, which defeats the primary motivation for multi-site testing.
   Inorder to successfully apply multi-site testing and reduce test application time, DFT
methodologies to reduce test power becomes a necessity.
   Test power can be broadly classified into two measures, viz. Average Test Power and
Instantaneous Peak Power
   Average Test Power : Average test power is the total distribution of power over a
time period. The ratio of energy to test time gives the average power. Elevated average
power increases the thermal load that must be vented away from the device under test to
prevent structural damages to silicon, bonding wires or package [10]. Test power during
scan shift is equated to average test power
   Instantaneous Peak Power : Instantaneous power is the value of power consumed at
any given instant. Usually, it is defined as the power consumed right after the application
of the synchronizing clock signal [10]. Elevated instantaneous power might overload the
power distribution system and cause IR drop problems which will manifest itself as a
timing related fault. Peak power during scan capture is equated with instantaneous peak
CHAPTER 1. INTRODUCTION                                                                   8

test power.

1.5      Multi-Clock Domain SoCs
SoC designs with multiple clock domains are now common due to several reasons :

   • IP reuse

   • Ability to reduce functional power by turning off clocks for unused blocks at any
      time

   • Difficulties in clock tree synthesis for a single clock domain in a large SoC

   In a modern SoC, there can be a number of different clock domains, as many as twenty
in some communication-related ASICs, where each subsystem operates in a different
clock domain, depending on its functionality and interface with other components on
the board.
   Scan testing of SoCs with multiple clocks poses several challenges in terms of clocking
and timing. Balanced scan chains are recommended for optimal test time. In a SoC with
multiple clock domains, to perform balancing of scan elements across chains, it is required
that the scan elements from different clock domains are mixed on the same chain. Having
scan elements from different clock domains on the same chain have the following issues.

1.5.1     Multi-Clock Domain Scan Shift

Problem : Consider two flops FA and FB , clocked by clka and clkb respectively as shown
in figure 1.3. There is a cloud of combinational logic from the Q of FA to the D of FB .
Since the clock sources are different, the insertion delay on these clock trees and hence
the clock arrival times of the rising edges of clocks at these flops are different.
   There could be two cases with the difference in the clock arrival times.

   • Case 1 : As illustrated in figure 1.4, due to the insertion delays if it so happens
      that clkb arrives before clka then the correct data is latched i.e., old data is moved
      from FA to FB and the new data is latched on FA when the clka arrives.
CHAPTER 1. INTRODUCTION                                                                    9

                      Figure 1.3: A Scan Chain with Clock Mixing

   • Case 2 : As illustrated in figure 1.5, incase again due to the insertion delays if the
      clkb arrives after clka arrives then the new data from FA is latched onto FB which
      is a wrong data.

   Solution : A naive yet difficult solution to implement would be to balance all the
clocks. Matching clock insertion delays could be difficult because due to functional
requirement a clock insertion delay exception could be set on a clock domain that could
be much smaller than the insertion delay requirement for scan shift. Inorder for these
clock domains to meet worst case insertion during test mode would mean addition of
clock buffers on the clock path, which results in area and power.
   Another solution to the case illustrated in figure 1.5 is to insert a lockup latch at the
clock domain crossing to delay the data from FA being available at FB by half cycle on
the Q → SD path as shown in figure 1.6.
   A Lockup latch is an normal latch clocked by the same clock as launch flop in the
scan chain i.e., in the case illustrated, by the clka and it opens at the negative edge of
the clock and close on the positive edge.
   So, when the new data is latched onto FA , it is not available to FB till the negative
edge of the clock and there by delaying the data by half cycle. With the data delayed
by half cycle if the clkb arrives after clka , the case of early data being captured will not
occur until arrival times of clkb off clkA by half cycle.
CHAPTER 1. INTRODUCTION                                                                 10

                      Figure 1.4: Arrival time of clkb earlier to clka

   When the insertion delays of clock domain crossing each other is greater than half
cycle then several timing closure tricks could be applied, one of which is to invert the
clock to the clock domain that has the larger insertion delay so that essentially the time
of the arrival of the edge is not more than half the shift clock period.
   Though the problem of capturing the early data or HOLD in the scan shift path is
made easy using the lockup latch, meeting timing is made tough because the paths that
were one full shift clock period are now due to the lockup latches half shift clock period.
   Figure 1.7 shows a waveform for the design with scan shift fixed. It could be observed
that the case 2 where clkb is arriving after clka is shown and the data a is available only
on the negative edge of the clock and the right data being latched until clkb does not
arrive after half cycle past clka .

1.5.2     Multi-Clock Domain Scan Capture

Problem: The problem described in the previous section for scan shift is from a Q → SD
path and the same could happen for the Q → D path. The problem was addressed with
much ease by adding a lockup latch in the Q → SD path. Since the path Q → SD is
purely a test logic path and the impact of making the path a half cycle path did not
CHAPTER 1. INTRODUCTION                                                              11

                    Figure 1.5: Arrival time of clkb later than clka

        Figure 1.6: Multi-Clock Domain Scan Shift Solution - Lockup Latches

affect functionality of the chip. But the Q → D being a functional path, addition of
any lockup latches could have serious repercussions in the functional timing because the
path would have just half cycle to meet timing.
   Consider a case where, data from more than one clock domains are captured in a
flop, then the addition of lockup latches would be cumbersome and would end up with
adding one latch per flop in the worst case.
   Hence the solution of adding a lockup latch in the Q → D path is infeasible. Though
the method of balancing all clocks so that the insertion delays are matched during test
CHAPTER 1. INTRODUCTION                                                                12

           Figure 1.7: Waveform of Multi-Clock Domain Scan Shift Solution

mode would work, it is way too much of over-design and also is difficult from a timing
closure point of view.
   Solution: A solution that has been adopted by the industry in view of no other
efficient solution is to capture in one domain at a time as shown in figure 1.8. In the
illustration, where there are two clock domains clka and clkb , test data is shifted into
both the domains i.e., flops FA and FB but during capture, the capture pulse is given in
only domain i.e., either clka or clkb depending on the faults targeted if they are in clka
domain or clkb domain respectively. Scan shift happens without any data error because
of the presence of the lockup latch at the crossing of clock domains.

   Figure 1.8: Multi-Clock Domain Scan Capture Solution - Single Domain Capture

   It must be noticed that the test application time spent in shifting through the flops
CHAPTER 1. INTRODUCTION                                                               13

of the domain in which the test response is not captured is wasted. The test power
dissipated during this operation is higher overall due to the amount of logic that are
toggling simultaneously.
   The main idea behind this thesis in trying to remove or reduce the wasted test cycles
used to shift data into domain in which data is not captured. This results in reduction
in test application time and test power.

1.6     Organization of the Thesis
The thesis is organized as follows — Chapter 2 provides an insight into the past work
on popular scan architectures (vanilla and DNC), test power reduction techniques, test
application time reduction techniques and ATPG based solutions to tackle the multiple
clock domain scan test problem. Chapter 3 provides details on the proposed virtual divide
and conquer architecture with detailed analysis on test time and test power. Chapter
4 describes the experiments and the results obtained to substantiate the theory. The
thesis concludes with conclusions and scope for future work in chapter 5.
Chapter 2

Background and Related Work

In the previous chapter, we pointed out the challenges in VLSI testing viz., reduction
of test application time and test power. We also pointed out the challenge of testing
multi-clock domain SoC.
   In this chapter, popular industry scan architectures like vanilla scan and divide-and-
conquer scan are described in detail along with their test time and test power analysis.
Existing literature on test time and test power reduction techniques are also reviewed.

2.1     Scan Architecture
A scan architecture is a key element that decides the testability of the chip and hence
determines the test cost and quality of the device in market. There are many scan archi-
tecture implemented on industry designs, of them the most popular ones are the vanilla
scan architecture, appreciated for its simplicity and the divide-and-conquer (DN C) scan
architecture used for implementation of hierarchical design-for-testability.
   In this section both the vanilla and DN C are discussed in detail along with analysis
of test time and test power with a toy design XDSL.

                                            14
CHAPTER 2. BACKGROUND AND RELATED WORK                                                  15

2.1.1     Vanilla Scan Architecture

Architecture
Consider an XDSL SoC with two sub-chips and four blocks - ARM , EM IF , CP U ,
DDR (Figure 2.2). Assume that there are four clock domains, one corresponding to each
block, as shown in the figure. The four clock domains could be because of functionality
or logically isolated block that could have different clocks to manage clock skew and ease
timing closure. The Automatic Test Equipment (ATE) has a limitation on the number
of clocks that it can supply and based on the high speed scan memory decide the number
of scan chains supported. Suppose the tester permits k scan-input and k scan-out pins.
The vanilla full-scan architecture will enforce that k scan chains are inserted in each of
the four blocks and these chains are concatenated at the top-level as illustrated in Figure
2.2. For any particular fault type (stuck-at, transition-delay, etc.) scan test involves a
single “test mode”, where the scan chains are loaded through the k scan inputs, and the
responses are unloaded using the k scan outputs [6].
   In a bottom-up flow, there is limited opportunity to balance scan chains, and in any
block, the length of the longest chain may be much larger than the average length of the
scan chain. To avoid this problem, we can consider an alternate where scan chains are
balanced through “clock mixing. For example, we can consider balancing the scan chains
across the ARM and EMIF blocks, and the CPU and the DDR blocks.
   However, this solution poses some implementation challenges. The clock trees for
the blocks are routed separately to keep skew in the individual trees in control. But
due to clock skew between the ARM and EMIF sub-blocks, it is not safe to pulse both
the ARM clock and the EMIF clock simultaneously during shift or capture operations.
The same is true for the CPU and DDR blocks. Lockup latches will be needed at every
clock domain crossing in the ARM+EMIF sub-chip [28]. Similarly, in order to correctly
capture data launched in one clock domain into a flop in another clock domain, lockup
latches must be inserted in the functional path. This impacts the functional timing of
the design. While lockup latches address the problem of shifting data in a scan chain
with clock mixing, capturing the response of the circuit under test in such a scan chain
CHAPTER 2. BACKGROUND AND RELATED WORK                                                    16

                        Figure 2.1: Vanilla FullScan : Illustration

requires careful timing analysis. A simplified solution often used in many designs is to
capture only in one clock domain at a time. This is implemented by pulsing all the clocks
during shift, but only one clock during capture. This is illustrated for the XDSL chip in
Figure 2.3.
   Test Time Analysis
Assuming that the length of the longest chain in the blocks are lARM , lEM IF , lDDR , and
lCP U , the scan test application time will be proportional to lARM + lEM IF + lDDR + lCP U .
Each pattern time consists of a shift time and a capture time. The capture time is
negligible compared to the shift time. Since all the elements in the scan chain are
scanned, lets say the number of patterns of capture required for the block to be PEM IF ,
PARM , PCP U , PDDR . Then test time could be given as,

   Tvanilla = (PEM IF + PARM + PCP U + PDDR ) × (lARM + lEM IF + lDDR + lCP U )         (2.1)
CHAPTER 2. BACKGROUND AND RELATED WORK                                                17

                Figure 2.2: Vanilla FullScan Implementation on XDSL

   Since the scan chain length can be written as,

                       Lvanilla = (lARM + lEM IF + lDDR + lCP U )                   (2.2)

   Hence above equation could be written as,

                                            X
                             Tvanilla =              Pj × Lvanilla                  (2.3)
                                          j=blocks

   Since the fault is excited and captured in only one domain at a time, the other domain
flops are just shifted through and the test time wasted in shifting through these flops
could be written as

                                        X
                          Twasted =              Pj × (Lvanilla − lj )              (2.4)
                                      j=blocks

   Test Power Analysis

   If the number of flops in the sub-blocks are nARM , nEM IF , nDDR , and nCP U , the
activity factor during shift is directly proportional to nARM + nEM IF + nDDR + nCP U .
Since all the flops will toggle at the same time during scan load/unload.
CHAPTER 2. BACKGROUND AND RELATED WORK                                                18

     Figure 2.3: Wasted Shift Cycles in the Single Clock Domain Capture Scheme

                           α ∝ nARM + nEM IF + nDDR + nCP U                         (2.5)

                                                 1
                             T estP owervanilla = CV 2 f × α                        (2.6)
                                                 2
   Since the flops that are not captured could be avoided the test power wasted could
be written as,

                                                                  1
                 T estP owerwasted = T estP owervanilla − max | CV 2 f.ni |         (2.7)
                                                         i=blocks 2

2.1.2    Divide and Conquer Scan Architecture

Architecture
Divide-and-Conquer (DN C) scan is a hierarchical scan test method [1, 5, 24] where the
essential idea is to provide a scan access mechanism to allow scan testing of individual
portions of the SoC. For example, if there are n sub-chips in the SoC, DNC scan will
use the available bandwidth of k scan pins to route k scan chains through each of the
sub-chips. A scan multiplexer logic (also known as scan router) is used to permit testing
CHAPTER 2. BACKGROUND AND RELATED WORK                                                19

of one sub-chip at a time. Since sub-chips may interact through glue logic, it becomes
necessary to also permit a daisy-chain mode which is essentially the vanilla fullscan
mode. In the daisy chain mode, the target fault list includes all faults that are not
already caught in the n individual scan test modes. Since only portions of the SoC
are tested at a time, the sequential elements in the remaining parts of the chip can be
initialized to constant values to reduce test power [24, 5].
   In Figure 2.5, we have illustrated how DN C can be applied to the XDSL SoC. The
chip is partitioned into 2 sub-chips, namely, (ARM + EM IF ) and (CP U + DDR). If
the chip has k scanin and k scanout ports, we must insert (balanced) scan chains in the
two sub-chips and connect the scan chains to a scan router as indicated in the figure. In
test mode 0, the (ARM + EM IF )sub-chip will be scan-tested through the scan path

                        scanin → ARM → EM IF → scanout

   the flops in the DDR and CPU sub-blocks will be initialized to constants. In test
mode 1, the CPU+DDR sub-chip will be scan-tested through the scan path

                         scanin → DDR → CP U → scanout

   the flops in the ARM and EMIF sub-blocks will be initialized to constants. In mode
2, the daisy chain mode, the scan path would be

              scanin → ARM → EM IF → DDR → CP U → scanout

   Note that DN C fits well into a physical design hierarchy; in a hierarchical physical
design flow, it is natural to partition the chip into logical partitions such as (ARM +
EM IF ) and (CP U + DDR) so as to balance the gate counts across partitions. Another
CHAPTER 2. BACKGROUND AND RELATED WORK                                                 20

consideration during physical partitioning is the connectivity between the blocks, so that
an effective floorplan can be derived. This partitioning strategy also works well from the
view point of DNC scan, since balancing the gate counts would tend to balance the
number of faults across the partitions, leading to balance in ATPG run-times on the
individual partitions. Similarly, keeping physically related modules together will lead
to a smaller target fault set for the daisy chain mode. As shown in [24], DNC scan
architecture allows us to run the ATPG for the partitions concurrently and the only
dependence in the ATPG flow is that the daisy-chain mode ATPG cannot be started
without completing the ATPG runs for the partitions. The daisy chain mode ATPG has
a dependency on the test group fault list. Since the daisy chain mode targets faults that
are not detected during the test group ATPGs. Therefore, the speedup of a distributed
implementation of the ATPG is impacted adversely by a long run of the daisy chain.
See Equation 2.8 which provides the speedup S obtained for the XDSL chip. In this
equation, TM refers to the test application time for module M .

                                            TXDSL
                   SXDSL =                                                           (2.8)
                             max(TARM +EM IF , TCP U +DDR ) + TDAISY

   Test Time Analysis

   The DNC scan architecture for the XDSL example includes 3 modes. In mode 0,
which corresponds to the (ARM + EM IF ) test group, the length of the scan chain can
be taken to be L0 = lARM + lEM IF . The number of test cycles in this mode of operation
is given by

                              T0 = L0 × (PARM + PEM IF )                             (2.9)

   Similarly, in mode 1, the test cycle count is given by

                               T1 = L1 × (PDDR + PCP U )                           (2.10)
CHAPTER 2. BACKGROUND AND RELATED WORK                                                     21

                   Figure 2.4: Divide and Conquer Scan : Illustration

   where L1 = lDDR + lCP U .
   During test application, when a pattern is shifted into EMIF, there are lARM wasted
cycles in ARM (and vice versa). Based on this argument, the number of wasted cycles
is given by

                        Twasted = PARM · LEM IF + PEM IF · LARM                         (2.11)

                                     +PDDR · LCP U + PCP U · LDDR

   In general, consider a DNC scan architecture with m + 1 modes, including the daisy-
chain mode. During mode j, 0 ≤ j < m, let the subset of blocks that are tested be given
by Bj,1 , Bj,2 , · · ·, Bj,Mj . Here, Mj is the number of modules in test group (or test mode)
j. The effective length of the scan chain in mode j is given by
CHAPTER 2. BACKGROUND AND RELATED WORK                                                 22

         Figure 2.5: One Divide and Conquer Scan Implementation on XDSL

                                              Mj
                                              X
                                       Lj =         LBj,k                          (2.12)
                                              k=0

   The wasted cycles are given by

                                       m−1 Mj
                                       X X
                           Twasted =             Pj,k · (Lj − LBj,k )              (2.13)
                                       j=0 k=1

   Test Power Analysis
The dynamic power dissipated in a CMOS circuit is proportional to the amount of node
toggling. During scan test, there are two major sources of toggling that contribute to
power, namely, scan shifting (which causes switching in the entire circuit) and clock-
ing (which causes toggling in the clock tree). The only way to minimize the power in
clock trees is through clock gating. However, scan shift power can be reduced through
techniques such as pattern optimization and reordering [26]. During shift operation, it
is easy to see that the test power is proportional to the length of the scan chain, since
longer scan chains will typically result in more switching activity. In a clock mixed scan
chain, the power dissipated in shifting data into scan flops that do not capture data is
wasteful. For example, consider the DNC scan architecture for XDSL, with three modes
CHAPTER 2. BACKGROUND AND RELATED WORK                                                  23

of operation, namely, (ARM +EM IF ), (CP U +DDR), and the daisy chain mode. Dur-
ing mode 0, ARM + EM IF are tested through a scan chain of length LARM + LEM IF .
Although all the flops in the chain toggle during pattern shift, only the ARM core flops
capture data. The toggling activity in EM IF , therefore, is wasteful. Similarly, the clock
tree power dissipated in the EM IF clock tree during scan shift is also wasteful. Power
savings can be achieved by careful selection of toggling and clocking only the flops that
take part in capture cycle. DNC places constant values on scan chains that are not part
of the test mode. This reduces shift test power to an extent [5]. However, clock power
reduction is not addressed in this architecture.

                           T estP ower = max |T estP owerMj |                       (2.14)

   where,

                                        1
                         T estP owerMj = CV 2 f × αMj + βMj                         (2.15)
                                        2
   The αMj is the activity factor during mode Mj which involves the flops present in
scan chain used during Mj mode. And βMj is the constant power dissipated in other
parts of circuit which is not scanned but are clocked and dissipate power.

2.2     Survey of Test Power Reduction Techniques
The motivation behind test power reduction is two fold, one is to ensure a safe and non
self-destructive test and the other is, by reducing the power consumed during test the
speed of test could be increased and hence the same testing could be performed in lesser
time within the power limits.
   Test power reduction techniques devised so far and published can be classified into
two main category viz., Architecture and Test Data
CHAPTER 2. BACKGROUND AND RELATED WORK                                                 24

2.2.1    Using Scan Architecture

Reducing test power by improving the scan architecture is one means that results in
guaranteed test power reduction [7, 2, 3]. The advantage of these methods is that they
are independent of the vendor tools i.e., scan stitching and test pattern generation tool.
The disadvantage of these methods is that they must be implemented as an architecture
in the design and hence requires planning upfront and implementation efforts and analysis
of the applicability of the architecture per design is mandatory. Based on the design,
a particular architecture must be selected to yield best results in terms of test power
reduction.
   Average test power reduction which is directly related to the scan shift power in the
general case can be reduced either by controlling the clock during scan shift operation
to the scan elements or the data to the scan elements.
   Controlling Test Clock :
   Several techniques have been proposed for scan shift test power reduction by control-
ling the clock. Saxena and Bonhomme [4, 28] have described techniques that uses Clock
gating to reduce the power dissipated in the clock tree during scan shift. Sankaralingam
et al [27] proposed a technique using programmable scan chain disable. The technique
disables the clock to scan chains and along with test pattern ordering are able to achieve
greater reduction in test power. The clock tree power has been shown to consume signif-
icant portion of the total test power [22]. However, clock gating is not easy to implement
due to the physical design and timing closure challenges it poses [8].
   Controlling Scan Chain :
   Other techniques uses, scan chain transformation which alters the scan chain by
selectively enabling and disabling the selected scan chain during shift mode. Whetzel [31]
use an approach to transform the conventional scan architecture into a scan path having
a desired number of selectable, separate scan paths. Lee [17] proposed a interleaving
scan architecture based on adding delay buffers among the scan chains.
   The amount of logic that toggles during the test is controlled and hence the test
power is reduced. The control of the logic could either be the scan chain or the clock to
CHAPTER 2. BACKGROUND AND RELATED WORK                                                 25

the scan elements. While applying the test power reduction techniques, either the test
time improved marginally or remained the same. Few techniques had area overhead and
other techniques like clock gating had physical design and timing closure restrictions.

2.2.2    Using Selective Test Sets

By selection of test sets, either by reordering test patterns generated or by compaction
techniques [11, 25, 26, 34] the test power can be reduced. Test pattern reordering reduces
the switching activity in the circuit during scan shifting. There has been active research
to reduce test power through this channel because test power reduction along these
channels are design independent though the patterns themselves are design dependent.
Test patterns generated are ordered to detect as many faults with minimum patterns
possible. Typically, production test patterns are ordered to place the patterns that
detect the maximum number of faults on the top of the set since early rejection of a bad
part will reduce the total testing time. We may potentially lose this advantage when
manufacturing test patterns are reordered for power minimization.

2.3     Survey of Test Time Reduction Techniques
Since the test time directly translates to test cost and hence the product cost, research
to reduce test time and yet guarantee the quality has been conducted for few decades
now.
   Scan test paved a new way for test compared to functional test to ensure quantitative
means of ensuring quality. Scan test suffers from large pattern volume and hence test
time. Current test time research is focused on improving the scan architectures to reduce
test time and maintain the quality goals.
   Logic BIST :
   Logic BIST solutions are often the best in terms of test application time, since pat-
terns are generated on-chip and compacted on-chip; Logic BIST is also attractive from
a field test perspective where after the chip is part of a system, the logic BIST can
CHAPTER 2. BACKGROUND AND RELATED WORK                                                 26

be activated on the field and failures over time can be identified which is infeasible on
other test solution which require a ATE. However, logic BIST comes with area overhead,
physical design challenges, and functional timing closure challenges [12]. Since the input
pattern is generated completely from a on-chip LFSR, all patterns applied are random
and a category of faults ”Random Pattern Resistant Faults” (RPRF) cannot be detected
or require large number of patterns to detect them.
   Scan Compression :
   Several solutions are available for reducing the scan data volume and scan test appli-
cation time. These solutions uses compression technique [23, 32] to accept compressed
data and decompress it on-chip and load, followed by compression of response and un-
load. Thus by compressing and decompressing the scan data, the volume of data applied
to the chip reduces and hence reduction in test time. In these solutions, the input pat-
tern is not generated purely from the LFSR instead, a seed is generated externally and
is fed to the LFSR based on which a pattern is generated that is used to excite the fault.
Using external algorithms, RPRF faults are targeted by loading a seed into the LFSR
that would in turn generate a random pattern which will excite the RPRF.
   Illinois Scan Architecture :
   Another popular architecture is the Illnosis Scan Architecture, where the same input
is fed to multiple scan chains and the test response is compressed and scanned out. The
difference compared to the other technique is that there is no decompress logic in the
input path. This is advantageous because any random input pattern can be applied to
a single chain. In this architecture since there is no decompress logic in the input path,
any pattern generated by an external algorithm can be applied directly. The drawback
is that the random input pattern only applies to a single chain and all other chains are
copies of the first chain.
   The common disadvantage in all of the above scheme is the compressor logic. The
compressor logic which is used to compress the test response and later compare it with
the fault free circuit signature could have aliasing errors. Due to aliasing error a fault
can be masked and hence a bad device could be categorized as a good device. There has
CHAPTER 2. BACKGROUND AND RELATED WORK                                               27

been extensive research in this field to device compression method that would be free of
aliasing errors [33, 20]
   Most of the research and publication in the field of test time reduction focused on
overcoming the limitation pertaining to the number of tester channel available for scan
shifting. Using techniques like Illinois Scan, MentorGraphics TestKompressT M , Synopsys
DBISTT M etc. to increase virtual tester channels by having more number of internal scan
chains and having on-chip circuitry to decompress and compress the scan inputs and scan
outputs respectively. There was not much focus on test power because when there is a
power issue then the tests could still be run at slower speed.

2.4      ATPG solution to Multi-Clock Domain

              Figure 2.6: Staggered Scan Capture in ATPG based solution

   Makar [19] summarizes the scan-test related problems in SoC with multiple clock
domains. ATPG vendors are beginning to address these problems [13, 18].
   Custom clocking procedures are supported by some ATPG tools, where the ATPG
algorithm is enhanced to generate patterns by pulsing more than one clock when it is
safe to do so. This requires analysis of the clock domains to ensure that they are not
interacting. Clearly, the effectiveness of custom clocking depends on the number of non-
interacting clocks; it is expected that designs with many non-interacting clock domains
will benefit from custom clocking through smaller number of test patterns.
CHAPTER 2. BACKGROUND AND RELATED WORK                                                28

   Pulsing the capture clock in staggered fashion for interactive clock domain clocks
results in sequential ATPG for the second pulse. Because the first pulse changes the
contents of the sequential elements. Sequential ATPG poses challenges in terms of ATPG
runtimes and pattern volume.
   Techniques discussed in [13, 18] did not report any reduction in test power.

2.5     Summary
In this chapter we reviewed the Vanilla Scan and Divide and Conquer Scan architectures.
We also did a brief survey of the test power reduction techniques and test application
time reduction techniques. The ATPG solution to the multi clock domain scan test
problem was also described.
   In the next chapter we describe in detail the proposed scan test architecture, Virtual
Divide and Conquer Scan.
Chapter 3

Virtual Divide and Conquer

In the previous chapter, we discussed the existing scan test architectures that are related
to the proposed scan architecture. A survey of the techniques used for test time and test
power reduction were also studied.
    In this chapter, we propose a new scan test architecture - the Virtual Divide and
Conquer Scan Architecture and analyze the test time and test power benefits theoreti-
cally.

3.1      Architecture
The Divide-and-Conquer (DN C) scan architecture provides definite advantage in terms
of test power and enables distributed hierarchical ATPG flow, when compared to vanilla
full scan. When it is applied to designs with multiple clock domains, it can throw up
the same problems as the vanilla fullscan architecture. Since the partitioning strategy is
based on balancing gate counts to balance ATPG run-times [24], it will become necessary
to group together blocks operating at different clock domains. As a result, clock mixing
in scan chains will become inevitable, bringing with it all the problems associated with
clock mixing. The DNC architecture also uses the policy of a activating a single clock
domain during capture cycle [24] - therefore, the problem of wasted clock cycles persists
(Figure 2.3). The DNC scan architecture is extended to alleviate the two problems

                                            29
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                   30

mentioned above.

                  Figure 3.1: Virtual Divide and Conquer: Illustration

   In Virtual Divide and Conquer (V DN C) scan, the design is partitioned into test
groups based on clock domain information. Since the partition may not preserve hierar-
chical boundaries, it is referred as virtual partitioning. A test group in VDNC consists of
scan chains that are clocked by a single clock or domains of same frequency but indepen-
dent of each other. Two clock domains are considered independent if there exists no path
between them or all the paths between them are false paths. Test patterns are generated
for each test group separately. Since there is only one clock per test group, the shift and
capture are completely safe on all flops in the scan chains. Hence all flops scanned with
test data are also used to capture new data. A simple illustration of VDNC is shown
in Figure 3.1. One implementation of VDNC scan for the XDSL example is shown in
Figure 3.2. Since the ARM core and the CPU work at 100 MHz and are independent
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                          31

from the clock analysis, it is possible to group them together. We will therefore have the
following test modes (Table 3.2).

 Mode            Frequency             Scan Path                         Comments
 V DN C 100      100 MHz               scanin → ARM →                    Faults in ARM
                                       CPU → scanout                     and CPU
 V DN C 200      200 MHz               scanin → DDR → scanout            Faults in DDR
 V DN C 75       75 MHz                scanin → EMIF → scanout           Faults in EMIF
 V DN C daisy    One clock             scanin → ARM → CPU →              Daisy mode;
                 at-a-time Capture     DDR → EMIF → scanout              Inter Clock Domain faults

        Table 3.1: Test Modes and Scan Paths for VDNC Scan for XDSL chip

      Figure 3.2: One Implementation of Virtual Divide and Conquer on XDSL

   The XDSL example may not fully bring out the nuances of the VDNC architec-
ture. For example, assume that the CPU block has two sub-blocks with two different
clock domains, say CPU-1 (100 MHz) and CPU-2 (75 MHz). VDNC will then group
(ARM+CPU-1), (EMIF+CPU-2), and (DDR). In general, a design hierarchy can be
represented as

                  Design = B1 · B1,1 + B1 · B1,2 + B2 · B2,1 · B2,1,1 + · · ·               (3.1)

   Here, B1 , B2 , · · · represent the blocks at the first level of hierarchy, B1,1 , B1,2 , · · ·
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                  32

represent the blocks at the second level of hierarchy under B1 , and so on. Suppose that
we add clock domain information to the leaf-level blocks that work on a single clock, by
prefixing the clock information to the name of the block. For example, if B1,1 works at
                             100                               75       75             75
100 MHz, we can write it as B1,1 . Note that VDNC may combine B1,1,3 , B2,1,1,3 , and B3,1
provided there is no skew in the clocks reaching these blocks. We shall use the notation
 75       75         75
B1,1,3 + B2,1,1,3 + B3,1 to indicate this test mode.
   Figure 3.3 illustrates the test time benefit offered by VDNC. Here, we shall assume
that there are three test modes, V DN C 100 , V DN C 200 , V DN C 75 other than the daisy-
chain mode. Note that the pattern is shifted into ARM chain and the response capture
happens immediately thereafter.

             Figure 3.3: Virtual Divide and Conquer Scan Test Waveform

   This eliminates the wasted cycles which become inevitable in a clock-mixing solution.
In the next section, we quantify the amount of test cycles saved in the VDNC scheme.
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                        33

3.2       Test Time Analysis
Again, we use the example of XDSL to understand the test time calculation for VDNC
before generalizing the calculation to a generic case. Assume that VDNC scan for XDSL
has three modes, one for each of the clock domain, and a daisy chain mode. The three
independent modes are intended for the four clock domains, namely, ARM , EM IF ,
DDR, and CP U . Since the daisy mode is used to test inter-clock-domain faults, the
number of patterns generated in this mode is quite small. Let Pj be the number of
patterns for mode j and Lj be the length of the scan chain in mode j. The following
table provides the test cycles consumed in the different modes. Also refer to Figure 3.3.

              Modej                     Pj                Lj                Test Cycles
              V DN C 100     PARM +CP U       LARM +CP U        PARM +CP U · LARM +CP U
              V DN C 75         PEM IF           LEM IF                PEM IF · LEM IF
              V DN C 200         PDDR             LDDR                    PDDR · LDDR

             Table 3.2: Test Time Calculation for VDNC Scan on XDSL chip

    As we saw in the previous section, the VDNC scan technique may combine into the
same test mode j several blocks at possibly different levels of hierarchy based on clock
domain information. Let Lj be the total length of the scan chain in mode j; this is
the sum of the chain lengths for the blocks that are assigned to mode j. Let Pj be the
number of patterns in mode j. Then the total number of test cycles for the test modes
                                     Pm−1
0, 1, 2, · · · , m − 1 is given by    j=0   Pj · Lj . The reduction in the number test cycles in
VDNC scan is equal to the number of wasted cycles in DNC scan. Compared to VDNC
the number of cycles wasted in DNC is

                                                 m−1
                                                 X
                                     Twasted =         Lj · (PT − Pj )                     (3.2)
                                                 j=0
                                                        Pm−1
    where PT is the total pattern volume i.e,             i=0   Pi
    In the case of XDSL it is equal to
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                   34

              Twasted = L1 · (P2 + P3 ) + L2 · (P1 + P3 ) + L3 · (P1 + P2 )

   When the same is written in terms on total scan chain length,

                                            m−1
                                            X
                                Twasted =         Pj · (L − Lj )                      (3.3)
                                            j=0

3.3     Test Power Analysis
In the VDNC architecture, since test partitioning is based on clock domains, it is possible
to not only reduce scan shift power, but also the clock tree power. Consider the VDNC
scan architecture for XDSL with three individual test modes and the daisy chain mode,
as explained in the previous section.
   Scan Shift Power : Consider the V DN C 200 test mode, scan shifting only impacts
the flops in the DDR core. The clock to the remaining blocks is not pulsed, hence
eliminating the scan shift power as well as clock tree power in these blocks. Further,
since the length of the scan chain in V DN C 200 test mode is smaller than the value for
LARM +EM IF +DDR+CP U vanilla scan, a reduction of a factor LARM +EM IF +DDR+CP U /LDDR
in scan shift power can be expected.
   Clock Tree Power : Also the clock to the EMIF, ARM, CPU core is not pulsed at
the chip level, hence clock tree power consumption due to these clock is avoided during
the V DN C 100 mode of operation. This reduction in clock power does not require clock
gating. As a result, the physical design challenges associated with clock gating do not
impact VDNC scan implementation[8].
   In order to analyze the test power reduction in V DN C, we shall use the following
simple model. We neglect the capture power in our average power calculation, since it
is negligible compared to the other components, namely, scan shift power and clock tree
power. Assume that there are n blocks in the chip, with fi flops in block i. Also, assume
that there are m clock domains and the number of buffers in clock domain j is Cj . If
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                    35

vanilla scan were to be used, the total test power can be written as

                                               m
                                               X              n
                                                              X
                              Pvanilla = α ·       Cj + β ·       fj                  (3.4)
                                               1              1

   where α and β are constants of proportionality. The first term corresponds to the
power dissipation in the clock tree buffers and the second term corresponds to the power
dissipation in the flops during scan shifting and capture. When DN C is used with p
partitions, assume that the number of flops in partition k is fk . The power dissipation
in DNC is taken as the DNC mode in which the average power is the maximum, it is
given by

                                                     m
                                      k=p
                                                     X
                          PDN C = M AXk=0 (α ·            Cj + β · fk )               (3.5)
                                                      1

   The test power dissipation for DNC architecture can be expected to be lower than
Pvanilla because once a portion of flops are involved in scan shifting. It is also evident
that since the entire chip clock trees are active, there is no savings in clock tree power.
   In the Virtual Divide-and-conquer technique, the test power for the mode in which
the power dissipation is maximum could be written as

                                         k=p
                           PV DN C = M AXk=0 (α · Ck + β · fk )                       (3.6)

   The test power dissipation for the VDNC architecture is much lower than the Vanilla
and DNC architecture because there is a reduction in both the clock tree power and the
scan shift power. This results in an overall power reduction.
   It was shown by Pouya et al [22] that during test, the clock tree power contributes to
99% of the total test power. Hence with VDNC, cutting down on the clock tree power
results in significant reduction in test power.

3.4        Implementation
Implementation of a VDNC scan architecture involves the following steps
CHAPTER 3. VIRTUAL DIVIDE AND CONQUER                                                   36

   • Scan Router Design

   • Scan Stitching

   • Scan Router Integration

   • Scan Test Pattern Generation

3.4.1     Scan Router

Scan Router is a block of muxing logic that allows the test modes to be programmed.
The test modes, selects the scan chain to be connected between the chip level scan in
and scan out pins.
   The scan router logic accepts the scan select bits, chip level scan in and scan out
along with the test group scan out ports as inputs and controls the chip level scanout,
test group scan in through its output ports.
   Appendix I shows a scan router RTL used to collect the experimental data. It is
written in verilog and assumes the design has 8 scan chains. But this can be configured
in the RTL to have different sizes depending on the limitation of the tester.

3.4.2     Scan Stitching

Inorder to stitch balanced set of scan chains per clock domains, the scan stitching tool is
run multiple times by defining one clock at a time and stitching chains for that particular
clock domain. Most of the scan stitching tools operate by tracing the clocks defined to
identify the scan elements to be stitched as part of the chain.

3.4.3     Scan Router Integration

The scan router RTL is synthesized and integrated with the design core which has the
scan chains stitched per clock domain.
   Figure 3.4 shows the scan router RTL integrated with the core.
You can also read