Machine Learning Clustering Techniques for Selective Mitigation of Critical Design Features

Page created by Steven Martin

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Machine Learning Clustering Techniques for
                                              Selective Mitigation of Critical Design Features
                                                Thomas Lange∗† , Aneesh Balakrishnan∗‡ , Maximilien Glorieux∗ , Dan Alexandrescu∗ , Luca Sterpone†
                                                                                          ∗ iRoC
                                                                                          Technologies, Grenoble, France
                                                                  † Dipartimento
                                                                             di Informatica e Automatica, Politecnico di Torino, Torino, Italy
                                                            ‡ Department of Computer Systems, Tallinn University of Technology, Tallinn, Estonia

                                            {thomas.lange, aneesh.balakrishnan, maximilien.glorieux, dan.alexandrescu}@iroctech.com         luca.sterpone@polito.it
arXiv:2008.13664v1 [cs.AR] 31 Aug 2020

                                            Abstract—Selective mitigation or selective hardening is an ef-        number of workloads to analyse and the duration in cycles of
                                         fective technique to obtain a good trade-off between the improve-        each workload. A detailed functional failure analysis requires
                                         ments in the overall reliability of a circuit and the hardware over-     a significant investment in terms of human efforts, processing
                                         head induced by the hardening techniques. Selective mitigation
                                         relies on preferentially protecting circuit instances according to       resources and tool licenses. Studies have shown that exhaustive
                                         their susceptibility and criticality. However, ranking circuit parts     fault simulation is not feasible for today’s complex circuits [4].
                                         in terms of vulnerability usually requires computationally inten-
                                         sive fault-injection simulation campaigns. This paper presents           A. Objective of This Work
                                         a new methodology which uses machine learning clustering                    Identifying and ranking the sequential elements which are
                                         techniques to group flip-flops with similar expected contributions
                                         to the overall functional failure rate, based on the analysis of a       most vulnerable to transient faults, usually requires compu-
                                         compact set of features combining attributes from static elements        tationally intensive fault-injection simulation campaigns. This
                                         and dynamic elements. Fault simulation campaigns can then be             procedure can be optimized by grouping flip-flops together
                                         executed on a per-group basis, significantly reducing the time           which are expected to have a similar sensitivity to faults. Fault
                                         and cost of the evaluation. The effectiveness of grouping similar        injection campaigns can then be performed on a per-group
                                         sensitive flip-flops by machine learning clustering algorithms is
                                         evaluated on a practical example.Different clustering algorithms         basis and thus, significantly reduce the time and cost of the
                                         are applied and the results are compared to an ideal selective           evaluation [5]. However, this optimization heavily relies on
                                         mitigation obtained by exhaustive fault-injection simulation.            the effectiveness of the grouping methodology. Therefore, we
                                            Index Terms—Transient Faults, Single-Event Upsets, Selective          propose a new approach to effectively group flip-flops together
                                         Mitigation, Selective Hardening, Soft Error Protection                   which are expected to have a similar sensitivity to functional
                                                                                                                  failures. The approach is based on machine learning clustering
                                                                 I. I NTRODUCTION
                                                                                                                  techniques which uses a set of features to characterizes each
                                            The advancement of the process technologies in the last               flip-flop in the circuit. The feature set combines attributes
                                         years made it possible to manufacture chips with tens of                 from static and dynamic elements. Machine learning clustering
                                         millions of flip-flops. At the same time due to the technology           algorithms are evaluated and compared to an ideal selective
                                         scaling, lower supply voltages and higher operating frequen-             mitigation obtained by exhaustive fault-injection simulation.
                                         cies, circuits became more vulnerable to reliability threats,
                                         such as transient faults. Especially, the Soft Errors in flip-flops      B. Organisation of the Paper
                                         are a major concern and countermeasures have to be taken into               In the following sections, first, the use of clustering tech-
                                         consideration by using hardening techniques, such as Triple              niques for selective mitigation is summarized. Further, the
                                         Modular Redundancy (TMR). However, a fully protected chip                general principle of machine learning clustering techniques are
                                         might not meet the system requirements in terms of area,                 explained. Section III presents the proposed methodology to
                                         power or target frequency. Since for many applications it                group flip-flops together based on machine learning clustering
                                         is not necessary to decrease the vulnerability to a possible             and the used feature set. The approach is evaluated on a
                                         minimum, Selective Mitigation can be used. Thereby, only the             practical example in Section IV for different machine learning
                                         most critical elements of the circuit are protected against Soft         algorithms. Section V concludes the paper and gives some
                                         Errors and thus, the failure rate of the system is decreased to          prospects for future work.
                                         meet all requirements [1]–[3].
                                            In order to perform Selective Mitigation an exhaustive                      II. C LUSTERING T ECHNIQUES FOR S ELECTIVE
                                         failure analysis is required to identify and rank the most                                     M ITIGATION
                                         vulnerable elements of the circuit. Especially, the failure                 The approach of protecting only the smallest set of elements
                                         analysis on a functional level grows with the design size, the           in a circuit to meet a specified reliability target is called
                                                                                                                  selective mitigation. Therefore, the individual circuit elements
                                           This work was supported by the RESCUE project which has received
                                         funding from the European Union’s Horizon 2020 research and innovation   of a circuit need to be ranked from the most to the least
                                         programme under the Marie Sklodowska-Curie grant agreement No. 722325.   sensitive. In the case of transient faults in the sequential

logic which lead to a functional failure this usually requires III. P ROPOSED M ACHINE L EARNING C LUSTERING
exhaustive fault injection simulation, which might not be A PPROACH
feasible for large and complex circuits.
The proposed methodology is based on clustering tech-
In order to reduce the mentioned fault injection efforts, niques for selective mitigation. In contrary to previous work
fault simulation campaigns can be performed on a group the clustering approach uses machine learning clustering al-
basis. Therefore, prior any simulations, flip-flops are grouped gorithms and a feature set to characterize each flip-flop in the
together and the statistical fault injection is performed on each circuit. In this way no assumptions are made about the design
of these groups. This can significantly reduce the time and and a general methodology is provided.
cost of the evaluation. However, the accuracy of this coarse-
grained fault injection solely relies on an effective approach to
group flip-flops together which have highly similar sensitivity Circuit
Representation Testbench
to faults. (RTL Design)
The previous studied techniques for clustering are based on
buses, design hierarchy, or a hybrid approach using buses, hi-
erarchy and signal naming information [5]. These approaches
have several drawbacks. The bus based and hierarchical based
approach are only able to provide a fixed number of clusters.
Thereby, the hierarchical based approach often provides a low Extract Features
per Flip-Flop
number of clusters which can be very heterogeneous. The bus
based approach often results in a high number of clusters
with small number of flip-flops per cluster and one larger
cluster containing all the flip-flops which do not belong to any
Flip-Flop
bus. Thus, the reduction of the number of fault injections is Feature Set
limited and the large cluster tends to be heterogeneous which
negatively impacts the effectiveness of the clustering. The
hybrid approach overcomes the problem of the fixed number of
Apply
clusters by combining the bus and hierarchy based approaches ML Clustering Nc
and also taking the signal names into account. The approach Algorithm
assumes that flip-flops with a similar naming also have a
similar function and thus, have a similar sensitivity to faults.
However, this relies on a consistent naming convention and
Flip-Flop
strong correlation between the naming and the function within Groups
the circuit.

A. Machine Learning Clustering Techniques Fault Injection
Campaign
on Group Basis
In order to tackle the drawbacks of the previous stud-
ied clustering approaches this paper investigates if machine
learning clustering techniques are suitable to group the flip-
flops. Clustering techniques in the machine learning domain Functional
Failure Rate
belong to the unsupervised learning category. In general these per Group
algorithm try to group similar objects together based on a
given set of features. The feature set characterizes the objects
and the clustering algorithms group objects together which Rank and
have similar feature values, while objects from a different Select Groups
group should have highly dissimilar feature values [6]. to Mitigate

Fig. 1. Selective Mitigation by Using Machine Learning Clustering
In this paper, the flip-flops are characterized by a set of
features which will be described in the next section. Machine The steps to perform a selective mitigation are illustrated in
learning clustering algorithms use this feature set to group flip- Fig. 1. First, the features for each flip-flop in the design are
flops together. Afterwards, the effectiveness of the grouping extracted by using the RTL description and a corresponding
will be evaluated on a practical example and it will be deter- testbench. Second, the machine learning clustering algorithm
mined if flip-flops with similar fault sensitivity are grouped is applied to the obtained feature set and the flip-flop groups
together. are obtained. The resulting number of groups Nc can be

TABLE I
F EATURE S ET TO C HARACTERISE A F LIP -F LOP I NSTANCE FFi

Feature Name Description
Structural Related Features
# FF at Startpoint/Endpoint The number of flip-flops directly connected to/by FFi .
# Connections from/to FF The number of flip-flops connected to/by FFi within the circuit.
# Connections from/to Primary Input/Output The number of primary inputs/outputs connected to/by FFi .
# FF Stages to/from Primary Input/Output (max/avg/min) Number of flip-flop stages from/to the primary input/output of the circuit.
Feedback Depth The depth (in terms of flip-flops stages) of the shortest feedback loop.
Bus Position Describes the position of FFi within the bus.
Bus Length Describes the total length of the bus signal FFi is part of.
Bus Label The number/label of the bus signal FFi is part of.
Module Label The number/label of the hierarchical module FFi is part of.
Signal Activity Related Features
@0/@1 The relative time FFi output is at logical 0/1.
State Changes The number of state changes.

adjusted by the parameters of the machine learning clustering behaviour of the flip-flops. Therefore, the information related
algorithm. The number of groups also dictates the effort to the signal activity is considered, such as the state distribu-
needed for the next step: the statistical fault injection. The fault tion and transitions.
injection campaign is performed on the computed flip-flop
IV. E VALUATING M ACHINE L EARNING C LUSTERING FOR
cluster and thus, needs more efforts (in terms of computing
S ELECTIVE M ITIGATION
resources, human efforts and tool licensees) with a higher
number of cluster and vice versa. Eventually, the sensitivity In this section the machine learning clustering approach is
to faults for each cluster is obtained and they can be ranked evaluated on a practical example. Therefore, different clus-
from the most sensitive to the least sensitive. The selective tering algorithms are used to group the flip-flops based on
mitigation will be applied starting from the most sensitive the presented feature set. The effectiveness of the clustering
cluster until the reliability requirement is met. algorithms are measured by evaluating the created flip-flop
cluster. The goal is to create flip-flop groups which have a
A. The Feature Set similar vulnerability to critical failures. Therefore, first, an
Previous works has shown that the masking effects and the exhaustive full flat statistical fault injection campaign was
vulnerability of a flip-flop can be related to certain charac- performed to obtain the sensitivity to critical failures for
teristics of the circuit, such as circuit structure and signal each flip-flop and to provide an independent measure of
probability [7]–[9]. Motivated by this idea a feature set has the sensitivity. Afterwards, this data is used to evaluate the
been developed in [10] which is adapted for the approach different clustering algorithms against an ideal and random
presented in this paper. The feature set characterizes each flip- approach.
flop instance in the circuit and combines attributes from static
elements, such as the circuit structure, as well as dynamic A. Circuit Under Test
elements, such as the signal activity. For the practical example, the Ethernet 10GE MAC Core
The original approach was based on the gate-level netlist from OpenCores is used. This circuit implements the Me-
of a design and contained features which corresponds to dia Access Control (MAC) functions as defined in the
synthesis attributes. In order to obtain a more general approach IEEE 802.3ae standard. The 10GE MAC core has a 10 Gbps
which can already be applied in an early design stage, the interface (XGMII TX/RX) to connect it to different types of
methodology presented in this paper uses only features which Ethernet PHYs and one packet interface to transmit and receive
can be derived from the RTL description of the design (e.g. packets to/from the user logic [11]. The circuit consists of
by performing a fast logic elaboration). Further, two additional control logic, state machines, FIFOs and memory interfaces.
features are extracted to reflect the bus and hierarchical based It is implemented at the Register-Transfer Level (RTL) and is
clustering approach described in the previous section. In total publicly available on OpenCores. The RTL description of the
the feature set consists of 20 features. design consists of 1054 flip-flops.
The features extracted for each flip-flop F Fi are described The corresponding testbench writes several packets to the
in detail in Tab. I. They are divided into two parts, the 10GE MAC transmit packet interface. As packet frames be-
structural and the signal activity related features. The structural come available in the transmit FIFO, the MAC calculates a
related features describe a flip-flop in relation with other flip- CRC and sends them out to the XGMII transmitter. The XG-
flops in the circuit without taking the (technology dependent) MII TX interface is looped-back to the XGMII RX interface
combinatorial logic into account. To consider the workload of in the testbench. The frames are thus processed by the MAC
the circuit, features are extracted which describe the dynamic receive engine and stored in the receive FIFO. Eventually, the

testbench reads frames from the packet receive interface and It is assumed that after mitigation, the sensitivity to Single-
prints out the results [11]. Event Upsets is zero1 . The flip-flops to protect were selected
starting with the most sensitive clusters first. The results
B. Full Flat Statistical Fault Injection Results
are compared against an ideal and a random approach. In
In order to evaluate the effectiveness of the clustering the ideal approach the most sensitive flip-flops are selected
algorithms the sensitivity of the considered design was mea- based on the exhaustive fault injection campaign. The random
sured by performing a flat statistical fault injection campaign. approach selects flip-flops to protect randomly (averaged over
The simulations are performed at the RT-Level using the 100 independent runs).
corresponding testbench which allows a functional verification. The clustering algorithms were implemented by using
Thus, it is possible to evaluate the system-level impact of Python’s scikit-learn Machine Learning framework [12] and
errors. The fault injection mechanism consisted of inverting applied to the extracted flip-flop feature set. This process
the value stored in a flip-flop by using a simulator functions. took only several seconds and is negligible. Each considered
In networking applications, such as the considered design, clustering algorithm has different parameters which can be
important data is protected by checksums. This means that adjusted. They affect the performance of the clustering and
a minor payload corruption can be handled by the error also the resulting number of clusters. For some of the al-
correction algorithm. However, in case the fault causes the gorithms the number of clusters can be specified directly.
circuit to stop working and interrupting the flow of sending Other algorithms try to find the optimal number clusters within
packages or data is continuously corrupted, then the effect can the constrained parameters. In the following evaluation the
be considered as critical. Especially, these failures are highly parameters were chosen in a way that the number of resulting
problematic and should be mitigated by selective mitigation. clusters are about 20 %, 10 % and 5 % of the number of flip-
In each flip-flop 200 faults are injected at a random time flops in the circuit. This would correspond to a reduction of
during the active phase of the test-case. The functional failure the fault injection efforts by 5×, 10× and 20× respectively.
rate of each flip-flop is calculated by dividing the number of 1) K-Means Clustering: K-Means clustering aims to par-
simulation runs which lead to a functional failure with the tition the given data points into k clusters. The algorithm
number of total simulation runs. The overall results of the flat computes the euclidean distance for each data point in the
statistical fault-injection campaign are presented in Table II. feature space. The data samples are then separated in such a
The average critical failure rate is 5.13 % and the most critical way the variance within each cluster is equal. The algorithm
flip-flops were identified and ranked. requires the number of clusters Nc to be specified.
Fig. 2 shows the overall critical sensitivity when the group-
TABLE II ing is performed by the K-Means clustering with different
SEU FAULT I NJECTION C AMPAIGN R ESULTS
number of clusters Nc . The flip-flops to protect were selected
Total Per Injection starting with the most sensitive clusters first until all flip-flops
Injection Targets (FFs) 1054 -
are mitigated. It can be noted that the grouping performs better
Injected Faults (SEU) 210800 - when the number of clusters is higher.
Functional Failure 10814 5.13 % 2) Agglomerative Clustering: Agglomerative Clustering
performs a hierarchical clustering by creating nested clusters,
which are represented as a tree. The advantage of hierarchical
C. Machine Learning Clustering for Selective Mitigation
clustering is that any valid measure of distance can be used,
The machine learning clustering algorithms are used to in comparison to K-Means clustering which performs on an
group flip-flops based on the feature-set described in sec- euclidean distance metric. Agglomerative Clustering uses a
tion III. The features are extracted from the RTL design bottom-up approach where each data point starts as its own
of the circuit. Therefore, a fast elaboration is performed cluster. Clusters are merged together by following the linkage
and the design is converted into a graph representation. The criteria. This criteria defines the metric used for the merge
structural features are extracted from the graph by using graph strategy. It was noted that the best results were obtained by
algorithms and the features regarding the signal activity are using the maximum or complete linkage, which minimizes the
extracted by tracing the simulation. The feature extraction is maximum distance between data points of pairs of clusters and
automated and overall, takes less than 5 minutes. a manhatten distance metric (l1 norm).
The effectiveness of the clustering is evaluated considering In Fig 3 the results of the selective mitigation are shown
they would be used to selectively mitigate against the critical when Agglomerative Clustering is used for different number
failures. Therefore, the data obtained from the exhaustive of clusters Nc . Similar to the K-Means clustering the effec-
statistical fault injection is used to compute the sensitivity
to critical faults for each cluster. Then, the reduction of the 1 The authors are aware that this is a simplification and important aspects
overall sensitivity of the circuit was calculated by varying related to physical design are not considered. However, the here presented
the number of groups being protected. For the protection it approach focuses on the evaluation on the functional level by using the RTL
description of the circuit. To obtain a more complete analysis, the presented
is assumed that the considered flip-flops within the group approach could be combined with e.g. the classical analysis for electrical and
are substituted by hardened cells, TMR or other approaches. temporal masking by using post place and route gate-level netlist.

0.05                                                       Ideal                                              0.05                                               Ideal
                                                                                         Random                                                                                                Random
                                                                                         K-Means Nc = 53                                                                                       Agglomerative Clustering Nc = 53
                                                                                         K-Means Nc = 105                                                                                      Agglomerative Clustering Nc = 105
                              0.04                                                       K-Means Nc = 211                                   0.04                                               Agglomerative Clustering Nc = 211
Overall Critcal Sensitivity

                                                                                                              Overall Critcal Sensitivity
                              0.03                                                                                                          0.03

                              0.02                                                                                                          0.02

                              0.01                                                                                                          0.01

                              0.00                                                                                                          0.00
                                     0%         20%           40%               60%      80%           100%                                        0%         20%          40%               60%             80%              100%
                                                             % of Mitigated Flip-Flops                                                                                    % of Mitigated Flip-Flops
                                             (a) From 0 % to 100 % of mitigated flip-flops                                                                 (a) From 0 % to 100 % of mitigated flip-flops

                              0.05                                                                                                          0.05

                              0.04                                                                                                          0.04
Overall Critcal Sensitivity

                                                                                                              Overall Critcal Sensitivity
                                                                                          Ideal                                                                                                Ideal
                              0.03                                                        Random                                            0.03                                               Random
                                                                                          K-Means Nc = 53                                                                                      Agglomerative Clustering Nc = 53
                                                                                          K-Means Nc = 105                                                                                     Agglomerative Clustering Nc = 105
                                                                                          K-Means Nc = 211                                                                                     Agglomerative Clustering Nc = 211
                              0.02                                                                                                          0.02

                              0.01                                                                                                          0.01

                                     0%          5%           10%               15%      20%            25%                                        0%          5%          10%               15%              20%              25%
                                                             % of Mitigated Flip-Flops                                                                                    % of Mitigated Flip-Flops
                                          (b) Section from 0 % to 25 % of mitigated flip-flops                                                          (b) Section from 0 % to 25 % of mitigated flip-flops
                                                      Fig. 2. K-Means Clustering                                                                                Fig. 3. Agglomerative Clustering

tiveness of the selective mitigation is increasing with a higher                                              105 and 210, respectively. Fig 4 shows the effectiveness of
number of clusters. However, the improvements from 53 to                                                      the algorithm when using the obtained clusters for selective
105 and 211 clusters are very minor. When comparing the                                                       mitigation. As for the previous algorithms the effectiveness is
Agglomerative Clustering to K-Means Clustering it can be                                                      increasing with a higher number of clusters (smaller window
noted that Agglomerative Clustering performs generally better.                                                sizes). It can be seen, that the results with a window size of
   3) Mean Shift Clustering: Mean Shift Clustering aims to                                                    w = 0.85, which results in 210 groups, is almost as good as
find dense areas of data points. The algorithm is based on                                                    the ideal solution.
a sliding-window which is shifted towards regions of higher
density. The density of the sliding-window is proportional to                                                 D. Comparison and Discussion
the number of data points within the window it. The goal is                                                      The results of the considered clustering algorithms with
to locate center points for each group in the dataset. For the                                                different resulting number of clusters are summarized in
previous clustering algorithm the number of clusters had to                                                   Tab. III. Since the number of resulting clusters was chosen to
be specified manually. In Mean Shift clustering the number                                                    be about the same, the average cluster size is identical or very
of clusters is determined by the algorithm. Further, K-Means                                                  similar. The standard deviation of the cluster sizes however
clustering assumes spherical distribution shape of the clusters                                               shows that Agglomerative and Mean Shift clustering tend to
in the feature space. For the algorithm the window size w                                                     create cluster with more variety in the cluster size.
needs to be specified.                                                                                           In order to quantify the effectiveness of the algorithm to
   A general problem when using Mean Shift Clustering is to                                                   create flip-flop groups with similar functional failure rate, two
chose the correct windows size. In this analysis the window                                                   metrics were derived from the results: the average variance
sizes were chosen in a way the resulting number of clusters                                                   of the functional failure rate and the maximum difference of
are close to the number of clusters used for the previous                                                     the functional failure rate of flip-flops within the same cluster.
algorithms. By using window sizes of w = 2.8, w = 1.7                                                         An additional metrics was created where the average variance
and w = 0.85 the number of clusters were resulting in 52,                                                     of the functional failure rate within one cluster is weighted

TABLE III
                                                                                  C OMPARISON OF THE D IFFERENT C LUSTERING A LGORITHM

                                                                                                                    Mean            Mean Weighted              Max
                                Clustering                          Mean             Standard Deviation       Functional Failure   Functional Failure   Functional Failure   Davies-Bouldin
                                Algorithm         # Cluster      Cluster Size           Cluster Size              Variance             Variance            Difference            Index
                          k-Means                       53           19.89                 13.01                     0.009               0.095                0.92                0.73
                                                       105           10.04                 7.45                      0.010               0.041                0.92                0.68
                                                       211            4.99                 3.80                      0.003               0.008                0.64                0.51
                          Agglomerative                 53           19.89                 21.39                     0.015               0.094                0.92                0.78
                          Clustering                   105           10.04                 11.76                     0.011               0.040                0.92                0.57
                                                       211            4.99                 5.32                      0.003               0.007                0.64                0.54
                          Mean Shift                   52            20.27                 28.71                     0.017               0.167                  1                 0.60
                                                       105           10.04                 14.12                     0.006               0.038                0.92                0.42
                                                       210            5.02                  6.64                     0.002               0.005                0.47                0.39

                              0.05                                                           Ideal                    considered number of clusters.
                                                                                             Random                      In order to evaluate the performance of the clustering
                                                                                             Mean Shift w = 2.8
                                                                                             Mean Shift w = 1.7       algorithm without knowledge of the actual/reference values,
                              0.04                                                           Mean Shift w = 0.85
                                                                                                                      several metrics exists which quantify certain characteristics
Overall Critcal Sensitivity

                                                                                                                      of the created clusters (such as the separation of the data).
                              0.03
                                                                                                                      These metrics can be used when the presented approach is
                                                                                                                      applied to a new unknown circuit and different algorithms
                              0.02                                                                                    are evaluated or the algorithm parameters are fine tuned. For
                                                                                                                      this analysis the Davies-Bouldin index was used and results
                              0.01                                                                                    are shown in Tab. III. This index can be used to evaluate
                                                                                                                      the separation between the clusters. It signifies the average
                                                                                                                      similarity between clusters by comparing the distance between
                              0.00
                                     0%         20%            40%               60%          80%             100%    clusters with the size of the clusters themselves. An index of 0
                                                              % of Mitigated Flip-Flops
                                                                                                                      is the lowest possible score and values closer to zero indicate
                                             (a) From 0 % to 100 % of mitigated flip-flops
                                                                                                                      a better partition.
                              0.05
                                                                                                                         The Davies-Bouldin index does not fully correlate with the
                                                                                                                      functional failure variance or difference metrics. However, it
                                                                                                                      follows the general direction and a lower index can be ob-
                              0.04                                                                                    served with higher number of clusters. Further, similar scores
Overall Critcal Sensitivity

                                                                                                                      are obtained for the k-Means and Agglomerative Clustering as
                                                                                             Ideal                    also seen when comparing the functional failure metrics. The
                              0.03                                                           Random
                                                                                             Mean Shift w = 2.8       best index, the closest to zero, is also obtained for the best
                                                                                             Mean Shift w = 1.7
                                                                                             Mean Shift w = 0.85      result, Mean Shift clustering with 210 clusters.
                              0.02
                                                                                                                                   V. C ONCLUSION AND F UTURE W ORK
                                                                                                                         This paper proposes a new methodology to group flip-
                              0.01
                                                                                                                      flops together which are expected to have a similar contri-
                                     0%          5%            10%               15%          20%              25%
                                                                                                                      butions to the overall functional failure rate. The grouping
                                                              % of Mitigated Flip-Flops                               is based on machine learning clustering and uses a compact
                                          (b) Section from 0 % to 25 % of mitigated flip-flops                        set of features combining attributes from static elements and
                                                      Fig. 4. Mean Shift Clustering                                   dynamic elements. The advantage in comparison to other
                                                                                                                      existing approaches is that the approach is more flexible and
                                                                                                                      no assumption of the circuit or its representation is made.
with the cluster size. In this way a large cluster, which has the                                                     Further, the number of clusters can be chosen by the user,
same variance within the cluster as a small cluster, is penalized                                                     which determines the needed efforts for the fault injection
more.                                                                                                                 campaign.
   The results verify what was observed from the Fig. 2, 3 and                                                           The effectiveness of the grouping by different machine
4. In general the algorithms perform better with a higher num-                                                        learning clustering algorithms were evaluated on a practical
ber of clusters. Agglomerative Clustering performs slightly                                                           example and compared to an ideal solution. Good results were
better than k-Means and Mean Shift clustering performs worst                                                          obtained by choosing the number of clusters with 5 % of
with low number of clusters and close to ideal with the highest                                                       the total number of flip-flops in the circuit and results close

to the ideal solution were obtained with number of clusters
corresponding to 10 % to 20 % of the number of flip-flops. This
would mean that the fault injection efforts could be reduced
by a factor of 5×, 10× or 20× respectively.
   Future work will focus on identifying new features to
improve the effectiveness of the clustering algorithms as well
as applying the techniques to a broader range of circuits.
Further, the approach could be extended by using features
which take physical design aspects into account and thus,
reducing the efforts needed to perform a complete failure
analysis on several levels.
                               R EFERENCES
 [1] I. Polian, S. M. Reddy, and B. Becker, “Scalable Calculation of Logical
     Masking Effects for Selective Hardening Against Soft Errors,” in 2008
     IEEE Computer Society Annual Symposium on VLSI, Apr. 2008, pp.
     257–262.
 [2] M. Maniatakos and Y. Makris, “Workload-driven selective hardening of
     control state elements in modern microprocessors,” in 2010 28th VLSI
     Test Symposium (VTS), Apr. 2010, pp. 159–164.
 [3] M. G. Valderas, M. P. Garcia, C. Lopez, and L. Entrena, “Extensive
     SEU Impact Analysis of a PIC Microprocessor for Selective Hardening,”
     IEEE Transactions on Nuclear Science, vol. 57, no. 4, pp. 1986–1991,
     Aug. 2010.
 [4] Y. Yu, B. Bastien, and B. W. Johnson, “A state of research review on
     fault injection techniques and a case study,” in Annual Reliability and
     Maintainability Symposium, 2005. Proceedings., Jan. 2005, pp. 386–392.
 [5] A. Evans, M. Nicolaidis, S. Wen, and T. Asis, “Clustering techniques and
     statistical fault injection for selective mitigation of SEUs in flip-flops,”
     in International Symposium on Quality Electronic Design (ISQED), Mar.
     2013, pp. 727–732.
 [6] E. Alpaydin and F. Bach, Introduction to Machine Learning, ser.
     Adaptive Computation and Machine Learning Series. MIT Press, 2014.
 [7] I. Wali, B. Deveautour, A. Virazel, A. Bosio, P. Girard, and M. Sonza Re-
     orda, “A Low-Cost Reliability vs. Cost Trade-Off Methodology to Se-
     lectively Harden Logic Circuits,” Journal of Electronic Testing, vol. 33,
     no. 1, pp. 25–36, Feb. 2017.
 [8] O. Ruano, J. A. Maestro, and P. Reviriego, “A Methodology for
     Automatic Insertion of Selective TMR in Digital Circuits Affected by
     SEUs,” IEEE Transactions on Nuclear Science, vol. 56, no. 4, pp. 2091–
     2102, Aug. 2009.
 [9] P. K. Samudrala, J. Ramos, and S. Katkoori, “Selective Triple Modular
     Redundancy (STMR) Based Single-Event Upset (SEU) Tolerant Synthe-
     sis for FPGAs,” IEEE Transactions on Nuclear Science, vol. 51, no. 5,
     pp. 2957–2969, Oct. 2004.
[10] T. Lange, A. Balakrishnan, M. Glorieux, D. Alexandrescu, and L. Ster-
     pone, “Machine Learning to Tackle the Challenges of Transient and
     Soft Errors in Complex Circuits,” in 2019 IEEE 25th International
     Symposium on On-Line Testing and Robust System Design (IOLTS), Jul.
     2019, pp. 7–14.
[11] Andre Tanguay, “10GE MAC Core Specification,” Jan. 2013.
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
     O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
     plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
     esnay, “Scikit-learn: Machine Learning in Python,” Journal of Machine
     Learning Research, vol. 12, pp. 2825–2830, 2011.

You can also read