RiskNet: Neural Risk Assessment in Networks of Unreliable Resources - arXiv

Page created by Sandra Hawkins
 
CONTINUE READING
1

                                                                                       Pattern Recognition Letters
                                                                                      journal homepage: www.elsevier.com

                                         RiskNet: Neural Risk Assessment in Networks of Unreliable Resources

                                         Krzysztof Ruseka,∗∗, Piotr Boryloa , Piotr Jaglarza , Fabien Geyerb , Albert Cabellosc , Piotr Choldaa
                                         a AGH   University of Science and Technology, Institute of Telecommunications, Al. Mickiewicza 30, 30-059 Kraków, Poland;
                                         b TechnicalUniversity of Munich, Department of Informatics, Boltzmannstr. 3, 85748 Garching bei München, Germany;
                                         c Barcelona Neural Networking, Universitat Politècnica de Catalunya, Barcelona, Spain
arXiv:2201.12263v1 [cs.NI] 28 Jan 2022

                                         Article history:
                                                                                      ABSTRACT

                                         Graph neural networks (GNNs),
                                                                                      We propose a graph neural network (GNN)-based method to predict the distri-
                                         message-passing neural network, re-          bution of penalties induced by outages in communication networks, where con-
                                         silience management, risk engineer-          nections are protected by resources shared between working and backup paths.
                                         ing, shared protection.                      The GNN-based algorithm is trained only with random graphs generated with
                                                                                      the Barabási–Albert model. Even though, the obtained test results show that we
                                                                                      can precisely model the penalties in a wide range of various existing topologies.
                                                                                      GNNs eliminate the need to simulate complex outage scenarios for the network
                                                                                      topologies under study. In practice, the whole design operation is limited by 4 ms
                                                                                      on modern hardware. This way, we can gain as much as over 12 000 times in the
                                                                                      speed improvement.
                                                                                                                              © 2022 Elsevier Ltd. All rights reserved.

                                         1. Introduction                                                            In communication networks, it is typically assumed
                                                                                                                that resilience to outages is based on the so-called pro-
                                              Applications of machine learning (ML) in the area of              tection, where each connection uses a precalculated pair
                                         communication networks1 (IP/MPLS-based, optical trans-                 of working and backup paths. The former is used be-
                                         port networks and the like) focus mainly on enabling data-             fore the outage and the latter serves afterwards to bypass
                                         driven self-management. The aim is to enable a network                 faulty components (routers, cross-connects, links, etc.).
                                         infrastructure to intelligently react in the presence of ran-          Resilience provisioning can be based on dedicated protec-
                                         dom and adversary events. We follow this path and con-                 tion, where backup resources are designated to be used in
                                         tribute with a concept of resilience-aware network design,             the case of faults in working paths and there is no danger of
                                         where randomness relates to failures and consecutive re-               their shortage. However, this method is extremely costly
                                         covery events. Here, we elaborate on provisioning an ef-               in term of infrastructure. Furthermore, the prediction of
                                         ficient method to predict the quality of resilience when               its behaviour is easy to model (we can assume indepen-
                                         complex recovery settings are provided.                                dence between connections). On the contrary, a practical
                                                                                                                method, yet complex in modelling, is based on the so-
                                           ?? Funding: This work was supported by the Polish Ministry of Sci-
                                                                                                                called shared backup path protection (SBPP). The notion
                                         ence and Higher Education with the subvention funds of the Faculty     of ‘sharing’ consists in having the same backup resource
                                         of Computer Science, Electronics and Telecommunications of AGH         pool for various connections, this makes the connections
                                         University and by the PL-Grid Infrastructure.                          dependent.
                                           ∗∗ Corresponding author.
                                                                                                                    From the application viewpoint, i.e., communication
                                              e-mail: krusek@agh.edu.pl (Krzysztof Rusek),
                                         piotr.borylo@agh.edu.pl (Piotr Borylo), pjaglarz@agh.edu.pl            network management and operations, this approach can
                                         (Piotr Jaglarz), fgeyer@net.in.tum.de (Fabien Geyer),                  incur the penalties imposed on a network operator due
                                         alberto.cabellos@upc.edu (Albert Cabellos),                            to outages. They happen when a backup path cannot be
                                         piotr.cholda@agh.edu.pl (Piotr Cholda)
                                            1 To avoid equivocation, we will denote ‘communication network’     established for a connection in the presence of resource
                                         as the ‘topology’. The word ‘network’ will be limited to ‘neural       shortage. Typically, a penalty is a monetary value propor-
                                         network’ only.
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                             2

tional to the outage time experienced by a user due to a        on the total number of failures experienced or the sum of
given service level agreement (SLA). From a business point      squares of downtimes, etc. Such the relationship between
of view, this is a random expense that should be included       physical time and monetary units connects traditional re-
in the risk management process.                                 liability analysis with business-oriented risk analysis.
    In this paper, the penalties are treated as the main per-
formance metric to assess the quality of resilience and its     2.1. Related Work
financial aspect. The used modeling has to take into ac-             This research lays at the intersection of reliability anal-
count probabilistic estimates (similar to the traditionally     ysis, risk management and machine learning. Network re-
used availability, mean downtime, etc.), but also consid-       silience provisioning has been well studied [1]. However,
ers the amplitude of the outage impact in order to use an       less emphasis is put to approaches aligned with the busi-
indicator inspired by risk engineering. Since SBPP has          ness perspective. Here, we assume risk-based quantifica-
no known analytical results for those indicators (e.g. ex-      tion, where not only the probability of outages, but also
pected total downtime per year), the only solution is to        their impact is directly taken into account. This way, we
use time-consuming simulations. Due to the rare nature          are able to characterize resilience in a more informative
of failures, the simulation must be very long as accord-        way than with the usage of classical resilience measures
ing to the best authors’ knowledge, there is no rare event      (e.g., reliability or availability functions). In the commu-
simulation technique addressing SBPP.                           nications sector, risk has been associated with quantifica-
    The main contribution of this paper is proposition to       tion of deviations from the desired quality levels, includ-
pose the risk estimation as a regression problem defined        ing security [2]. From the mathematical viewpoint, the
on a bipartite meta-graph solved with the help of a graph       penalty is expressed on the basis of the risk theory dealing
neural network (GNN). The GNN maps reliability param-           with extreme events. Generally, the most important as-
eters of network’s components (e.g. links) to parameters        pect here is to estimate the whole distribution of the total
of a distribution family. Such a distribution is a univer-      penalty. Then, it is possible to quantify the penalty level
sal tool to determine various performance characteristics       with, for instance, the 95th percentile, known as value-
for any network topology. By ‘universal’, we mean that a        at-risk (VaR5 % ) [3]. However, a more robust measure is
single instance of the trained GNN can be effectively used      suggested. It is called the conditional VaR (CVaR 5% ).
with various network topologies, no matter what kind of         In this case, the average of all the penalties beyond the
topologies were used during the training process. This          assumed percentile is calculated. An important advan-
way, a high level of generalization is obtained, increasing     tage of CVaR is its property of subadditivity. It is helpful
the applicability potential of the proposed approach. We        since with it we can assume as if the penalties for the
have also eliminated the need for long simulations as they      individual connections were independent. Then, it is rea-
are needed only during the training phase. Inference is ex-     sonable to sum the penalties for individual connections,
tremely fast as since the GNN outputs the entire distribu-      since this sum is the upper bound for the total penalty
tion, we are not restricted to any particular risk measure-     in the network. Therefore, we can use the pessimistic ap-
this is left to the decision of risk analyst.                   proximation [4]. However, the common requirement for
    The next section presents the literature review to show     risk measures is the penalty distribution from which both
the background and emphasize the originality of our ap-         measures can be derived. In the networking content, such
proach. Section 3 theoretically elaborates on our GNN-          distribution depends on the reliability parameters of net-
based approach to the prediction of penalty levels. Next,       work components.
we devote Section 4 for illustration of the whole experi-            In the modeling of up- and down-times, the typical ap-
mental setup and for the presentation of the results and        proach is to assume that the failures arise due to a homo-
discussion, proving that our approach meets the practi-         geneous Poisson process [5]: the times between failures are
cal requirements. We conclude the paper and present the         exponentially distributed. It is statistically valid for many
ideas for the extension of the given concept in Section 5.      cases in communication networks [6], but has appeared to
                                                                be limited since other distributions for times between con-
2. Rationale and Related Work                                   secutive failures are also reported (e.g., Weibull distribu-
                                                                tion). The latter case considerably hinders the prediction
    The rationale for this work comes from the field of com-    of resilience parameters. Modeling of outage times is even
munication and computer networks, where data transmis-          more controversial. While the simplest approach also uses
sion must be protected against the potential impact of          exponential times, these times in real networks appear to
component failures. An operator that cannot provide re-         be log-normal [7] or Pareto-like [8]. Combing both main
liable transmission for its customers must pay fines ac-        fields it can be stated that machine learning algorithms
cording to the particular service level agreements. The fee     have also been used in the context of risk management. A
incurred in relation to a failure impact is calculated ac-      general framework for risk assessment using DL was pro-
cording to the so-called compensation policy. Usually, it       posed in [9] and validated in drive-off scenario involving an
is based on the total downtime averaged over a period such      Oil & Gas drilling rig. Authors provided comprehensive
as a year. Other possible compensation policies are based       analysis and indicated several limitations of deep neural
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                           3

networks (DNNs) in terms of risk assessment. Moreover,           be carried out with the help of resilient connections and
DNNs were also successfully used for risk management in          this fact is reflected in service level agreements (SLAs). In
customs [10] or financial market [11]. The most promi-           essence, we identify a connection with its business-oriented
nent communication network models related to resilience          description given by the SLA.
are based on Markov theory. However, this type of mod-               In general, such properties (features) of both compo-
elling is not approachable and does not provide general and      nents and SLAs are denoted by vectors of data: xci and
concise formulas for realistic cases, such as SBPP, where        xsk , respectively. Vector xci ∈ R4+ contains resilience pa-
up and down-times do not follow simplistic assumptions           rameters (parameters of up- and down-time distributions)
of exponentiality. Due to the limitations of the modelling       and design parameters (i.e., backup bandwidth reserved
approaches, we decided to base on graph neural networks          in links) of individual component ci . On the other hand,
(GNNs) [12]. The additional advantage of this approach           xsk ∈ R1+ contains SLA’s parameters, in our case the de-
is that it allows us to build a solution independent from a      mand volume for connection sk . Both features xsk , xck
network topology and any particular routing scheme (by           and the routing S are jointly denoted as x = (xsk , xck , S).
‘routing’ we mean any function mapping every SLA (con-
nection) to the a pair of working/backup paths, i.e., sets       3.1. Approximate Penalty Distribution
of components (links))                                               The ultimate goal in risk analysis of communication
    GNNs have been already used for computer networks,           networks is the evaluation of the conditional distribution
for instance by Geyer et al. [13]. The fundamental dif-          P = P(Y |x) and in particular its quantiles to obtain VaR
ference when comparing our work with this paper is that          or its derivatives. For simple protection schemes (e.g.,
we consider routing paths as an input to the GNN-based           dedicated protection) with the additional assumption of
algorithm, while [13] aim at finding routing paths accord-       exponential up- and down-time distributions, P can be
ing to the specified policies. Additionally, Geyer et al. do     obtained analytically [16]. According to the best authors
not address the resilience aspect in the sense considered in     knowledge, there are no analytical results for more realistic
our work. Namely, we are mainly focused on limitations           scenarios of shared protection procedures and non-Poisson
in bandwidth of backup resources in SBPP. The other pa-          downtimes. Having said that, we point out that it is rela-
pers of the same research group (e.g., [14]) also differ with    tively simple to sample from P by simulation. In principle,
our work significantly: while they focus on control plane        one can estimate the risk measures with a Monte-Carlo
faults, we put emphasis on a topology-agnostic solution          method. However, the simulations take a lot of time to
for the resilience of the data plane based on the business-      obtain reliable estimates due to the rare nature of the fail-
oriented approach. Authors of [15] also focus on network         ures. The method proposed in this paper is to approximate
performance by using GNN to extract features. They are           P by a surrogate distribution Q = nn(x), where nn is a
used to calculate the routing that maximizes of bandwidth        neural network mapping topology properties to the family
utilization. Resilience is also indirectly addressed, as the     of parametric distributions. The network consists of its
proposed solution is able to cope with router and link fail-     own parameters (weights) θ learned from the simulations.
ures. Despite some similarities, our approach to resilience          The training objective for nn used in this study is to
is more business-oriented (due to adoption of the risk ap-       minimize the Kullback–Leibler divergence DKL (P k Q)
proach) and not bounded to any particular method of gen-         between distributions. In particular, for given network
erating working/backup paths.                                    parameters and simulated penalty y ∈ Rns we use Monte-
                                                                 Carlo approximation to the true KL divergence:
3. Methods                                                                 DKL (P k Q) ≈ log PP (y) − log PQ (y).         (1)
     The customers carry data on connections established         Since log PP (y) does not depend on θ, we can ignore this
in the network. We assume that the network topology is           term and use the negative log-likelihood function of the
represented by the following: (a) Set C = {ci }i=1:nc of         surrogate distribution as the loss function. With the ad-
basic communication components ci (links) prone to fail-         ditional assumption of conditional independence of the
ure. Only the link bandwidth can limit the shared pro-           SLAs, the loss function simplifies considerably to
tection capabilities. We assume realistic distributions of
up- and down-times of links and treat them as the re-                                      1 X
                                                                            `(θ, x, y) = −       log PQ (yk |x, θ) ,   (2)
silience attributes of these components. For the sake of                                   ns
                                                                                               k
this study, the routers are treated as fully reliable. (b) Set
S = {sk }k=1:ns of connections sk between pairs of end-              Due to its generalization potential, we applied the Stu-
points (routers) in the network topology. The routing for        dent t-distributions as the parametric family. As used in
each SLA                                                         many cases in statistical modeling (e.g., robust regression),
           is defined as a pair of sets of components sk =      we assume five degrees of freedom. This is a justified ap-
 skp , skb , where skp is the set of components in the work-
                                                                 proach ensuring a proper heavy tail and is convenient for
ing path and skb contains components in the backup path
                                                                 the training of neural networks (i.e., incurres a relatively
prepared for the k-th SLA. The connections are charac-
                                                                 simple likelihood function). Additionally, it ensures the ex-
terized by their demand volumes. These demands should
                                                                 istence of the mean value and variance. This way, we still
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                                                                                      4
                     Topology of a network with SLAs (i.e., connections);
                               typically not a bipartite graph
                                                                            Bipartite meta-graph of relationships
                                                                                 between components and
                                                                                                                            in the vectors associated with the component and SLA-
                                                                                                                            related nodes, denoted as htc (for component c) and hts
                                                                                        connections

  Connection (SLA)                                                                                                          (for SLA s), respectively. These vectors, known as inter-
   Working                             c
                                           1
                                                                              c
                                                                                  1
                                                                                                                    SLA s
                                                                                                                            nal (hidden) states, are found as one of the results of our
    path
                                               c
                                                   3                          c
                                                                                  2
                                                                                                                        A
                                                                                                                            algorithm operation. The forward pass begins with zero-
   Backup
    path                               c                                      c
                                                                                                                    SLA s
                                                                                                                        B   padded components and SLA feature vectors, followed by
                                           2                                      3

                                                                SLA s                                      SLAs
                                                                                                                            an iterative exchange of messages and state updates. In
                                                                     B            Components           (connections)
                                                                                                                            particular, the result of embedding every edge in the meta-
                     SLA s
                          A
                                                                                                                            graph gives vectors m̃tr,u→z , where t represents the itera-
                                                                                                                            tion; r ∈ {p, b} represents the message related either to
                                                                                                                            working (p) or backup (b) path, and u, z ∈ {c, s} rep-
Fig. 1. Transformation of the given topology onto the asso-                                                                 resent in which direction the message is passed (s → c
ciated bipartite metagraph for GNN
                                                                                                                            denotes SLA to component message, while c → s denote
                                                                                                                            the message in the opposite way). The message is cal-
have a bell-shaped distribution with an analytical PDF                                                                      culated as an output of a neural network represented as
for efficient training of the nn. Other candidate distribu-                                                                 m̃t+1          t
                                                                                                                               r,u→z = Mr,u→z (hu , hz ) (the abovementioned notation
tions involve normal and log-normal distributions. During                                                                   related to m̃tr,u→z is again valid). These neural networks
the initial calculations, the normal distribution (suggested                                                                are called message functions. We can see that the working
by the Central Limit Theorem) gave us similar results to                                                                    and backup paths have different message functions, and
t-distribution, upon inspection of a few simulations we ob-                                                                 they can deal with two directions. Therefore, we have
                                                                                                                                                                 t         t        t
served that t-distribution matches the tail more accurately.                                                                four message functions used (Mp,s→c       , Mb,s→c , Mp,c→s ,
                                                                                                                               t
On the other hand, log-normal distribution has a too heavy                                                                  Mb,c→s ). These functions can be also different in various
tail, and in this case, it is sufficiently accurate only for a                                                              iterations (that is, why we also use the superscript t for
smaller number of failures. The bell-shaped distribution                                                                    them). To calculate the change of a hidden state, we use
is suitable for our simulation setup, as we always observe                                                                  neural networks denoted as Uct and Ust . These update func-
a few failures a year in a communication topology. In the                                                                   tions encode the combined incoming      information into the
case of an extremely resilient topology with the possibil-                                                                  hidden state: ht+1v   = Uvt htv , mt+1
                                                                                                                                                               v    ,where v ∈ {s, c} and
ity of no-failures, a zero-inflated log-normal distribution is                                                              mt+1
                                                                                                                               v   is the sum of the embedding vectors for all adjacent
recommended [17], as it can model both the probability of                                                                   edges in the metagraph.
no failure and the cost distribution given the failure has                                                                       The whole process is typically convergent after a few
occurred.                                                                                                                   (T ) iterations. This parameter controls the range of inter-
    The architecture of nn can be as simple as MLP, how-                                                                    actions between SLAs, e.g., for T = 1 each SLA receives
ever, since there is no particular order of the SLAs nor the                                                                messages only from its components and the model is ba-
components, we additionally require model equivariance                                                                      sically a DeepSet [18]. Higher T allows for information
under permutations. This makes a Graph Neural Network                                                                       to be exchanged between SLAs trough components. The
(GNN) a perfect candidate for nn.                                                                                           predicted penalty related to an SLA is found using a small
                                                                                                                            readout neural network (represented as F ) applied to the
3.2. GNN                                                                                                                    final hidden state of the SLA. Its output represents the
    The association between the set of components and                                                                       parameters of the penalty distribution for this particular
the set of SLAs can be represented as a bipartite meta-                                                                     SLA. In terms of particular neural architectures we used
graph (for simple illustration, see Fig. 1). Both compo-                                                                    affine functions for message propagation (M ) and GRU
nents and SLAs can be represented as nodes2 of the asso-                                                                    units for the update (U ). These are the proven units used
ciation graph, with edges connecting an SLA and a com-                                                                      in many GNN architectures. They are selected as a trade-
ponent if and only if the latter is used in the path for the                                                                off between simplicity, performance, and flexibility of the
given SLA. There are two kinds of edges in the metagraph:                                                                   model. Similarly to the previously proposed models, the
one representing a working path (i.e., that a component is                                                                  weights of message and update functions are reused for
used in SLA’s working path), and the other for the backup                                                                   subsequent message-passing iterations. The readout func-
path.                                                                                                                       tion proposed in this paper is a multilayer perceptron with
    Our main idea is to run a heterogeneous GNN named                                                                       the SeLU activation function. The mapping from the raw
RiskNet on the presented bipartite metagraph to get the                                                                     readout output to Student t-distribution’s location param-
surrogate penalty distribution. We base on a version of                                                                     eter is the identity; however, the scale must be constrained
GNNs known as message-passing neural networks. Such                                                                         to positive numbers, so we use the softplus function.
a network embeds the knowledge of internal relationships
                                                                                                                            3.3. Simulation
                                                                                                                                All experiments reported in this paper are supported
   2 Toavoid misunderstanding, the vertices of the bipartite meta-                                                          by the discrete event risk simulator previously used and
graph are denoted as ‘nodes’, while the vertices of the studied com-                                                        verified by comparison with the theoretical results in [16]
munication network topology are called ‘routers’.
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                           5

in series of unit tests. The software is written in C++. The     4.1. Training and hyperparameters
simulator is treated as the source of ground truth and it            During the training phase, first we select different net-
is used for the creation of training datasets for neural net-    work topologies with randomly generated parameters (train-
work (GNN). A network topology is the starting point for         ing samples). They fed the GNN and a discrete-event sim-
building the configuration. In the experiments, the topolo-      ulator and the result is stored for offline training. Second,
gies under study are either the ones based on the random         the message-passing is run in GNN to obtain the conver-
Barabási–Albert model (with the output power distribu-          gence and provide prediction of the total penalties ŷ. The
tion of node degrees) or the existing topologies retrieved       predicted penalties are confronted with y, i.e., the ground
from the SNDLib library (http://sndlib.zib.de). Ac-              truth (real penalty levels) provided by the simulator. On
cording to the random model, every new router is attached        this basis, the learning error (loss) is calculated and the
to at least two existing ones. This method always makes it       classical backpropagation algorithm is used to update the
possible to construct the working and backup path for any        internal weights of the neural networks forming GNN. As
connection between a pair of routers. Afterwards, for a se-      concerns the inference, the output from the prediction
lected topology, first we generate the working and backup        model follows the same path as in the case of the training.
path for every connection (SLA) between all routers. For         The only exception to when one wants to use dropout to
the training dataset creation, we do not use a typical ap-       estimate the uncertainty of the prediction. Then, the out-
proach of looking for 2-shortest candidate paths. Instead,       put is equal to the average of multiple stochastic forward
we choose the router-disjoint path randomization from the        passes.
set of all disjoint paths found for all connections. Namely,         In the spirit of the modern deep learning approach, we
we set the probability of drawing a given path as decreas-       used the raw simulator parameters as the input to RiskNet
ing in the function of a path’s length (e.g., a number of        and let the model learn a meaningful internal representa-
components on the path). We select a parameter ξ and             tion. The SLA feature contains the demand volume only.
then draw paths with probability pk ∼ e−ξ|sk | . This value      The components’ feature vector has four dimensions (fail-
is normalized. Then, for small values of ξ, the distribution     ure intensity, α and β parameters of the Pareto down-
is flat and long paths have a high probability to be selected,   time distribution, and the capacity reserved for protec-
while for large ξ, we have mainly the shortest paths, as pk      tion). The only transformation applied to the data is the
drops quickly with the length. For the training phase, we        z-score normalization, as it improves the training process.
use ξ = 0.1. This approach allows us to explore the con-         We do not consider routing as a feature, but rather as
figuration space and gives the GNN a highly diverged set         metadata.
of training samples. Demand volume of a connection is                Training of a deep neural network typically requires
proportional to the product of the sizes of source and sink      hundreds of thousands of samples. Learning from sim-
routers for the connection. The size of a router depends         ulations makes it easier to obtain samples; however, we
on its degree d and is uniformly distributed in the inter-       are still limited by the training and simulation time. Our
val 10 × (d ± 1). At the end, the resilience parameters          training set is constructed from a simulation of 1000 ran-
of different components are generated. Here, we assume           dom network topologies (generated according to Barabási–
Poissonian failures (i.e., exponential up-times) and Pareto-     Albert model) with a number of routers uniformly dis-
distributed down-times. Both are parametrized according          tributed in the range [10, 40]. Every topology was simu-
to the estimates reported in [8]. Link lengths are obtained      lated for up to 1000 years of operation. Since for some of
from the scaled spring layout of the network topology. The       the largest networks not all simulations finished in the as-
output of the simulation is the total penalty for each SLA       sumed time constraint, we ended up with around 829 000
in every simulated year of operation. This value is used as      training examples. Using the same procedure, we gener-
a target (label value y) in the process of GNN training.         ated additional 20 000 test samples to spot signs of overfit-
                                                                 ting. With this training set, we tested multiple configura-
4. Numerical Results                                             tions of the RiskNet mostly differentiated by the hyperpa-
                                                                 rameter values. On this basis and according to our prior
    To simplify, we can say that we use a type of supervised     knowledge about GNNs, we chose the final configuration.
learning (heteroscedastic regression), since we are able to      Both hidden vectors have 32 dimensions. The message
provide the real values (ground truth) of the penalties and      has 64 dimensions. The kernel of the affine message func-
confront them with the predicted values to train the model.      tion is regularized with the coefficient 0.01, and the bias is
The real values are provided by simulations for specific         not regularized. The message passing loop is iterated six
configurations. On the other hand, simulations take a lot        times. Finally, the readout function has three hidden lay-
of time, that is why we can afford for them only in the          ers of sizes (64, 64, 32) interleaved with two dropout layers
training phase. During operation of the whole system, a          with dropout rates 0.2 and 0.1, respectively. The model
very fast GNN model predicts the values for given work-          characterized above was trained for 54 epochs of 12 900
ing/protection path settings.                                    iterations of the Adam optimizer on a batch of 64 topolo-
                                                                 gies. The learning rate was set to 0.0001 for the first 20
                                                                 epochs. Then it was decayed by 0.99 per epoch. The whole
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                               6

                                                                   theory. The difference between log-likelihoods is a mea-
               Table 1. GNN evaluation loss
  Topology          GNN T = 6   Baseline    GNN T = 1
                                                                   sure of information the model has learned. By changing
                                                                   the base of the logarithm to 2, we can express this infor-
  train                      -1.37           –          -1.34
  test                       -1.43           –          -1.38
                                                                   mation in bits. In particular, for the validation set, the
  validation                 -1.42        1.10          -1.39      RiskNet system reduces the description length on average
  Average SNDLib             -0.88        1.62          -0.74      by 3.6 bits per path (in comparison to the baseline). For
  dfn-bwin                    0.73        4.09           1.19      reference, the entropy of the baseline distribution is 1.6
  abilene                    -1.67        0.87          -1.62
  nobel-germany              -1.37        0.88          -1.32
                                                                   bits.
  cost266                    -1.32        1.00          -1.25          Despite the fact that the negative log-likelihood ap-
  geant                      -1.38        1.01          -1.33      plied as a loss function is a theoretically justified measure
  nobel-eu                   -1.32        0.97          -1.20      used for parameter estimation, it is difficult to state how
  janos-us                   -1.14        1.11          -1.04
                                                                   accurate the fit is by reasoning on the basis of a single
                                                                   value. Therefore, to provide a more intuitive measure,
                                                                   we propose to use probability plots (pp-plots, see Fig. 2)
training took 11 hours 40 min with T = 6.
                                                                   produced according to the following procedure. For every
    The model was implemented entirely in TensorFlow.
                                                                   network configuration x, RiskNet predicts the whole dis-
The hardware supporting calculations embraces 36 cores
                                                                   tribution of penalties Q. The ground truth y is obtained
Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz, while the
                                                                   from the simulation. Given some probability value q, we
graphics processor unit used is GPU Tesla V100 SXM2. As
                                                                   construct a Bernoulli random variable 1y
Rusek et al.: RiskNet: Neural Risk Assessment in Networks of Unreliable Resources                                                                                                         7
                  net = validation              net = SNDLib average                     net = geant                by using an ML-based module, providing very good re-
     1.0
                                                                                                                    sults in comparison to the baseline case; (c) replace time-
     0.8
                                                                                                                    consuming simulations, being an intuitive alternative to
     0.6
                                                                                                                    our proposal, by a very fast prediction method used — to
q̂

     0.4

     0.2                                                                                                            can be applied by a network designer during connection
     0.0                                                                                                            optimization.
                  net = janos-us                     net = dfn-bwin                     net = abilene

     1.0                                                                                                            References
     0.8

     0.6                                                                                                             [1] D. A. Schupke, Multilayer and multidomain resilience in optical
q̂

     0.4                                                                                                                 networks, Proceedings of the IEEE 100 (5) (2012) 1140–1148.
     0.2                                                                                                             [2] A. Teixeira, K. C. Sou, H. Sandberg, K. H. Johansson, Secure
     0.0                                                                                                                 control systems: A quantitative risk management approach,
                                                                                 0.00   0.25   0.50   0.75   1.00
                                                                                                                         IEEE Control Systems Magazine 35 (1) (2015) 24–45.
                   net = cost266                 net = nobel-germany                            q                    [3] J. Wang, A. Chaudhury, H. R. Rao, A Value-at-Risk approach
     1.0                                                                                                                 to information security investment, Info. Sys. Research 19 (1)
     0.8                                                                           model                                 (2008) 106–120.
     0.6
                                                                                         GNN                         [4] M. B. Righi, P. S. Ceretta, A comparison of expected short-
                                                                                         Baseline
                                                                                                                         fall estimation models, Journal of Economics and Business 78
q̂

     0.4

     0.2
                                                                                                                         (2015) 14 – 47.
     0.0
                                                                                                                     [5] W. Ahmad, O. Hasan, U. Pervez, J. Qadir, Reliability modeling
                                                                                                                         and analysis of communication networks, Journal of Network
           0.00   0.25   0.50   0.75   1.00   0.00   0.25   0.50   0.75   1.00
                                                                                                                         and Computer Applications 78 (2017) 191 – 215.
                          q                                  q
                                                                                                                     [6] A. J. Gonzalez, B. E. Helvik, J. K. Hellan, P. Kuusela, Analysis
                                                                                                                         of dependencies between failures in the UNINETT IP backbone
Fig. 2. Probability plots for topologies simulated for 1000                                                              network, in: 2010 IEEE 16th Pacific Rim International Sympo-
years of operation                                                                                                       sium on Dependable Computing, 2010, pp. 149–156.
                                                                                                                     [7] P. Garraghan, I. S. Moreno, P. Townend, J. Xu, An analysis
                                                                                                                         of failure-related energy waste in a large-scale cloud environ-
                                                                                                                         ment, IEEE Transactions on Emerging Topics in Computing
model3 (see Table 1) without any interactions and whose
                                                                                                                         2 (2) (2014) 166–180.
test loss quickly begins to grow during the training. The                                                            [8] P. Kuusela, I. Norros, On/Off process modeling of IP network
similar behaviour was observed in other weekly interact-                                                                 failures, in: 2010 IEEE/IFIP International Conference on De-
ing models — all of them suffered from overfitting. We                                                                   pendable Systems Networks (DSN), 2010, pp. 585–594.
                                                                                                                     [9] N. Paltrinieri, L. Comfort, G. Reniers, Learning about risk:
conclude that a high value of T acts as a regularizer for                                                                Machine learning for risk assessment, Safety Science 118 (2019)
the model. We explain this by the fact that networks in                                                                  475 – 486.
the simulation were highly reliable and most of the con-                                                            [10] R. H. Regmi, A. K. Timalsina, Risk management in customs
tributions to the penalty were due to a single failure only.                                                             using deep neural network, in: 2018 IEEE 3rd International
                                                                                                                         Conference on Computing, Communication and Security (IC-
Having said that, we emphasize that in general SLAs must                                                                 CCS), 2018, pp. 133–137.
exchange information since, by definition, they do interact                                                         [11] S. K. Chandrinos, G. Sakkas, N. D. Lagaros, AIRMS: A risk
in the case of shared protection.                                                                                        management tool using machine learning, Expert Systems with
                                                                                                                         Applications 105 (2018) 34 – 48.
                                                                                                                    [12] L. Ruiz, F. Gama, A. Ribeiro, Graph neural networks: Architec-
5. Conclusions                                                                                                           tures, stability and transferability (2020). arXiv:2008.01767.
                                                                                                                    [13] F. Geyer, DeepComNet: Performance evaluation of network
                                                                                                                         topologies using graph-based deep learning, Performance Eval-
    In this letter, we show that a GNN can parametrize
                                                                                                                         uation 130 (2019) 1–16.
the t-Student distribution to approximate the penalty cost                                                          [14] F. Geyer, S. Bondorf, DeepTMA: Predicting effective contention
distribution due to component failures in a network. Even                                                                models for network calculus using graph neural networks, in:
though this work is derived and motivated by our experi-                                                                 IEEE INFOCOM 2019 — IEEE Conference on Computer Com-
                                                                                                                         munications, 2019, pp. 1009–1017.
ence in the field of communication and computer networks,                                                           [15] K. Sawada, D. Kotani, Y. Okabe, Network routing optimiza-
it generalizes well beyond this area. Similar concepts of                                                                tion based on machine learning using graph networks robust
network or unreliable components arise in logistics and                                                                  against topology change, in: 2020 International Conference on
other areas of business importance. Since the penalty un-                                                                Information Networking (ICOIN), 2020, pp. 608–615.
                                                                                                                    [16] K. Rusek, P. Guzik, P. Cholda, Effective risk assessment in
der consideration is based on downtimes, we expect this                                                                  resilient communication networks, Journal of Network and Sys-
work to be extendable to the downtime related quantities.                                                                tems Management 24 (3) (2016) 491–515.
    Our approach allows us to (a) omit very complex, time-                                                          [17] A. McDavid, R. Gottardo, N. Simon, M. Drton, Graphical mod-
consuming, and ineffective modeling of resilience param-                                                                 els for zero-inflated single cell gene expression, Ann. Appl. Stat.
                                                                                                                         13 (2) (2019) 848–873.
eters for shared protection; (b) improve practically use-                                                           [18] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R.
ful prediction quality of business-oriented risk parameters                                                              Salakhutdinov, A. J. Smola, Deep sets, in: I. Guyon, U. V.
                                                                                                                         Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
                                                                                                                         R. Garnett (Eds.), Advances in Neural Information Processing
     3 The
                                                                                                                         Systems, Vol. 30, Curran Associates, Inc., 2017.
              information difference between models is 0.04 bit per SLA.
You can also read