Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks - J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo Nokia Bell Labs ...

Page created by Corey Hansen
 
CONTINUE READING
Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks - J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo Nokia Bell Labs ...
Self-Healing and Resilience in
Future 5G Cognitive Autonomous
Networks
J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo
Nokia Bell Labs
janne.ali-tolppa@nokia-bell-labs.com

26-28 November
Santa Fe, Argentina
Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks - J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo Nokia Bell Labs ...
Robustness
       •    “Capability of performing without failure under a wide range of conditions ”
                                                                     Merriam-Webster Dictionary

       Resilience
       •    “An ability to recover from or adjust easily to misfortune or change”
                                                                    Merriam-Webster Dictionary

26-28 November
Santa Fe, Argentina
Why is resiliency important in 5G?

 •    5G is by nature dynamic and complex
 → Unforeseen circumstances are bound to happen
 •    Use cases requiring ultra-high reliability (URLLC)

       Robustness (redundancy etc.) is no longer alone
       enough!

26-28 November
Santa Fe, Argentina
How to design for resilience?

 • Monitor and adapt        Focus

 • Decoupling,              Common core
   modularity                principles

26-28 November
Santa Fe, Argentina
How to design for resilience?

 • Monitor and adapt        Focus

 • Decoupling,              Common core
   modularity                principles

26-28 November
Santa Fe, Argentina
Self-Healing in Radio Access Networks

26-28 November
Santa Fe, Argentina
Detecting Anomalies without Labelled Training Data
  Which are anomalous?

     Example 1         Example 2     Example 3         Example 4

Meaningless question     Red          Green

26-28 November
Santa Fe, Argentina
Anomaly Detection
 Feature selection

 Relevant feature:

           Color      Shape   Color and shape

26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly Detection
 Feature and context selection

 •    Input features include typically Performance Management (PM)
      Key Performance Indicators (KPIs) and Fault Management (FM)
      alarms, but other (additional) inputs can be used as well, e.g. log
      analysis
 •    Is the whole input space profiled, including cross-correlations, or
      only selected projections of it (single KPIs, selected KPI pairs etc.)
 •    Context needs to decided, e.g. will the profiling be done per
      network function or a group of network functions, hourly, diurnal
      profiles for network traffic dependent KPIs etc.
 •    In our work we used PM data only and created diurnal profiles for
      traffic-dependent KPIs and cross-correlations for selected KPI
      pairs.

26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly Detection
 Simple time-context dependent profiling of a timeseries

26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly Detection
 Cross-correlation profiling with clustering

 • First a clustering algorithm is
   applied, which omits the
   most probable outliers to
   clarify data
 • Correlation is modelled only
   inside the clusters
 • Can model also non-normal
   multivariate distributions

26-28 November
Santa Fe, Argentina
Diagnosis
 Anomaly event detection and diagnosis
                                                                                            anomaly pattern
                              anomalous timeframe                                    KPI1
            KPI1                                       average value (KPI1)

            KPI2                                       average value (KPI2)   KPI2

            KPI3                                       average value (KPI3)
                                                                                     KPI3
                                                    Time

 •    Anomalous timeframes are detected by using DBSCAN algorithm on the
      anomaly levels of selected features against their profiles
 •    By aggregating the selected feature (KPI) values in the anomaly event
      timeframe, the event is represented as an anomaly pattern
       – The diagnosis feature set can be, and often is, different than what is used in the
         detection!
 •    The root causes of the detected anomaly patterns are diagnosed against a
      diagnosis knowledgebase
26-28 November
Santa Fe, Argentina
Diagnosis
 Active learning assisted diagnosis

                                                            st ruct ured view of t he dat a

                                  ret hink                             loop                    rest ruct ure

                                                             int erpret ation of t he dat a

   a)   A human operator provides the machine with his own interpretation of the data
          – By attaching labels to anomaly points or clusters while considering information from step b)

   b)   The machine provides the operator with a structured view of the data
          – By clustering the data points while taking into account information from step a)

26-28 November
Santa Fe, Argentina
Holistic Self-Healing
 Across domains and management areas in mobile networks

       In a complex system, improving the resilience of only one part or level of organization can sometimes
       (unintentionally) introduce fragility in another. To improve the resilience, it is often necessary to work
       in more than one domain and scale at a time. - A. Zolli, A. M. Healy, “Resilience – Why Things
       Bounce Back”

    Coordination is required between the self-healing actions of, for example:
    • Network Management (NM): Management automation aggregated on a (Virtual) Network Function (V)NF level
    • Quality of Experience (QoE) driven management: Optimizing the end-to-end customer experience at the
        application and individual subscriber level
    • VNF and Service Orchestration

26-28 November
Santa Fe, Argentina
Knowledge Cloud
 Transferring diagnosis knowledge

    •    Collecting the diagnosis knowledge base is significant effort
    •    It would be desirable to be able to diagnose previously unforeseen problems
    •    This could be mitigated by sharing diagnosis knowledge between self-healing function deployments
    •    However, translating, i.e. generalizing and re-applying, diagnosis knowledge from other
         deployments is a difficult problem
           – Transfer learning methods

26-28 November
Santa Fe, Argentina
Demonstration in SON Experimental System

26-28 November
Santa Fe, Argentina
Evaluation Results

  • Fault injection in a testbed
      • Radio attenuation
      • Backhaul misconfiguration
  • Background traffic increased until the optimization functions can no longer
    remedy the problem
  • Our solution detected and diagnosed both conditions before they lead to a
    service degradation

26-28 November
Santa Fe, Argentina
Conclusion
 • In 5G, networks are becoming ever more complex and dynamic
 • At the same time, new use cases are requiring increased reliability
 • We need intelligent resilient networks that can react to unforeseen
   problems and adapt to changes in their context
 • A step in this direction is the self-healing method presented in this
   paper, based on anomaly detection and diagnosis
 • We need methods to share knowledge and coordinate the self-
   healing actions across management domains, areas and
   deployments
       – Standardized interfaces not only for sharing data, but also for sharing
         knowledge and machine learning models

26-28 November
Santa Fe, Argentina
Thank you
You can also read