Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks - J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo Nokia Bell Labs ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Self-Healing and Resilience in Future 5G Cognitive Autonomous Networks J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo Nokia Bell Labs janne.ali-tolppa@nokia-bell-labs.com 26-28 November Santa Fe, Argentina
Robustness • “Capability of performing without failure under a wide range of conditions ” Merriam-Webster Dictionary Resilience • “An ability to recover from or adjust easily to misfortune or change” Merriam-Webster Dictionary 26-28 November Santa Fe, Argentina
Why is resiliency important in 5G? • 5G is by nature dynamic and complex → Unforeseen circumstances are bound to happen • Use cases requiring ultra-high reliability (URLLC) Robustness (redundancy etc.) is no longer alone enough! 26-28 November Santa Fe, Argentina
How to design for resilience? • Monitor and adapt Focus • Decoupling, Common core modularity principles 26-28 November Santa Fe, Argentina
How to design for resilience? • Monitor and adapt Focus • Decoupling, Common core modularity principles 26-28 November Santa Fe, Argentina
Self-Healing in Radio Access Networks 26-28 November Santa Fe, Argentina
Detecting Anomalies without Labelled Training Data Which are anomalous? Example 1 Example 2 Example 3 Example 4 Meaningless question Red Green 26-28 November Santa Fe, Argentina
Anomaly Detection Feature selection Relevant feature: Color Shape Color and shape 26-28 November Santa Fe, Argentina
Radio Access Network Anomaly Detection Feature and context selection • Input features include typically Performance Management (PM) Key Performance Indicators (KPIs) and Fault Management (FM) alarms, but other (additional) inputs can be used as well, e.g. log analysis • Is the whole input space profiled, including cross-correlations, or only selected projections of it (single KPIs, selected KPI pairs etc.) • Context needs to decided, e.g. will the profiling be done per network function or a group of network functions, hourly, diurnal profiles for network traffic dependent KPIs etc. • In our work we used PM data only and created diurnal profiles for traffic-dependent KPIs and cross-correlations for selected KPI pairs. 26-28 November Santa Fe, Argentina
Radio Access Network Anomaly Detection Simple time-context dependent profiling of a timeseries 26-28 November Santa Fe, Argentina
Radio Access Network Anomaly Detection Cross-correlation profiling with clustering • First a clustering algorithm is applied, which omits the most probable outliers to clarify data • Correlation is modelled only inside the clusters • Can model also non-normal multivariate distributions 26-28 November Santa Fe, Argentina
Diagnosis Anomaly event detection and diagnosis anomaly pattern anomalous timeframe KPI1 KPI1 average value (KPI1) KPI2 average value (KPI2) KPI2 KPI3 average value (KPI3) KPI3 Time • Anomalous timeframes are detected by using DBSCAN algorithm on the anomaly levels of selected features against their profiles • By aggregating the selected feature (KPI) values in the anomaly event timeframe, the event is represented as an anomaly pattern – The diagnosis feature set can be, and often is, different than what is used in the detection! • The root causes of the detected anomaly patterns are diagnosed against a diagnosis knowledgebase 26-28 November Santa Fe, Argentina
Diagnosis Active learning assisted diagnosis st ruct ured view of t he dat a ret hink loop rest ruct ure int erpret ation of t he dat a a) A human operator provides the machine with his own interpretation of the data – By attaching labels to anomaly points or clusters while considering information from step b) b) The machine provides the operator with a structured view of the data – By clustering the data points while taking into account information from step a) 26-28 November Santa Fe, Argentina
Holistic Self-Healing Across domains and management areas in mobile networks In a complex system, improving the resilience of only one part or level of organization can sometimes (unintentionally) introduce fragility in another. To improve the resilience, it is often necessary to work in more than one domain and scale at a time. - A. Zolli, A. M. Healy, “Resilience – Why Things Bounce Back” Coordination is required between the self-healing actions of, for example: • Network Management (NM): Management automation aggregated on a (Virtual) Network Function (V)NF level • Quality of Experience (QoE) driven management: Optimizing the end-to-end customer experience at the application and individual subscriber level • VNF and Service Orchestration 26-28 November Santa Fe, Argentina
Knowledge Cloud Transferring diagnosis knowledge • Collecting the diagnosis knowledge base is significant effort • It would be desirable to be able to diagnose previously unforeseen problems • This could be mitigated by sharing diagnosis knowledge between self-healing function deployments • However, translating, i.e. generalizing and re-applying, diagnosis knowledge from other deployments is a difficult problem – Transfer learning methods 26-28 November Santa Fe, Argentina
Demonstration in SON Experimental System 26-28 November Santa Fe, Argentina
Evaluation Results • Fault injection in a testbed • Radio attenuation • Backhaul misconfiguration • Background traffic increased until the optimization functions can no longer remedy the problem • Our solution detected and diagnosed both conditions before they lead to a service degradation 26-28 November Santa Fe, Argentina
Conclusion • In 5G, networks are becoming ever more complex and dynamic • At the same time, new use cases are requiring increased reliability • We need intelligent resilient networks that can react to unforeseen problems and adapt to changes in their context • A step in this direction is the self-healing method presented in this paper, based on anomaly detection and diagnosis • We need methods to share knowledge and coordinate the self- healing actions across management domains, areas and deployments – Standardized interfaces not only for sharing data, but also for sharing knowledge and machine learning models 26-28 November Santa Fe, Argentina
Thank you
You can also read