Application failure predictions from neural networks analyzing telemetry data - Filip Hultgren Max Rylander
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
UPTEC IT 21016 Examensarbete 30 hp Juni 2021 Application failure predictions from neural networks analyzing telemetry data Filip Hultgren Max Rylander Institutionen för informationsteknologi Department of Information Technology
Abstract Application failure predictions from neural networks analyzing telemetry data Filip Hultgren Max Rylander Teknisk- naturvetenskaplig fakultet UTH-enheten With the revolution of the internet, new applications have emerged in our daily life. People are dependent on services for transportation, bank matters, and Besöksadress: communication. Services availability is crucial for their survival and competition against Ångströmlaboratoriet Lägerhyddsvägen 1 other service providers. Achieving good availability is a challenging task. The latest Hus 4, Plan 0 trend is migrating systems to the cloud. The cloud provides numerous methods to prevent downtimes, such as auto-scaling, continuous deployment, continuous Postadress: monitoring, and more. However, failures can still occur even though the preemptive Box 536 751 21 Uppsala techniques fulfill their purpose. Monitoring the system gives insights into the system's actual state, but it is up to the maintainer to interpret these insights. This thesis Telefon: investigates how machine learning can predict future crashes of Kubernetes pods 018 – 471 30 03 based on the metrics collected from them. At the start of the project, there was no Telefax: available data on pod crashes, and the solution was to simulate a 10-tier microservice 018 – 471 30 00 system in a Kubernetes cluster to create generic data. The project applies two different models, a Random Forest model and a Temporal Convolutional Networks Hemsida: model, where the first-mentioned acted as a baseline model. They predict if a failure http://www.teknat.uu.se/student will occur within a given prediction time window based upon a 15-minutes of data. The project evaluated three different prediction time windows. The five-minute prediction time window resulted in the best foresight based on the models' accuracy. The Random Forest model achieved an accuracy of 73.4 %, while the TCN model achieved an accuracy of 77.7 %. Predictions of the models can act as an early alert of incoming failure, which the system or a maintainer can act upon to improve the availability of its system. Handledare: Johan Hernefeldt Ämnesgranskare: Filip Malmberg Examinator: Lars-Åke Nordén ISSN: 1401-5749, UPTEC IT 21016 Tryckt av: Reprocentralen ITC
Sammanfattning Med internets revolution uppstår nya applikationer i våra dagliga liv. Människor är be- roende av tjänster såsom transportation, bankärenden, och kommunikation. Tjänsternas tillgänglighet är avgörande för applikationens överlevnad och kamp mot konkurrenter. Att uppnå en god tillgänglighet är en utmaning. Den senaste trenden är migrering av system till molnet. Molnet erbjuder flera metoder för att undvika driftstopp såsom auto- skalning, kontinuerlig driftsättning, kontinuerlig övervakning, och mer. Men fel i syste- met kan inträffa oavsett de förebyggande tekniker uppnår deras ändamål. Övervakning av system ger inblick av systemets tillstånd men det är upp till underhållaren att tolka denna information. Detta examensarbete undersöker hur maskininlärning can förutsäga framtida Kubernetes pod krascher baserat på övervakningsmetrikerna från respekti- ve pod. Det existerade ingen tillgänlig data för pod krascher i början av projektet. Lösningen var att simulera ett 10-nivå mikrotjänst system i ett Kubernetes kluster för att generare generisk data. Detta projekt applicerar två olika modeller, en Random Forest modell och en Temporal Convolutional Networks modell, varav den förstnämnda är en basmodell. Modellerna förutser om en krasch sker inom ett tidsintervall baserat på 15 minuters intervall av data. Detta examensarbete evaluerade tre olika prediktionsintervall. Det 5 miuters prediktionsintervallet resulterade i den bästa prognosen baserat på model- lernas precision. Random Forest modellen uppnådde en noggrannhet på 73.4% medans TCN modellen uppnåde en noggrannhet på 77.7%. Denna förutsägelse kan användas som en tidig varning för ett inkommande fel som systemet eller underhållaren kan agera på för att förbättra systemets tillgänlighet. ii
Acknowledgement Throughout the writing of this thesis, we have received a great deal of support and shared knowledge. We would like to thank our supervisor Johan Hernefeldt at Telia, whose guidance was invaluable for the result of the thesis. Your expertise in the area and your ability to share and exhibit its value have given us invaluable insight and knowledge to understand modern technologies. iii
Contents 1 Introduction 1 2 Background 3 2.1 Cloud Native’s origin . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Orchestration Framework - Kubernetes . . . . . . . . . . . . . . . . . . 3 2.3 Monitoring in Cloud Environment . . . . . . . . . . . . . . . . . . . . 4 2.4 Metrics in a Cloud Environment . . . . . . . . . . . . . . . . . . . . . 6 2.5 Resilience Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Chaos Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Theory 9 3.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Convolution Neural Networks . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Temporal Convolutional Networks . . . . . . . . . . . . . . . . 14 3.2.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Ensemble learning - Random Forest . . . . . . . . . . . . . . . 17 3.4 Regularization techniques . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 Label Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.2 One cycle training . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.4 Layer normalization . . . . . . . . . . . . . . . . . . . . . . . 20 4 Related work 21 4.1 Predicting Node Failure in Cloud Service Systems . . . . . . . . . . . . 21 iv
4.2 Failure Prediction in Hardware Systems . . . . . . . . . . . . . . . . . 22 4.3 System-level hardware failure prediction using deep learning . . . . . . 22 4.4 Predicting Software Anomalies using Machine Learning Techniques . . 23 4.5 Netflix ChAP - The Chaos Automation Platform . . . . . . . . . . . . . 24 5 Methodology 25 5.1 AWS Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1.1 Boutique - Sample Application By Google . . . . . . . . . . . 25 5.1.2 System monitoring - Prometheus . . . . . . . . . . . . . . . . . 26 5.1.3 Locust - load framework . . . . . . . . . . . . . . . . . . . . . 30 5.1.4 Provoking pod failures . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.2 Prometheus queries . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.3 Data formatting . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.1 Data loaders . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.2 Random Forest model . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 Temporal Convolutional Networks model . . . . . . . . . . . . 38 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6 Result 41 6.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7 Discussion 48 v
7.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Choice of monitoring metrics . . . . . . . . . . . . . . . . . . . . . . . 50 7.3 Data pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.4.1 Random Forest model . . . . . . . . . . . . . . . . . . . . . . 51 7.4.2 Temporal Convolutional Networks model . . . . . . . . . . . . 52 7.5 Improving resilience with failure prediction . . . . . . . . . . . . . . . 52 8 Conclusion 54 9 Future work 55 A Data set 60 A.1 Value distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.1.1 Crash data samples . . . . . . . . . . . . . . . . . . . . . . . . 60 A.1.2 Healthy data samples . . . . . . . . . . . . . . . . . . . . . . . 67 vi
1 Introduction 1 Introduction The requirements for modern distributed systems continue to increase, and to manage the change, systems are migrated to cloud-native architecture to achieve numerous ad- vantages such as low cost, scalability, and robustness [25]. Achieving full availability of the system is a desired characteristic but is no easy task. Full availability means that a system can serve a customer at any given point in time. In other words, there will be no downtime. In recent years, resilience engineering and chaos engineering has changed how we think about building distributed enterprise solutions. These methodologies in- doctrinate the developers to continuously develop systems that are robust, scalable, and safe. However, there are more primitive techniques to improve availability, such as scaling the system as the traffic increases [25]. Scaling includes increasing the amount server instances, increasing resource limits, and more [30]. Another method is to use telemetry data to improve the observability of the system [30]. The information from telemetry data provides insights into the current state of the sys- tem. The insights need to be interpreted correctly in order to know if an availability degradation will occur. A systems telemetry data could consists of metrics such as: re- quest per second (RPS), error rate, request latency and request duration [30]. However, the value provided by the metric depends on the behavior and architecture of the sys- tem. The Four Golden Signals, RED Method, and USE Method are three generic sets of metrics to monitor a system and will suffice in many situations [2][3][14]. A maintainer of the system interprets the telemetry data, and it is up to that individ- ual to determine the state of the system. Human error can occur in the interpretation of telemetry data, which could lead to availability degradation. Manual monitoring re- quires extensive human resources as it scales with the system’s size and the number of instances. Autonomous techniques release human resources, but auto-scaling and similar preemptive techniques use constant thresholds to determine if a service is being exhausted. So instead of understanding the data, the maintainer has to understand the system’s thresholds, thereby leading to a transition of the original human resources. This thesis investigates the research question: Is it possible for a machine learning model to predict Kubernetes pod failures based on telemetry data? This prediction could then warn the maintainer and aid in preventing the predicted pod failure. At the start of this project, no relevant data existed freely accessible. Simulation of a 10-tier microservice solution over an extensive period to provoke pod crashes resulted in more than 1270 crashes. The simulation took place in a Kubernetes cluster with the help of Prometheus, and the telemetry data were constantly stored, consisting of the metrics presented in Section 5.1.2. This data was then formatted to time windows and 1
1 Introduction labeled as either crash or healthy data samples. The project applies two different models, a Random Forest model and a Temporal Con- volutional Networks model, where the first-mentioned acted as a baseline model. The Random Forest model has minimal hyperparameter optimization and requires little to set up, thereby superior for a baseline model. The baseline model’s performance was the target for the more complex Temporal Convolution Networks model. The input to these models is a 15 minutes time series of telemetry data with 15 seconds interval between each metric measurement. The models will then predict if a failure will occur within the prediction time window. The thesis explores different prediction time windows to analyze how it impacts the accuracy and precision of the predictions. As healthy data samples occur throughout the whole day, the collected data is highly unbalanced. Undersampling rebalanced the data set to a 50/50 balance between the classes. The five minutes prediction time window resulted in the best accuracy and precision for the two models. The Random Forest model achieved an accuracy of 73.4 %, and the temporal convolutional model achieved an accuracy of 77.7 %. However, the difference in precision was more noticeable, where the Random Forest model achieved a precision of 78.9 %, and the Temporal Convolutional Networks model achieved a precision of 96.4 %. 2
2 Background 2 Background Telia offers enterprise customers a contact center solution, designed for seamless cus- tomer meetings and intelligent conversations, called ACE. ACE is market-leading in the Nordic and the Baltic states. This solution allows enterprises to communicate with their customers over a variety of communication channels. This unique feature gives the operator of an event a richer picture of the current state as the system aggregate information from multiple channels to a single source of truth, an omnichannel [8]. To have a competitive product in this industry, it has to provide full availability where the system is available for a user at any given time. Unavailability may cause the customer to experience an interruption in the service and the worst-case complete outage of the service. Therefore, in recent years Telia ACE has started a migration to a cloud-native microservice architecture to improve the system’s availability and resilience. The transition to the cloud has provided solutions to update services without downtime but avoiding failing services is still an issue. Kubernetes with related orchestration frameworks provides continuous health checks to analyze the state of the services, but there is still no standard approach to predict a failing service and its cause of failure. 2.1 Cloud Native’s origin The current understanding of the term cloud-native originates back to 2012 from re- search and not the industry [27]. There exists no precise definition on what a cloud- native application (CNA) is. However, there are characteristics for a CNA which are commonly understood by researchers. A base characteristic is that a CNA must be op- erated on an automated platform with migration and interoperability in mind. Achiev- ing the previous mentioned characteristic enables a CNA architecture which consists of service-based applications. These service-based applications posses the character- istics such as; horizontal scalability, elasticity, and resiliency. There are pattern based methods, such as Kubernetes [1], to achieve a CNA architecture. These pattern based methods are used to automate the process of delivery and infrastructure changes [27]. 2.2 Orchestration Framework - Kubernetes The orchestration framework Kubernetes provides support to run distributed systems resiliently. Kubernetes offers similar features as Paas (Platform as a Service), such as deployment, scaling, load balancing, and allows users to integrate logging, monitor- 3
2 Background ing, and alerting solutions. The distinction from PaaS is that these default solutions are optional and pluggable. Kubernetes yield the fundamental base for the platform’s development and preserves user choice and flexibility where it is crucial [1]. A Kubernetes cluster consists of nodes (worker machines) that run the containerized environment. The nodes host the Pods which are the components of the application workload. A pod can host multiple containerized applications, and a pod can live across multiple physical machines, enabling scaling with dynamically changing workload [1]. Figure 1 Illustration of a Kubernetes cluster 2.3 Monitoring in Cloud Environment Monitoring complex distributed systems play a crucial role in every aspect of a software- orientated organization [47]. To obtain, store, and process information of a system is a difficult task. The raw data may not result in any enlightenments, and its value comes when it is processed. If a system achieves effective monitoring, it can help eliminate performance bottlenecks, security flaws and aid the engineers in making informed de- cisions to improve the system. The design of monitoring tools for cloud computing is yet still under a researched area. There is no standard monitoring technique in cloud environments. All existing techniques have their benefits and drawbacks [47]. Figure 2 visualizes a three-stage process on how to monitor. The system collects relevant data from the target’s current state. Another component analyzes the data, produces visualizations and alerts to the operator, and sends the results to a decision engine that executes preemptive actions to sustain a healthy environment [47]. This automated 4
2 Background decision engine can handle faults that it has been engineered to detect but tumbles on out-of-domain faults. The most common analysis technique in monitoring systems is threshold analysis. Threshold analysis consists of continuously comparing metric values with the respective predefined condition. If a metric violates the predefined condition, the monitoring system will raise an alert. The type of alerts the decision engine can resolve is primarily trivial issues as more complex issues may have an underlying origin that is not the analyzed metrics. The most trivial automatic recovery mechanism is to terminate the faulty virtual machine (VM) and initialize a replacement [47]. The other option, and the most common, is that the engineers will manually respond to the alert with an informed decision. Finding an appropriate response is a challenging task. The engineers have to analyze the current state, which requires them to consider all known states, identify the issue and then form a set of actions to resolve the issue, which entails significant operations personnel to deal with this process [47]. Figure 2 Monitoring Process Though monitoring plays a crucial role in maintaining distributed systems, developing an effective monitoring strategy is extremely difficult - a monitoring strategy state how to gather and interpret information from the system. The vast majority of monitoring strategies are devised during the design phase and revised throughout the development of the system. The monitoring strategy is thus always one step behind as it adapts to an ever-changing environment. As long as the system changes before the monitoring strategy, it will always be an insufficient monitoring strategy derived from a system that no longer exists in its original form. Many high-profile outages were possible due to monitoring strategies that failed to detect anomalies and thus preventing engineers from acting preventively to avoid the outage. Monitoring strategies detect only the anomalies that the engineers have predefined. Thus it suffers from a limited range of anomalies it can detect [47]. 5
2 Background In recent years, two disciplines have appeared that address this issue. Resilience and chaos engineering advocate continuous verification of the system to limit and unveil the system’s anomalies. These two approaches are further explained in sections 2.5 and 2.6. 2.4 Metrics in a Cloud Environment As mentioned before, choosing metrics is a challenging task, and despite the chosen strategy, it does not assure the engineers it can detect future issues. However, some sets of metrics have been popular among monitoring strategies and proven to be useful in many systems. These are the Four Golden Signals, the RED Method, and the USE Method. Google’s SRE teams introduced the four golden signals. The metrics are: latency, traffic, errors, and saturation [14]. • Latency is the time it takes to serve a request, from when the client sends the request to when the response is received at the client [14]. • Traffic is a measurement of the current demand for the system. The observed metrics depend on what the system’s functionality is. The metric is HTTP usu- ally requests per minute. For a streaming system, the measurement might be on network I/O rate [14]. • Errors are the rate of requests that fail. The engineering team needs to define what an error is. It could be an HTTP response code or perhaps a policy for requests, e.g., requests with latency over 1 second is classified as an error [14]. • Saturation is a measurement on how full the service is. It provides the information on how much load the system can handle. This measurement are usually observ- ing the extra work that the service cannot handle, which results in error or delayed response [14]. The next set of metrics is RED Method. RED Method takes inspiration from the four golden signals but excludes saturation. The author of the RED method excludes satu- ration with the motivation that it only applies to advanced use cases. The three metrics that the RED Method consists of are rate, errors, and duration. Whereas the rate is equivalent to traffic, and duration is equivalent to latency. RED Method predefines that these three metrics observe the HTTP requests and thereby only applicable to request- driven services. These three metrics combined are sufficient to monitor a vast majority of services [2]. 6
2 Background USE Method contains the metrics utilization, saturation, and errors. This methodology analysis the performance of a system. In comparison with the RED Method, which emphasizes the client experience, the USE method emphasizes the performance of the service’s underlying hardware [3]. • Utilization is average time that the service was busy servicing work [3]. 2.5 Resilience Engineering The definition of resilience engineering follows as ”Resilience engineering is a paradigm for safety management with a focus on how to help people cope with complexity under pressure to achieve success” stated in the book [48]. In other words, resilience describes how well the system can manage unpredicted issues. On the other hand, Robustness refers to system designs that handle predicted issues [10]. Resilience engineering appears in numerous industries in diverse forms, e.g., aviation, medicine, space flight, nuclear power, and rail. Resilience is critical to have in these industries to avoid catastrophic failures, or even casualties [10]. However, resilience is viewed differently within cloud engineering, the mentioned industries require the per- sonnel to follow strict procedures to prevent issues, but this is uncommon within cloud engineering. One similar aspect within resilience engineering throughout the industries is the increased adoption of automation. Automation introduces challenges, and they are the topics of numerous resilience engineering papers [10]. Resilience engineering stretches not just over systems but also organizations. Resilience concerns the ability to identify and adapt to manage unpredicted issues with changing any relevant factor. This factor could be a software change or changes within the orga- nization as modifications to processes, strategies, and coordination [48]. 2.6 Chaos Engineering Chaos engineering is a relatively new discipline within software development that has emerged to improve robustness and resilience within systems. This discipline origi- nates back to Netflix in 2008 when they moved from the data center to the cloud. At that time, Amazon Web Services (AWS) was considerably less sophisticated than now [26]. Cloud computing suffered from various defects and failures, such as instances that would blink out of existence without warning. Therefore, a system had to be resilient to cope with these failures. The vanishing instances introduced numerous practices to 7
2 Background deal with it automatically, but Netflix could not adopt these practices due to their unique management philosophy where the engineering teams were highly aligned and loosely coupled. There was no mechanism to share an edict to the entire engineering organiza- tion demanding them to follow these practices [26]. Netflix then introduced Chaos Monkey [26]. Chaos Monkey is an application that shuts down a random instance without warning, one in each cluster, during business hours, and the process repeats every day. This feature proactively tested all engineering team’s resilience, and each team had to apply methods to adapt to these unexpected failures [26]. After a devastating region failure within AWS that affected Netflix and others, Netflix introduced Chaos Kong, in response, that disables a whole region, and thus each engineering team was required to adapt to this kind of failure [26]. The Chaos Engineering team at Netflix created the Principles of Chaos Engineering which is the fundamental core of Chaos Engineering [26][9]. The definitions of Chaos Engineering follows ”Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in pro- duction”[9]. Chaos Engineering validates that some form of failure is always prone to be provoked within a system. 8
3 Theory 3 Theory In the following sections, the theory required to perform the methods of the project is presented. 3.1 Artificial Neural Network Neural networks originate back to 1958 where Frank Rosenblatt invented what he called a Perceptron [35]. The Perceptron is based upon the concept of artificial neurons called linear threshold unit (LTU) [19]. A neuron comprises several inputs, weights, and an output, where each input is associated with a weight. The equation how the LTU calcu- lates the weighted sum of its inputs follows: n X z = w 1 x1 + w 2 x2 + · · · + w n xn = w i ⇤ xi i=1 The LTU’s output is the weighted sum passed through a step function. n X hw (x) = step(z) = step( w i ⇤ xi ) i=1 See Figure 3 for a graphical overview of an LTU. Figure 3 Linear Threshold Unit (LTU) Heaviside step function is the most common step function used in the perceptron. ( 0, if z < 0 heaviside(z) = 1, if z 0 In other cases, the sign function is used. 8 > < 1, if z < 0 sgn(z) = 0, if z = 0 > : 1, if z > 0 9
3 Theory The linear threshold unit computes a linear relationship between the inputs, and the perceptron can solve trivial linear binary classification tasks when it utilizes multiple LTUs [19]. A perceptron consists of two fully connected layers, an input layer and a layer of LTUs. The input layer consists of input neurons and a bias neuron. The input neuron forwards the model’s input to the LTUs, and the bias neuron constantly forwards a 1 to shift the activation function of the LTUs by a constant factor [19]. The LTU layer consists of multiple units, with weights for the inputs and the bias neuron. See Figure 4 for an overview of the Perceptron. The Perceptron in the figure can classify three different binary classes based on two inputs. Figure 4 Perceptron The algorithm Perceptron training, proposed by Frank Rosenblatt, fits the weights of the LTUs for a given training instance [35]. The algorithm feds the perceptron with inputs checks the output compared to the target of the input. The weight is then re- calibrated based on the difference between the output and the target [19]. The equation for updating weight follows: wi,j = ⌘(yj yˆj )xi • wi,j is the weight between ith input neuron and j th output neuron. • ⌘ is the learning rate. • yˆj is the output of the j th output neuron from the current training instance. • xi is the ith input value of the current training instance. 10
3 Theory • yj is the target output of the j th output neuron of the current training instance. Perceptrons are incapable of solving trivial problems such as Exclusive OR (XOR) clas- sification problem [19]. The Multi-Layer Perceptron (MLP) solves numerous problems that Perceptrons have. The architecture of the MLP consists of stacking layers on top of each other. One layer’s output will be another layer’s input. The MLP consists of one input layer, one or more hidden layers of LTUs, and one final output layer of LTUs. See Figure 5 for a graphical overview. If the amount of hidden layers exceeds one, the architecture is called deep neural network (DNN). The MLP is capable of solving more complex classification problems in comparison with one single Perceptron [19]. Figure 5 Multi-Layer Perceptron Multi-Layer Perceptron suffered from insufficient training algorithms, which resulted in bad performance. In 1986, D. E. Rumelhart et al. [36] published introducing the backpropagation training algorithm. The algorithm measures the network’s output er- ror. The output error is the difference between the network’s desired output and actual output. The measurement of the output error consists of computations on how much each neuron contributed to each output neuron’s error. These computations are recur- sive, where it begins to compute the output error contribution of each neuron in the last hidden layer and proceeds back to the input layer. It will establish a measurement of the error gradient across all weights. It will finally apply a Gradient Descent step with the previously measured error gradients to tweak the connection weights to reduce the output error [19]. 11
3 Theory With backpropagation, the step function got replaced by new activation functions. There is no gradient to work within the step function because it consists only of flat surfaces. The new activation functions consist of surfaces with gradients, and these functions can be either linear or non-linear. Two popular activation functions are hyperbolic tangent function and ReLU function [19]. When a Multi-Linear Perceptrons applies to classification problems, the output layers activation function is a shared softmax function is. The softmax function transforms the input values, so the sum of the values is equal to one. Each output value represents the probability for each class [19]. Loss functions enables neural networks to learn and is used within the backpropaga- tion algorithm to determine how faulty an output is [37]. The purpose of training net- works is to find connection weights that lead to the lowest value of the loss function. In other words, the loss function indicated the poorness of a neural network’s ability. One common P loss function is the Mean Squared Error (MRE) function which follows: E = 2 k (yk tk )2 where yk is the output of the network, tk is the labeled data, and 1 k is the dimensions of the data. The result of the loss function determines how much of the connection weights to the activated neurons should be adjusted during training [37]. 3.2 Convolution Neural Networks Convolution Neural Networks (CNN) is an advancement of the Multiple-Layer Percep- tron. The evolution is a result of image analysis. An MLP model for image analysis would need to have input neurons for each pixel and their dimension, which would re- sult in large networks, even for small images [32]. For example, a 25x25 RGB image consists of 625 pixels, and each pixel has three dimensions for the colors red, green, and blue. The required number of weights for each hidden neuron is 1875, and each hidden layer consists of 1875 neurons. The number of weight variables increases by the number of layers as they are fully connected, making deep neural networks unsustainable. Convolutional neural networks solve this by weight sharing. To enable weight sharing, CNN utilizes a different architecture than MLP. A CNN model consists of layers. These layers transform the input and forward the output to the following layer. Common transformations of a CNN are convolutions, activation layers, pooling layers, linear layers, and layers that apply regularization techniques [32]. Regularization counteracts overfitting, and common layers are batch normalization and dropout. Convolutional layers apply a kernel to the input and output a activation map. The ac- tivation map is the sums from multiple element-wise multiplications between the input and the kernel [32]. The kernel slides over the input as seen in Figure 6. The size of 12
3 Theory the kernel, stride (how far the kernel slides in each step), and padding determine the shape and receptive field of the output. The receptive field is the input values that are multiplied with the kernel to obtain the output value. The trainable parameters of a convolutional layer are the variables of the kernel. The kernel’s variables are shared for the whole input (weight sharing) [21], thereby making the model size depend on the architecture and not the input. Figure 6 The figure shows two element-wise multiplications and summations under convolution with a 2-dimensional input. The convolution involves an input of shape 5x5 and a 3x3 kernel with no padding. The activation map’s shape is 3x3. Activation and pooling layers are parameter-free - no optimization applies in the layers. The layers transform the input by a fixed function that only depends on hyperparameters. Rectified linear activation and sigmoid activation are two examples of activation layers, and the transformation is element-wise. Pooling layers reduce the shape of the input [32]. For example, the max-pooling layer applies a filter that extracts the maximum value in its receptive field, and a filter slides over the inputs as a kernel. Reducing the shape of the output counteracts the overfitting of the model [32]. Linear layers are fully connected layers and act similar to hidden layers in MLPs. They have weights associated with each input, a bias, and an output with a specified shape [32]. Linear layers can act as a final layer of the model, as its output size is adjustable. The output size can represent the classes the model should predict. The input of CNN models are matrices of values. The number of dimensions depends on the data and if batch learning is applied. An image, for example, would have height, width, and color spectrum as dimensions, seen in Figure 7. Batch learning increases the input dimension as it feeds the model with a mini-batch (multiple data samples) at once and calculates the output in parallel with the help of linear algebra, resulting in improved training time. The previous example’s dimensions would be batch size, height, width, color spectrum, seen in Figure 7. 13
3 Theory Figure 7 Two input examples. To the left in the figure, a three-dimensional image with 5 pixels in height, 5 pixels in width, and RGB as the color spectrum. To the right in the figure, a four-dimensional input that is n images in a mini-batch. 3.2.1 Temporal Convolutional Networks Temporal Convolutional Networks (TCN) is an architecture design for sequential mod- eling. TCN emphasizes flexible receptive field, stable gradients, and parallelism [28, 12]. It distinguishes itself by its convolution is causal, and the model can take an input of any length and produce an output of the same size [12]. TCN makes use of residual blocks similar to ResNet [20] with the distinction that each block applies dilated causal convolution. Causal convolution is convolution with an input sequence of i0 , . . . , it and an output y0 , . . . , yt where yi only depends on the input sequence i0 , . . . , ii where 0 i t. Causal convolution leads to no data leakage across the time dimension i.e look ahead bias [46, 12], which happens if standard convolution is applied, the two convolutions can be seen in Figure 8. 14
3 Theory Figure 8 To the left: standard convolution with 1 padding. To the right: causal convo- lution with 2 padding. Both use a kernel with size 3 ⇥ 1 and has input and output size of 8 ⇥ 1. In the standard convolution, the output at y1 depends on i0 , i1 , i2 , thus leading to data leakage over the time dimension. Dilated convolution is convolution with a dynamic receptive field. The receptive field is dependent on the dilation factor d and kernel size k [46]. Convolution with a dilation factor d = 1 is equal to standard convolution, and its receptive field is k. When d > 1, there is a space of d 1 between each input to the kernel, and thus the receptive field is (k ⇥ d) 1. In Figure 9 dilated convolution occurs with a dilation factor of 2, and each output yi has a receptive field of 5. Figure 9 Dilated convolution with k = 3, d = 2, and input and output size of 8 ⇥ 1. Residual blocks were conceived from the empirical findings that adding more layers to 15
3 Theory a model leads to degradation of its performance, even if the layers are identity mappings [20]. The purpose of residual blocks is to learn modifications to the identity mapping [12] by splitting the input x into an identity branch and a transformation branch. The identity branch forwards x to the end of the block, and the transformation branch applies a set of transformations F = {f0 , . . . , fn } to x. If the shape of F (x) and x differentiates, the identity branch applies a 1 ⇥ 1 convolution to make them compatible when summed in the output [12, 21]. The branches are joined and passed through an activation function , such as the Rectified Linear Unit activation function. The output is y = (x + F (x)), and thereby the weights of F are updated to minimize the difference between x and y. Dilated causal convolution in combination with residual blocks makes the TCN model stable when handling long time series. A common practice for dilated convolution is to increase the dilation factor exponentially for each layer (block for TCN) [12, 21], and with the stability of residual blocks, the network’s receptive field can include large time series. 3.2.2 Transfer learning Transfer learning is the use of an existing model for a new task. There are three pos- sible advantages of using a transfer learning model compared to a model trained from scratch, higher initial performance of the model, lower training time, and higher final performance [31]. Bozinovski et al. [15] defined the concept of transfer learning in the mid-1970s but the method’s breakthrough was in recent years as neural networks got deeper and more expensive to train. Transfer learning is widely adapted in image recognition as it is favorable with convolution neural networks with a high number of trainable parameters. During transfer learning, a model M is trained on a task To with data set Do . M consists of a body and a head. The body is the model architecture, and the head is the last layer(s) of the model. The head is unique for the data set as its output shape depends on the data set’s class cardinality. Model M can be applied on a new task Tn with data set Dn . Before applying the model to the new task, the head has to be removed, as it predicts the classes of Do and not Dn [22]. A new layer L with output shape B ⇥ C replaces the head, where B is the batch size, and C is the cardinality of classes in Dn . In the early stage of training, M ’s body is frozen, i.e., its parameters are not changed as they are already fit to recognize patterns in a similar task. The body can be unfrozen when the newly added layer(s) converges to make the model more coherent to the new task [21]. 16
3 Theory 3.3 Decision trees Decision tree algorithms generate trees by evaluating data sets. A generated decision tree consists of decision nodes and leaves. Each decision node split the data set and partition it into descending decision nodes or leaves. Decision trees predict by the label of the reached leaf when traversing the tree from an input of attributes [34]. The data set S consists of n samples. A sample is a tuple of attributes and a label, and the data set is detonated by S = {T1 , . . . , Tn } where Ti = {Ai , yi } and A = {a1 , . . . , an }. The attribute aj ’s type is either qualitative or quantitative [34]. And in classification prediction, the label yi is the class instance, and in regression, it is the target value. There are two groups of decision tree algorithms, top-down and bottom-up algorithms. The most common decision tree algorithm type is top-down and includes ID3, C4.5, and CART [49]. Top-down algorithms are greedy algorithms and consist of two phases, growing and pruning. The growing phase builds the decision tree, and the pruning phase counteracts the downsides of greedy algorithms, where an optimal split only depends on the information in the node [34]. Pruning decreases the complexity of the tree and lowers the chance of overfitting. The growing phase can be generalized as follows, start at the root node with the whole data set S. Iterate the attributes A of S to find attribute aj that results in the best split of S, according to the algorithm’s splitting criterion. The split result in m data sets partitions set and m descending nodes are created for each of these partitions. Repeat the splitting procedure in the newly created nodes until the node achieves some of the algorithm’s stopping criteria. Stopped nodes (leaves) are then labeled based on their partition of the data set. For classification, the label is the class instance with the highest occurrence, and for regression, it is a statistical measure such as the mean. The pruning phase traverses the decision tree from the root node and searches for branches of succeeding nodes that result in a worse distribution of the samples in the leaves than the upper node. The tree prunes found branches. The decision tree classifies an attribute set A by traversing the decision tree by its split- ting rules, and the reached leaf label is the prediction of the sample. 3.3.1 Ensemble learning - Random Forest Condorcet Jury theorem states that if there is n number of voters and their probability p of making the correct decision is p > 0.5, then increasing the number of voters leads to a higher probability that the majority makes the correct prediction. Ensemble learning 17
3 Theory makes use of this theorem by combining multiple machine learning models to achieve higher predictive performance. The ensemble decision is obtained by the majority vote or by the weighted majority vote for classification models. The weighted majority vote favors selected models’ predictions over others and thus increasing the magnitude of its vote. If all models have the same errors rate p and the errors is uncorrelated, the ensemble’s error rate pens can be defined as: T X ✓ ◆ T k pens = p (1 p)T k (1) T k k= 2 +1 T is the number of models, and k denotes the minimum number of models whose pre- diction has to be true [50]. Random Forest is an ensemble learning method consisting of decision trees. Tin Kam Ho introduced the Random Forest model in the paper [44] and proposed that the usage of multiple decision trees could lead to higher predictive accuracy and lower variance. The paper compared how the splitting rule affected the decision tree’s complexity and proposed creating decision trees from a random subset of the data sets features [44]. Random feature selection leads to better generalization in the ensemble because deci- sion trees’ predictions are invariant for samples with variations in the feature excluded from its feature subset [44], thus making the decision trees more decoupled. Breiman continued developing Random Forest models and proposed using bagging in combi- nation with randomness feature selection [17]. Bagging samples the data set for each decision tree with replacement [16], which decreases the variance of each model without affecting the ensembles bias. 3.4 Regularization techniques Regularization techniques are methods to counteract overfitting and improve the gener- alization of the model. Regularization works by injecting noise or randomness to the model’s pipeline or weights during training. 3.4.1 Label Smoothing Label Smoothing is a regularization method that applies to classification models. When training a classification model, the weights are updated based on the loss. The loss function depends on the difference between the model’s output v = {v0 , . . . , vK } and 18
3 Theory the target t = {t0 , . . . , tK }, where K is the number of classes, vK is the activation for class K, and t is one-hot-encoding for the class of the input. In the loss function, v passes through a softmax function e vi (v)i = PK , if i = 1, . . . , K; j=1 e zj that transform v to K X vi = 1, where vi 2 [0, 1]; i=1 The boundary values 0 and 1 is only obtainable if vi ! ±1 for a input of class ti . As an effect, the model will update its weights to make the output approach infinity and thereby overfit [43]. Label smoothing solves this issue by modifying the targets the model predicts during training. The modifications are 0 ! N' and 1 ! 1 N' , where ' states the uncertainty of a prediction and N the number of classes [21]. When training with label smoothing, the model generalizes as no targets are 0 and 1. The model will never predict a class by a probability of 0 or 1, and thereby the result of the softmax function can approach the targets without vi ! ±1. 3.4.2 One cycle training 1cycle training is a regularization technique that divides the training phase into two phases, warmup and annealing. In the warmup phase, the learning rate r increases, and the annealing phase r decreases it to the original level. Smith stated in [40] that using a too high r at the start makes the loss diverge as the random initiated weights have not converged to the given task. Using a too high r at the end makes the model’s optimiza- tion method miss local minimums. However, using a higher learning rate throughout the rest of the process will make the model’s optimization method converge faster. 3.4.3 Dropout Dropout is a regularization method that randomly drops inputs of the units in the neural network under training. Dropout multiplies the unit’s input by independent variables from a Bernoulli distribution, where the variables are one with the probability p. The 19
3 Theory randomness makes the units produce outputs that are generalized and thereby decreasing the variance of the model [41]. When testing the neural network, the units’ weights are multiplied by p to reckon with the output difference from not dropping inputs. In convolutional neural networks, the architecture consists of layers and not units. Dropout is available through a dropout layer. The layer drops variables in the input with a prob- ability p during training. The dropout layer becomes a feature map when it is frozen, i.e., it only forwards. 3.4.4 Layer normalization Batch normalization and weight normalization are two regularization techniques for normalizing the layers in a convolutional neural network. Batch normalization normalizes the layers’ activation values before the non-linearity function in a neural network. Batch normalization uses statistical values derived from the activation values for layer i for all samples in the mini-batch [24]. This method does not comply with correlated time series as the standard deviation (one of the statistical values) is only linear when the data is uncorrelated. Weight normalization takes inspiration from batch normalization without introducing a dependence between the samples in the mini-batches. Weight normalization improves the training speed, robustness and thus enables higher learning rates. Weight normaliza- tion reparameterizes the weight matrices to decouple the weight vector’s norm from its direction. The optimization method is applied to the new parameters, thereby achieving a speedup in its convergence [38]. 20
4 Related work 4 Related work The related work consists of papers for predicting failures in hardware and systems, as no found paper targets Kubernetes pod failures. The following articles inspired the decisions concerning data representation, how to generate data, and which machine learning models are suitable for the thesis. 4.1 Predicting Node Failure in Cloud Service Systems In 2018, Lin et al. [29] published an article that proposed a model for predicting node failures in a cloud service system. The study discusses how node failures impact a cloud provider’s availability, and even if the cloud provider has over 99.999 % availability, the virtual machines suffer from 26 seconds of downtime per month [29]. The paper presented the model MING, which achieved an accuracy of above 60 % of predicting a failure in the next day, and as a result of this, they could migrate the affected virtual machines to healthy nodes and thus reduce the downtime. The study empathizes the challenge of a highly imbalanced data set, and complex failure-indicating signals [29]. The data set in the study has an imbalance of 1:1000 between failing and healthy nodes. One conventional method to handle imbalanced data set is to re-balance the data set with sampling techniques. Re-balancing a data set could increase the recall of a model, but a potential side effect is decreases precision due to an increase of false-positive predictions [29]. Therefore, the study proposes a ranking model to address the imbalanced data set. A ranking model ranks a node’s failure-proneness and focuses on optimizing the top k result instead of trying to sepa- rate all true and negative samples [29] that conventional classifiers do. Thereby is less affected by an imbalanced data set. As a node can fail in over a hundred different ways [29] the signals indicating a failure are complex and might have their origin in multiple sources. The study addresses this problem by including both temporal and spatial data in the model. Temporal data is time-series data from the individual node such as performance counters, logs, sensor data, and OS events [29]. Spatial data includes a node’s relation to other nodes, such as shared routers, rack location, resource family, load balance group, and more [29]. The temporal and spatial data are the inputs to the model’s sub-models, a Long Short Time Memory model and a Random Forest model. The two models’ output is concatenated and fed to the ranking model, which generates the output of the whole model. In con- clusion, a model consisting of sub-models that handles data from different sources and uses a ranking model for output performs overall better than individual models [29]. 21
4 Related work 4.2 Failure Prediction in Hardware Systems In 2003, Doug Turnbull and Neil Alldrin published an article [45] that proposed a clas- sification model that predicts hardware failures. The effect of hardware failures could be fatal for a vendor. It could lead to high expenses related to damage to equipment, loss of confidence of clients, violation of service agreement between vendor and client, and loss of mission-critical data[23]. The goal of the model was to obtain a high true-positive rate (tpr) and a low false- positive rate. A high tpr could assist the vendor and potentially prevent failures. A low fpr is vital as the occurrences of failures are rare, and false-positive predictions could initiate expensive events by the vendor to handle the failure that does not exist. The article defines two abstractions for the data, the Sensor Window, and the Potential Failure Window. The Sensor Window is the input of the model, and the Potential Failure Window is the target of the model. A Sensor Window consists of n entries, and each entry is an aggregation of raw or processed data from the sensors. The Potential Failure Window would consist of n entries and each entry state if a failure occurred in the hardware during its time period. The number of entries for each window sets its time period, and each entry represents one minute. The two windows combined are either a positive feature vector - if a failure occurs or a negative feature vector - if no failure occurs. The article concludes with a benchmark of the model with different data sets. The data set varies by how many entries the sensor window consists of and if the sensor data is raw or processed. The model achieves the best result by a sensor window of ten minutes of processed sensor data. 4.3 System-level hardware failure prediction using deep learn- ing Sun et al. [42] conducted a study comparing different machine learning models to pre- dict disk and memory failures based on safety parameters. The study developed tech- niques to normalize the data as an effect of lack of standard on how to log the attributes, to train a neural network with an imbalanced data set as a hardware failure is a rare event and a base model for transfer learning that is trained on the normalized data set and then fine-tuned on data from a specific manufacturer. The ”Self-Monitoring Analysis And Reporting Technology” (SMART) is a standard that requires disk manufacturers to record specific attributes throughout a disk’s lifetime. 22
4 Related work However, the standard does not state how manufacturers should record the attributes. Thereby data from different manufactures cannot be combined into a data set without normalization. The study normalizes the data to a unified distribution based on the historical data on both healthy and failed samples. The result is a generic data set independent of the manufacturer. The study creates a loss function for imbalanced data sets. The loss function is an exten- sion of the binary cross-entropy loss function, reduces the loss for correct predictions, and magnifies the loss for misclassified inputs [42]. Their extension of binary cross- entropy is to multiply the loss with a coefficient which value is similar to a sign function and depends on the classification of the input [42]. The study provides two benchmarks: first, how different models trained on the normal- ized data set perform, and second, how a transfer learning model performs on a specific manufacture’s data set. The transfer learning model trains on a normalized data set and then fine-tuned on the target manufacture data set. The study concludes that a temporal convolutional neural network achieves the best overall performance [42]. The model achieves it when it trains on the normalized data set and then fine-tuned on a specific manufacturer’s data set. The model uses the created loss function for both training periods. 4.4 Predicting Software Anomalies using Machine Learning Techniques This paper introduces a detailed evaluation of a set of machine learning classifiers to predict dynamic and non-deterministic software anomalies [11]. The predictions are based upon monitoring metrics, similar to this thesis. The conducted machine learn- ing methods that were evaluated in this paper were; Rpart(Decision Tree), Naive Bayes (NB), Support Vector Machines Classifiers (SVM-C), K-nearest neighbors (knn), Ran- dom Forest (RF), and LDA/QDA (Linear and quadratic discriminant analyses). The study conducted three scenarios with different failure injections that resulted in three separate data sets, which will be used to train the models. The system setup, which was used to provoke the failures, is a multi-tier e-commerce site that simulates an online book store, written in Java with and MySQL database. Apache Tomcat was utilized as the application server. The parameters that was altered during the simulations were the threads and memory, individually or in combination. Three different definitions were used to classify the current state of the system [11]. 23
4 Related work • Red zone: the system is at least 5 minutes away from a crash. • Orange zone: the system is 5 minutes distance to the red zone. • Green zone: the rest of the timeline. These are the labels of the time windows that have been used in the data set, thus the targets of the models. The paper concluded that the Random Forest algorithm had the lowest validation error rate (less than 1%) in all three scenarios [11]. 4.5 Netflix ChAP - The Chaos Automation Platform Netflix conducts several deployments a day, and each change affects the resilience of the system [13]. Therefore, relying on previous experiments to verify the service’s resilience in an ever-changing environment is not a good strategy. Furthermore, from this, the Chaos Automation Platform was born. ChAP is a continuous verification tool that is integrated with Netflix’s continuous integration and deployment pipeline [13] it also complies with the advanced principles of chaos engineering [26][9]. ChAP conducts a chaos experiment by creating a new cluster of the service in scope. To retain availability, it creates hundreds of instances of that service [13]. The experiment allocates two instances that the load balancer will redirect one percent of the requests to those instances. Instance one will be exposed to the chaos experiment, and the second will act as a reference to the steady-state of the service [26]. After a completed experi- ment, ChAP informs the service owner about its performance. The report may contain required changes to improve the quality of the service. 24
5 Methodology 5 Methodology This project touches a wide range of field in computer science. In the following sections, the implementation procedures will be defined and described. 5.1 AWS Cluster Data is generated by exposing a demo application for different levels of load over time. For this project, the demo application has been deployed on a Kubernetes cluster with the following services, Prometheus, Istio, and Grafana. These services provide observ- ability and inter-communication abstraction for the application. The cluster is deployed on Amazon Web Services and has the resources of 24 CPU cores and 91 GiB memory. 5.1.1 Boutique - Sample Application By Google Online Boutique is Google’s show-case application for cloud-native technologies [4]. It is a 10-tier microservice system written in Python, Golang, Node.js, Java, and C# and utilizes technologies such as Kubernetes, Istio, gRPC, and OpenCensus. Boutique mimics a modern e-commerce site and includes a load generator written in Locust. The Boutique is treated as a black box during the generation of data, and crashes are initiated by increasing the load or injecting latency in the system and not by exploiting the weaknesses of the system. See Figure 10 for an overview of the services. See Figure 11 for an overview of the pods and container of each service. The server is the container where the application is stored. 25
You can also read