Active Learning for Network Traffic Classification: A Technical Study - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 1 Active Learning for Network Traffic Classification: A Technical Study Amin Shahraki, Mahmoud Abbasi, Amir Taherkordi and Anca Delia Jurcut [Note: This work has been submitted to the IEEE Trans- networks and maintain their performance, such as Monitor- actions on Cognitive Communications and Networking jour- Analyze-Plan-Execute (MAPE), and Observe-Orient-Decide- nal for possible publication. Copyright may be transferred Act (OODA) [2]. without notice, after which this version may no longer be In networking, the process of analyzing the network traffic accessible] behavior is mainly known as Network Traffic Monitoring and arXiv:2106.06933v2 [cs.NI] 5 Aug 2021 Abstract—Network Traffic Classification (NTC) has become an Analysis (NTMA) [3]. NTMA has attracted much interest important feature in various network management operations, in recent years and become an important research topic in e.g., Quality of Service (QoS) provisioning and security services. Machine Learning (ML) algorithms as a popular approach for the field of communication systems and networks [4]. The NTC can promise reasonable accuracy in classification and deal importance of NTMA lies in the properties and challenges with encrypted traffic. However, ML-based NTC techniques of modern networking, e.g., heterogeneity, complexity, and suffer from the shortage of labeled traffic data which is the dynamicity, resulting in instability in data transmission [5]. case in many real-world applications. This study investigates the NTMA is an essential approach to measure the performance of applicability of an active form of ML, called Active Learning (AL), in NTC. AL reduces the need for a large number of applications and services, and to discover network inefficien- labeled examples by actively choosing the instances that should cies. Indeed, NTMA allows us to shed light on the functioning be labeled. The study first provides an overview of NTC and of communication systems and to deal with unexpected events, its fundamental challenges along with surveying the literature especially in complex and large-scale networks, such as the on ML-based NTC methods. Then, it introduces the concepts of Internet. AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in AL-based NTMA applications are generally categorized into eight classification of network traffic are discussed. Moreover, as a groups, including Network Traffic Classification (NTC), traffic technical survey, some experiments are conducted to show the prediction, fault management, network security, traffic routing, broad applicability of AL in NTC. The simulation results show congestion control, resource management, and Quality of that AL can achieve high accuracy with a small amount of data. Service (QoS) and Quality of Experience (QoE) management [6]. In this study, we focus on NTC as an important and open Index Terms—Survey, Network Traffic Classification, Active issue in NTMA. NTC refers to techniques for categorizing Learning, Machine Learning, NTMA network traffic into different classes based on their properties. The classification of network traffic is highly beneficial in I. I NTRODUCTION various network services from QoS (e.g., traffic policing and During the last decades, emerging new networking shaping) and pricing to malware and intrusion detection [7]. paradigms, such as Internet of Things (IoT), have introduced NTC provides detailed knowledge on network traffic, which various network management challenges. Given the prolif- is very useful for those who investigate the changes in traffic eration of IoT devices and the distinguishing characteristics characteristics and long-term requirements of networks [8], of IoT traffic, such as heterogeneity, spatio-temporal depen- e.g., Network Management and Orchestration (NMO) tools, dencies, dominating uplink traffic, and low duty-cycle traffic and performance management models. patterns, network management and monitoring has become NTC techniques can be broadly grouped into three cate- challenging. Gaining deep insight into such complex networks gories: port-based, payload-based, and flow-based methods for performance evaluation and network planning purposes is [9]. Port-based techniques associate a standard port number not a trivial task with respect to processing time, human effort, to a service or application, while payload-based methods and computational overhead. Understanding network traffic carefully inspect the content of the captured packets to classify behavior plays a vital role in a wide variety of network man- them. Last but not least, flow-based techniques utilize the agement aspects, e.g., fault management, accounting, security, network traffic flow characteristics (e.g., round-trip time and and network performance management [1]. Some general inter-arrival times) to associate produced traffic to the related approaches have been introduced to analyze the behavior of sources. The two latter methods cannot be used in some network types (e.g., Virtual Private Network (VPN)), or violate Amin Shahraki is with School of Computer Science, University College the privacy of users by accessing their personal data. Flow- Dublin, Ireland. Corresponding author e-mail: (am.shahraki@ieee.org) Mahmoud Abbasi was with Department of Computer Sciences, Islamic based techniques are the most common techniques for NTC Azad University, Mashhad, Iran, email: mahmoud.abbasi@ieee.org as instead of inspecting all packets passing through a given Amir Taherkordi is with the Department of Informatics (IFI), University of link, they examine network traffic flows or an aggregated form Oslo, Norway. email: amirhost@ifi.uio.no Anca Delia Jurcut is with Department of Computer Sciences, University of the network header packets information. As a result, the College Dublin, Dublin, Ireland, email: anca.jurcut@ucd.ie volume of data needed to be examined will be reduced, and
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 2 the encrypted traffic will no longer be a problem. Flow-based CFS Correlation based Feature Selection techniques assume that each application’s traffic has almost CNN Convolutional Neural Network unique statistical or time-series features that can be utilized DDoS Distributed Denial of Service by classifiers to categorize both encrypted and regular traffics. DL Deep Learning In flow-based methods, the traffic classifier may leverage DPI Deep Packet Inspection Machine Learning (ML) algorithms to automate the classifica- EER Expected error reduction tion process, discover different traffic patterns produced by de- GAN Generative Adversarial Network vices, and classify encrypted traffic. Although ML algorithms i.i.d Identically and independently distributed are powerful techniques to classify network traffic flows [10], IDSs Intrusion Detection System [11], the accuracy of learning-based approaches is limited by IoT Internet of Things their need for a massive number of labeled instances. As the LAL Learning Active Learning authors in [12] mentioned, most of the real-world application LSTM Long Short-Term Memory data is semi-labeled or unlabeled data. Moreover, the data M2M Machine-to-Machine labeling process for ML tasks can be challenging in terms MAPE Monitor-Analyze-Plan-Execute of human effort and cost [13]. ML Machine Learning Fortunately, Active Learning (AL), as a sub-field of ML, is MLP Multi-layer Perceptron a promising approach to deal with the need for a huge amount NMO Network Management and Orchestration of labeled instances. AL aims to reduce the need for labeled NTC Network Traffic Classification examples by intelligently querying the labels during training. NTMA Network Traffic Monitoring and Analysis The query goes for the examples that the AL algorithm OODA Observe-Orient-Decide-Act believes will help build the best model [14]. Therefore, based P2P Peer-to-peer on the aforementioned challenges, AL can be considered as QBC Query-By-Committee an appropriate and efficient technique for flow-based NTC. QoE Quality of Experience Providing a thorough study on the usefulness of AL in NTC QoS Quality of Service and reviewing the state-of-the-art techniques in this field can RAE Relief Attribute Evaluation significantly help the network research community in better RAL Reinforcement AL adoption of AL for classification of network traffic in various RL Reinforcement Learning domains. To the best of our knowledge, this is the first and only SDAE Stacked Denoising Autoencoder study that technically reviews the efficiency and importance of SDN Software Defined Networking AL for NTC along with surveying the literature in this field. SFEM Structural Feature Extraction Methodology In this paper, we study the NTC techniques and discuss AL SVDD Support Vector Data Dscription as a useful approach in this field. The main contributions of SVM Support Vector Machine our work are summarized as follows: TLS Transport Layer Security UNC Uncertainty sampling • Discussing NTC techniques and their correlations with VAE Variational Autoencoder ML techniques VPN Virtual Private Network • Reviewing existing work in AL-based NTC WSNs Wireless Sensor Networks • Empirical evaluation of the performance of AL for NTC purposes • Discussing the challenges, and future directions in using II. R ELATED S URVEY A RTICLES AL for NTC There exist several literature studies reviewing the use The rest of this paper is structured as follows: In Section of ML techniques in communication systems and wireless II, we review existing survey works on traffic classification networks, e.g., [15], [16]. There are also some surveys that techniques. In Section III, we provide an overview of the focus on specific ML techniques, e.g., Deep Learning (DL) NTC problem and the use of ML techniques. Then, we devote [17] and Reinforcement Learning (RL) [18] , or specific types Section IV to discussing the fundamental elements of AL and of networking, e.g., Software Defined Networking (SDN) [19] query strategies. Next, in Section V, we discuss the advantages and optical networks [20]. Moreover, some survey works com- of using AL for NTC purposes and carry out a literature review pare, evaluate or review different techniques, e.g., ML-based on this topic. In Section VI, we evaluate the performance of techniques, heuristic models and statistical-based techniques AL in NTC. In Section VII, we discuss the challenges and for NTC e.g., [21]. Considering the volume of survey literature future directions in using AL for NTC, and finally we conclude in this field, in this section, we focus only on surveys that the paper in Section VIII. review NTC or the use of various ML techniques in NTC. L IST OF ABBREVIATIONS • General literature reviews on NTC: In [22], Dainotti AL Active Learning et al. reviewed the issues and future research directions ALBL AL by learning of NTC, especially in case of applicability, reliability ASVM AL Support Vector Machine and privacy. They outlined the research and policy future directions of NTC, e.g., validating the NTC models, effect
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 3 of network speed in NTC and NTC tools. In [23], Fin- Pacheco et al. comprehensively surveyed the use of ML sterbusch et al. reviewed the payload-based NTC based techniques in NTC for different cases, e.g., encrypted on Deep Packet Inspection (DPI). They also practically network traffic. By understanding the challenges of using analysed the most significant open-source DPI modules ML techniques in NTC, they studied the reliable label to show their performance in terms of accuracy and assignment, dynamic feature selection, integrating the requirements. Additionally, they provided a guideline on meta-learning processes. They considered these solutions how to design and implement DPI-based NTC modules. to solve several issues, including imbalance network data, In [24], Velan et al. studied NTC models for encrypted dynamicity of networks, and online strategies for re- network traffics to measure the traffic and improve the training the ML models. security, e.g., detecting anomalies. They have reviewed In Table I, a summary of the surveys above is provided different types of encrypted traffics and how payload- based on their vision of NTC, the reviewed solutions, network based and feature-based NTC techniques can classify type and practical evaluation of studied solutions. As indicated encrypted network traffics. Zhao et al. [7] reviewed the in the table, our survey is for flow-based NTC for the use use of NTC in IoT and Machine-to-Machine (M2M) in Internet communications and specifically considers AL as networks. They reviewed the current NTC within the IoT one of the most important ML-based solution. To the best context based on the differences between IoT and non-IoT of our knowledge, our study is one of the rare literature network traffics. By reviewing the literature, the authors surveys that evaluates such specific ML solutions for NTC as showed that in IoT research area, most of NTC techniques most of existing surveys consider general ML models, e.g., are proposed to solve security challenges. The authors in supervised learning solutions for NTC. Studying AL-based [25] reviewed the NTC techniques, i.e., statistics-based solutions makes our work different from all existing survey classification, correlation-based classification, behavior- works. based classification, payload-based classification, and port-based classification. They also quantified classifica- III. OVERVIEW ON NTC AND ML tion granularity based on four levels, i.e., application type layer, protocol layer, application layer and service layer. In NTC, one should clarify the goals of classification based Last but not least, they classified network traffic features on the intended use, such as for accounting purposes, malware and the existing public datasets that are commonly used detection, intrusion detection, providing QoS, and identifying in the proposed NTC techniques. types of applications based on the network traffic (e.g., VPN • Literature reviews on the use of ML in NTC: As one of and nonVPN traffics or Tor and nonTor traffics). Indeed, there the earliest study in the use of ML in NTC, Nguyen et are different factors that one can use to categorize network al. [26] reviewed the literature between the years 2004 to traffic, including applications (e.g., Facebook and Hangouts), 2007. They studied how ML models can be employed protocols (e.g., HTTP and BitTorrent), traffic types (e.g., Web for NTC in IP networks, e.g., clustering approaches, Browsing and Chat), browsers (e.g., Firefox and Chrome), supervised learning approaches and hybrid approaches. operating systems, and websites. Therefore, the purpose is to They also reviewed the literature that compares ML tech- determine the label of each network flow truly, e.g., browsing, niques or non-ML techniques for NTC. They mentioned interactive, and video stream. NTC can be further categorized that offline analysis models, e.g., AutoClass, Decision into online and offline classification. In online NTC, the input Tree and Naive Bayes can achieve a high accuracy for traffic needs to be classified in a real-time or near real-time about 99%. They also outlined some critical operational manner (e.g., QoS provisioning). On the other hand, offline requirements for real-time NTCs models compared to classification is appropriate for applications such as anomaly offline models. In [21], Singh evaluated the unsupervised detection and billing systems. Despite their importance, exist- ML techniques including K-means and Expectation Max- ing NTC techniques suffer from general networking challenges imization algorithm for NTC. The results show that the as listed below: accuracy of K-Means is better than Expectation Maxi- • While the literature on traffic classification is mature mization algorithm. In [27], Perera et al. compared six to adapt to old-fashioned networking paradigms, e.g., ML algorithms including Naive Bayes, Bayes Net, Naive legacy cellular systems, the dramatic growth and evolu- Bayes Tree, Random Forest, Decision Tree and Multi- tion of online applications and services have made traffic player Perceptron along with two feature extraction tech- classification a non-trivial task. Due to the traffic char- niques, i.e., Correlation based Feature Selection (CFS) acteristics of modern networks, e.g., being large-scale, and Relief Attribute Evaluation (RAE). Their results show heterogeneity, multimodal data, and big data, emerging that Decision Tree and Random Forest have better perfor- NTC methods must meet strict requirements in terms mance compared to other techniques. In [28], Gomez et of system performance, accuracy, and robustness. For al. compared seven ensemble ML techniques including example, the vast amount of raw data generated by IoT OneVsRest, OneVsOne, Error-Correcting Output-code, and cellular devices pose severe challenges to ML-based Adaboost classifier, Bagging algorithm, Random Forest NTC methods as they need clean and pre-processed data and Extremely Randomized Trees which are all based for training purposes. on decision trees in NTC. They compared them in case • NTC is a multi-factor procedure in which an automated of model accuracy, latency and byte accuracy. In [29], program categorizes the network traffic based on the
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 4 Table I: An overview of existing literature surveys on NTC and ML. Practical Study Year NTC vision Reviewed Solution(s) Type of network Evaluation [26] 2008 Analysing Statistical traffic ML Solutions IP networks No Characteristics [22] 2012 General NTC Not Specified TCP Networks No [23] 2014 Payload-Based NTC DPI-based techniques Internet Yes techniques [21] 2015 Comparative Study Comparing unsupervised ML techniques Internet Yes [24] 2015 Analysing encrypted network ML techniques and hybrid techniques Not Specified No traffic by payload-based and feature-based NTC technique [27] 2017 Comparative Study Comparing six ML Solutions Communication Networks Yes [28] 2017 Comparative Study Comparing Decision-tree based ensemble techniques Internet Yes [29] 2018 ML-based NTC Most existing ML solutions IP Networks No [7] 2020 NTC for M2M network traffic Generic solutions IoT No [25] 2021 Reviewing various types of Most existing ML solutions Internet No NTC models Our Study 2021 Flow-based NTC Active Learning Internet Yes network traffic features, e.g., types of network protocols, A. Data gathering applications, hosts, etc. As a challenge, NTC techniques Since ML algorithms learn to classify the data based on need to select the best features to classify the network sample datasets, representative data must be collected as the traffic with high accuracy, while each of them can be ef- data gathering step. While a few publicly available network ficient or inefficient from one network to another network. traffic datasets have been released, using these to train a In other words, feature engineering is a challenge when traffic classification model can be difficult [33]. In addition, it comes to using classical ML for traffic classification. since the behavior of the network traffic is different from one • The recent increase of encrypted network traffic and network to another one, it is highly recommended to train protocol encapsulation methods limit the effectiveness the ML algorithm for the target network [2]. Additionally, of many traffic classification techniques since the packet the number of network traffic classes can be high, and it is inspection techniques are unable to extract network man- rather impractical to consider all classes in one public dataset. agement information from network traffics. For example, Furthermore, there are a variety of data gathering and labeling a significant portion of the Internet traffic is associated techniques that lead to different feature sets. Hence, in real- with Peer-to-peer (P2P) applications. However, classifi- world applications, the goal is to use datasets that are tailored cation of P2P traffic is a difficult task [7] as many P2P to the intended use of NTC, mainly gathered from the target applications, such as online video and P2P downloading, network. use encryption and obfuscation protocols to remove the limitations posed by Internet service providers. B. Data pre-processing To overcome the above challenges, various techniques have been introduced, e.g., graphical techniques, statistical methods After gathering, the data must be pre-processed such that and ML-based methods [24]. In the scope of ML, various it is represented in a form that the target ML algorithms can solutions for port-based, payload-based, and flow-based have discover different patterns. In traffic classification, header data been proposed as the most promising solutions for NTC [30] and payload are two major data structures. These structures [31]. Multiple steps are needed for building a ML-based often need to be pre-processed because they contain irrelevant network traffic classifier as presented in [32]. Figure 1 shows or redundant information, such as network management data, a graphical description of all steps. In the rest of this section, which is not needed for traffic classification, e.g., source and we discuss each individual step. destination IP addresses, and protocol information. Moreover, changes in the distribution of packet-level features can occur in real-world environments because of unexpected events like Steps towards building a ML-based network traffic classifier the re-transmission of packets. In short, performing some Data gathering Data pre- Feature Model Model pre-processing steps such as packet filtering, elimination of processing engineering selection evaluation noisy samples, header removal, and data quality assessment is needed to ease the learning process for the ML algorithms Public Packet Time series Header+ datasets filtering features time series [34]. Header-related Header+ Exclusive Header features payload datasets removal Data quality assessment Statistical features Statistical features C. Feature engineering Conventional classification solutions, e.g., ML- and Header removal statistical-based techniques, need to go through a feature engineering procedure, in which domain knowledge is used to Figure 1: The main steps in building a network traffic classifier. extract features or patterns from the raw data [35], [23]. Fea- ture engineering is a crucial step in ML-based NTC methods
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 5 because of the fact that choosing appropriate features can ease features of network flows generated by different services the difficulties of the modelling phase, and vice versa [36]. It is or applications are almost unique. Nevertheless, a big worth mentioning that considering privacy, the risk associated challenge with the methods that use statistical features with feature engineering and representation procedures is also is that they are not suitable for online classification. This crucially important, especially in the payload feature-based is mainly due to the fact that a classifier needs to monitor techniques. Indeed, there are some legal restrictions on using the entire or significant part of a network flow in order payload-based methods in many environments or recognizing to extract statistical features. all communication protocols. This is mainly due to the user’s privacy policies, as such methods inspect the content of the D. Model selection network packets [37]. Generally, there are four major types of input features for Another step towards building a traffic classifier is selecting NTC: the right ML model. In the context of ML, choosing a model can carry different meanings, such as the selection of hyper- • Time series: Considering time series related features, parameters and parameters, as well as algorithm selection. one can refer to maximum packet inter-arrival time, Given NTC, several factors can be involved in the selection of maximum number of bytes in packet, and inter-packet the classification model (e.g., model performance, available timings. According to [38], the length of time series (or resources, model complexity, and feature selection). One of the number of packets within a flow) has a visible effect the most significant factors is feature selection. This is due on classification accuracy and computational overhead. to the fact that there is a direct correlation between features Specifically, increasing the number of considered packets and input dimensions of the model, and consequently the can improve the classification performance but at the computational and memory complexities of the model, which cost of higher computational overhead. In [38], only are crucial factors in NTC. This implies that the dimensions the first 20 traffic packets in a flow are used for the and structure of input data for training purposes should be experiments. The authors in [39] use the time-series optimized. Moreover, the selected features directly affect the features of packets, e.g., source and destination ports, performance of the final learning task (e.g., classification and payload size, and TCP window size (bytes) as input for regression) and the dimensions of the input data for training. a semi-supervised model to perform traffic classification Hence, one should consider the right number of informative related to the five Google services, including Hangout features. In the context of traffic classification, it may be not Chat, Hangout Voice Call, YouTube, File transfer, and sufficient to consider the model performance as the only factor Google play music. The simulation result shows excellent for model selection. Thus, one can also consider other criteria, accuracy, despite using a limited number of labeled data such as training time and model explainability. samples. This is mainly because they conducted a pre- training step on the entire unlabeled network flows in E. Model Evaluation order to learn statistical features, and then they re-trained the model using a small labeled dataset for fine-tuning. Finally, the evaluation of the selected model is the final • Header: The header of a network packet contains infor- step in building a network traffic classifier. In this step, the mation related to different layers (e.g., the network layer). performance of the ML model on unseen data is measured. Features such as port number and protocol number are The ML model should be able to give accurate predictions widely used as informative features in traffic classification to be useful for the given task. However, the accuracy is not tasks. However, some modern NTC techniques, especially the only evaluation metric for a classification task, and other DL-based, accept entire packets as the input feature. For metrics such as confusion matrix, F1 score, recall, etc. should example, in [40] the authors used hexadecimal raw packet be considered. NTC is a classification task, and we use the header and convolutional networks to classify Tor/non- same metrics to evaluate the performance of the proposed Tor traffic. To this end, they utilized TCP/IP headers, model. especially the first 54 bytes of packets, because TCP is associated with around 90% of all the Internet traffic. F. Existing Work • Payload: NTC techniques can also use layer-related information above the transport layer to classify network Recently, several ML techniques have been proposed for traffic. As a prime example, in [41] the authors utilize network traffic classification. In this subsection, we categorize BitTorrent handshake packets on layer 4 to classify the existing work in the literature based on the goals of network BitTorrent traffic. BT generates the highest amount of traffic classification, including identifying applications (also P2P traffic. Moreover, some works use packets related to called apps), cyber security purposes, fault detection, website the Transport Layer Security (TLS) handshake process to fingerprinting, user activities identification, and operating sys- identify HTTPS services [42]. tems identification. We discuss these goals in more details in • Statistical features: The statistical features of network the sequel. flows, such as minimum inter-arrival time and size of • Mobile apps identification: This goal refers to analyzing the IP packets can be used for NTC [43]. The main and finally identifying the network traffic related to a idea behind using statistical features is that the statistical particular mobile app. Given the ever-increasing number
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 6 of mobile apps, network administrators and telecommu- al. utilize federated learning for malware detection in IoT nications companies are actively looking for rigorous devices through one supervised model (based on Multi- methods to secure their infrastructure. Apps identification layer Perceptron (MLP)) and one unsupervised model based on analyzing the network traffic of mobile apps can (based on autoencoder). To evaluate the framework, they assist network administrators with resource management use N-BaIoT dataset, which models the traffic of IoT and planning, and app-specific policy establishment (e.g., systems impacted by malware. In [52], McLaughlin et security policy establishment and access management for al. present a DL-based method for Android malware a specific app). Furthermore, the identification of apps can detection using the raw opcode sequence as the in- help protect smartphone platforms (e.g., Android) against put of a CNN model which can automatically learn emerging security threats and uncover sensitive apps. the features of malware instances. The authors claimed Moreover, by app identification, it is possible to forbid the that the proposed method has a more straightforward use of some particular apps (e.g., Google+ and Instagram) training pipeline than the previously proposed works in an enterprise network [44]. Several papers have been (e.g., n-gram-based malware detection). Huang et al. [53] published on app identification. Ajaeiya et al. in [44] combine the unsupervised spatiotemporal encoder with present a framework for the classification of Android LSTM to detect abnormal network traffic. The spatial apps. The proposed framework identifies apps traffic from feature of network traffic data was extracted in the first a network viewpoint without adding any overhead on stage by the spatiotemporal model. Then, the obtained users’ mobile phones. Moreover, the authors provide a features are used to train another LSTM layer for the pre-processing method for traffic flows to extract the classification purpose. NSL-KDD dataset was used for most informative features for ML-based techniques. The the evaluation of the model. Based on the experimental work in [45] leverages Variational Autoencoder (VAE) for results, using the proposed DL model, the efficiency of the identification of mobile apps. The authors claimed intrusion detection is significantly high compared to the that their method is able to label a massive number of traditional techniques. instances and extract the features in mobile apps traffic • Fault detection: Fault detection is part of a more ex- automatically. To this end, the authors first transform the tensive network management process, called fault man- mobile apps traffic to meaningful images, and then use agement. Fault management points to a set of processes VAE as a classifier. Similar work was carried out by Wang to detect, isolate, and then correct unusual situations of et al. in [46], in which the authors design three DL- a network. Failure occurs when a system (e.g., an IoT based models, including Stacked Denoising Autoencoder network) cannot adequately provide a service, where a (SDAE), 1D Convolutional Neural Network (CNN), and fault is the source cause of a failure. Fault manage- Long Short-Term Memory (LSTM) for mobile apps ment, especially fault detection, play an essential role in identifications. The authors in [47] provide a multi- today’s network management (e.g., QoS provisioning). classification scheme for the classification of mobile apps Hence, many works have been conducted to improve traffic. More specifically, they combine several mobile the fault management process. In [54], Huang et al. traffic classifiers’ decisions (knowledge) to classify their survey fault detection techniques in IoT networks and traffic samples. introduce a fault-detection framework for Self-Driving • Cybersecurity purposes: One of the main goals of Network (SelfDN)-enabled IoT. Moreover, the authors traffic classification is detecting security breaches in propose an algorithm called Gaussian Bernoulli restricted communication systems, e.g., intrusion detection, mal- Boltzmann machines auto-encoder to change the fault- ware detection, anomaly detection, and worm detection. detection into a classification task. The simulation result Cybersecurity tools/techniques (e.g., intrusion detection demonstrates the superiority of the proposed method to systems) aim to defend communication systems from other adopted methods, such as linear discriminant anal- internal/external threats. Traffic classification methods ysis and SVM. In [55], the authors focus on the problem can be used to assess network traffic behavior through of cell coverage degradation detection through a deep detecting malicious traffic flow/link, and then prevent neural network. They propose a deep recurrent model attacks. A large body of work in the literature has for diagnosing cell radio performance deterioration and focused on ML-based malware and intrusion detection. complete cell outages in a mobile phone network. In [56], The authors in [48] propose an intrusion detection ap- Noshad et al. adopt the Random Forest classifier for fault proach based on deep neural networks and compare the detection in Wireless Sensor Networks (WSNs). They use performance of DL with classical ML classifiers, demon- a dataset with six types of faults at the sensor levels for strating the superiority of DL models. Similarly, in [49], performance evaluation, such as data loss, offset, and out- Shone et al. propose a non-symmetric deep auto-encoder- of-bounds. Moreover, they compare the performance of based learning solution for intrusion detection. The auto- the proposed method with other well-known techniques, encoder network has been used for learning features in e.g., MLP, CNN, and probabilistic neural networks. an unsupervised manner. Then, they employ a stacked • Website fingerprinting: It refers to methods for identify- non-symmetric auto-encoder as a traffic classifier. In [50], ing and collecting data about websites visited by a mobile Nguyen et al. propose a federated self-learning method to device, which is essential for the advertising industry, detect anomalies in IoT systems. Similarly, in [51], Rey et identifying the characteristics of attacks (e.g., botnets
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 7 and sniffing) and protecting users’ privacy. Website fin- obtain the more stable traffic features. Hou et al. in [63] gerprinting can help recognition of fraudsters and other categorize user activities of the WeChat application by unusual activities. Moreover, website fingerprinting can performing a detailed analysis on the encryption protocol be considered as a type of traffic analysis attack that of this application, called MMTLS, to find the typical user allows eavesdroppers to get information on the victim’s activities of the application (e.g., advertisement click and activities. Given the importance of website fingerprinting, browsing moments). Then, they adopt different learning there is a large body of literature on this topic. In algorithms, such as Naive Bayes, Random forest, and [57], Rahman et al. leverage the idea of adversarial ML Logistic Regression, to classify these activities. to defend users against website fingerprinting attackers. • Operating systems identification: This refers to identi- The authors propose a method to generate adversarial fying the operating system installed on a mobile device examples to decline the accuracy of the attacks that use by analyzing its generated traffic. Adversaries can use learning-based techniques for robust traffic classification. operating systems identification to launch more serious The simulation results show that the proposed method attacks against a specific mobile operating system. More- can decline the accuracy of the state-of-the-art attack over, it is desirable to use this analysis to investigate the by half. The work in [58] focuses on the concept drift popularity of the mobile operating systems (e.g., Android problem in static website fingerprinting attacks for the and iOS) among users. Hagos et al. in [64] introduce a Tor network. The authors refer to the fact that it is costly learning-based technique for passive operating systems to update static attacks in dataset updating and retrain fingerprinting. They use classical ML (i.e., Support Vec- the model. Hence, they introduce AdaWFPA, an adap- tor Machine (SVM), Random Forest, k-nearest neighbors, tive online website fingerprinting attack that leverages and Naive Bayes) and DL algorithms (i.e., MLP and adaptive stream mining techniques. Luo et al. in [59] LSTM ) for classification purposes. Moreover, the authors propose Random Bidirectional Padding (RBP), a website propose to use the underlying TCP variant as a practical fingerprinting obfuscation technique against intelligent feature for improving classification accuracy. The authors fingerprinting attacks. It uses time sampling and random in [65] compare the performance of the ML-based tech- bidirectional packets padding to change the inter-arrival niques, such as k-nearest neighbors and Decision Tree, time characteristics in the traffic flow, and consequently, with the traditional commercial rule-based strategy for to identify more complex patterns in network packets. operating systems fingerprinting. The simulation result • User activities identification: Such traffic analysis can demonstrates the superiority of the learning-based tech- be used to obtain exciting pieces of information about a niques to the traditional method. Lastovicka et al. in [66] specific action that a mobile subscriber carries out on investigate the performance of the three famous operating his/her device (e.g., posting a video on Twitter). The system fingerprinting techniques, including user-agent, identification of the user activities may also be made TCP/IP parameters fingerprint, and specific domains com- to get information about a specific activity, such as the munication. Performance measures reveal that the method length of a message sent by a user within a particular chat based on user-agents provides better performance than its application. User activity identification can be utilized counterparts. by adversaries/researchers to reveal the identity behind an unknown user, e.g., in a social media, that prefers IV. OVERVIEW ON AL to remain anonymous. This can be done by behavioral profiling for the users of a network, which is helpful for A supervised machine learns to discriminate the different identifying reconnaissance within the network. Moreover, traffic classes by being trained on labeled training data. While such traffic analysis offers a possibility to character- capturing large quantities of network data is relatively easy, ize the users’ habits in a network, e.g., chatting with analysing the data by ML techniques can be a very time- friends in the morning and watching the video stream consuming, expensive, or human-labor intensive process. This in the evening. The user’s behavior information can be is mainly because of the complexity of ML techniques or the employed next time to detect the user presence in the shortage of labeled data resulting in inefficient training. In network. In [60], Conti et al. use ML techniques (i.e., Dy- order to reduce the number of needed labeled examples and, namic Time Warping (DTW), hierarchical clustering, and consequently, reduce the effect of ground truth challenge, AL Random Forest) for analyzing Android encrypted network can be used to facilitate labeling. traffic, and consequently, to identify user actions (e.g., AL systems can participate in the gathering and selection email actions, including sending email, replying, and of training instances, such that only the most informative Facebook actions). The authors in [61] leverage transfer examples are required to be labeled. Using AL, a learner learning to analyze encrypted mobile traffic to deal with follows an iterative strategy in which it interacts with an oracle the problem of diversity of app releases, mobile operating to choose the most useful data instances to be labeled, thereby, systems, and model of devices, and identify user actions. it reduces the cost of data labeling by using only a few labeled The work in [62] focuses on the identification of the examples to deliver satisfactory performance in a reasonable Instagram user behavior. Unlike previous works that used time. The AL paradigm is illustrated in Fig. 2, in which the the statistical features of encrypted traffic, this work three core components are: query strategy, annotator, and provides a new technique based on maximum entropy to ML model. The query strategy is responsible for choosing
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 8 unlabeled data according to a pre-defined policy. A label the performance of the query strategies in Section VI. Note is then provided for the selected data by a human/machine that in ML terminology, hypothesis space refers to the all annotator, and the data is added to the set of training instances. possible legal hypotheses, where a hypothesis is a particular Afterwards, the model is updated, and the process repeated computational model that best explains the target data in as long as new data is available, or a stopping criterion is supervised ML. In active learning settings, a query strategy can satisfied. Different stopping criteria can be defined to end search the hypothesis space through testing unlabeled samples this iterative process, such as reaching the desired accuracy, to reduce the number of legal hypotheses under attention. running time, or a maximum number of queries, which can directly affect the performance of using AL. • Uncertainty sampling (UNC): In UNC, a learner prefers There are mainly two AL scenarios to consider, namely, to label the instances where the model is most uncertain stream-based selective sampling and pool-based sampling about the class of the example. The idea behind the strat- (presented in Fig. 3). In the former, the distribution of un- egy is that those examples on which the model exhibits labeled instances is known, and the instances are considered the most degree of uncertainty are most likely to improve one at a time. The learner then observes each instance in the performance of the model over time. Different criteria, sequence and decides whether the instance should be labeled also called uncertainty strategies, for measuring uncer- or discarded. AL is a promising technique to alleviate the tainty, have been proposed including posterior probability, challenge of streaming-based learning scenarios [67], [68]. smallest margin, and entropy [70]. Entropy is one of the AL algorithms designed for streaming scenarios can control most popular uncertainty strategies in many AL problems. the labeling process and gradually perform this process over In an n-class classification problem, assume the estimated time [69]. Using this strategy, it is expected that the labeling probabilities of the n classes are p1 , . . . , pn , respectively. process will be in balance and the algorithms will detect Given the currently labeledPdata instances, the entropy n the changes. In the case of pool-based sampling, a pool of is defined as E(X) = − i=1 pi . log(pi ). Given this unlabeled data is provided, and the aim of the learner is to expression, a larger value of the entropy means a higher select the most informative instances from the pool to be level of uncertainty. Accordingly, this objective function labeled by the annotator. Pool-based sampling is attractive for can be considered as a maximization problem. many real-world learning scenarios as it is possible to collect • Query-By-Committee (QBC): In QBC, an AL system a large body of unlabeled data at once. Pool-based sampling consists of a committee of different learners trained on presumes that a limited amount of labeled data and a big pool the current labeled data. These learners are then used of unlabeled data are available. to make a prediction on the labels of unlabeled data. The instances for which the committee members disagree the most on the correct label are selected for labeling. A. Active learning query strategies Then, the committee of learners will use the new labeled The fundamental question in AL is that what is the most data examples for training purposes. The QBC strategy effective strategy for querying data instances? In NTC applica- creates wider diversity than UNC because it considers tions, different query strategies can be used based on various the differences in the predictions of several different network circumstances, e.g., new unknown flows, changes in learners, instead of measuring the level of uncertainty the behavior of network traffic, and discovering unclassified of labeling using only a single learner. However, the network traffics. We first, introduce the most well-known query technique for measuring the disagreement is often similar strategies of AL widely used in literature and then evaluate for both query strategies [71]. In the QBC strategy, the vote entropy and KL-divergence metrics are usually applied to measure the disagreement. In the literature, to Train a model Requiring new data construct a committee of learners, two major approaches Machine learning model have been proposed. In the former, one can change the parameters/hyperparameters of a particular model (e.g., by sampling) in order to generate different models and, consequently, the committee models (or learners). In Unlabeled contrast, in the latter, the committee is built by a bag data Labeled data of different learners (i.e., ensemble of learners). • Learning Active Learning (LAL): The main idea behind this strategy is to train a regressor that forecasts the Ex- A dd gy in pected error reduction (EER) for an instance in a specific te g ra to st learning state. Indeed, this technique formulates the query th ry e ue tr ai Q strategy of unlabeled data as a regression problem. Then, ni ng da regarding a trained classifier and its output for specific ta Annotator unlabelled instance, the Learning Active Learning (LAL) (human or machine) forecasts the decrease in generalization error that can be reached by labeling that instance. The interested readers Figure 2: Graphical description of active learning. are referred to read [72] for details.
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 9 Train a model Machine Train a model learning model Machine learning model Observe an Select the most instance informative instance Labeled Make Input source data decision Labeled Labeled data data gy La he te ad gy to Discard ra be tra d te t st La e to ra l a ini ry st be tra th nd ng ue ry l a ini ad da Q ue nd ng d ta Q da ta Annotator (human or machine) Annotator (human or machine) (a) Stream-based sampling (b) Pool-based sampling Figure 3: (a) Stream-based selective sampling, and (b) Pool-based sampling. • Random: It refers to the conventional supervised learning to increase the accuracy of the method or recognize new scheme in which instances are randomly selected to be applications, protocols, or protocol versions. The update is labeled. Since data labeling is an expensive procedure, essentially performed using new labeled data. random sampling may not lead to the best learner, es- AL is a promising research field in this context as it greatly pecially when the query of each sample is costly, and reduces the cost of training and dramatically speeds up the consequently, few labels will finally be available [71]. learning phase [74]. This is advantageous to ML-based traffic • Information Density (Density): Uncertainty sampling, classification methods to better satisfy the aforementioned QBC, and LAL query strategies are all prone to choosing requirements, precisely data requirements and the need for outliers or unrepresentative instances and, consequently, updating to identify new types of traffic through attaching this can lead to sub-optimal queries. A solution is to labels on the most informative instances and the need for use the representativeness of an instance to ensure the updating to identify new types of traffic. selected instances resemble the overall distribution. When considering whether to query an instance, a combination A. Advantages of using AL for NTC purposes of representativeness and the informativeness instances is typically used [73]. In the density query, to measure AL is potentially a good candidate to perform NTC. Below, the representativeness of a data instance, the closeness we summarize the advantage of using AL techniques in the of the data instance to all other data instances is often field of NTC: considered. • Less amount of data needed for labeling: As mentioned before, most conventional networks generate unlabeled V. ACTIVE LEARNING FOR NETWORK TRAFFIC and semi-labeled data. Meanwhile, one of the key chal- CLASSIFICATION lenges to use the learning-based techniques for NTMA As explained in Section III, NTC has attracted much in- is the lack or limited accessibility of labeled instances. terest in recent years and different ML methods have been Moreover, data labeling is not often a straightforward proposed to solve the NTC problem. However, most of these procedure and can raise the cost in terms of time, human methods suffer from various challenges such as requiring effort, and the computational overhead. Other than that, a large amount of fully labeled data, existence of a con- if data labeling is performed manually or by online siderable amount of semi-labeled or unlabeled data in real- tools, it can reduce the data quality, since not all data world network scenarios, and complex, costly, and time- instances are informative. AL can tackle this concern by consuming methodology for data labeling. Providing labels to labeling only the most informative instances. To this end, data instances is especially challenging for NTC techniques, a comprehensive set of querying strategies in AL has been because one must consider several requirements in terms of proposed to determine the quality of instances for labeling traffic data granularity in order to satisfy the desired traffic [14]. classification objectives. One can, for example, refer to classes • Concept Drift: Due to high dynamicity of computer on the application level (i.e., Skype or Facebook), protocols networks, ML techniques must be re-trained frequently level (i.e., TCP or HTTP), or at the service group level (i.e., because of various reasons, e.g., new network behavior browsing or streaming) as typical examples of data granularity and new classes of network traffic [75]. In most ML [24]. Moreover, updating a traffic classification method is time- techniques, such as DL, retraining a model from scratch consuming. However, updating the models may be needed is a resource-intensive task in terms of time and power
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 10 computation in addition to their need for huge amount B. Literature Review on using AL in NTC of new data samples. Most well-known ML techniques become useless in NTC as the network cannot be unat- In this section, we review existing work on the application tended for a long time due to retraining purposes. AL is of AL in NTC. able to (re-)train the models very fast with high accuracy Torres textitet al. [80] proposed a botnet detection technique by continuous provisioning of new labeled instances. This based on AL. The authors provided a novel AL strategy to is demonstrated in Section VI where AL performance is label network traffic that contains normal and botnet traffics. evaluated with regard to the training time. The AL strategy is used to create a random forest model • Dealing with the shortage of labeled data samples: In that benefits from the user’s previously-labeled instances. The case of retraining, the number of labeled samples to train primary objective of the proposed technique is to help the user the model is very limited due to the cost of labelling in the labeling process. Similarly, the work in [81] employed process, e.g., time, complexity, need for domain knowl- AL for a security purpose, i.e., malware classification. In this edge, etc. Most ML models, e.g., DL, need a considerable work, SVMs and AL by learning (AL) have been combined amount of data to train. As shown in Section VI, AL to tackle the lack of labeled instances in malware detection. can train the model with a high accuracy using a limited The simulation results reveal that using AL can enhance the number of data samples. performance of classification in terms of accuracy and the • Incremental Learning: Although AL is not essentially quality of labeled instances. In addition, the authors claimed considered as an online learning technique, using the that by using different training algorithms, e.g., Generative stream-based sampling can possibly turns it into an Adversarial Networks (GANs), one can solve issues such as incremental learning technique to be adaptable with the the diversity of security-related datasets. nature of highly dynamic networks. As most of conven- The work in [82] is another attempt to develop an accurate tional and emerging networking paradigms are highly malware detection system. The system is based on AL, where dynamic in different aspects, AL can be used to learn a new Structural Feature Extraction Methodology (SFEM) is the behavior of network traffic online. In addition, pool- introduced to extract from docx files. The proposed system is based sampling can help reduce the time complexity of able to identify new unknown malicious docx files. To have learning from scratch, as the number of training samples an updatable detection model and identify new malicious files, becomes limited. Although labeling is a time-consuming the system benefits from AL to update and complement the task, using different query strategies based on the network signature database with new unknown malware. traffic circumstances can reduce the time complexity of Common cybersecurity attack vectors, such as viruses, bot- learning. nets, and malware are known for Intrusion Detection Systems • Monitoring incoming stream traffic: Using passive learn- (IDSss). Nevertheless, malicious users continuously create ing methods for NTC tasks, such as security and intrusion new attacks that can bypass the IDSss. Analyzing anoma- detection is no longer reasonable, as these methods lous behaviors calls for a considerable amount of time and cannot handle changes in the statistical characteristics effort. Preparing a significant of labeled data for the training of the target data (i.e., concept drift). To address this process is both increasingly costly and inefficient, because issue, one can investigate the great abilities of stream- of the continuous design of new attacks. In this case, one based AL [76]. Several AL-based strategies have been can use AL to reduce the number of the required labeled proposed to detect concept drift and instantly adapt to instances, while increasing the accuracy of anomaly detection. evolving characteristics of data [77] [78] [79]. In [83], a semi-supervised IDS has been designed that works • Addressing Theory of network: In Internet Engineering effectively with a small number of labeled instances. The Task Force 97 (IETF97)1 , the challenge is introduced as proposed learning algorithm for the IDS benefits from two networks suffer from the lack of a unified theory that can ML techniques, including AL Support Vector Machine (AL) be applied to all networks. It means that the behaviors of and Fuzzy C-Means clustering. Furthermore, [83] reported different networks are various based on their topology, that the proposed learning algorithm enables the IDSs to add equipment, scale, applications, etc. Theory of Network new training instances with minimum computational overhead. causes an important problem that ML techniques should Due to the fact that domain knowledge is required for the be trained for each network separately. AL can be con- annotations of unlabeled instances, adopting new cost-effective sidered as a suitable online learning choice in such cases labeling techniques is desired. To this end, the work in [84] thanks to its ability to be learned by a limited number by Beaugnon et al. developed an interactive labeling strategy, of data samples. This is beneficial for highly dynamic namely ILAB, to assist the experts in the labeling process of networks with a huge volume of starting and stopping large intrusion detection datasets. ILAB adopts divide and con- network traffics. AL also allows frequent retraining which quer approach to lower the computation cost. Deka et al. [85] eliminates the necessity of using representative datasets. investigated the important role of AL in the selection of more informative instances. Then, they used these instances to train a binary IDSs for Distributed Denial of Service (DDoS) attack classification. In addition, since there are massive amounts of traffic in modern networks, a parallel computation method has 1 https://www.ietf.org/blog/reflections-ietf-97/ been employed. The authors referred to this fact that using AL
You can also read