Android Malware Detection using Large-scale Network Representation Learning - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Android Malware Detection using Large-scale Network Representation Learning Rui Zhu, Chenglin Li, Di Niu Hongwen Zhang, Husam Kinawi University of Alberta Wedge Networks Inc. Edmonton, AB, Canada Calgary, AB, Canada {rzhu3,ch11,dniu}@ualberta.ca {hongwen.zhang,husam.kinawi}@wedgenetworks.com ABSTRACT applications, but also posed challenges to defending attacks from a With the growth of mobile devices and applications, the number proliferation of malware (short for malicious software). Due to a lack of trustworthy review methods, it is possible that some developers arXiv:1806.04847v1 [cs.CR] 13 Jun 2018 of malicious software, or malware, is rapidly increasing in recent years, which calls for the development of advanced and effective may upload their Android apps with malicious components, which malware detection approaches. Traditional methods such as signa- can be found in a number of third-party Android markets, and even ture based ones cannot defend users from an increasing number in Google’s official Android market, Google Play. According to a of new types of malware or rapid malware behavior changes. In report [22], the quantity of mobile malware detected in 2016 was this paper, we propose a new Android malware detection approach about 18.4 million, representing an increase of 105% from that in based on deep learning and static analysis. Instead of using Appli- 2015. cation Programming Interfaces (APIs) only, we further analyze the To protect users from malware threats, a number of anti-malware source code of Android applications and create their higher-level solution providers (e.g., Norton, MacAfee, Symamtec, Kingsoft) pro- graphical semantics, which makes it harder for attackers to evade vide software products as a major means of defence. Their products detection. In particular, we use a call graph from method invoca- typically use the signature-based method to detect threats. In this tions in an Android application to represent the application, and method, a unique signature is generated from a known type of mal- further analyze method attributes to form a structured Program ware, such that malware detection is to match a suspicious app with Representation Graph (PRG) with node attributes. Then, we use a existing signatures in the maintained database. However, the attack- graph convolutional network (GCN) to yield a graph representa- ers can easily evade detection, for example, by changing signatures tion of the application by embedding the entire graph into a dense using code obfuscation or repackaging. To overcome these limita- vector, and classify whether it is a malware or not. To efficiently tions of the signature-based method, the heuristic-based method train such a graph convolutional network, we propose a batch train- was introduced in the late 1990s, which operates based on explicit ing scheme that allows multiple heterogeneous graphs to be input rules crafted by security analyst experts. However, such rules are as a batch. To the best of our knowledge, this is the first work to prone to biases of human expertise; it is also hard to generate rules use graph representation learning for malware detection. We con- to match the speed of malware creation. duct extensive experiments from real-world sample collections and To overcome these challenges, there is an emerging trend of demonstrate that our developed system outperforms multiple other developing automatic malware detection methods using machine existing malware detection techniques. learning. These techniques are capable of classifying previously unseen malware samples as well as identifying the malware families KEYWORDS of malicious samples. In these systems, detection has two phases: feature extraction and classification. In the first phase, various Android Malware Detection; Call Grpah; Graph Convolution Net- features such as API calls, binary strings, are extracted from the works original file samples. In the second phase, machine learning is used to automatically categorize the file samples into several classes 1 INTRODUCTION based on feature representation. Different machine-learning-based Recent years have witnessed the rapid growth of smart phone usage malware detection methods differ in both phases. in daily life, e.g., for online shopping, online banking, entertainment, In this paper, instead of only using API calls as features, we and even for remote control. As the major operating system for further analyze the control flow graphs (CFGs) that represent the smart phones, Android is now powering tablets, TVs, wearable control flows of Android applications. CFGs are widely used in devices and even embedded systems in cars and IoT devices. The software analysis and have been widely studied in the literature, large market share of Android and its open sourced development since it not only provides information of API calls, but also reveals ecosystem has not only brought about opportunities for Android how these API calls interact in the application. Since some APIs are more security sensitive than others, we further extract features for Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed all APIs, such as requested permissions or hardware resources, and for profit or commercial advantage and that copies bear this notice and the full citation represent each Android application as a graph with node attributes. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). To make classification decisions from such graph structures, we ,, use graph convolutional neural networks (GCNs), a generalization © 2018 Copyright held by the owner/author(s). of classical CNN to handle graph structures. Convolutional neural ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. networks (CNNs) have proven to be successful on a wide range of https://doi.org/10.1145/nnnnnnn.nnnnnnn
,, Rui Zhu, Chenglin Li, Di Niu and Hongwen Zhang, Husam Kinawi machine learning problems, including image classification, object • Activities are the entry points for interacting with the user. detection, and deep reinforcement learning. However, in these prob- Activities handles events triggered by users and provide how lems, data can be represented on a regular grid, e.g., pixels in digital users navigate within and between apps. images and states of game in Go are grids with fixed numbers of • Services are general-purpose entry point for keeping an rows or columns. To overcome the challenges of representing graph application running in the background for all kinds of reasons. structures for classification, we use GCNs to embed the derived For example, the user might listen to music, which is running control flow graphs as points in a vector space for graph classi- in the background service, and use another application. In fication. Since the input graphs vary in shapes and structures, it addition, components can be bounded to services to interact is challenging to learn and train GCNs on arbitrary graphs. We with them, and even perform inter-process communication propose a batch training algorithm to overcome this issue. (IPC). Our contribution in this paper is summarized as follows: • Broadcast receivers are components that enable the system to deliver events to the application outside of a regular user • Novel feature representation: instead of using APIs or bi- flow, allowing the application to respond to system-wide nary OpCode (operation code) only, we extract control flow broadcast announcements. As broadcast receivers are also graphs (CFGs) for all Android applications at question, and well-defined entry points, the system can deliver broadcasts further analyze their API security sensitive attributes, includ- event to apps that are currently not running. Mostly, they ing requested permissions and hardware resources. Based are originated from the system, and some are initiated from on these features, we represent each Android application as apps that usually used to notify other apps. a graph with node attributes, where a graph convolutional • Content providers are components managing a shared set network is subsequently used to classify and detect malware. of application data that you can store in the file system, in a • Graph convolutional network with global context: Tra- SQLite database, on the web, or on any other persistent stor- ditional GCNs only consider information from graph and age location that your application can access. If the content node attributes. However, in Android malware detection, a provider allows, other apps can query or modify the data. wide range of contextual information can also be utilized. In this paper, we use various diverse information from manifest All components must be declared in the application manifest files, which are included in all Android applications, as the file before it can actually be used. Communications between differ- global contextual information, and extend the traditional ent components are through intents and intent filters. Intents are GCNs to take these additional global features into account. messaging objects that can be used to request actions from other • Batch training for GCNs: A graph classifier is hard to train, application components. An intent-filter is an expression declared since input graphs can have arbitrary sizes and structures. in the application manifest file that specifies the intent type that Unlike images, it is unreasonable to resize all input graphs the component will receive. into a fixed shape. As the compatibility with diverse topolo- gies is necessary for convolutional operations on graphs, we 2.2 System overview propose a batch training algorithm to solve this issue. An overview of our proposed malware detection system is shown in Fig. 1, which mainly consists of four components: unpacking, static analysis, feature extraction, and classification. 2 BACKGROUND AND SYSTEM OVERVIEW In this section, we introduce preliminaries of the Android opera- 2.2.1 Unpacking and Decompiling. We firstly take a look at what tion system, which are crucial for further data preprocessing and is inside the apk file. Each apk file is actually a zipped file that designing machine learning algorithms. Then, we will introduce includes the application code, resources, assets, and manifest file. how we design an end-to-end Android malware detection system The manifest file plays the central role in Android apps. In this file, from large-scale representation learning. it contains various of important and security sensitive information, including components, permissions, hardware features, .etc. In fact, the Android system requires all apps to declare these information 2.1 Preliminaries in their manifest files; otherwise, they are not recognized by the We firstly introduce some background of Android system and the system and will be ignored. components in Android application files (known as apk files). The The manifest file is not sufficient to provide the whole picture application in Android system are written in Java and executed of an app. In addition to this, sources codes are our next and the within a custom Java virtual machine, and each application package major part to extract features. The original source codes included is contained in a jar file with the extension of apk. Each Android in apk files are dex codes, i.e., Dalvik executable files, that can be application consists of many components of different types. These interpreted by the Dalvik Virtual Machine (DVM). Unfortunately, components are the essential building blocks of an Android applica- the dex file is hard to understand and we need to convert it into tion. Each component is an entry point through which the system human-readable format as smali code, which is the intermediate or a user can enter your application and applications interact via code and can be obtained by disassembling from the dex file. For- components. Therefore, it is essential to analyze the component tunately, the nature of DVM provides tools for such decompiling API for security concerns. There are four different types of app purpose. In particular, we use APKTool [5] to unpack the apk file components: and decompile the dex files to smali codes.
Android Malware Detection using Large-scale Network Representation Learning ,, or Unpacking Static Analysis Feature Extraction Detection 1. Unzipping 1. Scanning methods Extracting: Using graph 2. Decompile 2. Constructing call 1. Hardware convolution network graph 2. Permissions to detect malware 3. Intent filters 4. Suspicious API Figure 1: An overview of the system architecture 2.2.2 Construct Call Graph. As smali codes are readable and Execute(), cause programmer can also set constrains for this in- close to original Java codes, we can analyze these codes to extract vocation. We will ignore those constrain for our call graph and try both statical and relational features for further usages. This is the to conclude as many invocations as possible. step called static analysis which aims at analyzing a program with- out execution. 2.2.3 Node Feature Extraction. After construction of call graph, the whole program is then represented by how methods invocate com.fa.c.RootCommand: Execute() each other, and each method is now represented by a node. In this step, we further extract secure sensitive properties of each method that act as node attributes of our constructed call graph. In other words, our goal is to construct a graph with node attributes. In com.stericson.RootTools: getShell() particular, we extract the following five types of attributes for each android.util.Log: d() method: java.lang.Throwable: getMessage() • Method type: We categorize all methods into four types: Android system API, third-party API, component methods and com.stericson.RootShell.execution.Shell: others. Android system API includes all APIs listed in the add() Android official references, and libraries provided by Java. Third-party APIs include some widely used APIs, including Figure 2: An example of call graph Google Map APIs, Facebook APIs, Yahoo! Weather APIs. Since all components should be declared in the manifest file, all methods in such component classes are categorized as The major feature to represent an Android application is its call component methods. All other methods are categorized as graph, which represents calling relationships between methods. others. In this graph, each node represents a method and each edge, for • Hardware features: If a method requests some hardware example (f , д), denotes a method invocation that method f calls resources, like camera, GPS, sensors, .etc, these hardware method д. If there is a recursive call, meaning a method f calls itself, resources are considered as the method’s features. a cycle will be used. Call graph is a good visualization of internal • Requested permissions: Like the previous one, if a method structure of any kinds of computer programs and has been widely needs special permissions to execute, these permissions are used in many fields. An example of a call graph is shown in Fig. 2. considered as the method’s features. In this call graph there are five methods and four invocations. Each • Component permissions: Sometimes it is the component node in the call graph is assigned with a unique ID, which consists class, not the method, that requests permissions. If so, all of class name, method name, function arguments type and return methods in this component have permission features. type. Call relations are represented as directed edges, for example • Intent filter: Like the previous one, if a component declares the directed edge from com.fa.c.RootCommand: Execute() to some intent filters, these intent filters are features for all android.util.Log: d() indicates that there is a invocation state- methods in this component class. ment for android.util.Log: d() inside com.fa.c.RootCommand: Execute() method. However we can not affirm this invocation will 2.2.4 Classification. The outcome of previous steps is a graph be executed when the APP is running com.fa.c.RootCommand: with node attributes. With proper transformation, we can apply
,, Rui Zhu, Chenglin Li, Di Niu and Hongwen Zhang, Husam Kinawi grpah convolutional networks (GCN) for malware detection. Tradi- Finally, we introduce how we extract these features by static tional methods involve hand-craft feature extraction on the top of analysis. Most of these node attributes can be extracted from mani- such graph structured data in order to measure the neighborhood fest file, for example, intent filters. Requested permissions can be of two nodes in the graph. In this paper, however, we turn to ap- found in the tag uses-permissions and component permission ply graph convolutional networks that are capable of end-to-end can be found as an attribute android:permission. In our system, training. In this approach, a graph will be embedded as a point we use some offline tools to obtain system API permission map- in a vector space, and classification can be done on such vector ping, including official references [3] and using PSCout [8], then space. A key benefit of this approach is that learning the mapping assign component permissions to corresponding methods, and fi- of embedding and the classification scheme can be done jointly. nally assign all permissions shown in uses-permissions to the dummyMain node. Note that we only gather permissions or hard- ware features that a method requests, no matter where to collect 2.3 Additional details them. A tricky thing is about hardware features. In Android, the In the last part of this section, we put some details of our proposed tag uses-feature is used to declare hardware or software features. system. We start from construction of call graphs. For many reasons, Sometimes we may not find any hardware features in manifest generating a precise call graph is challenging for reason as follows. file, since they are implicitly declared by permissions. A common practice is to use a tool aapt from Android SDK to determine what (1) When a calling statement is found, the binding between two hardware features are declared. methods may be resolved at compilation time or runtime. An example is when a method is inherited from its parent class. 3 A GRAPH APPROACH FOR MALWARE (2) Unlike Java programs, Android applications do not have a DETECTION main method but multiple entry points instead. These entry In this section, we exploit the graph representation of Android mal- points are implicitly called by the Android framework in the ware samples provided by control flow graph, and propose the Call back end. Graph based Graph Convolutional Network (CG-GCN) for malware (3) Callbacks are prevalent in Android applications. There are detection. We illustrate the overall architecture of our proposed some existing work like FlowDroid [7] to solve these issues. model in Fig. 3. We firstly formulate the malware detection problem However, we found that these tools are quite complicated as a classification problem. Then we apply graph convolutional net- and time consuming with limited benefits for malware de- work (GCN) to solve it. To speed up training, we further propose a tection. Therefore, we use a simple yet effective call graph batch training scheme that allows to simultaneously learn graph construction way by adding an additional dummyMain node representation vectors in a batch. that connects to all methods listed in smali codes. Once the call graph is constructed, we need to extract features 3.1 Feature Transformation and Graph for all nodes, including the dummyMain node. As discussed above, Representation Learning each node contains four kinds of attributes, including method types, So far, we obtain a number of call graphs as well their corresponding permissions, and hardware features. Permissions are definitely most node attributes. Recall that the nodes in a call graph are actually security sensitive attributes; in fact, many operations need specific methods, and each method may have certain permission to request, permissions to execute and these permissions are granted by user or hardware resources to use, .etc. Each node is associated with a at installation, and malware samples are prone to request a special set of such attributes and the empty set is also allowed here. set of permissions. We actually have two types of permissions We now formulate the above idea as follows. A call graph is from manifest file, namely, requested permissions and component denoted by G = (V , E), and we use A to denote its adjacent matrix. permissions. As the dummyMain node connects to all methods in For each node v ∈ V , it is associated with a set Fv that extracted this apk file, we assign requested permissions as its attribute, and from the previous stage, and the goal of feature transformation assign component permissions to corresponding methods in the is to find a proper function ϕ(·) such that it can convert the node component. attribute set Fv into a vector xv ∈ Rh0 , where h 0 is the dimension of Other attributes are also crucial for malware detection. For ex- the destination vector space. By doing so, we can have a new matrix ample, we also collect all intent filters since they can be used for X such that the v-th row is xv , and the role of graph convolutional eavesdropping specific intents. Malware samples are sensitive to a network (GCN) is to classify the input tuple (A, X ) into categories special set of system events, so intent filters can be hints. malicious or benign. Note that we should be aware of a special set of APIs that can Traditional approaches often involve carefully feature engineer- lead to malicious behaviors without requesting permissions. For ing techniques to design ϕ and measure local neighborhood struc- example, cryptography functions in the Java library are considered tures from A, and then we can use existing machine learning al- as some math functions so no permissions needed. However, these gorithms for non-structural data. However, these hand-craft ap- functions can be used by malware samples for code obfuscation proaches are inflexible and limited under the rapid changing trend purpose, so unusual usage of these functions should be paid atten- of Android malware samples. In fact, as we will see in Sec. 4, the tion to. We will mark these type of functions as suspicious APIs, like call graph structure can vary a lot, both in scale and in complexity, what [6] did. and designing these features can be time-consuming.
Android Malware Detection using Large-scale Network Representation Learning ,, GCN Layer BN Layer Aggregation Layer Combination Layer Softmax Detection Result | {z } Layer Combo Figure 3: Architecture of Graph Convolution Network for Malware Recently, a surge of new approaches attempt to learn representa- 3.2 Deep Graph Convolutional Networks for tion of the graph by learning a mapping that embeds nodes or entire Malware Detection (sub)graphs as points in a vector space, which is usually in low- Unlike conventional network representation learning algorithms, dimension. A good mapping should reflect the graph structure from which attempts to learn node representations in unsupervised learn- geometric relationships among learned vectors in this space, which ing settings, learning the graph representation zG is quite challeng- is called embeddings and can be used for further machine learn- ing and often should involve supervised learning setting. In the ing tasks as feature inputs. In these approaches, representations of subsequent of this section, we will see how our proposed model nodes and the whole graph (or subgraph) are no longer designed CG-GCN can be utilized for graph classification while learns low- from kernel functions or other carefully engineered schemes; in- dimensional vectors for all graphs. stead, we design algorithms that can automatically learn them. The basic idea in graph neural networks is to generate node em- In this spirit, a good feature transformation should keep as bedding vector by iteratively aggregating vectors from its neighbor much raw information as possible. Therefore, we consider one- nodes. The operations at each layer is illustrated in Fig. 5. In each hot encoding scheme to convert sets into vectors. Specifically, layer, node v is associated with a hidden vector hv and let hv0 = xv we denote S as the set of all possible values in Fv , and we have at the beginning. At layer k, the hidden vector of node v aggregates S = {s 1 , . . . , s |S | }. Then, we assign a vector xv ∈ R |S | such that hidden vectors from its neighbors as follows: the i-th entry xv (i) = 1, if si is shown in Fv , and xv (i) = 0 vice Õ versa. Fig. 4 illustrates details of one-hot encoding that used in our h̃v = hvk −1 ′ , (1) system. v ′ ∈N(v)∪{v } hvk = σ (h̃v W k ), (2) where N (v) denotes the set of neighbors of node v, σ denotes the activation function at layer k, and W k is the weight matrix with A = {SEND_SMS, BIND_ADMIN, BLUETOOTH} size Rdk −1 × Rdk at layer k. Here we denote dk as the dimension +) B = {SEND_SMS, CHANGE_WIFI_STATE, NFC} of hidden vectors at layer k. By iteratively performing the above S = {SEND_SMS, BIND_ADMIN, BLUETOOTH, CHANGE_WIFI_STATE, NFC} equation for all nodes at all layers, we can finally obtain node embedding vectors at the last layer K for all nodes. These final representation vectors are regarded as the embedding vectors: for node v, its embedding vector is defined as zv := hvK . Figure 4: One hot encoding for node features. Here we use different color blocks to represent different permissions. In this example, node A requests three permissions, and node B also requests three permissions. The block with color will be encoded as 1 and the white block will be encoded as 0. Obviously, such one-hot encoding scheme results in sparse node features. Figure 5: Aggregation of Graph Convolution Network Once we have the tuple (A, X ) on hand, the next task is to learn Similar to X , we can juxtaposition all node embedding vectors as representations of graph G that embeds G into a low-dimensional a embedding matrix Z . As each node aggregates from its neighbors, vector space. More formally, the goal is to find zG ∈ Rd for all G it also implies that the node features are propagating to further given its adjacent matrix A and node feature matrix X , and zG will nodes in the graph in deeper layers, and the formation of Z im- be used for further classification. This is the role for GCN. plicitly depicts local neighborhood structure. After obtaining node
,, Rui Zhu, Chenglin Li, Di Niu and Hongwen Zhang, Husam Kinawi embeddings Z , we sum all the individual node embeddings in the rewrite (1) as graph to form the representation vector zG : H̃k = AHk −1 + Hk −1 , Õ zG = zv . (3) since A only accounts for links with neighbor nodes. Denote  := v ∈V A + I , and let D̂ as the diagonal node degree matrix of  that are As discussed in previous section, we have some special informa- used for normalization. Then, we can combine (1) and (2) as follows tion that encoded as node feature of dummyMain node, including [17]: permission requests and hardware requests. These features provide 1 1 a global context for other methods to learn node embedding vectors. Hk = σ (D̂ − 2 ÂD̂ − 2 Hk−1W k ). (5) 1 1 For malware detection purpose, however, they are also important We denote à = D̂ − 2 ÂD̂ − 2 which is invariant to all layers. There- features to discriminate whether this app is malicious. For this rea- fore, we can simply precompute à before passing it into the neural son, we create a shortcut from the node feature for dummyMain node network, and at layer k we have: to the last layer and the input vector for classification is actually the concatenation vector of graph embedding vector and the node Hk = σ (ÃHk −1W k ). (6) feature vector for the dummyMain node. More formally, we denote In summary, the sequential training is firstly compute Ãi for the xG as the row vector of the dummyMain node in X , and the vector i-th sample. Then, let it passes the GCN and we will get [zi , x i ] at for graph classification is actually [zG , xG ], where [] denotes the the final representations. By calculating the loss on this individual concatenation operator. sample, we can have its derivatives and in turn updates weights A deep GCN model is well suited for our malware detection using stochastic gradient descent (SGD) or its variant optimizers. problem for following reasons. First, it enables us to capture struc- Now let us extend the above idea from single sample training tural information. A full malicious behavior in an app often reflects to batch training. Suppose we want to train m samples as a mini- a long trait on the call graph. For example, when eavesdropping batch, denoted as (A1 , X 1 , y1 ), . . . , (Am , Xm , ym ). Similarly, we can messages in a smartphone, a malware should firstly execute the precompute Ãi for all samples in this minibatch. As graphs can API that can read messages, and then send it out. In call graph, this have various number of nodes, we try to concatenate all Ai the simple action refers two sites: a source site that gets user’s message, minibatch as follows: and a sink site that sends message out. Consider individual API calls is not enough to analyze such malicious behaviors. Also, we can Ã1 Ã2 obtain a graph embedding as well as node embedding using GCN. à = .. . (7) This means, we can simultaneously have representations for both methods and the entire app, which encodes structural information . for both. Ãm So now we can use the following way to calculate a minibatch of 3.3 Batch Training GCN graphs simultaneously: Now we turn to introduce how we train GCN and propose our batch Hk1 −1 training approach to speed it up. Suppose we have a training set © Ã1 Ã2 k −1 ª D := {(Ai , X i , yi )} of n graphs, where (Ai , X i ) denotes the input H 2 ® Hk = σ k® .. . W ®. tuple (A, X ) for the i-th graph in D. The GCN takes (Ai , X i ) as input . .. ® and we can obtain [zi , x i ] as the vector for graph classification. By ® Ãm k −1 Hm adopt the sigmoid loss, we obtain the optimization for embedding « ¬ parameters and discriminative classifier estimation as If we denote Ĥk −1 as the concatenation matrix of all Hki −1 , for n 1Õ i = 1, . . . , m, updating H̃k can be simply written as simple as min − −yi log(σ (⟨[zi , x i ], u⟩), (4) u, {W} n i=1 Ĥk = σ (ÃĤk −1W k ), (8) where u is the weight parameter for classification and {W} is the collection of weight parameters for GCN. Here we use [zi , x i ] as which is now similar to the sequential case. By iteratively perform- the concatenation of graph representation vector zi for the i-th ing the above equation, we can obtain node embeddings for all input graph ands x i is the node feature of dummyMain node in the nodes in all input graphs, and further obtain graph embeddings. same graph. We can also add regularizer term in (4) to prevent from overfitting. 4 EXPERIMENTAL RESULTS One of the greatest challenges for conducting convolution on In this section, we evaluate the performance of our proposed CG- graph-structured data is the difficulty of training graphs in a batch GCN model on the Android malware detectoin task. We will first [18]. Due to the irregular structure and shapes, some existing tech- introduce the dataset of malware samples and clean files for this niques in conventional CNN, like resizing or reshaping, are not task. After that, to evaluate our modelâĂŹs efficiency, we will com- suitable for GCN, which weakens the compatibility of GCN. Here pare our model with a wide variety of existing machine-learning we propose our approach for training GCN in batch mode. based malware detection approaches as well as some commercial Here we denote Hk as the matrix of hidden vectors whose v-th anti-virus engines. Finally, we qualitatively evaluate the final rep- row is the hidden vector for node v. From graph theory we can resentations learned from our proposed CG-GCN model.
Android Malware Detection using Large-scale Network Representation Learning ,, 4.1 Experiment Setup Table 1: Performance metrics of Android malware detection In this paper, we evaluate our algorithm on DREBIN dataset [6] that contains 5, 560 malware files collected from August 2010 to Metrics Description October 2012. All malware samples are labeled by one of 179 mal- TP # of malicious apps correctly detected ware families. Along with these malware datasets, we also collect TN # of benign apps correctly classified a number of real-world Android applications collected from the FP # of false prediction as malicious Internet. Resources of these files include Apkpure [4] with 5400 FN # of false prediction as clean samples, 700 samples from 360.com and over 13, 000 commercial ACC (T P + T N )/(T P + T N + F P + F N ) applications from the HKUST Wake Lock Misuse Detection Project Precision T P/(T P + F P) [19]. In summary, we have collected 19, 100 real-world applications. Recall T P/(T P + F N ) Although these Android applications are mostly collected from F1 2 ∗ Precision ∗ Recall/(Precision + Recall) well-known Android markets and research projects, we should ER Error rate, which is 1 − ACC. ensure whether they are clean. To do so, we uploaded all these F PR False positive rate, F P/(T N + F P) collected files to the VirusTotal service, a public anti-virus service DF Detection failure rate, F N /(T P + F N ) with 78 popular engines, and inspected scanning reports from the VirusTotal service for each file. Each engine in VirusTotal would Table 2: Malware and clean file Datasets. show one of three detection results: True for “malicious”, False for “clean”, and NK for “not known”, respectively. If an application Dataset DREBIN Benign has more than one True result, we label it as malware; otherwise, # samples 5,560 5,877 we label it as clean. As a result, only 16, 753 out of 19K collected # nodes (avg.) 9,590.23 28,973.35 samples passed all scanners on the VirusTotal service, and we take # nodes (max) 41,905 65,439 5, 877 samples from them as clean files for evaluation in this paper. # edges (avg.) 19,377.96 39,031.57 Table 2 shows some statistics of the DREBIN Android malware # edges (max) 132,731 207,997 dataset and the dataset of clean files. A key observation is the highly skewness of node and edge distributions among Android apk files. Due to the highly diversity of Android application developers, the Table 3: Size of extracted node feature sets on DREBIN size can range from KB to GB on the Internet. In the dataset we datasets. use for our evaluation, the largest file in the DREBIN dataset is 29MB and in clean files is 62 MB. Such diversity will bring sever Feature set DREBIN challenges in training and learning GCN. Hardware features 86 Table 4 shows the number of extracted features in Sec. 2. As Permissions 3,830 hardware resources are restricted by devices and Android system, Intent filters 9,317 its value set is quite small. Mostly used permissions are provided Total feature 13,233 by Android system as well, therefore it is expected to have many overlap of permissions among apps. However, intent filters can be easily created by users and developers, it has the most variety and methods as well as our proposed one are conducted in the same sparsity. procedure. All experiments were conducted on a Compute Engine on Google To evaluate our model, we firstly compare it with various machine- Cloud with 4 cores and 16 GB RAM running Ubuntu 16.04. This learning based malware detection algorithms. In particular, we com- engine is also equipped with an NVidia Tesla P100 GPU to speed pare the performance against the ones using static analysis without up graph convolutional network, which is implemented on top graph structure. For those baseline algorithms, we extract features of TensorFlow [2]. We evaluate the Android malware detection from static analysis of both manifest and source codes, which is performance of different methods using the measures shown in similar to Arp et al. [6] for all samples, except that we do not ex- Table 1. One thing need to be noticed is that in security precision and tract network addresses here. All features were encoded in one-hot false positive rate are two most important evaluations for security fashion as shown in Fig. 4. system. We compare with four other typical classification methods on these features, they are Random Forest (RF), Support Vector Ma- chine (SVM), and Naive Bayes (NB) with three kernels of variants: 4.2 Performance Evaluation on Benchmark Gaussian (NB-G), Bernoulli (NB-B) and Multinomial (NB-M). For Dataset RF, we set the maximum dept as 6 to trade off time and performance. For SVM, which is also the classifier used in [6], we use LibSVM in In this experiment, we randomly select 80% of the data for training, our experiment and the penalty is set to 2. and the rest 20% for testing. During training stage, all training The results of this experiment are shown in Table 4. In this table, samples will be used to do 4-fold cross validation to train our model “GCN” refers to the algorithms without concatenating the feature as well as tune the hyperparameters, and the testing samples are vector x, and “GCN+” refers to the one we introduced in Sec. 3. only be used for performance evaluation at the testing stage. We From this table, we can clearly observe that our proposed GCN repeat this procedure for 5 times and average results. All baseline significantly outperforms the other approaches by nearly 2.94%
,, Rui Zhu, Chenglin Li, Di Niu and Hongwen Zhang, Husam Kinawi ROC curves on AMD set of precision in prediction. All the baseline algorithms have also 1.00 achieved good performance, most of them have a precision above 0.95 90%. However, without modeling the semantics between API nodes 0.90 True Positive Rate these algorithms will never get a comparable performance to our 0.85 GCN approach. This would even become more obvious when the 0.80 RF (area = 0.988) malware become more complicate. 0.75 NB-Multinomial (area = 0.998) The most significant improvement of GCN is the false positive 0.70 NB-bernoulli (area = 0.997) GCN (area = 0.998) rate, Table 4 shows that our proposed method drops the false posi- SVM (area = 0.998) 0.65 tive rate from 5% to 0.09%, corresponding to nearly 100 fewer false NB-Gaussian (area = 0.936) 0.60 alarms during evaluating 2346 samples, which is remarkable cause 0.00 0.05 0.10 0.15 0.20 0.25 0.30 False Positive Rate false alarms have always be a big concern in security and would cost considerable time and energy for system user to get rid of it. Although Naive Bayes with Bernoulli kernel provides the highest Figure 6: ROC curves for all the baseline algorithms and our detection rate of 99.91% and best detection failure rate of 0.09%, proposed GCN method. such rate is at the expense of high false positive rate (10.12%), which diminishes its effectiveness and overall performance. Note that in our experiments are actually labeled by these AV engines using the this table, GCN+ outperforms GCN, which indicates that we can rule described in subsection 4.1. Therefore, AV engines are supposed enhance detection performance by incorporating global contextual to have a better false positive rate than their normal performance. information. Another thing is, even though we got the scan results of all 78 AV This can be attributed to the two characteristics of our model. engines from VirusTotal, here we just list the ones with the best First, the input of Android apk files are transformed into call graphs, performance or ones that are already popular and widely used in which provides a more detailed picture of Android applications security programs such as, Kaspersky, Cylance and McAfee. like data flow. In contrast, traditional machine learning based mal- Table 5 lists parts of the AV engines’ scanning results on the ware detection algorithm only accounts for static features. Second, testing split of our experiment dataset. Comparing to the result of some features are more expressive in our model. We can consider our GCN method we can see that our method outperforms most of permissions as an example. A malware sample may declare more the AV engines, with a precision of 99.91% and FPR of 0.06%. For permissions than necessary in order to use remote server to control false positive rate, our method outperforms 7 out of 10 antivirus a device. Traditional methods only scans the permissions that are engines, which is remarkable cause all the engines would have actually used, while our model can explicitly learn the “permission better FPR on this dataset. And for most of the antivirus engines distribution” in call graphs. When permissions are requested by that have a better recall or F 1 score than what our method provided isolated methods, which are often executed by command & control would often end up with either much worse precision or higher FPR, servers, such apk file is more likely to be malware samples. Such e.g. AV5, AV7 and AV10. Only several engines have a comparable information can only be exploited by learning from structural data overall results, e.g. AV4 and AV2. as our model does, of which traditional models are lack. Note that the thresholds of prediction malicious or benign appli- cations in Table 4 are all set as 0.5. To further investigate detection 4.4 Qualitative evaluation of interpretable performance under different threshold, we plot the receiver oper- representations ating characteristic curves (ROC curves) in Fig. 6. From Fig. 6, we A core problem of machine learning is to interpret the trained can see that given a certain false positive level, our detection rate model. To qualitatively assess how much interpretable our model is always higher than other methods in almost all region, which is has, we randomly chose 2, 000 samples in our dataset, in which we a significant improvement and means that we can get very high have 1, 000 malware samples and 1, 000 clean samples. We firstly malware detection rate and give little false alarms at the same time. generate t-SNE plots [20] of all samples using feature vectors xG of We also notice that the performance given by the classical “flat their dummyMain nodes in Fig. 7(a). From this figure, we can clearly features + SVM” model is relatively not bad. Actually this is reason- see clusters of clean samples and malware samples, which implies able, as the features are carefully extracted. However, our model that the features we have extracted are highly relevant of detecting provides a method to detect Android malware without manually malware. However, the boundary is not separable. A number of designed features: we only need to construct a call graph and the points on the right half plane are entangled a lot, and any simple method specifications like permissions and hardware resources. We hyperplane or curves on this would fail to classify these points. can also easily incorporate handcrafted features into our model for This helps explain why using flat features for traditional machine specific purpose by concatenating them with our embedded vector learning algorithms are hard to improve performance. at the last layer. By further exploiting graph structure from call graphs, our pro- posed CG-GCN model can simultaneously learn classification and graph representation. We generate t-SNE plots for all samples us- 4.3 Compare with AV engines ing their embedding vectors zG in Fig. 7(b). At this time, malware We also compared the performance of our malware detection al- points and clean points are separable by a significant margin. An gorithm with existing Anti-Virus engines on VirusTotal [1]. The interesting finding is that clean points are clustered at the top, while critical point to mention is that all of the ‘truly’ clean files used in malware points are forming several small spirals that disconnected
Android Malware Detection using Large-scale Network Representation Learning ,, Table 4: Test result on DREBIN set. All values are in percentage for better readability. The boldface denotes the best algorithms in terms of corresponding metric. Algorithm GCN+ GCN RF SVM NB-G NB-B NB-M Precision 99.91 98.83 95.75 97.93 88.46 90.32 94.54 Recall 99.45 99.82 93.34 97.75 99.37 99.91 99.73 F1 99.68 99.32 94.53 97.84 93.6 94.87 97.07 ACC 99.69 99.34 94.75 97.9 93.4 94.75 97.07 FPR 0.09 1.11 3.91 1.96 12.24 10.12 5.44 DF 0.51 0.17 6.15 2.12 0.67 0.09 0.27 ER 0.31 0.66 5.25 2.1 6.6 5.25 2.93 Table 5: Performance of VirusTotal scanners on DREBIN test set. Scanner GCN AV1 AV2 AV3 AV4 AV5 AV6 AV7 AV8 AV9 AV10 Precision% 99.91 99.91 99.64 62.84 99.91 59.03 99.63 50.09 99.84 97.78 61.42 Recall% 99.45 98.74 99.46 99.10 99.28 100.0 97.21 99.91 54.64 94.96 100.0 F1% 99.68 99.32 99.55 76.91 99.59 74.24 98.41 66.73 70.62 96.35 76.10 FPR% 0.09 0.089 0.357 58.073 0.089 68.778 0.357 98.66 0.089 2.141 62.27 with each other. This implies that the rich semantics encoded in call Android apps will be represented by a graph with node features, graph and node features can bring more information for malware and a sparse vector consists of global contextual information. detection. Another trend of static analysis in Android security is to detect some specific malicious behaviors like privacy breach and over 5 RELATED WORK privilege. For example, [13, 16] goes through source code with a predefined source and sinks to find a potential private breach. Fu In order to keep combating the increasing number of malicious et al. [12] attempts to protect from stealing private information by applications, there have been a number of research studies on devel- examining all URL addresses in source codes. However, we note oping Android malware detection system using machine learning that static taint-analysis and over privilege are prone to be false and data mining, e.g., [6, 11, 15, 24, 25]. The major difference among positive. them is on how to extract features from packed applications. One To differentiate malicious apps from clean ones, we use graph category is to use dynamic analysis to capture API calls or envi- convolution neural network for machine learning on graph-structured ronmental variables during execution and obtain the original codes data. Several convolutional neural network architectures have been from packed Android applications. For example, DroidDophin [25] proposed for learning over graphs in recent years, most of them can use DroidBox and APE to record thirteen activity features. An- be categorized as spectral graph convolutional neural networks. Its other example is CopperDroid [23], which is a Virtual Machine seminal work was done by Bruna et. al. [9] and later by Defferrard Introspection (VMI) based dynamic analysis system that extract et. al. [10] with fast localized convolutions. Kipf and Welling [17] operating system interactions and process communications as fea- propose a first order approximation scheme to reduce the computa- tures, in which both intra-process and inter-process are considered. tional costs the graph filter spectrum. One thing interesting of these However, the coverage of dynamic analysis is limited since not all two works is that although they consider spectral convolution, all malicious behaviors can appear in only one execution, so dynamic convolution operations in their papers are actually done in spatial analysis usually takes long time. domain only, which is convenient to implement on various deep In contrast, static analysis focuses on analyzing the internal learning frameworks. A more recent work by Hamilton et al. [14] components of an application, and it is able to explore all possi- further extends GCNs by considering more generic form of aggre- ble execution paths in malware samples. For example, DroidMat gation functions and allow nodes to sample their neighborhoods. [24] and DREBIN [6] performed static analysis on manifest file and In our application, we extend these existing works, which are fo- source codes to extract multiple features including permissions, cusing on node embedding, to graph embedding models. Another hardware resources and API calls, where the first uses k-means extension is that we further jointly train the deep GCN model with clustering and k-NN classification and the later uses support vector a wide model for global contextual features. Finally, to efficiently machine (SVM) to train the one-hot encoded feature vectors for An- train a large number of graphs with arbitrary shapes, we propose droid malware detection. There are other classifiers in the literature. a batch training algorithm to allow multiple graphs as input in a For example, Peiravian and Zhu [21] consider SVM, decision tree minibatch. and ensemble classifiers. Different from existing works, we analyze the method invocations to form a call graph and extract attributes for all methods in an application, which provides a more complete picture of the application. Based on these extracted features, the
,, Rui Zhu, Chenglin Li, Di Niu and Hongwen Zhang, Husam Kinawi t-SNE view of feature representation t-SNE view of feature representation 15 15 10 10 5 5 0 0 −5 −5 −10 −10 Malicious clean −15 −15 Benign malware −20 −30 −20 −10 0 10 20 30 −20 −10 0 10 20 (a) Before embedding (b) After embedding Figure 7: Scatter plot of app files before embedding and after embedding. (a) Scatter plot of sparse representations from dummyMain node feature vectors. (b) Scatter plot of final representations zG learned from our proposed GCN. 6 CONCLUSIONS [8] Kathy Wain Yee Au, Yi Fan Zhou, Zhen Huang, and David Lie. 2012. Pscout: analyzing the android permission specification. In Proceedings of the 2012 ACM In this paper, we present an Android malware detection frame- conference on Computer and communications security. ACM, 217–228. work based on deep graph convolutional networks. Instead of using [9] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral net- works and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 API calls only, we utilize static analysis to generate call graphs (2013). and method attributes to represent Android applications. Such fea- [10] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu- ture representation not only provides higher-level sematics but tional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852. also includes detailed execution information that makes attackers [11] Marko Dimjašević, Simone Atzeni, Ivo Ugrina, and Zvonimir Rakamaric. 2016. hard to evade the detection. Based on the extracted features, we Evaluation of android malware detection based on system calls. In Proceedings of present a novel Android malware detection framework based on the 2016 ACM on International Workshop on Security And Privacy Analytics. ACM, 1–8. graph convolutional networks. We extend existing convolutional [12] Hao Fu, Zizhan Zheng, Somdutta Bose, Matt Bishop, and Prasant Mohapatra. 2017. networks for graph classification and incorporate global contextual Leaksemantic: Identifying abnormal sensitive network transmissions in mobile applications. In INFOCOM 2017-IEEE Conference on Computer Communications, information that extract from manifest files. To further enhance IEEE. IEEE, 1–9. training efficiency, we propose a batch training algorithm that en- [13] Clint Gibler, Jonathan Crussell, Jeremy Erickson, and Hao Chen. 2012. Androi- ables multiple various shapes of graphs as a input minibatch. To dLeaks: automatically detecting potential privacy leaks in android applications on a large scale. In International Conference on Trust and Trustworthy Computing. the best of our knowledge, this is the first work to use GCN for Springer, 291–307. Android malware detection. A comprehensive experimental study [14] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation on the real sample collections is performed to compare various mal- learning on large graphs. In Advances in Neural Information Processing Systems. 1025–1035. ware detection approaches, and results reveal that our algorithm [15] Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hindroid: outperforms state-of-the-art techniques. An intelligent android malware detection system based on structured heteroge- neous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1507–1515. REFERENCES [16] Jinyung Kim, Yongho Yoon, Kwangkeun Yi, Junbum Shin, and SWRD Center. 2012. ScanDal: Static analyzer for detecting privacy leaks in android applications. [1] [n. d.]. VirusTotal. ([n. d.]). https://www.virustotal.com/#/home/upload [Online; MoST 12 (2012). accessed 9-May-2018]. [17] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph [2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey convolutional networks. arXiv preprint arXiv:1609.02907 (2016). Dean, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In [18] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive Graph 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI Convolutional Neural Networks. arXiv preprint arXiv:1801.03226 (2018). 16). USENIX Association, 265–283. [19] Yepang Liu, Chang Xu, Shing-Chi Cheung, and Valerio Terragni. 2016. Under- [3] Android API Reference. [n. d.]. https://developer.android.com/reference/index.html. standing and detecting wake lock misuses for android applications. In Proceedings ([n. d.]). of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software [4] Apkpure.com. 2018. apkpure. (2018). https://apkpure.com/ [Online; accessed Engineering. ACM, 396–409. 9-May-2018]. [20] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. [5] APKTool. [n. d.]. https://ibotpeaches.github.io/Apktool/. ([n. d.]). Journal of machine learning research 9, Nov (2008), 2579–2605. [6] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, [21] Naser Peiravian and Xingquan Zhu. 2013. Machine learning for android malware and CERT Siemens. 2014. DREBIN: Effective and Explainable Detection of An- detection using permission and api calls. In Tools with Artificial Intelligence droid Malware in Your Pocket.. In Ndss, Vol. 14. 23–26. (ICTAI), 2013 IEEE 25th International Conference on. IEEE, 300–305. [7] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bar- [22] Symantec Internet Threat Report. [n. d.]. tel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. https://www.symantec.com/content/dam/symantec/docs/reports/istr-22- Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint 2017-en.pdf. ([n. d.]). analysis for android apps. Acm Sigplan Notices 49, 6 (2014), 259–269.
Android Malware Detection using Large-scale Network Representation Learning ,, [23] Kimberly Tam, Salahuddin J Khan, Aristide Fattori, and Lorenzo Cavallaro. 2015. CopperDroid: Automatic Reconstruction of Android Malware Behaviors.. In NDSS. [24] Dong-Jie Wu, Ching-Hao Mao, Te-En Wei, Hahn-Ming Lee, and Kuo-Ping Wu. 2012. Droidmat: Android malware detection through manifest and api calls tracing. In Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on. IEEE, 62–69. [25] Wen-Chieh Wu and Shih-Hao Hung. 2014. DroidDolphin: a dynamic Android malware detection framework using big data and machine learning. In Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems. ACM, 247–252.
You can also read