Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs Mu Zhang Yue Duan Heng Yin Zhiruo Zhao Department of EECS, Syracuse University, Syracuse, NY, USA {muzhang,yuduan,heyin,zzhao11}@syr.edu ABSTRACT 1. INTRODUCTION The drastic increase of Android malware has led to a strong interest The number of new Android malware instances has grown ex- in developing methods to automate the malware analysis process. ponentially in recent years. McAfee reports [3] that 2.47 million Existing automated Android malware detection and classification new mobile malware samples were collected in 2013, which repre- methods fall into two general categories: 1) signature-based and 2) sents a 197% increase over 2012. Greater and greater amounts of machine learning-based. Signature-based approaches can be easily manual effort are required to analyze the increasing number of new evaded by bytecode-level transformation attacks. Prior learning- malware instances. This has led to a strong interest in developing based works extract features from application syntax, rather than methods to automate the malware analysis process. program semantics, and are also subject to evasion. In this paper, Existing automated Android malware detection and classifica- we propose a novel semantic-based approach that classifies An- tion methods fall into two general categories: 1) signature-based droid malware via dependency graphs. To battle transformation and 2) machine learning-based. Signature-based approaches [18, attacks, we extract a weighted contextual API dependency graph as 37] look for specific patterns in the bytecode and API calls, but they program semantics to construct feature sets. To fight against mal- are easily evaded by bytecode-level transformation attacks [28]. ware variants and zero-day malware, we introduce graph similarity Machine learning-based approaches [5, 6, 27] extract features from metrics to uncover homogeneous application behaviors while toler- an application’s behavior (such as permission requests and criti- ating minor implementation differences. We implement a prototype cal API calls) and apply standard machine learning algorithms to system, DroidSIFT, in 23 thousand lines of Java code. We evaluate perform binary classification. Because the extracted features are our system using 2200 malware samples and 13500 benign sam- associated with application syntax, rather than program semantics, ples. Experiments show that our signature detection can correctly these detectors are also susceptible to evasion. label 93% of malware instances; our anomaly detector is capable To directly address malware that evades automated detection, of detecting zero-day malware with a low false negative rate (2%) prior works distill program semantics into graph representations, and an acceptable false positive rate (5.15%) for a vetting purpose. such as control-flow graphs [11], data dependency graphs [16, 21] and permission event graphs [10]. These graphs are checked against manually-crafted specifications to detect malware. However, these Categories and Subject Descriptors detectors tend to seek an exact match for a given specification and D.2.4 [Software Engineering]: Software/Program Verification— therefore can potentially be evaded by polymorphic variants. Fur- Validation; D.4.6 [Operating Systems]: Security and Protection— thermore, the specifications used for detection are produced from Invasive software known malware families and cannot be used to battle zero-day mal- ware. General Terms In this paper, we propose a novel semantic-based approach that classifies Android malware via dependency graphs. To battle trans- Security formation attacks [28], we extract a weighted contextual API de- pendency graph as program semantics to construct feature sets. The Keywords subsequent classification then depends on more robust semantic- Android; Malware classification; Semantics-aware; Graph similar- level behavior rather than program syntax. It is much harder for ity; Signature detection; Anomaly detection an adversary to use an elaborate bytecode-level transformation to evade such a training system. To fight against malware variants and zero-day malware, we introduce graph similarity metrics to un- cover homogeneous essential application behaviors while tolerat- ing minor implementation differences. Consequently, new or poly- morphic malware that has a unique implementation, but performs Permission to make digital or hard copies of all or part of this work for personal or common malicious behaviors, cannot evade detection. classroom use is granted without fee provided that copies are not made or distributed To our knowledge, when compared to traditional semantics-aware for profit or commercial advantage and that copies bear this notice and the full cita- approaches for desktop malware detection, we are the first to ex- tion on the first page. Copyrights for components of this work owned by others than amine program semantics within the context of Android malware ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- classification. We also take a step further to defeat malware vari- publish, to post on servers or to redistribute to lists, requires prior specific permission ants and zero-day malware by comparing the similarity of these and/or a fee. Request permissions from permissions@acm.org. programs to that of known malware at the behavioral level. CCS’14, November 3–7, 2014, Scottsdale, Arizona, USA. Copyright 2014 ACM 978-1-4503-2957-6/14/11 ...$15.00. We build a database of behavior graphs for a collection of An- http://dx.doi.org/10.1145/2660267.2660359. droid apps. Each graph models the API usage scenario and pro-
gram semantics of the app that it represents. Given a new app, a Submit Vet Update query is made for the app’s behavior graphs to search for the most Database & similar counterpart in the database. The query result is a similarity Classifiers Developer Android score which sets the corresponding element in the feature vector of App Market the app. Every element of this feature vector is associated with an Report Offline Graph Database Construction & individual graph in the database. Online Detection Training Phase We build graph databases for two sets of behaviors: benign and Figure 1: Deployment of DroidSIFT malicious. Feature vectors extracted from these two sets are then used to train two separate classifiers for anomaly detection and sig- rudimentary. As an example, consider the Bouncer [22] vetting nature detection. The former is capable of discovering zero-day system that is used by the official Google Play Android market. malware, and the latter is used to identify malware variants. Though the technical details of Bouncer are not publicly avail- We implement a prototype system, DroidSIFT, in 23 thousand able, experiments by Oberheide and Miller [24] show that Bouncer lines of Java code. Our dependency graph generation is built on performs dynamic analysis to examine apps within an emulated top of Soot [2], while our graph similarity query leverages a graph Android environment for a limited period of time. This method matching toolkit [29] to compute graph edit distance. We evaluate of analysis can be easily evaded by apps that perform emulator our system using 2200 malware samples and 13500 benign sam- detection, contain hidden behaviors that are timer-based, or oth- ples. Experiments show that our signature detection can correctly erwise avoid triggering malicious behavior during the short time label 93% malware instances; our anomaly detector is capable of period when the app is being vetted. Signature detection tech- detecting zero-day malware with a low false negative rate (2%) and niques adopted by the current anti-malware products have also been an acceptable false positive rate (5.15%) for vetting purpose. shown to be trivially evaded by simple bytecode-level transforma- In summary, this paper makes the following contributions: tions [28]. We propose a new technique, DroidSIFT, illustrated in Figure 1, • We propose a semantic-based malware classification to ac- that addresses these shortcomings and can be deployed as a re- curately detect both zero-day malware and unknown variants placement for existing vetting techniques currently in use by the of known malware families. We model program semantics Android app markets. This technique is based on static analysis, with weighted contextual API dependency graphs, introduce which is immune to emulation detection and is capable of analyz- graph similarity metrics to capture homogeneous semantics ing the entirety of an app’s code. Further, to defeat bytecode-level in malware variants or zero-day malware samples, and en- transformations, our static analysis is semantics-aware and extracts code similarity scores into graph-based feature vectors to en- program behaviors at the semantic level. More specifically, we able classifiers for anomaly and signature detections. achieve the following design goals: • We implement a prototype system, DroidSIFT, that performs static program analysis to generate dependency graphs. Droid- • Semantic-based Detection. Our approach detects malware SIFT builds graph databases and produces graph-based fea- instances based on their program semantics. It does not rely ture vectors by performing graph similarity queries. It then on malicious code patterns, external symptoms, or heuristics. uses the feature vectors of benign and malware apps to con- Our approach is able to perform program analysis for both struct two separate classifiers that enable anomaly and signa- the interpretation and demonstration of inherent program de- ture detections, respectively. pendencies and execution contexts. • We evaluate the effectiveness of DroidSIFT using 2200 mal- • High Scalability. Our approach is scalable to cope with ware samples and 13500 benign samples. Experiments show millions of unique benign and malicious Android app sam- that our signature detection can correctly label 93% malware ples. It addresses the complexity of static program analysis, instances; our anomaly detector is capable of detecting zero- which can be considerably expensive in terms of both time day malware with a low false negative rate (2%) and an ac- and memory resources, to perform precisely. ceptable false positive rate (5.15%) for vetting purpose. • Variant Resiliency. Our approach is resilient to polymor- phic variants. It is common for attackers to implement ma- Paper Organization. The rest of the paper is organized as follows. licious functionalities in slightly different manners and still Section 2 describes the deployment of our learning-based detection be able to perform the expected malicious behaviors. This mechanism and gives an overview of our malware classification malware polymorphism can defeat detection methods that technique. Section 3 defines our graph representation of Android are based on exact behavior matching, which is the method program semantics and explains our graph generation methodol- prevalently adopted by existing signature-based detection and ogy. Section 4 elaborates graph similarity query, the extraction of graph-based model checking. To address this, our technique graph-based feature vectors and learning-based anomaly & signa- is able to measure the similarity of app behaviors and tolerate ture detection. Section 5 shows the experimental results of our ap- such implementation variants by using similarity scores. proach. Discussion and related work are presented in Section 6 and 7, respectively. Finally, the paper concludes with Section 8. Consequently, we are able to conduct two kinds of classifica- tions: anomaly detection and signature detection. Upon receiving a 2. OVERVIEW new app submission, our vetting process will conduct anomaly de- In this section, we present an overview of the problem and the tection to determine whether it contains behaviors that significantly deployment and architecture of our proposed solution. deviate from the benign apps within our database. If such a devia- tion is discovered, a potential malware instance is identified. Then, 2.1 Problem Statement we conduct further signature detection on it to determine if this app An effective vetting process for discovering malicious software falls into any malware family within our signature database. If so, is essential for maintaining a healthy ecosystem in the Android app the app is flagged as malicious and bounced back to the developer markets. Unfortunately, existing vetting processes are still fairly immediately.
[0,0,0,0.9,…,0.8] buckets [1,0.6,0,0,…,0.7] { } { } [0,0,...,0,0,0,0,1] [0,0,...,0,0,0,1,0] } [0,0.9,0.7,0,…,0] .. . [0,0,...,0,0,0,1,1] { [0.6,0.9,0,0,…,0] Android Apps .. . [1,0,...,1,1,1,1,1] [0.8,0,0.8,0,…,1] [1,1,...,1,1,1,1,1] Behavior Graph Scalable Graph Graph-based Feature Anomaly & Signature Generation Similarity Query Vector Extraction Detection Figure 2: Overview of DroidSIFT If the app passes this hurdle, it is still possible that a new malware (3) Graph-based Feature Vector Extraction. Given an app, we species has been found. We bounce the app back to the developer attempt to find the best match for each of its graphs from the with a detailed report when suspicious behaviors that deviate from database. This produces a similarity feature vector. Each ele- benign behaviors are discovered, and we request justification for ment of the vector is associated with an existing graph in the the deviation. The app is approved only after the developer makes database. This vector bears a non-zero similarity score in one a convincing justification for the deviation. Otherwise, after fur- element only if the corresponding graph is the best match to ther investigation, we may confirm it to indeed be a new malware one of the graphs for the given app. species. By placing this information into our malware database, we further improve our signature detection to detect this new malware (4) Anomaly & Signature Detection. We have implemented a species in the future. signature classifier and an anomaly detector. We have produced It is also possible to deploy our technique via a more ad-hoc feature vectors for malicious apps, and these vectors are used scheme. For example, our detection mechanism can be deployed as to train the classifier for signature detection. The anomaly de- a public service that allows a cautious app user to examine an app tection discovers zero-day Android malware, and the signature prior to its installation. An enterprise that maintains its own private detector uncovers the type (family) of the malware. app repository could utilize such a security service. The enterprise service conducts vetting prior to adding an app to the internal app 3. WEIGHTED CONTEXTUAL API DEPEN- pool, thereby protecting employees from apps that contain malware DENCY GRAPH behaviors. In this section, we describe how we capture the semantic-level behaviors of Android malware in the form of graphs. We start 2.2 Architecture Overview by identifying the key behavioral aspects that must be captured, Figure 2 depicts the workflow of our graph-based Android mal- present a formal definition, and then present a real example to ware classification. This takes the following steps: demonstrate these aspects. 3.1 Key Behavioral Aspects (1) Behavior Graph Generation. Our malware classification con- We consider the following aspects as essential when describing siders graph similarity as the feature vector. To this end, we the semantic-level behaviors of an Android malware sample: first perform static program analysis to transform Android byte- code programs into their corresponding graph representations. 1) API Dependency. API calls (including reflective calls to the Our program analysis includes entry point discovery and call private framework functions) indicate how an app interacts with graph analysis to better understand the API calling contexts, the Android framework. It is essential to capture what API calls and it leverages both forward and backward dataflow analysis an app makes and the dependencies among those calls. Prior to explore API dependencies and uncover any constant parame- works on semantic- and behavior-based malware detection and ters. The result of this analysis is expressed via Weighted Con- classification for desktop environments all make use of API de- textual API Dependency Graphs that expose security-related pendency information [16, 21]. Android malware shares the behaviors of Android apps. same characteristics. 2) Context. An entry point of an API call is a program entry (2) Scalable Graph Similarity Query. Having generated graphs point that directly or indirectly triggers the call. From a user- for both benign and malicious apps, we then query the graph awareness point of view, there are two kinds of entry points: database for the one graph most similar to a given graph. To user interfaces and background callbacks. Malware authors com- address the scalability challenge, we utilize a bucket-based in- monly exploit background callbacks to enable malicious func- dexing scheme to improve search efficiency. Each bucket con- tionalities without the user’s knowledge. From a security ana- tains graphs bearing APIs from the same Android packages, lyst’s perspective, it is a suspicious behavior when a typical user and it is indexed with a bitvector that indicates the presence of interactive API (e.g., AudioRecord.startRecording()) is such packages. Given a graph query, we can quickly seek to the called stealthily [10]. As a result, we must pay special attention corresponding bucket index by matching the package’s vector to APIs activated from background callbacks. to the bucket’s bitvector. Once a matching bucket is located, we further iterate this bucket to find the best-matching graph. 3) Constant. Constants convey semantic information by revealing Finding the best-matching graph, instead of an exact match, is the values of critical parameters and uncovering fine-grained necessary to identify polymorphic malware. API semantics. For instance, Runtime.exec() may execute
, OnClickListener. Handler. BroadcastReceiver.onReceive, Ø onClick Runnable.run handleMessage , e1 e2 e3 BroadcastReceiver.onReceive, Ø , BroadcastReceiver.onReceive, Ø , a1 a2 a3 BroadcastReceiver.onReceive, Ø Runnable.start Handler. SmsManager. , sendMessage sendTextMessage BroadcastReceiver.onReceive, Ø , Figure 4: Callgraph for asynchronously sending an SMS message. BroadcastReceiver.onReceive, Setc “e” and “a” stand for “event handler” and “action” respectively. Setc = {”http://softthrifty.com/security.jsp”} manner. This graph contains five API call nodes. Each node con- Figure 3: WC-ADG of Zitmo tains the call’s prototype, a set of any constant parameters, and the entry points of the call. Dashed arrows that connect a pair of nodes varied shell commands, such as ps or chmod, depending upon indicates that a data dependency exists between the two calls in the input string constant. Constant analysis also discloses the those nodes. data dependencies of certain security-sensitive APIs whose benign- By combining the knowledge of API prototypes with the data ness is dependent upon whether an input is constant. For exam- dependency information shown in the graph, we know that the app ple, a sendTextMessage() call taking a constant premium- is forwarding an incoming SMS to the network. Once an SMS is rate phone number as a parameter is a more suspicious behavior received by the mobile phone, Zitmo creates an SMS object from than the call to the same API receiving the phone number from the raw Protocol Data Unit by calling createFromPdu(byte[]). user input via getText(). Consequently, it is crucial to extract It extracts the sender’s phone number and message content by call- information about the usage of constants for security analysis. ing getOriginatingAddress() and getMessageBody(). Both strings are encoded into an UrlEncodedFormEntity object and Once we look at app behaviors using these three perspectives, we enclosed into HttpEntityEnclosingRequestBase by using the perform similarity checking, rather than seeking an exact match, setEntity() call. Finally, this HTTP request is sent to the net- on the behavioral graphs. Since each individual API node plays a work via AbstractHttpClient.execute(). distinctive role in an app, it contributes differently to the graph sim- Zitmo variants may also exploit various other communication- ilarity. With regards to malware detection, we emphasize security- related API calls for the sending purpose. Another Zitmo instance sensitive APIs combined with critical contexts or constant param- uses SmsManager.sendTextMessage() to deliver the stolen in- eters. We assign weights to different API nodes, giving greater formation as a text message to the attacker’s phone. Such variations weights to the nodes containing critical calls, to improve the “qual- motivate us to consider graph similarity metrics, rather than an ex- ity” of behavior graphs when measuring similarity. Moreover, the act matching of API call behavior, when determining whether a weight generation is automated. Thus, similar graphs have higher sample app is benign or malicious. similarity scores by design. The context provided by the entry points of these API calls in- 3.2 Formal Definition forms us that the user is not aware of this SMS forwarding behav- ior. These consecutive API invocations start within the entry point To address all of the aforementioned factors, we describe app be- method onReceive() with a call to createFromPdu(byte[]). haviors using Weighted Contextual API Dependency Graphs (WC-ADG). onReceive() is a broadcast receiver registered by the app to re- At a high level, a WC-ADG consists of API operations where some ceive incoming SMS messages in the background. Therefore, the of the operations have data dependencies. A formal definition is createFromPdu(byte[]) and subsequent API calls are activated presented as follows. from a non-user-interactive entry point and are hidden from the Definition 1. A Weighted Contextual API Dependency Graph is user. a directed graph G = (V, E, α, β) over a set of API operations Σ Constant analysis of the graph further indicates that the forward- and a weight space W , where: ing destination is suspicious. The parameter of execute() is nei- • The set of vertices V corresponds to the contextual API opera- ther the sender (i.e., the bank) nor any familiar parties from the tions in Σ; contacts. It is a constant URL belonging to an unknown third-party. • The set of edges E ⊆ V × V corresponds to the data dependen- cies between operations; • The labeling function α : V → Σ associates nodes with the la- 3.4 Graph Generation bels of corresponding contextual API operations, where each label We have implemented a graph generation tool on top of Soot [2] is comprised of 3 elements: API prototype, entry point and constant in 20k lines of code. This tool examines an Android app to conduct parameter; entry point discovery and perform context-sensitive, flow-sensitive, • The labeling function β : V → W associates nodes with their and interprocedural dataflow analyses. These analyses locate API corresponding weights, where ∀w ∈ W , w ∈ R, and R represents call parameters and return values of interest, extract constant pa- the space of real numbers. rameters, and determine the data dependencies among the API calls. 3.3 A Real Example Entry Point Discovery. Zitmo is a class of banking trojan malware that steals a user’s Entry point discovery is essential to revealing whether the user SMS messages to discover banking information (e.g., mTANs). is aware that a certain API call has been made. However, this Figure 3 presents an example WC-ADG that depicts the malicious identification is not straightforward. Consider the callgraph seen behavior of a Zitmo malware sample in a concise, yet complete, in Figure 4. This graph describes a code snippet that registers a
Table 1: Calling Convention of Asynchronous Calls Algorithm 1 Entry Point Reduction for Asynchronous Callbacks Top-level Class Run Method Start Method Mentry ← {Possible entry point callback methods} Runnable run() start() CMasync ← {Pairs of (BaseClass, RunM ethod) for asynchronous AsyncTask onPreExecute() execute() calls in framework} AsyncTask doInBackground() onPreExecute() RSasync ← {Map from RunM ethod to StartM ethod for asyn- AsyncTask onPostExecute() doInBackground() chronous calls in framework} Message handleMessage() sendMessage() for mentry ∈ Mentry do c ← the class declaring mentry base ← the base class of c public class AsyncTask{ if (base, mentry ) ∈ CMasync then public AsyncTask execute(Params... params){ mstart ← Lookup(mentry ) in RSasync e x e c u t e S t u b (params); for ∀ call to mstart do } r ← “this” reference of call public AsyncTask e x e c u t e S t u b (Params...params){ P ointsT oSet ← PointsToAnalysis(r) onPreExecute(); if c ∈ P ointsT oSet then Result result = doInBackground(params); Mentry = Mentry − {mentry } o n P o s t E x e c u t e S t u b (result); BuildDependencyStub(mstart , mentry ) } end if public void o n P o s t E x e c u t e S t u b (Result result){ end for onPostExecute(result); end if } end for } output Mentry as reduced entry point set Figure 5: Stub code for dataflow of AsyncTask.execute Given these inputs, our algorithm iterates over Mentry . For ev- ery method mentry in this set, it finds the class c that declares this onClick() event handler for a button. From within the event han- method and the top-level base class base that c inherits from. Then, dler, the code starts a thread instance by calling Thread.start(), it searches the pair of base and mentry in the CMasync set. If a which invokes the run() method implementing Runnable.run(). match is found, the method mentry is a “callee” by convention. The run() method passes an android.os.Message object to the The algorithm thus looks up mentry in the map SRasync to find message queue of the hosting thread via Handler.sendMessage(). the corresponding “caller” mstart . Each call to mstart is further A Handler object created in the same thread is then bound to this examined and a points-to analysis is performed on the “this” refer- message queue and its Handler.handleMessage() callback will ence making the call. If class c of method mentry belongs to the process the message and later execute sendTextMessage(). points-to set, we can ensure the calling relationship between the The sole entry point to the graph is the user-interactive callback caller mstart and the callee mentry and remove the callee from onClick(). However, prior work [23] on the identification of pro- the entry point set. gram entry points does not consider asynchronous calls and recog- To indicate the data dependency between these two methods, nizes all three callbacks in the program as individual entry points. we introduce a stub which connects the parameters of the asyn- This confuses the determination of whether the user is aware that an chronous call to the corresponding parameters of its callee. Fig- API call has been made in response to a user-interactive callback. ure 5 depicts the example stub code for AsyncTask, where the To address this limitation, we propose Algorithm 1 to remove any parameter of execute() is first passed to doInBackground() potential entry points that are actually part of an asynchronous call through the stub executeStub(), and then the return from this chain with only a single entry point. asynchronous execution is further transferred to onPostExecute() Algorithm 1 accepts three inputs and provides one output. The via onPostExecuteStub(). first input is Mentry , which is a set of possible entry points. The Once the algorithm has reduced the number of entry point meth- second is CMasync , which is a set of (BaseClass, RunM ethod) ods in Mentry , we explore all code reachable from those entry pairs. BaseClass represents a top-level asynchronous base class points, including both synchronous and asynchronous calls. We (e.g., Runnable) in the Android framework and RunM ethod is further determine the user interactivity of an entry point by exam- the asynchronous call target (e.g., Runnable.run()) declared in ining its top-level base class. If the entry point callback overrides a this class. The third input is RSasync , which maps RunM ethod counterpart declared in one of the three top-level UI-related inter- to StartM ethod. RunM ethod and StartM ethod are the callee faces (i.e., android.graphics.drawable.Drawable.Callback, and caller in an asynchronous call (e.g., Runnable.run() and android.view.accessibility.AccessibilityEventSource, Runnable.start()). The output is a reduced Mentry set. and android.view.KeyEvent.Callback), we then consider the We compute the Mentry input by applying the algorithm pro- derived entry point method as a user interface. posed by Lu et al. [23], which discovers all reachable callback methods defined by the app that are intended to be called only by Constant Analysis. the Android framework. To further consider the logical order be- We conduct constant analysis for any critical parameters of se- tween Intent senders and receivers, we leverage Epicc [25] to curity sensitive API calls. These calls may expose security-related resolve the inter-component communications and then remove the behaviors depending upon the values of their constant parameters. Intent receivers from Mentry . For example, Runtime.exec() can directly execute shell com- Through examination of the Android framework code, we gener- mands, and file or database operations can interact with distinctive ate a list of 3-tuples consisting of BaseClass, RunM ethod and targets by providing the proper URIs as input parameters. StartM ethod. For example, we capture the Android-specific call- To understand these semantic-level differences, we perform back- ing convention of AsyncTask with AsyncTask.onPreExecute() ward dataflow analysis on selected parameters and collect all possi- being triggered by AsyncTask.execute(). When a new asyn- ble constant values on the backward trace. We generate a constant chronous call is introduced into the framework code, this list is set for each critical API argument and mark the parameter as “Con- updated to include the change. Table 1 presents our current model stant” in the corresponding node on the WC-ADG. While a more for the calling convention of top-level base asynchronous classes in complete string constant analysis is also possible, the computation Android framework. of regular expressions is fairly expensive for static analysis. The
substring set currently generated effectively reflects the semantics , where V and V 0 are respectively the vertices of two graphs, vI and of a critical API call and is sufficient for further feature extraction. vD are individual vertices inserted to and deleted from G, while EI and ED are the edges added to and removed from G. API Dependency Construction. WGED presents the absolute difference between two graphs. We perform global dataflow analysis to discover data dependen- This implies that wged(G,G’) is roughly proportional to the sum cies between API nodes and build the edges on WC-ADG. How- of graph sizes and therefore two larger graphs are likely to be more ever, it is very expensive to analyze every single API call made by distant to one another. To eliminate this bias, we normalize the re- an app. To address computational efficiency and our interests on sulting distance and further define Weighted Graph Similarity based security analysis, we choose to analyze only the security-related upon it. API calls. Permissions are strong indicators of security sensitivity Definition 3. The Weighted Graph Similarity of two Weighted in Android systems, so we leverage the API-permission mapping Contextual API Dependency Graphs G and G’, with a weight func- from PScout [8] to focus on permission-related API calls. tion β, is, Our static dataflow analysis is similar to the “split”-based ap- wged(G, G0 , β) proach used by CHEX [23]. Each program split includes all code wgs(G, G0 , β) = 1 − (2) reachable from a single entry point. Dataflow analysis is performed wged(G, ∅, β) + wged(∅, G0 , β) on each split, and then cross-split dataflows are examined. The dif- ference between our analysis and that of CHEX lies in the fact that , where ∅ is an empty graph. wged(G, ∅, β) + wged(∅, G0 , β) then we compute larger splits due to the consideration of asynchronous equates the maximum possible edit cost to transform G to G’. calling conventions. We make a special consideration for reflective calls within our 4.2 Weight Assignment analysis. In Android programs, reflection is realized by calling the Instead of manually specifying the weights on different APIs method java.lang.reflect.Method.invoke(). The “this” ref- (in combination of their attributes), we wish to see a near-optimal erence of this API call is a Method object, which is usually ob- weight assignment. tained by invoking either getMethod() or getDeclaredMethod() from java.lang.Class. The class is often acquired in a reflective Selection of Critical API Labels. manner too, through Class.forName(). This API call resolves a Given a large number of API labels (unique combinations of string input and retrieves the associated Class object. API names and attributes), it is unrealistic to automatically as- We consider any reflective invoke() call as a sink and conduct sign weights for every one of them. Our goal is malware clas- backward dataflow analysis to find any prior data dependencies. If sification, so we concentrate on assigning weights to labels for such an analysis reaches string constants, we are able to statically the security-sensitive APIs and critical combinations of their at- resolve the class and method information. Otherwise, the reflective tributes. To this end, we perform concept learning to discover call is not statically resolvable. However, statically unresolvable critical API labels. Given a positive example set (PES) contain- behavior is still represented within the WC-ADG as nodes which ing malware graphs and a negative example set (NES) containing contain no constant parameters. Instead, this reflective call may benign graphs, we seek a critical API label (CA) based on two re- have several preceding APIs, from a dataflow perspective, which quirements: 1) frequency(CA,PES) > frequency(CA,NES) and 2) are the sources of its metadata. frequency(CA,NES) is less than the median frequency of all critical API labels in NES. The first requirement guarantees that a critical API label is more sensitive to a malware sample than a benign one, 4. ANDROID MALWARE CLASSIFICATION while the second requirement ensures the infrequent presence of We generate WC-ADGs for both benign and malicious apps. Each such an API label in the benign set. Consequently, we have se- unique graph is associated with a feature that we use to classify lected 108 critical API labels. Our goal becomes the assignment Android malware and benign applications. of appropriate weights to these 108 labels while assigning a default weight of 1 to all remaining API labels. 4.1 Graph Matching Score To quantify the similarity of two graphs, we first compute a graph Weight Assignment. edit distance. To our knowledge, all existing graph edit distance al- Intuitively, if two graphs come from the same malware family gorithms treat node and edge uniformly. However, in our case, our and share one or more critical API labels, we must maximize the graph edit distance calculation must take into account the differ- similarity between the two. We call such a pair of graphs a “homo- ent weights of different API nodes. At present, we do not con- geneous pair”. Conversely, if one graph is malicious and the other sider assigning different weights on edges because this would lead is benign, even if they share one or more critical API labels, we to prohibitively high complexity in graph matching. Moreover, to must minimize the similarity between the two. We call such a pair emphasize the differences between two nodes in different labels, of graphs a “heterogeneous pair”. Therefore, we cast the problem we do not seek to relabel them. Instead, we delete the old node and of weight assignment to be an optimization problem. insert the new one subsequently. This is because node “relabeling” Definition 4. The Weight Assignment is an optimization problem cost, in our context, is not the string edit distance between the API to maximize the result of an objective function for a given set of labels of two nodes. It is the cost of deleting the old node plus that graph pairs {}: of adding the new node. X X Definition 2. The Weighted Graph Edit Distance (WGED) of max f ({< G, G0 >}, β) = wgs(G, G0 , β) − wgs(G, G0 , β) two Weighted Contextual API Dependency Graphs G and G’, with is a is a a uniform weight function β, is the minimum cost to transform G homogeneous pair heterogeneous pair to G’: s.t. X X wged(G, G0 , β) = min( β(vI ) + β(vD ) + |EI | + |ED |) 1 ≤ β(v) ≤ θ, if v is a critical node; vI ∈{V 0 −V } vD ∈{V −V 0 } β(v) = 1, otherwise. (1) (3)
Homogeneous Heterogeneous ge er sa ag Graph Pairs Graph Pairs createFromPdu sM nyM er e s an Sm o g y. ph na on ele Ma getOriginatingAddress ph .T n le y tio β + te on a d. h oc getMessageBody ke er .S ip s va y r e oi lep .L f({},β) et .C es oc h ja x.cr os.P tim dr .te ion t .n pto oc - va d. n an oid cat ja oi .Ru dr .lo e dr ng ag an roid an .la s setEntity es va d sM an ja m Hill Climber execute [0, 0, 0, …, 0, 0, 0, 1] y.S on [0, 0, 0, …, 0, 0, 1, 0] ph le [0, 0, 0, …, 0, 0, 1, 1] te Figure 6: A Feedback Loop to Solve the Optimization Problem d. [0, 0, 0, …, 0, 1, 0, 0] oi dr [0, 0, 0, …, 0, 1, 0, 1] an , where β is the weight function that requires optimization; θ is the [0, 0, 1, …, 0, 0, 0, 0] ... ... upper bound of a weight. Empirically, we set θ to be 20. To achieve the optimization of Equation 3, we use the Hill Climb- [0, 0, 1, …, 0, 0, 0, 0] ... ... ing algorithm [30] to implement a feedback loop that gradually im- proves the quality of weight assignment. Figure 6 presents such a system, which takes two sets of graph pairs and an initial weight Figure 7: Bucket-based Indexing of Graph Database function β as inputs. β is a discrete function which is represented as a weight vector. At each iteration, Hill Climbing adjusts a single el- sition to a hierarchical database structure, such as vantage point ement in the weight vector and determines whether the change im- tree [20], under each bucket. proves the value of objective function f({},β). Any change Figure 7 demonstrates the bucket query for the WC-ADG of Zitmo that improves f({},β) is accepted, and the process contin- shown in Figure 3. This graph contains six API calls, three of which ues until no change can be found that further improves the value. belong to a critical package: android.telephony.SmsManager. The generated bitvector for the Zitmo graph indicates the presence 4.3 Implementation of this API package, and an exact match for the bitvector is per- To compute the weighted graph similarity, we use a bipartite formed against the bucket index. Notice that the presence of a sin- graph matching tool [29]. We cannot directly use this graph match- gle critical package is different from that of a combination of mul- ing tool because it does not support assigning different weights tiple critical packages. Thus, the result bucket used by this search on different nodes in a graph. To work around this limitation, we contains graphs that include android.telephony.SmsManager enhanced the bipartite algorithm to support weights on individual as the only critical package in use. SmsManager, being a criti- nodes. cal package, helps capture the SMS retrieval behavior and narrow down the search range. Because HTTP-related API packages are 4.4 Graph Database Query not considered as critical, such an exact match over index will not Given an app, we match its WC-ADGs against all existing graphs exclude Zitmo variants that use other I/O packages, such as SMS. in the database. The number of graphs in the database can be fairly large, so the design of the graph query must be scalable. 4.5 Malware Classification Intuitively, we could insert graphs into individual buckets, with each bucket labeled according to the presence of critical APIs. In- Anomaly Detection. stead of comparing a new graph against every existing graph in the We have implemented a detector to conduct anomaly detection. database, we limit the comparison to only the graphs within a par- Given an app, the detector provides a binary result that indicates ticular bucket that possesses graphs containing a corresponding set whether the app is abnormal or not. To achieve this goal, we build of critical APIs. Critical APIs generally have higher weights than a graph database for benign apps. The detector then attempts to regular APIs, so graphs in other buckets will not be very similar to match the WC-ADGs of the given app against the ones in database. the input graph and are safe to ignore. However, API-based bucket If a sufficiently similar one for any of the behavior graphs is not indexing may be overly strict because APIs from the same package found, an anomaly is reported by the detector. We have set the usually share similar functionality. For instance, both getDeviceId() similarity threshold to be 70% per our empirical studies. and getSubscriberId() are located in TelephonyManager pack- age, and both retrieve identity-related information. Therefore, we Signature Detection. instead index buckets based on the package names of critical APIs. We next use a classifier to perform signature detection. Our sig- More specifically, to build a graph database, we must first build nature detector is a multi-label classifier designed to identify the an API package bitvector for all existing graphs within the database. malware families of unknown malware instances. Such a bitvector has n elements, each of which indicates the pres- To enable classification, we first build a malware graph database. ence of a particular Android API package. For example, a graph To this end, we conduct static analysis on the malware samples that calls sendTextMessage() and getDeviceId() will set the from the Android Malware Genome Project [1,36] to extract WC-ADGs. corresponding bits for the android.telephony.SmsManager and In order to consider only the unique graphs, we remove any graphs android.telephony.TelephonyManager packages. Graphs that that have a high level of similarity to existing ones. With experi- share the same bitvector (i.e., the same API package combination) mental study, we consider a high similarity to be greater than 80%. are then placed into the same bucket. When querying a new graph Further, to guarantee the distinctiveness of malware behaviors, we against the database, we encode its API package combination into compare these malware graphs against our benign graph set and a bitvector and compare that bitvector against each database index. remove the common ones. Notice that, to ensure the scalability, we implement the bucket- Next, given an app, we generate its feature vector for classifi- based indexing with a hash map where the key is the API package cation purpose. In such a vector, each element is associated with bitvector and the value is a corresponding graph set. a graph in our database. And, in turn, all the existing graphs are Empirically, we found this one-level indexing efficient enough projected to a feature vector. In other words, there exists a one-to- for our problem. If the database grows much larger, we can tran- one correspondence between the elements in a feature vector and
80 40 Number of Graphs Number of Graphs G1 G2 G3 G4 G5 G6 G7 G8 ... G861 G862 60 30 ADRD 0 0 0 0 0 0.8 0.9 0 ... 0 0 40 20 20 10 DroidDream 0.9 0 0 0 0.8 0.7 0.7 0 ... 0 0 0 0 DroidKungFu 0 0.7 0 0 0.6 0 0.6 0 ... 0 0.9 1 4001 8001 12001 1 501 1001 1501 App ID App ID Figure 8: An Example of Feature Vectors (a) Graphs per Benign App. (b) Graphs per Malware. 600 300 the existing graphs in the database. To construct the feature vec- Number of Nodes Number of Nodes tor of the given app, we produce its WC-ADGs and then query the 400 200 graph database for all the generated graphs. For each query, a best 200 100 matching graph is found. The similarity score is then put into the 0 0 feature vector at the position corresponding to this best matching 1 30001 60001 90001 1 4001 8001 12001 16001 graph. Specifically, the feature vector of a known malware sample Graph ID Graph ID is attached with its family label so that the classifier can understand (c) Nodes per Benign Graph. (d) Nodes per Malware Graph. the discrepancy between different malware families. Figure 8 gives an example of feature vectors. In our malware Figure 9: Graph Generation Summary. graph database of 862 graphs, a feature vector of 862 elements is constructed for each app. The two behavior graphs of ADRD are These facts serve as the basic requirements for the scalability of our most similar to graph G6 and G7, respectively, from the database. approach, since the runtime performance of graph matching and The corresponding elements of the feature vector are set to the sim- query is largely dependent upon the number of nodes and graphs, ilarity scores of theose features. The rest of the elements remain set respectively. to zero. Once we have produced the feature vectors for the training sam- 5.3 Classification Results ples, we can next use them to train a classifier. We select Naïve Bayes algorithm for the classification. In fact, we can choose dif- Signature Detection. ferent algorithms for the same purpose. However, since our graph- We use a multi-label classification to identify the malware fam- based features are fairly strong, even Naïve Bayes can produce sat- ily of the unrecognized malicious samples. Therefore, we expect to isfying results. Naïve Bayes also has several advantages: it re- only include those malware behavior graphs, that are well labeled quires only a small amount of training data; parameter adjustment with family information, into the database. To this end, we rely on is straightforward and simple; and runtime performance is favor- the malware samples from the Android Malware Genome Project able. and use them to construct the malware graph database. Conse- quently, we built such a database of 862 unique behavior graphs, 5. EVALUATION with each graph labeled with a specific malware family. We then selected 1050 malware samples from the Android Mal- 5.1 Dataset & Experiment Setup ware Genome Project and used them as a training set. Next, we We collected 2200 malware samples from the Android Malware would like to collect testing samples from the rest of our collec- Genome Project [1] and McAfee, Inc, a leading antivirus company. tion. However, a majority of malware samples from McAfee are To build a benign dataset, we received a number of benign samples not strongly labeled. Over 90% of the samples are coarsely labeled from McAfee, and we downloaded a variety of popular apps hav- as “Trojan” or “Downloader”, but in fact belong to a specific mal- ing a high ranking from Google Play. To further sanitize this benign ware family (e.g., DroidDream). Moreover, even VirusTotal can- dataset, we sent these apps to the VirusTotal service for inspection. not provide reliable malware family information for a given sam- The final benign dataset consisted of 13500 samples. We performed ple because the antivirus products used by VirusTotal seldom reach the behavior graph generation, graph database creation, graph sim- a consensus. This fact tells us two things: 1) It is a non-trivial ilarity query and feature vector extraction using this dataset. We task to collect evident samples as the ground truth in the context of conducted the experiment on a test machine equipped with Intel(R) multi-label classification; 2) multi-label malware detection or clas- Xeon(R) E5-2650 CPU (20M Cache, 2GHz) and 128GB of physi- sification is, in general, a challenging real-world problem. cal memory. The operating system is Ubuntu 12.04.3 (64bit). Despite the difficulty, we obtained 193 samples, each of which is detected as the same malware by major AVs. We then used those 5.2 Summary of Graph Generation samples as testing data. The experiment result shows that our clas- Figure 9 summarizes the characteristics of the behavior graphs sifier can correctly label 93% of these malware instances. generated from both benign and malicious apps. Among them, Fig- Among the successfully labeled malware samples are two types ure 9a and Figure 9b illustrate the number of graphs generated from of Zitmo variants. One uses HTTP for communication, and the benign and malicious apps. On average, 7.8 graphs are computed other uses SMS. While the former one is present in our malware from each benign app, while 9.8 graphs are generated from each database, the latter one was not. Nevertheless, our signature de- malware instance. Most apps focus on limited functionalities and tector is still able to capture this variant. This indicates that our do not produce a large number of behavior graphs. In 92% of be- similarity metrics effectively tolerate variations in behavior graphs. nign samples and 98% of malicious ones, no more than 20 graphs We further examined the 7% of the samples that were misla- are produced from an individual app. beled. It turns out that the mislabeled cases can be roughly put into Figure 9c and Figure 9d present the number of nodes of benign two categories. First, DroidDream samples are labeled as Droid- and malicious behavior graphs. A benign graph, on average, has KungFu. DroidDream and DroidKungFu share multiple malicious 15 nodes, while a malicious graph carries 16.4. Again, most of behaviors such as gathering privacy-related information and hidden the activities are not intensive, so the majority of these graphs have network I/O. Consequently, there exists a significant overlap be- a small number of nodes. Statistics show that 94% of the benign tween their WC-ADGs. Second, Zitmo, Zsone and YZHC instances graphs and 91% of the malicious ones carry less than 50 nodes. are labeled as one another. These three families are SMS Tro-
12000 Number of Unique Graphs True Positive False Positive Number of Detections 10000 25 20 8000 15 6000 10 5 4000 0 2000 0 Detector ID 3000 4000 5000 6000 7000 8000 9000 10000 11000 Number of Benign Apps Figure 11: Detection Ratio for Obfuscated Malware Figure 10: Convergence of Unique Graphs in Benign Apps Further, we would like to evaluate our anomaly detector with malicious samples from new malware families. To this end, we jans. Though their behaviors are slightly different from each other, retrieve a new piece of malware called Android.HeHe, which was they all exploit sendTextMessage() to deliver the user’s informa- first found and reported in January 2014 [12]. Android.HeHe ex- tion to an attacker-specified phone number. Despite the mislabeled ercises a variety of malicious functionalities such as SMS inter- cases, we still manage to successfully label 93% of the malware ception, information stealing and command-and-control. This new samples with a Naïve Bayes classifier. Applying a more advanced malware family does not appear in the samples from the Android classification algorithm will further improve the accuracy. Malware Genome Project, which were collected from August 2010 to October 2011, and therefore cannot be labeled via signature de- tection. DroidSIFT generates 49 WC-ADGs for this sample and, Anomaly Detection. once these graphs are presented to the anomaly detector, a warn- Since we wish to perform anomaly detection using our benign ing is raised indicating the abnormal behaviors expressed by these graph database, the coverage of this database is essential. In the- graphs. ory, the more benign apps that the database collects, the more be- nign behaviors it covers. However, in practice, it is extremely diffi- cult to retrieve benign apps exhaustively. Luckily, different benign Detection of Transformation Attacks. apps may share the same behaviors. Therefore, we can focus on We collected 23 DroidDream samples, which are all intention- unique behaviors (rather than unique apps). Moreover, with more ally obfuscated using a transformation technique [28], and 2 be- and more apps being fed into the benign database, the database size nign apps that are deliberately disguised as malware instances by grows slower and slower. Figure 10 depicts our discovery. When applying the same technique. We ran these samples through our the number of apps increases from 3000 to 4000, there is a sharp anomaly detection engine and then sent the detected abnormal ones increase (2087) of unique graphs. However, when the number of through the signature detector. The result shows that while 23 true apps grows from 10000 to 11000, only 220 new, unique graphs are malware instances are flagged as abnormal ones in anomaly de- generated, and the curve begins to flatten. tection, the 2 clean ones also correctly pass the detection without We built a database of 10420 unique graphs from 11400 benign raising any warnings. We then compared our signature detection apps. Then, we tested 2200 malware samples against the benign results with antivirus products. To obtain detection results of an- classifier. The false negative rate was 2%, which indicates that tivirus software, we sent these samples to VirusTotal and selected 42 malware instances were not detected. However, we noted that 10 anti-virus (AV) products (i.e., AegisLab, F-Prot, ESET-NOD32, most of the missed samples are exploits or Downloaders. In these DrWeb, AntiVir, CAT-QuickHeal, Sophos, F-Secure, Avast, and cases, their bytecode programs do not bear significant API-level Ad-Aware) that bear the highest detection rates. Notice that we behaviors, and therefore generated WC-ADGs do not necessarily consider the detection to be successful only if the AV can correctly look abnormal when compared to benign ones. At this point, we flag a piece of malware as DroidDream or its variant. In fact, to our have only considered the presence of constant parameters in an API observation, many AV products can provide partial detection re- call. We did not further differentiate API behaviors based upon sults based upon the native exploit code included in the app pack- constant values. Therefore, we cannot distinguish the behaviors age or common network I/O behaviors. As a result, they usually of Runtime.exec() calls or network I/O APIs with varied string recognize these DroidDream samples as “exploits” or “Download- inputs. Nevertheless, if we create a custom filter for these string ers” while missing many other important malicious behaviors. Fig- constants, we will then be able to identify these malware samples ure 11 presents the detection ratios of “DroidDream” across dif- and the false negative rate will drop to 0. ferent detectors. While none of the antivirus products can achieve Next, we used the remaining 2100 benign apps as test samples to a detection rate higher than 61%, DroidSIFT can successfully flag evaluate the false positive rate of our anomaly detector. The result all the obfuscated samples as DroidDream instances. In addition, shows that 5.15% of clean apps are mistakenly recognized as sus- we also notice that AV2 produces a relatively high detection ra- picious ones during anomaly detection. This means, if our anomaly tio (52.17%), but it also mistakenly flags those two clean samples detector is applied to Google Play, among the approximately 1200 as malicious apps. Since the disguising technique simply renames new apps per day [4], around 60 apps will be mislabeled as contain- the benign app package to the one commonly used by DroidDream ing anomalies and be bounced back to the developers. We believe (and thus confuses this AV detector), such false positives again ex- that this is an acceptable ratio for vetting purpose. Moreover, since plain that external symptoms are not robust and reliable features for we do not reject the suspicious apps immediately, but rather ask malware detection. the developers for justifications instead, we can further eliminate these false positives during this interactive process. In addition, as 5.4 Runtime Performance we add more benign samples into the dataset, the false positive rate Figure 12 illustrates the runtime performance of DroidSIFT. Specif- will further decrease. ically, it demonstrates the cumulative time consumption of graph
You can also read