Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds Thomas Ristenpart∗ Eran Tromer† Hovav Shacham∗ Stefan Savage∗ ∗ † Dept. of Computer Science and Engineering Computer Science and Artificial Intelligence Laboratory University of California, San Diego, USA Massachusetts Institute of Technology, Cambridge, USA {tristenp,hovav,savage}@cs.ucsd.edu tromer@csail.mit.edu ABSTRACT core computing and software capabilities are outsourced on Third-party cloud computing represents the promise of out- demand to shared third-party infrastructure. While this sourcing as applied to computation. Services, such as Mi- model, exemplified by Amazon’s Elastic Compute Cloud crosoft’s Azure and Amazon’s EC2, allow users to instanti- (EC2) [5], Microsoft’s Azure Service Platform [20], and Rack- ate virtual machines (VMs) on demand and thus purchase space’s Mosso [27] provides a number of advantages — in- precisely the capacity they require when they require it. cluding economies of scale, dynamic provisioning, and low In turn, the use of virtualization allows third-party cloud capital expenditures — it also introduces a range of new risks. providers to maximize the utilization of their sunk capital Some of these risks are self-evident and relate to the new costs by multiplexing many customer VMs across a shared trust relationship between customer and cloud provider. For physical infrastructure. However, in this paper, we show example, customers must trust their cloud providers to re- that this approach can also introduce new vulnerabilities. spect the privacy of their data and the integrity of their Using the Amazon EC2 service as a case study, we show that computations. However, cloud infrastructures can also in- it is possible to map the internal cloud infrastructure, iden- troduce non-obvious threats from other customers due to tify where a particular target VM is likely to reside, and then the subtleties of how physical resources can be transparently instantiate new VMs until one is placed co-resident with the shared between virtual machines (VMs). target. We explore how such placement can then be used to In particular, to maximize efficiency multiple VMs may mount cross-VM side-channel attacks to extract information be simultaneously assigned to execute on the same physi- from a target VM on the same machine. cal server. Moreover, many cloud providers allow “multi- tenancy” — multiplexing the virtual machines of disjoint customers upon the same physical hardware. Thus it is con- Categories and Subject Descriptors ceivable that a customer’s VM could be assigned to the same K.6.5 [Security and Protection]: UNAUTHORIZED AC- physical server as their adversary. This in turn, engenders CESS a new threat — that the adversary might penetrate the iso- lation between VMs (e.g., via a vulnerability that allows General Terms an “escape” to the hypervisor or via side-channels between VMs) and violate customer confidentiality. This paper ex- Security, Measurement, Experimentation plores the practicality of mounting such cross-VM attacks in existing third-party compute clouds. Keywords The attacks we consider require two main steps: place- Cloud computing, Virtual machine security, Side channels ment and extraction. Placement refers to the adversary ar- ranging to place their malicious VM on the same physical 1. INTRODUCTION machine as that of a target customer. Using Amazon’s EC2 as a case study, we demonstrate that careful empirical “map- It has become increasingly popular to talk of “cloud com- ping” can reveal how to launch VMs in a way that maximizes puting” as the next infrastructure for hosting data and de- the likelihood of an advantageous placement. We find that ploying software and services. In addition to the plethora of in some natural attack scenarios, just a few dollars invested technical approaches associated with the term, cloud com- in launching VMs can produce a 40% chance of placing a puting is also used to refer to a new business model in which malicious VM on the same physical server as a target cus- tomer. Using the same platform we also demonstrate the existence of simple, low-overhead, “co-residence” checks to Permission to make digital or hard copies of all or part of this work for determine when such an advantageous placement has taken personal or classroom use is granted without fee provided that copies are place. While we focus on EC2, we believe that variants not made or distributed for profit or commercial advantage and that copies of our techniques are likely to generalize to other services, bear this notice and the full citation on the first page. To copy otherwise, to such as Microsoft’s Azure [20] or Rackspace’s Mosso [27], as republish, to post on servers or to redistribute to lists, requires prior specific we only utilize standard customer capabilities and do not permission and/or a fee. require that cloud providers disclose details of their infras- CCS’09, November 9–13, 2009, Chicago, Illinois, USA. Copyright 2009 ACM 978-1-60558-352-5/09/11 ...$10.00. tructure or assignment policies.
Having managed to place a VM co-resident with the tar- In this setting, we consider two kinds of attackers: those get, the next step is to extract confidential information via who cast a wide net and are interested in being able to attack a cross-VM attack. While there are a number of avenues some known hosted service and those focused on attacking a for such an attack, in this paper we focus on side-channels: particular victim service. The latter’s task is more expensive cross-VM information leakage due to the sharing of physical and time-consuming than the former’s, but both rely on the resources (e.g., the CPU’s data caches). In the multi-process same fundamental attack. environment, such attacks have been shown to enable ex- In this work, we initiate a rigorous research program aimed traction of RSA [26] and AES [22] secret keys. However, we at exploring the risk of such attacks, using a concrete cloud are unaware of published extensions of these attacks to the service provider (Amazon EC2) as a case study. We address virtual machine environment; indeed, there are significant these concrete questions in subsequent sections: practical challenges in doing so. • Can one determine where in the cloud infrastructure an We show preliminary results on cross-VM side channel at- instance is located? (Section 5) tacks, including a range of building blocks (e.g., cache load • Can one easily determine if two instances are co-resident measurements in EC2) and coarse-grained attacks such as on the same physical machine? (Section 6) measuring activity burst timing (e.g., for cross-VM keystroke • Can an adversary launch instances that will be co-resident monitoring). These point to the practicality of side-channel with other user’s instances? (Section 7) attacks in cloud-computing environments. • Can an adversary exploit cross-VM information leakage Overall, our results indicate that there exist tangible dan- once co-resident? (Section 8) gers when deploying sensitive tasks to third-party compute Throughout we offer discussions of defenses a cloud provider clouds. In the remainder of this paper, we explain these might try in order to prevent the success of the various at- findings in more detail and then discuss means to mitigate tack steps. the problem. We argue that the best solution is for cloud providers to expose this risk explicitly and give some place- ment control directly to customers. 3. THE EC2 SERVICE By far the best known example of a third-party compute 2. THREAT MODEL cloud is Amazon’s Elastic Compute Cloud (EC2) service, As more and more applications become exported to third- which enables users to flexibly rent computational resources party compute clouds, it becomes increasingly important to for use by their applications [5]. EC2 provides the ability quantify any threats to confidentiality that exist in this set- to run Linux, FreeBSD, OpenSolaris and Windows as guest ting. For example, cloud computing services are already operating systems within a virtual machine (VM) provided used for e-commerce applications, medical record services [7, by a version of the Xen hypervisor [9].1 The hypervisor 11], and back-office business applications [29], all of which plays the role of a virtual machine monitor and provides require strong confidentiality guarantees. An obvious threat isolation between VMs, intermediating access to physical to these consumers of cloud computing is malicious behav- memory and devices. A privileged virtual machine, called ior by the cloud provider, who is certainly in a position to Domain0 (Dom0) in the Xen vernacular, is used to manage violate customer confidentiality or integrity. However, this guest images, their physical resource provisioning, and any is a known risk with obvious analogs in virtually any in- access control rights. In EC2 the Dom0 VM is configured dustry practicing outsourcing. In this work, we consider to route packets for its guest images and reports itself as a the provider and its infrastructure to be trusted. This also hop in traceroutes. means we do not consider attacks that rely upon subverting When first registering with EC2, each user creates an ac- a cloud’s administrative functions, via insider abuse or vul- count — uniquely specified by its contact e-mail address — nerabilities in the cloud management systems (e.g., virtual and provides credit card information for billing compute and machine monitors). I/O charges. With a valid account, a user creates one or In our threat model, adversaries are non-provider-affiliated more VM images, based on a supplied Xen-compatible ker- malicious parties. Victims are users running confidentiality- nel, but with an otherwise arbitrary configuration. He can requiring services in the cloud. A traditional threat in such a run one or more copies of these images on Amazon’s network setting is direct compromise, where an attacker attempts re- of machines. One such running image is called an instance, mote exploitation of vulnerabilities in the software running and when the instance is launched, it is assigned to a single on the system. Of course, this threat exists for cloud appli- physical machine within the EC2 network for its lifetime; cations as well. These kinds of attacks (while important) are EC2 does not appear to currently support live migration of a known threat and the risks they present are understood. instances, although this should be technically feasible. By We instead focus on where third-party cloud computing default, each user account is limited to 20 concurrently run- gives attackers novel abilities; implicitly expanding the at- ning instances. tack surface of the victim. We assume that, like any cus- In addition, there are three degrees of freedom in specify- tomer, a malicious party can run and control many instances ing the physical infrastructure upon which instances should in the cloud, simply by contracting for them. Further, since run. At the time of this writing, Amazon provides two the economies offered by third-party compute clouds derive “regions”, one located in the United States and the more from multiplexing physical infrastructure, we assume (and recently established one in Europe. Each region contains later validate) that an attacker’s instances might even run three “availability zones” which are meant to specify in- on the same physical hardware as potential victims. From frastructures with distinct and independent failure modes this vantage, an attacker might manipulate shared physical 1 We will limit our subsequent discussion to the Linux ker- resources (e.g., CPU caches, branch target buffers, network nel. The same issues should apply for other guest operating queues, etc.) to learn otherwise confidential information. systems.
(e.g., with separate power and network connectivity). When acceptable use policy, whereas external probing is not (we requesting launch of an instance, a user specifies the re- discuss the legal, ethical and contractual issues around such gion and may choose a specific availability zone (otherwise probing in Appendix A). one is assigned on the user’s behalf). As well, the user We use DNS resolution queries to determine the external can specify an “instance type”, indicating a particular com- name of an instance and also to determine the internal IP bination of computational power, memory and persistent address of an instance associated with some public IP ad- storage space available to the virtual machine. There are dress. The latter queries are always performed from an EC2 five Linux instance types documented at present, referred instance. to as ‘m1.small’, ‘c1.medium’, ‘m1.large’, ‘m1.xlarge’, and ‘c1.xlarge’. The first two are 32-bit architectures, the latter three are 64-bit. To give some sense of relative scale, the 5. CLOUD CARTOGRAPHY “small compute slot” (m1.small) is described as a single vir- In this section we ‘map’ the EC2 service to understand tual core providing one ECU (EC2 Compute Unit, claimed to where potential targets are located in the cloud and the be equivalent to a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon instance creation parameters needed to attempt establish- processor) combined with 1.7 GB of memory and 160 GB ing co-residence of an adversarial instance. This will speed of local storage, while the “large compute slot” (m1.large) up significantly adversarial strategies for placing a malicious provides 2 virtual cores each with 2 ECUs, 7.5GB of mem- VM on the same machine as a target. In the next section we ory and 850GB of local storage. As expected, instances with will treat the task of confirming when successful co-residence more resources incur greater hourly charges (e.g., ‘m1.small’ is achieved. in the United States region is currently $0.10 per hour, while To map EC2, we begin with the hypothesis that different ‘m1.large’ is currently $0.40 per hour). When launching an availability zones are likely to correspond to different inter- instance, the user specifies the instance type along with a nal IP address ranges and the same may be true for instance compatible virtual machine image. types as well. Thus, mapping the use of the EC2 internal Given these constraints, virtual machines are placed on address space allows an adversary to determine which IP ad- available physical servers shared among multiple instances. dresses correspond to which creation parameters. Moreover, Each instance is given Internet connectivity via both an since EC2’s DNS service provides a means to map public IP external IPv4 address and domain name and an internal address to private IP address, an adversary might use such a RFC 1918 private address and domain name. For example, map to infer the instance type and availability zone of a tar- an instance might be assigned external IP 75.101.210.100, get service — thereby dramatically reducing the number of external name ec2-75-101-210-100.compute-1.amazonaws instances needed before a co-resident placement is achieved. .com, internal IP 10.252.146.52, and internal name domU- We evaluate this theory using two data sets: one created 12-31-38-00-8D-C6.compute-1.internal. Within the cloud, by enumerating public EC2-based web servers using external both domain names resolve to the internal IP address; out- probes and translating responsive public IPs to internal IPs side the cloud the external name is mapped to the external (via DNS queries within the cloud), and another created by IP address. launching a number of EC2 instances of varying types and Note that we focus on the United States region — in the surveying the resulting IP address assigned. rest of the paper EC2 implicitly means this region of EC2. To fully leverage the latter data, we present a heuristic algorithm that helps label /24 prefixes with an estimate of 4. NETWORK PROBING the availability zone and instance type of the included Inter- In the next several sections, we describe an empirical mea- nal IPs. These heuristics utilize several beneficial features surement study focused on understanding VM placement of EC2’s addressing regime. The output of this process is a in the EC2 system and achieving co-resident placement for map of the internal EC2 address space which allows one to an adversary. To do this, we make use of network probing estimate the availability zone and instance type of any tar- both to identify public services hosted on EC2 and to pro- get public EC2 server. Next, we enumerate a set of public vide evidence of co-residence (that two instances share the EC2-based Web servers same physical server). In particular, we utilize nmap, hping, Surveying public servers on EC2. Utilizing WHOIS and wget to perform network probes to determine liveness queries, we identified four distinct IP address prefixes, a /16, of EC2 instances. We use nmap to perform TCP connect /17, /18, and /19, as being associated with EC2. The last probes, which attempt to complete a 3-way hand-shake be- three contained public IPs observed as assigned to EC2 in- tween a source and target. We use hping to perform TCP stances. We had not yet observed EC2 instances with public SYN traceroutes, which iteratively sends TCP SYN pack- IPs in the /16, and therefore did not include it in our sur- ets with increasing time-to-lives (TTLs) until no ACK is vey. For the remaining IP addresses (57 344 IP addresses), received. Both TCP connect probes and SYN traceroutes we performed a TCP connect probe on port 80. This re- require a target port; we only targeted ports 80 or 443. We sulted in 11 315 responsive IPs. Of these 9 558 responded used wget to retrieve web pages, but capped so that at most (with some HTTP response) to a follow-up wget on port 1024 bytes are retrieved from any individual web server. 80. We also performed a TCP port 443 scan of all 57 344 IP We distinguish between two types of probes: external addresses, which resulted in 8 375 responsive IPs. Via an ap- probes and internal probes. A probe is external when it propriate DNS lookup from within EC2, we translated each originates from a system outside EC2 and has destination public IP address that responded to either the port 80 or an EC2 instance. A probe is internal if it originates from port 443 scan into an internal EC2 address. This resulted in an EC2 instance (under our control) and has destination a list of 14 054 unique internal IPs. One of the goals of this another EC2 instance. This dichotomy is of relevance par- section is to enable identification of the instance type and ticularly because internal probing is subject to Amazon’s availability zone of one or more of these potential targets.
IP address mod 64 Zone 1 Zone 2 Zone 3 64 32 0 10.249.0.0 10.250.0.0 10.251.0.0 10.252.0.0 10.253.0.0 10.254.0.0 10.255.0.0 Internal IP address Account B Account A c1.medium c1.xlarge m1.large m1.small m1.xlarge 64 32 0 64 32 0 10.252.0.0 10.253.0.0 10.254.0.0 Internal IP address Figure 1: (Top) A plot of the internal IP addresses assigned to instances launched during the initial mapping exper- iment using Account A. (Bottom) A plot of the internal IP addresses of instances launched in Zone 3 by Account A and, 39 hours later, by Account B. Fifty-five of the Account B IPs were repeats of those assigned to instances for Account A. Instance placement parameters. Recall that there are 88 had unique /24 prefixes, while six of the /24 prefixes had three availability zones and five instance types in the present two instances each. A single /24 had both an m1.large and EC2 system. While these parameters could be assigned in- an m1.xlarge instance. No IP addresses were ever observed dependently from the underlying infrastructure, in practice being assigned to more than one instance type. Of the 100 this is not so. In particular, the Amazon EC2 internal IP ad- Acount B IP’s, 55 were repeats of IP addresses assigned to dress space is cleanly partitioned between availability zones instances for Acount A. (likely to make it easy to manage separate network con- A fuller map of EC2. We would like to infer the instance nectivity for these zones) and instance types within these type and availability zone of any public EC2 instance, but zones also show considerable regularity. Moreover, different our sampling data is relatively sparse. We could sample accounts exhibit similar placement. more (and did), but to take full advantage of the sampling To establish these facts, we iteratively launched 20 in- data at hand we should take advantage of the significant stances for each of the 15 availability zone/instance type regularity of the EC2 addressing regime. For example, the pairs. We used a single account, call it “Account A”. The top above data suggests that /24 prefixes rarely have IPs as- graph in Figure 1 depicts a plot of the internal IP address signed to distinct instance types. We utilized data from assigned to each of the 300 instances, partitioned according 4 499 instances launched under several accounts under our to availability zone. It can be readily seen that the sam- control; these instances were also used in many of the exper- ples from each zone are assigned IP addresses from disjoint iments described in the rest of the paper. These included portions of the observed internal address space. For ex- 977 unique internal IPs and 611 unique Dom0 IPs associated ample, samples from Zone 3 were assigned addresses within with these instances. 10.252.0.0/16 and from discrete prefixes within 10.253.0.0/16. Using manual inspection of the resultant data, we derived If we make the assumption that internal IP addresses are a set of heuristics to label /24 prefixes with both availability statically assigned to physical machines (doing otherwise zone and instance type: would make IP routing far more difficult to implement), this data supports the assessment that availability zones use sep- • All IPs from a /16 are from the same availability zone. arate physical infrastructure. Indeed, none of the data gath- • A /24 inherits any included sampled instance type. If there ered in the rest of the paper’s described experiments have are multiple instances with distinct types, then we label the cast doubt on this conclusion. /24 with each distinct type (i.e., it is ambiguous). While it is perhaps not surprising that availability zones • A /24 containing a Dom0 IP address only contains Dom0 enjoy disjoint IP assignment, what about instance type and IP addresses. We associate to this /24 the type of the accounts? We launched 100 instances (20 of each type, 39 Dom0’s associated instance. hours after terminating the Account A instances) in Zone 3 from a second account, “Account B”. The bottom graph • All /24’s between two consecutive Dom0 /24’s inherit the in Figure 1 plots the Zone 3 instances from Account A and former’s associated type. Account B, here using distinct labels for instance type. Of The last heuristic, which enables us to label /24’s that have the 100 Account A Zone 3 instances, 92 had unique /24 no included instance, is derived from the observation that prefixes, while eight /24 prefixes each had two instances, Dom0 IPs are consistently assigned a prefix that immedi- though of the same type. Of the 100 Account B instances, ately precedes the instance IPs they are associated with. (For example, 10.250.8.0/24 contained Dom0 IPs associated
with m1.small instances in prefixes 10.254.9.0/24 and port) from another instance and inspecting the last hop. For 10.254.10.0/24.) There were 869 /24’s in the data, and ap- the second test, we noticed that round-trip times (RTTs) re- plying the heuristics resulted in assigning a unique zone and quired a “warm-up”: the first reported RTT in any sequence unique type to 723 of these; a unique zone and two types to of probes was almost always an order of magnitude slower 23 of these; and left 123 unlabeled. These last were due to than subsequent probes. Thus for this method we perform areas (such as the lower portion of 10.253.0.0/16) for which 10 probes and just discard the first. The third check makes we had no sampling data at all. use of the manner in which internal IP addresses appear to While the map might contain errors (for example, in areas be assigned by EC2. The same Dom0 IP will be shared by in- of low instance sample numbers), we have yet to encounter stances with a contiguous sequence of internal IP addresses. an instance that contradicts the /24 labeling and we used (Note that m1.small instances are reported by CPUID as the map for many of the future experiments. For instance, having two CPUs each with two cores and these EC2 in- we applied it to a subset of the public servers derived from stance types are limited to 50% core usage, implying that our survey, those that responded to wget requests with an one such machine could handle eight instances.) HTTP 200 or 206. The resulting 6 057 servers were used as Veracity of the co-residence checks. We verify the cor- stand-ins for targets in some of the experiments in Section 7. rectness of our network-based co-residence checks using as Figure 7 in the appendix graphs the result of mapping these ground truth the ability to send messages over a cross-VM servers. covert channel. That is, if two instances (under our control) Preventing cloud cartography. Providers likely have in- can successfully transmit via the covert channel then they centive to prevent cloud cartography for several reasons, be- are co-resident, otherwise not. If the checks above (which do yond the use we outline here (that of exploiting placement not require both instances to be under our control) have suf- vulnerabilities). Namely, they might wish to hide their in- ficiently low false positive rates relative to this check, then frastructure and the amount of use it is enjoying by cus- we can use them for inferring co-residence against arbitrary tomers. Several features of EC2 made cartography signif- victims. We utilized for this experiment a hard-disk-based icantly easier. Paramount is that local IP addresses are covert channel. At a very high level, the channel works as statically (at least over the observed period of time) as- follows. To send a one bit, the sender instance reads from sociated to availability zone and instance type. Changing random locations on a shared disk volume. To send a zero this would likely make administration tasks more challeng- bit, the sender does nothing. The receiver times reading ing (and costly) for providers. Also, using the map requires from a fixed location on the disk volume. Longer read times translating a victim instance’s external IP to an internal mean a 1 is being set, shorter read times give a 0. IP, and the provider might inhibit this by isolating each We performed the following experiment. Three EC2 ac- account’s view of the internal IP address space (e.g. via counts were utilized: a control, a victim, and a probe. (The VLANs and bridging). Even so, this would only appear to “victim” and “probe” are arbitrary labels, since they were slow down our particular technique for locating an instance both under our control.) All instances launched were of in the LAN — one might instead use ping timing measure- type m1.small. Two instances were launched by the control ments or traceroutes (both discuss more in the next section) account in each of the three availability zones. Then 20 in- to help “triangulate” on a victim. stances on the victim account and 20 instances on the probe account were launched, all in Zone 3. We determined the Dom0 IPs of each instance. For each (ordered) pair (A, B) 6. DETERMINING CO-RESIDENCE of these 40 instances, if the Dom0 IPs passed (check 1) then Given a set of targets, the EC2 map from the previous we had A probe B and each control to determine packet section educates choice of instance launch parameters for RTTs and we also sent a 5-bit message from A to B over attempting to achieve placement on the same physical ma- the hard-drive covert channel. chine. Recall that we refer to instances that are running We performed three independent trials. These generated, on the same physical machine as being co-resident. In this in total, 31 pairs of instances for which the Dom0 IPs were section we describe several easy-to-implement co-residence equal. The internal IP addresses of each pair were within 7 checks. Looking ahead, our eventual check of choice will be of each other. Of the 31 (potentially) co-resident instance to compare instances’ Dom0 IP addresses. We confirm the pairs, 12 were ‘repeats’ (a pair from a later round had the accuracy of this (and other) co-residence checks by exploit- same Dom0 as a pair from an earlier round). ing a hard-disk-based covert channel between EC2 instances. The 31 pairs give 62 ordered pairs. The hard-drive covert Network-based co-residence checks. Using our expe- channel successfully sent a 5-bit message for 60 of these rience running instances while mapping EC2 and inspect- pairs. The last two failed due to a single bit error each, and ing data collected about them, we identify several poten- we point out that these two failures were not for the same tial methods for checking if two instances are co-resident. pair of instances (i.e. sending a message in the reverse direc- Namely, instances are likely co-resident if they have tion succeeded). The results of the RTT probes are shown (1) matching Dom0 IP address, in Figure 2. The median RTT for co-resident instances was significantly smaller than those to any of the controls. The (2) small packet round-trip times, or RTTs to the controls in the same availability zone as the (3) numerically close internal IP addresses (e.g. within 7). probe (Zone 3) and victim instances were also noticeably As mentioned, an instance’s network traffic’s first hop is the smaller than those to other zones. Dom0 privileged VM. An instance owner can determine its Discussion. From this experiment we conclude an effec- Dom0 IP from the first hop on any route out from the in- tive false positive rate of zero for the Dom0 IP co-residence stance. One can determine an uncontrolled instance’s Dom0 check. In the rest of the paper we will therefore utilize the IP by performing a TCP SYN traceroute to it (on some open
Count Median RTT (ms) or might even trigger launching of victims, making this at- Co-resident instance 62 0.242 tack scenario practical. Zone 1 Control A 62 1.164 Zone 1 Control B 62 1.027 Towards understanding placement. Before we describe Zone 2 Control A 61 1.113 these strategies, we first collect several observations we ini- Zone 2 Control B 62 1.187 tially made regarding Amazon’s (unknown) placement algo- Zone 3 Control A 62 0.550 rithms. Subsequent interactions with EC2 only reinforced Zone 3 Control B 62 0.436 these observations. A single account was never seen to have two instances Figure 2: Median round trip times in seconds for probes simultaneously running on the same physical machine, so sent during the 62 co-residence checks. (A probe to running n instances in parallel under a single account results Zone 2 Control A timed out.) in placement on n separate machines. No more than eight m1.small instances were ever observed to be simultaneously co-resident. (This lends more evidence to support our earlier following when checking for co-residence of an instance with estimate that each physical machine supports a maximum of a target instance we do not control. First one compares the eight m1.small instances.) While a machine is full (assigned internal IP addresses of the two instances, to see if they are its maximum number of instances) an attacker has no chance numerically close. (For m1.small instances close is within of being assigned to it. seven.) If this is the case, the instance performs a TCP We observed strong placement locality. Sequential place- SYN traceroute to an open port on the target, and sees if ment locality exists when two instances run sequentially (the there is only a single hop, that being the Dom0 IP. (This first terminated before launching the second) are often as- instantiates the Dom0 IP equivalence check.) Note that this signed to the same machine. Parallel placement locality check requires sending (at most) two TCP SYN packets and exists when two instances run (from distinct accounts) at is therefore very “quiet”. roughly the same time are often assigned to the same ma- Obfuscating co-residence. A cloud provider could likely chine. In our experience, launched instances exhibited both render the network-based co-residence checks we use moot. strong sequential and strong parallel locality. For example, a provider might have Dom0 not respond in Our experiences suggest a correlation between instance traceroutes, might randomly assign internal IP addresses at density, the number of instances assigned to a machine, and the time of instance launch, and/or might use virtual LANs a machine’s affinity for having a new instance assigned to to isolate accounts. If such precautions are taken, attack- it. In Appendix B we discuss an experiment that revealed a ers might need to turn to co-residence checks that do not bias in placement towards machines with fewer instances al- rely on network measurements. In Section 8.1 we show ex- ready assigned. This would make sense from an operational perimentally that side-channels can be utilized to establish viewpoint under the hypothesis that Amazon balances load co-residence in a way completely agnostic to network con- across running machines. figuration. Even so, inhibiting network-based co-residence We concentrate in the following on the m1.small instance checks would impede attackers to some degree, and so de- type. However, we have also achieved active co-residence termining the most efficient means of obfuscating internal between two m1.large instances under our control, and have cloud infrastructure from adversaries is a good potential av- observed m1.large and c1.medium instances with co-resident enue for defense. commercial instances. Based on the reported (using CPUID) system configurations of the m1.xlarge and c1.xlarge in- stance types, we assume that these instances have machines 7. EXPLOITING PLACEMENT IN EC2 to themselves, and indeed we never observed co-residence of Consider an adversary wishing to attack one or more EC2 multiple such instances. instances. Can the attacker arrange for an instance to be placed on the same physical machine as (one of) these vic- 7.1 Brute-forcing placement tims? In this section we assess the feasibility of achieving We start by assessing an obvious attack strategy: run nu- co-residence with such target victims, saying the attacker is merous instances over a (relatively) long period of time and successful if he or she achieves good coverage (co-residence see how many targets one can achieve co-residence with. with a notable fraction of the target set). We offer two adver- While such a brute-force strategy does nothing clever (once sarial strategies that make crucial use of the map developed the results of the previous sections are in place), our hypoth- in Section 5 and the cheap co-residence checks we introduced esis is that for large target sets this strategy will already in Section 6. The brute-force strategy has an attacker sim- allow reasonable success rates. ply launch many instances over a relatively long period of The strategy works as follows. The attacker enumerates time. Such a naive strategy already achieves reasonable suc- a set of potential target victims. The adversary then infers cess rates (though for relatively large target sets). A more which of these targets belong to a particular availability zone refined strategy has the attacker target recently-launched and are of a particular instance type using the map from instances. This takes advantage of the tendency for EC2 Section 5. Then, over some (relatively long) period of time to assign fresh instances to the same small set of machines. the adversary repeatedly runs probe instances in the target Our experiments show that this feature (combined with the zone and of the target type. Each probe instance checks if ability to map EC2 and perform co-residence checks) repre- it is co-resident with any of the targets. If not the instance sents an exploitable placement vulnerability: measurements is quickly terminated. show that the strategy achieves co-residence with a specific We experimentally gauged this strategy’s potential effi- (m1.small) instance almost half the time. As we discuss be- cacy. We utilized as “victims” the subset of public EC2- low, an attacker can infer when a victim instance is launched based web servers surveyed in Section 5 that responded with
HTTP 200 or 206 to a wget request on port 80. (This re- running probe instances temporally near the launch of a striction is arbitrary. It only makes the task harder since victim allows the attacker to effectively take advantage of the it cut down on the number of potential victims.) This left parallel placement locality exhibited by the EC2 placement 6 577 servers. We targeted Zone 3 and m1.small instances algorithms. and used our cloud map to infer which of the servers match But why would we expect that an attacker can launch this zone/type. This left 1 686 servers. (The choice of zone instances soon after a particular target victim is launched? was arbitrary. The choice of instance type was due to the Here the dynamic nature of cloud computing plays well into fact that m1.small instances enjoy the greatest use.) We the hands of creative adversaries. Recall that one of the collected data from numerous m1.small probe instances we main features of cloud computing is to only run servers launched in Zone 3. (These instances were also used in the when needed. This suggests that servers are often run on in- course of our other experiments.) The probes were instru- stances, terminated when not needed, and later run again. mented to perform the cheap co-residence check procedure So for example, an attacker can monitor a server’s state described at the end of Section 6 for all of the targets. For (e.g., via network probing), wait until the instance disap- any co-resident target, the probe performed a wget on port pears, and then if it reappears as a new instance, engage 80 (to ensure the target was still serving web pages). The in instance flooding. Even more interestingly, an attacker wget scan of the EC2 servers was conducted on October 21, might be able to actively trigger new victim instances due 2008, and the probes we analyzed were launched over the to the use of auto scaling systems. These automatically grow course of 18 days, starting on October 23, 2008. The time the number of instances used by a service to meet increases between individual probe launches varied, and most were in demand. (Examples include scalr [30] and RightGrid [28]. launched in sets of 20. See also [6].) We believe clever adversaries can find many We analyzed 1 785 such probe instances. These probes other practical realizations of this attack scenario. had 78 unique Dom0 IPs. (Thus, they landed on 78 different The rest of this section is devoted to quantifying several physical machines.) Of the 1 686 target victims, the probes aspects of this attack strategy. We assess typical success achieved co-residency with 141 victim servers. Thus the rates, whether the availability zone, attacking account, or “attack” achieved 8.4% coverage of the target set. the time of day has some bearing on success, and the effect Discussion. We point out that the reported numbers are of increased time lag between victim and attacker launches. conservative in several ways, representing only a lower bound Looking ahead, 40% of the time the attacker (launching just on the true success rate. We only report co-residence if the 20 probes) achieves co-residence against a specific target in- server is still serving web pages, even if the server was ac- stance; zone, account, and time of day do not meaningfully tually still running. The gap in time between our survey of impact success; and even if the adversary launches its in- the public EC2 servers and the launching of probes means stances two days after the victims’ launch it still enjoys the that new web servers or ones that changed IPs (i.e. by being same rate of success. taken down and then relaunched) were not detected, even In the following we will often use instances run by one of when we in fact achieved co-residence with them. We could our own accounts as proxies for victims. However we will have corrected some sources of false negatives by actively also discuss achieving co-residence with recently launched performing more internal port scans, but we limited our- commercial servers. Unless otherwise noted, we use m1.small selves to probing ports we knew to already be serving public instances. Co-residence checks were performed via compar- web pages (as per the discussion in Section 4). ison of Dom0 IPs. Our results suggest that even a very naive attack strategy The effects of zone, account, and time of day. We can successfully achieve co-residence against a not-so-small start with finding a base-line for success rates when run- fraction of targets. Of course, we considered here a large ning probe instances soon (on the order of 5 minutes) af- target set, and so we did not provide evidence of efficacy ter victims. The first experiment worked as follows, and against an individual instance or a small sets of targets. We was repeated for each availability zone. A victim account observed very strong sequential locality in the data, which launched either 1, 10, or 20 instances simultaneously. No hinders the effectiveness of the attack. In particular, the sooner than five minutes later, a separate attacker account growth in target set coverage as a function of number of requested launch of 20 instances simultaneously. The num- launched probes levels off quickly. (For example, in the data ber of collisions (attacker instances co-resident with a victim above, the first 510 launched probes had already achieved instance) are reported in the left table of Figure 3. As can co-residence with 90% of the eventual 141 victims covered.) be seen, collisions are quickly found for large percentages of This suggests that fuller coverage of the target set could victim instances. The availability zone used does not mean- require many more probes. ingfully affect co-residence rates. We now focus on a single availability zone, Zone 1, for 7.2 Abusing Placement Locality the next experiment. We repeated, at three different time We would like to find attack strategies that do better than periods over the course of a day, the following steps: A sin- brute-force for individual targets or small target sets. Here gle victim instance was launched. No more than 5 minutes we discuss an alternate adversarial strategy. We assume later 20 probe instances were launched by another account, that an attacker can launch instances relatively soon after and co-residence checks were performed. This process was the launch of a target victim. The attacker then engages repeated 10 times (with at least 5 minutes in between con- in instance flooding: running as many instances in parallel clusion of one iteration and beginning of the next). Each as possible (or as many as he or she is willing to pay for) iteration used a fresh victim; odd iterations used one ac- in the appropriate availability zone and of the appropriate count and even iterations used another. The right table in type. While an individual account is limited to 20 instances, Figure 3 displays the results. The results show a likelihood it is trivial to gain access to more accounts. As we show, of achieving co-residence as 40% — slightly less than half the
# victims v # probes p coverage 1 20 1/1 Account Zone 1 10 20 5/10 Trial A B Total 20 20 7/20 Midday 2/5 2/5 4/10 1 20 0/1 (11:13 – 14:22 PST) Zone 2 10 18 3/10 Afternoon 1/5 3/5 4/10 20 19 8/20 (14:12 – 17:19 PST) 1 20 1/1 Night 2/5 2/5 4/10 Zone 3 10 20 2/10 (23:18 – 2:11 PST) 20 20 8/20 Figure 3: (Left) Results of launching p probes 5 minutes after the launch of v victims. The rightmost column specifies success coverage: the number of victims for which a probe instance was co-resident over the total number of victims. (Right) The number of victims for which a probe achieved co-residence for three separate runs of 10 repetitions of launching 1 victim instance and, 5 minutes later, 20 probe instances. Odd-numbered repetitions used Account A; even-numbered repetitions used Account B. time a recently launched victim is quickly and easily “found” 38 instances (using two accounts). In the second round, we in the cloud. Moreover, neither the account used for the vic- achieved a three-way co-residency: an instance from each tims nor the portion of the day during which the experiment of our accounts was placed on the same machine as the was conducted significantly affected the rate of success. RightScale server. The effect of increased time lag. Here we show that rPath is another company that offers ready-to-run Inter- the window of opportunity an attacker has for launching net appliances powered by EC2 instances [29]. As with instances is quite large. We performed the following exper- RightScale, they currently offer free demonstrations, launch- iment. Forty victim instances (across two accounts) were ing on demand a fresh EC2 instance to host systems such initially launched in Zone 3 and continued running through- as Sugar CRM, described as a “customer relationship man- out the experiment. These were placed on 36 unique ma- agement system for your small business or enterprise” [29]. chines (8 victims were co-resident with another victim). Ev- We were able to successfully establish a co-resident instance ery hour a set of 20 attack instances (from a third account) against an rPath demonstration box using 40 instances. Sub- were launched in the same zone and co-residence checks were sequent attempts with fresh rPath instances on a second oc- performed. These instances were terminated immediately casion proved less fruitful; we failed to achieve co-residence after completion of the checks. Figure 4 contains a graph even after several rounds of flooding. We believe that the showing the success rate of each attack round, which stays target in this case was placed on a full system and was there- essentially the same over the course of the whole experiment. fore unassailable. (No probes were reported upon for the hours 34–43 due to Discussion. We have seen that attackers can frequently our scripts not gracefully handling some kinds of EC2-caused achieve co-residence with specific targets. Why did the strat- launch failures, but nevertheless reveals useful information: egy fail when it did? We hypothesize that instance flooding the obvious trends were maintained regardless of continuous failed when targets were being assigned to machines with probing or not.) Ultimately, co-residence with 24 of the 36 high instance density (discussed further in Appendix B) or machines running victim instances was established. Addi- even that became full. While we would like to use network tionally, probes were placed on all four machines which had probing to better understand this effect, this would require two victim instances, thus giving three-way collisions. port scanning IP addresses near that of targets, which would The right graph in Figure 4 shows the cumulative num- perhaps violate (the spirit of) Amazon’s AUP. ber of unique Dom0 IP addresses seen by the probes over the course of the experiment. This shows that the growth 7.3 Patching placement vulnerabilities in the number of machines probes were placed on levels off The EC2 placement algorithms allow attackers to use rel- rapidly — quantitative evidence of sequential placement lo- atively simple strategies to achieve co-residence with victims cality. (that are not on fully-allocated machines). As discussed ear- On targeting commercial instances. We briefly exper- lier, inhibiting cartography or co-residence checking (which imented with targeted instance flooding against instances would make exploiting placement more difficult) would seem run by other user’s accounts. RightScale is a company that insufficient to stop a dedicated attacker. On the other hand, offers “platform and consulting services that enable compa- there is a straightforward way to “patch” all placement vul- nies to create scalable web solutions running on Amazon nerabilities: offload choice to users. Namely, let users re- Web Services” [28]. Presently, they provide a free demon- quest placement of their VMs on machines that can only be stration of their services, complete with the ability to launch populated by VMs from their (or other trusted) accounts. a custom EC2 instance. On two separate occasions, we setup In exchange, the users can pay the opportunity cost of leav- distinct accounts with RightScale and used their web inter- ing some of these machines under-utilized. In an optimal face to launch one of their Internet appliances (on EC2). assignment policy (for any particular instance type), this We then applied our attack strategy (mapping the fresh in- additional overhead should never need to exceed the cost of stance and then flooding). On the first occasion we sequen- a single physical machine. tially launched two rounds of 20 instances (using a single account) before achieving co-residence with the RightScale instance. On the second occasion, we launched two rounds of
Unique Dom0 assignments 16 Total co-resident 60 Number of instances 14 New co-resident 55 12 50 10 45 8 40 6 35 4 30 2 25 0 20 0 10 20 30 40 50 0 10 20 30 40 50 Hours since victims launched Hours since victims launched Figure 4: Results for the experiment measuring the effects of increasing time lag between victim launch and probe launch. Probe instances were not run for the hours 34–43. (Left) “Total co-resident” corresponds to the number of probe instances at the indicated hour offset that were co-resident with at least one of the victims. “New co-resident” is the number of victim instances that were collided with for the first time at the indicated hour offset. (Right) The cumulative number of unique Dom0 IP addresses assigned to attack instances for each round of flooding. 8. CROSS-VM INFORMATION LEAKAGE mented and measured simple covert channels (in which two The previous sections have established that an attacker instances cooperate to send a message via shared resource) can often place his or her instance on the same physical using memory bus contention, obtaining a 0.006bps channel machine as a target instance. In this section, we show the between co-resident large instances, and using hard disk con- ability of a malicious instance to utilize side channels to tention, obtaining a 0.0005bps channel between co-resident learn information about co-resident instances. Namely we m1.small instances. In both cases no attempts were made show that (time-shared) caches allow an attacker to measure at optimizing the bandwidth of the covert channel. (The when other instances are experiencing computational load. hard disk contention channel was used in Section 6 for estab- Leaking such information might seem innocuous, but in fact lishing co-residence of instances.) Covert channels provide it can already be quite useful to clever attackers. We intro- evidence that exploitable side channels may exist. duce several novel applications of this side channel: robust Though this is not our focus, we further observe that the co-residence detection (agnostic to network configuration), same resources can also be used to mount cross-VM per- surreptitious detection of the rate of web traffic a co-resident formance degradation and denial-of-service attacks, analo- site receives, and even timing keystrokes by an honest user gously to those demonstrated for non-virtualized multipro- (via SSH) of a co-resident instance. We have experimentally cessing [12, 13, 21]. investigated the first two on running EC2 instances. For the keystroke timing attack, we performed experiments on an 8.1 Measuring cache usage EC2-like virtualized environment. An attacking instance can measure the utilization of CPU caches on its physical machine. These measurements can On stealing cryptographic keys. There has been a long be used to estimate the current load of the machine; a high line of work (e.g., [10, 22, 26]) on extracting cryptographic load indicates activity on co-resident instances. Here we secrets via cache-based side channels. Such attacks, in the describe how to measure cache utilization in EC2 instances context of third-party compute clouds, would be incredibly by adapting the Prime+Probe technique [22, 32]. We also damaging — and since the same hardware channels exist, are demonstrate exploiting such cache measurements as a covert fundamentally just as feasible. In practice, cryptographic channel. cross-VM attacks turn out to be somewhat more difficult to realize due to factors such as core migration, coarser schedul- Load measurement. We utilize the Prime+Probe tech- ing algorithms, double indirection of memory addresses, and nique [22, 32] to measure cache activity, and extend it to (in the case of EC2) unknown load from other instances the following Prime+Trigger+Probe measurement to sup- and a fortuitous choice of CPU configuration (e.g. no hy- port the setting of time-shared virtual machines (as present perthreading). The side channel attacks we report on in on Amazon EC2). The probing instance first allocates a con- the rest of this section are more coarse-grained than those tiguous buffer B of b bytes. Here b should be large enough required to extract cryptographic keys. While this means that a significant portion of the cache is filled by B. Let s the attacks extract less bits of information, it also means be the cache line size, in bytes. Then the probing instance they are more robust and potentially simpler to implement performs the following steps to generate each load sample: in noisy environments such as EC2. (1) Prime: Read B at s-byte offsets in order to ensure it is Other channels; denial of service. Not just the data cached. cache but any physical machine resources multiplexed be- (2) Trigger: Busy-loop until the CPU’s cycle counter jumps tween the attacker and target forms a potentially useful by a large value. (This means our VM was preempted by channel: network access, CPU branch predictors and in- the Xen scheduler, hopefully in favor of the sender VM.) struction cache [1, 2, 3, 12], DRAM memory bus [21], CPU (3) Probe: Measure the time it takes to again read B at pipelines (e.g., floating-point units) [4], scheduling of CPU s-byte offsets. cores and timeslices, disk access [16], etc. We have imple- When reading the b/s memory locations in B, we use a pseudorandom order, and the pointer-chasing technique de-
scribed in [32], to prevent the CPU’s hardware prefetcher (2) Sleep briefly (to build up credit with Xen’s scheduler). from hiding the access latencies. The time of the final step’s (3) Prime: Read all of B to make sure it’s fully cached. read is the load sample, measured in number of CPU cycles. (4) Trigger: Busy-loop until the CPU’s cycle counter jumps These load samples will be strongly correlated with use of by a large value. (This means our VM was preempted by the cache during the trigger step, since that usage will evict the Xen scheduler, hopefully in favor of the sender VM.) some portion of the buffer B and thereby drive up the read (5) Probe: Measure the time it takes to read all even ad- time during the probe phase. In the next few sections we dresses in B, likewise for the odd addresses. Decide “0” describe several applications of this load measurement side iff the difference is positive. channel. First we describe how to modify it to form a robust On EC2 we need to deal with the noise induced by the fact covert channel. that each VM’s virtual CPU is occasionally migrated be- Cache-based covert channel. Cache load measurements tween the (m1.small) machine’s four cores. This also leads create very effective covert channels between cooperating to sometimes capturing noise generated by VMs other than processes running in different VMs. In practice, this is not the target (sender). Due to the noise-cancelling property of a major threat for current deployments since in most cases differential encoding, we can use a straightforward strategy: the cooperating processes can simply talk to each other over the receiver takes the average of multiple samples for making a network. However, covert channels become significant his decision, and also reverts to the prime stage whenever it when communication is (supposedly) forbidden by informa- detects that Xen scheduled it to a different core during the tion flow control (IFC) mechanisms such as sandboxing and trigger or probe stages. This simple solution already yields IFC kernels [34, 18, 19]. The latter are a promising emerg- a bandwidth of approximately 0.2bps, running on EC2. ing approach to improving security (e.g., web-server func- tionality [18]), and our results highlight a caveat to their 8.2 Load-based co-residence detection effectiveness. Here we positively answer the following question: can one In the simplest cache covert-channel attack [15], the sender test co-residence without relying on the network-based tech- idles to transmit “0” and frantically accesses memory to niques of Section 6? We show this is indeed possible, given transmit “1”. The receiver accesses a memory block of his some knowledge of computational load variation on the tar- own and observes the access latencies. High latencies are in- get instance. This condition holds when an adversary can dicative that the sender is evicting the receiver’s data from actively cause load variation due to a publicly-accessible ser- the caches, i.e., that “1” is transmitted. This attack is ap- vice running on the target. It might also hold in cases where plicable across VMs, though it tends to be unreliable (and an adversary has a priori information about load variation thus has very low bandwidth) in a noisy setting. on the target and this load variation is (relatively) unique We have created a much more reliable and efficient cross- to the target. VM covert channel by using finer-grained measurements. Consider target instances for which we can induce compu- We adapted the Prime+Trigger+Probe cache measurement tational load — for example, an instance running a (public) technique as follows. Recall that in a set-associative cache, web server. In this case, an attacker instance can check for the pool of cache lines is partitioned into associativity sets, co-residence with a target instance by observing differences such that each memory address is mapped into a specific in load samples taken when externally inducing load on the associativity set determined by certain bits in the address target versus when not. We experimentally verified the effi- (for brevity, we ignore here details of virtual versus physi- cacy of this approach on EC2 m1.small instances. The target cal addresses). Our attack partitions the cache sets into two instance ran Fedora Core 4 with Apache 2.0. A single 1 024- classes, “odd sets” and “even sets”, and manipulates the load byte text-only HTML page was made publicly accessible. across each class. For resilience against noise, we use differ- Then the co-residence check worked as follows. First, the ential coding where the signal is carried in the difference attacker VM took 100 load samples. (We set b = 768 ∗ 1024 between the load on the two classes. Noise will typically and s = 128. Taking 100 load samples took about 12 sec- be balanced between the two classes, and thus preserve the onds.) We then paused for ten seconds. Then we took 100 signal. (This argument can be made rigorous by using a further load samples while simultaneously making numer- random-number generator for the choice of classes, but the ous HTTP get requests from a third system to the target following simpler protocol works well in practice.) via jmeter 2.3.4 (a utility for load testing HTTP servers). The protocol has three parameters: a which is larger We set jmeter to simulate 100 users (100 separate threads). than the attacked cache level (e.g., a = 221 to attack the Each user made HTTP get requests as fast as possible. EC2’s Opteron L2 cache), b which is slightly smaller than The results of three trials with three pairs of m1.small the attacked cache level (here, b = 219 ), and d which is instances are plotted in Figure 5. In the first trial we used the cache line size times a power of 2. Define even ad- two instances known to be co-resident (via network-based dresses (resp. odd addresses) as those that are equal to co-residence checks). One can see the difference between 0 mod 2d (resp. d mod 2d). Define the class of even cache the load samples when performing HTTP gets and when sets (resp. odd cache sets) as those cache sets to which even not. In the second trial we used a fresh pair of instances (resp. odd) addresses are mapped. co-resident on a different machine, and again one can easily The sender allocates a contiguous buffer A of a bytes. To see the effect of the HTTP gets on the load samples. In the transmit “0” (resp. 1) he reads the even (resp. odd) addresses third trial, we used two instances that were not co-resident. in A. This ensures that the one class of cache sets is fully Here the load sample timings are, as expected, very similar. evicted from the cache, while the other is mostly untouched. We emphasize that these measurements were performed on The receiver defines the difference by the following mea- live EC2 instances, without any knowledge of what other surement procedure: instances may (or may not) have been running on the same (1) Allocate a contiguous buffer B of b bytes machines. Indeed, the several spikes present in Trial 2’s
You can also read