Privacy and Accountability for Location-based Aggregate Statistics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Privacy and Accountability for Location-based Aggregate Statistics Raluca Ada Popa Andrew J. Blumberg Hari Balakrishnan MIT University of Texas, Austin MIT ralucap@mit.edu blumberg@math.utexas.edu hari@mit.edu Frank H. Li MIT frankli@mit.edu ABSTRACT information is then processed by the server to compute a variety of A significant and growing class of location-based mobile applica- aggregate statistics. tions aggregate position data from individual devices at a server and As primary motivation, consider applications that process streams compute aggregate statistics over these position streams. Because of GPS position and/or speed samples along vehicle trajectories, these devices can be linked to the movement of individuals, there sourced from smartphones, to determine current traffic statistics such is significant danger that the aggregate computation will violate the as average speed, average delay on a road segment, or congestion at location privacy of individuals. This paper develops and evaluates an intersection. Several research projects (e.g., CarTel [21], Mobile PrivStats, a system for computing aggregate statistics over location Millennium [27]) and commercial products (e.g., TomTom [34]) pro- data that simultaneously achieves two properties: first, provable vide such services. Another motivation is social mobile crowdsourc- guarantees on location privacy even in the face of any side informa- ing, for example, estimating the number of people at a restaurant to tion about users known to the server, and second, privacy-preserving determine availability by aggregating data from position samples accountability (i.e., protection against abusive clients uploading provided by smartphones carried by participating users. large amounts of spurious data). PrivStats achieves these properties A significant concern about existing implementations of these using a new protocol for uploading and aggregating data anony- services is the violation of user location privacy. Even though mously as well as an efficient zero-knowledge proof of knowledge the service only needs to compute an aggregate (such as a mean, protocol we developed from scratch for accountability. We imple- standard deviation, density of users, etc.), most implementations mented our system on Nexus One smartphones and commodity simply continuously record time-location pairs of all the clients servers. Our experimental results demonstrate that PrivStats is a and deliver them to the server, labeled by which client they belong practical system: computing a common aggregate (e.g., count) over to [21, 27, 22]. In such a system, the server can piece together all the data of 10, 000 clients takes less than 0.46 s at the server and the locations and times belonging to a particular client and obtain the protocol has modest latency (0.6 s) to upload data from a Nexus the client’s path, violating her location privacy. phone. We also validated our protocols on real driver traces from Location privacy concerns are important to address because many the CarTel project. users perceive them to be significant (and may refuse to use or even oppose a service) and because they may threaten personal security. Categories and subject descriptors: C.2.0 [Computer Communi- Recently (as of the time of writing this paper), two users have sued cation Networks]: General–Security and protection Google [26] over location data that Android phones collect citing General terms: Security as one of the concerns “serious risk of privacy invasions, includ- ing stalking." The lawsuit attempts to prevent Google from selling phones with software that can track user location. Just a week before, 1. INTRODUCTION two users sued Apple [25] for violating privacy laws by keeping a The emergence of location-based mobile services and the interest log of user locations without offering users a way to disable this in using them in road transportation, participatory sensing [28, 21], tracking or delete the log. Users of TomTom, a satellite navigation and various social mobile crowdsourcing applications has led to system, have expressed concern over the fact that TomTom logs a fertile area of research and commercial activity. At the core of user paths and sells aggregate statistics (such as speeding hotspots) many of these applications are mobile nodes (smartphones, in-car to the police, who in turn install speed cameras [34]. A study by devices, etc.) equipped with GPS or other position sensors, which Riley [35] shows even wider location privacy fears: a significant (periodically) upload time and location coordinates to a server. This number of drivers in the San Francisco Bay Area will not install toll transponders in their cars because of privacy concerns. Moreover, online databases are routinely broken into or abused by insiders with access; if that happens, detailed records of user mobility may Permission to make digital or hard copies of all or part of this work for become known to criminals, who can then attempt security attacks, personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies such as burglarizing homes when they know that their residents are bear this notice and the full citation on the first page. To copy otherwise, to away [40]. republish, to post on servers or to redistribute to lists, requires prior specific In this paper, we design, implement, and evaluate PrivStats, a permission and/or a fee. practical system for computing aggregate statistics in the context of CCS’11, October 17–21, 2011, Chicago, Illinois, USA. mobile, location-based applications that achieves both strong guar- Copyright 2011 ACM 978-1-4503-0948-6/11/10 ...$10.00.
antees of location privacy and protection against cheating clients. server, performs two simple tasks summing up to only 62 lines PrivStats solves two major problems: it provides formal location of code (excluding libraries) which can be easily scrutinized for privacy guarantees against any side information (SI) attacks, and it correctness, and only uses constant storage per aggregate. provides client accountability without a trusted party. Because of - In contrast to our limited SM, trusted parties in previous work [18, these contributions, in comparison to previous systems [16], Clique- 14, 20, 16] had the ability to fully compromise client paths and Cloak [14], [19], [24], [20], and Triplines [18], PrivStats provides modify aggregate results when corrupted; at the same time, their the strongest formal privacy and correctness guarantees while mak- use still did not provide general SI attack protection. ing the weakest trust assumptions. Side information refers to any out-of-bound information the server Since clients are anonymous, the problem of accountability be- may have, which, when used together with the data the server gets comes serious: malicious clients can significantly affect the correct- in a system, can help the server compromise user location privacy. ness of the aggregate results. Hoh et al. [19] present a variety of Some previous work ensures that clients upload data free of identi- attacks clients may use to change the aggregate result. For exam- fiers (they upload approximate time, location, and data sample), but ple, clients might try to divert traffic away from a road to reduce even when all data is anonymized, a considerable amount of private a particular driver’s travel time or to keep low traffic in front of location information can still leak due to side information. As a sim- one’s residence, or divert traffic toward a particular road-way to ple example, if the server knows that Alice is the only person living increase revenue at a particular store. A particular challenge is that on a street, it can infer that Alice just left or arrived at her house accountability and privacy goals are in tension: if each client upload when receiving some speed updates from that street. SI can come in is anonymous, it seems that the client could upload as many times many forms and has been shown to leak as much as full client paths as she wishes. Indeed, most previous systems [16, 14, 24, 20] do in some cases: Krumm [24], as well as Gruteser and Hoh [17], show not provide any protection for such client biasing attacks and the that one can infer driver paths from anonymized data and then link few that provide such verification ([18, 19]) place this task on heav- them to driver identities using side information such as knowledge ily loaded trusted parties that when compromised can release full of map, typical driving behavior patterns, timing between uploads, location paths with user identities and can corrupt aggregate results. and a public web service with addresses and names. Despite SI’s Our second contribution is a privacy-preserving accountability ability to leak considerable location privacy, previous systems for protocol without any trusted parties (§6). We provide a crypto- mobile systems aggregates either do not address the problem at all, graphic protocol for ensuring that each client can upload at most a or they only treat specific cases of it, such as areas of low density fixed quota of values for each aggregate. At the core of the protocol, (see §10). is an efficient zero-knowledge proof of knowledge (ZKPoK) that Our first contribution is to provide an aggregate statistics proto- we designed from scratch for this protocol. The zero-knowledge col with strong location privacy guarantees, including protection property is key in maintaining the anonymity of clients. Therefore, against any general side information attack. We formalize our no trusted party is needed; in particular, the SM is not involved in guarantees by providing a definition of location privacy, termed this protocol. strict location privacy or SLP (Def. 2), which specifies that the Finally, our third contribution is an implementation of the overall server learns nothing further about the location of clients in the face system on Nexus One smartphones and commodity servers (§9). of arbitrary side information other than the desired aggregate result; One of our main goals was to design a practical system, so we strived then, in §5, we provide a protocol, also termed SLP, that provably to keep our cryptographic protocols practical, including designing achieves our definition. our ZKPoK from scratch. Computing a common aggregate (e.g. While we manage to hide most sources of privacy leakage in our count) over the data of 104 clients takes less than 0.46 s at the server, SLP protocol without any trust assumptions, hiding the number of the protocol has modest latency of about 0.6 s to upload data from a tuples generated by clients information-theoretically can be reduced, Nexus phone, and the throughput is linearly scalable in the number under reasonable model assumptions, to a problem in distributed of processing cores. We believe these measurements indicate that algorithms that can be shown to be impossible to solve. The reason PrivStats is practical. We also validated that our protocols introduce is that it requires synchronization among a set of clients that do little or no error in statistics computation using real driver traces not know of each other. Our solution is to use a lightweight and from the CarTel project [21]. restricted module, the smoothing module (SM), that helps clients synchronize with respect to the number of tuples upload and per- 2. MODEL forms one decryption per aggregate. We distribute the SM on the In this section, we discuss the model for PrivStats. Our setup clients (each “client SM” is responsible for handling a few aggre- captures a wide range of aggregate statistics problems for mobile gates) so as to ensure that at most a small fraction of SMs misbehave. systems. We provide our strong SLP guarantees for all aggregates with honest SMs, while still ensuring a high level of protection for compromised 2.1 Setting SMs: In our model, we have many clients, a server, and a smoothing - A compromised SM cannot change aggregate results undetectably module (SM). Clients are mobile nodes equipped with smartphones: and cannot collude with clients to allow them to corrupt aggre- drivers in a vehicular network or peers in a social crowdsourcing gates. application. The server is an entity interested in computing certain - Although a malicious SM opens the doors to potential SI leakage aggregate statistics over data from clients. The SM is a third party from the number of uploads and the values uploaded, a significant involved in the protocol that is distributed on clients. degree of privacy remains because client uploads are always We assume that clients communicate with the server and the SM anonymized, free of timing and origin information. Our SM is using an anonymization network (e.g., Tor), a proxy forwarder, or distributed on the clients, resulting in few compromised client other anonymizing protocol to ensure that privacy is not compro- SMs in practice. mised by the underlying networking protocol. In this paper, we - Our design of the SM (§5) facilitates distribution and ensuring its assume these systems succeed in hiding the origin of a packet from correctness: the SM has light load, ≈ 50 times less load than the an adversary; it is out of our scope to ensure this. We assume that
these systems also hide network travel time; otherwise clients must !"#$%&'( introduce an appropriate delay so that all communications have *567,$%&'(()%*+,$%-./812$%/9214% roughly the same round trip times. Clients can choose which aggregates they want to participate in; they can opt-out if the result is deemed too revealing. )$+,$+( A sample is a client’s contribution to an aggregate, such as average -%.%/0#1#%2( speed, delay on a road, or a bit indicating presence in a certain )*( %$&3.+4( 5678906(:/0$%;3?)%1@""AB% in vehicular systems, or at certain location/times, as is the case for (5658906(:/0$%;DD/E
I. System setup (Runs once, when the system starts). 1: Server generates public and private keys for accountability by running System setup from §6. 2: Both Server and SM generate public and private keys for the aggregation protocols by running System setup from §5. II. Client join (Runs once per client, when a client signs up to the system). 1: Client identifies herself and obtains public keys and capabilities to upload data for accountability (by running Client join, §6) from Server and public keys for aggregation (by running Client join, §5) from SM. Server also informs Client of the aggregates Server wants to compute, and Client decides to which aggregates she wants to contribute (i.e., generate tuples). III. Client uploads for aggregate id (Runs when a client generates a tuple hid, samplei) 1: Client runs SLP’s Upload protocol (§5) by communicating with the SM for each hid, samplei and produces a set of transformed tuples T1 , . . . , Tk and a set of times, t1 , . . . , tk , when each should be uploaded to Server. 2: For each i, Client uploads Ti to Server at time ti . 3: Client also proves to Server using the Accountability upload protocol, §6, that the upload is within the permissible quotas and acceptable sample interval. IV. Server computes aggregate result id (Runs at the end of an aggregate’s time interval). 1: Server puts together all the tuples for the same id and aggregates them by running Aggregation with SM, §5. Figure 2: Overview of PrivStats. For a security parameter k and an aggregate function F (which ViewP (Rb ) to Adv. (ViewP usually contains probabilistic encryp- can also be a collection of aggregate functions), consider a protocol tion so two encryptions of Rb will lead to different values.) P = PF (k) for computing F . For example, the SLP protocol (§5) is 4: The adversary outputs its best guess for b∗ and wins this game if such a protocol P where F can be any of the aggregates discussed in b = b∗ . Let winP (Adv, k) := Pr[b = b∗ ] be the probability that §7. Let R be a collection of raw tuples generated by users in some Adv wins this game. time period; that is, R is a set of tuples of the form hid, samplei together with the precise time, client network information, and D EF. 2 (S TRICT L OCATION P RIVACY – SLP). A protocol P client id when they are generated. R is therefore the private data has strict location privacy with respect to a raw-tuple domain, D, containing the paths of all users and should be hidden from the if, for any polynomial-time adversary Adv, winP (Adv, k) ≤ 1/2 + server. We refer to R as a raw-tuple configuration. Let SI be some negligible function of k. side information available at the server about R. Using P, clients Intuitively, this definition says that, when the server examines the transform the tuples in R before uploading them to the server. (The data it receives and any side information, all possible configurations sizes of R, F , and SI are assumed to be polynomial in k.) of client paths having the same aggregate result are equally likely. The following definition characterizes the information available at Therefore, the server learns nothing new beyond the aggregate result. the server when running protocol P. Since we consider a real system, Strict location privacy and differential privacy. The guarantee the server observes the timing and network origin of the packets it of strict location privacy is complementary to those of differential receives; a privacy definition should take these into account. privacy [11], and these two approaches address different models. In D EF. 1 (S ERVER ’ S VIEW.). The view of the server, ViewP (R), SLP, the server is untrusted and all that it learns is the aggregate is all the tuples the server receives from clients or SM in P associated result. In a differential privacy setting, the server is trusted and with time of receipt and any packet network information, when knows all private information of the clients, clients issue queries to clients generate raw tuples R and run protocol P. the database at the server, and clients only learn aggregate results that do not reveal individual tuples. Of course, allowing the server Let D be the domain for all raw-tuple configurations R. Let res be to know all clients’ private path information is unacceptable in our some aggregate result. Let RF (res) = {R ∈ D : F (R) = res} be setting. PrivStats’ model takes the first and most important step for all possible collections of raw tuples where the associated aggregate privacy: hiding the actual paths of the users from the server. Actual result is res. paths leak much more than common aggregate results in practice. Security game. We describe the SLP definition using a game be- A natural question is whether one can add differential privacy on tween a challenger and an adversary. The adversary is a party that top of PrivStats to reduce leakage from the aggregate result, while would get access to all the information the server gets. Consider a retaining the guarantees of PrivStats, which we discuss in §8. protocol P, a security parameter k, a raw-tuple domain D, and an adversary Adv. 4. OVERVIEW 1: The challenger sets up P by choosing any secret and public keys PrivStats consists of running an aggregation protocol and an ac- required by P with security parameter k and sends the public keys countability protocol. The aggregation protocol (§5) achieves our to Adv. strict location privacy (SLP) definition, Def. 2, and leaks virtu- 2: Adv chooses SI, res, and R0 , R1 ∈ RF (res) (to facilitate guess- ally nothing about the clients other than the aggregate result. The ing based on SI) and sends R0 and R1 to the challenger. accountability protocol (§6) enables the server to check three prop- 3: Challenger runs P producing ViewP (R0 ) and ViewP (R1 ), and erties of each client’s upload without learning anything about the sends them to Adv. It chooses a fair random bit b and sends identity of the client: the client did not exceed a server-imposed
C contacts SM to sync. quota of uploads for each sample point, did not exceed a total quota at random timing of uploads over all sample points, and the sample uploaded is in an acceptable interval of values. (1) (3) Figure 1 illustrates the interaction of the three components in (2) our system on an example: computation of aggregate statistic with C passes by sample C uploads a real and a junk id 14, average speed. The figure shows Alice and Bob generating point, generates tuple tuple at S at random timing data when passing through the corresponding sample point and then contacting the SM to know how many tuples to upload. As we time explain in §5, we can see that the data that reaches the server is anonymized, contains encrypted speeds, arrives at random times Figure 3: Staged randomized timing upload for a client C: (1) is the independent of the time of generation, and is accompanied by ac- generation interval, (2) the synchronization interval, and (3) the upload countability proofs to prevent malicious uploads. Moreover, the interval. number of tuples arriving is uncorrelated with the real number. Figure 2 provides an overview of our protocols, with components elaborated in the upcoming sections. rithms problem which is shown to be impossible (as presented in a longer version of our paper at http://nms.csail.mit.edu/ projects/privacy/). For this reason, we use a party called the 5. AGGREGATION: THE SLP (STRICT smoothing module (SM) that clients use to synchronize and upload LOCATION PRIVACY) PROTOCOL a constant number of tuples at each aggregate. The SM will also A protocol with strict location privacy must hide all five leakage perform the final decryption of the aggregate result. The trust re- vectors (included in Def. 1): client identifier, network origin of quirement on the SM is that it does not decrypt more than one value packet, time of upload, sample value, and number of tuples gen- for the server per aggregate and it does not leak the actual number erated for each aggregate (i.e., the number of clients passing by of tuples. Although the impossibility result provides evidence that a sample point). The need to hide the first two is evident and we some form of trust is needed for SLP, we do not claim that hiding already discussed the last three vectors in §3. the number of tuples would remain impossible if one weakened the Hiding the identifier and network origin. As discussed in §2, privacy guarantees or altered the model. However, we think our clients never upload their identities (which is possible due to our trust assumption is reasonable, as we explain in §8: a malicious accountability protocol described in §6). We hide the network origin SM has limited effect on privacy (it does not see the timing, origin, using an anonymizing network as discussed in §2.1. and identifier of client tuples) and virtually no effect on aggregate Hiding the sample. Clients encrypt their samples using a (seman- results (it cannot decrypt an incorrect value or affect accountability). tically secure) homomorphic encryption scheme. Various schemes Moreover, the SM has light load, and can be distributed on clients can be used, depending on the aggregate to be computed; Pail- ensuring that most aggregates have our full guarantees. For simplic- lier [30] is our running example and it already enables most common ity, we describe the SLP protocol with one SM and subsequently aggregates. Paillier has the property that E(a) · E(b) = E(a + b), explain how to distribute it on clients. where E(a) denotes encryption of a under the public key, and the Since the number of real tuples generated may be less than Uid , multiplication and addition are performed in appropriate groups. we will arrange to have clients occasionally upload junk tuples: Using this setup, the server computes the desired aggregate on en- tuples that are indistinguishable from legitimate tuples because the crypted data; the only decrypted value the server sees is the final encryption scheme is probabilistic. Using the SM, clients can figure aggregate result (due to the SM, as described later in this section). out how many junk tuples they should upload. The value of the junk It is important for the encryption scheme to be verifiable to pre- tuples is a neutral value for the aggregate to be computed (e.g., zero vent the SM from corrupting the decrypted aggregate result. That for summations), as we explain in §7. is, given the public key PK and a ciphertext E(a), when given a To summarize, the SM needs to perform only two simple tasks: and some randomness r, the server can verify that E(a) was indeed • Keep track of the number of tuples that have been uploaded an encryption of a using r, and there is no b such that E(a) could so far (sid ) for an aggregate (without leaking this number to have been an encryption of b for some randomness. Also, the holder an adversary). Clients use this information to figure out how of the secret key should be able to compute r efficiently. Fortu- many more tuples they should upload at the server to reach a nately, Paillier has this property because it is a trapdoor permutation; total of Uid . computing r involves one exponentiation [30]. • At the end of the time period for an aggregate id, decrypt the Hiding the number of tuples. The server needs to receive a number value of the aggregate result from the server. of tuples that is independent of the actual number of tuples generated. The idea is to arrange that the clients will upload in total a constant Hiding timing. Clients upload tuples according to a staged ran- number of tuples, Uid , for each aggregate id. Uid is a publicly-known domized timing algorithm described in Figure 3; this protocol hides value, usually an upper bound on the number of clients generating the tuple generation time from both the server and the SM. The tuples, and computed based on historical estimates of how many generation interval corresponds to the interval of aggregation when clients participate in the aggregate. §9 explains how to choose Uid . clients generate tuples, and the synchronization and upload interval The difficulty with uploading Uid in total is that clients do not are added at the end of the generation interval, but are much shorter know how many other clients pass through the same sample point, (they are not points only to ensure reasonable load per second at the and any synchronization point may itself become a point of leakage SM and server). for the number of clients. We now put together all these components and specify the SLP Assuming that there are no trusted parties (i.e., everyone tries protocol. Denote an aggregate by id and let quota be the number of to learn the number of tuples generated), under a reasonable for- samples a client can upload at id. Let [ts0 , ts1 ] be the sync. interval. malization of our specific practical setting, the problem of hiding the number of tuples can be reduced to a simple distributed algo-
per (http://nms.csail.mit.edu/projects/privacy/); nev- I. System setup: SM generates a Paillier secret and public key. ertheless, the proof is easy enough to understand in such terms. II. Client joins: Client obtains the public key from the SM. According to Def. 2, we need to argue that the information in the III. Upload: view of the server (besides the aggregate result and any SI the server ◦ G ENERATION INTERVAL already knows) when clients generate two different collections of 1: Client generates a tuple hid, samplei and computes T1 := raw tuples, R0 and R1 , is indistinguishable. The information that hid, E[sample]i using SM’s public key. reaches the server for id is a set of tuples consisting of three compo- 2: Client selects a random time ts in the sync. interval and does nents: time of receipt, network origin, and encrypted sample. Due nothing until ts . to our Uid scheme, the number of tuples arriving at the server in both ◦ S YNCHRONIZATION INTERVAL cases is the same, Uid . Next, our staged upload protocol ensures 3: Client ↔ SM : At time ts , Client requests sid , the number that the timing of upload is uniformly at random distributed and the of tuples other clients already engaged to upload for id from network origin is hidden. The encryption scheme of the samples is SM (sid = 0 if Client is the first to ask sid for id). semantically-secure so, by definition, any two samples encrypted 4: Client should upload ∆s := min(quota, dUid (ts − with this scheme are indistinguishable to any polynomial adversary. ts0 )/(ts1 − ts0 )e − sid ) tuples. If ∆s ≤ 0, but sid < Uid , as- Since all these three components are independent of each other, the sign ∆s := 1. (In parallel, SM updates sid to sid := ∆s+sid total set of tuples for R0 and R1 are indistinguishable. because it also knows quota and it assumes that Client will Note that this theorem assumes that the clients, server, and SM upload her share.) run the SLP protocol correctly; specifically, the server performs all 5: Client picks ∆s random times, t1 , . . . , t∆s , in the upload tasks in the protocol correctly, although it may try to infer private in- interval to use when uploading tuples. formation from the data it receives. In §8, we discuss the protection ◦ U PLOAD INTERVAL : PrivStats provides when these parties deviate from the SLP protocol 6: Client → Server : The client uploads ∆s tuples at times in various ways. t1 , . . . , t∆s . One of the tuples is the real tuple T1 , the others, Note that we provide the SLP guarantees as long as clients suc- if any, are junk tuples. ceed in uploading Uid tuples (as captured in the definition of D∗ ). In the special cases when an aggregate is usually popular (so it has IV. Aggregation: high Uid ), but suddenly falls in popularity (at least quota = 3 times 1: At the end of the interval, Server computes the aggregate on less), clients may not be able to reach Uid . One potential solution encrypted samples by homomorphically adding all samples is to have clients volunteer to “watch” over certain sample points for id. they do not necessarily traverse, by uploading junk tuples within the 2: Server ↔ SM : Server asks SM to decrypt the aggregate re- same quota restrictions, if sid is significantly lower than Uid towards sult. SM must decrypt only one value per aggregate. Server the end of the synchronization interval. verifies that SM returned the correct decrypted value as dis- Also note that the results of an aggregate statistics are available cussed. at the server only at the end of the generation, synchronization, and upload intervals. The last two intervals are short, but the first one is as long as the time interval for which the server wants to The equation for ∆s is designed so that for each time t in the syn- compute the aggregate. This approach inherently does not provide chronization interval, approximately a total of Uid (t − ts0 )/(ts1 − real-time statistics; however, it is possible to bring it closer to real- ts0 ) tuples have been scheduled by clients to be uploaded during the time statistics, if the server chooses the generation interval to be upload interval. Therefore, when t becomes ts1 , Uid tuples should short, e.g., the server runs the aggregation for every 5 minutes of an have been scheduled for upload. hour of interest. We assign sid := 1 when ∆s comes out negative or zero because 5.1 Distributing the Smoothing Module as long as sid < Uid , it is better to schedule clients for uploading in order to reach the value of Uid by the end of the time interval; this So far, we referred to the smoothing module as one module for approach avoids too many junk tuples (which do not carry useful simplicity, but we recommend the SM be distributed on the clients: information) being uploaded later in the interval if insufficiently this distributes the already limited trust placed on the SM. Each many clients show up then. Moreover, this ensures that, as long client will be responsible for running the SM for one or a few as Uid is an upper bound on the number of clients passing by, few aggregate statistics. We denote by a client SM the program running clients will refrain from uploading in practice. on a client performing the task of the SM for a subset of the sample Choice of Uid and quota. A reasonable choice for quota is 3 points. Therefore, each physical client will be running the protocol and Uid = avgid + stdid , where avgid is an estimate of the average for the client described above and the protocol for one client SM of clients passing through a sample point and stdid of the standard (these two roles being logically separated). deviation; both these values can be deduced based on historical A set of K clients will handle each aggregate to ensure that the records or known trends of the sample point id. We justify and SLP protocol proceeds even if K − 1 clients become unavailable evaluate this choice of Uid and quota in §9.2. (e.g., fail, lose connectivity). Assuming system membership is Let D∗ be the set of all raw-tuple configurations in which clients public, anyone can compute which K clients are assigned to an ag- succeed in uploading Uid at sample point id. Raw-tuple configura- gregate id using consistent hashing [23]; this approach is a standard tions in practice should be a subset of this set for properly chosen distributed systems technique with the property that, roughly, load Uid and quota. is balanced among clients and clients are assigned to aggregates in a random way (see [23] for more details). T HEOREM 1. The SLP protocol achieves the strict location pri- The design of the SM facilitates distribution to smartphones: the vacy definition from Def. 2 for D∗ . SM has light load (≈ 50 times less than the server), performs two P ROOF. Due to space constraints, we provide a high-level proof simple tasks, and uses constant storage per aggregate (see §9). leaving a mathematical exposition for an extended version of this pa- We now discuss how the the two tasks of a client SM – aggregate
result decryption and maintaining the total number of tuples – are client for aggregate id. All the values in this protocol are computed accomplished in a distributed setting. in Zn , where n = p1 p2 , two safe prime numbers, even if we do not During system setup, one of the K clients chooses a secret and a always mark this fact. We make standard cryptographic assump- public key and shares them with the other K − 1 clients. The server tions that have been used in the literature such as the strong-RSA can ask any SM to decrypt the aggregate result: since the server can assumption. verify the correctness of the decryption, the server can ask a different SM in case it establishes that the first one misbehaved. Client SMs should notify each other if the server asked for a decryption to 6.1 Protocol prevent the server from asking different client SMs for different For clarity, we explain the protocol for a quota of one tuple per decryptions. Decryption is a light task for client devices because it aggregate id per client and then explain how to use the protocol for happens only once per aggregate. a larger quota. For a given aggregate, the aggregate result will be successfully The accountability protocol consists of an efficient zero-knowledge decrypted if at least one of K client SMs is available. K should proof of knowledge protocol we designed from scratch for this prob- be chosen based on the probability that clients become unavailable, lem. Proofs of knowledge [15], [3], [36] are proofs by which one which depends on the application. Since it is enough for only one can prove that she knows some value that satisfies a certain relation. SM to function for correctness, in general we expect that a small For example, Schnorr [36] provided a simple and efficient algorithm number for K (e.g., K = 4 should suffice in practice). for proving possession of a discrete log. The zero-knowledge [15] For maintaining the total number of tuples, we designed our property provides the guarantee that no information about the value staged timing protocol (Fig. 3) in such a way that client SMs need in question is leaked. only be available for the synchronization time interval. This interval System setup. The server generates a public key (PKs ) and a private is shorter than the generation interval. A client should ask each key (SKs ) for the signature scheme from [10]. client SM corresponding to a sample point for the value of sid in Client joins. Client identifies herself to Server (using her real parallel. Since some client may not succeed in contacting all the identity) and Server gives Client one capability (allowing her to client SMs, the client SMs should synchronize from time to time. upload one tuple per aggregate) as follows. Client picks a random To facilitate client SMs synchronization, each client should choose number s and obtains a blind signature from Server on s denoted a random identifier for the request and provide it to each client sig(s) (using the scheme in [10]). The pair (s, sig(s)) is a capability. SM so that SMs can track which requests they missed when they A capability enables a client to create exactly one correct token Tid synchronize. (As an optimization, for very popular aggregates, one for each aggregate id. The purpose of sig(s) is to attest that the might consider not trying to hide the number of tuples because capability of the client is correct. Without the certification provided the leakage is less for such aggregates. For example, an aggregate by the blind signature, Client can create her own capabilities and can be conservatively considered popular if there are at least 1000 upload more than she is allowed. clients passing per hour.) The client keeps s and sig(s) secret. Since the signature is blind, Our strict privacy guarantees will hold for an aggregate if none Server never sees s or sig(s). Otherwise, the server could link of the K client SMs for that aggregate are malicious. Since the the identity of Client to sig(s); it could then test each token Tid fraction of malicious clients is small in practice, our strong privacy received to see if it was produced using sig(s) and know for which guarantees will be achieved for most of the aggregates. tuples Client uploaded and hence her path. By running the signing protocol only once with Client, Server can ensure that it gives only 6. ACCOUNTABILITY one capability to Client. Since each client uploads tuples anonymously, malicious clients Accountability upload might attempt to bias the aggregates by uploading multiple times. In 1. Client → Server : Client computes Tid = ids mod n and this section, we discuss PrivStats’s accountability protocol, which uploads it together with a tuple. Client also uploads a zero- enforces three restrictions to protect against such attacks: a quota knowledge proof of knowledge (ZKPoK, below in Sec. 6.2) for each client’s uploads for each sample point, a total upload quota with which Client proves that she knows a value s such that per client, and a guarantee that each sample value must be within an ids = Tid mod n and for which she has a signature from the acceptable interval. The protocol must also achieve the conflicting server. goal of protecting location privacy. The most challenging part is to 2. Server → Client : Server checks the proof and discards the ensure that each client uploads within quota at each sample point; tuple if the proof fails or if the value Tid has been uploaded the other two requirements are provided by previous work. before for id (signaling an overupload). We note that the SM is not involved in any part of accountability; We prove the security of the scheme in Sec. 6.2 and the appendix. the server will perform the accountability checks by itself. This Intuitively, Tid does not leak s because of hardness of the discrete is one of the main contributions of PrivStats: with no reliance on log problem. The client’s proof is a zero-knowledge proof of knowl- a (even partially or distributed) trusted party, the server is able to edge (ZKPoK) so it does not leak s either. Since Tid is computed enforce a quota on the number of uploads of each client without deterministically, more than one upload for the same id will result in learning who performed any given upload. the same value of Tid . The server can detect these over-uploads and The idea is to have each client upload a cryptographic token throw them away. The client cannot produce a different Tid∗ for the whose legitimacy the server can check and which does not reveal same id by using a different s∗ because she cannot forge a signature any information about the client; furthermore, a client cannot create of the server for s∗ ; without such signature, Client cannot convince more that her quota of legitimate tokens for a given aggregate. Thus, Server that she has a signature for the exponent of id in Tid∗ . Here if a client exceeds her quota, she will have to re-use a token, and the we assumed that id is randomly distributed. server will note the duplication, discarding the tuple. Quotas. To enforce a quota > 1, the server simply issues a Notation and conventions. Let SKs be the server’s secret sign- quota of capabilities during registration. In addition, the server may ing key and PKs the corresponding public verification key. Consider want to tie quotas to aggregates. To do so, the server divides the an aggregate with identifier id. Let Tid be the token uploaded by a aggregates into categories. For each category, the server runs system
setup separately obtaining different SKs and PKs and then gives a probability q, it checks each tuple for aggregate id. If the server different number of capabilities to clients for each category. notices an attempt to bias the aggregate (a proof failing or a dupli- Quota on total uploads. We also have a simple cryptographic cate token), it should then check all proofs for the aggregate from protocol to enforce a quota on how much a client can upload in then on or, if it stored the old proofs it did not verify, it can check total over all aggregates, not presented here due to space constraints; them all now to determine which to disregard from the computation. however, any existing e-cash scheme suffices here [8]. Note that the probability of detecting an attempt to violate the up- Ensuring reasonable values. As mentioned, in addition to quota load quote, Q, increases exponentially in the number of repeated enforcement, for each tuple uploaded, clients prove to the server uploads, n: Q = 1 − (1 − q)n . For q = 20%, after 10 repeated that the samples encrypted are in an expected interval of values uploads, the chance of detection is already 90%. We recommend (e.g., for speed, the interval is (0, 150)) to prevent clients from using q = 20%, although this value should be adjusted based on the uploading huge numbers that affect the aggregate result. Such proof expected number of clients uploading for an aggregate. is done using the efficient interval zero-knowledge proof of [5]; the server makes such interval publicly available for each aggregate. 6.4 Generality of Accountability We discuss the degree to which fake tuples can affect the result in The accountability protocol is general and not tied to our setting. section 9.2. It can be used to enable a server to prevent clients from uploading more than a quota for each of a set of “subjects” while preserving 6.2 A Zero-Knowledge Proof of Knowledge their anonymity. For example, a class of applications are online We require a commitment scheme (such as the one in [31]): recall reviews and ratings in which the server does not need to be able to that a commitment scheme allows a client Alice to commit to a link uploads from the same user; e.g., course evaluations. Students value x by computing a ciphertext ciph and giving it to another would like to preserve their anonymity, while the system needs to client Bob. Bob cannot learn x from ciph. Later, Alice can open prevent them from uploading more than once for each class. the commitment by providing x and a decommitment key that are checked by Bob. Alice cannot open the commitment for x0 6= x and 7. AGGREGATE STATISTICS SUPPORTED pass Bob’s verification check. SLP supports any aggregates for which efficient homomorphic Public inputs: id, Tid , PKs , g, h encryption schemes and efficient verification or proof of decryption Client’s input: s, σ = sig(s). exist. Additive homomorphic encryption (e.g., Paillier [30]) already Server’s input: SKs supports many aggregates used in practice, as we explain below. C ONSTRUCTION 1. Proof that Client knows s and σ such that Note that if verification of decryption by the SM is not needed (e.g., ids = T and σ is a signature by Server on s. in a distributed setting), a wider range of schemes are available. 1. Client computes a Pedersen commitment to s: com = g s hr Also note that if the sample value is believed to not leak, arbitrary mod n, where r is random. Client proves to Server that aggregates could be supported by simply not encrypting the sample. she knows a signature sig(s) from the server on the value Table 1 lists some representative functions of practical interest committed in com using the protocol in [10]. supported by PrivStats. Note the generality of sum and average of 2. Client proves that she knows s and r such that Tid = ids and functions — F could include conditionals or arbitrary transforma- com = g s hr as follows: tions based on the tuple. When computing the average of functions, (a) Client picks k1 and k2 at random in Zn . She computes some error is introduced due to the presence of junk tuples because T1 = g k1 mod n, T2 = hk2 mod n and T3 = idk1 each sample is weighted by the number of uploads. However, as we mod n and gives them to the server. show in §9.2, the error is small (e.g., < 3% for average speed for (b) Server picks c a prime number, at random and sends it CarTel). to Client. Note in Table 1 that count is a special case and we can make some (c) Client computes r1 = k1 + sc, r2 = k2 + rc and sends simplifications. We do not need the SM at all. The reason is that the them to Server. result of the aggregate itself equals the number of tuples generated at ? (d) Server checks if comc T1 T2 ≡ g r1 hr2 mod n and the sample point, so there is not reason to hide this number any more. ? Tidc T3 ≡ idr1 mod n. If the check succeeds, it out- Also, since each sample of a client passing through the sample point puts “ACCEPT”; else, it outputs “REJECT”. is 1, there is no reason to hide this value, so we remove the sample encryptions. As such, there is no need for the SM any more. Instead This proof can be made non-interactive using the Fiat-Shamir of contacting the SM in the sync. interval (Fig. 3), clients directly heuristic [13] or following the proofs in [5]; due to space limits, we upload to the server at the same random timing. do not present the protocol here. The idea is that the client computes Additive homomorphic encryption does not support median, min, c herself using a hash of the values she sends to the server in a way and max. For the median, one possible solution is to approximate that is computationally infeasible for the client to cheat. it with the mean. One possible solution is to use order-preserving T HEOREM 2. Under the strong RSA-assumption, Construction encryption [4]: for two values a < b, encryption of a will be smaller 1 is a zero-knowledge proof of knowledge that the client knows s and than encryption of b. This approach enables the server to directly σ such that ids = Tid mod n and σ is a signature by the server on compute median, min, and max. However, it has the drawback that s. the server learns the order on the values. An alternative that leaks less information is as follows: To compute the min, a client needs to Due to space constraints, the proof is presented in a longer ver- upload an additional bit with their data. The bit is 1 if her sample is sion of this paper at http://nms.csail.mit.edu/projects/ smaller than a certain threshold. The server can collect all values privacy/. with the bit set and ask the SM to identify and decrypt the minimum. (Special purpose protocols for other statistics are similar.) 6.3 Optimization To avoid performing the accountability check for each tuple up- loaded, the server can perform this check probabilistically; with a
Table 1: Table of aggregate statistics supported by PrivStats with example applications and implementation. Aggregation Example Application Implementation Summation No. of people carpooled Each client uploads encryption of the number of people in the car. Junk tuples are encryptions through an intersection. of zero. Addition P of functions Count of people exceeding Each client applies F to her tuple and uploads encryption of the resulting value. Junk tuples F (tuplei ) speed limit. This is a gener- are encryptions of zero. alization of summation. Average Average speed or delay. See average of functions where F (tuple) = sample. Standard Deviation Standard deviation of delays. Compute average of functions (see below) where F (tuple) = sample2 and denote the result Avg1 . Separately p compute F (tuple) = sample and denote the result Avg2 . The server computes Avg2 − (Avg1 )2 . This procedure also reveals average to the server, but likely one would also want to compute average when computing std. deviation. If count (see below) is also computed for the sample point, compute N P Average of functions PN Average no. of people in a i=1 F (tuplei ) as i=1 F (tuplei )/N certain speed range. Average above instead, and then the server can divide by the count. Otherwise, compute summation speed and delay. This is a gen- of functions, where junk tuples equal real tuples. The server divides the result by the eralization of average. corresponding Uid . Count Traffic congestion: no. of The SM is not needed at all for count computation. Clients do not upload any sample value drivers at an intersection. (uploading simply hidi is enough to indicate passing through the sample point). There are no junk tuples. 8. PRIVACY ANALYSIS The SM may attempt a DoS attack on an aggregate by telling We saw that if the clients, smoothing module, and the server clients that Uid tuples have already been uploaded and only allowing follow the SLP protocol, clients have strong location privacy guar- one client with a desired value to upload. However, the server can antees. We now discuss the protection offered by PrivStats if the easily detect such cheating when getting few uploads. server, SM, or clients maliciously deviate from our strict location Malicious clients. Our accountability protocol prevents clients privacy protocol. from uploading too much at an aggregate, too much over all aggre- Malicious server. The server may attempt to ask the SM to de- gates, and out-of-range values. Therefore, clients cannot affect the crypt the value of one sample as opposed to the aggregate result. correctness of an aggregate result in this manner. As mentioned in However, this prevents the server from obtaining the aggregate re- §2, we do not check if the precise value a client uploads is correct, sult, since the SM will only decrypt once per aggregate id, and we but we show in §9.2 that incorrect value in the allowable range likely assume that the server has incentives to compute the correct result. will not introduce a significant error in the aggregate result. Nevertheless, to reduce such an attack, the SM could occasionally Colluding clients may attempt to learn the private path of a spe- audit the server by asking the server to supply all samples it aggre- cific other client. However, since our SLP definition models a gated. The SM checks that there are roughly Uid samples provided general adversary with access to server data, such clients will not and that their aggregation is indeed the supplied value. learn anything beyond the aggregate result and the SI they already The server may attempt to act as a client and ask the SM for sid knew (which includes, for example, their own paths). A client may (because the upload is anonymous) to figure out the total number try a DoS attack by repeatedly contacting the SM so that the SM of tuples. Note that, if the aggregate is not underpopulated, if the thinks Uid tuples have been uploaded. This attack can be prevented server asks for sid at time t, it will receive as answer approximately by having the SM also run the check for the total number of tuples a Uid (t − ts0 )/(ts1 − ts0 ), a value it knew even without asking the client can upload; Uid for a popular aggregate should be larger than SM. This is one main reason why we used the “growing sid ” idea the number of aggregates a client can contribute to in a day. For a for the sync. interval (Fig. 3), as opposed to just having the SM more precise, but expensive check, the SM could run a check for the count how many clients plan to upload and then informing each quota of tuples per aggregate, as the server does. client of the total number at the end of the sync. interval, so that they Aggregate result. In some cases, the aggregate result itself may can scale up their uploads to reach Uid . Therefore, to learn useful reveal something about clients’ paths. However, our goal was to information, the server must ask the SM for sid frequently to catch enable the server to know this result accurately, while not leaking changes in sid caused by other clients. However, when the server additional information. As mentioned, clients can choose not to asks the SM for sid , the helper increases sid as well. Therefore, the participate in certain aggregates (see §2). SM will reach the Uid quota early in the sync. interval and not allow Differential privacy protocols [11] add noise into the aggregate re- other clients to upload; hence, the server will not get an accurate sult to avoid leaking individual tuples; however, most such protocols aggregate result, which is against its interest. leak all the private paths to the server by the nature of the model. Malicious SM. Since the server verifies the decryption, the SM §3 explains how the SLP and differential privacy models are com- cannot change the result. Even if the SM colludes with clients, plementary. A natural question is whether one can add differential the SM has no effect on the accountability protocol because it is privacy on top of the PrivStats protocol, while retaining PrivStats’ entirely enforced by the server. If the SM misbehaves, the damage guarantees. At a high level, the SM, upon decrypting the result, may is limited because clients upload tuples without identifiers, network decide how much noise to add to the result to achieve differential origin, and our staged timing protocol for upload hides the time privacy, also using the number of true uploads it knows (the number when the tuples were generated even from the SM. A compromised of times it was contacted during the sync. interval). The design SM does permit SI attacks based on sample value and number of of a concrete protocol for this problem is future work. Related to tuples; however, if the SM is distributed, we expect the fraction of this question is the work of Shi et al. [37], who proposed a protocol aggregates with a compromised SM to be small and hence our SLP combining data hiding from the server with differentially-private guarantees to hold for most aggregates. aggregate computation at the server; they consider the different
End-to-end metric Result Contact SM Setup 0.16 s Join, Nexus client 0.92 s Prepare tuples Join, laptop client 0.42 s Upload without account., Nexus 0.29 s Prepare account. Upload with account., Nexus 2.0 s proof Upload with 20% account., Nexus 0.6 s Upload to server and wait reply Upload without account., laptop 0.094 s Upload with account., laptop 0.84 s 0 0.5 1 1.5 2 seconds Aggregation (103 samples) 0.2 s Aggregation (104 samples) 0.46 s Aggregation (105 samples) 3.1 s Figure 4: Breakdown of Nexus upload latency. Table 2: End-to-end latency measurements. Server 0.29 s Client laptop 0.46 s Client Nexus 1.16 s Table 3: Runtime of the accountability protocol. setting of n parties streaming data to a server computing an overall summation. While this is an excellent attempt at providing both Server metric Measurement types of privacy guarantees together, their model is inappropriate for Upload latency, with account. 0.3 s our setting: they assume that a trusted party can run a setup phase Upload latency, no account. 0.02 s for each specific set of clients that will contribute to an aggregate Throughput with 0% account. 2400 uploads/core/min ahead of time. Besides the lack of such a trusted party in our model, Throughput with 20% account. 860 uploads/core/min Throughput with 100% account. 170 uploads/core/min importantly, one does not know during system setup which specific clients will pass through each sample point in the future; even the Table 4: Server evaluation for uploads. Latency indicates the time it number n of such clients is unknown. takes for the server to process an upload from the moment the request reaches the server until it is completed. Throughput indicates the num- 9. IMPLEMENTATION AND EVALUATION ber of uploads per minute the server can handle. In this section, we demonstrate that PrivStats can run on top of commodity smartphones and hardware at reasonable costs. We implemented an end-to-end system; the clients are smartphones Since this occurs either in the background or after the client triggered (Nexus One) or commodity laptops (for some social crowd-sourcing the upload, the user does not need to wait for completion. Figure 4 applications), the server is a commodity server, and the SM was shows the breakdown into the various operations of an upload. We evaluated on smartphones because it runs on clients. The system can see that the accountability protocol (at client and server) takes implements our protocols (Fig. 2) with SLP and enforcing a quota most of the computation time (86%). The cost of accountability is number of uploads at each aggregate per client for accountability. summarized in Table 3. We wrote the code in both C++ and Java. For our evaluation For aggregation, we used the Paillier encryption scheme which below, the server runs C++ for efficiency, while clients and SM takes 33 ms to encrypt, 16.5 ms to decrypt, and 0.03 ms for one run Java. Android smartphones run Java because Android does not homomorphic aggregation on the client laptop, and 135 ms to en- fully support C++. (As of the time of the evaluation for this paper, crypt and 69 ms to decrypt on Nexus. The aggregation time includes Android NDK lacks support for basic libraries we require.) We server computation and interaction with the SM. The latency of implemented our cryptographic protocols using NTL for C++ and this operation is small: 0.46 s for 104 tuples per aggregate, more BigInteger for Java. Our implementation is ≈ 1300 lines of code for tuples than in the common case. Moreover, the server can aggregate all three parties, not counting libraries; accountability forms 55% samples as it receives them, rather than waiting until the end. of the code. The core code of the SM is only 62 lines of code (not In order to understand how much capacity one needs for an appli- including libraries), making it easy to secure. cation, it is important to determine the throughput and latency at the server as well as if the throughput scales with the number of cores. 9.1 Performance Evaluation We issued many simultaneous uploads to the server to measure these In our experiments, we used Google Nexus One smartphones metrics, summarized in Table 4. We can see that the server only (1GHz Scorpion CPU running Android 2.2.2., 512 MB RAM), a needs to perform 0.3 s of computation to verify a cryptographic commodity laptop (2.13 GHz Intel Pentium CPU 2-core, 3 GB proof and one commodity core can process 860 uploads per minute, RAM), a commodity server (2.53GHz Intel i5 CPU 2-core, 4 GB a reasonable number. We parallelized the server using an adjustable RAM), and a server with many cores for our scalability measure- number of threads: each thread processes a different set of aggregate ments: Intel Xeon CPU 16 cores, 1.60 GHz, 8 GB RAM. In what identifiers. Moreover, no synchronization between these threads follows, we report results with accountability, without accountability was needed because each aggregate is independent. We ran the (to show the overhead of the aggregation protocols), and with 20% experiment on a 16-core machine: Fig. 5 shows that the throughput of uploads using accountability as recommended in §6.3. indeed scales linearly in the number of requests. Table 2 shows end-to-end measurements for the four main op- We proceed to estimate the number of cores needed in an applica- erations in our system (Fig. 2); these include our operations and tion. In a social crowd-sourcing application, suppose that a client network times. We can see that the latency of setup and join are uploads samples on the order of around 10 times a day when it visits insignificant, especially since they only happen once for the service a place of interest (e.g., restaurant). In this case, one commodity or once for each client, respectively. core can already serve about 120,000 clients. The latency of upload, the most frequent operation, is measured In the vehicular case, clients upload more frequently. If n is from the time the client wants to generate an upload until the upload the number of cars passing through a sample point, a server core is acknowledged at the server, including interaction with the SM. working 24 h can thus support ≈ 24 · 60 · 860/n statistics in one The latency of upload is reasonable: 0.6 s and up to 2 s for Nexus. day. For instance, California Department of Transportation statistics
You can also read