Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings on Privacy Enhancing Technologies ; 2021 (3):28–48 Kerem Ayoz, Erman Ayday, and A. Ercument Cicek Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons Abstract: Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the sci- 1 Introduction entific progress promised by the big data era in ge- With plummeting sequencing costs, we look forward nomics. A community-driven protocol named genomic reaching a capacity of sequencing one billion individu- data-sharing beacon protocol has been widely adopted als over the next 15-20 years, resulting in availability of for sharing genomic data. The system aims to provide very large genomic datasets [20, 49, 64]. Although such a secure, easy to implement, and standardized inter- large datasets are promising a revolution in medicine, face for data sharing by only allowing yes/no queries on it has been shown in numerous studies that it is not the presence of specific alleles in the dataset. However, straightforward to ensure anonymity of the participants beacon protocol was recently shown to be vulnerable in such datasets [19, 36, 42, 63, 71]. against membership inference attacks. In this paper, we Human genome is the utmost personal identifier and show that privacy threats against genomic data sharing sharing genomic data for research while preserving the beacons are not limited to membership inference. We privacy of the individuals have been challenging many identify and analyze a novel vulnerability of genomic different fields (e.g., medicine, bioinformatics, computer data-sharing beacons: genome reconstruction. We show science, law, and ethics) for long, due to possibly dire that it is possible to successfully reconstruct a substan- ethical, monetary, and legal consequences. To address tial part of the genome of a victim when the attacker this challenge and create frameworks and standards to knows the victim has been added to the beacon in a re- enable the responsible, voluntary, and secure sharing of cent update. In particular, we show how an attacker can genomic data, the Global Alliance for Genomics and use the inherent correlations in the genome and cluster- Health (GA4GH) was formed by the community [1]. The ing techniques to run such an attack in an efficient and current genomic data sharing standard of the GA4GH accurate way. We also show that even if multiple indi- is called the genomic data-sharing beacons. Beacons are viduals are added to the beacon during the same update, the gateways that let users (researchers) and data own- it is possible to identify the victim’s genome with high ers exchange information without -in theory- disclosing confidence using traits that are easily accessible by the any personal information. A user who wants to apply for attacker (e.g., eye color or hair type). Moreover, we show access to a dataset can learn whether individuals with how a reconstructed genome using a beacon that is not specific alleles (nucleotides) of interest are present in the associated with a sensitive phenotype can be used for beacon through an online interface. That is, a user can membership inference attacks to beacons with sensitive submit a query, asking whether a genome exists in the phenotypes (e.g., HIV+). The outcome of this work will beacon with a certain nucleotide at a certain position, guide beacon operators on when and how to update the and the beacon answers as "yes" or "no". If the dataset content of the beacon and help them (along with the does not contain the desired genome, genomic data is beacon participants) make informed decisions. not shared and distributed unnecessarily. In addition, researchers do not have to go through the paperwork Keywords: Privacy, Genome Reconstruction Attack, Ge- to obtain a dataset which will not be helpful for their nomic Data-Sharing Beacons, Genomics research. The GA4GH provides a shared beacon inter- DOI 10.2478/popets-2021-0036 face [2] that as of December 2020 provides access to 81 Received 2020-11-30; revised 2021-03-15; accepted 2021-03-16. beacons and acts as a hub where researchers and data owners meet. Beacons are typically associated with a particular Kerem Ayoz: Bilkent University, E-mail: sensitive phenotype (e.g., the SFARI beacon that host kerem.ayoz@bilkent.edu.tr individuals with autism). Therefore, presence of an in- Erman Ayday: Case Western Reverse University, E-mail: exa208@case.edu A. Ercument Cicek: Bilkent University, Carnegie Mellon University , E-mail: cicek@cs.bilkent.edu.tr
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 29 dividual in a particular beacon is considered as privacy- mulate the genome reconstruction attack accordingly. sensitive information and the main aim of the beacons Privacy vulnerabilities due to dynamic changes in a sys- is to protect this information. An attacker, using the re- tem has been recently explored in the context of dy- sponses of a beacon and genomic data of a victim, may namic model changes in machine learning models [61]. try to infer the membership of the victim in a particular It has been shown that different model outputs can con- beacon by running a membership inference attack. Bea- stitute a new attack surface for an adversary to infer con framework sets a barrier against membership infer- information of the dataset used to perform a model up- ence attacks by allowing only presence/absence queries date [61]. Here, rather than model updates, we focus for variants and not tying any response to any spe- on the changes in the query responses to a dynamic cific individual. In that sense, beacons are considered to database. have stronger privacy measures compared to other sta- In a genome reconstruction attack, the attacker re- tistical genomic databases. Despite these barriers, sev- constructs all or a subset of the genomes in the beacon. eral works have proven that beacons are not bulletproof Among the reconstructed genomes, it is not trivial to in- and they are vulnerable to membership inference at- fer which one belongs to the victim. Therefore, we also tacks [59, 65, 73]. show how the attacker can identify the victim’s genome However, threats against genomic data-sharing bea- among the set of reconstructed genomes using moder- cons are not limited to membership inference attacks. ate auxiliary information about the victim (i.e., a set In this paper, for the first time, we identify and ana- of visible physical characteristics of the victim, which is lyze the vulnerability of genomic data-sharing beacons public information). Finally, to show one of the conse- for the “genome reconstruction” attack. We consider a quences of the identified genome reconstruction attack, scenario, in which the attacker knows the membership we show how the attacker can utilize the outcome of of a victim to a beacon that may not be associated with this attack to initiate a membership inference attack a sensitive phenotype. Therefore, we consider a targeted against the same victim in another beacon, which can attack, in which either (i) the attacker knows that the be associated with a sensitive phenotype. To do this, victim donated their genome to take part in a study or we combine the identified genome reconstruction attack (ii) infer the membership of the victim from beacon’s with the membership inference attacks against beacons metadata (as done in [65]). Then, we show how the at- from the literature. tacker can accurately infer the genome of the victim by We implement and evaluate the identified vulner- using the beacon responses. Such an attack may result ability using real genome data obtained from Open- in serious consequences if the attacker uses the recon- SNP [32] and HapMap [21] datasets. We particularly structed genome to infer sensitive information (e.g., dis- evaluate the success of the attacker to reconstruct a ease diagnosis) about the victim or to infer the victim’s victim’s point mutations that include at least one rare membership to another statistical genomic database of nucleotide (i.e., minor allele) since minor alleles (i) re- interest (e.g., another beacon that is associated with a veal sensitive attributes of individuals (e.g., predisposi- sensitive phenotype). In particular, we show how the at- tions to privacy-sensitive diseases); and (ii) provide rich tacker can use the inherent correlations in the genome information to the attacker for membership inference to run such an attack in an efficient and accurate way attacks [59, 73]. We show that for a beacon with 50 compared to a baseline approach. We also show how individuals, precision and recall of the reconstruction clustering techniques can be used to further improve reach up to 0.9 (each) when 3 individuals are added the accuracy of such an attack. to the beacon and the victim is one of the newcomers. Previous works in the literature assume beacons are Even when 10 new participants are added to the bea- static and do not change over time. However, beacons con (causing a 20% increase in beacon size), we show are dynamic datasets (donors join and leave) and this re- that the attacker has a precision of 0.7 and a recall of sults in an increased risk for the genome reconstruction 0.8. Furthermore, our results show that when more than attack. An attacker can monitor the number of newly one individual is added to the beacon, the attacker can added donors to the beacon and the number of donors accurately pinpoint the victim’s reconstructed genome leaving the beacon from the meta-information of the by using moderate (and publicly available) auxiliary in- beacon. With this information, newly joined donors (or formation about the victim. For this, we show how the donor leaving the beacon) become more vulnerable for attacker can match the victim’s phenotypical charac- genome reconstruction attacks. Thus, for the first time, teristics to the reconstructed genomes using machine we consider the beacons as dynamic databases and for- learning algorithms. We also show via experiments that
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 30 the outcome of the genome reconstruction attack can 52, 54, 58, 74, 78]. To mitigate such attribute inference be accurately used for the membership inference attack attacks, cryptographic solutions has been proposed for on another beacon and it helps an attacker infer the privacy-preserving processing and sharing of genomic membership of a victim only with a few queries. data (e.g., to outsource the computation to a public Overall, we identify an important vulnerability and cloud or to conduct collaborative association studies). show how it can be exploited. We notably show how Existing cryptographic solutions mainly focus on (i) pri- dependencies between point mutations can be used in a vate pattern-matching and the comparison of genomic clustering algorithm to have high accuracy in a genome sequences [15, 24, 43, 55, 69] and (ii) privacy-preserving reconstruction attack. Furthermore, our methodology personalized medicine [12, 13]. In this work, we identify consists of a complete pipeline, showing how an attacker and analyze a different type of attribute inference attack use the information it infers in the genome reconstruc- particularly against genomic data-sharing beacons. tion attack in a subsequent membership inference at- tack. Therefore, this study clearly shows that privacy 2.2 Privacy in Genomic Data Sharing risks for genomic data-sharing beacons are much severe Beacons than perceived. This is particularly important since the Researchers showed that presence (membership) of an number of beacon participants, and hence the privacy individual in a genome sharing beacon can be inferred risk of individuals increase rapidly. by repeatedly querying the beacon. Here, the attacker is assumed to be an active (or authorized) user of the 2 Related Work beacon, in practice, it can ask as many queries as it Genomic privacy has recently been explored by many wishes to the beacon (there is no limitations and cost for studies [11, 27, 56]. In the following subsections, we will this in the current beacon protocol), and it can decide summarize existing work on privacy in statistical ge- which queries to ask to the beacon. Furthermore, the nomic databases, inference attacks, and privacy of ge- attacker is assumed to have access to the set of SNPs of nomic data-sharing beacons. the victim. Shringarpure and Bustamante introduced a likelihood-ratio test (LRT) that can predict whether an 2.1 Privacy in Statistical Genomic individual is in the beacon by querying the beacon for Databases and Inference Attacks on multiple SNPs of a victim [65]. Note that inferring the Genomic Privacy membership of an individual in a beacon that is associ- ated with a sensitive phenotype is equivalent to uncov- Several works have shown that anonymization does not ering the sensitive phenotype about the victim. Then, effectively protect the privacy of genomic data [30, 33, Raisaro et al. showed that if the attacker first queries 35, 45, 50, 53, 66]. It has been shown that the identity the SNPs with low minor allele frequency (MAF) val- of a participant of a genomic study can be revealed by ues, it needs fewer queries for a successful attack [59]. using a second sample (e.g., part of the DNA informa- In Section 6.5, we use this attack when we show how tion from the individual) and the results of the clinical the proposed genome reconstruction attack can be com- study [19, 37, 41, 75, 77]. Differential privacy (DP) [26] bined with the membership inference attack. We pro- concept has been frequently used to mitigate member- vide further background information about this attack ship inference attacks when releasing summary statis- in Appendix A. Later, von Thenen et al. showed that tics from genomic databases [28, 44, 68, 76]. Compared even if the attacker does not have victim’s low-MAF to statistical databases, genomic data-sharing beacons SNPs, it is still possible to infer membership by exploit- have stronger privacy measures since they only allow ing the correlations in the genome [73]. Furthermore, presence/absence (or yes/no) queries for variants. they showed that beacon responses can also be inferred Humbert et al. proposed an inference attack on kin using such correlations (via a query inference, or QI- genomic privacy using the family ties between individ- attack). In an orthogonal work, Hagestedt et al. have hy- uals, pairwise correlations between the SNPs, and pub- pothesized that while current beacons systems are lim- licly available statistics about DNA [38]. Then, Dezn- ited to genomic data, in the near future, the community abi et al. demonstrated that stronger inference tech- is going to need a similar system for other biomedical niques can be generated by combining high-order corre- data types. They proposed a beacon system for shar- lations and family ties [25]. Furthermore, several stud- ing DNA methylation data (an epigenetic mechanism ies have examined phenotype prediction from genomic to regulate transcriptional activity) and then showed data, as a means of tracing identity [10, 18, 39, 46, 51,
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 31 that it is possible to successfully launch a membership they may carry sensitive information regarding individ- inference attack against this system. They proposed a uals’ health conditions. As discussed in Section 2, most DP-based solution in their proposed MBeacon [34] sys- existing works in genomic privacy literature focus on the tem. The approach retains utility by adjusting the noise protection of the SNPs to prevent the risk of genetic dis- level for high risk methylation regions that might leak crimination. phenotypic information (i.e., regions which are related to disease). 4 System Model Contribution of this paper. In this paper, we iden- As shown in Figure 1, we consider a system between tify and analyze a genome reconstruction attack against the beacon participants (e.g., donors), the beacon, and genomic data-sharing beacons by particularly exploit- the beacon users (which may include the attacker). The ing the information leaked due to beacon updates and donor shares their genome with the beacon. It is pos- the correlations between the point mutations. So far, all sible that the donor may share their genome with mul- works in the literature have focused on membership in- tiple beacons that may or may not be associated with ference attacks against genomic data-sharing beacons. sensitive traits. Genome donor is not active during the To the best of our knowledge, this is the first work that protocol after they share their data with the beacon. identifies, thoroughly analyzes, and shows the conse- Also, beacon never publicly shares its dataset, but some quences of the genome reconstruction attack against the beacons may share metadata about (i) their content beacons. Furthermore, as opposed to existing work (that (e.g., size) or (ii) their donors (e.g., their gender, age, only consider a snapshot of the beacon), we show the or ethnicity). In general, we consider the beacon as a privacy risk in dynamic beacons, in which new donors dynamic dataset, in which new donors may join and ex- may join or existing donors may leave. isting donors may leave over time. Beacon users issue queries to the beacon. As discussed, the beacon user 3 Genomics Background can only ask the presence of a genome with a particu- Approximately 99.9% of the all individuals’ DNA are lar allele (nucleotide) at a particular position of a given identical and the remaining 0.1% is responsible for our chromosome and the beacon only responds as “yes” or differences. Single nucleotide polymorphism (SNP) is “no”. In this work, we assume beacon honestly reports the most common source of variation in the human the result of each query to the user (e.g., without in- genome. SNP is a point mutation (e.g., substitution of troducing intentional noise to the query results) and we a single nucleotide in the genome - A,T,C, or G) and do not consider a query limit for the users, as it is usu- there are around 50 million known SNPs in the hu- ally trivial to overcome such limits (e.g., by registering man genome [3]. The alternative nucleotides for each several times with different accounts). locus (SNP position) are called alleles and each allele of a SNP can be either the major or the minor allele 5 Threat Model for that SNP. The major allele is the most frequently Depending on the attacker’s objective, two attacks that observed nucleotide for a SNP position and the minor can be launched against genomic data-sharing beacons allele is the rare nucleotide (i.e., the second most com- are: (i) membership inference attack and (ii) genome mon). The frequency (or probability) of observing the reconstruction attack. In both attacks (including this minor allele at a SNP position is called the minor allele work), the attacker is assumed to be a registered bea- frequency (MAF) of that SNP. Human genome has two con user who can send unlimited number of queries to copies for each locus (one per chromosome) and a SNP the beacon. In this work, for the first time, we iden- can be represented in terms of the number of its minor tify and study the genome reconstruction attack. We alleles (i.e., 0 for homozygous major, 1 for heterozygous, assume that the attacker knows the membership of an or 2 for homozygous minor). individual to a beacon. Thus, we consider a targeted Particular SNPs in human population are inherently attack, in which the attacker knows that the victim do- correlated and this correlation model may change for nated their genome (to take part in a study). Given different populations. Linkage disequilibrium (LD) is the current rise in personal genomics (people upload- the non-random association of alleles at two or more ing their genomes to public sites), this is feasible. Also, loci. If two SNPs are in LD, they are correlated and co- beacons with no sensitive-phenotype report metadata occur more frequently than expected. Some SNPs are about their donors. For instance, Shringapure and Bus- pathogenic and cause genetic diseases [6] and hence,
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 32 2. Including the victim, m 6. Attacker may initiate a membership donors join the beacon inference attack against the victim on another beacon (with a sensitive phenotype) n n+m participants participants Membership HIV+ Inference no, no, yes, … Beacon’s Beacon’s metadata metadata Time t t+δ yes, yes, yes, … 1. Attacker 3. Attacker yes, no, yes, … takes the takes the 5. Attacker snapshot and snapshot and 4. Attacker identifies metadata of metadata of the reconstructs the genome of the beacon beacon again m’ partial the victim using genomes auxiliary information Beacon users Beacon users (including the attacker) (including the attacker) Fig. 1. Proposed system model. tamante [65] verified a specific person being in PGP around 79$ per month [70] and there is no other eco- and Kaviar [31] beacons via metadata, and hence the nomic cost, as the system is publicly available at [2]. attacker can also identify the membership of the vic- Even though the number of SNPs in a complete snap- tim using such metadata. Using the membership infor- shot is large, typically, only low-MAF SNPs are useful mation, the goal of the attacker is to reconstruct the for the attacker (as they are typically the sensitive ones); victim’s genome by issuing queries to the corresponding (iii) auxiliary information about the victim to identify beacon. victim’s genome among the reconstructed ones. For this Genome inference attack can be considered both for we assume the attacker has moderate information, such static and dynamic beacons. In static beacons, knowing as a set of victim’s visible characteristics (phenotype); that the victim is a member of the beacon, only the “no” and (iv) publicly available information about genomics, responses would provide certain information about the such as minor allele frequencies (MAF values) of SNPs victim’s genome to the attacker. “Yes” responses may and correlation between the SNPs in the population of be due to any other participant of the beacon and as interest. Finally, we assume that the attacker does not the size of the beacon increases, “yes” responses do not collude with the beacon. provide much information to the attacker. However, in In genome reconstruction attack, due to the nature dynamic beacons, when the beacon is updated, using of beacon responses, the attacker can infer if a victim the change in the responses of the beacon, the attacker has at least one minor allele at every SNP position. This can learn more about the genomes of new participants. is because the response of the beacon only tells if there is Thus, in this paper, we analyze this vulnerability for dy- an individual in the beacon with at least one minor allele namic beacons and we assume that the victim is added at a given SNP position. Thus, for each SNP j of victim between times t and t+δ along with other (m−1) newly v (Sjv ), the goal of the attacker is to infer P r(Sjv = 0) added donors to the beacon. As discussed before, the at- and P r(Sjv 6= 0) (i.e., P r(Sjv = 1) or P r(Sjv = 2)). For tacker can monitor the number of newly added donors simplicity, we define the event Ŝjv = 1Sjv =1∨Sjv =2 . Thus, to the beacon and the number of donors leaving the Ŝjv = 0 if Sjv = 0, and Ŝjv = 1, otherwise. Note that beacon from the metadata of the beacon. inferring this information for a victim results in a serious We assume that, along with the fact that the victim privacy concern. As we will discuss and show later, using is among the newly joined participants to the beacon, this information, an attacker can associate the genotype the attacker also knows (i) the number of other newly of the victim to related phenotypes (e.g., diseases) and joined individuals that are added to the beacon along initiate a membership inference attack for the victim with the victim; (ii) a snapshot of the beacon before by targeting another beacon that is associated with a the victim is added (at time t). That is, responses to all sensitive phenotype (e.g., cancer or HIV+). queries before the victim joins to the beacon. The bea- Our methodology consists of a complete pipeline, con protocol does not bar someone from taking a com- showing how an attacker uses the information it infers plete snapshot. Thus, querying a beacon to take a com- in the genome reconstruction attack in a subsequent plete snapshot only requires a high-bandwidth internet membership inference attack. Therefore, we evaluate the connection. Economic cost of such an internet service is success of the attacker using different metrics in differ-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 33 ent parts of the pipeline as follows. For genome recon- We consider a scenario, in which the attacker has struction (in Section 6.3), we use precision and recall to no information about the victim’s genome, but it knows quantify this inference power of the attacker. As we will that the victim is added to the beacon between times t show in Section 7, the success of genome reconstruction and t+δ. Let n and (n+m) represent the number of indi- mainly depends on the size of the beacon, the number of viduals in the beacon at times t and t + δ, respectively. newly added donors to the beacon between times t and As discussed, for most real-life beacons, the attacker t + δ, and the fraction of attacker’s snapshot at time knows m (by monitoring the changes in beacon using t. In real life, sizes of beacons show a large variation. the metadata of the beacon). In all attack scenarios, we The size of a beacon can be as small as 100, such as assume that the attacker reconstructs m0 genomes (m0 NBDC Human Database [4] or as large as 100K, such can be different than m and the selection of m0 effects as The Genome Aggregation Database (gnomAD) [5]. the precision and recall of the attacker). Our goal is to As discussed, these numbers can be monitored from the evaluate the performance for different m0 values to show metadata of such beacons. Thus, as we will we show, the attack is robust even if the attacker does not know for small-size beacons, even if the size of the beacon is how many people are added. When metadata of the bea- significantly increased (compared to its original size), con, and hence m is not available, the attacker can de- the attacker’s success may be high. For large-size bea- termine a potential upper bound (k) for the number of cons, on the other hand, the number of newly added newly added donors (m) by examining the number of donors should be a small fraction of the original size flipped responses (from “no” to “yes”). Then, for each for a successful attack. As a result of the genome re- i from 1 to k, it can reconstruct genomes using RN →Y construction, the attacker potentially reconstructs mul- assuming m = i, and hence instead of m, the attacker tiple genomes and among these, one belongs to the vic- ends up having k(k+1)2 potential genomes to identify the tim. For this part, we show how the attacker can uti- victim’s best matching reconstructed genome. lize machine learning techniques to identify the victim’s Using its auxiliary information (as discussed in genome among the reconstructed ones (in Section 6.4) Section 5), the attacker can probabilistically infer the and we use the classification accuracy of the attacker genome of the victim by utilizing the changes in bea- as its success metric. Finally, to quantify the success of con’s responses (at times t and t + δ) as follows: (i) the membership inference (Section 6.5), we use a power if the previous response (at time t) was “no” and the analysis as the success metric. To evaluate the success current response (at time t + δ) is “yes”, the probabil- of the attacker in the membership inference attack, we ity that the victim having a minor allele at the cor- first let the attacker run the genome reconstruction at- responding query position increases depending on how tack and then use the proposed machine learning tech- many new individuals are added to the beacon in this nique to identify the victim’s genome among the recon- time interval; (ii) if the previous response was “yes” and structed ones. Thus, the success metric for the mem- the current response is also “yes”, attacker cannot infer bership inference considers the attacker’s success in the much about the victim’s genome, especially if the total entire pipeline. size of the beacon is large; and (iii) if both the previous and the current responses are “no”, the attacker under- 6 Genome Reconstruction Attack stands that the victim does not have a minor allele at the corresponding query position. on Genomic Data-Sharing Here, the most important (or the most sensitive) in- Beacons formation for the attacker can be considered as the “no” responses at time t that turn to “yes” at time t + δ. Be- As discussed, we define the genome reconstruction at- cause, such responses let the attacker infer the positions tack as inferring genomic data of a genome donor (i.e., that the victim has at least one minor allele with a high victim) given their membership information to the bea- probability (depending on how many new individuals con. To show the effect of genome reconstruction attack are added to the beacon in this time interval). Since mi- more clearly, we consider dynamic beacons and we as- nor alleles of individuals are typically the indicators for sume the victim is among the newly joined donors to privacy-sensitive information about them, in this work, the beacon. For clarity of the discussion, we present the we focus on the success of the attacker based on its identified attack only considering newly joined donors. success in inferring the minor alleles of a victim using Considering the donors that leave the beacon is sym- the beacon responses that turn to “yes”. Exhaustively metrical and trivial. We discuss this case in Section 8.2. generating all potential solutions of this problem would
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 34 0 result in a total of 2β∗m genomes, where β is the total 6.2 Greedy Algorithm for Genome number of responses that turn to “yes” at time t + δ Reconstruction (which can be on the order of tens of thousands), and The above-mentioned baseline algorithm assumes every hence it is intractable. In the following, we first describe SNP is independent and the correlations among them a baseline method that provides a tractable solution to are disregarded. However, SNPs are inherently corre- this problem. Next, we present a greedy approach to run lated and considering such correlations in the genome such an attack more accurately, and then we will detail reconstruction attack may result in significantly more a more sophisticated, clustering-based approach for the accurate results. In the greedy algorithm discussed here, genome reconstruction attack. the attacker forms the bins considering the correlations between the SNPs in set RN →Y . Using an iterative ap- 6.1 Baseline Approach for Genome proach, the attacker assigns each SNP (minor allele) to Reconstruction an individual such that the probability of assignment is Here, we describe a baseline approach, in which the at- proportional to the average correlation of the new SNP tacker, using the responses of the beacon, reconstructs with the already assigned SNPs of the individual (i.e., the genomes (of the newly joined donors) by assigning bin i). If no assignment is made this way, a random in- them to m0 bins according to MAF values of the SNPs. dividual is selected to make sure there is at least one Genome reconstruction attack using the baseline algo- person with the corresponding new SNP. rithm for a particular victim v at time t + δ can be Genome reconstruction attack using the greedy al- described as follows. The input of the attacker is (i) re- gorithm for a particular victim v at time t + δ can be sponses of the beacon to all possible queries at time t described as follows. The input of the attacker includes (i.e., complete snapshot of the beacon at time t); (ii) everything in the baseline approach and also a correla- the fact that m new donors are added to the beacon tion model between the SNPs that is consistent with between times t and t + δ; (iii) the fact that the vic- the population structure of the beacon (that can be tim is among the newly added donors; and (iv) publicly computed using publicly available genomic datasets). available MAF values of the SNPs. Different correlation models have been explored for ge- First, the attacker identifies the set of SNPs for nomic data before. In [62], authors showed how the cor- which the response of the beacon was “no” at time t relations in the genome can be modelled using a Markov and it becomes “yes” at time t + δ. Thus, the attacker chain model. We create our correlation model by consid- constructs a set RN →Y , consisting of these SNPs. Then, ering the pairwise correlations between all the SNPs in the attacker creates m0 empty bins representing SNP the beacon (which results in richer information for the sets of newcomer donors. For each SNP j in set RN →Y , attacker). The attacker calculates the likelihood of the the attacker retrieves its MAF value, M AFj . Next, the victim v having at least one minor allele at a SNP posi- attacker assigns the value of SNP j for each individual tion j as Pk (Ŝjv ) = P (Ŝjv |Ŝkv ), where k may be any other i (in each bin) consistent with the SNP’s MAF value as position in the genome. We use Sokal-Michener distance follows: (i) Ŝji = 0 with probability (1−M AFj )2 and (ii) to compute correlations between SNPs as follows: Ŝjv = 1 with probability M AFj2 + 2M AFj (1 − M AFj ). A = 2(nŜ v =1,Ŝ v =0 + nŜ v =0,Ŝ v =1 ) j k j k Since the beacon’s response for SNPs in RN →Y has B = nŜ v =1,Ŝ v =1 + nŜ v =0,Ŝ v =0 flipped from “no” to “yes”, for all SNPs in RN →Y , there j k j k should be at least one bin (among m0 bins) with at least A DSokal−M ichener (Ŝjv , Ŝkv ) = one mutation (i.e., homozygous minor or heterozygous A+B SNP). Thus, once the values of the SNPs in RN →Y for In the greedy approach, first, the attacker con- all m0 bins are determined, the attacker checks if there structs set RN →Y . Then, it creates m0 empty bins (m0 is any SNP in set RN →Y that is not assigned to any does not have to be equal to m) representing the num- bin. If there is such a SNP, the attacker randomly picks ber of rare SNPs in RN →Y . We assume that the SNPs a bin and assigns the value of the corresponding SNP with an MAF value below a threshold τ are categorized as Ŝji = 1 for the corresponding bin. The details of this as rare SNPs. Observing rare SNPs do not have corre- baseline approach are also shown in Algorithm 2 (in lations among each other, assigning the rare SNPs in Appendix B). RN →Y to different bins as seeds is assumed to result in an accurate initial separation of individuals. Next, for each remaining SNP j in RN →Y , the attacker computes
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 35 the average correlation between that and all the previ- response of the beacon was “no” at time t and it be- ously assigned SNPs in bin i using the aforementioned comes “yes” at time t + δ and constructs set RN →Y . correlation model. This is done for each bin i. Let Ŝji Then, the attacker builds a graph of SNPs using the be a binary random variable for SNP j and bin i. The correlation model, in which the vertices are the SNPs in attacker assigns Ŝjc = 1 for bin c which has the highest RN →Y and undirected edges are weighted by the corre- average correlation value and Ŝji = 0, ∀i ∈ [1, m0 ] and lation values between these SNPs. This graph represents i 6= c. Eventually, the attacker constructs m0 potential a pairwise similarity model for the SNPs and is used for genomes (in m0 bins) belonging to m newcomer donors. a quantitative assessment of the correlation of each SNP pair in RN →Y . 6.3 Clustering-Based Algorithm for Next, the attacker applies either the spectral or Genome Reconstruction fuzzy clustering algorithms on the constructed graph. The outcome of spectral clustering is a set of disjoint Greedy algorithm (in Section 6.2) reconstructs genomes clusters. Fuzzy clustering results in groups of SNPs that by following a particular order (determined based on maximizes the similarity in a group while allowing a the MAFs of the SNPs). Different orders may provide SNP to be shared by multiple individuals. Thus, in fuzzy different solutions. Thus, to consider all query responses clustering, each SNP i is assigned to clusters for which together in a collective way, we propose clustering-based the algorithm returns a relatively high probability of as- approaches for the genome reconstruction attack that sociation. After clustering, the attacker obtains m0 dif- cluster the identified minor alleles for the newly joined ferent clusters which corresponds to m0 reconstructed donors to the beacon. The proposed clustering tech- genomes. The details are shown in Algorithm 1. niques essentially use the correlations between the SNPs (that are computed using the aforementioned correla- 6.4 Identifying the Victim Using tion model) to distribute SNPs into different bins. We Genotype-Phenotype Associations use two types of clustering techniques: (i) hard cluster- ing to create non-overlapping bins and (ii) soft or fuzzy In previous sections, for genome reconstruction, we as- clusterin to assign a SNP into multiple bins. sumed that the attacker can correctly identify the vic- For (i), we employ spectral clustering, in which a tim’s genome among several reconstructed bins. As- standard clustering method (such as k-means cluster- suming the attacker has some moderate auxiliary in- ing) is applied on certain eigenvectors of the Laplacian formation about the victim, here, we study and show matrix of a graph [57]. In this graph, the SNPs cor- how accurately the attacker can identify the victim’s respond to vertices and correlations between the SNPs genome among other candidates. For this, we assume correspond to weights of edges. Spectral clustering is our the attacker uses information about some phenotypic method of choice as it has been shown to provide favor- characteristics of the victim and it relies upon the fact able results in many high dimensional feature spaces that SNPs are intrinsically linked to phenotypic traits like ours [60]. And, for (ii) we employ the fuzzy c-means (such as eye color, hair color, etc.) This provides a com- clustering (FCM) algorithm [14], which is a common plete methodology for the genome reconstruction attack choice for these types of tasks. The algorithm is similar against beacons in real-life. As we will discuss later, the to k-means clustering, but it also allows probabilistic as- success of the attacker to correctly identify the victim’s signments of samples to multiple clusters. Different from genome among the reconstructed ones increases if the k-means clustering, FCM assigns a membership value attacker has access to more auxiliary information about uij = P (Ŝji = 1) for each element j and for each cluster the victim. i. This membership values are used as weights in the ob- Assume victim v is among the m new additions to jective function. After convergence, these membership the beacon (it is trivial to extend the methodology if values are used as the probability of assignments of el- there are more than one victim). The attacker is as- ements to each cluster. The description of both cluster- sumed to have access to two distinct sets: (i) a set ing methods are similar except for the clustering steps. S = {S ~1 , S ~2 , . . . , S~m0 } of m0 reconstructed genotypes Thus, in the following, we describe both methods to- as a result of the genome reconstruction attack, where gether. ~i = (Ŝ i , . . . , Ŝ i ) is a vector containing the SNP values S 1 k The input of both clustering-based algorithms is the of genotype i (or bin i); and (ii) a set Pv = (pv1 , . . . , pvt ) same as the input of the greedy algorithm. First, the at- containing the values of t phenotypic traits of victim v. tacker identifies the set of SNP positions for which the Such phenotype information can be obtained from pub-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 36 Algorithm 1: Clustering-Based Algorithm In [40], Humbert et al. focused on the deanoymiza- for Genome Reconstruction Attack tion risk and modelled genotype-phenotype association Input: b: beacon; m: Number of added people to b; as an assignment problem. They showed this risk by us- Population P that represents the composition ing the Hungarian algorithm [47]. Different from [40], in b here, we rely on machine learning for maximizing the Output: m0 reconstructed genomes matching likelihood and genotype-phenotype associa- // Step 1: Query Beacon 1 snapshot1 ← queryBeacon(b, t) tions. We observe that such a formulation provides more // Including victim, m donors join Beacon accurate results. Also, rather than using SNP values (0, between time t and t + δ 1 or 2), due to the nature of the proposed attack, we 2 snapshot2 ← queryBeacon(b, t + δ) represent the state of each SNP j of individual i as Ŝji , 3 which can be either 0 or 1, as discussed before. // Step 2: Obtain No-Yes SNPs 4 NoYesResponses ← [] For phenotype inference, we train a separate model 5 for i ← 0 to snapshot1.length do for each of the considered phenotypes, where SNPs with 6 if snapshot1[i] = "No" and snapshot2[i] = "Yes" flipped responses (from “no” to “yes”) are used as fea- then tures. Since phenotype datasets are highly imbalanced, 7 NoYesResponses.append(i) we apply Synthetic Minority Oversampling Technique 8 end (SMOTE) [16] for each of these datasets to resolve this 9 end 10 problem. In SMOTE, a minority class instance is se- // Step 3: Cluster No-Yes SNPs lected along with its nearest neighbors at random. Then, 11 G ← Graph() a new sample is generated as a combination of the orig- 12 for i ← 0 to N oY esResponses.length − 1 do inal instance and a random neighbor. Next, we train 13 for j ← i + 1 to NoYesResponses.length do a random forest model for each phenotype. We use re- 14 ri ← NoYesResponses[i] peated stratified 5-fold cross validation to tune the hy- 15 rj ← NoYesResponses[j] 16 c ← corr(P,ri,rj) perparameters. After training the phenotype models, we 17 G.addEdge(ri,rj,c) form the ensemble classifier using the ones that have 18 end better validation F1-macro score than random guess. 19 end We discard the other models. 20 clusters ← graphClustering(G, m0 ) Ensemble classifier calculates the matching likeli- 21 hood of given genome and set of phenotypic traits. // Step 4: Reconstruct genomes 22 S ← [] Softmax output of each phenotype model correspond- 23 for i ← 0 to m0 do ing to a given phenotypic trait of the victim (i.e., prob- 24 S[i] ← getRef erenceGenome(P ) ability that a reconstructed genome having blue eye) 25 foreach s in clusters[i] do are summed to calculate the matching likelihood. For 26 S[i][s] ← getM inorAllele(P, s) single victim, this calculation is done for each recon- 27 end structed genome and the victim is matched with the 28 end 29 return S reconstructed genome with the highest matching likeli- hood score. Note that this matching does not need to be licly available resources or using the physical traits of one-to-one; a single reconstructed genome might match the victim. For instance, the attacker can obtain such with different set of phenotypic traits. We discuss the information from victim’s social media accounts. The performance of identification of victim’s reconstructed goal of the attacker is to correctly match the victim’s genome under different settings in Section 7.4. phenotype to the correct reconstructed genome (that is the most similar to the victim’s) among all candi- 6.5 Using Genome Reconstruction in date reconstructed genome sequences. In the test phase, Membership Inference Attack the attacker has m newly added donors and m0 re- To show one consequence of the proposed genome re- constructed genomes. Attacker’s task is to match each construction attack, we also model and analyze how the donor with the best matching reconstructed genome. proposed attack can be utilized for membership infer- Thus, for each newly added donor, the attacker calcu- ence attack (introduced in Appendix A). We consider a lates the likelihood scores of matching with all m0 re- scenario in which the attacker knows the membership of constructed genomes. an individual to a beacon with which no sensitive associ-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 37 ated phenotype (e.g., phenotype neutral). The attacker uals that are not in B2 . In this work, in order to model first utilizes the responses of this beacon to infer specific the uncertainty of correctly matching the victim (using parts of a victim’s genome (i.e., SNPs). Then, it uses phenotype inference as in Section 6.4), we first experi- these inferred SNPs to infer the membership of the vic- mentally compute the error rate of the overall process. tim to a beacon with a sensitive phenotype. This attack For instance, if the accuracy of correctly matching the is important and realistic, because knowing the mem- phenotype of the victim to their reconstructed genome bership of an individual to a phenotype neutral beacon is p%, then p% of the 20 individuals are selected from (e.g., Kaviar Beacon) may not seem to pose a privacy is- correctly identified reconstructions and remaining indi- sue. However, using the proposed genome reconstruction viduals are selected from other new people added to the attack and the membership information of the victim to beacon along with the victim (incorrect identifications). the beacon with non-sensitive phenotype, the attacker When Λ is less than a threshold tα , the null hypoth- can first infer the SNPs of the victim and then, infer esis is rejected and we find tα from the null hypothesis the membership of the victim to another beacon which with α = 0.05 (corresponding to 5% false positive rate). is associated a sensitive phenotype (e.g., SFARI beacon Then, we computed the power as proportion of the in- which is associated with autism phenotype). dividuals in the alternate hypothesis (including 20 dif- To show this, first, we run the proposed genome re- ferent individuals in B2 ) having a Λ value that is less construction attack that is explained in Section 6.3 and than tα . As before, p% of the 20 individuals are selected infer the SNPs of the victim with at least one minor from correctly identified reconstructions and remaining allele on a beacon B1 . Using these inferred SNPs, we people are selected from other new people added to the then run the membership inference attack to infer the beacon along with the victim. membership of the victim in another beacon B2 . For membership inference attack, we use the optimal attack 7 Evaluation in [59] (described in Appendix A), which is shown to To evaluate the identified vulnerabilities, we evaluated be an effective attack for membership inference (for our our methods using real-life genomic datasets. Here, we scenario, optimal attack in [59] and the QI-attack in [73] describe the datasets and present the evaluation results. perform similarly, so we choose to use the optimal at- tack due to its simplicity). However, in contrast to the 7.1 Datasets original optimal attack, in the null and alternate hy- pothesis equations in (1) and (2), there is an additional We used two different genome datasets for evalua- error due to the inference error of the genome recon- tion: (i) genome dataset of CEU population from struction attack. This is because the attacker queries the HapMap dataset [29] and (ii) OpenSNP genome the alleles of the victim that it infers as a result of the dataset [7]. Using the HapMap dataset, we created the genome reconstruction attack and there is a degree of beacons and victims from CEU population which con- uncertainty. Thus, we first experimentally compute the tains 164 donors and around 4 million SNPs for each error rate of the genome reconstruction attack for a par- donor. We created the correlation model (i.e., SNP-SNP ticular scenario (e.g., for particular m and n values). We relation network or similarity model) for this beacon us- then include this additional error on the γ parameter in ing individuals from the same HapMap dataset that are (2), which represents the probability that the attacker’s not in the constructed beacon and set of victims. Us- copy of the victim’s genome does not match the beacon’s ing the OpenSNP dataset, we created the beacons and copy for a SNP. Furthermore, as opposed to original op- victims from a random population which contains 2980 timal attack, here the attacker may not have access to donors and around 2 million SNPs for each donor. We the SNPs of the victim with the lowest MAF values; created the correlation model using the rest of the Open- instead the attacker only knows the SNPs that are in- SNP dataset. ferred as a result of the genome reconstruction attack. For the OpenSNP dataset, we also collected the re- We evaluate the success of this attack in terms of ported phenotypes of individuals. Since sample sizes the power of the attacker in Section 7.5. Similar to Rais- are small, we used the reported phenotypes in a bi- aro et al. and von Thenen et al., we plot the power curve nary form. From OpenSNP, we used the following com- of the membership inference attack at 5% false positive monly reported phenotypes: (i) eye color, 967 samples, rate. We empirically build the null hypothesis (H0 in (ii) hair type, 371 samples, (iii) hair color, 468 samples, Appendix A). For every query, we determine the distri- (iv) tan ability, 287 samples, (v) asthma, 226 samples, bution of Λ under the null hypothesis using 20 individ- (vi) lactose intolerance, 347 samples, (vii) earwax, 244
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 38 samples, (viii) tongue rolling, 434 samples, (ix) intol- the ith query is calculated from given set of l case peo- ple as P i = ( Λi ∈Λi 1Λi
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 39 Fig. 2. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of newly added donors. Section 6.1) performs substantially worse than the pro- the size of the beacon increases, both the precision and posed clustering-based approach. The results also show recall of the reconstruction attack almost remains the that spectral clustering-based genome reconstruction is same (for a fixed number of newly added donors). slightly better than the fuzzy clustering-based approach. Even if the success of the genome reconstruction re- We observed that allowing a SNP (that includes at least mains high, the number of flipped responses (from “no” one minor allele) to be in multiple bins results in high to “yes”) may decrease when beacon size is increased (as false positives. Therefore, in the remaining of this sec- shown in Figure 4). In other words, the number of vul- tion, we use spectral clustering-based genome recon- nerable SNPs (the ones that can be inferred using the struction for the evaluations. change in the beacon responses) of a victim decreases To show the benefit of utilizing a beacon (and bea- and this might result in lower performance in phenotype con update) in its genome reconstruction attack, we also inference and membership inference parts of the attack. computed the reconstruction accuracy of an attacker However, with high probability, as the beacon size in- when it only uses publicly available information (e.g., crease, low-MAF SNPs of the victim (which typically population statistics and victim’s phenotype). As dis- provide the most valuable information for the member- cussed, each victim we consider has a subset of 21 phe- ship inference attack) still remain vulnerable, since with notypes listed in Section 7.1. Using the associations of high probability, such SNPs are not observed in other victim’s phenotypes with the corresponding SNPs (ex- donors in the beacon. For example, in the previous ex- tracted from SNPedia [8]), we assigned some SNP values periment (in Figure 4), when the size of the beacon is of the victim. We observed that, on the average, such a increased from 50 to 400, total number of vulnerable reconstruction achieves a precision of 18% and a recall SNPs of a victim reduces by 94%, however, number of of 47% on total of 232 SNPs. Therefore, we conclude vulnerable SNPs of a victim with MAF value smaller that having access to a beacon and knowing the mem- than 0.01 only reduces by 52%. bership of a victim to a beacon significantly increases Keeping the ratio of newly added donors fixed (to the success of the genome reconstruction attack. 5%), we also observed the change in the success of the To show the effect of varying number of bins (m0 ) in attack with increasing beacon size when m0 = m in the genome reconstruction attack, in Figures 3 and 10 Figure 5 (we did this evaluation only for the Open- (in Appendix C), we show the attacker’s success when SNP beacon since HapMap beacon did not have more the number of newly added donors m = 5 and beacon than 100 donors). We observed that, when the beacon size n = 50 for OpenSNP and HapMap beacons, respec- size increases beyond 100, although the recall of the at- tively. We observed that for both beacons, precision in- tacker still remains high, its precision starts decreas- creases and recall decreases with increasing m0 . Also, as ing. This shows that the success of the identified attack expected, precision and recall becomes balanced when mainly relies on the number of clusters the attackers m0 = m. needs to generate (in the proposed clustering-based al- Next, in Figures 4 and 11 (in Appendix C), we show gorithm). For small or mid-size beacons (e.g., NBDC the effect of the beacon size (n) at time t when 5 new Human Database [4] with slightly more than 100 in- donors are added between times t and t + δ for Open- dividuals), even if the beacon update significantly in- SNP and HapMap beacons, respectively. Here, we as- creases beacon’s size, the identified attack is still effec- sume that the number of bins (m0 ) is equal to the num- tive. On the other hand, for large size beacons (e.g., gno- ber of newly added donors (m). We observed that as
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons 40 Fig. 3. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of bins/clusters (m0 ) in the genome reconstruction attack. Number of newly added donors (m) is 5. Fig. 4. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of beacon size (n). Number of newly added donors m is 5 and m0 = m for all plots. mAD [5], with more than 100K individuals), the update Among these, we obtained the highest classifier accu- size should be small to have a vulnerability. racy with the Random Forest, and hence all reported Finally, we explored the scenario, in which the at- results are based on this model. tacker only has a partial snapshot of the beacon (instead In Figure 7, we show the ensemble classifier accu- of a full snapshot). In Figure 6, we show the success of racy for varying number of newly added donors to the the reconstruction attack when m = 5 donors are added beacon (here, we assumed m0 = m and we observed sim- (at time t + δ) into the OpenSNP beacon with size 50 ilar patterns when m0 6= m as well). We used the origi- when the attacker has varying snapshots of the bea- nal genomes of individuals in the training dataset when con at time t and when m = m0 . We observed that the building the model. For test, we used reconstructed success (precision and recall) of reconstruction do not genomes of the victims (that may have noise due to change with varying snapshots. However, the number of reconstruction error). Beacon size is 50 in these experi- inferred SNPs (as a result of the genome reconstruction ments (i.e., n = 50). attack) decreases linearly with the decreasing snapshot We observed that the proposed algorithm provides that is known by the attacker at time t. 70% accuracy when the size of the beacon is increased by adding 2 individuals in the update, and the accu- 7.4 Identifying the Victim’s Genome racy slightly decreases with increasing number of newly Using Phenotype Inference added donors. These results show that the attacker can identify the reconstructed genome of the victim among Here, we evaluate the success of the attacker in iden- all m0 reconstructed genomes with high accuracy. As tifying the reconstructed genome of the victim among discussed before, in this experiment, we assumed the at- all reconstructed genomes using the algorithm in Sec- tacker has moderate auxiliary knowledge about the vic- tion 6.4. Since HapMap dataset does not include phe- tim (i.e., phenotypic-traits, which can be easily learnt notype information about the genome donors, we only from social network profiles of the victim). However, use the OpenSNP beacon for this evaluation. since genotype-phenotype associations are not strong We employed and compared several machine learn- yet, there is an accuracy bottleneck in the overall pro- ing models for genotype-phenotype associations, includ- cess due to this step. A stronger attacker (that has ac- ing: Logistic Regression [23], SVM [22], Multi-layer Per- cess to richer auxiliary information about the victim) ceptron [72], Random Forest [67], and XGBoost [17].
You can also read