Privacy-Preserving Machine Learning: Methods, Challenges and Directions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Privacy-Preserving Machine Learning: Methods, Challenges and Directions Runhua Xu1∗, Nathalie Baracaldo1 and James Joshi2† 1 IBM Research - Almaden Research Center, San Jose, CA, United States, 95120 arXiv:2108.04417v1 [cs.LG] 10 Aug 2021 2 School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States, 15260 runhua@ibm.com, baracald@us.ibm.com, jjoshi@pitt.edu Abstract Machine learning (ML) is increasingly being adopted in a wide variety of application domains. Usually, a well-performing ML model, especially, emerging deep neural network model, relies on a large volume of training data and high-powered computational resources. The need for a vast volume of available data raises serious privacy concerns because of the risk of leakage of highly privacy-sensitive information and the evolving regulatory environments that increasingly restrict access to and use of privacy-sensitive data. Furthermore, a trained ML model may also be vulnerable to adversarial attacks such as membership/property inference attacks and model inversion attacks. Hence, well-designed privacy-preserving ML (PPML) solutions are crucial and have attracted increasing research interest from academia and industry. More and more efforts of PPML are proposed via integrating privacy-preserving techniques into ML algorithms, fusing privacy-preserving approaches into ML pipeline, or designing various privacy-preserving architectures for existing ML systems. In particular, existing PPML arts cross-cut ML, system, security, and privacy; hence, there is a critical need to understand state-of-art studies, related challenges, and a roadmap for future research. This paper systematically reviews and summarizes existing privacy-preserving approaches and proposes a PGU model to guide evaluation for various PPML solutions through elaborately decomposing their privacy-preserving functionalities. The PGU model is designed as the triad of Phase, Guarantee, and technical Utility. Furthermore, we also discuss the unique characteristics and challenges of PPML and outline possible directions of future work that benefit a wide range of research communities among ML, distributed systems, security, and privacy areas. Key Phrases: Survey, Machine Learning, Privacy-Preserving Machine Learning ∗ Part of this work was done while Runhua Xu was at the School of Computing and Information, University of Pittsburgh. † This work was performed while James Joshi was serving as a program director at NSF; and the work represents the views of the authors and not that of NSF. Preprint.
Contents 1 Introduction 3 2 Machine Learning Pipeline in a Nutshell 5 2.1 Differentiate Computation Tasks in Model Training and Serving . . . . . . . . . . 5 2.2 An Illustration of Third-party Facility-related ML Pipeline . . . . . . . . . . . . . 6 3 Privacy-Preserving Phases in PPML 7 3.1 Privacy-Preserving Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.1 Privacy-Preserving Data Preparation . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 Privacy-Preserving Model Training . . . . . . . . . . . . . . . . . . . . . 8 3.2 Privacy-Preserving Model Serving . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Full Privacy-Preserving Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Privacy Guarantee in PPML 10 4.1 Object-Oriented Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Pipeline-Oriented Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5 Technical Utility in PPML 12 5.1 Type I: Data Publishing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.1.1 Elimination-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 13 5.1.2 Perturbation-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 13 5.1.3 Confusion-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2 Type II: Data Processing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2.1 Additive Mask Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2.2 Garbled Circuits Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2.3 Modern Cryptographic Approaches . . . . . . . . . . . . . . . . . . . . . 17 5.2.4 Mixed-Protocol Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2.5 Trusted Execution Environment Approach . . . . . . . . . . . . . . . . . . 20 5.3 Type III: Architecture Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3.1 Delegation-based ML Architecture . . . . . . . . . . . . . . . . . . . . . . . 21 5.3.2 Distributed Selective SGD Architecture . . . . . . . . . . . . . . . . . . . 22 5.3.3 Federated Learning (FL) Architecture . . . . . . . . . . . . . . . . . . . . 22 5.3.4 Knowledge Transfer Architecture . . . . . . . . . . . . . . . . . . . . . . 22 5.4 Type IV: Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.5 Technical Path and Utility Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 Challenges and Potential Directions 24 6.1 Open Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2.1 Systematic Definition, Measurement and Evaluation of Privacy . . . . . . . 26 6.2.2 Strategies of Attack and Defense . . . . . . . . . . . . . . . . . . . . . . . 26 6.2.3 Communication Efficiency Improvement . . . . . . . . . . . . . . . . . . 27 6.2.4 Computation Efficiency Improvement . . . . . . . . . . . . . . . . . . . . 27 6.2.5 Privacy Perturbation Budget and Model Utility . . . . . . . . . . . . . . . 27 6.2.6 New Deployment Approaches of Differential Privacy in PPML . . . . . . . 28 6.2.7 Compatibility of Privacy, Fairness, and Robustness . . . . . . . . . . . . . 28 6.2.8 Novel Architecture of PPML . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.9 New Model Publishing Method for PPML . . . . . . . . . . . . . . . . . . 28 6.2.10 Interpretability in PPML . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2.11 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7 Conclusion 29 2
1 Introduction Machine learning (ML) is increasingly being applied in a wide variety of application domains. For instance, emerging deep neural networks, also known as deep learning (DL), have shown significant improvements in model accuracy and performance, especially in application areas such as computer vision, natural language processing, speech or audio recognition [1, 2, 3]. Emerging federated learning (FL) is another collaborative ML technique that enables training a high-quality model while training data remains distributed over multiple decentralized devices [4, 5]. FL has shown its promise in various application domains, including healthcare, vehicular networks, intelligent manufacturing, among others [6, 7, 8]. Although these models have shown considerable success in AI-powered or ML-driven applications, they still face several challenges, such as (i) lack of powerful computational resources and (ii) availability of vast volumes of data for model training. In general, the performance of an ML system relies on a large volume of training data and high-powered computational resources to support both the training and inference phases. To address the need for computing resources with high-performance CPUs and GPUs, large memory storage, etc., existing commercial ML-related infrastructure service providers, such as Amazon, Microsoft, Google, and IBM, have devoted significant amounts of their efforts toward building infrastructure as a service (IaaS) or machine learning as a service (MLaaS) platforms with appropriate rental expense. The resource-limited clients can employ ML-related IaaS to manage and train their models first and then provide data analytics and prediction services through their applications directly. Availability of a massive volume of training data is another challenge for ML systems. Intuitively, more training data indicates better performance of an ML model; thus, there is a need for collecting large volumes of data that are potentially from multiple sources. However, the collection and use of the data, and even the creation and use of ML models, raise serious privacy concerns because of the risk of leakage of highly secure or private/confidential information. For instance, recent data breaches have increased the privacy concerns of large-scale collection and use of personal data [9, 10]. An adversary can also infer private information by exploiting an ML model via various inference attacks such as membership inference attacks [11, 12, 13, 14, 15], model inversion attacks [16, 17, 18], property inference attacks [19, 20], and privacy leakage from the exchanged gradients in distributed ML scenes [21, 22]. In an example of membership inference attacks, an attacker can infer whether or not data related to a particular patient has been included in the training of an HIV-related ML model. In addition, existing regulations such as the Health Insurance Portability and Accountability Act (HIPPA) and more recent ones such as the European General Data Protection Regulation (GDPR), Cybersecurity Law of China, California Consumer Privacy Act (CCPA), etc., are increasingly restricting the availability and use of privacy-sensitive data. Such privacy concerns of users and requirements imposed by regulations pose a significant challenge for the training of a well-performing ML model; hence they are consequently hindering the adoption of ML models for real-world applications. To tackle the increasing privacy concerns related to using ML in applications, in which users’ privacy- sensitive data such as electronic health/medical records, location information, etc., are stored and processed, it is crucial to devise innovative privacy-preserving ML solutions. More recently, there have been increasing efforts focused on PPML research that integrate existing traditional anonymization mechanisms into ML pipelines or design privacy-preserving methods and architectures for ML systems. Recent ML-related or FL-oriented surveys such as in [23, 24, 25, 26, 27, 28] have partially illustrated or discussed the specific privacy and security issues in ML or FL systems. Each existing PPML approach addresses part of privacy concerns or is only applicable to limited scenarios. There is no unified or perfect PPML solution without any sacrifice. For instance, the adoption of differential privacy in ML systems can lead to model utility loss, e.g., reduced model accuracy. Similarly, the use of secure multi-party computation approaches incurs high communication overhead because of a large volume of intermediate data such as garbled tables of circuit-gates that need to be transmitted during the execution of the protocols or high computation overhead due to the adopted ciphertext- computational cryptosystems. ML security, such as issues of stealing the ML models, injecting Trojans and availability of ML services and corresponding countermeasures, have been thoroughly discussed in the systematization of knowledge [29], surveys [30, 31] and comprehensive analysis [32]. However, there is still a lack of systematization of knowledge discussion and evaluation from the privacy aspect. Inspired from the CIA model (a.k.a, confidentiality, integrity and availability triad) designed to guide policies for 3
Figure 1: An overview of PGU model to evaluate the privacy-preserving machine learning systems and illustration of selected PPML examples in the PGU model. The demonstrated PPML examples in the figure are HybridAlpha [33], DP-SGD [34], NN-EMD[35], SA-FL[36]. information security within an organization, this paper proposes a PGU model to guide evaluation for various privacy-preserving machine learning through elaborately decomposing their privacy- preserving functionalities, as illustrated in Figure 1. The PGU model is designed as the triad of Phase, Guarantee, and technical Utility: the phase represents the phase of privacy-preserving functionalities that occurs in ML pipeline; guarantee denotes the intensity and scope of the privacy protection under the setting of threat model and trust assumption; technical utility means the utility impact of adopted techniques or design achieving privacy-preserving goals in the machine learning systems. Based on the PGU analysis framework, this paper also discusses possible challenges and potential future research directions of PPML. More specifically, this paper first introduces the general pipeline in an ML application scenario. Then we discuss the PPML pipeline from various phases of privacy-preserving functionalities that occurred in the process-chain in the systems, usually including privacy-preserving data preparation, privacy-preserving model training and evaluation, privacy-preserving model deployment, and privacy- preserving model inference. Next, based on common threat model settings and trust assumptions, we discuss the privacy guarantee of existing PPML solutions by analyzing the intensity and scope of the privacy protection from two aspects: object-oriented privacy protection and pipeline-oriented privacy protection. From the object perspective, PPML solutions either target the data privacy to prevent private information leakage from the training or inference data samples or the model privacy to mitigate the privacy disclosure in the trained model. From the pipeline perspective, PPML solutions focus on the privacy-preserving functionality at one entity or a set of entities in the pipeline of an ML solution, e.g., the distributed ML system and IaaS/MLaaS-related ML system. Usually, it includes local privacy, global privacy, and full-chain privacy. In addition, the paper evaluates PPML systems by investigating their technical utility. In particular, we first elaborate underlying privacy-preserving techniques by decomposing and classifying existing PPML solutions into four categories: data publishing approaches, data processing approaches, architecture based approaches and hybrid based approaches. Then, we analyze their impact on an ML system from a set of aspects: computation utility, communication utility, model utility, scalability utility, scenario utility and privacy utility. Finally, we present our viewpoints on the challenges of designing PPML solutions and then the future research directions in the PPML area. Our discussion and analysis broadly contribute to machine learning, distributed system, security, and privacy areas. Organization. The remainder of this paper is organized as follows. We briefly present the ML pipeline in Section 2 by reviewing the critical task in ML-related systems, as well as illustrating third-party facility-related ML solutions. In Section 3, we present general discussion of existing privacy-preserving methods with consideration of the phase where they are applied, and discuss the 4
privacy guarantee in Section 4. We investigate the technical utility of PPML solutions in Section 5 by summarizing and classifying privacy-preserving techniques and their impact on ML systems. Furthermore, we also discuss the challenges and open problems in PPML solutions and outline the promising directions of future research in Section 6. Finally, we conclude the paper in Section 7. 2 Machine Learning Pipeline in a Nutshell An ML system typically includes four phases: data preparation or preprocess, model training and evaluation, model deployment, and model serving. More roughly, it can be concluded as training and serving. The former includes operations such as collecting and preprocessing data, training a model, and evaluating the model performance. The latter mainly focuses on using a trained model, such as how to deploy the model and provide inference service. In this section, we first formally differentiate computation tasks between the model training and model serving, which will enhance the following up discussion on privacy-preserving training and privacy- preserving serving approaches from the perspective of the computation task. Next, we illustrate generic process paths in the scenario of ML pipeline, including the adoption of both self-owned and third-party facilities, to cover most ML applications. Specifically, we divide the processing pipeline into three trust domains: data owner trust domain, third-party trust domain, and model user trust domain. Based on the divided trust domain, we can set up various trust assumptions and potential adversary’s behavior to analyze the privacy guarantee of existing PPML solutions. 2.1 Differentiate Computation Tasks in Model Training and Serving From the underlying computation task perspective, there is no strict boundary between the model training and model serving (i.e., inference). The computed function in the serving procedure could be viewed as a simplified version of the training procedure without loss computation, regularization, (partial) derivatives, and model weights update. For instance, in a stochastic gradient descent (SGD) based training approach, the computation that occurs at the inference phase could be viewed as one round of computation in the training phase without operations related to model gradients update. In a more complex neural networks example, the training computation is the process where a set of data is continuously fed to the designed network for multiple training epochs. In contrast, the inference computation could be treated as only one epoch of computation for one data sample to generate a predicting label without propagation procedure and related regularization or normalization. Formally, given a set of training samples denoted as ( 1 , 1 ), ..., ( , ), where ∈ R , ∈ R, the goal of a ML model training (for simplicity, assume a linear model) is to learn a fit function denoted as , ( ) = | + , (1) where ∈ R is the model parameters, and is the intercept. To find proper model parameters, usually, we need to minimize the regularized training error given by 1 ∑︁ ( , ) = L ( , ( )) + ( ), (2) =1 where L (·) is a loss function that measures model fit and (·) is a regularization term (a.k.a., penalty) that penalizes model complexity; is a non-negative hyperparameter that controls the regularization strength. Regardless of various choices of L (·) and (·), stochastic gradient descent (SGD) is a common optimization method for unconstrained optimization problems as discussed above. A simple SGD method iterates over the training samples and for each sample updates the model parameters according to the update rule given by ← − ∇ = − [ ∇ + ∇ L] (3) ← − ∇ = − [ ∇ + ∇ L] (4) where is the learning rate which controls the step-size in the parameter space. Given the trained model ( trained , trained ), the goal of the model serving is to predict a value ˆ for target sample as follows: ˆ = trained , trained ( ). (5) 5
Figure 2: An illustration of machine learning pipeline (above part) and demonstration of correspond- ing processes with trust domain partition (bottom part) in various scenarios of ML applications. As demonstrated above, the computed functions in the inference phase (i.e., Equation (5)), is part of computed procedures in the SGD training (i.e., Equation (2)). Taking emerging deep neural network as an example, the model training process is a model inference process with extra back-propagation to compute the partial derivative of weights. In particular, the task of privacy-preserving training is more challenging than the task of privacy-preserving serving. Most existing computation-oriented privacy- preserving training solutions indicate the availability to achieve privacy-preserving serving even though the proposal does not evidently state that, but not vice versa. A more specific demonstration is presented in the rest of the paper. 2.2 An Illustration of Third-party Facility-related ML Pipeline As depicted in Figure 2, we briefly overview the ML pipeline and the corresponding process chains and facilities, including self-owned resources and third-party computational resources (e.g., IaaS, PaaS, and MLaaS). Note that third-party facility-related ML pipeline is also a widespread adoption in recent ML-related applications. The pipeline is divided into four stages: data preparation that collects and preprocess the dataset; model training and evaluation that applies or designs proper ML algorithm to train and evaluate an ML model; textitmodel deployment that delivers the model to the target user or deploys the model to a third-party agency; and model serving that generates prediction results or provides inference service. From the perspective of the types of participants, the ML pipeline includes three entities/roles: the data producer (DP), the model consumer (MC) , and the computational facility (CF) that may be owned by the data producer and model consumer themselves or employed from a third-party entity. The data producer owns and provides a vast volume of training data to learn various ML models. At the same time, the model consumer also owns vast amounts of target data and expects to acquire various ML inference services such as classification of the target data with labels, prediction values based on the target data, and aggregation to several groups. For the stage of model training/evaluation or model deployment/serving, there exist two possible cases: the data producer (or the model consumer) (i) intends to use locally owned CFs, or (ii) prefers to employ third-party CFs instead of local CFs. As a result, different computation options exist for model training/evaluation and model deployment/serving. As illustrated in Figure 2, in case (i), the data producer can train a complete model locally (T1) or locally train a partial model that can be used to aggregate a global ML model in a collaborative or distributed manner (T2). In case (ii), the data producer directly sends out its data to third-party entities that have computational facilities to employ their computational resources to train an ML model (T3). Such third-party facilities may include the edge nodes in the edge computing environment and/or IaaS servers in the cloud computing environment. Accordingly, the model consumer may also acquire the trained model directly (D1) for model inference service with unlimited access (S1) if it has local computation facilities; otherwise, the model consumer can also utilize a third-party facility where the trained model is deployed (D2) to acquire the prediction service (S2). 6
From the perspective of privacy-preserving phases in the ML pipeline, we classify the research of PPML into two directions: privacy-preserving training phase including private data preparation and model creation and the privacy-preserving serving phase involving private model deployment and serving. We discuss the affected phases of privacy-preserving approaches in Section 3 in detail. It is also important to consider the trust domains to characterize the trust assumptions and related threat models. We divide the ML system into three domains: the data owner’s local trust domain, the third-party CF trust domain, and the model user’s trust domain. Based on those trust domains, we analyze various types of privacy guarantees provided by a PPML system. We present more details in Section 4. From the perspective of underlying adopted techniques, we decompose recently proposed PPML solutions to summarize and classify the critical privacy-preserving components to evaluate their potential utility impact. Intuitively, the underlying privacy-preserving techniques such as differential privacy, conventional multi-party computation building on the garbled circuits and oblivious transfer, or customized secure protocols are widely employed in the PPML solutions. Besides, the well- designed learning architectures have been broadly studied under specific trust domain and threat model settings. Furthermore, emerging advanced cryptosystems such as homomorphic encryption and functional encryption also show their promise for PPML with strong privacy guarantees. More detailed taxonomy and analysis will be introduced in Section 5. 3 Privacy-Preserving Phases in PPML This section evaluates existing PPML solutions from the perspective of privacy-preserving phases in an ML pipeline. Figure 2 illustrates four phases in a typical ML pipeline: data preparation, model training and evaluation, model deployment and model serving. Correspondingly, the existing PPML pipeline involves (i) privacy-preserving data preparation, (ii) privacy-preserving model training and evaluation, (iii) privacy-preserving model deployment, (iv) privacy-preserving inference. For simplicity, this section analyzes the PPML mainly from privacy-preserving model creation covering phases (i-ii) and privacy-preserving model serving including phases (iii-iv) to avoid overlapping discussion among those phases. 3.1 Privacy-Preserving Model Creation Most privacy-preserving model creation solutions emphasize that the adopted privacy-preserving approaches should prevent the private information in the training data from leaking the trusted scope of the data sources. In particular, the key factors of the model creation are data and computation, and consequently, research community tries to tackle the challenge of privacy-preserving model creation from the following two aspects: (i) how to distill/filter the training data such that the data in the subsequent processing include less or even no private information; (ii) how to process the training data in a private manner. 3.1.1 Privacy-Preserving Data Preparation From the perspective of data, existing privacy-preserving approaches work on the following directions: (i) adopting the traditional anonymization mechanisms such as -anonymity[37], -diversity[38], and -closeness [39] to remove the identifier information in the training data before sending out the data for training; (ii) representing the raw dataset to surrogate dataset by grouping the anonymized data [40] or abstracting the dataset by sketch techniques [41, 42]; (iii) employing the -differential privacy mechanisms [43, 44, 45] to add privacy budget (noise) into the dataset to avoid private information leakage in the statistical queries. Specifically, the proposal in [46] tries to providing -anonymity in the data mining algorithm, while the works in [47, 48] focus on the utility metric and provide a suite of anonymization algorithms to produce an anonymous view based on ML workloads. On the other hand, the recent differential privacy mechanism has shown its promise in emerging DL models that rely on training on a large-scale dataset. For example, an initial work in [34] proposes a differentially private stochastic gradient descent approach to train the privacy-preserving DL model. Regarding most recently proposed representative 7
works, the proposal [49] demonstrates that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Recent parameter-transfer meta-learning (i.e., the applications including few-shot learning, federated learning, and reinforcement learning) often requires the task-owners to share model parameters that may result in privacy disclosure. Proposals of privacy-preserving meta-learning such as presented in [50, 33] try to tackle the problem of private information leakage in federated learning (FL) by proposing an algorithm to achieve client-sided (local) differential privacy. The literature [51] formalizes the notion of task-global differential privacy and proposes a differentially private algorithm for gradient-based parameter transfer that satisfies the privacy requirement as well as retaining provable transfer learning guarantees in convex settings. Thanks to recent successful studies on computation over ciphertext (i.e., practical computation over the encrypted data) in the cryptography community, the data encryption approach is going to be another promising approach to protect the privacy of the training data. Unlike the traditional anonymization mechanisms or differential privacy mechanisms that still resist on the inference or de-anonymization attacks such as demonstrated in [52, 53, 11, 54], wherein the adversary may have additional background knowledge, the encryption approaches can provide a strong privacy guarantee - called confidential-level privacy in the rest of the paper - and hence are receiving more and more attention in recent studies [55, 56, 33, 57, 58, 59, 60, 61, 62], wherein the training data or the transferred model parameter is protected by cryptosystems while still allowing the subsequent computation outside of the trusted scope of the data sources. 3.1.2 Privacy-Preserving Model Training From the perspective of computation, existing privacy-preserving approaches are also correspond- ingly divided into two directions: (i) for the case that the training data is processed by conventional anonymization or differential privacy mechanisms, the subsequent training computation is as normal as vanilla model training; (ii) for the case that the training data is protected via cryptosystems, due to the confidential-level privacy, the privacy-preserving (i.e., crypto-based) training computation is a little bit more complex than normal model training. The demand of training computation over the ciphertext indicates that the direct use of traditional cryptosystems such as AES and DES is not applicable here. That crypto-based training relies on recently proposed modern cryptography schemes, mainly refers to homomorphic encryption [63, 64, 65, 66, 67] and functional encryption[68, 69, 70, 71, 72, 73, 74], which allows computation over the encrypted data. In general, homomorphic encryption (HE) is a form of public-key encryption, allowing one to perform computation over encrypted data without decrypting it. The result of the computation is still in an encrypted form and is as same as if the computation had been performed on the original data. Similarly, functional encryption (FE) is also a generalization of public-key encryption in which the one with an issued functional decryption key allows to learn a function of what the ciphertext is encrypting without learning the underlying protected inputs. Unlike normal training process, it is worth noting that there may be an extra step - data conversion or data encoding - in the crypto-based training, since most of those cryptosystems, such as multi- party functional encryption [70, 75] and BGV schemes [76] (i.e., an implemented homomorphic encryption), are built on the integer group while the training data or exchanged model parameter is always in floating-point numbers format, especially, those training samples are preprocessed via commonly adopted methods, such as feature encoding, discretization, normalization, etc. However, it is not a requisite restriction among all crypto-based training approaches. For instance, an emerging implemented CKKS scheme [77] - an instance of homomorphic encryption - can support approximate arithmetic computation. The data conversion usually includes a pair of encoding and decoding operations. The encoding step is commonly adopted to convert the floating-point numbers into integers so that the data can be encrypted and then applied in crypto-based training. On the contrary, the decode step is applied to the trained model or crypto-based training result to recover those floating-point numbers. There is no doubt that the effectiveness and efficiency of those rescaling procedures rely on the conversion precision setting. We will discuss the detail of the potential impact of the data conversion in Section 5. 8
3.2 Privacy-Preserving Model Serving There is no clear boundary between the privacy-preserving model deployment and model inference among most PPML solutions; hence we combine the discussion as the privacy-preserving model serving here. From the perspective of computation, unlike the task of privacy-preserving training, tackling privacy- preserving model serving problems is relatively simpler, as demonstrated in Section 2.1. Except for the emerging complex deep neural networks model, there is no much separate study focused on privacy-preserving inference. According to our exploration, most PPML solutions such as in [60, 78, 79, 80, 61, 81, 82, 83] that apply the modern cryptosystems (mainly refer to the homomorphic encryption and its related schemes) only target the privacy-preserving inference, as those crypto-based solutions are not efficient enough to be applied for the complex and massive training computation of neural networks. Note that those privacy-preserving serving solutions mainly focus on protecting the data samples needed to be inferred, or the trained model, or both. Besides, another branch of privacy-preserving model serving research focuses on the privacy- preserving model query or publication in the case that the trained model is deployed in a privacy- preserving manner, where the model consumer and mode owner are usually separated. The core problem in this study branch focuses on how to prevent the adversary model user from inferring the private information of the training data. According to various model inference attack settings, the adversary has limited (or unlimited) access times to query the trained model. In addition, the adversary may (or may not) have extra background knowledge of the trained model specification, usually called white-box (or black-box) attacks. To address those inference attack issues, a naive privacy-preserving solution is to limit the query times or reduce the background information of the released model for a specific model user. Beyond that, potential prevention methods include the following directions: (i) private aggregation of teacher ensembles (PATE) approaches [84, 85, 86], wherein the knowledge of an ensemble of “teacher” models is transferred to a “student” model, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers; (ii) model transformation approach such as MiniONN [87] and variant solution [88], where an existing model is transformed into an oblivious neural network supporting privacy-preserving predictions with reasonable efficiency; (iii) model compression approach, especially, applied in the emerging deep learning model with a large scale of model parameters, where knowledge distillation methods [89, 90] are adopted to compress the trained deep neural networks model. Even though the main goal of knowledge distillation is to reduce the size of the deep neural networks model, such a method also brings additional privacy-preserving functionality [91, 92]. Intuitively, the distillation procedure not only removes the redundant information in the model, but also reduces the probability that the adversary can infer potential private information in the model by iterative queries. 3.3 Full Privacy-Preserving Pipeline The notion of a full privacy-preserving pipeline is rarely mentioned in PPML proposals. Existing PPML solutions either claim the support of privacy-preserving model creation or focus on privacy- preserving model serving. As demonstrated in Section 2.1, the computation tasks at the model inference phase can be viewed as a non-iterative and simplified version of operations of the model training. Thus, from the perspective of the computation task, the problems of the privacy-preserving inference could be a subset problem of the privacy-preserving training. Examples of existing computation-oriented PPML proposals focusing on the privacy-preserving training indicate the support of the privacy-preserving inference, as illustrated in [93, 94, 95, 56, 62, 35], and hence they can be evaluated as full privacy-preserving pipeline approaches. We note that the PPML solutions relying on privacy-preserving data preparation approaches such as anonymization, sketching, and differential privacy are usually not applicable with privacy-preserving inference. The goal of inference is to acquire an accurate prediction for a single data point; however, those approximation or perturbation techniques are either not applicable to data needed to be predicted 9
or reduce the data utility for inference. Thus, the data-oriented PPML proposals as introduced in Section 3.1 is incompatible with the privacy-preserving inference goal. Besides, another direction of a complete privacy-preserving pipeline could be trivially integrating privacy-preserving model creation approaches and those privacy-preserving model serving approaches (i.e., privacy-preserving model query or publication-based methods for model deployment accompa- nying with computation-oriented privacy-preserving inference). For instance, it is possible to produce a deep neural networks model with the most privacy-preserving training approaches. Then the trained model can be transformed into an oblivious neural network to support privacy-preserving predictions. 4 Privacy Guarantee in PPML Privacy is a concept in disarray. It is hard to accurately articulate what privacy means. In general, privacy is a sweeping concept, encompassing freedom of thought, control over one’s body, solitude in one’s home, control over personal information, freedom from surveillance, protection of one’s reputation, and protection from searches and interrogations [96]. In our viewpoint, privacy is a subjective assessment of the acceptability degree of what personal information can be disclosed to the trustless domains and how much personal information is publicly available. It is a challenge to define what private information is and how to measure privacy since privacy is a subjective notion where different people may have different perceptions or viewpoints. In the digital environment, usually, a widely accepted notion of minimum privacy coverage is the personal identity [37, 43, 97]. Some common approaches include differential privacy mechanisms [43, 97] and k-anonymity mechanism [37] and its follow-up work such as l-diversity [38], t-disclosure [39]. Differential privacy aims to hide individuals from the patterns of a dataset, e.g., the results of predefined functions queried from the dataset. The general idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore, such as query result provides privacy protection [43, 97]. The goal of the anonymization mechanisms is to remove the identity-related information from the dataset directly. Given a person-specific field-structured dataset, k-anonymity and its variants focus on defining and hiding the identifiers and quasi-identifiers with guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain useful [37]. Various privacy-related terms and notions exist in PPML proposals to claim their privacy-preserving functionality, resulting in no universal definition for privacy guarantee in PPML. This paper explores those widely discussed privacy-related notions to analyze the privacy guarantee of PPML from two aspects: object-oriented privacy guarantee and pipeline-oriented privacy guarantee. The former focuses on measuring the privacy protection on specific objects in PPML, namely, the critical objects such as the trained model weights, the exchanged gradients, the training or inference data samples, among others. The latter evaluates the privacy guarantee by inspecting the entire pipeline, as illustrated in Figure 2. We elaborate on each perspective in the following sections. 4.1 Object-Oriented Privacy Guarantee The privacy claim of most early PPML solutions is object-oriented, where the privacy protection objectives are on a specific object in the PPML system. To achieve the privacy goal, a set of PPML solutions protect the dataset directly, such as removing identifiers and quasi-identifiers from the dataset using empirical anonymization mechanisms [37, 38, 39]. To overcome the concerns of empirical privacy guarantee, differential privacy (DP) mechanism [43, 97] is proposed and widely adopted among various domains since the privacy guarantee of DP is mathematically proved. In addition, encryption is a more strict approach to protect the data, where it requires learning from the dark since the data is confused and diffused. We summarize such kind of privacy guarantee as data-oriented privacy guarantee, or data privacy in short in the rest of the paper, as described in Definition 1. Definition 1 (Data Oriented Privacy Guarantee) A PPML solution claims the data-oriented pri- vacy guarantee indicating that any PPML adversary cannot learn private information directly from the training/inferences data samples directly or link private information to a specific person’s identifier. 10
In short, data-oriented privacy-preserving approaches aim to prevent privacy leakage from the dataset directly. However, privacy is not free; an oblivious limitation of data privacy is the sacrifice of data utility. For instance, the empirical anonymization mechanism needs to aggregate and remove proper feature values. At the same time,some values of quasi-identifiers features are entirely removed or partially deleted to satisfy the -diversity and -disclosure definition. Besides, a noise budget is required to add into the data sample by the differential privacy mechanism. Both approaches impact the accuracy of the trained model. The encrypted data could provide confidentiality on the dataset but brings extra computation overhead in the consequent machine learning training. Another set of PPML solutions focus on providing privacy-preserving models, which indicates the privacy protection objective is the trained model in the PPML system. The goal of the privacy- preserving model is to prevent privacy leakage in the trained model and the use of the model. Examples of that privacy information include membership, property, etc. As described in Definition 2, we summarize such a privacy guarantee as a model-oriented privacy guarantee or model privacy in shot in the rest of the paper. Definition 2 (Model Oriented Privacy Guarantee) A PPML solution claims the model-oriented privacy guarantee denoting that any PPML adversary cannot infer private information from the model via limited or unlimited periodically querying a specific model. To achieve the model privacy guarantee, existing PPML solutions are located in two directions: (i) introducing differentially private training algorithms to perturb the trained model parameters; (ii) controlling the model access times and model access patterns to limit the adversary’s ability. For instance, Adadi et al. [34] propose a differentially private stochastic gradient descent (DP-SGD) algorithm by adding a differential privacy budget (noise) into the clipped gradients to achieve a differential private model. The private aggregation of teacher ensembles (PATE) framework [84] designs a remarkable model deployment architecture, where a set of ensemble models are trained as the teacher models to provide model inference service for a student model. 4.2 Pipeline-Oriented Privacy Guarantee Existing privacy measurement approaches such as differential privacy and -anonymity are only applicable for specific objects such as datasets and models but cannot be directly adopted to assess the privacy guarantee of PPML pipeline. There is a lack of formal or informal approaches for assessing the intensity and scope of privacy protection provided by an ML pipeline. We believe that assessing the privacy guarantee relies on defining (i) the boundary of data processing and (ii) the trust assumption on each processing domain in the pipeline. For instance, suppose that a data owner employs a third-party computational facility (CF) to process its privacy-sensitive data; this creates a boundary in data processing workflow into two parts: data owner’s local domain and CF’s domain. If the data owner fully trusts CF, there may not exist privacy concerns; otherwise, there is a demand for a privacy guarantee on the data processing. As depicted in Figure 2, we establish the trust boundaries of the processing pipeline in PPML as data producers, local CF, third-party CF, and model consumers. From the perspective of a data owner (i.e., data producer), it may have different trusts on other domains. For instance, the data producer may fully trust its local CF, semi-trust the third-party CF, and may have no trust at all in the model consumer. Based on such trust assumptions for each boundary, we present the taxonomy of privacy guarantee notions from the data owner’s perspective as below: • No Privacy Guarantee: Here, the raw training data is shared with third-party CFs, regardless of whether these CFs are trusted or not, without applying any privacy-preserving approaches. Each entity is able to acquire the original raw data to process or the ML model to consume according to its role in the ML pipeline. • Global Privacy Guarantee: Global privacy guarantee focuses on the ML model serving phase. The data producer has generated the trained ML model with its own CFs, or a trusted third-party entities with powerful CFs to help train the ML model based on the raw data shared by the data producer. The global privacy guarantee aims to ensure that there is no privacy-sensitive information leakage during the model deployment and model inference phases. Essentially, the ML model is a statistical abstraction or pattern computed from the raw data, and hence the privacy disclosure that may occur here is viewed as a statistical 11
leakage. Several typical privacy leakage examples include the leakage of membership, class representatives, properties, etc. For instance, an ML model for helping diagnose HIV can be trained using existing healthcare records of HIV patients. A successful membership attack on the model will allow an adversary to check whether a selected target sample (i.e., patient’s record) is used in the training or not, resulting in revealing whether a target person is an HIV patient or not. • Vanilla Local Privacy Guarantee: The basic local privacy guarantee ensures that the privacy- sensitive raw data is not directly shared with other honest entities in the ML pipeline. The indirect-sharing approaches include: (i) the raw data is pre-processed to remove privacy- sensitive information, or obfuscate private information with noise before sending out for model training; (ii) the raw data is pre-trained in a local model in a collaborative manner, while the generated model update is revealed to other entities. • Primary Local Privacy Guarantee: The primary local privacy is built upon the vanilla local privacy guarantee. The main difference is that the trust assumption of the rest of the pipeline. Beyond the basic requirement of vanilla privacy guarantee, it also requires that the shared local model update should be private to curious third-party CF entities other than honest CF entities such as the training server in the IaaS platform and coordinator server in a distributed collaborative ML system. • Enhanced Local Privacy Guarantee: The enhanced local privacy is built upon the primary local privacy guarantee with additional assumption requirement that the third-party CF entities would be totally untrusted. • Full Privacy Guarantee: The requirement of a full privacy guarantee includes both local privacy and global privacy. As the definitions as mentioned above, the global privacy guarantee focuses on the ML model serving phase, while the local privacy ensures the privacy guarantee in the model creation phase, the full privacy ensures privacy protection at each step in the ML pipeline, as illustrated in Figure 2. In particular, the privacy leakage is specific to the threat model setting that considers the adversary behaviors and abilities and assumes the worst situation that an ML system can deal with. Simulta- neously, the threat model also reflects users’ confidence in the entire data process pipeline and the trustworthiness of each entity. Consequently, as mentioned above, the privacy guarantees strongly correlate to the specific threat model defined in the PPML system. 5 Technical Utility in PPML In this section, we first classify and discuss the PPML solutions in a more specific manner, namely, the aspect of privacy-preserving techniques by decomposing those solutions and tracking how those approaches tackle the following questions: • How the privacy-sensitive data is released/published? • How the privacy-sensitive data is processed/trained? • Does the architecture of the ML system prevent the disclosure of private-sensitive informa- tion? Correspondingly, we summarize the privacy-preserving approaches into four categories: data publishing-based approaches, data processing-based approaches, architecture-based approaches, hybrid approaches that may combine or fuse two or three approaches as mentioned earlier. Next, we analyze the potential impact (i.e., utility cost) of those privacy-preserving techniques to normal ML solutions from computation utility, communication utility, model utility, scalability utility, etc. 5.1 Type I: Data Publishing Approaches In general, the data publishing based privacy-preserving approaches in the PPML system include the following types: approaches that partially eliminate the identifiers and/or quasi-identifiers in the raw data; approaches that partially perturb the statistical result of the raw data; approaches that totally confuse the raw data. 12
5.1.1 Elimination-based Approaches The traditional anonymization mechanisms are classified to the private information elimination approach, where the techniques such as -anonymity[37], -diversity[38] and -closeness [39] are applied to the raw privacy-sensitive data to remove private information and resist the potential inference attack. Specifically, the -anonymity mechanism can ensure that the information for each person contained in the released dataset cannot be distinguished from at least -1 individuals whose information is also released in the dataset. To achieve that the -anonymity defines the identifiers and quasi-identifiers for each data attribute, and then remove the identifiers and partially hide the quasi-identifiers information. The -diversity mechanism introduces the concept of equivalence class, where an equivalence class is -diversity if there are at least “well-represented” values for the sensitive attribute. Essentially, as an extension of the -anonymity mechanism, the -diversity mechanism reduces the granularity of the data representation and additionally maintain the diversity of sensitive fields by adopting techniques like generalization and suppression such that given any records it can map to at least − 1 other records in the dataset. The -closeness further refinement of -diversity by introducing additional restriction about the value distribution on the equivalence class, where an equivalence class is -closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole dataset is no more than a threshold . Examples of emerging anonymization-based PPML solutions include [40, 98] that focus on secure or privacy-preserving federated gradient boosted trees model. Yang et al. [40] employ a modified -anonymity based data aggregation method to compute the gradient and hessian by projecting original data in each feature to avoid privacy leakage, instead of directly transmitting all exact data for each feature. Furthermore, Ong et al. [98] propose an adaptive histogram-based federated gradient boosted trees by a data surrogate representation approach that is compatible with -anonymity method or differential privacy mechanism. 5.1.2 Perturbation-based Approaches We present two typical perturbation-based approaches: differential privacy mechanisms and sketching methods. Differential privacy based perturbation: The conventional perturbation-based privacy-preserving data publishing approaches mainly rely on the -differential privacy technique[43, 44, 45]. Formally, according to [97, 44], the definition of the -differential privacy is described as follows: a randomized mechanism M : D → 0R with domain D and range R satisfies ( , )-differential privacy if for any two adjacent input , ∈ D and for any subset of outputs ⊆ R, it holds that 0 Pr[M ( ) ∈ ] ≤ · Pr[M ( ) ∈ ] + . (6) The additive term allows for the possibility that plain -differential privacy is broken with probability (which is preferably smaller than 1/| |). Usually, a paradigm of an approximating a deterministic function : D → R with a differentially private mechanism is via additive noise calibrated 0 to function’s sensitivity that is defined as the maximum of the absolute distance | ( ) − ( )|. The representative and common additive noise mechanisms for real-valued functions are Laplace mechanism and Gaussian mechanism, as respectively defined as follows: MGauss ( ; , , ) = ( ) + N ( , 2 ) (7) MLap ( ; , ) = ( ) + Lap( , ) (8) The typical usage of differential privacy in the PPML solutions is located in two directions: (i) directly adopting the above-mentioned additive noise mechanism on the raw dataset in the case of publishing data, as illustrated in [99, 100]; (ii) transforming the original training method into a differentially private training method so that the trained model has -differential privacy guarantee in the model publishing, as illustrated in [34, 50, 51]. Sketching-based perturbation: Sketching is an approximate and simple approach for data stream summary, by building a probabilistic data structure that serves as a frequency table of events, as like the counting Bloom filters. Recent theoretical advances [101, 102] have shown that differential privacy is achievable on sketches with additional modifications. For instance, the work in [103] 13
focuses on privacy-preserving collaborative filtering, a popular technique for the recommendation system, by using sketching techniques to implicitly provide the differential privacy guarantees by taking advantage of the inherent randomness of the data structure. Recent work as reported in [41] proposes a novel sketch-based framework for distributed learning, where they compress the transmitted messages via sketches to achieve communication efficiency and provable privacy benefits simultaneously. In short, the traditional anonymization mechanisms and perturbation-based approaches are designed to tackle general data publishing problems; however, those techniques are still not out-of-date in the domain of PPML. Especially, the differential privacy technique has been widely adopted in recent privacy-preserving deep learning and privacy-preserving federated learning systems such as proposed in [34, 49, 50, 51, 55, 33]. Furthermore, differential privacy is not only employed as a privacy-preserving approach, but also shows its promise in generating synthetic data [43, 104, 105] and emerging generative adversarial networks (GAN) [106, 107]. 5.1.3 Confusion-based Approaches The confusion-based approach mainly refers to the cryptography technique that confuses the raw data to achieve a much stronger privacy guarantee (i.e., confidential-level privacy) than traditional anonymization mechanisms and perturbation-based approaches. Existing cryptography approaches for data publishing for machine learning training include the following two directions: (i) applying the traditional symmetric encryption schemes such as AES that are associated with the garbled-circuits based secure multi-party computation protocols [108, 109, 110] and authentication encryption that is associated with the customized secure computation protocols [36]; (ii) applying recent advanced modern cryptosystems such as homomorphic encryption schemes [63, 64, 65, 66, 67] and functional encryption schemes [68, 69, 70, 71, 72, 73, 74] that include the necessary algorithms to compute over the ciphertext, such that one party with the issued key is able to acquire the computation results. The typical PPML system such as proposed works in [94, 111] could be classified as the first direction of the crypto-based data publishing approach, while more and more recent works such as proposed in [55, 56, 33, 57, 58, 59, 60, 61, 62] focus on the direction (ii). In particular, the crypto-based approaches cannot work independently and usually are associated with related special process approaches, since it is expected to let the data receiver only learn the data processing result rather than the original data. The crypto-based approaches provide a promising candidate for data publishing, but the introduction and discussion emphasis of these approaches need mainly focuses on how to share the one-time symmetric encryption keys or how to process the encrypted data, and hence, more details will be presented in the next section. 5.2 Type II: Data Processing Approaches Based on different data/model publishing approaches, the data processing approaches of training and inference locate in two corresponding aspects: normal computation and secure computation. As Type I approaches discussed in Section 5.1, if the data is published using traditional anonymization mech- anisms or perturbation-based approaches, where the personal identifiers in the data are eliminated, and the statistical result of the data is perturbed by adding differential privacy noise or building a probabilistic data structure, the consequent training computation is as normal as the training computa- tion in vanilla machine learning systems. Thus, the privacy-preserving data processing approach here mainly refers to the secure computation that occurred at the training and inference phases. The secure computation problems and the corresponding solutions are initialized by Andrew Yao in 1982 in a garbled-circuits protocol for two-party computation problems [112]. The primary goal of secure computation is to enable two or more parties to evaluate an arbitrary function of both their inputs without revealing anything to either party except for the function’s output. From the number of the enrolled participants, those secure computation approaches could be classified as the basic secure two-party computation (2PC) and secure multi-party computation (MPC or SMC). From the perspective of the threat model (a.k.a, security model or security guarantee), those secure computation protocols include two types of security according to various adversary settings: semi-hones (passive) security and malicious (active) security. We refer the reader to [113, 114] for the systematization of knowledge on the general secure multi-party computation solutions and corresponding threat models in detail. 14
You can also read