L8: Introduction to privacy-preserving computations - Privacy-preserving Technologies / LTAT.04.007
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
L8: Introduction to privacy-preserving computations Privacy-preserving Technologies / LTAT.04.007 Dan Bogdanov dan.bogdanov@cyber.ee
Using privacy technologies to solve it Source data: Estonian Education Information 10 million tax records, records System's Authority 600 000 education records. Ministry of Education and Research Each record upload using secret sharing (think: “encryption”) Ministry of Records linked and processed using Employment tax records Finance secure multi-party computation (think: IT Center Estonian Tax and “data not decrypted for processing”) Customs Board Data never existed outside the source in an unencrypted state. Cybernetica Solution based on Sharemind MPC. 5
Tax and Aggregate by year Customs Board Monthly Average income yearly income Recover Extract data results from Aggregate Expand by years and shares Employment by month aggregate by person tax payments Employment Employment Analysis Analysis tax payments record of a person results results Secret share and upload ? Merge by Complete record Analysis person's ID of a person table Higher study Higher study Compute additional events events attributes and Extract data Aggregate University career align tax payments by person of a person Statistical Ministry of Data stored with secret sharing and analyst Education and Science processed with secure multi-party computation 7
Sharemind-powered Analytics Data scientists used analytics tools based Estonian Information on secure multi-party computation. System's Authority The MPC system also prevented queries outside the study plan. Reports were given to industry, universities Data Universities and the government. Ministry of Finance Analyst Companies Policymakers Result: no clear relation between working IT Center during studies and not graduating. Cybernetica 8
A privacy-preserving statistics tool inspired by R 9
10
Non-IT graduation rate is around 40% IT graduation rate is around 20% 11 Joonis 1. Nominaalajaga lõpetajate osakaal immatrikuleerimisaastate lõikes, IKT- ja mitte-IKT õppekavad, bakalaureuseõpe
Non-IT and IT students have similar employment ratios, but IT students lost more in the financial crisis Joonis 4. Nominaalaja jooksul töötanud tudengite osakaal kõigist tudengitest aastati, IKT- ja mitte-IKT 12 õppekavad, bakalaureuseõpe
DATA-DRIVEN SERVICES ON CONFIDENTIAL DATA Regulatory status of the project In an official response, after a study of the system, the Estonian DPA suggested that neither the hosts of the servers running the statistics nor the analysts making the queries could feasibly re-identify individuals in the source database (this was pre-GDPR). The Internal Supervision Department of the Tax and Customs Board agreed to provide unmodified tax records after a code and process review. Follow-up legal review in the FP7 PRACTICE by a research from the University of Göttingen suggested that the same precedent could hold under GDPR as well. 13
A general model for privacy-preserving computing
Concept of secure computing encrypted database standard When a standard computer tools encrypts data, it must be decrypted before analysis secure Secure computing systems computing can analyze data without removing the encryption. 15
Extended definition of secure multi-party computation Input parties Computing parties Result parties x11 IP1 x1 ... xk1 CP1 y1 y1 RP1 x1i ... ... ... yj ... xki x1l IPk xk ... CPl yl ym RPm xkl Step 1: Step 2: Step 3: upload and secure publishing 16 storage of inputs computing of results
Technique: property-preserving cryptography Analogy: symmetric crypto that preserves a relation on inputs (e.g., order, equality). Pros: Low performance overhead. Fits well into existing systems. Cons: Only allows a few operations (e.g., only equality comparison or ordering). Multi-user systems are a challenge, but can be done with proxy re-encryption. 17
Technique: homomorphic encryption Analogy: asymmetric crypto that allows addition and multiplication of ciphertexts. Pros: Fits well into existing systems. Cons: High performance overhead. Multi-user systems are a challenge, but can be done with proxy re-encryption. 18
Technique: garbled circuits Analogy: cryptographic versions of electrical circuits. Pros: Flexible programming model. Cons: Medium performance overhead. Fixed number of parties (can be solved by combining with other techniques). 19
Example: millionaire’s problem 20
Technique: secret sharing Analogy: give a number of people a random piece of each secret value and let them collaborate to compute results. Pros: Low-to-medium performance overhead. Flexible programming model. Cons: Distributed deployments do not fit into all existing systems. 21
Technique: trusted execution environments Analogy: think of a computer process that hides the data from its owner Ik Pros: Minimal performance overhead. Relatively easy to convert applications to work SC C on trusted execution environments Cons: Side-channel attack mitigations are Rn complicated to implement. 22
Lecture exercise: modelling parties for a COVID-19 social distancing tracking application
Lecture task Think of an application that would support social distancing and limit infection rates. Write down very clearly, what is the expected benefit of the system. Write down the list of input parties and the data they would provide. Write down the list of computing parties and describe the kind of processing they would perform. Write down the list of result parties and describe the outputs they would receive. Bonus tasks, time permitting: Think of minimizing personal data processing using process redesign. See if any of the secure computing paradigms described above could support your application. Prepare in 12 minutes and then we’ll have 1-2 students present their ideas. 24
Programmable privacy-preserving computations
PDK as an abstraction of a secure computing paradigm A protection domain kind (PDK) is a set of data representations, algorithms and protocols for storing and computing on protected data. Examples: SMC based on secret sharing, SMC based on garbled circuits, (fully) homomorphic encryption, trusted hardware (e.g., Intel SGX). 26
Protection domain as an instance of a PDK A protection domain (PD) is a set of data that is protected with the same resources and for which there is a well-defined set of algorithms and protocols for computing on that data while keeping the protection. Examples: data held by a fixed group of servers performing secure multi-party computation, data encrypted under a fixed key of a homomorphic encryption scheme. 27
Application model for privacy-preserving computing Secure Privacy- Application Application primitive preserving logic operations algorithms • private outputs from private • publish selected results to inputs, make system useful, • have privacy proofs, • do not leak private inputs or • remain private under show leakage as acceptable, sequential or parallel • compositions of secure composition, primitive operations, • optimized to have a • optimize for running low resource footprint. time. 28
Converting an algorithm to a privacy-preserving one We pick frequent itemset mining as a problem of choice. Frequent itemset mining is a data mining problem that helps with shopping basket analysis and the simplest kinds of recommender systems. What kind of things do people buy from stores together most often? If the service provider knows this, they can recommend one to a customer who is planning to buy the other. The simpler algorithms include Apriori (breadth-first search) and Eclat (depth- first search). We will know look at the basic primitive of frequent itemset mining and then build a privacy-preserving approach. 29
Privacy-preserving data representations Private data representations are the key toward desaigning privacy-preserving algorithms. nasi chicken rendang lontong lemak satay chicken t1 rendang nasi lemak satay t1 1 1 0 1 t2 nasi lemak lontong t2 0 1 1 0 chicken t3 satay t3 0 0 0 1 t4 rendang nasi lemak t4 1 1 0 0 chicken t5 nasi lemak satay t5 0 1 0 1 chicken t6 nasi lemak satay t6 0 1 0 1 t7 lontong t7 0 0 1 0 30
Calculating the support of an item The data representation allows for very efficient calculation of item supports. nasi chicken nasi rendang lontong lemak satay lemak t1 1 1 0 1 1 t2 0 1 1 0 1 t3 0 0 0 1 0 t4 1 1 0 0 1 t5 0 1 0 1 1 t6 0 1 0 1 1 t7 0 0 1 0 0 31 ∑= 5
Calculating support for a set of items Checking the joint support of a pair of items simply requires a multiplication nasi chicken nasi chicken nasi lemak & rendang lontong chicken satay lemak satay lemak satay t1 1 1 0 1 1 x 1 = 1 t2 0 1 1 0 1 x 0 = 0 t3 0 0 0 1 0 x 1 = 0 t4 1 1 0 0 1 x 0 = 0 t5 0 1 0 1 1 x 1 = 1 t6 0 1 0 1 1 x 1 = 1 0 0 1 0 0 x 0 = 0 t7 32 ∑= 3
Evaluating itemsets with a depth-first strategy Depth-first search would be intuitive for pruning. { rendang } { nasi lemak } { lontong } { chicken satay } {rendang, } { nasi lemak rendang, lontong } {rendang, } { chicken satay nasi lemak, lontong } { nasi lemak, } { chicken satay lontong, } chicken satay { } { }{ }{ } rendang, rendang, rendang, nasi lemak, nasi lemak, nasi lemak, lontong, lontong lontong chicken satay chicken satay chicken satay { } rendang, nasi lemak, lontong chicken satay 33
Evaluating itemsets with a breadth-first strategy However, breadth-first search can be done in parallel. { rendang } { nasi lemak } { lontong } { chicken satay } {rendang, } { nasi lemak rendang, lontong } {rendang, } { chicken satay nasi lemak, lontong } { nasi lemak, } { chicken satay lontong, } chicken satay { } { }{ }{ } rendang, rendang, rendang, nasi lemak, nasi lemak, nasi lemak, lontong, lontong lontong chicken satay chicken satay chicken satay { } rendang, nasi lemak, lontong chicken satay 34
Balancing optimizations with privacy preservation Challenge: exploring all possible itemsets leads is slow due to combinatorial explosion. Pruning the search tree requires us to declassify itemset supports during computation (leak?). Solution: consider that the algorithm will publish all frequent itemsets, as that is its intended goal. We will compare support to the threshold privately, only declassifying the result bit. We will prune the search tree based on that bit. Not a leak - if the itemset is frequent, we would have learned it from the outputs anyway. 35
You can also read