Private Continual Release of Real-Valued Data Streams - Date: February 26, 2019 Victor Perrier, Hassan Jameel Asghar, Dali Kaafar Macquarie ...

Page created by Gordon Watson

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Private Continual Release of Real-Valued Data Streams - Date: February 26, 2019 Victor Perrier, Hassan Jameel Asghar, Dali Kaafar Macquarie ...

Private Continual Release of
 Real-Valued Data Streams
                   Date: February 26, 2019
                        Victor Perrier,
                    Hassan Jameel Asghar,
                           Dali Kaafar
     Macquarie University, Data61 (CSIRO), ISAE-SUPAERO

Streaming Data and Statistics
• Real-time monitoring of customer data can improve services
   • Real-time updates
   • Analysts/planners can optimize services

          Service     Event                 Real-time statistics
          Energy      Smart-meter reading   Electricity usage in a neighborhood
          Transport   Tap-on/off time       Peak hour commute times
          Retail      Supermarket bill      Average expenditure in a supermarket
          Location    Check-in/out time     Average time spent in a restaurant

                                                                                   2

Issue: Privacy                       Smart meter readings

• Raw stats may reveal
  sensitive events
   • Unusual presence at home
     (smart meter)
   • Trip to beach instead of work
     (transport)

• Events (observations) can be
  linked to real-life activities
  [MSF+10]
                                        Unusual activity [MSF+10]

                                                                    3

✏
                                                                           on a given day. Notice that this is significantly less than
            Since GS = B , we get a linear term in B . We will             the total time possible in a 24 hour period, which is
            use this algorithm as the building block, but we shall         obviously 1,440 minutes.

Privacy-Preserving Statistics
            find ways to improve the dependence on B . Our gain
            will be through the assumption that actual distribution
            of the string      is concentrated far below B . We will                          CDF of observations
            later justify this assumption by analyzing two real-world
            datasets. Then, instead of using the global sensitivity,
• Differential privacy a natural candidate
            we shall use smooth sensitivity tailored to a threshold
               ⌧ B , and therefore we expect considerable gain in
   •   Most⌧utility.
              work     on static databases

                                                                             % readings
   • Some work on binary data streams [DNPR10,
     CSS11]
         A. Motivation
               We consider scenarios where the upper bound B on a
            generic element of the incoming stream is overly conser-
• Our problem
            vative resulting in utility that is begging to be improved
            in practice. Consider the following likely scenarios
   •   Data from
               • Thean    event
                      bound        is real-valued
                              B might                      withinFor
                                         not be known in advance.    a                                                    !
       public upper        a bound!
                         bound
                 instance,           on the expenditure during a trip to                                               public
                                                                           Fig. 2: Histogram of journey times on Sydneybound
                                                                                                                        trains on
                 the supermarket. Any guess on the bound B would
   •   Release updated
                 be taking intosum/average
                                 account instances atofeach   event
                                                        unusually  high
                                                                           a particular day What is the scale of the x-axis.

   •   Event-level     privacy
                 spendings.   This will result in a very conservative
                                                                              Likewise, we see a similar trend in the total expen-
                 upper bound.
        • Peculiar   events protected
               • In some cases, a natural bound B exists. For in-
                                                                           diture during a trip to a supermarket chain in Australia
                                                                                                                        time
                                                                           on a given day. Converted into an empirical cumulative
                 stance, the commute time per day has a natural
                                                                           distribution function (CDF), the two distributions
                                                                                      events                            4       are
                 bound of B = 24 hours. However, most commute
                                                                           shown in Figures 3 and 4 respectively. An interesting

for strings of length n, where                              trains on a particular day (including multiple trips). We
                                     ✓       r                ◆                can see that the peak is around 20 to 30 mins, and very
                                           1      1
                             ↵ = O GS ·       8 ln log2 (n)1.5 .               few customers take more than 150 minutes of train ride
                                           ✏
                                                                               on a given day. Notice that this is significantly less than
How to Release the Average?
              Since GS = B , we get a linear term in B . We will
              use this algorithm as the building block, but we shall
                                                                               the total time possible in a 24 hour period, which i
                                                                               obviously 1,440 minutes.
              find ways to improve the dependence on B . Our gain
              will be through the assumption that actual distribution
                                                                                                    CDF of observations
•   Basic: addofLaplace
                  the string noise      of scalefar!below
                                  is concentrated      to each
                                                            B . We will
    observation
              later justify this assumption by analyzing two real-world
              datasets.
     • Error !"    afterThen,    instead of using the global sensitivity,
                          " events
              we shall use smooth sensitivity tailored to a threshold
              ⌧ ⌧ B , and therefore we expect considerable gain in
•   Generalized
              utility.binary stream algorithm fairs better

                                                                                  % readings
      • Error !log & " [DNPR10, CSS11]
                   A. Motivation
             We consider scenarios where the upper bound B on a
• Problem: generic
           error     still proportional to !
                   element of the incoming stream is overly conser-
      • In manyvative resulting !
                 situations     in is toothat
                                   utility loose    or unknown
                                              is begging to be improved
            • E.g.,inUnlikely
                      practice.someone    commuting
                                Consider the following for  fullscenarios
                                                        likely   24 hours!
                                                                                                         '                         !
                     •  The bound B might not be known in advance. For                               threshold
      •   Most readings        concentrated
                        instance,  a bound on thebelow        during a trip to'
                                                          a threshold
                                                  expenditure                                                                   public
                                                                                Fig. 2: Histogram of journey times on Sydney bound
                                                                                                                                trains on
                        the supermarket. Any guess on the bound B would
                                                                                a particular day What is the scale of the x-axis.
      •                 be  taking  into account
          If ' known, error is only 'log & "     instances of unusually  high
                        spendings.
             • Significant           This will result in a very conservative
                            if !: ' large                                          Likewise, we see a similar trend in the total expen
                        upper bound.
                                                                                diture during a trip to a supermarket chain5 in Australia
                      • In some cases, a natural bound B exists. For in-

Validation of Data Concentration
                                                                              Train trips
• Is data really concentrated well below a
  conceivable !?

                                                            % trips
• Train trips dataset
   • 50 million trips over four weeks (Sydney, Australia)                        minutes
   • Conceivable bound ! = 24 hours
                                                                             Supermarket

• Supermarket dataset

                                                            % transactions
   • 140,000 transactions by 1,000 customers (Australia)
   • Conceivable bound ! = ?

                                                                                  dollars
                                                                                            6

How to Estimate Threshold with Privacy?
• Need to observe a subset ! of observations
  – time lag                                                                Stream Readings
                                                   90
                                                   80

• Time lag needs to be optimized for accuracy      70

                                                   60
   • Too early: high outlier error                 50

   • Too late: marginal gain (may just use " as    40
                                                   30
     estimate)                                     20

                                                   10
                                                   0
                                                        1   2   3   4   5     6   7   8   9   10   11   12   13   14   15

• Naively estimating # violates privacy                                    !
                                                                        Time lag
                                                                                                                       $

   • E.g., maximum of ! observations is an exact
     event!

                                                                                                             7

Our Work
• A method to estimate threshold ! using a subset of observations
   • With differential privacy
   • and utility optimized for moving average

• Mechanism is generic – can also be used for
   • Average over a sliding window
   • Releasing histogram of streaming data
   • Estimating scale of distribution

                                                                    8

Background: Binary Tree Algorithm

• Binary tree (BT) algorithm [DNPR10, CSS11]
  • Find at most log $% nodes in tree whose union
    equals sum up to & events
                                 '()*+ ,                ',
  • Add Laplace noise of scale      -
                                           instead of    -

• Goal: Use BT as sub-module but noise
  scaled to . instead of /
                                                             Computing private sum of first 7 observations

                                                                                                   9

Global Mechanism
1. Estimate threshold ! using first " observations using budget #$

                                    %
2. Use Laplace noise with scale to release sum of first "
                               &'
   observations

3. Update & release sum for each event after " with Laplace noise of
   scale ! log + ,/# using BT algorithm

• Overall: (#, 0)-differential privacy

                                                                       10

What are the Choices for Threshold?
• False starts
   • Differentially private max of ! values?
      • max function is highly sensitive
      • Adjacent streams can differ by any value in [0, %]        Distribution of '
   • Standard deviation of distribution of '?
      • Need to know distribution in advance

                                                                      ( fraction of values
• Statistic of choice: (-quantile
   • E.g., ( = 0.005 (0.5% of values)
                                                                ,(
                                                             (-quantile

                                                                                      11

Privately Estimating !-Quantile
• Need to estimate !-quantile through first " readings
   • Satisfying # ≫ " ≫ 1/!                     required for stable estimate of !

• Roadmap
   • Obtain the empirical estimate (' ! of (!
   • Add differentially private noise to (' !
   • Set the result as threshold )

• Complication: cannot use Global Sensitivity (GS) for DP noise
   • Maximum change in function over all adjacent streams
   • GS of !-quantile is close to *

                                                                                    12

Using Smooth Sensitivity
• Local sensitivity (LS)
   • Maximum change in !-quantile over streams adjacent to input stream only
   • Unfortunately, LS itself can be sensitive
      • E.g., big differences in LS over nearby streams

• Smooth sensitivity (SS) [NRS07]
   • " #, # % : Hamming distance between streams # and # %
                          234  .,. /
   • SS(#, () = max./ {1             5 LS./ }
      • Smooths out change in LS as we move away from input stream

                                                                               13

Privately Obtaining the Threshold
• Obtain threshold as

     ! = $# %+ Laplace noise with SS

• We have swept some details under the
  rug
   • $# % and ! should be ≥ $% to bound error
   • We assume $# % ≥ $%

                                                14

Utility Analysis
• Light-tailed distributions                               Train trips dataset
    • Lighter than exponential distribution with the
      same !-quantile

• True for train trips and supermarket datasets
  for sufficiently small !

• If distribution is light-tailed
    • We show that error "log & '/) (as required)
    • Note: Privacy definition not dependent on
      distribution assumption                          ! = 0.005-quantile

                                                                                 15

Utility Analysis for Light-tailed Distributions
• Exponential distribution has the property
                          !" # $ ≥ !&' for all $ ≥ 1
   • For light-tailed distributions: !) " # $ ≥ !&'

• Idea:
   • Estimate "-quantile using 1/" readings
                                                                  !"           !& '
   • Set threshold + to !) " # $
   • Benefits:                                                             ≈ + = !) " # $
          • Estimate threshold with a much smaller time lag ,
          • Minimise outlier error                                            project
           .                                                    estimate
   • -         log 3 4
           /

                                                                                   16

What Values to Use in Practice?
• Improvement Factor (IF) metric
   • Ratio of error through BT versus       Impact of epsilon   Impact of time lag
     our method

• Epsilon: IF increases with larger !
  but then drops
   • Due to truncation: any value greater
     than threshold is fixed to threshold

• Time lag: Noticeable increase in
  impact factor with " ≈ 50,000
                                                        !                "
                                                                               17

Heuristics for Choosing Parameters
• Optimization suggests

          Parameter   Interpretation                         Value
              !       !-quantile                             0.005
              "       Shifting !-quantile                    Between 1 and 2
             #$       Budget to estimate threshold           0.8 of overall privacy budget
             #%       Budget to release sum of first & terms Derive from #$
             &        Time lag                               50,000

                                                                                             18

Experimental Evaluation
                                     Train trips dataset
• Max error on the sum (at step !)
   • 20k repetitions

• Train trips
   •   ! = 250, 000, 000
   •   " = 50,000
   •   # = 1440 mins (24 hrs)
   •   Improvement factor: 3.5
                                     Supermarket dataset

• Supermarkets
   •   ! = 150, 000
   •   " = 50,000
   •   # = 3,000 dollars
   •   Improvement factor: 9

                                                           19

Discussion
• Improved private release of moving average if distributions are light-tailed

• Question: which data have light-tailed distribution?
   • Any data coming from short-lived, time constrained events
       •   Smart-meter data
       •   Phone-call durations
       •   Length of posts (on social media)
       •   Daily average inter-arrivals of check-in times

• Heavy-tailed distributions are not “directly” time-constrained
   • Income distribution
   • File sizes in computer systems

                                                                                 20

Conclusion
• Shown a way to privately estimate the bulk of a distribution of streaming real-
  valued data

• Can be estimated by sacrificing a time lag

• Heuristics for choosing parameters in practice

• In worst-case, threshold is close to public bound !
   • We do not need to abort as in the propose-test-release approach [DL09]

• Moving average release is just one application – can be used in other applications
                                                                                    21

Questions

            22

References
• [CSS11]Chan, T.H.H., Shi, E. and Song, D., 2011. Private and continual release of
  statistics.ACM Transactions on Information and System Security (TISSEC), 14(3), p.26.
• [DL09] Dwork, C. and Lei, J., 2009, May. Differential privacy and robust statistics.
  In STOC (Vol. 9, pp. 371-380).
• [DNPR10] Dwork, C., Naor, M., Pitassi, T. and Rothblum, G.N., 2010, June. Differential
  privacy under continual observation. In Proceedings of the forty-second ACM symposium
  on Theory of computing (pp. 715-724). ACM.
• [MSF+10] Molina-Markham, A., Shenoy, P., Fu, K., Cecchet, E. and Irwin, D., 2010,
  November. Private memoirs of a smart meter. In Proceedings of the 2nd ACM workshop
  on embedded sensing systems for energy-efficiency in building (pp. 61-66). ACM.
• [NRS07] Nissim, K., Raskhodnikova, S. and Smith, A., 2007, June. Smooth sensitivity and
  sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on
  Theory of computing (pp. 75-84). ACM.

                                                                                             23

You can also read