CHEETAH: An Ultra-Fast, Approximation-Free, and Privacy-Preserved Neural Network Framework based on Joint Obscure Linear and Nonlinear ...

Page created by Elsie Mann

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

1

                                             CHEETAH: An Ultra-Fast, Approximation-Free,
                                                and Privacy-Preserved Neural Network
                                             Framework based on Joint Obscure Linear and
                                                       Nonlinear Computations
                                                                           Qiao Zhang, Cong Wang, Chunsheng Xin, and Hongyi Wu
arXiv:1911.05184v2 [cs.LG] 11 Feb 2021

                                              Abstract—Machine Learning as a Service (MLaaS) is enabling a wide range of smart applications on end devices. However, such
                                              convenience comes with a cost of privacy because users have to upload their private data to the cloud. This research aims to provide
                                              effective and efficient MLaaS such that the cloud server learns nothing about user data and the users cannot infer the proprietary
                                              model parameters owned by the server. This work makes the following contributions. First, it unveils the fundamental performance
                                              bottleneck of existing schemes due to the heavy permutations in computing linear transformation and the use of communication
                                              intensive Garbled Circuits for nonlinear transformation. Second, it introduces an ultra-fast secure MLaaS framework, CHEETAH, which
                                              features a carefully crafted secret sharing scheme that runs significantly faster than existing schemes without accuracy loss. Third,
                                              CHEETAH is evaluated on the benchmark of well-known, practical deep networks such as AlexNet and VGG-16 on the MNIST and
                                              ImageNet datasets. The results demonstrate more than 100× speedup over the fastest GAZELLE (Usenix Security’18), 2000×
                                              speedup over MiniONN (ACM CCS’17) and five orders of magnitude speedup over CryptoNets (ICML’16). This significant speedup
                                              enables a wide range of practical applications based on privacy-preserved deep neural networks.

                                              Index Terms—privacy; machine learning as a service; secure two party computation; joint obscure neural computing

                                                                                                                      ✦

                                         1    I NTRODUCTION

                                         F    ROM Alexa and Google Assistant to self-driving vehi-
                                              cles and Cyborg technologies, deep learning is rapidly
                                         advancing and transforming the way we work and live. It
                                                                                                                          cloud for inference, but they want the data privacy well pro-
                                                                                                                          tected, preventing curious cloud provider from mining valu-
                                                                                                                          able information. In many domains such as health care [9]
                                         is becoming prevalent and pervasive, embedded in many                            and finance [10], data are extremely sensitive. For example,
                                         systems, e.g., for pattern recognition [1], medical diagnosis                    when patients transmit their physiological data to the server
                                         [2], speech recognition [3] and credit-risk assessment [4].                      for medical diagnosis, they do not want anyone (including
                                         In particular, deep Convolutional Neural Network (CNN)                           the cloud provider) to see it. Regulations such as Health
                                         has demonstrated superior performance in computer vision                         Insurance Portability and Accountability Act (HIPAA) [11]
                                         such as image classification [5], [6] and facial recognition [7],                and the recent General Data Protection Regulation (GDPR)
                                         among many others.                                                               in Europe [12] have been in place to impose restrictions
                                             Since training a deep neural network model is resource-                      on sharing sensitive user information. On the other hand,
                                         intensive, cloud providers begin to offer Machine Learning                       cloud providers do not want users to be able to extract
                                         as a Service (MLaaS) [8], where a proprietary model is                           their proprietary, valuable model that has been trained with
                                         trained and hosted on clouds, and clients make queries (in-                      significant resource and efforts, as it may turn customers
                                         ference) and receive results through a web portal. While this                    into one-time shoppers [13]. Furthermore, the trained model
                                         emerging cloud service is embraced as important tools for                        contains private information about the training data set
                                         efficiency and productivity, the interaction between clients                     and can be exploited by malicious users [14], [15], [16]. To
                                         and cloud servers creates new vulnerabilities for unau-                          this end, there is an urgent need to develop effective and
                                         thorized access to private information. This work focuses                        efficient schemes to ensure that, in MLaaS, a cloud server
                                         on ensuring privacy-preserved while efficient inference in                       does not have access to users’ data and a user cannot learn
                                         MLaaS.                                                                           the server’s model.
                                             Although communication can be readily secured from
                                         end to end, privacy still remains a fundamental challenge.                       1.1 Retrospection: Evolvement of Privacy-Preserved
                                         On the one hand, the clients must submit their data to the                       Neural Networks
                                                                                                                          The quest began in 2016 when CryptoNets [17] was pro-
                                                                                                                          posed to embed Homomorphic Encryption (HE) [29] into
                                         •   Qiao Zhang, Chunsheng Xin, and Hongyi Wu are with the Department of
                                             Electrical and Computer Engineering, Old Dominion University, Norfolk,       CNN. It was the first work that successfully demonstrated
                                             VA, 23529. Cong Wang is with the Department of Computer Science, Old         the feasibility of calculating inference over Homomorphi-
                                             Dominion University, Norfolk, VA, 23529.                                     cally encrypted data. While the idea is conceptually straight-
                                             E-mail: {qzhan002, c1wang, cxin, h1wu}@odu.edu
                                                                                                                          forward, its prohibitively high computation cost renders it

2

                                                            TABLE 1
                                        Comparison of Privacy-Preserved Neural Networks.

                            Scheme for Linear Computation        Scheme for Non-Linear Computation        Speedup over [17]
         CryptoNets [17]    HE                                   HE (square approx.)                              –
  Faster CryptoNets [18]    HE                                   HE (polynomial approx.)                        10×
          GELU-Net [19]     HE                                   Plaintext (no approx.)                         14×
              E2DM [20]     Packed HE & Matrix optimization      HE (square approx.)                            30×
          SecureML [21]     HE & Secret share                    GC (piecewise linear approx.)                  60×
         Chameleon [22]     Secret share                         GMW & GC (piecewise linear approx.)            150×
          MiniONN [23]      Packed HE & Secret share             GC (piecewise linear approx.)                  230×
         DeepSecure [24]    GC                                   GC (polynomial approx.)                        527×
          SecureNN [25]     Secret share                         GMW (piecewise linear approx.)                1000×
           FALCON [26]      Packed HE with FFT                   GC (piecewise linear approx.)                 1000×
             XONN [27]      GC                                   GC (piecewise linear approx.)                 1000×
          GAZELLE [28]      Packed HE & Matrix optimization      GC (piecewise linear approx.)                 1000×
             CHEETAH        Packed HE & Obscure matrix cal.      Obscure HE & SS (no approx.)                 100,000×

impractical for most applications that rely on non-trivial       of magnitude faster than CryptoNets. So far, GAZELLE is
deep neural networks with a practical size in order to           considered the state-of-art framework for secure inference
characterize complex feature relations [6]. For instance,        computation.
CryptoNets takes about 300s for computing inference even             Two recent works unofficially published in arXiv re-
on a simple three-layer CNN architecture. With the increase      ported new designs that achieved computation speed at
of depth, the computation time grows exponentially. More-        the same order of magnitude as GAZELLE. FALCON [26]
over, several key functions in neural networks (e.g., acti-      leveraged fast Fourier Transform (FFT) to accelerate linear
vation and pooling) are nonlinear. CryptoNets had to use         computation. Its computing speed is similar to GAZELLE,
Taylor approximation, e.g., replacing the original activation    while the communication cost is higher. SecureNN [25]
function with a square function. Such approximation leads        adopted a design philosophy similar to Chameleon and
to not only degraded accuracy compared with the original         MiniONN, but exploited a 3-party setting to accelerate
model, but also instability and failure in training.             the secure computation, to obtain a 4 times speedup over
     Following CryptoNets, the past two years have seen a        GAZELLE, at the cost of using a semi-trust third party.
multitude of works aiming to improve the computation             Additionally, XONN [27] worked in line with DeepSecure
accuracy and efficiency (as summarized in Table 1). A neural     to explore the GC based design for Binary Neural Network
network essentially consists of two types of computations,       (BNN), achieving up to 7 times speedup over GAZELLE, at
i.e., linear and nonlinear computations. The former focuses      the cost of accuracy drop due to the binary quantization in
on matrix calculation to compute dot product (for fully-         BNN.
connected dense layers) and convolution (for convolutional           In addition, a few approaches were introduced to not
layers). The latter includes nonlinear functions such as acti-   just improve computation efficiency but also provide other
vation, pooling and softmax. A series of studies have been       desirable features. For example, GELU-Net [19] aims to
carried out to accelerate the linear computation, or nonlinear   avoid approximation of non-linear functions. It partitioned
computation, or both. For example, faster CryptoNets [18]        computation onto non-colluding parties: one party performs
leveraged sparse polynomial multiplication to accelerate the     linear computations on encrypted data, and the other exe-
linear computation. It achieved about 10 times speedup           cutes nonpolynomial computation in an unencrypted but
over CryptoNets. SecureML [21], Chameleon [22] and Min-          secure manner. It showed over 14 times speedup than Cryp-
iONN [23] adopted a similar design concept. Among them,          toNets and does not have accuracy loss. E2DM [20] aimed to
MiniONN achieved the highest performance gain. It applied        encrypt both data and neural network models, assuming the
Secret Share (SS) for linear computation, and packed HE [30]     latter are uploaded by users to untrusted cloud. It focused
to pre-share a noise vector between the client and server        on matrix optimization by combining Homomorphic oper-
offline, in order to cancel the noise during secure online       ation and ciphertext permutation, demonstrating 30 times
computation. In [23], non-linear functions were approxi-         speedup over CryptoNets.
mated by piece-wise linear segments, and computed by
using Garbled Circuits (GC), which resulted in 230 times
                                                                 1.2 Contribution of This Work
speedup over CryptoNets. DeepSecure [24] took an all-
GC approach, i.e., implemented both linear and nonlinear         Despite the fast and promising improvement in compu-
computations using GC. It optimized the gates in the tradi-      tation speed, there is still a significant performance gap
tional GC module to achieve a speedup of 527 times over          to apply privacy-preserved neural networks on practical
CryptoNets. Finally, GAZELLE [28] focused on the linear          applications. The time constraints in many real-time appli-
computation, to accelerate the matrix-vector multiplication      cations (such as speech recognition in Alexa and Google
based on packed HE, such that Homomorphic computa-               Assistant) are within 10 seconds [31], [32]; self-driving cars
tions can be efficiently parallelized on multiple packed         even demand an immediate response less than a second [33].
ciphertexts. GAZELLE demonstrated impressive speedup of          In contrast, our benchmark has showed that GAZELLE,
about 20 times compared with MiniONN and three orders            which has achieved the best performance so far in terms of

speed among existing schemes, takes 161s and 1731s to run solutions rely on piece-wise or polynomial approximation
the well-known practical deep neural networks AlexNet [5] for nonlinear functions such as activation. This leads to
and VGG-16 [6], which renders it impractical in real-world degraded accuracy and the accuracy loss is often significant.
applications. The proposed scheme takes a secret sharing approach with
In this paper, we propose CHEETAH, an ultra-fast, 0-multiplicative-depth packed HE to avoid the use of com-
secure MLaaS framework that features a carefully crafted putationally expensive GC. A novel design is developed to
secret sharing scheme to enable efficient, joint linear and allow the server and client to each obtain a share of Homo-
nonlinear computation, so that it can run significantly faster morphic encrypted nonlinear transformation result based
than the state-of-the-art schemes. It eliminates the need on the obscure linear transformation as discussed above.
to use approximation for nonlinear computations; hence, This approach eliminates the need to use approximation for
unlike the existing schemes, CHEETAH does not have accu- nonlinear functions and achieves enormous speedup. For
racy loss. It, for the first time, reduces the computation delay example, it is 1793 times faster than GAZELLE in comput-
to milliseconds and thus enables a wide range of practical ing the most common nonlinear ReLu activation function,
applications to utilize privacy-preserved deep neural net- under the output dimension of 10K.
works. To the best of knowledge, this is also the first work Overall, the proposed CHEETAH is an ultra-fast privacy-
that demonstrates privacy-preserved inference based on the preserved neural network inference framework without
well-known, practical deep architectures such as AlexNet accuracy loss. It enables obscure neural computing that
and VGG. intrinsically merges the calculation of linear and nonlinear
The significant performance improvement of CHEETAH transformations and effectively reduces the computation
stems from a creative design, called joint obscure neural time. We benchmark the performance of CHEETAH with
computing. Computations in neural networks follow a se- well-known deep networks for secure inference. Our results
ries of operations alternating between linear and nonlinear show that it is 218 and 334 times faster than GAZELLE,
transformations for feature extraction. Each operation takes respectively, for a 3-layer and a 4-layer CNN used in pre-
the output from the previous layer as the input. For exam- vious works. It achieves a significant speedup of 130 and
ple, the nonlinear activation is computed on the weighted 140 times, respectively, over GAZELLE in the well-known,
values of linear transformations (i.e., the dot product or practical deep networks AlexNet and VGG-16. Compared
convolution). All existing approaches discussed in Sec. 1.1 with CryptoNets, CHEETAH achieves a speedup of five
essentially follow the same framework, aiming to securely orders of magnitudes.
compute the results for each layer and then propagate to The rest of the paper is organized as follows. Section 2 in-
the next layer. This seemingly logic approach, however, troduces the system and threat models. Section 3 elaborates
becomes the fundamental performance hurdle as revealed the system design of CHEETAH, followed by the security
by our analysis. analysis in Section 4. Experimental results are discussed in
First, although matrix computation has been deeply op- Section 5. Finally, Section 6 concludes the paper.
timized based on packed HE for the linear transformation
in the state-of-the-art GAZELLE, it is still costly. The com- 2 S YSTEM AND T HREAT M ODELS
putation time of the linear transformation is dominated by
In this section, we introduce the overall system architecture
the operation called ciphertext permutation (or Perm) [28],
and threat model, as well as the background knowledge
which generates the sum based on a packed vector. It is
about cryptographic tools used in our design.
required in both convolution (for a convolutional layer) and
dot product (for a dense layer). From our experiments, one
Perm is 56 times slower than one Homomorphic addition 2.1 System Model
and 34 times slower than one Homomorphic multiplica- We consider a MLaaS system as shown in Fig. 1. The client is
tion. We propose an approach to enable an incomplete the party that generates or owns the private data. The server
(or obscure) linear transformation result to propagate to is the party that has a well-trained deep learning model and
the next nonlinear transformation as the input to continue provides the inference service based on the client’s data. For
the neural computation, reducing the number of ciphertext example, a doctor performs a chest X-ray for her patient and
permutations to zero in both convolution and linear dot sends the X-ray image to the server on the cloud, which runs
product computation. the neural network model and returns the inference result
Second, most existing schemes (including GAZELLE) to assist the doctor’s diagnosis.
adopted GC to compute the nonlinear transformation (such While various deep learning techniques can be em-
as activation, pooling and softmax), because GC generally ployed to enable MLaaS, we focus on the Convolutional
performs better than HE when the multiplicative depth is Private chest X-ray
greater than 0 (i.e., nonlinear) [28]. However, the GC-based
approach is still costly. The overall network must be repre- Client Server
sented as circuits and involves interactive communications
between two parties to jointly evaluate neural functions
over their private inputs. The time cost is often significant
for large and deep networks. Specifically, our benchmark Inference result:
shows that it takes about 263 seconds to compute a nonlin- Pneumonia
ear ReLu function with 3.2M input values, which is part of
the VGG-16 framework [6]. Moreover, all existing GC-based Fig. 1. An overview of the MLaaS system.

4

Neural Network (CNN), which has achieved wide success                                                                                             where k and x are the kernel and input, respectively. For the
and demonstrated superior performance in computer vision                                                                                          ease of description, we omit the bias in Eq. (1). Nevertheless,
such as image classification [5], [6] and face recognition [7].                                                                                   it can be easily transformed into the convolution or weight
A CNN consists of a stack of layers to learn a complex                                                                                            matrix multiplication [35].
relation among the input data, e.g., the relations between                                                                                             The last convolutional layer is typically connected with
pixels of an input image. It operates on a sequence of                                                                                            the fully-connected layer, which computes the weighted sum,
linear and nonlinear transformations to infer a result, e.g.,                                                                                     i.e., a dot product between the weight matrix w of size no ×
whether an input medical image indicates the patient has                                                                                          ni and a flatten feature vector of size ni × 1. The output is
pneumonia. The linear transformations are in two typical                                                                                          a vector with the size of no × 1. Each element of the output
forms: dot product and convolution. The nonlinear transfor-                                                                                       vector is calculated below:
mations leverage activations such as the Rectified Linear Unit                                                                                                                 X
                                                                                                                                                                               ni

(ReLu) to approximate complex functions [34] and pooling                                                                                                              z(i) =         w(i, j)x(j).              (2)
                                                                                                                                                                               j=1
(e.g., max pooling and mean pooling) for dimensionality
reduction. CNN repeats the linear and nonlinear transfor-                                                                                             Activation. Nonlinear activation is applied to convolu-
mations recursively to reduce the high-dimensional input                                                                                          tional and weighted-sum outputs in an elementwise man-
data to a low-dimensional feature vector for classification                                                                                       ner. In this work, we mainly target on the ReLu activation
at the fully connected layer. Without losing generality, we                                                                                       function, f (x) = max{0, x}, which is widely adopted in
use image classification as an example in the following                                                                                           state-of-the-art neural networks such as AlexNet [5] and
discussion, aiming to provide a lucid understanding of the                                                                                        VGG-16 [6].
CNN architecture as illustrated in Fig. 2.                                                                                                            Pooling. Pooling conducts downsampling to reduce di-
    Convolutional Layer. As shown in Fig. 2(b), the input to                                                                                      mensionality. In this work, we consider Mean pooling, which
a convolutional layer has the dimensions of wi × hi × ci ,                                                                                        is implemented in CryptoNets and also commonly adopted
where wi and hi are the width and height of the input                                                                                             in state-of-art CNNs. It splits a feature map into regions and
feature map and ci is the number of the feature maps (or                                                                                          averages the regional elements. Compared to max pooling
channels). For the first layer, the feature maps are simply                                                                                       (another pooling function which selects the maximum value
the input images. Hereafter, we use the subscript i to denote                                                                                     in each region), authors in [36] have claimed that while
input and o output. The input is convolved with co groups                                                                                         the max and mean pooling functions are rather similar, the
of kernels. The size of each group of kernel is kp × kq × ci , in                                                                                 use of mean pooling encourages the network to identify
which kp and kq are the width and height of the kernel. The                                                                                       the complete extent of the object, which builds a generic
number of channels of the kernel group must match with                                                                                            localizable deep representation that exposes the implicit
the input, i.e., ci . The convolution will produce the feature                                                                                    attention of CNNs on an image.
output, with a size of wo × ho × co . More specifically, the
(m, n)-th element in the t-th (1 ≤ t ≤ co ) output feature is                                                                                     2.2 Threat Model
calculated as follows:                                                                                                                            Similar to [21], [23], [24], [28], we adopt the semi-honest
                        X   X X
                        ci kp −1 kq −1                                                                                                            model, in which both parties try to learn additional in-
   z(m, n, t) =                          k(u, v, j, t)x(m − u, n − v, j),                                                                   (1)   formation from the message received (assuming they have
                        j=1 u=0 v=0                                                                                                               bounded computational capability). That is, the client C and
                                                                                                                                                  server S will follow the protocol, but C wants to learn the
     Input image Conv layer             Pooling                                                             Output predictions
                                                                                                                                                  model parameters and S attempts to learn the data. Hence,
                                                                               Flatten vector

                                                                                                                       Cat
                                                                                                                 FC layer
                                                                                                 FC layer

                                                                                                                       Dog                        the goal is to make the server oblivious of the private data
                                                                                                                       Deer
                                                                                                                       Frog                       from the clients, and prevent the client from learning the
                                                                                                                             ...

                                                                                                                                 Truck
                                                                                                                                                  model parameters of the server. We would prove that the
                                                                                                                                                  proposed framework is secure under semi-honest corrup-
                           (a) Overall network structure.                                                                                         tion using ideal/real security [37]. Our framework targets
                                                                                                                                                  to protect clients’ sensative data, and service providers’
                                                                                                                                                  models which have been trained by service providers with
                                            co groups of kernels
                                  kq

                                                                                                                                                  significant resources (e.g., private training data and comput-
                      hi

                                                                                                                                                  ing power). Protecting models is usually sufficient through
                                                                                            ho

                                                                                                                                      ho
                                   ...

                 wi                                                            wo                                                wo               protecting the model parameters, which are the most critical
                                 kp                                                                                     Output after              information for a model. Moreover, many applications are
           Input                                                         Output                                          activation
                                                                                                                                                  even built on well-known deep network structures such as
                              (b) Convolutional layer.                                                                                            AlexNet [5], VGG16/19 [6] and ResNet50 [38]. Hence it is
                                                                    1                                                                             typically not necessary to protect the structure (number of
                                                                                                                                                  layers, kernel size, etc). In the case that the implemented
                                                                          no

                                                                                                                            no

                                                                                                                                           no
                                                                                                            ni

                                                                                                                                                  structure is proprietary and has to be protected, service
                                       ho
                   hi

                                                                                                  1
            wi                    wo                                                                                                              providers can introduce redundant layers and kernels to
                                                                         ni     Input Output Output-                                              hide the real structure [23], [28].
         Input                 Output                              Weight matrix       -after activation                                              There is also an array of emerging attacks to the security
            (c) Pooling.                                                (d) Fully connected layer.                                                and privacy of the neural networks [13], [14], [16], [39], [40],
Fig. 2. A three-layer CNN: (a) overall network structure, (b) convolutional                                                                       [41]. They can be further classified by the processes that they
layer, (c) pooling, (d) fully connected layer.                                                                                                    are targeting at: training, inference (model) and input.

5

    (1) Training. The attack in [16] attempts to steal the hy-        homomorphic addition (Add), multiplication (Mult) and
perparameters during training. The membership inference               permutation (Perm). Add([x],[y]) outputs a ciphertext [x+y]
attack [14] wants to find out whether an input belongs to             which encrypts the elementwise sum of x and y . Mult([x],u)
the training set based on the similarities between models             outputs a ciphertext [x ◦ u] which encrypts the elementwise
that are privately trained or duplicated by the attacker. This        multiplication of x and plaintext u. It is worth pointing
paper focuses on the inference stage and does not consider            out that CHEETAH is designed to require multiplication
such attacks in training, since the necessary variables for           between a ciphertext and a plaintext only, but not the
launching these attacks have been released in memory and              much more expensive multiplication between two cipher-
the training API is not provided.                                     texts. Perm([x]) permutes the n elements in [x] into another
    (2) Model. The model extraction attack [13] exploits the          ciphertext [xπ ], where xπ = (x(π 0 ), x(π 1 ), · · · ) and πi is a
linear transformation at the inference stage to extract the           permutation of {0, 1, · · · , n − 1}.
model parameters and the model inversion attack [39] at-                  The run-time complexities of Add and Mult are signifi-
tempts to deduce the training sets by finding the input that          cantly lower than Perm. From our experiments, one Perm is
maximizes the classification probability. The success of these        56 times slower than one Add and 34 times slower than one
attacks requires full knowledge of the softmax probability            Mult. This observation motivates the design of CHEETAH,
vectors. To mitigate them, the server can return only the             which completely eliminates permutations in convolution
predicted label but not the probability vector or limits              and dot product transformations, thus substantially reduc-
the number of queries from the attacker. The Generative               ing the overall computation time.
Adversarial Networks (GAN) based attacks [40] can recover                 It is worth pointing out that neural networks always
the training data by accessing the model. In this research,           deal with floating point numbers while the PHE is in the
since the model parameters are successfully protected from            integer domain. Specifically, neural networks typically use
the clients, this attack can be defended effectively.                 real number arithmetic, not modular arithmetic. On the
    (3) Input. A plethora of attacks adopt adversarial exam-          other hand, direct increasing plaintext modulus in PHE
ples by adding a small perturbation to the input in order to          increases noise budget consumption, and also decreases the
cause the neural network to misclassify [41]. Since rational          initial noise budget, which causes limited Homomorphic
clients pay for prediction services, it is not of their interest to   operations. As for the original floating point numbers in
obtain an erroneous output. Thus, this attack does not apply          neural networks, they are firstly quantized into 8-bit signed
in our framework.                                                     integers with fix-point encoding. As for the transforma-
                                                                      tion from fix-point number to integer, our implementation
2.3 Cryptographic Tools                                               adopts the highly efficient encoding for BFV in Microsoft
The proposed privacy-preserved deep neural network                    SEAL library [44] to establish a mapping from real numbers
framework, i.e., CHEETAH, employs two fundamental                     in neural network to plaintext elements in PHE. This makes
cryptographic tools as outlined below.                                real number arithmetic workable in PHE without data over-
     (1) Packed Homomorphic Encryption. Homomorphic En-               flow. Thereafter, our design is described in floating point
cryption (HE) is a cryptographic primitive that supports              domain with real number input.
meaningful computations on encrypted data without the                     (2) Secret Sharing. In the secret sharing protocol, a value
decryption key. It has found increasing applications in data          is shared between two parties, such that combining the two
communication, storage and computation [42]. Traditional              secrets yields the true value [22]. In order to additively
HE operates on individual ciphertext [19], while the packed           share a secret m, a random number, s, is selected and two
homomorphic encryption (PHE) enables packing of multiple              shares are created as hmi0 = s and hmi1 = m − s. Here,
values into a single ciphertext and performs component-               m can be either plaintext or ciphertext. A party that wants
wise homomorphic computation in a Single Instruction                  to share a secret sends one of the shares to the other party.
Multiple Data (SIMD) manner [43] to take the advantages               To reconstruct the secret, one needs to only add two shares
of parallelism. Among various PHE techniques, our work                m = hmi0 + hmi1 .
builds on the private-key Brakerski-Fan-Vercauteren (BFV)                 While the overall idea of secret share (SS) is straightfor-
scheme [30], which involves four parameters1: 1) ciphertext           ward, creative designs are often required to enable its ef-
modulus q , 2) plaintext modulus p, 3) number of ciphertext           fective application in practice, because in many applications
slots n, and 4) a Gaussian noise with a standard deviation σ .        the two parties need to perform complex nonlinear compu-
The secure computation involves two parties, i.e., the client         tation on their respective shares and thus it is non-trivial to
C and server S .                                                      reconstruct the final result based on the computed shares.
     In PHE, the encryption algorithm encrypts a plaintext            Due to this fundamental hurdle, the existing approaches
message vector x from Zn into a ciphertext [x] with n                 discussed in Sec. 1.1 predominately chose to use GC, instead
slots. We denote [x]C and [x]S as the ciphertexts encrypted           of SS, to implement the nonlinear functions. However, GC is
by client C and server S , respectively. The decryption al-           computationally costly for large input [24], [28], [45]. Specifi-
gorithm returns the plaintext vector x from the cipher-               cally, our benchmark shows that GC takes about 263 seconds
text [x]. Computation can be performed on the cipher-                 to compute a nonlinear ReLu function with 3.2M input
text. In a general sense, an evaluation algorithm inputs              values, which is part of the VGG-16 framework [6]. In this
several ciphertexts [x1 ], [x2 ], · · · and outputs a ciphertext      work, we propose a creative PHE-based SS for CHEETAH
[x′ ] = f ([x1 ], [x2 ], · · · ). The function f is constructed by    to implement secret nonlinear computation, which requires
                                                                      only 21 round communication for each nonlinear function,
  1. The readers are referred to [28] for more detail.                thus achieving multiple orders of magnitude reduction of

6

the computation time. For example, CHEETAH achieves a                           ensure that the server does not have access to x and the
speedup of 1793 times over GAZELLE in computing the                             client cannot learn the server’s model parameters. To this
nonlinear ReLu function.                                                        end, in GAZELLE, C encrypts x into [x]C by using HE and
                                                                                sends it to S . In the following discussion, both server and
                                                                                client use private-key BFV encryption [30]. The subscript
3   D ESIGN OF P RIVACY P RESERVED I NFERENCE                                   [·]C denotes ciphertext encrypted by the client’s private key,
A neural network is organized into layers. For example,                         while [·]S denotes ciphertext encrypted by the private key
CNN consists of convolutional layers and fully-connected                        of server.
dense layers. Each layer includes linear transformation (i.e.,                      S performs HE computation to calculate the convolu-
weighted sum for a fully-connected dense layer or con-                          tion k ∗ [x]C . To accelerate the computation, packed HE is
volution for a convolutional layer), followed by nonlinear                      employed. For example, to compute the first element of
tranformation (such as activation and pooling). All existing                    the convolution (i.e., Con1 ), a single cipheretxt can be cre-
schemes intend to securely compute the results for linear                       ated to contain the vector [x(1, 1), x(1, 2), x(2, 1), x(2, 2)]C .
transformation first, and then perform the nonlinear com-                       On the other hand, a packed plaintext vector is
putation. Although it appears logical, such design leads to a                   created for [k(2, 2), k(2, 3), k(3, 2), k(3, 3)]. The packed
fundamental performance bottleneck as discussed in Sec. 1.                      HE supports the computation of element-wise multi-
The proposed approach, CHEETAH, is based on a creative                          plication between the two vectors in a single op-
design, named joint obscure neural computing, which only                        eration, yielding a single ciphertext for the vector
computes a partial linear transformation output and uses                        [k(2, 2)x(1, 1), k(2, 3)x(1, 2), · · · , k(3, 3)x(2, 2)]C . However,
it to complete the nonlinear transformation. It achieves sev-                   we still need to add the vector’s elements together to com-
eral orders of magnitude speedup compared with existing                         pute Con1 . Since the vector is in a single ciphertext, direct
schemes.                                                                        addition is not possible. GAZELLE uses permutation (Perm)
     We introduce the basic idea of CHEETAH via a simple                        to compute the sum [28]. For example, given a ciphertext
example based on a two-layer CNN (with a convolutional                          that has four elements, it is firstly permed such that the
layer and a dense layer), which can be formulated as fol-                       last two elements are moved to the first two slots in the
lows:                                                                           ciphertext. Then the permed ciphertext is added with the
                      z = w · f (k ∗ x),                  (3)                   original counterpart, which results in a ciphertext whose
                                                                                first two elements are the partial sum of the four elements.
where f (·) is the activation function, x is the 2 × 2 input
                                                                                Then that added ciphertext is permed such that its second
data, k is a 3 × 3 kernel for the convolutional layer, ∗ stands
                                                                                element is moved to the first slot. The sum of the four
for convolution and w is the weight matrix for the dense
                                                                                elements is obtained by adding the permed ciphertext and
layer:
                               "                          #                     the non-permed one. The resultant sum is at the first slot of
                                  k(1, 1) k(1, 2) k(1, 3)                       the final ciphertext.
       x(1, 1) x(1, 2)
     h                 i
x = x(2, 1) x(2, 2) , k = k(2, 1) k(2, 2) k(2, 3) and                               However, computing the sum using Perm is costly, with
                                        k(3, 1)   k(3, 2)   k(3, 3)
                                                                                the complexity of O(r2 ) for convolution and O(log nno +
                                                                                ni no
                                                                                   n ) for weighted sum in the dense layer, where no , ni and
         w(1, 1)    w(1, 2)   w(1, 3)    w(1, 4)
     h                                           i
w=       w(2, 1)    w(2, 2)   w(2, 3)    w(2, 4) .                              r are the output dimension, input dimension, and kernel
Note that while we use the simple two-layer CNN to lucidly                      size, respectively). From our experiments, one Perm is 56
describe the main idea, CHEETAH is actually applicable to                       times slower than one Add and 34 times slower than one
any neural networks with any layer structure and input data                     Mult.
size. In the rest of this section, we first present CHEETAH                         In this paper, we propose a novel idea to enable an
for a Single Input Single Output (SISO) convolution layer                       incomplete (or obscure) linear transformation result to prop-
and then discuss the cases for Multiple Input Multiple Out-                     agate to the next nonlinear transformation to continue the
put (MIMO) convolution and fully connected dense layers.                        neural computation, thus eliminating the need for cipher-
                                                                                text permutations. The overall design is motivated by the
                                                                                double-secret scheme for solving linear system of equations
3.1 SISO Convolutional Layer
                                                                                [46]. Our scheme is illustrated in Fig. 3.
The process of convolution can be visualized as placing the
                                                                                (1) Packed HE Encryption. C and S transform the data x
kernels at different locations of the input data. At each loca-                 and kernel k into x′ and k ′ , respectively, as follows:
tion, an element-wise sum of product is computed between
the kernel and corresponding data values. If the convolution                    x′ = [x(1, 1), x(1, 2), x(2, 1), x(2, 2), x(1, 1), x(1, 2), x(2, 1), x(2, 2),
of the above example, i.e., k ∗ x, is computed in plaintext,                         x(1, 1), x(1, 2), x(2, 1), x(2, 2), x(1, 1), x(1, 2), x(2, 1), x(2, 2)],
the result, denoted as Con, should include four elements,
Con = [Con1 , Con2 , Con3 , Con4 ]:                                             k′ = [k(2, 2), k(2, 3), k(3, 2), k(3, 3), k(2, 1), k(2, 2), k(3, 1), k(3, 2),
                                                                                     k(1, 2), k(1, 3), k(2, 2), k(2, 3), k(1, 1), k(1, 2), k(2, 1), k(2, 2)].
 Con1    : k(2, 2)x(1, 1) + k(2, 3)x(1, 2) + k(3, 2)x(2, 1) + k(3, 3)x(2, 2),
 Con2    : k(2, 1)x(1, 1) + k(2, 2)x(1, 2) + k(3, 1)x(2, 1) + k(3, 2)x(2, 2),
 Con3    : k(1, 2)x(1, 1) + k(1, 3)x(1, 2) + k(2, 2)x(2, 1) + k(2, 3)x(2, 2),       As illustrated in Fig. 4, four convolutional blocks are
 Con4    : k(1, 1)x(1, 1) + k(1, 2)x(1, 2) + k(2, 1)x(2, 1) + k(2, 2)x(2, 2).
                                                                                computed. For example, the first convolutional block com-
    In the problem setting of secure MLaaS (as introduced                       putes x(1, 1) × k(2, 2) + x(1, 2) × k(2, 3) + x(2, 1) × k(3, 2) + x(2, 2) ×
in Sec. 2), the client C owns the data x, while the server S                    k(3, 3). The elements in each convolutional block are sequen-
owns the CNN model (including k and w). The goal is to                          tially extracted into a packed ciphertext [x′ ]C . Meanwhile, S

7

      Client (C)                                                                        Server (S)                            convolution result with a randomly multiplicative blinding
        (x, s1)                                    offline                                  (k, w)                            factor. Specifically, S pre-generates a pair of random num-
                                                                                                                              bers that satisfy vi1 vi2 = 1, for each i-th to-be-summed
                                                    [s]1 C                                                                    block in [x′ ◦ k ′ ]C , where i ∈ {1, 2, 3, 4} in this example.
                                             [ID1]S & [ID2]S
                                                                                                                                  S constructs the following vector v by using vi1 :
                                                                                                                                          v = [v11 , v11 , v11 , v11 , v21 , v21 , v21 , v21 ,
                                        Encrypted Data [x']C
                                                                                                                                              v31 , v31 , v31 , v31 , v41 , v41 , v41 , v41 ],
                                                                                                                              which will be used to scramble [x′ ◦ k ′ ]C by multiplying it
                             Obscure Linear Result [x'żk'żv+b]C                                                               with v before it is sent to C . Note that, as each individual
                                                                                                                              in the i-th four-element block is multiplied with the same
        Recover S-encrypted ReLu (Eq. (6))                                                                                    factor (since v11 , v21 , v31 , v41 are repeated four times in v ),
                                                                                                                              it would leak the relative magnitude among those four
                             Secret Share for Server [f(k* x)-s1]S                                                            elements in each block. To this end, S further constructs
                                                                                                                              a noise vector as follows:
                                                 Recover C-encrypted ReLu [f(k* x)]C                                                      b = [b11 , b12 , b13 , b14 , b21 , b22 , b23 , b24 ,
                                                                                                                                               b31 , b32 , b33 , b34 , b41 , b42 , b43 , b44 ],
                                                                                                                              where bij are random numbers subject to 4j=1 bij = vi1 δi
                                                                                                                                                                                     P
Fig. 3. The overall design of CHEETAH.
                                                                                                                              that is uniformly distributed in [−ǫ, ǫ] where ǫ is a model
also transforms the kernel k into k ′ according to each con-                                                                  parameter known to server S .
volutional block. Note that the transformation is completed                                                                      At the same time, S uses vi2 to create the following
offline. C encrypts x′ and sends [x′ ]C to S .                                                                                vectors:
(2) Perm-free Secure Linear Computation. Upon receiving                                                                                      ID1 = [ID11 , ID21 , ID31 , ID41 ],
[x′ ]C , S performs the linear computation based on the client-                                                                              ID2 = [ID12 , ID22 , ID32 , ID42 ],
encrypted data. A distinguished feature of the proposed                                                                       where (IDi1 , IDi2 ) is a pair of polar indicator,
design is to eliminate the costly permutations.                                                                                                          
     Let x′ ◦k ′ denote the elementwise multiplication between                                                                                                (0, vi2 ), if vi1 > 0
                                                                                                                                      (IDi1 , IDi2 ) =                                           (4)
x and k ′ . As we can see, the sum of four elements for each
  ′                                                                                                                                                         (vi2 , −vi2 ), if vi1 < 0.
block in x′ ◦k ′ corresponds to one element of the convolution                                                                S encrypts ID1 and ID2 by using packed HE. The encrypted
result. For example, the four elements for first block, i.e.,                                                                 values, i.e., [ID1 ]S and [ID2 ]S , will be sent to C for the
[x(1, 1), x(1, 2), x(2, 1), x(2, 2)] and [k(2, 2), k(2, 3), k(3, 2), k(3, 3)],                                                nonlinear computation as to be discussed later. Note that,
correspond to Con1 . The next block (i.e., [x(1, 1), x(1, 2),x(2, 1),                                                         [ID1 ]S and [ID2 ]S can be transmitted to C offline, as vi1 and
x(2, 2)] and [k(2, 1), k(2, 2), k(3, 1), k(3, 2)]) correspond to Con2 ,                                                       vi2 are pre-generated by S .
and so on and so forth.                                                                                                           Now, let us put all pieces together for the secure com-
     S performs Mult([x′ ]C , k ′ ) to obtain [x′ ◦ k ′ ]C . The result                                                       putation of convolution: C encrypts x′ and sends [x′ ]C to S .
is the client-encrypted elementwise multiplication between                                                                    S pre-computes v ◦ k ′ in plaintext and then multiplies the
x′ and k ′ . But S does not intend to calculate the sum of each                                                               result with [x′ ]C to obtain [x′ ◦ k ′ ◦ v]C . As we can see, the
block to obtain the final convolution result as GAZELLE                                                                       i-th convolution element (which corresponds to the sum of
does, because it would need the costly permutations. In-                                                                      i-th four-element block in [x′ ◦ k ′ ◦ v]C ) is actually multiplied
stead, it intends to let C decrypt [x′ ◦ k ′ ]C to compute the                                                                with a random number vi1 . Finally, S adds the noise vector
sum in the plaintext.                                                                                                         by Add([x′ ◦ k ′ ◦ v]C , b) = [x′ ◦ k ′ ◦ v + b]C . In this way,
     However, naively sending [x′ ◦ k ′ ]C to the client would                                                                b disturbs each element of convolution result (the sum of
allow the client to obtain the neural network model infor-                                                                    four elements in each block) with a random noise δi while v
mation, i.e., k . To this end, S disturbs each element of the                                                                 scales each noised element.
                                                                                                                                  Next, we will show that, although the convolution result
                                             x' at C                                                                          is not explicitly calculated, the partial (obscure) result, i.e.,
 x(1,1) x(1,2) x(2,1) x(2,2) x(1,1) x(1,2) x(2,1) x(2,2) x(1,1) x(1,2) x(2,1) x(2,2) x(1,1) x(1,2) x(2,1) x(2,2)
                                                                                                                              [x′ ◦ k ′ ◦ v + b]C , is sufficient to compute the nonlinear
       x at C                                                                                                                 transformation (e.g., activation and pooling).
                                                                                                       Convolutional blocks

   x(1,1) x(1,2)                             k(1,1) k(1,2) k(1,3)          k(1,1) k(1,2) k(1,3)                               (3) PHE-based Secret Share for Non-Linear Transforma-
                                             k(2,1) x(1,1)
                                                    k(2,2) k(2,3)
                                                           x(1,2)          k(2,1) k(2,2)
   x(2,1) x(2,2)                                                           x(1,1) x(1,2) k(2,3)                               tion. S sends [x′ ◦ k ′ ◦ v + b]C , [ID1 ]S and [ID2 ]S to C (note
                                             k(3,1) x(2,1)
                                                    k(3,2) k(3,3)
                                                           x(2,2)          k(3,1) k(3,2)
                                                                           x(2,1) x(2,2) k(3,3)
          *                                                                                                                   that [ID1 ]S and [ID2 ]S are transmitted to C offline).
k(1,1) k(1,2) k(1,3)                         k(1,1) x(1,1) x(1,2)          k(1,1) k(1,2)                                          C decrypts [x′ ◦k ′ ◦v+b]C and sums up each four-element
                                                    k(1,2) k(1,3)          x(1,1) x(1,2) k(1,3)
k(2,1) k(2,2) k(2,3)                         k(2,1) x(2,1)
                                                    k(2,2) k(2,3)
                                                           x(2,2)          k(2,1) k(2,2)
                                                                           x(2,1) x(2,2) k(2,3)                               block in plaintext, yielding y = [y(1), y(2), y(3), y(4)]. It is
k(3,1) k(3,2) k(3,3)                         k(3,1) k(3,2) k(3,3)           k(3,1) k(3,2) k(3,3)                              not difficult to show that y(i) is vi1 times of the disturbed
      k at S                                                                                                                  convolution, i.e., y(i) = vi1 × (Coni + δi ).
 k(2,2) k(2,3) k(3,2) k(3,3) k(2,1) k(2,2) k(3,1) k(3,2) k(1,2) k(1,3) k(2,2) k(2,3) k(1,1) k(1,2) k(2,1) k(2,2)                  If C had the true convolution outcome, i.e., Coni , it
                                             k' at S                                                                          would compute the ReLu function as follows:
                                                                                                                                                          
                                                                                                                                                             Coni , if Coni ≥ 0
Fig. 4. Data transformation at client and server.                                                                                          fR (Coni ) =                                        (5)
                                                                                                                                                               0, if Coni < 0.

8

    However, C only has y(i) = vi1 × (Coni + δi ). Since vi1           cn be the number of input data that can be packed into one
is a random number that could be positive or negative, it              ciphertext. Recall that each x must be transformed to x′ as
is infeasible to obtain correct activation directly. Instead, C        discussed in Sec. 3.1. Let co denote the number of kernels
computes                                                               and r the size of each kernel. After transformation, the size
                                                                       of x′ is r2 times of the original x. Therefore, each ciphertext
         Add(Mult([ID1 ]S , y), Mult([ID2 ]S , fR (y))).        (6)    can hold cn /r2 such transformed input data. Accordingly,
    We can show that the above calculation essentially re-             the ci input data are transformed and encrypted into ci r2 /cn
covers the server-encrypted ReLu function of Coni + δi , i.e.,         ciphertexts.
[f (k ∗x+δ)]S where δ = {δi }. Since y(i) = vi1 ×(Coni +δi ),              The remaining process for linear and nonlinear compu-
fR (y(i)) may yield four possible outputs, depending on the            tation is similar to SISO, except that the computation on a
signs of vi1 and Coni + δi .                                           ciphertext actually calculates multiple input data simultane-
                                                                      ously and that the convolution of all input ciphertexts based
                
                 y(i), if vi1 > 0 & (Coni + δi ) ≥ 0                  on one kernel are combined into one output ciphertext,
                  y(i), if vi1 < 0 & (Coni + δi ) < 0
                
    fR (y(i)) =                                           (7)          yielding a total of co output ciphertexts. MIMO is obviously
                
                 0, if vi1 > 0 & (Coni + δi ) < 0                     more efficient in processing batches of input data.
                  0, if vi1 < 0 & (Coni + δi ) ≥ 0.
                

For example, when vi1 > 0 and (Coni + δi ) ≥ 0, we have                3.3 Fully-connected Dense Layer
ID1 = {0} according to Eq. (4) and thus Mult([ID1 ]S , y) =            In a fully-connected dense layer, S uses the output of the
[0]S . On the other hand, ID2 = [v12 , v22 , v32 , v42 ]. Since        previous layer (i.e., [a]C ) to compute the weighted sum. Take
y(i) = vi1 × (Coni + δi ), we have Mult([ID2 ]S , fR (y)) =            the simple two-layer CNN as an example, the weighted sum
[v12 v11 (Con1 + δ1 ), · · · , v42 v41 (Con4 + δ4 )]S . Note that we   computes
have chosen vi1 vi2 = 1. Therefore, Eq. (6) should yield                c1 = w(1, 1)[a(1)]C + w(1, 2)[a(2)]C + w(1, 3)[a(3)]C + w(1, 4)[a(4)]C ,
[Con1 +δ1 , Con2 +δ2 , Con3 +δ3 , Con4 +δ4 ]S . This is clearly         c2 = w(2, 1)[a(1)]C + w(2, 2)[a(2)]C + w(2, 3)[a(3)]C + w(2, 4)[a(4)]C .
the server-encrypted ReLu output of Con + δ . Similarly, we
can examine other cases of vi1 and Coni + δi in Eq. (7) and            The computation of c1 and c2 is intrinsically the same as the
show that Eq. (6) always produce the server-encrypted ReLu             computation of each convolution element (i.e., Con1 , . . . ,
outcome f (k ∗ x + δ). We will show in Sec. 5 that the ReLu            Con4 ) as discussed above.
function of noised linear result introduces negligible accu-
racy loss to the neural networks while δi and v1i prevent              3.4 Complexity Analysis
client from inferring the right Coni .
    Subsequently, C creates a ReLu share s1 and computes               In this subsection, we analyze the computation and com-
the server’s share as Add([f (k ∗ x + δ)]S , −s1 ) = [f (k ∗           munication cost of CHEETAH and compare it with other
x + δ) − s1 ]S . C sends it along with [s1 ]C (i.e., the client-       schemes.
encrypted share s1 , which can be pre-generated by C ) to S .          (1) Computation Complexity. The analysis of the computa-
    S decrypts [f (k ∗ x + δ) − s1 ]S to obtain a share of the         tion complexity focuses on the number of ciphertext permu-
plaintext activation result, i.e., f (k ∗ x + δ) − s1 . It then        tations (Perm), multiplications (Mult), and additions (Add).
computes Add([s1 ]C , f (k∗x+δ)−s1) to obtain [a]C = [f (k∗            The notations to be used in the analysis are summarized as
x + δ)]C , i.e., the client-encrypted nonlinear transformation         follows:
result. Note that, the introduce of δ dose not effect the neural          •    n is the number of slots in a ciphertext.
network performance as shown in Sec. 5.                                   •    q is the ciphertext space.
    Till now, the computation of the current layer (including             •    n log q is the number of bits of a ciphertext.
linear convolution and nonlinear activation) is completed.                •    ni is the input dimension of a fully connected layer.
The output of this layer (i.e., [f (k ∗ x + δ)]C ) will serve as the      •    no is the output dimension of a fully connected layer.
input for the next layer. If the next layer is still convolution,         •    r is the kernel size.
the server simply repeats the above process. Otherwise, if                •    ci is the number of input data (channels) in MIMO.
the next is a fully-connected dense layer, a similar approach             •    co is the number of kernels or the number of output
can be taken as to be discussed in Sec. 3.3.                                   feature maps in MIMO.
    Note that some CNN models employ pooling after acti-                  •    cn is the number of input data that can be packed
vation to reduce its dimensionality. For example, mean pool-                   into one ciphertext.
ing takes the activations as the input, which is divided into
a number of regions. The averaged value of each region is                  In SISO, recall that a ciphertext [x′ ]C is firstly sent to S . S
used to represent that region. Both C and S can respectively           conducts one ciphertext multiplication and addition to get
average their activation shares (i.e., s1 and f (k ∗ x + δ) − s1 )     [v ◦ k ′ ◦ x′ + b]C . Then C receives [v ◦ k ′ ◦ x′ + b]C , performs
to obtain the share of mean pooling. Meanwhile, a similar              the decryption, and gets the summed convolution y in plain-
scheme can be applied if the bias is included.                         text, which is followed by 2 multiplications and 1 addition
                                                                       to get the encrypted ReLu, according to Eq. 6. Finally, C
                                                                       does another addition namely Add([f (k ∗ x + δ)]S , −s1 ) to
3.2 MIMO Convolutional Layer                                           generate S ’s ReLu share. S finaly recovers the encrypted
The above SISO method can be readily extended to MIMO                  nonlinear result with another addition. Therefore, total 3
convolutional layer in order to process multiple inputs                multiplications and 4 additions are required in SISO. The
simultaneously. Assume there are ci input data (i.e., x). Let          complexity is O(1).

9

       In MIMO, C sends S ci r2 /cn ciphertexts. Then S per-                                        TABLE 2
forms ci r2 /cn Mult and (ci r2 /cn − 1) Add to get an in-                            Comparison of computation complexity.
complete ciphertext for each of co kernels. After that, each              Method            Perm                 Mult               Add
of co incomplete ciphertext is added with noise vector by                GA-SISO            O(r )  2               2
                                                                                                               O(r )              O(r 2 )
one addition. Then S sends those co cipheretxts to C , which             CH-SISO              0                O(1)               O(1)
                                                                                                              ci co r 2          ci co r 2
decrypts them and obtain co output features, creating co /cn             IR-MIMO           O(ci r 2 )       O(   cn     )      O(   cn     )
plaintext. Based on Eq. (6), C gets the encrypted ReLu with                                c c r2             ci co r 2          c c r2
                                                                         OR-MIMO         O( i con       )   O( cn )            O( i con )
2co /cn multiplications and co /cn additions, because each               CH-MIMO               0
                                                                                                              c co r 2
                                                                                                            O( i cn     )
                                                                                                                                 c co r 2
                                                                                                                               O( i cn     )
of co /cn plaintext associates with 2 multiplications and 1              NA-FC [28]     O(no log ni )        O(no )           O(no log ni )
addition. Finally, C performs another addition on each of                HS-FC [47]         O(ni )           O(ni )               O(ni )
                                                                                                   n n        n n                        n n
co /cn ReLu ciphertexts to generate the ReLu share for S .                GA-FC       O(log nno + in o )    O( in o )       O(log nno + in o )
                                                                                                              n n                   n n
                                                                          CH-FC                0            O( in o )           O( in o )
S then gets its ReLu share by decryption and recovers the
nonlinear result by co r2 /cn Add. Therefore, MIMO needs
          2                             (ci +1)co r 2
( ci cconr + 2c
             cn ) multiplications and (
                o
                                             cn       + 2c
                                                        cn ) addi-
                                                           o
                                                                     changed. In the second transmission, since S can simulta-
tions, both with the complexity of
                                                         2
                                              O( ci cconr ).         neously send each of co cipheretexts after each calculation,
                                                                     the actual communication cost is on transmitting the last
     In a fully-connected (FC) dense layer, S conducts ni no /n
                                                                     one of co ciphertexts. Thus, CHEETAH has a pipelined
multiplications to get ni no /n intermediate ciphertext, where
                                                                     communication cost as ( ccni + 1)n log q bits.
n is usually much larger than ni and no . After that, the
                                                                         In the FC layer, the two transmissions are 1) C sends S an
zero-sum vector is added on each of ni no /n intermediate
                                                                     input ciphertext; 2) S sends C ni no /n cipheretexts. As each
ciphertext to form [x′ ◦ w′ ◦ v + b′ ]C 2 which is sent to C . C
                                                                     of ni no /n cipheretexts can be simultaneously transmitted
does the decryption and gets the summed result in plaintext.
                                                                     after each calculation, the actual communication cost is the
Then C calculates the encrypted ReLu with 2 multiplications
and 1 addition by Eq. (6). Finally, one addition is performed        last one of ni no /n ciphertexts. The total pipelined cost is
                                                                     thus 2n log q bits. The quantitative communication compar-
to generate the ReLu share for S , and S needs another Add
                                                                     ison to other approaches is given in Sec. 5.
to recover the encrypted nonlinear result. So the FC layer
needs ( ninno + 2) multiplications and ( ninno + 3) additions,
resulting in the complexity of O( ninno ).                           4      S ECURITY A NALYSIS
     Table 2 compares the computation complexity between
CHEETAH and other schemes. Specifically, In the SISO                 We follow the ideal/real world paradigm [37], [48], [49] to
case, CHEETAH (CH) has a constant complexity without                 prove the security of CHEETAH. We start with defining the
permutation while GAZELLE (GA) has the complexity r2 .               ideal functionality f OMI which captures the security prop-
In the MIMO case, GAZELLE has two traditional options for            erties we want to achieve for Outsourced MLaaS Inference.
permutation, i.e., Input Rotation (IR) and Output Rotation           Defintion 1. The ideal functionality f OMI of outsourced MLaaS
(OR) [28]. CHEETAH eliminates the expensive permutation              inference consists of the following parts:
without incurring more multiplications and additions, thus               - Input. The server sends model parameters M , e.g., kernel
yielding a considerable gain. In the FC layer, we compare            k ∈ M , to f OMI . The client sends private input x to f OMI .
CHEETAH with a naive method (NA) in [28] (the base-                      - Computation. Upon receiving the model parameters from
line of GAZELLE), Halevi-Shoup (HS) [47] and GAZELLE.                server and the private input x from client, f OMI conducts
Through the obscure matrix calculation, obscure HE and               MLaaS inference by linear and nonlinear computation with x and
secret share, CHEETAH further reduces the complexity of              produces the nonlinear result f (x ∗ k) = ReLu(x ∗ k).
addition by O(log nno ) compared to GAZELLE. In particular,              - Output: The f OMI sends respective share of the nonlinear
n is usually much larger than no , which makes this reduc-           result f (x ∗ k) = ReLu(x ∗ k) to client and server. As for the
tion significant. It is worth pointing out that CHEETAH              last layer, the f OMI sends the obscure linear result to client with
completes both the linear and nonlinear operations with              one random number in v .
the above complexity while the existing schemes such as
                                                                        Given the ideal functionality f OMI , we give the formal
GAZELLE only finish the linear operation.
                                                                     security definition as follows.
(2) Communication Complexity. In the SISO case, CHEE-
TAH has two transmissions: 1) C sends the encrypted data             Definition 2. A protocol Π securely computes the f OMI in the
[x′ ]C to S ; 2) S sends [x′ ◦ k ′ ◦ v + b]C to C . Thus the         semi-honest adversary setting with static corruption if it provides
communication cost is 2n log q bits. Note that the third             the following guarantees:
transmission in Fig. 4 where C sends the encrypted ReLu                  - Corrupted server. We require that a corrupted and semi-
share to S is the beginning of the next layer.                       honest server does not learn any information about the values
     Similarly, in MIMO, the two transmissions are 1) C sends        in the client’s private input x. Formally, there should exist a
S ci r2 /cn ciphertexts for ci input images; 2) S sends C co ci-     Probabilistic Polynomial Time (PPT) simulator simS such that
                                                                              c
pheretexts for co kernels. Note that, in the first transmission,     viewSΠ ≈ simS (M , out), where viewSΠ denotes the view of the
ci r2 /cn ciphertexts are transmitted at the first convolutional     server in the real protocol execution (including the server’s input,
layer while only ci /cn ciphertexts are needed in other layers.      randomness, and the transcript of the protocol). simS (M , out)
This is because the size of S -encrypted ReLu will not be            is the simulation based on S ’s input, i.e., M , and its final output
                                                                                                                               c
                                                                     ‘out’, e.g., the share of nonlinear function. The “≈” denotes
  2. The structure of b′ is similar with b.                          “computationally indistinguishable”.

10

    - Corrupted client. We require that a corrupted and semi-                 of simC (x, out) is computationally indistinguishable to
honest client does not learn any information about the server’s               the viewCΠ of the corrupted client.
model parameters beyond some generic meta-parameters, i.e, the                b) The case of last layer. simC 1) chooses an uniform
number of input and output channels and the number of layers.                 random tape for the client; 2) sends private input x
Formally, there should exist a PPT simulator simC such that                   to f ODT and gets the obscure linear result as out; 3)
          c
viewCΠ ≈ simC (x, out), where viewCΠ denotes the view of the                  receives from client the C -encrypted input as [x]C ; 3)
client in the real protocol execution (including the client’s input,          enceypts out with client’s public key as [out]C ; 4) sends
randomness, and the transcript of the protocol). simC (x, out) is             [out]C to client and outputs whatever C outputs. Here
the simulation based on C ’s input, i.e., x, and its final output             the view of client in real protocol execution and the
‘out’, e.g., the share of nonlinear function.                                 simulated counterpart is identical. So the output of
                                                                              simC (x, out) are computationally indistinguishable to
Theorem 1. Our protocol provides a secure realization of the ideal            the viewCΠ of the corrupted client. The proof of Theorem
functionality f ODT according to Definition 2.                                1 is completed.
Proof. According to our security definition, we need to show
a simulator for different corrupted parties i.e., the server and       5     P ERFORMANCE E VALUATION
the client.
                                                                       We implement CHEETAH with C++ based on Microsoft
   - Simulator for the corrupted server:                               SEAL Library [44], and compare it with the best existing
     a) The case of intermediate layer. simS 1) chooses an             scheme, GAZELLE3 . We use two workstations as the client
     uniform random tape for the server; 2) sends model                and server. Both machines run Ubuntu with Intel i7-8700
     parameters M to f ODT and gets the share of the                   3.2GHz CPU with 12 threads and 16 GB RAM. The network
     nonlinear result as out; 2) randomly picks a public               link between them is a Gigabit Ethernet. Recall that the four
     key pk and encrypts all-zero input as [0]simS ; 3) sends          parameters in BFV scheme are: 1) ciphertext modulus q ; 2)
     [0]simS to server and receives the obscure linear result          plaintext modulus p; 3) number of ciphertext slots n and 4)
     from the server; 4) encrypts out with S ’s public key as          a Gaussian noise with a standard deviation σ . A larger q/p
     [out]S ; 5) sends [out]S to server and outputs whatever           tolerates more noise. We set p to be a 20-bit number and q
     S outputs. Here the view of server in real protocol               to be a 60-bit psuedo-Mersenne prime. The number of slots
     execution is the client-encrypted input and the share             for the packed encryption is set to 10,000.
     of nonlinear function, while the simulated view is
     simS -encrypted input and the same share of nonlinear             5.1 Component-wise Benchmark
     function. On the one hand, the client-encrypted input
     and simS -encrypted input are indistinguishable due to            We first examine the performance of each functional com-
     the semantic security of HE. On the other hand, the               ponent including Conv, FC and ReLu.
     share of nonlinear function are identical in real and             Convolution Benchmark. We define the time of the con-
     simulated execution. So the output of simS (M , out) is           volution operation as the duration between S receives the
     computationally indistinguishable to the viewSΠ of the            encrypted data or secret share from the previous layer (e.g.,
     corrupted server.                                                 ReLu) till S completes the convolution computation, just
                                                                       before sending the (partial) convolution results to C . It does
     b) The case of last layer. simS 1) chooses an uniform
                                                                       not contain the communication time between S and C , such
     random tape for the server; 2) sends model parameters
     M to f ODT and gets the None as out; 2) randomly                  as transmitting the (partial) convolution results to C , or
     picks a public key pk and encrypts all-zero input as              secret share to S , or in the case of GAZELLE, the time
     [0]simS ; 3) sends [0]simS to server and receives the             for the HE to GC transformation between S and C for fair
     obscure linear result from the server. Here the view of           comparison. All such communication time is accounted in
                                                                       ReLu and pooling discussed later.
     server in real protocol execution is the client-encrypted
                                                                           Table 3 benchmarks the convolution with different in-
     input while the simulated view is simS -encrypted in-
                                                                       put and kernel sizes. The ‘In rot’ and ‘Out rot’ indicate
     put. As the client-encrypted input and simS -encrypted
                                                                       two GAZELLE variants with the input or output rotation,
     input are indistinguishable due to the semantic security
     of HE, the output of simS (M , out) is computationally            from which, one of them has to be used for convolution
     indistinguishable to the viewSΠ of the corrupted server.          (see [28] for details). From Table 3, CHEETAH significantly
                                                                       outperforms GAZELLE. E.g., with the kernel size 5 × 5@5,
   - Simulator for the corrupted client:                               both the GAZELLE In rot and Out rot variants need more
     a) The case of intermediate layer. simC 1) chooses an             than 25 Mult, 24 Add and 24 Perm operations to yield the
     uniform random tape for the client; 2) sends private in-          result of convolution. In contrast, CHEETAH needs only 5
     put x to f ODT and gets the share of the nonlinear result         Mult and 5 Add operations, one for each kernel, to obtain
     as out; 2) receives from client the C -encrypted input as         the (partial) convolution results. Those results are then
     [x]C ; 3) randomly forms a vector r and encrypts it with          sent to C for computing ReLu (to be discussed). Overall,
     client’s public key as [r]C ; 4) sends [r]C to client and         CHEETAH accomplishes a speedup of 247 and 207 times
     receives the S -encrypted share of nonlinear function for         compared with the GAZELLE In rot and Out rot variants,
     server. Here the view of client in real protocol execution        respectively, for the case with the kernel size 5 × 5@5 and
     is the obscure linear result, e.g., x′ ◦ k ′ ◦ v + b, while       input data size 28 × 28@1.
     the simulated view is r. As the v , b and r are random,
     x′ ◦ k ′ ◦ v + b and r are indistinguishable. So the output           3. Available at: https://github.com/chiraag/gazelle mpc

You can also read