Secure Bilevel Asynchronous Vertical Federated Learning with Backward Updating
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Secure Bilevel Asynchronous Vertical Federated Learning with Backward Updating Qingsong Zhang1,2 , Bin Gu 3,4 , Cheng Deng1,∗ , and Heng Huang3,5∗ 1 School of Electronic Engineering, Xidian University, Xi’an 710071, China; 2 JD Tech, Beijing 100176, China. 3 JD Finance America Corporation, Mountain View, CA, USA; 4 MBZUAI, United Arab Emirates 5 Electrical and Computer Engineering, University of Pittsburgh, PA, USA {qszhang1995, jsgubin, chdeng.xd, henghuanghh}@gmail.com Abstract homomorphic mathematical operation on ciphertext field is very high, thus HE is extremely time consuming for modeling Vertical federated learning (VFL) attracts increasing attention (Liu, Ng, and Zhang 2015; Liu et al. 2019). Second, approxi- due to the emerging demands of multi-party collaborative mod- eling and concerns of privacy leakage. In the real VFL appli- mation is required for HE to support operations of non-linear cations, usually only one or partial parties hold labels, which functions, such as Sigmoid and Logarithmic functions, which makes it challenging for all parties to collaboratively learn inevitably causes loss of the accuracy for various machine the model without privacy leakage. Meanwhile, most existing learning models using non-linear functions (Kim et al. 2018; VFL algorithms are trapped in the synchronous computations, Yang et al. 2019a). Thus, the inefficiency and inaccuracy of which leads to inefficiency in their real-world applications. To HE based methods dramatically limit their wide applications address these challenging problems, we propose a novel VFL to realistic VFL tasks. framework integrated with new backward updating mecha- ERCR based methods (Zhang et al. 2018; Hu et al. 2019; nism and bilevel asynchronous parallel architecture (VFB2 ), Gu et al. 2020b) leverage labels and the raw intermediate com- under which three new algorithms, including VFB2 -SGD, - SVRG, and -SAGA, are proposed. We derive the theoretical putational results transmitted from the other parties to com- results of the convergence rates of these three algorithms un- pute stochastic gradients, and thus use distributed stochastic der both strongly convex and nonconvex conditions. We also gradient descent (SGD) methods to train VFL models effi- prove the security of VFB2 under semi-honest threat models. ciently. Although ERCR based methods circumvent afore- Extensive experiments on benchmark datasets demonstrate mentioned drawbacks of HE based methods, existing ERCR that our algorithms are efficient, scalable and lossless. based methods are designed with only considering that all parties have labels, which is not usually the case in real-world VFL tasks. In realistic VFL applications, usually only one Introduction or partial parties (denoted as active parties) have the labels, Federated learning (McMahan et al. 2016; Smith et al. 2017; and the other parties (denoted as passive parties) can only Kairouz et al. 2019) has emerged as a paradigm for collab- provide extra feature data but do not have labels. When these orative modeling with privacy-preserving. A line of recent ERCR based methods are applied to the real situation with works (McMahan et al. 2016; Smith et al. 2017) focus on both active and passive parties, the algorithms even cannot the horizontal federated learning, where each party has a sub- guarantee the convergence because only active parties can set of samples with complete features. There are also some update the gradient of loss function based on labels but the works (Gascón et al. 2016; Yang et al. 2019b; Dang et al. passive parties cannot, i.e. partial model parameters are not 2020) studying the vertical federated learning (VFL), where optimized during the training process. Thus, it comes to the each party holds a disjoint subset of features for all samples. crux of designing the proper algorithm for solving real-world In this paper, we focus on VFL that has attracted much at- VFL tasks with only one or partial parties holding labels. tention due to its wide applications to emerging multi-party Moreover, algorithms using synchronous computation collaborative modeling with privacy-preserving. (Gong, Fang, and Guo 2016; Zhang et al. 2018) are ineffi- Currently, there are two mainstream methods for VFL, in- cient when applied to real-world VFL tasks, especially, when cluding homomorphic encryption (HE) based methods and computational resources in the VFL system are unbalanced. exchanging the raw computational results (ERCR) based Therefore, it is desired to design the efficient asynchronous methods. The HE based methods (Hardy et al. 2017; Cheng algorithms for real-world VFL tasks. Although there have et al. 2019) leverage HE techniques to encrypt the raw data been several works studying asynchronous VFL algorithms and then use the encrypted data (ciphertext) for training (Hu et al. 2019; Gu et al. 2020b), it is still an open problem to model with privacy-preserving. However, there are two major design asynchronous algorithms for solving real-world VFL drawbacks of HE based methods. First, the complexity of tasks with only one or partial parties holding labels. ∗ Corresponding Authors In this paper, we address these challenging problems by Copyright c 2021, Association for the Advancement of Artificial proposing a novel framework (VFB2 ) integrated with the Intelligence (www.aaai.org). All rights reserved. novel backward updating mechanism (BUM) and bilevel 10896
asynchronous parallel architecture (BAPA). Specifically, the Algorithm 1 Safe algorithm of obtaining wT xi . BUM enables all parties, rather than only active parties, to collaboratively update the model securely and also makes Input: {wG`0 }q`0 =1 and {(xi )G`0 }q`0 =1 allocating at each the final model lossless; the BAPA is designed for efficiently party, index i. asynchronous backward updating. Considering the advan- Do this in parallel tages of SGD-type algorithms in optimizing machine learn- 1: for `0 = 1, · · · , q do ing models, we thus propose three new SGD-type algorithms, 2: Generate a ramdon number δ`0 and calculate i.e., VFB2 -SGD, -SVRG and -SAGA, under that framework. wG>`0 (xi )G`0 + δ`0 , We summarize the contributions of this paper as follows. 3: end for Pq > 4: Obtain ξ1 = `0 =1 (wG`0 (xi )G`0 + δ` ) through tree 0 • We are the first to propose the novel backward updating mechanism for ERCR based VFL algorithms, which en- structure T1 . P q ables all parties, rather than only parties holding labels, to 5: Obtain ξ2 = `0 =1 δ`0 through totally different tree collaboratively learn the model with privacy-preserving structure T2 6= T1 . and without hampering the accuracy of final model. Output: w> xi = ξ1 − ξ2 • We design a bilevel asynchronous parallel architecture that enables all parties asynchronously update the model through backward updating, which is efficient and scalable. security, only active parties know the form of the loss func- tion. Moreover, we assume that the labels can be shared by • We propose three new algorithms for VFL, including all parties finally. Note that this does not obey our intention VFB2 -SGD, -SVRG, and -SAGA under VFB2 . Moreover, that only active parties hold the labels before training. The we theoretically prove their convergence rates for both problem studied in this paper is stated as follows: strongly convex and nonconvex problems. Given: Vertically partitioned data {xG` }q`=1 stored in q par- Notations. w b denotes the inconsistent read of w. w̄ denotes ties and the labels only held by active parties. w to compute local stochastic gradient of loss function for Learn: A machine learning model M collaboratively learned collaborators, which maybe stale due to communication delay. by both active and passive parties without leaking privacy. ψ(t) is the corresponding party performing the t-th global Lossless Constraint: The accuracy of M must be compara- iteration. Given a finite set S, |S| denotes its cardinality. ble to that of model M0 learned under non-federated learning. Problem Formulation VFB2 Framework Given a training set {xi , yi }ni=1 , where yi ∈ {−1, +1} for In this section, we propose the novel VFB2 framework. VFB2 binary classification task or yi ∈ R for regression prob- is composed of three components and its systemic structure lem and xi ∈ Rd , we consider the model in a linear form is illustrated in Fig. 1a. The details of these components are of w> x, where w ∈ Rd corresponds to the model param- presented in the following. eters. For VFL, xi is vertically distributed among q ≥ 2 The key of designing the proper algorithm for solving parties, i.e., xi = [(xi )G1 ; · · · ; (xi )Gq ], where (xi )G` ∈ Rd` real-world VFL tasks with both active and passive parties Pq is stored on the `-th party and `=1 d` = d. Similarly, there is to make the passive parties utilize the label information is w = [wG1 ; · · · ; wGq ]. Particularly, we focus on the follow- for model training. However, it is challenging to achieve this ing regularized empirical risk minimization problem. because direct using the labels hold by active parties leads to n q privacy leakage of the labels without training. To address this 1X X challenging problem, we design the BUM with painstaking. L w > xi , y i + λ min f (w) := g(wG` ), (P) w∈Rd n i=1 Backward Updating Mechanism: The key idea of BUM `=1 | {z } is to make passive parties indirectly use labels to compute fi (w) stochastic gradient without directly accessing the raw label Pq data. Specifically, the BUM embeds label yi into an interme- where w> xi = `=1 wG>` (xi )G` , L denotes the loss func- ∂L(w> xi ,yi ) Pq diate value ϑ := ∂(w> xi ) . Then ϑ and i are distributed tion, `=1 g(wG` ) is the regularization term, and fi : Rd → R is smooth and possibly nonconvex. Examples of problem P backward to the other parties. Consequently, the passive par- include models for binary classification tasks (Conroy and ties can also compute the stochastic gradient and update the Sajda 2012; Wang et al. 2017) and models for regression model by using the received ϑ and i (please refer to Algo- tasks (Shen et al. 2013; Wang et al. 2019). rithms 2 and 3 for details). Fig. 1b depicts the case where In this paper, we introduce two types of parties: active ϑ is distributed from party 1 to the rest parties. In this case, party and passive party, where the former denotes data all parties, rather than only active parties, can collaboratively provider holding labels while the latter does not. Particularly, learn the model without privacy leakage. in our problem setting, there are m (1 ≤ m ≤ q) active For VFL algorithms with BUM, dominated updates in dif- parties. Each active party can play the role of dominator in ferent active parties are performed in distributed-memory model updating by actively launching updates. All parties, parallel, while collaborative updates within a party are per- including both active and passive parties, passively launching formed in shared-memory parallel. The difference of paral- updates play the role of collaborator. To guarantee the model lelism fashion leads to the challenge of developing a new 10897
Coordinator : Active : Passive =1 (wi )T ( xi ) q Secure Aggregation Party 1 Party m Party m+1 Party q Async. Async. Async. Async. Update Update Update Update D =1 D =m D = m +1 D =q (a) (b) Figure 1: (a): System structure of VFB2 framework. (b): Illustration of the BUM and BAPA, where k is the number of threads. parallel architecture instead of just directly adopting the ex- Algorithm 2 VFB2 -SGD for active party ` to actively launch isting asynchronous parallel architecture for VFL. To tackle dominated updates. this challenge, we elaborately design a novel BAPA. Bilevel Asynchronous Parallel Architecture: The BAPA Input: Local data {(xi )G` , yi }ni=1 stored on the `-th party, includes two levels of parallel architectures, where the up- learning rate γ. per level denotes the inner-party parallel and the lower one 1: Initialize the necessary parameters. is the intra-party parallel. More specifically, the inner-party Keep doing in parallel (distributed-memory parallel parallel denotes distributed-memory parallel between active for multiple active parties) parties, which enables all active parties to asynchronously 2: Pick up an index i randomly Pq from {1, ..., n}. launch dominated updates; while the intra-party one denotes 3: Compute w b> xi = `0 =1 w bG>`0 (xi )G`0 based on Al the shared-memory parallel of collaborative updates within gorithm 1. ∂L(wb> xi ,yi ) each party, which enables multiple threads within a specific 4: Compute ϑ = ∂(wb> xi ) . party to asynchronously perform the collaborative updates. 5: Send ϑ and index i to collaborators. Fig. 1b illustrates the BAPA with m active parties. 6: Compute ve` = ∇G` fi (w). To utilize feature data Pq provided by other parties, a party b 7: Update wG` ← wG` − γe v` . need obtain wT xi = `=1 wG>` (xi )G` . Many recent works End parallel achieved this by aggregating the local intermediate computa- tional results securely (Hu et al. 2019; Gu et al. 2020a). In this paper, we use the efficient tree-structured communication scheme (Zhang et al. 2018) for secure aggregation, whose chine learning (ML) models. However, it has a poor conver- security was proved in (Gu et al. 2020a). gence rate due to the intrinsic variance of stochastic gradient. Secure Aggregation Strategy: The details are summarized Thus, many popular variance reduction techniques have been in Algorithm 1. Specifically, at step 2, wG>` (xi )G` is com- proposed, including the SVRG, SAGA, SPIDER (Johnson puted locally on the `-th party to prevent the direct leakage and Zhang 2013; Defazio, Bach, and Lacoste-Julien 2014; of wG` and (xi )G` . Especially, a random number δ` is added Wang et al. 2019) and their applications to other problems to wG>` (xi )G` to mask the value of wG>` (xi )G` , which can en- (Huang, Chen, and Huang 2019; Huang et al. 2020; Zhang hance the security during aggregation process. At steps 4 et al. 2020; Dang et al. 2020; Yang et al. 2020a,b; Li et al. and 5, ξ1 and ξ2 are aggregated through tree structures T1 2020; Wei et al. 2019). In this section we raise three SGD- and T2 , respectively. Note that T2 is totally different from type algorithms, i.e. the SGD, SVRG and SAGA, which T1 that can prevent the random value being removed un- are the most popular ones among SGD-type methods for der threatP model 1 (defined in section ). Finally, value of the appealing performance in practice. We summarize the q w> xi = `=1 (wG>` (xi )G` is recovered by removing term detailed steps of VFB2 -SGD in Algorithms 2 and 3. For Pq Pq > VFB2 -SVRG and -SAGA, one just needs to replace the up- `=1 δ` from `=1 (wG` (xi )G` + δ` ) at the output step. Us- date rule with corresponding one. ing such aggregation strategy, (xi )G` and wG` are prevented from leaking during the aggregation. As shown in Algorithm 2, at each dominated update, the dominator (an active party) calculates ϑ and then distributes Secure Bilevel Asynchronous VFL Algorithms ϑ together with i to the collaborators (the rest q − 1 par- ties). As shown in Algorithm 3, for party `, once it has re- with Backward Updating ceived the ϑ and i, it will launch a new collaborative update SGD (Bottou 2010) is a popular method for learning ma- asynchronously. As for the dominator, it computes the local 10898
Algorithm 3 VFB2 -SGD for the `-th party to passively Given wb as the inconsistent read of w, which is used to launch collaborative updates. compute the stochastic gradient in dominated updates, fol- lowing the analysis in (Gu et al. 2020b), we have Input: Local data {(xi )G` , yi }ni=1 stored on the `-th party, X learning rate γ. bt − wt = γ w Uψ(u) veuψ(u) , (4) 1: Initialize the necessary parameters (for passive parties). u∈D(t) Keep doing in parallel (shared-memory parallel for multiple threads) where D(t) = {t − 1, · · · , t − τ0 } is a subset of non- 2: Receive ϑ and the index i from the dominator. overlapped previous iterations with τ0 ≤ τ1 . Given w̄ as 3: Compute ve` = ∇G` L(w̄) + λ∇G` g(w) b = ϑ · (xi )G` + the parameter used to compute the ∇G` L in collaborative up- λ∇g(w bG` ). dates, which is the steal state of w b due to the communication delay between the specific dominator and its correspond- 4: Update wG` ← wG` − γe v` . ing collaborators. Then, following the analyses in (Huo and 5: End parallel Huang 2017), there is ψ(t0 ) X w̄t = wbt−τ0 = w bt + γ Uψ(t0 ) vet0 , (5) stochastic gradient as ∇G` fi (w)b = ∇G` L(w) b + λ∇g(w bG` ). t0 ∈D 0 (t) While, for the collaborator, it uses the received ϑ to compute 0 where D (t) = {t − 1, · · · , t − τ0 } is a subset of previous ∇G` L and local w b to compute ∇G` g as shown at step 3 in iterations performed during the communication and τ0 ≤ τ2 . Algorithm 3. Note that active parties also need perform Algo- rithm 3 to collaborate with other dominators to ensure that Convergence Analysis for Strongly Convex the model parameters of all parties are updated. Problem Theoretical Analysis Assumption 4. Each function fi , i = 1, . . . , n, is µ-strongly convex, i.e., ∀ w, w0 ∈ Rd there exists a µ > 0 such that In this section, we provide the convergence analyses. Please µ see the arXiv version for more details. We first present pre- fi (w) ≥ fi (w0 ) + h∇fi (w0 ), w − w0 i + kw − w0 k2 . (6) liminaries for strongly convex and nonconvex problems. 2 For strongly convex problem, we introduce notation K(t) Assumption 1. For fi (w) in problem P, we assume the fol- that denotes a minimum set of successive iterations fully vis- lowing conditions hold: iting all coordinates from global iteration number t. Note that 1. Lipschitz Gradient: Each function fi , i = 1, . . . , n, there this is necessary for the asynchronous convergence analyses exists L > 0 such that for ∀ w, w0 ∈ Rd , there is of the global model. Moreover, we assume that the size of k∇fi (w) − ∇fi (w0 )k ≤ Lkw − w0 k. (1) K(t) is upper bounded by η1 , i.e., |K(t)| ≤ η1 . Based on K(t), we introduce the epoch number v(t) as follow. 2. Block-Coordinate Lipschitz Gradient: For i = Definition 1. Let P (t) be a partition of {0, 1, · · · , t − σ 0 }, 1, . . . , n, there exists an L` > 0 for the `-th block G` , where σ 0 ≥ 0. For any κ ⊆ P (t) we have that there exists where ` = 1, · · · , q such that t0 ≤ t such that K(t0 ) = κ, and κ1 ⊆ P (t) such that k∇G` fi (w + U` ∆` ) − ∇G` fi (w)k ≤ L` k∆` k, (2) K(0) = κ1 . The epoch number for the t-th global iteration, i.e., v(t) is defined as the maximum cardinality of P (t). where ∆` ∈ Rd` , U` ∈ Rd×d` and [U1 , · · · , Uq ] = Id . Given the definition of epoch number v(t), we have the 3. Bounded Block-Coordinate Gradient: There exists a following theoretical results for µ-strongly convex problem. constant G such that for fi , i = 1, · · · , n and block G` , ` = 1, · · · , q, it holds that k∇G` fi (w)k2 ≤ G. Theorem 1. Under Assumptions 1-3 and 4, to achieve the accuracy of problem P for VFB2 -SGD, i.e., Assumption 2. The regularization term g is Lg -smooth, µ1/3 E(f (wt ) − f (w∗ )) ≤ , let γ ≤ (G96L 2 )1/3 , if τ ≤ which means that there exists an Lg > 0 for ` = 1, . . . , q ∗ (GL2∗ )2/3 such that ∀wG` , wG0 ` ∈ Rd` there is min{−4/3 , } , the epoch number v(t) should sat- 2 µ2/3 44(GL2∗ )1/3 (w∗ )) k∇g(wG` ) − ∇g(wG0 ` )k ≤ Lg kwG` − wG0 ` k. (3) isfy v(t)≥ µ4/3 log( 2(f (w0 )−f ) , where L∗ = Assumption 2 imposes the smoothness on g, which is nec- max{L, {L` }q`=1 , Lg }, τ = max{τ12 , τ22 , η12 }, w0 and w∗ de- essary for the convergence analyses. Because, as for a specific note the initial point and optimal point, respectively. collaborator, it uses the received w b (denoted as w̄) to com- Theorem 2. Under Assumptions 1-3 and 4, to achieve the pute ∇G` L and local wb to compute ∇G` g = ∇g(wG` ), which accuracy of problem P for VFB2 -SVRG, let C = (L2∗ γ + makes it necessary to track the behavior of g individually. 2 16L2∗ η1 C L∗ ) γ2 and ρ = γµ 2 − µ , we can carefully choose γ Similar to previous research works (Lian et al. 2015; Huo and Huang 2017; Leblond, Pedregosa, and Lacoste-Julien such that 2017), we introduce the bounded delay as follows. 8L2∗ τ 1/2 C 1) 1 − 2L2∗ γ 2 τ > 0; 2) ρ > 0; 3) ≤ 0.05; Assumption 3. Bounded Delay: Time delays of inconsis- ρµ tent reading and communication between dominator and its 36G collaborators are upper bounded by τ1 and τ2 , respectively. 4) L2∗ γ 2 τ 3/2 (28C + 5γ) 2 2 ≤ , (7) ρ(1 − 2L∗ γ τ ) 8 10899
log0.25 8 8 8 where v(t) should satisfy v(t) ≥ log(1−ρ) and the outer loop Multiple-parties speedup Multiple-parties speedup Multiple-parties speedup Ideal Ideal Ideal Async(ours) Async(ours) Async(ours) (w∗ ) Sync Sync Sync number S should satisfy S ≥ 1 log 43 log 2f (w0 )−f . 4 4 4 Theorem 3. Under Assumptions 1-3 and 4, to achieve the accuracy of problem P for VFB2 -SAGA, let c0 = 1 4 #Party 8 1 4 #Party 8 1 4 #Party 8 18GL2 2γ 3 τ 3/2 + (L2∗ γ 3 τ + L∗ γ 2 )180γ 2 τ 3/2 + 8γ 2 τ 1−72L2 γ∗ 2 τ , (a) SGD-based (b) SVRG-based (c) SAGA-based ∗ L2∗ τ c1 = 2L2∗ τ (L2∗ γ 3 τ + L∗ γ 2 ), c2 = 4(L2∗ γ 3 τ + L∗ γ 2 ) n , Figure 2: q-parties speedup scalability with m = 2 on D4 . and ρ ∈ (1 − n1 , 1), we can choose γ such that γµ 1) 1 − 72L2∗ γ 2 τ > 0; 2) 0 < 1 − < 1; for stochastic variable w, for VFB2 -SGD, let γ = L∗ qG , if 4 512qG 4c τ≤ the total epoch number T should satisfy 2 , 3) 20 ≤ ; E f (w0 ) − f ∗ L∗ qG γµ γµ(1 − ρ) 4 − 2c1 − c2 2 T ≥ , (10) 2 γµ2 1 − n1 −1 4) − + 2c1 + c2 1 + (1 − ) ≤ 0; where L∗ = max{L, {L` }q`=1 , Lg }, τ = max{τ12 , τ22 , η22 }, 4 ρ f (w0 ) is the initial function value and f ∗ is defined in Eq. 9. γµ2 1 − n1 −1 5) − + c2 + c1 2 + (1 − ) ≤ 0,(8) Theorem 5. Under Assumptions 1-3 and 5, to solve prob- 4 ρ lem P with VFB2 -SVRG, let γ = Lm 0 ∗n α , where 0 < m0 < 8 , 1 0 < α ≤ 1, if epoch number N in an outer loop satisfies the epoch number v(t) should satisfy v(t) ≥ nα n2α 1−8m0 1 2(2ρ−1+ γµ 4 ) (f (w0 )−f (w∗ )) N ≤ b 2m 0 c, and τ < min{ 20m 2 , 40m2 }, there is 1 log ρ log 2 . 0 0 γµ (ρ−1+ 4 ) γµ 4 −2c1 −c2 S N −1 1 X X L∗ n E [f (w0 ) − f (w∗ )] α Remark 1. For strongly convex problems, given the assump- E||∇f (wts0 )||2 ≤ , (11) T s=1 t=0 Tσ tions and parameters in corresponding theorems, the con- vergence rate of VFB2 -SGD is O( 1 log( 1 )), and those of where T is the total number of epoches, t0 is the start itera- VFB2 -SVRG and VFB2 -SAGA are O(log( 1 )). tion of epoch t, σ is a small value independent of n. Theorem 6. Under Assumptions 1-3 and 5, to solve prob- Convergence Analysis for Nonconvex Problem lem P with VFB2 -SAGA, let γ = Lm ∗n 0 1 α , where 0 < m0 < 20 , Assumption 5. Nonconvex function f (w) is bounded below, nα 0 < α ≤ 1, if total epoch number T satisfies T ≤ b 4m0 c 2α f ∗ := inf f (w) > −∞. (9) n and τ < min{ 180m 2, 1−20m0 40m2 }, there is w∈Rd 0 0 T −1 Assumption 5 guarantees the feasibility of nonconvex prob- 1 X L∗ nα E [f (w0 ) − f (w∗ )] lem (P). For nonconvex problem, we introduce the nota- E||∇f (wt0 )||2 ≤ . (12) T Tσ tion K 0 (t) that denotes a set of q iterations fully visiting t=0 all coordinates, i.e., K 0 (t) = {{t, t + t̄1 , · · · , t + t̄q−1 } : Remark 2. For nonconvex problems, given conditions in √the ψ({t, t + t̄1 , · · · , t + t̄q−1 }) = {1, · · · , q}}, where the t- theorems, the convergence rate of VFB2 -SGD is O(1/ T ), th global iteration denotes a dominated update. Moreover, and those of VFB2 -SVRG and VFB2 -SAGA are O(1/T ). these iterations are performed respectively on a dominator and q − 1 different collaborators receiving ϑ calculated at Security Analysis the t-th global iteration. Moreover, we assume that K 0 (t) We discuss the data security and model security of VFB2 un- can be completed in η2 global iterations, i.e., for ∀t0 ∈ A(t), der two semi-honest threat models commonly used in security there is η2 ≥ max{u|u ∈ K 0 (t0 )} − t0 . Note that, different analysis (Cheng et al. 2019; Xu et al. 2019; Gu et al. 2020a). from K(t), there is |K 0 (t)| = q and the definition of K 0 (t) Specially, these two threat models have different threat abili- does not emphasize on “successive iterations” due to the ties, where threat model 2 allows collusion between parties difference of analysis techniques between strongly convex while threat model 1 does not. and nonconvex problems. Based on K 0 (t), we introduce the epoch number v 0 (t) as follow. • Honest-but-curious (threat model 1): All workers will follow the algorithm to perform the correct computations. Definition 2. A(t) denotes a set of global iterations, where However, they may use their own retained records of the for ∀ t0 ∈ A(t) there is the t0 -th global iteration denoting a intermediate computation result to infer other worker’s dominated update and ∪∀t0 ∈A(t) K 0 (t0 ) = {0, 1, · · · , t}. The data and model. epoch number v 0 (t) is defined as |A(t)|. • Honest-but-colluding (threat model 2): All workers will Give the definition of epoch number v 0 (t), we have the follow the algorithm to perform the correct computa- following theoretical results for nonconvex problem. tions. However, some workers may collude to infer other Theorem 4. Under Assumptions 1-3 and 5, to achieve the - worker’s data and model by sharing their retained records first-order stationary point of problem P, i.e. Ek∇f (w)k ≤ of the intermediate computation result. 10900
Similar to (Gu et al. 2020a), we prove the security of VFB2 by Financial Large-Scale analyzing and proving its ability to prevent inference attack defined as follows. D1 D2 D3 D4 #Samples 24,000 96,257 17,996 175,000 Definition 3 (Inference attack). An inference attack on the `- #Features 90 92 1,355,191 16,609,143 th party is to infer (xi )G` (or wG` ) belonging to other parties or yi hold by active parties without directly accessing them. Table 1: Dataset Descriptions. Lemma 1. Given an equation oi = wG>` (xi )G` or oi = ∂L(w b> xi ,yi ) b > xi ) ∂(w with only oi being known, there are infinite dif- rest as the testing data. An optimal learning rate γ is cho- ferent solutions to this equation. sen from {5e−1 , 1e−1 , 5e−2 , 1e−2 , · · · } with regularization The proof of lemma 1 is shown in the arXiv version. Based coefficient λ = 1e−4 for all experiments. on lemma 1, we obtain the following theorem. Datasets: We use four classification datasets summarized in Table 1 for evaluation. Especially, D1 (UCICreditCard) Theorem 7. Under two semi-honest threat models, VFB2 and D2 (GiveMeSomeCredit) are the real financial datsets can prevent the inference attack. from the Kaggle website∗ , which can be used to demon- Feature and model security: During the aggregation, the strate the ability to address real-world tasks; D3 (news20) value of oi = wG>` (xi )G` is masked by δ` and just the value of and D4 (webspam) are the large-scale ones from the LIB- SVM (Chang and Lin 2011) website† . Note that we apply wG>` (xi )G` +δ` is transmitted. Under threat model 1, one even one-hot encoding to categorical features of D1 and D2 , thus can not access the true value of oi , let alone using relation the number of features become 90 and 92, respectively. oi = wG>` (xi )G` to refer wG>` and (xi )G` . Under threat model Problems: We consider `2 -norm regularized logistic regres- 2, the random value δ` has risk of being removed from term sion problem for µ-strong convex case wG>` (xi )G` + δ` by colluding with other parties. Applying n lemma 1 to this circumstance, and we have that even if the 1X > λ random value is removed it is still impossible to exactly refer min f (w) := log(1 + e−yi w xi ) + kwk2 , (13) w∈Rd n i=1 2 wG>` and (xi )G` . Thus, the aggregation process can prevent inference attack under two semi-honest threat models. and the nonconvex logistic regression problem Label security: When analyze the security of label, we do n d not consider the collusion between active parties and pas- 1X > λ X wi2 min f (w) := log(1 + e−yi w xi ) + . sive parties, which will make preventing labels from leaking w∈Rd n i=1 2 i=1 1 + wi2 meaningless. In the backward updating process, if a passive party ` wants to infer yi through the received ϑ, it must solve Evaluations of Asynchronous Efficiency and b> xi ,yi ) ∂L(w Scalability the equation ϑ = ∂(wb> xi ) . However, only ϑ is known to party `. Thus, following from lemma 1, we have that it is To demonstrate the asynchronous efficiency, we introduce impossible to exactly infer the labels. Moreover, the collusion the synchronous counterparts of our algorithms (i.e., syn- between passive parties has no threats to the security of la- chronous VFL algorithms with BUM, denoted as VFB) bels. Therefore, the backward updating can prevent inference for comparison. When implementing the synchronous algo- attack under two semi-honest threat models. rithms, there is a synthetic straggler party which may be 30% From above analyses, we have that the feature security, to 50% slower than the faster party to simulate the real appli- label security and model security are guaranteed in VFB2 . cation scenario with unbalanced computational resource. Asynchronous Efficiency: In these experiments, we set q = 8, m = 3 and fix the γ for algorithms with a same Experiments SGD-type but in different parallel fashions. As shown in In this section, extensive experiments are conducted to Figs. 3 and 4, the loss v.s. run time curves demonstrate that demonstrate the efficiency, scalability and losslessness of our algorithms consistently outperform their synchronous our algorithms. More experiments are presented in the arXiv counterparts regarding the efficiency. version. Moreover, from the perspective of loss v.s. epoch number, Experiment Settings: All experiments are implemented on we have that algorithms based on SVRG and SAGA have the a machine with four sockets, and each sockets has 12 cores. better convergence rate than that of SGD-based algorithms To simulate the environment with multiple machines (or par- which is consistent to the theoretical results. ties), we arrange an extra thread for each party to schedule its Asynchronous Scalability: We also consider the asyn- k threads and support communication with (threads of) the chronous speedup scalability in terms of the number of total other parties. We use MPI to implement the communication parties q. Given a fixed m, q-parties speedup is defined as scheme. The data are partitioned vertically and randomly Run time of using 1 party into q non-overlapped parts with nearly equal number of fea- q-parties speedup = , (14) Run time of using q parties tures. The number of threads within each parties, i.e. k, is ∗ set as m. We use the training dataset or randomly select 80% https://www.kaggle.com/datasets † samples as the training data, and the testing dataset or the https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ 10901
100 100 100 100 VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB-SAGA VFB-SAGA VFB-SAGA VFB-SAGA 10-1 VFB 2-SVRG(ours) 10 -1 VFB 2-SVRG(ours) 2 10-1 10-1 VFB -SVRG(ours) 2 VFB -SVRG(ours) Sub-optimality Sub-optimality Sub-optimality Sub-optimality VFB-SVRG VFB-SVRG VFB-SVRG VFB-SVRG 2 -2 VFB -SGD(ours) VFB 2-SGD(ours) VFB 2-SGD(ours) 10 VFB-SGD 10-2 VFB-SGD 10-2 VFB 2-SGD(ours) 10-2 VFB-SGD VFB-SGD -3 10 10-3 -4 10-3 10 10-4 10-5 10-4 10-5 10-5 0 50 100 0 100 200 300 0 500 1000 0 1 2 3 Time/s Time/s Time/s Time/s 104 (a) Data: D1 (b) Data: D2 (c) Data: D3 (d) Data: D4 Figure 3: Results for solving µ-strongly convex VFL models (Problem 13), where the number of epoches (points) denotes how many passes over the dataset the algorithm makes. 100 100 100 100 VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB 2-SAGA(ours) VFB-SAGA VFB-SAGA VFB-SAGA VFB-SAGA -1 -1 2 10 VFB 2-SVRG(ours) 10 2 VFB -SVRG(ours) VFB 2-SVRG(ours) VFB -SVRG(ours) 10-1 Sub-optimality Sub-optimality VFB-SVRG Sub-optimality Sub-optimality VFB-SVRG VFB-SVRG 10-1 VFB-SVRG 10-2 VFB 2-SGD(ours) -2 2 VFB -SGD(ours) VFB 2-SGD(ours) VFB 2-SGD(ours) VFB-SGD 10 VFB-SGD -2 VFB-SGD VFB-SGD 10 10-3 10-3 10-2 -3 10 10-4 10-4 10-3 10-5 -5 10-4 10 0 50 100 0 100 200 300 0 250 500 900 0 2 4 Time/s Time/s Time/s Time/s 104 (a) Data: D1 (b) Data: D2 (c) Data: D3 (d) Data: D4 Figure 4: Results for solving nonconvex VFL models (Problem 14), where the number of epoches (points) denotes how many passes over the dataset the algorithm makes. Algorithm D1 D2 D3 D4 NonF 81.96%±0.25% 93.56%±0.19% 98.29%±0.21% 92.17%±0.12% Problem (13) AFSVRG-VP 79.35%±0.19% 93.35%±0.18% 97.24%±0.11% 89.17%±0.10% Ours 81.96%±0.22% 93.56%±0.20% 98.29%±0.20% 92.17%±0.13% NonF 82.03%±0.32% 93.56%±0.25% 98.45%±0.29% 92.71%±0.24% Problem (14) AFSVRG-VP 79.36%±0.24% 93.35%±0.22% 97.59%±0.13% 89.98%±0.14% Ours 82.03%±0.34% 93.56%±0.24% 98.45%±0.33% 92.71%±0.27% Table 2: Accuracy of different algorithms to evaluate the losslessness of our algorithms (10 trials). where run time is defined as time spending on reaching a son is repeated 10 times with m = 3, q = 8, and a same certain precision of sub-optimality, i.e., 1e−3 for D4 . We stop criterion, e.g., 1e−5 for D1 . As shown in Table 2, the implement experiment for Problem (14), results of which are accuracy of our algorithms are the same with those of NonF shown in Fig. 2. As depicted in Fig. 2, our asynchronous algorithms and are much better than those of AFSVRG-VP, algorithms has much better q-parties speedup scalability than which are consistent to our claims. synchronous ones and can achieve near linear speedup. Conclusion Evaluation of Losslessness In this paper, we proposed a novel backward updating mecha- To demonstrate the losslessness of our algorithms, we com- nism for the real VFL system where only one or partial parties pare VFB2 -SVRG with its non-federated (NonF) counterpart have labels for training models. Our new algorithms enable (all data are integrated together for modeling) and ERCR all parties, rather than only active parties, to collaboratively based algorithm but without BUM, i.e., AFSVRG-VP pro- update the model and also guarantee the algorithm conver- posed in (Gu et al. 2020b). Especially, AFSVRG-VP also uses gence, which was not held in other recently proposed ERCR distributed SGD method but can not optimize the parameters based VFL methods under the real-world setting. Moreover, corresponding to passive parties due to lacking labels. When we proposed a bilevel asynchronous parallel architecture to implementing AFSVRG-VP, we assume that only half par- make ERCR based algorithms with backward updating more ties have labels, i.e., parameters corresponding to the features efficient in real-world tasks. Three practical SGD-type of held by the other parties are not optimized. Each compari- algorithms were also proposed with theoretical guarantee. 10902
Acknowledgments Huang, F.; Chen, S.; and Huang, H. 2019. Faster Stochastic Q.S. Zhang and C. Deng were supported in part by the Na- Alternating Direction Method of Multipliers for Nonconvex tional Natural Science Foundation of China under Grant Optimization. In ICML, 2839–2848. 62071361, the National Key R&D Program of China un- Huang, F.; Gao, S.; Pei, J.; and Huang, H. 2020. Acceler- der Grant 2017YFE0104100, and the China Research Project ated zeroth-order momentum methods from mini to minimax under Grant 6141B07270429. optimization. arXiv preprint arXiv:2008.08170 . Huo, Z.; and Huang, H. 2017. Asynchronous mini-batch References gradient descent with variance reduction for non-convex op- Bottou, L. 2010. Large-scale machine learning with stochas- timization. In Thirty-First AAAI Conference on Artificial tic gradient descent. In Proceedings of COMPSTAT’2010, Intelligence. 177–186. Springer. Johnson, R.; and Zhang, T. 2013. Accelerating stochastic Chang, C.-C.; and Lin, C.-J. 2011. LIBSVM: A library for gradient descent using predictive variance reduction. In Ad- support vector machines. ACM transactions on intelligent vances in NIPS, 315–323. systems and technology (TIST) 2(3): 27. Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis, Cheng, K.; Fan, T.; Jin, Y.; Liu, Y.; Chen, T.; and Yang, Q. M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; 2019. SecureBoost: A Lossless Federated Learning Frame- Cummings, R.; et al. 2019. Advances and Open Problems in work. arXiv preprint arXiv:1901.08755 . Federated Learning. arXiv preprint arXiv:1912.04977 . Conroy, B.; and Sajda, P. 2012. Fast, exact model selection Kim, M.; Song, Y.; Wang, S.; Xia, Y.; and Jiang, X. 2018. and permutation testing for l2-regularized logistic regression. Secure logistic regression based on homomorphic encryption: In Artificial Intelligence and Statistics, 246–254. Design and evaluation. JMIR medical informatics 6(2): e19. Dang, Z.; Li, X.; Gu, B.; Deng, C.; and Huang, H. 2020. Large-Scale Nonlinear AUC Maximization via Triply Leblond, R.; Pedregosa, F.; and Lacoste-Julien, S. 2017. Stochastic Gradients. IEEE Transactions on Pattern Analysis Asaga: Asynchronous Parallel Saga. In 20th International and Machine Intelligence . Conference on Artificial Intelligence and Statistics (AISTATS) 2017. Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. SAGA: A fast incremental gradient method with support for non- Li, M.; Deng, C.; Li, T.; Yan, J.; Gao, X.; and Huang, H. 2020. strongly convex composite objectives. In Advances in NIPS, Towards Transferable Targeted Attack. In Proceedings of 1646–1654. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 641–649. Gascón, A.; Schoppmann, P.; Balle, B.; Raykova, M.; Do- erner, J.; Zahur, S.; and Evans, D. 2016. Secure Linear Re- Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous gression on Vertically Partitioned Datasets. IACR Cryptology parallel stochastic gradient for nonconvex optimization. In ePrint Archive 2016: 892. Advances in Neural Information Processing Systems, 2737– 2745. Gong, Y.; Fang, Y.; and Guo, Y. 2016. Private data analyt- ics on biomedical sensing data via distributed computation. Liu, F.; Ng, W. K.; and Zhang, W. 2015. Encrypted gradient IEEE/ACM transactions on computational biology and bioin- descent protocol for outsourced data mining. In 2015 IEEE formatics 13(3): 431–444. 29th International Conference on Advanced Information Net- working and Applications, 339–346. IEEE. Gu, B.; Dang, Z.; Li, X.; and Huang, H. 2020a. Federated Doubly Stochastic Kernel Learning for Vertically Partitioned Liu, Y.; Liu, Y.; Liu, Z.; Zhang, J.; Meng, C.; and Zheng, Y. Data. In Proceedings of the 26th ACM SIGKDD International 2019. Federated Forest. arXiv preprint arXiv:1905.10053 . Conference on Knowledge Discovery & Data Mining, 2483– McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al. 2493. 2016. Communication-efficient learning of deep networks Gu, B.; Xu, A.; Deng, C.; and Huang, h. 2020b. Privacy- from decentralized data. arXiv preprint arXiv:1602.05629 . Preserving Asynchronous Federated Learning Algorithms for Shen, X.; Alam, M.; Fikse, F.; and Rönnegård, L. 2013. A Multi-Party Vertically Collaborative Learning. arXiv preprint novel generalized ridge regression method for quantitative arXiv:2008.06233 . genetics. Genetics 193(4): 1255–1268. Hardy, S.; Henecka, W.; Ivey-Law, H.; Nock, R.; Patrini, G.; Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S. Smith, G.; and Thorne, B. 2017. Private federated learning on 2017. Federated multi-task learning. In Advances in Neural vertically partitioned data via entity resolution and additively Information Processing Systems, 4424–4434. homomorphic encryption. arXiv preprint arXiv:1711.10677 . Wang, X.; Ma, S.; Goldfarb, D.; and Liu, W. 2017. Stochastic Hu, Y.; Niu, D.; Yang, J.; and Zhou, S. 2019. FDML: A quasi-Newton methods for nonconvex stochastic optimiza- collaborative machine learning framework for distributed tion. SIAM Journal on Optimization 27(2): 927–956. features. In Proceedings of the 25th ACM SIGKDD Interna- Wang, Z.; Ji, K.; Zhou, Y.; Liang, Y.; and Tarokh, V. 2019. tional Conference on Knowledge Discovery & Data Mining, SpiderBoost and Momentum: Faster Variance Reduction Al- 2232–2240. gorithms. In Advances in NIPS, 2403–2413. 10903
Wei, K.; Yang, M.; Wang, H.; Deng, C.; and Liu, X. 2019. Adversarial Fine-Grained Composition Learning for Unseen Attribute-Object Recognition. In Proceedings of the IEEE International Conference on Computer Vision, 3741–3749. Xu, R.; Baracaldo, N.; Zhou, Y.; Anwar, A.; and Ludwig, H. 2019. Hybridalpha: An efficient approach for privacy- preserving federated learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, 13–23. Yang, K.; Fan, T.; Chen, T.; Shi, Y.; and Yang, Q. 2019a. A Quasi-Newton Method Based Vertical Federated Learn- ing Framework for Logistic Regression. arXiv preprint arXiv:1912.00513 . Yang, M.; Deng, C.; Yan, J.; Liu, X.; and Tao, D. 2020a. Learning Unseen Concepts via Hierarchical Decomposition and Composition. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 10248– 10256. Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019b. Federated machine learning: Concept and applications. ACM Trans- actions on Intelligent Systems and Technology (TIST) 10(2): 12. Yang, X.; Deng, C.; Wei, K.; Yan, J.; and Liu, W. 2020b. Adversarial Learning for Robust Deep Clustering. Advances in Neural Information Processing Systems 33. Zhang, G.-D.; Zhao, S.-Y.; Gao, H.; and Li, W.-J. 2018. Feature-Distributed SVRG for High-Dimensional Linear Classification. arXiv preprint arXiv:1802.03604 . Zhang, Q.; Huang, F.; Deng, C.; and Huang, H. 2020. Faster Stochastic Quasi-Newton Methods. arXiv preprint arXiv:2004.06479 . 10904
You can also read