Impact of Barren Plateaus on the Hessian and Higher Order Derivatives
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Impact of Barren Plateaus on the Hessian and Higher Order Derivatives M. Cerezo1, 2 and Patrick J. Coles1 1 Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA 2 Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM, USA The Barren Plateau (BP) phenomenon is an issue for certain quantum neural networks and variational quantum algorithms, whereby the gradient vanishes exponentially in the system size n. The question of whether high-order derivative information such as the Hessian could help escape a BP was recently posed in the literature. Here we show that the elements of the Hessian are exponentially suppressed in a BP, so estimating the Hessian in this situation would require a precision that scales exponentially with n. Hence, Hessian-based approaches do not circumvent the exponential scaling associated with BPs. We also show the exponential suppression of higher order derivatives. Hence, BPs will impact optimization strategies that go beyond (first-order) gradient descent. arXiv:2008.07454v1 [quant-ph] 17 Aug 2020 I. Introduction was to compute the Hessian H of the cost function, and the claim was that taking a learning rate proportional to the inverse of the largest eigenvalue of the Hessian leads Training parameterized quantum circuits is a promis- to an optimization method that could escape the barren ing approach for quantum computing in the Noisy In- plateau. termediate Scale Quantum (NISQ) era. This strategy is employed in two closely related paradigms: Variational The question of whether higher-order derivative infor- Quantum Algorithms (VQAs) for chemistry, optimiza- mation (beyond the first-order gradient) is useful for es- tion, and other applications [1–12], and Quantum Neural caping a barren plateau is interesting and is the subject of Networks (QNNs) for classification applications [13–16]. our work here. Our main results are presented here in the In both cases, one utilizes a quantum computer to ef- form of two propositions and corollaries. First, we show ficiently evaluate a cost (or loss) function C(θ) or its that the matrix elements Hij of the Hessian are exponen- gradient ∇C(θ), while employing a classical optimizer tially vanishing when the cost exhibits a barren plateau. to train the parameters θ of a parameterized quantum This implies that the calculation of Hij requires expo- circuit V (θ). While many novel VQAs and QNNs have nential precision. In our second result we show that the been developed, more rigorous scaling analysis is needed magnitude of any higher-order partial derivative of the for these architectures. cost will also be exponentially small in a barren plateau. Our results suggest that optimization methods that use One of the few known results is the so-called bar- higher-order derivative information, such as the Hessian, ren plateau phenomenon [17–20], where the cost func- will also face exponential scaling, and hence do not cir- tion gradient vanishes exponentially with the system size. cumvent the scaling issues arising from barren plateaus. This can arise due to deep unstructured ansatzes [17, 19], global cost functions [18, 19], or noise [20]. Regardless of the origin, when a cost landscape exhibits a barren II. Preliminaries plateau, one requires an exponential precision to deter- mine a minimizing direction in order to navigate the land- scape. Since the standard goal of quantum algorithms is In what follows, we consider the case when the cost polynomial scaling with the system size (in contrast to can be expressed as a sum of expectation values: the exponential scaling of classical algorithms), the expo- N nential scaling due to barren plateaus can destroy quan- C(θ) = X Cx , with Cx = Tr[Ox V (θ)ρx V † (θ)] , (1) tum speedup. Hence, the study and analysis of barren x=1 plateaus should be viewed as a fundamental step in the development of VQAs and QNNs to guarantee that they where {ρx } is a set (of size N ) of input states to the can, in fact, provide a speedup over classical algorithms. parameterized circuit V (θ). In order for this cost to be Recently there have been multiple strategies proposed efficiently computable, the number of states in the input for avoiding barren plateaus such as employing local set should grow at most polynomially with the number cost functions [18], pre-training [21], parameter corre- of qubits n, that is, N ∈ O(poly(n)). In the context lation [22], layer-by-layer training [23], and initializing of QNNs, the states {ρx } can be viewed as training data layers to the identity [24]. These strategies are aimed at points, and hence (1) is a natural cost function for QNNs. either avoiding or preventing the existence of a barren In the context of VQAs, one typically chooses N = 1, plateau, and they appear to be promising, with more re- corresponding to a single input state. In this sense, the search needed on their efficacy on general classes of prob- cost function in (1) is general enough to be relevant to lems. In a recent manuscript [25], an alternative idea was both QNNs and VQAs. proposed involving a method for actually training inside Let θi be an angle which parameterizes a unitary in and escaping a barren plateau. Specifically, the proposal V (θ) as e−iθi σi /2 , with σi a Hermitian operator with
2 eigenvalues ±1. Then, the partial derivative ∂C(θ) ∂θi = From ChebyshevâĂŹs inequality we can bound the prob- ∂i C(θ) can be computed via the parameter shift rule [26, ability that the cost derivative deviates from its mean 27] as value (of zero) as 1 (1) (− 1 ) Varθ [∂i C] ∂i C(θ) = C θ i , θi 2 − C θ i , θi 2 , (2) Pr (|∂i C| > c) 6 , (9) 2 c2 where we define for all c > 0, and for all i. Then, let E±be defined as the (δ) (± 12 ) θi = θi + δπ. (3) event that |∂i C(θ ± )| > c, where θ ± = θ j , θj . Note Note that the parameter shift rule in (2) allows one to that the set of events where |Hij | > c is a subset of the exactly write the first-order partial derivative as a dif- set E+ ∪ E− . Then, from the union bound and Eq. (9) we ference of cost function values evaluated at two different can recover (7) as follows: points. Pr(|Hij | > c) 6 Pr(E+ ∪ E− ) (10) 6 Pr(E+ ) + Pr(E− ) (11) III. Hessian matrix elements Varθ+ [∂i C] Varθ− [∂i C] 6 + (12) c2 c2 Let us now state our results for the Hessian. The Hes- 2Varθ [∂i C] sian H of the cost function is a square matrix whose = , (13) c2 matrix elements are the second derivatives of C(θ), i.e., where we used the fact that h·iθ = h·iθ± . ∂ 2 C(θ) Hij = = ∂i ∂j C(θ) . (4) ∂θi ∂θj When the cost exhibits a barren plateau as in [17– 19], the variance of the cost function partial derivative Reference [25] noted that the matrix elements of the Hes- Varθ [∂i C] is exponentially vanishing as sian can be written according to the parameter shift rule. Namely, one can first write Varθ [∂i C] 6 F (n) , with F (n) ∈ O(1/bn ). (14) 1h (1) (− 1 ) i Hij = ∂i C θ j , θj 2 − ∂i C θ j , θj 2 (5) for some b > 1. Then, the following corollary holds. 2 and then apply the parameter shift rule a second time: Corollary 1. Consider the bound in Eq. (10) of Proposi- tion 1. If the cost exhibits a barren plateau, such that (14) 1h (1) (1) (− 1 ) (− 1 ) holds, then the matrix elements of the Hessian are expo- Hij = C θ ij , θi 2 , θj 2 + C θ ij , θi 2 , θj 2 (6) 4 nentially vanishing since i (1) (− 1 ) (− 1 ) (1) −C θ ij , θi 2 , θj 2 − C θ ij , θi 2 , θj 2 . 2F (n) Pr(|Hij | > c) 6 , (15) c2 Now, the second derivatives of the cost can be expressed as a sum of cost functions being evaluated at (up to) four where F (n) ∈ O(1/bn ) for some b > 1. points. From the parameter shift rule we can then derive the The proof follows by combining (7) and (14). Corol- following bound on the probability that the magnitude of lary 1 shows that when the cost landscape exhibits a the matrix elements |Hij | are larger than a given c > 0. barren plateau, the matrix elements of the Hessian are exponentially vanishing with high probability. This im- Proposition 1. Consider a cost function of the plies that any algorithm that requires the estimation of form (1), for which the parameter shift rule of (2) holds. the Hessian will requires a precision that grows exponen- Let Hij be the matrix elements of the Hessian as defined tially with the system size. in (5). Then, assuming that h∂i Ciθ = 0, the following inequality holds for any c > 0, 2Varθ [∂i C] IV. Higher order partial derivatives Pr(|Hij | > c) 6 . (7) c2 Let us now analyze the magnitude of higher order par- Here Varθ [∂i C] = h(∂i C)2 iθ − h∂i Ci2θ , where the expec- tial derivatives in a barren plateau. We use the following tation values are taken over θ. notation for the |α|th-order derivative Proof. Equation (5) implies that the magnitudes of the Dα C(θ) = ∂α1 ∂α2 · · · ∂α|α| C(θ) , (16) Hessian matrix elements are bounded as 1 (1) (− 1 ) where α is an |α|-tuple. Since one can take the deriva- |Hij | 6 ∂i C θ j , θj 2 + ∂i C θ j , θj 2 . (8) 2 tive with respect to the same angle multiple times, we
3 M Y X Wω = d(ωl ,Nl ) such that |Wω | = 2|α| . (19) l=1 ω Also, ω = (ω1 , . . . , ωD ), where ωl ∈ {0, ±1} if Nl is even, and ωl ∈ {± 12 , ± 32 } if Nl is odd. Additionally, the coeffi- cients dωl ,Nk can be obtained from the Pascal tree which we introduce in Fig. 1. In the Appendix we provide ad- ditional details regarding the coefficients d(ωl ,Nl ) and the Pascal tree. From (18) we obtain that the (|α| + 1)th-order deriva- tive, which we denote as ∂i Dα C(θ) = Di,α C(θ), is ob- tained as the sum of (up to) 2|α| partial derivatives: 1 X FIG. 1. The Pascal tree. a) The Pascal tree can be ob- Di,α C(θ) = |α| Wω ∂i C(Θ, Θ(ω) ) . (20) tained by modifying how a Pascal triangle is constructed. In 2 ω a Pascal triangle each entry of a row is obtained by adding to- gether the numbers directly above to the left and above to the Since one has to individually evaluate each term in (20) right, with blank entries considered to be equal to zero. The and since there are up to 2|α| terms, we henceforth as- entries of a Pascal tree are obtained following the aforemen- sume that |α| ∈ O(log(n)). This guarantees that the tioned rule, with the additional constraint that the width of computation of Di,α C(θ) leads to an overhead which is the triangle is restricted to always being smaller than a given (at most) O(poly(n)). even number. Moreover, once an entry in a row is outside The following proposition, which generalizes Proposi- the maximum width, its value is added to the central entry tion 1, allows us to bound the probability that the mag- in that row (see arrows). Here the maximum width is four. nitude of Di,α C(θ) is larger than a given c > 0. b) The coefficients d(ωl ,Nl ) in (19) can be obtained from the Pascal tree of (a) by adding signs to the entries of the tree. Proposition 2. Consider a cost function of the As schematically depicted, all entries in a diagonal going from form (1), for which the parameter shift rule of (2) holds. top left to bottom right have the same sign, with the first en- Let Di,α C(θ) be a higher order partial derivative of the try in the first row having a positive sign. Here, each row cost as defined in (16). Then, assuming that h∂i Ciθ = 0, corresponds to a given Nl , while entries in a row correspond the following inequality holds for any c > 0, to different values of ωl , with ωl ∈ {0, ±1} if Nl is even, and ωl ∈ {± 12 , ± 23 } if Nl is odd. For instance, d(− 1 ,5) = −12. 2|α| Varθ [∂i C] 2 Pr( Di,α C(θ) > c) 6 . (21) c2 Proof. From Eq. (20) we can obtain the following bound define the set Θ (of size M = |Θ|) as the set of dis- tinct angles with respect to which we take the partial derivative. Similarly, let Θ be the compliment of Θ, so 1 X Di,α C(θ) 6 |Wω | ∂i C(Θ, Θ(ω) ) . (22) that Θ ∪ Θ = θ. Then, for any Θk ∈ Θ we define Nk 2|α| ω PM as the multiplicity of Θk in α such that k=1 Nk = |α|. Since the cost function and any of its higher order partial Let us define Eω as the event that ∂i C(Θ, Θ(ω) ) > c. derivatives are continuous function of the parameters (as Since (14) holds, then the following chain of inequalities can be seen below via multiple applications of the param- holds eter shift rule), one can extend Clairaut’s Theorem [28] ! to rewrite i,α [ Pr( D C(θ) > c) 6 Pr Eω (23) N1 NM ω Dα C(θ) = ∂Θ1 · · · ∂ΘM C(θ) . (17) X 6 Pr (Eω ) (24) Then, applying the parameter shift rule |α| times we ω find that the |α|-order partial derivative can be expressed X Var(Θ,Θ(ω) ) [∂i C] as a summation of cost functions evaluated at (up to) 2|α| 6 (25) ω c2 points as |α| 2 Varθ [∂i C] 1 X 6 , (26) Dα C(θ) = Wω C(Θ, Θ (ω) ). (18) c2 2|α| ω where we invoked the union bound, and where we recall that h·iθ = h·i(Θ,Θ(ω) ) , ∀ω. In addition, for (26) we used (ω ) (ω ) Here we defined Θ(ω) = (Θ1 1 , . . . , ΘM M ), with the fact that the summation in (25) has at most 2|α| (ω ) Θk k = Θk + ωk π defined analogously to (3), and where terms.
4 Then, if the cost function exhibits a barren plateau, V. Discussion the following corollary follows. In this work, we investigated the impact of barren Corollary 2. Consider the bound in Eq. (21) of Proposi- plateaus on higher order derivatives. This issue was im- tion 2. If the cost exhibits a barren plateau, such that (14) portant in light of a recent proposal to use to higher holds, then higher order partial derivatives of the cost order derivative information to escape a barren plateau. function are exponentially vanishing since We considered a cost function C that is relevant to both VQAs and QNNs, as barren plateaus are relevant to both G(n) of these applications. Pr( Di,α C(θ) > c) 6 , (27) c2 Our main result was that, when a barren plateau ex- ists, the Hessian and other high order partial derivatives where G(n) ∈ O(1/q n ) for some q > 1. of C are exponentially vanishing in n with high probabil- ity. Our proof relied on the parameter shift rule, which Proof. Combining (14) and (21) leads to we showed can be applied iteratively to relate higher or- der partial derivatives to the first order partial derivative 2|α| F (n) (analogous to what Ref. [25] did for the Hessian). Hence, Pr( Di,α C(θ) > c) 6 . (28) the parameter shift rule allowed us to state the vanishing c2 of higher order derivatives as essentially a corollary of the vanishing of the first order derivative. We remark that Then, let us define G(n) = 2|α| F (n). Since |α| ∈ iterative applications of the parameter shift rule led us O(log(n)), and F (n) ∈ O(1/bn ), then we know that there to a mathematically interesting construct that we called exists κ, κ0 , and n0 such that ∀n > n0 we respectively 0 the Pascal tree, depicted in Fig. 1. have 2|α| 6 nκ and F (n) 6 bκn . Combining these two Our results imply that estimating higher order par- results we find tial derivatives in a barren plateau is exponentially hard. Hence, any optimization strategy that requires informa- κ0 nκ κ0 G(n) 6 n = L(n) , ∀n > n0 , (29) tion about partial derivatives that go beyond first-order b b (such as the Hessian) will require a precision that grows exponentially with n. We therefore surmise that, by where L(n) = (n − κ logb (n)). Equation (29) shows that themselves, optimizers that go beyond first order gra- G(n) ∈ O(1/bL(n) ). Then, since dient descent do not appear to be a feasible solution to the barren plateau problem. More generally, our results L(n) suggest that it is better to develop strategies that avoid lim = 1, (30) n→∞ n the appearance of the barren plateau altogether, rather than to try to escape an existing barren plateau. we have L(n) ∈ Ω(n), meaning that there exists a κ b>0 Acknowledgements.—We thank Kunal Sharma for and nb0 such that ∀n > n b0 , we have L(n) > κ bn. The helpful discussions. Research presented in this article 0 latter implies G(n) 6 bκκb n for all n > max{n0 , n b0 }, which was supported by the Laboratory Directed Research and n κ means that G(n) ∈ O(1/q ) where q = b . Also, q > 1 e Development program of Los Alamos National Labora- follows from b > 1 and κ e > 0. tory under project number 20180628ECR. MC was also supported by the Center for Nonlinear Studies at LANL. PJC also acknowledges support from the LANL ASC Be- Corollary 2 shows that, in a barren plateau, the mag- yond Moore’s Law project. This work was also supported nitude of any efficiently computable higher order par- by the U.S. Department of Energy (DOE), Office of Sci- tial derivative (i.e., any partial derivative where |α| ∈ ence, Office of Advanced Scientific Computing Research, O(log(n))) is exponentially vanishing in n with high under the Accelerated Research in Quantum Computing probability. (ARQC) program. [1] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. approximate optimization algorithm,” arXiv:1411.4028 Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien, [quant-ph]. “A variational eigenvalue solver on a photonic quantum [4] J. Romero, J. P. Olson, and A. Aspuru-Guzik, “Quantum processor,” Nature Communications 5, 4213 (2014). autoencoders for efficient compression of quantum data,” [2] Jarrod R McClean, Jonathan Romero, Ryan Babbush, Quantum Science and Technology 2, 045001 (2017). and Alán Aspuru-Guzik, “The theory of variational [5] S. Khatri, R. LaRose, A. Poremba, L. Cincio, A. T. Sorn- hybrid quantum-classical algorithms,” New Journal of borger, and P. J. Coles, “Quantum-assisted quantum Physics 18, 023023 (2016). compiling,” Quantum 3, 140 (2019). [3] E. Farhi, J. Goldstone, and S. Gutmann, “A quantum [6] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and
5 P. J. Coles, “Variational quantum state diagonalization,” [24] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, npj Quantum Information 5, 1–10 (2018). and Marcello Benedetti, “An initialization strategy for [7] A. Arrasmith, L. Cincio, A. T. Sornborger, W. H. Zurek, addressing barren plateaus in parametrized quantum cir- and P. J. Coles, “Variational consistent histories as a hy- cuits,” Quantum 3, 214 (2019). brid algorithm for quantum foundations,” Nature com- [25] Patrick Huembeli and Alexandre Dauphin, “Characteriz- munications 10, 3438 (2019). ing the loss landscape of variational quantum circuits,” [8] Marco Cerezo, Alexander Poremba, Lukasz Cincio, and arXiv preprint arXiv:2008.02785 (2020). Patrick J Coles, “Variational quantum fidelity estima- [26] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, tion,” Quantum 4, 248 (2020). “Quantum circuit learning,” Phys. Rev. A 98, 032309 [9] Cristina Cirstoiu, Zoe Holmes, Joseph Iosue, Lukasz Cin- (2018). cio, Patrick J Coles, and Andrew Sornborger, “Vari- [27] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh ational fast forwarding for quantum simulation beyond Izaac, and Nathan Killoran, “Evaluating analytic gra- the coherence time,” arXiv preprint arXiv:1910.04292 dients on quantum hardware,” Physical Review A 99, (2019). 032331 (2019). [10] Carlos Bravo-Prieto, Ryan LaRose, M. Cerezo, Yigit [28] James Stewart, Multivariable calculus (Nelson Educa- Subasi, Lukasz Cincio, and Patrick J. Coles, “Variational tion, 2015). quantum linear solver: A hybrid algorithm for linear sys- tems,” arXiv:1909.05820 (2019). [11] Xiaosi Xu, Jinzhao Sun, Suguru Endo, Ying Li, Simon C Appendix A Explicit description of d(ωl ,Nk ) Benjamin, and Xiao Yuan, “Variational algorithms for linear algebra,” arXiv preprint arXiv:1909.03898 (2019). [12] M Cerezo, Kunal Sharma, Andrew Arrasmith, and In this appendix we first discuss how the parameter Patrick J Coles, “Variational quantum state eigensolver,” shift rule leads to the Pascal tree. Then, we provide arXiv preprint arXiv:2004.01372 (2020). analytical formulas for d(ωl ,Nk ) . [13] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione, Let us consider the first and second order partial “The quest for a quantum neural network,” Quantum In- derivatives of the cost function with respect to the same formation Processing 13, 2567–2586 (2014). angle. From the parameter shift rule of Eq. (2) we find [14] Iris Cong, Soonwon Choi, and Mikhail D Lukin, “Quan- i tum convolutional neural networks,” Nature Physics 15, 1h (1) (− 1 ) ∂i C(θ) = C θ i , θ i 2 − C θ i , θi 2 1273–1278 (2019). 2 [15] Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, To- | {z } | {z } ×d 1 =1 ×d =−1 ( ,1) (− 1 ,1) bias J Osborne, Robert Salzmann, Daniel Scheiermann, 2 2 and Ramona Wolf, “Training deep quantum neural net- 1h (1) (−1) i ∂i2 C(θ) = C θ i , θi + C θ i , θi −2 C (θ) . works,” Nature Communications 11, 1–6 (2020). 4 | | {z } ×d {z } | {z } =−2 [16] Guillaume Verdon, Jason Pye, and Michael Broughton, ×d(1,2) =1 ×d(−1,2) =1 (0,2) “A universal training algorithm for quantum deep learn- ing,” arXiv preprint arXiv:1806.09729 (2018). where we can see that |d(0,2) | = |d(− 12 ,1) | + |d( 12 ,1) | = 2. [17] Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Similarly, if we were to take the third partial derivative Ryan Babbush, and Hartmut Neven, “Barren plateaus with respect to i we would find |d(±1/2,3) | = |d(0,2) | + in quantum neural network training landscapes,” Nature |d(±1,2) |, and |d(±3/2,3) | = |d(±1,2) |. Note that this proce- communications 9, 4812 (2018). dure forms the first four rows of the Pascal tree, which ac- [18] M Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, tually coincide with first four rows of the Pascal triangle. and Patrick J Coles, “Cost-function-dependent barren plateaus in shallow quantum neural networks,” arXiv When taking the fourth partial derivative we have to take (−2) (2) preprint arXiv:2001.00550 (2020). into account the fact that C θ i , θi = C θ i , θi = [19] Kunal Sharma, M Cerezo, Lukasz Cincio, and Patrick J Coles, “Trainability of dissipative perceptron- C (θ), since e−iθσ/2 is equal to e−i(θ+2π)σ/2 up to an based quantum neural networks,” arXiv preprint unobservable global phase. Hence, the fact that θ ≡ arXiv:2005.12458 (2020). θ(2) (mod 2π) imposes a restriction on the width of the [20] Samson Wang, Enrico Fontana, M Cerezo, Kunal Pascal tree. Hence, following this procedure one can re- Sharma, Akira Sone, Lukasz Cincio, and Patrick J Coles, cover the entries in Fig. 1. “Noise-induced barren plateaus in variational quantum For arbitrary ωl and Nl , the coefficients dωl ,Nk can algorithms,” arXiv preprint arXiv:2007.14384 (2020). be analytically obtained as follows: If Nk < 2 we have [21] Guillaume Verdon, Michael Broughton, Jarrod R Mc- d(±1,0) = 0, d0,0) = 1, d(±1/2,1) = ±1, and d(±3/2,1) = 0. Clean, Kevin J Sung, Ryan Babbush, Zhang Jiang, Hart- Then, for Nk > 2 mut Neven, and Masoud Mohseni, “Learning to learn with quantum neural networks via classical neural net- Nk works,” arXiv preprint arXiv:1907.05415 (2019). (−1) 2 2Nk −1 if ωl = 0 , ±(−1) Nk2−1 3 · 2Nk −3 [22] Tyler Volkoff and Patrick J Coles, “Large gradients via if ωl = ±1/2 , correlation in random parameterized quantum circuits,” d(ωl ,Nk ) = Nk −2 Nk −2 (−1) 2 2 if ωl = ±1 arXiv preprint arXiv:2005.12200 (2020). Nk −1 −3 N ∓(−1) 2 2 k if ωl = ±3/2 , [23] Andrea Skolik, Jarrod R McClean, Masoud Mohseni, Patrick van der Smagt, and Martin Leib, “Layerwise |d(ωl ,Nk ) | = 2Nk . P learning for quantum neural networks,” arXiv preprint Note that ∀Nl we have ωl arXiv:2006.14904 (2020).
You can also read