Hyper-optimized tensor network contraction - Quantum Journal
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hyper-optimized tensor network contraction Johnnie Gray1,2 and Stefanos Kourtis1,3,4 1 Blackett Laboratory, Imperial College London, London SW7 2AZ, United Kingdom 2 Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, USA 3 Department of Physics, Boston University, Boston, MA, 02215, USA 4 Institut quantique & Département de physique, Université de Sherbrooke, Québec J1K 2R1, Canada March 12, 2021 Tensor networks represent the state-of- 1 Introduction the-art in computational methods across Since the advent of the density-matrix renormal- arXiv:2002.01935v4 [quant-ph] 11 Mar 2021 many disciplines, including the classical simulation of quantum many-body systems ization group algorithm, invented to study one- and quantum circuits. Several applications dimensional lattice systems of quantum degrees of current interest give rise to tensor net- of freedom, tensor networks have permeated a works with irregular geometries. Finding plethora of scientific disciplines, finding use in the best possible contraction path for such fields such as quantum condensed matter [1–4], networks is a central problem, with an ex- classical statistical mechanics [5–7], information ponential effect on computation time and science and big-data processing [8, 9], systems memory footprint. In this work, we imple- engineering [10], quantum computation [11], ma- ment new randomized protocols that find chine learning and artificial reasoning [12–14] and very high quality contraction paths for ar- more. The underlying idea of tensor network bitrary and large tensor networks. We test methods is to use sparse networks of intercon- our methods on a variety of benchmarks, nected low-rank tensors to represent data struc- including the random quantum circuit in- tures that would otherwise be expressed in (very) stances recently implemented on Google high-rank tensor form, which is hard to manip- quantum chips. We find that the paths ob- ulate. Due to this ubiquity, techniques to per- tained can be very close to optimal, and form (multi)linear algebraic operations on ten- often many orders or magnitude better sor networks accurately and efficiently are very than the most established approaches. As useful to a highly interdisciplinary community of different underlying geometries suit differ- researchers and engineers. Of these operations, ent methods, we also introduce a hyper- tensor network contraction, i.e., the evaluation of optimization approach, where both the a scalar quantity that has been expressed as a method applied and its algorithmic pa- tensor network, is the most common. rameters are tuned during the path find- When a system under consideration gives rise ing. The increase in quality of contraction to a tensor networks with a regular structure, schemes found has significant practical im- such as lattices, the renormalization group ap- plications for the simulation of quantum paratus is often employed to perform tensor many-body systems and particularly for network contractions with controllable accuracy. the benchmarking of new quantum chips. This approach has been successful in tackling Concretely, we estimate a speed-up of over a variety of classical and quantum many-body 10,000× compared to the original expec- problems [5–7, 15–20]. Efficient tensor network tation for the classical simulation of the contraction is also possible in special cases in Sycamore ‘supremacy’ circuits. which network topology (e.g., trees), values of tensor entries, or both are restricted [21–26]. De- spite these results, contracting tensor networks with arbitrary structure remains (at least) #P- hard in the general case [27, 28]. This is true, in particular, for tensor networks that model ran- Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 1
optimal drivers for forming sub-trees at different scales. The second key idea is to hyper-optimize the generation of these trees, and to do this with respect to the entire tree and thus the total con- traction cost, rather than just the leading scal- ing, given by the line-graph tree-width for exam- ple. We also establish a powerful set of simpli- fications for efficiently pre-processing tensor net- works prior to contraction. Using this framework we are able to find very high-quality contraction paths, achieving speedups that scale exponentially with the num- ber of tensors in the network compared to es- Figure 1: Sample tensor networks: (a) simplified network tablished approaches, for a variety of problems. for a rectangular 7x7 qubit 1 + 40 + 1 depth random quantum circuit with 742 rank-3 tensors; (b) a random The drivers we test include recently introduced 5-regular network with 100 tensors, arising in, e.g., SAT contraction algorithms based on graph parti- problems; and (c) random planar network with 184 ten- tioning and community structure detection [43], sors, arising in, e.g., the statistical-mechanical evaluation previously theorized [11] and recently imple- of knot invariants. mented [44] algorithms based on the tree decom- position of graphs, as well as new heuristics that we introduce in this work. Furthermore, observ- dom quantum circuits, a fact that has recently ing that different graph structures favor different inspired proposals for quantum algorithms run- algorithms, we implement a hyper-optimization ning on these circuits that aim towards a prac- approach, where both the method applied and its tically demonstrable quantum computational ad- parameters are varied throughout the contraction vantage over classical computers [11, 29–39]. The path search, leading to automatically customized key idea is that, unlike quantum algorithms (e.g., contraction algorithms that often achieve near- Shor or Grover) that require deep quantum cir- optimal performance. cuits and high gate fidelities — inaccessible in the near future — to become manifestly advanta- We demonstrate the new methodology intro- geous, the task of sampling bit strings from the duced here on a range of benchmarks. First, we output of random quantum circuits is expected test on problems defined on random graph fam- to be hard to simulate classically even for low- ilies, such as simulation of solving MAX-CUT depth circuits and low-fidelity gates. The precise with quantum approximate optimization as well threshold for observing such a quantum advan- as weighted model counting. We find substan- tage is nonuniversal and ultimately depends on tial improvements in performance compared to the efficiency of the classical simulation for each previous methods reported in the literature. We particular combination of circuit model and quan- then simulate random quantum circuits recently tum chip architecture. This motivates the de- implemented by Google on the Bristlecone and velopment of high-performance simulation tech- Sycamore architectures. We estimate a speed-up niques for these quantum systems, predominantly of over 10,000× in the classical simulation of the based on finding good contraction paths for ten- Sycamore ‘supremacy’ circuits compared to what sor networks, that runs in parallel to the race for is given in [45]. In general, our algorithms out- the development of higher qubit count and qual- perform all others for the same task, by a wide ity devices [40–42]. margin on general networks and by a narrower Inspired by the classical simulation of quantum margin on planar structures. These findings thus circuits, here we introduce a new framework for illustrate that our methods can lead to significant exact contraction of large tensor networks with performance gains across a spectrum of tensor arbitrary structure (see examples in Fig. 1). The network applications. This is the main result of first key idea of this framework is to explicitly this paper. construct the contraction tree for a given tensor The remainder of this paper is organized as network, combining agglomerative, divisive, and follows. In Sec. 2 we formalize the problem of Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 2
finding the optimal contraction path for arbitrary others all vertices in V are contracted into a sin- tensor networks. In Sec. 3 we introduce and gle vertex. Here we will focus on the latter case, explain the various algorithms employed in our as it underlies the former. We will assume that heuristics. In Sec. 4 we test our methods on a va- G initially has no loops, i.e., edges connecting riety of benchmarks, including the random quan- vertices to themselves, and that multiple edges tum circuit instances recently implemented on are always contracted simultaneously, so that no Google Bristlecone and Sycamore quantum chips, loops occur throughout the contraction. the simulation of the quantum adiabatic opti- To represent the sequence of vertex contrac- mization algorithm for solving the MAX-CUT tions, we define a rooted binary tree B = problem on random regular graphs, and exact (VB , EB ), with the first |V | vertex indices denot- weighted model counting on problem instances ing leaves, using two tuples l and r such that l(v) from a recent competition. We conclude in Sec. 5. and r(v) are the indices of the ‘left’ and ‘right’ children of vertex v ∈ VB , respectively, if any. This defines a tree embedding of G [46]. Finally, 2 Problem statement we assign an incidence set sv to each v ∈ VB , starting with leaves, according to We denote an edge-weighted graph by G = (V, E), ( where V is the vertex set and the set of 2-tuples {e : e ∈ E and v ∈ e} if v is a leaf index , sv = of vertex indices E ⊂ {(u, v) : u, v ∈ V } is the sl(v) ⊕ sr(v) otherwise , edge set, along with a weight function w : E → (1) R+ that assigns a positive real number to each with si ⊕ sj = (si ∪ sj ) \ (si ∩ sj ). The composite edge. For each vertex v, define the incidence set of (B, S), where S = {sv : v ∈ VB }, defines a sv = {e : e ∈ E and v ∈ e}, which is the set of contraction tree of G. edges incident to vertex v, such that |sv | = dv , For a given tensor network contraction tree, the degree of vertex v. one can quantify the space and time cost of con- To define a tensor network, we augment G with tracting the network. First, the total space re- (i) a discrete variable xe for each edge e ∈ E, quired for the contraction of a network is given, whose set of possible values is given by D(e) with up to an O(|V |) prefactor, by 2W , for contraction |D(e)| = w(e), (ii) an ordered tuple tv : Ndv → sv width for each vertex v ∈ V , and (iii) a multivari- W = ecmax (B, S) , (2) ate function or tensor Tv : D(tv (1)) × · · · × where ecmax is the maximum edge congestion for D(tv (dv )) → C, where tv (i) denotes the ith el- this tree embedding of G [47]. In our notation, ement of tuple tv , for every vertex v ∈ V . That X w is defined to be a real-valued function even ecmax (B, S) = max log2 w(e) . (3) v∈VB though D(e) ∈ Z+ ∀ e ∈ E is simply a choice e∈sv that allows for extra flexibility in the design of A space-optimal contraction tree for G is then contraction algorithms, see, e.g., the Boltzmann defined by greedy algorithm below. Bspace (G) = argmin ecmax (B, S) , (4) With these definitions, a tensor network con- B∈B|V | traction can be represented as a sequence of ver- tex contractions in graph G. Each vertex con- where B|V | is the set of all rooted binary trees traction removes common edges between pairs with |V | leaves. For systems of boolean variables of tensors, if any, and represents a product op- or qubits, w = 2 and ecmax (B, S) = maxv∈VB |sv |. eration on the corresponding tensors, in which The contraction width is then equal to the max- one takes the inner product over common indices imum vertex degree in the minors of G obtained or an outer product if there are no common in- throughout the contraction path represented by dices. For simplicity, in what follows we consider B [43], as illustrated in the example of Fig. 2. only pairwise contractions, which are common The same logic extends to any constant w. practice. Multiway contractions are also possible, Similarly, the time complexity of the contrac- but they can always be decomposed to sequences tion is captured by the contraction cost of pairwise contractions. For some applications, X C(B, S) = 2vc(B,S,v) , (5) only a subset of V must be contracted, while in v∈VB Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 3
Method Optimal Edge weights Hyper edges Targets Exhaustive search yes yes yes total cost Line graph tree decomposition dependsa no yes leading cost Community detection no yes no total cost Boltzmann-greedy no yes yes total cost Hyper-graph partitioning no yes yes total cost Table 1: Contraction path optimization methods detailed in Secs. 3.1-3.5. For each method, we list its name, whether it is guaranteed to find the optimal contraction path, whether it incorporates edge weights (i.e., bond dimensions), whether it naturally handles hyper-edges, and whether it targets the total contraction cost or just the leading cost (single most expensive contraction). a QuickBB will eventually find the optimal contraction with respect to leading cost but not FlowCutter. is not guaranteed to also find or approximate the Figure 2: For the graph shown in (a), two possible con- traction trees (b) and (c), showing intermediate tensors other. and congestions. Each edge in a tree has an associated tensor and subgraph. The size of the tensor is exponen- tial in the number of indices (denoted by unique colors) 3 Tensor network contraction path op- running along that edge — the edge congestion. Each timization vertex in a tree represents a pairwise contraction of two tensors, as well as a bi-partitioning of the parent edge’s We have shown that the optimization of the con- subgraph (the dashed grey line shows one example of traction path for a given tensor network corre- this). The cost of that pairwise contraction is exponen- tial in the number of indices passing through that vertex sponds to minimization of a vertex or edge con- — the vertex congestion. Assuming each index is the gestion measure over the possible tree embed- same size, the tree (c) thus has both a higher maximum dings of the network. Instead of performing this contraction width (in bold) and total contraction cost minimization, here we will use methods that op- than tree (b). timize contraction paths based on quantities that are proxies to these congestion measures, as ex- where vc is the vertex congestion [47] plained below. Our heuristics are based on estab- lished algorithms for a variety of common graph theoretic tasks, such as balanced bipartitioning X vc(B, S, v) = log2 w(e) . (6) e∈sl(v) ∪sr(v) or community detection, some of which, unlike tree embedding, have seen decades of develop- Again using the case of qubits as an example, the ment and improvement, thus affording great ben- number of operations required to obtain the ten- efits in performance to our methods. We stress, sor corresponding to a non-leaf vertex v by con- however, that all contraction path optimization tracting its children is proportional to 2|sl(v) ∪sr(v) | . tools studied in this work except for those in- More precisely, assuming every contraction is an troduced in Secs. 3.1 and 3.2 are original contri- inner product, for real (complex) tensors, the butions, and that graph theory algorithms used associated FLOP count will be a factor of two to perform a particular task (e.g., graph parti- (eight) times more than C: one (six) FLOP(s) tioning) are interchangeable with any other algo- for the multiplication and one (two) FLOP(s) for rithm that can perform the same task. Finally, the addition. A time-optimal contraction tree for we also note that all the algorithms we test ex- G is then cept for the exhaustive search of Sec. 3.1 are not guaranteed to find the global minimum of the con- Btime (G) = argmin C(B, S) . (7) B∈B|V | gestion measures. Nevertheless, as will be seen below, they can often get arbitrarily close to the Btime (G) and Bspace (G) are not necessarily the optimum. A summary of the methods we intro- same and hence a strategy that aims to find one duce below is shown in Tab. 1 Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 4
3.1 Exhaustive search bond dimensions can vary significantly. One method for finding contraction trees is to 3.3 Community detection via edge betweenness exhaustively search through all of them and re- - Hyper-GN turn whichever minimizes the desired target W or C. Since outer products are rarely ever ben- One of the methods for the contraction of tensor eficial, an efficient but virtually optimal way to networks with arbitrary structure introduced in perform this search is to adopt a dynamic pro- Ref. [43] is based on detecting communities in the gramming approach that builds the tree consid- network. Qualitatively, a community is a subset ering connected subgraphs only [48]. We refer to of the vertices in a network that is densely con- this optimizer as Optimal and for our results use nected internally and sparsely connected with its the version implemented in opt einsum [49]. complement. Detecting communities in networks is a central problem in the study of complex net- works [53, 54]. 3.2 Line-Graph Tree Decompositions - The intuition behind using the community QuickBB & FlowCutter structure to contract an arbitrary tensor network The most common approach to contracting arbi- is that it is advantageous to contract all the trary tensor networks in recent years, motivated edges between vertices that belong to a commu- by the results of Markov and Shi [11], has been nity first. That is because the vertex that re- to find a tree decomposition of the line graph of sults from the contraction of all edges within a G. From this tree decomposition, an edge elim- community, which we call a community vertex, is ination ordering can be constructed such that sparsely connected with the rest of the network. the complexity of the corresponding contraction Thus, when a community structure exists and is is upper bounded by the tree-width of the line- detected in the network, the adherence of contrac- graph minus one. Practically speaking, we turn tions to this community structure is expected to an edge ordering, (e1 , e2 , e3 , . . .) into a contrac- lead to community vertices with a maximum de- tion tree as follows. First, find the subgraph gree that is lower than that of the same number of of G induced by the next edge in the ordering, vertices reached by an arbitrary sequence of con- ei . Update G by contracting all of the tensors in tractions of the original network. This approach this subgraph to form a single vertex (if there are hence effectively minimizes the contraction cost, more than 2 tensors use an exhaustive or greedy i.e., yields a contraction sequence that approxi- approach to find a contraction sequence for this mates the one defined by the space-optimal con- small subgraph only). Repeat until all edges in traction tree. the ordering have been processed. A popular community structure detection al- In the tensor network literature the most gorithm is the one of Girvan and Newman [55]. commonly used tree decomposition finder is It operates by evaluating a quantity called edge QuickBB [50], which implements a depth-first betweenness centrality, defined as ‘branch and bound’ search. Broadly speak- g(e) = X σst (e)/σst , (8) ing this approach emphasizes performance for s,t∈V graphs with modest numbers of edges, where in- deed QuickBB has been shown to work well [42]. where σst is the total number of shortest paths More recently, the FlowCutter tree decomposi- between vertices s and t, and σst (e) is the number tion finder [51, 52], has been applied to tensor of those paths that pass through edge e ∈ E. networks [44]. FlowCutter takes more of a ‘top- The algorithm starts with an empty edge list and down’ approach which emphasizes performance repeats two steps: on graphs with large numbers of edges. Both 1. remove e0 = argmax g(e) from E and add it function as ‘any-time’ algorithms, able to yield e∈E the best found solution after setting an arbitrary to the list, time. On the other hand, neither of these op- 2. calculate g(e) ∀ e ∈ E, timizers take edge weights into account, which may be a significant disadvantage in the many- until exhausting E. Multiple edges can be pro- body setting, where, unlike in quantum circuits, cessed simultaneously, since they have the same Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 5
g. The resulting list of edges, sometimes called with τ an effective temperature governing how a dendrogram, defines the detected community ‘adventurous’ the path finding should be. Repeat- structure: if one sequentially removes the list en- edly generating contraction trees using this com- tries from E until G becomes disconnected, then bination of cost and weighting, whilst potentially the resulting connected components are the com- tuning both α and τ , leads to the Hyper-Greedy munities of G. The algorithm then proceeds by optimizer. Hyper-Greedy generally outperforms splitting each connected component into smaller other greedy approaches and is quick to run, mak- communities, and the process repeats all the way ing it a simple but useful reference algorithm. down to the individual vertex level. The output of the Girvan-Newman method is 3.5 Divisive contraction trees - Hyper-Par also a contraction path: one simply has to tra- verse the edge list in reverse, each entry defining The greedy or agglomerative approach is a natu- a contraction of the endpoints of the correspond- ral way to think about building contraction trees ing edge. One can incorporate edge weights (and from the bottom up. However, as introduced in thus bond dimensions) into Eq. (8), possibly ran- [43] we can also try and build contraction trees domized with some strength τ , to generate varied from the top down in a divisive manner. The key paths. We call the optimizer based on repeated here is that each node in a contraction tree repre- sampling of these paths Hyper-GN. sents not only an effective tensor but a subgraph of the initial graph describing the full tensor net- work. As we ascend a contraction tree, merging 3.4 Agglomerative contraction trees - two nodes corresponds to a pairwise contraction Hyper-Greedy of the two effective tensors. In reverse, as we de- One simple way to construct a contraction tree scend a contraction tree, splitting a node corre- is greedily from the bottom up. Here, one ig- sponds to a bipartitioning of subgraph associated nores any overall structure of the graph G and with that node. instead heuristically scores each possible pairwise Practically we start with the list of ‘childless’ contraction. Based on these scores, a pair of ten- vertices - initially just the root of the tree corre- sors can be chosen and contracted into a new ver- sponding to the full graph, {VG }. We take the tex and the list of scores then updated with any next childless vertex, V , and partition it into new possible contractions. Whilst we know the V = V1 ∪ V2 . If |V1 | > 1 we append it to the exact cost and output size of each pairwise con- list of childless vertices and similarly if |V2 | > 1. traction, we do not know the effect it might have This process can be repeated until the full con- on the cost and size of later contractions, mean- traction tree is generated. Such a divisive ap- ing we must instead carefully choose the heuristic proach is very similar to the community detec- score function. tion scheme introduced earlier, however, whilst Given two tensors Ti and Tj whose contraction the Girvan-Newman algorithm naturally yields yields Tk , one reasonable choice for the heuristic the entire contraction tree, here we create single cost function is contractions one at a time. This allows one to combine partitioning with other optimizers. For cost(Ti , Tj ) = size(Tk ) − α(size(Ti ) + size(Tj )) example, we can instead partition a vertex V (9) into k partitions, V1 , V2 , . . . , Vk and then use the with α a tunable constant. If we take α = 1 then Optimal or Hyper-Greedy optimizer to ‘fill in’ this cost is directly proportional to the change the contraction tree — essentially find the con- in memory should we perform the contraction. traction path for a tensor network composed just Whereas instead taking α = 0 essentially just pri- of the tensors corresponding to each of these new oritizes the rank of the new tensor. Since we will subgraphs. Similarly, if the size of V drops below want to sample many greedy paths we also intro- some threshold, we can again use either Optimal duce a ‘Boltzmann factor’ weighting of the costs or Hyper-Greedy to find the remaining part of such that the probability of selecting a pairwise the contraction tree corresponding just to the leaf contraction is tensors in V . The cost of an individual contraction - a ver- p(Ti , Tj ) ∝ exp (−cost(Ti , Tj )/τ ) , (10) tex bi-partitioning - is given by the product of Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 6
in the search for short cuts compared to the origi- nal graph. To revert back to a ‘traditional’ tensor network after partitioning, each hyperedge can be replaced by a low-rank COPY tensor subgraph that cuts each separator at most once, as illus- Figure 3: (a) Segment of tensor network with six ten- trated in Fig. 3. Another important use-case for sors, one of which (black filled circle) is a COPY tensor. hyperedges is to efficiently treat batch and out- (b) COPY tensor replaced by a hyperedge. Recursive put indices, though these are not benchmarked hypergraph bipartitioning yields the separator hierarchy in this work. drawn as dashed lines, with thicker lines for higher level We employ the partitioner KaHyPar [56, 57] in the hierarchy. (c) After a separator hierarchy is found, to generate our contraction trees for a number of the hyperedge is replaced by a connected subgraph of COPY tensors whose edges intersect each separator at reasons. Aside from offering state-of-the-art per- most once. The results of the contraction of networks formance, it also can handle hypergraphs (and (a) and (c) are identical. thus arbitrary tensor expressions), allows key pa- rameters such as the imbalance to be specified, and takes into account edge weights (and thus ar- the dimensions of the involved indices. These bitrary bond dimensions). Repeatedly sampling include any outer indices of the subgraph, plus contraction trees whilst tuning the parameters k, any indices that cross the newly created partition. and the cut-off to stop partitioning leads us Since the outer indices are independent of the to the optimizer we call Hyper-Par. Note that partition, minimizing the number of indices cut the line graph and greedy methods of Secs. 3.2 by a partition also minimizes the cost of the cor- and 3.4, respectively, also support hypergraphs responding contraction. This is still essentially natively. a greedy approach - it only considers the cost In passing, we note that (hyper)graph parti- of a single contraction and strictly minimizing tioning has been used as a simplification tool for this cost (corresponding to choosing a min-cut) computational tasks in other research fields, see, could likely create more expensive contractions e.g., [58]. down the line. However, one way to heuristi- cally adjust this is to control how balanced to 3.6 Stochastic Bayesian Optimization make the partitions, in other words, how much to match the size of each partition. Specifically, The Optimal contraction tree optimizer runs un- we can define the imbalance parameter, , such til completion whilst QuickBB and FlowCutter that |Vi | ≤ (1 + )|V |/k for i = 1 . . . k, where k are natively any-time algorithms. For the remain- is the number of partitions. If is close to zero, ing three optimizers – Hyper-GN, Hyper-Greedy then the partitions are forced to be very similar and Hyper-Par – we use a combination of ran- in size, whilst if is close to k the partitions are domization and Bayesian optimization [59] to in- allowed to be of any size. telligently sample ever better contraction paths. Taking into account the internal structure of This allows all three of them to run as parallel the tensors in a problem allows for further flexi- any-time algorithms. bility in the recursive bipartition process, which For the Hyper-GN and Hyper-Par optimizers, in turn can lead to significant performance gains. randomization can be introduced as a noise of As an example, consider the case of a COPY ten- the edge weights of the initial graph G. For sor, whose entries are 1 only when all indices the Hyper-Greedy optimizer the Boltzmann sam- are equal and 0 otherwise. These tensors appear, pling of greedy contractions yields another source for example, when modeling circuits of controlled of randomization. Due to the high sensitivity gates (see, e.g., Sec. 4.6.1) or satisfiability formu- of the contraction width W and cost C to the las [26, 43]. Each COPY tensor in a network can contraction path, simply sampling many paths be replaced by any connected graph of COPY ten- and keeping the best already offers significant sors without changing the result of the contrac- improvements over single shot versions of these tion [4]. By replacing all COPY tensors in the same algorithms. However we can further im- network with hyperedges, one can perform recur- prove the performance if we allow the heuristic sive hypergraph bipartitioning with more freedom parameters of each optimizer to be tuned as the Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 7
sampling continues. We use the baytune [60] li- The third pre-processing step we perform is brary to perform this optimization, which uses antidiagonal-gauging. Here, again assuming we Gaussian processes [61] to model the effect of the have a k-dimensional tensor ti1 i2 ...ik , if for any parameters on the target score – either W or C pair of indices {ix , iy } of matching size d we find – and suggest new combinations which are likely ti1 i2 ...ik = 0 ∀ ix 6= d − iy (12) to perform well. then we can flip the order of either index ix or iy 3.7 Tensor Network Simplifications throughout the tensor network. This corresponds to gauging that index with a ‘reflected’ identity, Next we describe a series of simplifications based for example if d = 2 the Pauli matrix X. This simply on tensor network structure and sparsity simplification does not help on its own but merely of the tensors that we perform iteratively until produces tensors which can then be diagonally no more operations are possible. These are all reduced using the prior scheme. designed to decrease the complexity of the ten- sor network prior to invoking the full contraction path finders, and are performed as efficient local searches. The first of these is diagonal-reduction of ten- sor axes, as introduced for quantum circuits in [62]. For a k-dimensional tensor, ti1 i2 ...ik , with The fourth simplification we perform is indices i1 i2 . . . ik , if for any pair {ix , iy } column-reduction. Here, if for any k-dimensional ti1 i2 ...ik = 0 ∀ ix 6= iy (11) tensor ti1 i2 ...ik we find an index ix and ‘column’ c then we can replace t with a (k − 1)-dimensional such that tensor, t̃ with elements t̃...ix = t...ix iy δiiyx , where ti1 i2 ...ik = 0 ∀ ix 6= c (13) the δ copy can be implemented by re-indexing iy → ix everywhere else in the tensor network, then we can replace every tensor, t...ix , featuring thus resulting in ix becoming a hyperedge. This that index with the (k − 1)-dimensional tensor t̃ enables the use of the hypergraph machinery de- corresponding to the slice t...[ix =c] , removing that tailed in Sec. 3.5. index from the network entirely. This can be pic- tured as projecting the index into the basis state |ci. The second pre-processing step we perform is rank-simplification. Here we generate a greedy contraction path that targets rank reduction only The final possible processing step is split- (i.e. with respect to Eq. (9) and (10) sets α = τ = simplification. Here if any tensor, t, has an ex- 0). We then perform any of the pairwise contrac- act low-rank decomposition across any biparti- tions such that the rank of the output tensor is tion of its indices – i.e. ti1 ...j1 ... = k li1 ...,k rj1 ,...,k P not larger than the rank of either input tensor. If with max(size(l), size(r)) < size(t) – we perform the tensor network has no hyperedges, this corre- it. This is done using the SVD, and is the one sponds to absorbing all rank-1 and rank-2 tensors simplification that increases the number of ten- into neighbouring tensors, a process which can- sors in order to decrease the cut-weight across not increase the cut-weight across any partition partitions. for example. Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 8
We apply the above set of simplifications W and C, a complex, single precision tensor of iteratively but deterministically until no method size 227 requires 1GB of memory, and a con- can find any operation to perform. For all sumer grade GPU can usually achieve a few ter- methods that compare to zero we use a relative aFLOPs in terms of performance, corresponding precision of 10−12 unless otherwise stated. The to C ∼ 1015 over an hour. In the final results order they are applied in can produce very section we benchmark various contractions and different networks – we find cycling through the indeed find this real-world performance. At the order {antidiagonal-gauging, diagonal-reduction, extreme end of the scale, the most powerful su- column-reduction, rank-simplification, split- percomputer in the world currently, Summit, has simplification} produces good results. Indeed a few petabytes of memory, corresponding very for quantum circuits generally the resulting roughly to W ∼ 47, though this is obviously dis- tensor networks often have almost no sparsity tributed among nodes and utilizing it for a single among tensor entries. Note for methods such contraction would need, among many other tech- as Hyper-GN which cannot handle hyperedges nical considerations, significant inter-node com- we skip the diagonal-reduction. Finally, if munication. Summit has also achieved sustained aiming to reuse a contraction path, one needs performance of a few hundred petaFLOPs [65], to maintain the sparsity structure from network which over an hour might correspond to C ∼ 1020 , to network, possibly excluding any variable but is unlikely to do so if distributed contraction tensors from the simplification steps that detect is required (i.e. for high W ). sparsity. For most circuits terminated with a layer of Hadamard gates, if one only changes 4.1 Random Regular Graphs the sampled bit-string x then even this is not usually necessary. We start by benchmarking tensor networks with geometries defined by random regular graphs, as studied in [43, 44]. These graphs arise in 4 Results the study of many computational problems, such as satisfiability, but also problems defined on We benchmark our contractors on six classes graphs with nonuniform degree distribution can of tensor networks with complex geometry – often be reduced to equivalent problems on low- random regular graphs, random planar graphs, degree regular graphs [66]. For such a k-regular square lattices, weighted model counting formu- graph, every vertex is connected randomly to k lae, QAOA energy computation, and random others, with total number of vertices |V |. We quantum circuits. In each set of results we set treat each of the edges as tensor indices of size a time limit or maximum number of shots for 2 and associate a rank-k tensor with each vertex. each of the optimizers to run for, and then tar- None of the simplifications of Sec. 3.7 are appli- get either the contraction width, W , or contrac- cable. An example of such a network is shown in tion cost C. As a reminder, W is essentially the Fig. 1(b). For each size |V |, degree k and target space requirement of the contraction (log2 of the ∈ {W, C}, we generate 100 sample regular graphs size of the largest intermediate tensor) whilst C is uniformly [67], and allow 5 minutes of search time the time requirement (the total number of scalar per instance for each optimizer. The reference operations). The Optimal algorithm is able to Optimal path finder we instead run for 24 hours search for either the minimum W or C, whilst and only show data points where all but one or Hyper-GN, Hyper-Greedy and Hyper-Par can tar- two of the instances successfully terminated so as get either through the guided Bayesian optimiza- not to bias those points towards easy instances. tion. Finally, there is no way to specifically bias The results are shown in Figs. 4(a)-(f). First QuickBB and FlowCutter towards either W or of all we note that for small sizes all optimiz- C so in each case the optimizer runs identically. ers return similar performance, indeed, close If an optimizer can run in parallel, we allow it 4 to Optimal. As |V | increases however the cores to do so. An open source implementation of same ranking emerges in each combination of the optimizers, compatible with opt einsum [49] k and {W, C}: (from worst to best) QuickBB, and quimb [63], is available at [64]. Hyper-Greedy, FlowCutter, Hyper-GN, then fi- To give some context to the relative scale of nally Hyper-Par. We attribute the improve- Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 9
k=3 k=4 k=5 60 QuickBB 60 QuickBB 60 QuickBB Hyper-Greedy Hyper-Greedy Hyper-Greedy FlowCutter FlowCutter FlowCutter 50 50 50 Hyper-GN Hyper-GN Hyper-GN W Hyper-Par Hyper-Par Hyper-Par Contraction Width, 40 Optimal 40 Optimal 40 Optimal 30 30 30 20 20 20 10 10 10 (a) (b) (c) QuickBB QuickBB QuickBB 10 21 Hyper-Greedy 10 21 Hyper-Greedy 10 21 Hyper-Greedy FlowCutter FlowCutter FlowCutter 10 18 Hyper-GN 10 18 Hyper-GN 10 18 Hyper-GN C Hyper-Par Hyper-Par Hyper-Par Contraction Cost, 10 15 Optimal 10 15 Optimal 10 15 Optimal 10 12 10 12 10 12 10 9 10 9 10 9 10 6 10 6 10 6 (d) (e) (f) 10 3 10 3 10 3 0 50 100 150 200 250 300 0 25 50 75 100 125 150 0 20 40 60 80 100 |V| |V| |V| Figure 4: Mean contraction width (top row) and cost (bottom row) of random regular graphs of degree k = 3, 4, 5 (left, centre and right columns respectively) as a function of the number of vertices (tensors) in the network, |V |, for various contraction path optimizers each allowed 5 minutes to search. The shaded regions show standard deviations across 100 random graph instances. An example graph with k = 5 is shown in Fig. 1(b). ment of Hyper-GN over previous studies [44] to tative. Similarly to the random regular graphs, the use of guided stochastic sampling. There for each vertex with k edges we associate a rank- are some interesting performance comparisons k tensor with bond dimensions of size 2 and al- when it comes to targeting contraction width W low each optimizer 5 minutes per instance to ex- or cost C. For example, while Hyper-Greedy plore contraction paths. In [44] it was shown beats QuickBB for width across the board, the that the optimal contraction path with respect results are much closer for contraction cost. On to W for planar graphs can be found in poly- the other hand, the advantage of Hyper-Par nomial time. Also, planar tensor networks√can over Hyper-GN and FlowCutter is much more be contracted in subexponential time O(2 |V | ) pronounced when considering cost rather than as a consequence of the planar separator theo- width. rem [22, 43, 70]. In Fig. 5(a) and (b) we plot the mean contraction width, W , and cost, C,pas a function of the ‘side length’ of the graph, |V |. 4.2 Random Planar Graphs Alongside a sub-exponential scaling for all the A contrasting class of geometries to consider is optimizers we see a very different ranking of opti- that of planar graphs, encountered for example mizer performance as compared to random regu- in the study of physical systems defined on a 2D lar graphs, with Hyper-Greedy performing best. lattice or in evaluating knot invariants [68]. To For small sizes, again the performance of all opti- investigate these in a generic fashion, we gener- mizers is close to Optimal, and in fact the differ- ate random planar graphs with |V | ∈ [20, 200] ence between methods remains relatively small using the Boltzmann sampler described in [69]. throughout the size range. An instance of the generated graphs is shown in Fig. 1(c). Whilst these are much more ran- dom than square lattices for example, we find nonetheless that the results are broadly represen- Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 10
Random Planar Square Lattice OBC Square Lattice PBC TEBD-Exact TEBD-exact OBC Hyper-GN 60 Hyper-Greedy 60 Hyper-Par 22 Hyper-Par Optimal Hyper-Par Hyper-Greedy hyper-edges FlowCutter 50 50 20 hyper-edges Optimal QuickBB Contraction Wdith, W Optimal Optimal W 18 Hyper-Greedy 40 hyper-edges 40 hyper-edges Contraction Width, Optimal 16 30 30 OBC 14 hyper-edges 20 20 12 10 10 10 (a) (c) 8 (a) 0 0 10 21 10 30 PBC 10 10 10 27 Hyper-GN 10 18 Hyper-Par 10 24 FlowCutter 10 15 Contraction Cost, C 10 21 QuickBB 10 8 10 18 C Hyper-Greedy 10 12 Contraction Cost, 10 15 Optimal PBC 10 9 10 12 hyper-edges 10 6 10 9 10 6 10 6 10 3 (b) 10 3 (d) (b) 10 20 30 10 20 30 10 4 L L 6 8 10 12 14 p |V| Figure 6: Contraction width W (top row) and contraction cost C, for square lattice geometry - either with vertices Figure 5: Mean contraction width W (top) and cost C representing the underlying lattice (left column) or hyper- (bottom) for randomly generated planar graphs as a func- edges (right column). Insets to right illustrate the four tion of number of vertices |V |, for various path optimizers possible TNs with L = 5. Note that the hyper-edge case each allowed 5 minute to search. The shaded regions can be exactly transformed into the normal case but the show standard deviations across random graph instances. reverse is not generally true. The 35,162 graph instances studied p are approximately uniformly distributed over the |V | bins shown, and an example instance is shown in Fig. 1(c). to yield the standard TN geometry, this makes the TN harder to contract. For OBC, we find W is significantly reduced 4.3 Regular Square Lattice from the TEBD-Exact scaling 1 of 2L (Fig. 6(a)) as well as C (Fig. 6(b)). Contracting the hyper- To emphasize that the utility of these optimizers edge form of the TN also yields an advantage is not restricted to randomly structured graphs, for both. For PBC the TEBD-Exact path yields we now compare the best of them with a naive the same, optimal contraction width (Fig. 6(c)) Time Evolving Block Decimation (TEBD) style but carries a significantly worse scaling contrac- approach on a square 2D lattice. While such an tion cost (Fig. 6(d)). Contracting the hyper-edge approach – contracting a Matrix Product State form of the TN again yields an advantage for boundary from one side to the other – usually both. In all cases we see either Hyper-Greedy or would be combined with canonicalization and Hyper-Par very closely tracks the Optimal width compression, doing it exactly yields a natural and cost at accessible sizes. comparison point for a simple, manually chosen contraction path. In Fig. 6 we show W and C for such an approach (labelled TEBD-Exact), the 4.4 Exact Weighted Model Counting best of Hyper-Greedy or Hyper-Par, as well as We now move on to exact weighted model count- Optimal, for 2D square lattice TNs with bond ing, an important #P-complete task, central to dimension 2. As well as showing open and peri- problems of inference in graphical models, eval- odic boundary conditions (OBC and PBC), we uating partition functions in statistical physics, show the case for when the lattice geometry is calculating network reliabilities, and many oth- defined on hyper-edges rather than the vertices. ers [71–73]. The problem can be cast as comput- This is a common scenario when evaluating parti- 1 tion functions of classical spin models. While the With canonicalization but no compression the scaling hyper-edges can be converted to COPY tensors would be W ∼ L. Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 11
ing the following sum: X #vars Y #clauses Y x= wv Cv̄i , (14) {v} v i where {v} is all combinatorial assignments of ev- ery binary variable, wv is a vector with the ‘posi- tive’ and ‘negative’ weight of variable v, and Cv̄i the ith clause containing variables v̄i , given by the tensorization of the OR function. Such an expression can directly be thought of as an hy- per tensor network, with tensors (nodes) wv , Cv̄i and tensor indices (hyper-edges) v. Key here is that we directly handle constructing contraction trees for such hyper-graphs, and thus do not need to map Eq. (14) into a ‘normal’ tensor network form. To test our contraction optimizers we assess all Figure 7: Example hyper tensor networks, post- 100 private weighted model counting (track-2) in- simplification, representing weighed model counting for- stances from the Model Counting 2020 competi- mulae from the MCC2020 model counting competition. tion [74]. After constructing the tensor network representation of x we run the simplification pro- wj,k for j, k ∈ E(G) is given by: cedure, actively renormalizing the tensors since for some instances x > 102000 . We find the simpli- |γ̄, β̄i = UB (βp )UC (γp ) · · · UB (β1 )UC (γ1 )|+i fications to be very powerful here – of the 100 in- (15) stances, 63 simplify all the way to a single scalar, where whilst the remaining 37 instances require actual e−iγwjk Zj Zk Y contraction of a much reduced tensor network. UC (γ) = (16) j,k∈E(G) We invoke our hyper-optimizer on these, allow- e−iβXj Y ing 64 repeats and access to both the greedy and UB (β) = (17) KaHyPar drivers. Of these, 1 instance was excep- j∈G tionally difficult (W & 100), whilst the remain- for the two length-p vectors of parameters ᾱ and ing (shown in Fig. 7) all had contraction paths β̄. The energy of this is given by a sum of local with W < 20 and C < 108 making them eas- terms: ily contractable. Overall the 99 solved instances compares favourably with the best score of 69 X E= wj,k hγ̄, β̄| Zj Zk |γ̄, β̄i (18) achieved in the competition [74]. For those 69 in- j,k∈E(G) stances we confirmed all results against the ADDMC solver [75]. where for each term any unitaries outside the ‘re- verse lightcone’ of j, k can be cancelled. We study MAX-CUT problems on random 3- 4.5 QAOA Energy Evaluation regular graphs of size N , for which wj,k = 1, The Quantum Approximate Optimization Algo- equivalent to an antiferromagnetic Ising model. rithm (QAOA) [76] is a promising approach for Note that whilst the problem is defined on such optimization on near-term quantum devices. It a graph, G, the actual tensor networks for each involves optimizing the energy of an ansatz cir- energy term have very different geometries com- cuit, followed by the sampling of potential solu- pared to Sec. 4.1, since they arise from the re- tion bitstrings. Here we explore the first part, a peated application of 3p layers of gates followed task that has been studied before [77] and is iden- by unitary cancellation. Indeed, in the limit of tical to computing the energy of a unitary ansatz large N , they are not random at all [77]. First for a many-body model. The p-layer ansatz cir- we form the 3N2 energy term tensor networks, and cuit for target graph G with constraint weights simplify each using all five methods from Sec. 3.7. Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 12
p executed on a range of quantum chip geometries. 50 (a) 5 In particular, we look at sizes and depths previ- 4 40 3 ously explored in the context of so-called ‘quan- 2 tum supremacy’ [37, 38, 45, 78]. Quantum cir- 1 ® 30 cuits can be naturally cast as tensor networks Wmax and then simulated via contraction, as shown 20 in [11]. In recent years, random quantum cir- 10 cuits have been used both as a test-bed for ten- sor network contraction schemes as well as set- ting the benchmark for demonstrating ‘quantum 10 18 p (b) supremacy’ [41, 62, 79–82]. Practically speak- 5 4 ing, such simulations can also allow the fidelity 10 15 3 of real quantum chips to be benchmarked and 10 12 2 1 calibrated [38, 45, 81]. ® Ctotal 10 9 The simplest quantity to compute here is the ‘transition amplitude’ of one computational basis 10 6 state to another through a unitary describing the 10 3 quantum circuit. Assuming we start with the N qubit all-zero bit-string |0⊗N i, the transition 101 102 103 N amplitude for output bit-string x can be written: Figure 8: Maximum contraction width (a) and total cx = hx| Ud Ud−1 . . . U2 U1 |0⊗N i , (19) contraction cost (b) for computing the energy of a p-layer where we have assumed some notion of circuit QAOA circuit, averaged across 10 instances of random 3-regular graphs of size N . The shaded region shows the depth, d, such that each unitary Ui contains a standard deviation across these instances. ‘layer’ of entangling gates, the exact composition of which depends on the specific circuit definition. The process for computing cx takes place in sev- We invoke our hyper-optimizer on these, allowing eral steps; (a) construct the tensor network cor- 64 repeats and access to both the greedy and responding the circuit; (b) perform some purely KaHyPar drivers. In Fig. 8 we report the maxi- structure dependent simplifications of the tensor mum contraction width, Wmax and total contrac- network; (c) find the contraction path for this tion cost, Ctotal , across terms, averaged over 10 simplified network; and (d) actually perform the instances of the random regular graphs, as a func- contraction using the found path. Steps (a) and tion of N and p. (b) are very cheap, and moreover we can re-use We note that up to and including p=4, the path found in step (c) to contract any ten- throughout the range of N , Wmax remains less sor with matching structure but different tensor than ∼ 28 and Ctotal less than ∼ 1010 , putting entries, such as varying x. such simulations easily within the range of single workstations. As an example, on a CPU with 4.6.1 Gate Decompositions 4 cores, performing all of the contractions for N = 54 and p = 4 takes on the order of sec- We find that pre-processing the tensor networks onds. Stepping up to p = 5 increases the diffi- with the methods from Sec. 3.7 before attempt- culty significantly, especially in the N = 40 − 120 ing to find contraction paths is an important step, range. The peak here is due to cycles of length particularly for optimizers such as QuickBB and ≤ p appearing in G for small enough N , which Hyper-Greedy that scale badly with the num- dramatically increase the complexity of each ten- ber of edges and vertices. A tensor network for sor network. cx initially consists of: rank-1 tensors describ- ing each of the input and output qubit states; rank-2 tensors describing single qubit gates; and 4.6 Random Quantum Circuits rank-4 tensors describing two-qubit gates. The The final class of tensor networks we study is first processing step is deciding how to treat those corresponding to random quantum circuits the two-qubit gates. A tensor describing such Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 13
a gate can be written gioaaiobb , such that ia (ib ) is cuting on three different quantum chip geome- the input index and oa (ob ) the output index tries: (i) a rectangular 7×7 lattice of 49 qubits; of qubit a (b). Whilst gioaaiobb is unitary with re- (ii) a 70 qubit ‘Bristlecone’ lattice; and (iii) a 53- spect to ia ib → oa ob , a low rank decomposition qubit ‘Sycamore’ lattice. can potentially be found by grouping the indices For the first two we use the updated, harder {ia , oa }, {ib , ob } or {ia , ob }, {ib , oa } and perform- versions of the random circuit definitions first sug- ing an SVD on the resulting matrix. In the first gested in [38], which are available at [84]. We case this yields two rank-3 tensors: adopt the notation (1+d+1) for depth d to em- χ phasize that the technically first and last layer of gioaaiobb lioaaξ riobbξ , single qubit gates (which add no real complexity) X = (20) ξ=1 are not counted. In both cases the entangling gate used is the controlled-Z which has a χ = 2 spatial decomposition. For the Sycamore architecture, we use the same circuits that were defined and also actu- where we have dropped any zero singular vectors ally executed in the recent work [45]. Here each and absorbed the remaining singular values into two-qubit gate is a separately tuned ‘fermionic either of the left and right tensors l and r, each simulation’ gate which has no low-rank decompo- of which is now ‘local’ to either qubit a or b, con- sition if treated exactly. On the other hand, if nected by a bond of size χ. The second case yields a swapped decomposition is performed, the two the same but with an effective SWAP (which can smallest singular values are quite small and on av- be implemented purely as a relabelling of tensor erage discarding them leads to a fidelity drop of a indices) of the qubit states first: fraction of a percentage point – for a single gate. If this approximation is used for every single en- χ X 2 i0 i0 tangling gate in the circuit, however, the error gioaaiobb lio0aaξ rio0bξ δiab δiba . X = (21) b is compounded. For our main results, labelled ξ=1 i0a i0b =1 ‘Sycamore-53’, we thus perform no gate decom- position and consider perfect fidelity transition amplitude calculations only. Results where the χ = 2 swapped decomposition has been used we label ‘Sycamore-53*’. We also note that the defi- The options for a gate are thus to: (a) perform nition of circuit ‘cycles’, m, used in [45] is about no decomposition; (b) perform a spatial decom- twice as hard as the rectangular and Bristlecone position – Eq. (20); or (c) perform a swapped circuit definition of depth, d, since per layer al- decomposition – Eq. (21). By default we only most all qubits are acted on with an entangling perform a decomposition if the bond dimension, gate rather than approximately half respectively. χ, yielded is less than 4; all controlled gates In the following table we report the number fall into this category for a spatial decomposi- of network vertices and edges for representative tion, whereas the ISWAP gate for instance has depths of each circuit geometry after simplifica- χ = 2 for the swapped decomposition. Such ex- tions. The first two columns, |V |, |E| are for act decompositions would also be performed au- the case where hyperedge introduction is avoided, tomatically using the split-simplification scheme the last two columns, |V˜ |, |E|, ˜ are for the case of Sec. 3.7. Another option is to discard small where the full simplification scheme introduced but non-zero singular values which will result in above has been applied. Using the ratio |V˜ |/|E| ˜ a drop in the fidelity of cx [45, 83] – unless explic- as a heuristic figure of merit, we see that the net- itly noted we do not perform this form of ‘com- works resulting from the Sycamore circuit model pression’. are considerably denser. One may thus anticipate that Sycamore benchmarks will be more challeng- 4.6.2 Random Quantum Circuit Geometries ing for our methods. This expectation will be borne out in Sec. 4.6.4. We benchmark the contraction path optimizers against different random quantum circuits exe- Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 14
Circuit |V | |E| |Ṽ | |Ẽ| ture is the same, since all the optimizers aside Rectangular-7×7 (1+40+1) 734 1101 790 425 from qFlex/PEPs are naturally stochastic. Bristlecone-70 (1+40+1) 1036 1554 1086 574 Sycamore-53 (m=20) 381 754 381 754 We first note that across the board, the Sycamore-53* (m=20) 754 1131 1125 748 Hyper-Par optimizer again performs best, with We note that if the swap decomposition is little variance from instance to instance. Perfor- not applied to the Sycamore circuits then no mance of the remaining optimizers is more diffi- diagonal-reductions can take place and the result- cult to rank. The tensor network simplification ing simplified tensor network is the same in both scheme employed here results in significant im- cases. provement over previous results even when using QuickBB to perform the actual path optimization, particularly when |E| or |Ẽ| is moderate. As the 4.6.3 2D Circuit Specific Optimizers - tensor networks get larger QuickBB is consistently qFlex/PEPs outperformed by the other line-graph based opti- Before presenting results for contraction width mizer FlowCutter. and cost for these random circuits, we introduce For the Rectangular-7x7 and Bristlecone-70 cir- one final form of contraction path optimizer that cuits, which both use a CZ entangling gate, the has been successfully applied to circuits acting on diagonal reduction of tensors greatly simplifies 2D lattices [81, 82]. Here one performs the spatial the tensor networks. The methods that make decomposition of the entangling gates, regardless use of this, aside from Hyper-Greedy, perform of rank, such that every tensor is uniquely lo- best here, with similar values of C, though in- calized above a single qubit register. One can terestingly Hyper-Par is able to target a lower then contract every tensor in each of these spa- contraction width. Hyper-GN and qFlex/PEPs tial slices resulting in a planar tensor network do not use the diagonal simplification and here representing cx with a single tensor per site. Al- show similar performance. though the two works, [81] and [82], have signifi- cant differences in terms of details (and goals be- In the case of Sycamore-53 the entangling yond the computation of a single perfect fidelity fSim [85] gates are close to but not exactly amplitude), the core object treated by each is ISWAP gates. As a result there are no diagonal ultimately this planar tensor network, which is reductions to be made and the simplified tensor small enough that we can report optimal contrac- network has no hyper-edges. Whilst FlowCutter, tion widths and costs for. We call this optimizer – Hyper-GN and Hyper-Par find similar contrac- which flattens the circuit tensor network into the tion widths, Hyper-Par achieves a much lower plane before finding the optimal W or C from contraction cost. This is likely due to its ability that point onwards – qFlex/PEPs. With regards to search imbalanced partition contraction trees to a swapped decomposition, in order to maintain such as ‘Schrödinger style’ (full wavefunction) the spatial locality of the tensors this method can evolution. Note that for the entangling gates an only benefit in the first and last layer of gates [45]. approximate swapped χ=2 decomposition can be made, resulting in a drop in fidelity based on how many of the m layers of gates this is applied to. 4.6.4 Results The qFlex/PEPs method results in [45] make use In Fig. 9(a)-(f) we report the mean contraction of this in the first and last layer of gates for a width, W , and cost, C, for each geometry and op- drop in total fidelity of ∼5% that reduces W by timizer as a function of circuit depth, d, or cycles, ∼4 and C by ∼24 . We only show the exact results m. For these large tensor networks we allow each here so as to compare all methods on exactly the optimizer one hour to search for a contraction same footing. If the swapped decomposition is path. While this is not an insignificant amount used for all layers (Sycamore-53*) then at m=20 of time, we note that many optimizers converge the corresponding drop in total fidelity is likely to their best contraction paths much quicker, and to be ∼50%. For the best performing optimizers moreover that contraction paths can be re-used in Fig. 9(c) and (f) we find little gain in doing so. if only changing tensor values from run to run. We also emphasize that for the highest values of We show the variance in W and C across 10 in- m, the estimates for classical computation cost stances, despite the fact the tensor network struc- in [45] are not based on the qFlex [81] simulator Accepted in Quantum 2021-03-06, click title to verify. Published under CC-BY 4.0. 15
You can also read