Qulacs: a fast and versatile quantum circuit simulator for research purpose
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Qulacs: a fast and versatile quantum circuit simulator for research purpose Yasunari Suzuki1,2 , Yoshiaki Kawase3 , Yuya Masumura4 , Yuria Hiraga5 , Masahiro Nakadai6 , Jiabao Chen7 , Ken M. Nakanishi7,8 , Kosuke Mitarai3,7,9 , Ryosuke Imai7 , Shiro Tamiya7,10 , Takahiro Yamamoto7 , Tennin Yan7 , Toru Kawakubo7 , Yuya O. Nakagawa7 , Yohei Ibe7 , Youyuan Zhang7,8 , Hirotsugu Yamashita11 , Hikaru Yoshimura11 , Akihiro Hayashi12 , and Keisuke Fujii2,3,9,13 1 NTT Computer and Data Science Laboratories, NTT Corporation, Musashino 180-8585, Japan 2 JST PRESTO, Kawaguchi, Saitama 332-0012, Japan 3 Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan arXiv:2011.13524v4 [quant-ph] 5 Oct 2021 4 Graduate School of Information Science and Technology, Osaka University, 1-1 Yamadaoka, Suita, Osaka 565-0871, Japan 5 Graduate School of Information and Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara 630-0192, Japan 6 Graduate School of Science, Kyoto University, Yoshida-Ushinomiya, Sakyo, Kyoto 606-8302, Japan 7 QunaSys Inc., Aqua Hakusan Building 9F, 1-13-7 Hakusan, Bunkyo, Tokyo 113-0001, Japan 8 Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan 9 Center for Quantum Information and Quantum Biology, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Japan 10 Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan 11 Individual researcher 12 School of Computer Science, Georgia Institute of Technology, Atlanta, GA, 30332, USA 13 Center for Emergent Matter Science, RIKEN, Wako Saitama 351-0198, Japan To explore the possibilities of a near- 1 Introduction term intermediate-scale quantum algo- rithm and long-term fault-tolerant quan- Many theoretical groups have explored quantum tum computing, a fast and versatile quan- computing applications due to the rapid improve- tum circuit simulator is needed. Here, ments in quantum technologies and huge efforts we introduce Qulacs, a fast simulator for of experimental groups to develop quantum com- quantum circuits intended for research puters [1, 2]. Although classical simulation of purpose. We show the main concepts of quantum circuits is a vital tool to develop quan- Qulacs, explain how to use its features via tum computers, the simulation time increases ex- examples, describe numerical techniques ponentially with the number of qubits. to speed-up simulation, and demonstrate its performance with numerical bench- The primary reasons for using a classical sim- marks. ulator are the following: (i) Quantum devices suffer from much higher error rates. A classical simulator is necessary to determine the ideal re- sults for comparison. (ii) In certain cases, not all necessary values can be directly measured from experiments such as the full quantum state vec- Yasunari Suzuki: yasunari.suzuki.gz@hco.ntt.co.jp tor and marginal probabilities of measurements. Jiabao Chen: qulacs@qunasys.com (iii) To analyze the performance of quantum error Ryosuke Imai: qulacs@qunasys.com correction and noise characterization for an arbi- Tennin Yan: qulacs@qunasys.com trary noise model, noisy quantum circuits must Toru Kawakubo: qulacs@qunasys.com be simulated. Hence, there are broad demands Yuya O. Nakagawa: qulacs@qunasys.com on classical simulators for research on quantum Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 1
computing. Although full simulations of quan- tremendous. Qulacs offers optimized and paral- tum circuits are not efficient in the sense of com- lelized codes to update a quantum state and eval- putational complexity theory, it is important to uate probability distributions, observables, and implement a fast classical simulator as much as so on. possible. Here, we introduce a quantum circuit simula- Small overhead for simulating small quan- tor, which is called Qulacs [3]. The main feature tum circuits: Often small but noisy quantum of Qulacs is that it meets many popular demands circuits (up to 10 qubits) are simulated many in quantum computing research such as evaluat- times rather than simulating a single-shot large ing near-term applications, quantum error correc- ideal quantum circuit. In this case, the over- tion, and quantum benchmark methods. In ad- head due to pre- and post-processing for calling dition, Qulacs is available on many popular en- core API functions is not negligible relative to vironments, and is one of the fastest quantum the overall simulation time. Qulacs is designed circuit simulators. Herein we demonstrate the to minimize such overhead by focusing on core structure of our library, optimization methods, features and avoiding complicated functionalities. and numerical benchmarks for simulating typical quantum gates and circuits. Available on different environments: This paper consists of the following topics. Sec- While numerical analysis is typically performed tion. 2 overviews the main features and struc- on workstations or high-performance computers, ture of Qulacs as well as compares it with exist- most software development occurs on laptops ing libraries. Section. 3 introduces the expected or desktop personal computers. For versatility, use of Qulacs. Section. 4 discusses how to write Qulacs provides interfaces for both Python and codes with Qulacs using examples codes. Sec- C++ languages, while most of the codes of tion. 5 introduces further optimization techniques Qulacs are written in C and C++ languages. to speed-up the simulation of quantum circuits. In addition, Qulacs supports several compilers Section. 6 shows the numerical benchmarks for such as GNU Compiler Collection (GCC) and Qulacs. Section 7 compares the performance of Microsoft Visual Studio C++ (MSVC). Qulacs Qulacs with that of existing simulators. Finally, is tested on several operating systems such as Section. 8 is devoted to summary and discussion. Linux, Windows, and Mac OS. 2 Overview Many useful utilities for research: To meet various demands in quantum computing research, 2.1 Features of Qulacs a simulator should support general quantum op- erations. Qulacs can create not only common uni- Qulacs is designed to accelerate research on quan- tary gates and projection measurements, but also tum computing. Thus, Qulacs prioritizes the fol- general operations such as completely positive lowing: instruments and adaptive quantum gates condi- tioned on measurement results. Fast simulation of large quantum circuits: A full simulation of quantum circuits requires a 2.2 Structure of Qulacs time that grows exponentially with the number of qubits. This problem can be mitigated by op- Qulacs consists of three shared libraries. The first timizing and parallelizing the simulation codes one is a core library written in C language for op- for single- or multi-core CPUs and SIMD (Sin- timized memory management, update functions gle Instruction Multiple Data) units and even of quantum states, and property evaluation of GPUs (Graphics Processing Units), and optimiz- quantum states. The second library is built on ing quantum circuits before a simulation. These top of the first library, and is written in C++ techniques can enable a few orders of magnitude language. This allows users to easily create and performance improvement compared to naive im- control quantum gates and circuits in an object- plementations. Although this is a constant fac- oriented way, thereby improving the programma- tor speed-up, the effect on practical research is bility. Also, at runtime, it adaptively chooses Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 2
the best performing implementation of quantum tree-width is small, such as shallow quantum cir- gates depending on the number of qubits. The cuits with a large number of qubits, a tree-width third library explores variational methods with becomes equal to the number of qubits. More- quantum circuits. This library wraps quantum over, the Schrödinger’s method is much faster gates and circuits so that quantum circuits can than Feynman’s path integral when the number be treated as variational objects. of qubits is limited and the memory size is suf- We expect that users who work on variational ficient for storing a full state vector. Although quantum algorithms use Qulacs via the third li- we can also consider Schrödinger-Feynman ap- brary, and the other users access Qulacs via the proach [1, 10] as a hybrid method, this method is second library. Additionally, Qulacs can be used typically faster than Schrödinger’s method simu- as a Python library, while there is some over- lating large quantum circuits with a small depth. head in interfacing between Python and C++. If quantum circuits have specific features, then Figure 1 shows the overview of the structure of the simulation time can be reduced. For exam- Qulacs. The components of Qulacs and their us- ple, if quantum circuits are dominated by Clifford ages are explained in Sec. 4 with example codes. gates, non-Clifford gates can be treated as a per- Qulacs uses Eigen [4] to treat a sparse matrix turbation to the simulation. This treatment can and to manage matrix representations of quan- reduce the time for simulation [11, 12]. This con- tum gates and pybind11 [5] for exporting func- dition is satisfied in several situations, e.g., quan- tions and classes from C++ to python. The fea- tum circuits of stabilizer measurements where tures introduced in this paper is fully tested with non-Clifford errors happens with a very small GoogleTest [6] and pytest [7]. probability or fault-tolerant quantum computing with a limited number of T and TOFFOLI-gates. 2.3 Simulation methods However, most of the quantum circuits of typical quantum algorithms do not satisfy these condi- There are several approaches to simulate general tions. quantum circuits with classical computers. The simplest approach is to update quantum states 2.4 Relation to the existing libraries represented by state vectors or density matrices sequentially by applying quantum gates as gen- To date, many groups have published a vari- eral maps. This method is called Schrödinger’s ety of quantum circuit simulators. Cirq [13], method [1]. Qulacs implements this method for Qiskit [14], PyQuil [15], and PennyLane [16] are simulating quantum circuits due to its fast and published by Google, IBM, Rigetti computing, versatile simulations of quantum circuits com- and Xanadu, respectively. Since these groups pared with the other methods introduced in this are developers of hardware for quantum com- section. A detail of implementation with this puting, these libraries are designed for submit- method is described in Sec. 4. ting quantum tasks as a job without thinking Another method is Feynman’s approach [1, 8, about detailed experimental procedures or se- 9], which computes the sum of all Feynman’s tups. Q# [17] by Microsoft focuses on provid- path contributions. This technique greatly de- ing higher levels of abstraction, such as pack- creases the memory size requirement, allowing aged instructions for integer arithmetic or tool- for the single amplitude of the final quantum chains for compilation. To simulate quantum state to be quickly known. While the number circuits with a low depth and a large num- of Feynman’s paths increases exponentially ac- ber of qubits, tensor-network-based simulators cording to not only the number of qubits but also are used [18–20]. However, these are not good the number of quantum gates, this requirement for higher depths or a fewer number of qubits. can be relaxed by using tensor-network-based ap- For quantum circuit simulations on state-of-the- proach. With tensor-network-based simulator, art supercomputers, several works have reported we can make the simulation time increases ex- on the performance and optimizations [21–24]. ponentially according to a tree-width of the net- Qiskit-Aer [14], Intel-QS [25, 26], QX Simula- work [9], which is a characteristic value of graph tor [27, 28], ProjectQ [29], QuEST [30], qsim [31], representing how a tensor network is close to a Yao [32], QCGPU [33], and Qibo [34] were all de- tree graph. However, except special cases where veloped with motivations similar to ours. They Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 3
Qulacs Front-end C++ classes Optimized low-level functions Quantum Circuit Memory management Allocate / deallocate array ( ) Concatenate / drop qubits Copy / load state ⋯ Computing devices depolarize Quantum state evaluation C++ user codes Inner product Circuit optimization for fast simulation CPUs Expectation value Parameter controls and differentiation for variational algorithms Marginal probability - SIMD Sampling with Z-basis - Multi-threading Quantum Gates ⋯ ↦ † Python interfaces ↦ † Quantum state manipulation GPUs Quantum State Basic gate Dense matrix Diagonal matrix 0 00 01 ⋯ Sparse matrix Permutation matrix = 1 = 10 ⋱ ⋮ ⋮ Pauli matrix ⋯ Observable General quantum maps Python user codes ෨ = CPTP-map CP-instrument ∈ Adaptive op ⋯ Figure 1: The overview of the structure of Qulacs. focused on optimizing quantum circuit simula- gates as variational parameters of a cost func- tions for quantum computing researchers. By tion and optimize them by repeatedly simulating contrast, Qulacs is one of the fastest simulators, quantum circuits with a relatively small number which minimizes overhead even for small simula- of qubits. The other application is a quantum tions, supports general quantum operations such simulator [36]. In this approach, large quantum as completely-positive and trace-preserving maps systems are simulated to explore physics in many- and completely-positive instruments, and runs on body quantum systems. In a typical scenario, we various environments. In addition, Qulacs pro- perform a single simulation with quantum circuits vides many useful utilities frequently used in re- as large as possible. Thus, the size of memory or search such as calculations of transition ampli- allowed time limits the size of quantum circuits. tudes and reversible Boolean functions. 3 Expected usages of Qulacs To date, Qulacs has been used in a few tens Quantum circuit simulators should be designed of research papers. For example, Qulacs is for specific targets. Qulacs is designed to help used by papers related to noisy intermediate- researchers of quantum computing. In particular, scale quantum (NISQ) applications [37, 37–42] we expect the following usages: and fault-tolerant quantum computing [43]. Al- though Qulacs does not support gate decompo- 3.1 Exploration of near-term applications and sition, which is supported by high-layer libraries, error mitigation techniques Qulacs can be used as a faster backend library. For example, Qulacs can serve as a backend of To highlight typical evaluation targets, here we Cirq [13] using a library cirq-qulacs [44]. Qulacs show two popular directions of near-term appli- has also been chosen as a fast backend simulator cations. One is optimizing a target function using in several libraries and services such as Penny- variational quantum circuits such as the varia- Lane (Xanadu) [16], t|ket> (Cambridge Quan- tional quantum eigensolver (VQE) [35]. In a typ- tum Computing) [45], Orquestra (Zapata com- ical scenario, we assume rotation angles of Pauli puting) [46], and Tequila [47]. Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 4
3.2 Performance analysis of quantum error represents a density matrix. These classes have correction schemes some basic utilities as their member functions such as initialization to a certain quantum state, Another important usage of the simulator for computing marginal probabilities, and sampling quantum circuits is performance analysis of measurement results. The QuantumStateBase the quantum error correction and fault-tolerant class also contains a variable-length integer array quantum computing. To construct a quantum called classical registers, which are used to store computer large enough for Shor’s algorithm [48, measurement results. 49], a quantum simulation for quantum many- body systems [36, 50], or algorithms for linear When a quantum state is updated by quan- systems [51], quantum error correction [52] is nec- tum operations, subclasses of QuantumGateBase essary to reduce logical error rates to an arbitrar- are instantiated and applied to a quantum state. ily small value. Many types of quantum error- This class supports not only unitary operations correcting codes and schemes have been pro- and projection measurements but also a vari- posed. However, the number of qubits needed ety of operations for general quantum mapping for a specific application is highly dependent such as a completely-positive instrument and a on the performance of quantum error-correcting completely-positive trace-preserving map. schemes. Consequently, it is difficult to control To evaluate the expectation values of observ- the noise properties on real devices and the per- ables, there is a class named Observable. We formance of quantum error correction must be assume the Observable class is described as a analyzed with classical simulation in near-term linear combination of Pauli operators. Thus, the development. Unfortunately, the time to accu- Observable instance can be constructed directly rately simulate quantum error-correcting codes or from an output of an external library such as with practical noise models grows exponentially OpenFermion [54]. Additionally, a Trotterized with the number of qubits. Thus, we need a fast quantum circuit can be created from a given ob- and accurate simulator of noisy quantum circuits servable. of quantum error correction. By default, Qulacs performs a simulation by allocating and manipulating StateVector on a 3.3 Generation of a reference of experimental CPU. Also, since the use of GPUs can signifi- data cantly outperform a CPU in certain cases, Qulacs To characterize and calibrate controls of qubits, supports GPU execution. This can be done by sometimes experimental data must be compared using the StateVectorGpu. Once a state vector with the ideal one. For example, several verifi- is allocated on a GPU, all the computations such cation methods for large quantum devices [1, 53] as update and evaluation of state vectors are per- require a full simulation of large quantum sys- formed in a GPU unless the state is explicitly tems. While quantum circuits in quantum com- converted into StateVector. Qulacs does not putational supremacy regime require supercom- support allocating a quantum state on multiple puters for simulations, portable and fast quan- GPUs. One GPU can be selected by supplying tum circuit simulators remain useful for generat- an index if multiple GPUs are installed. ing small-scale experimental references. In the discussion below, we explain the case where quantum states are allocated as state vec- tors on the main RAM and processed within a 4 Implementation of Qulacs CPU. Although Qulacs can be used as a C++ library, we show examples in the Python lan- 4.1 Overview guage for simplicity in the main text. See Ap- In Qulacs, any state of a quantum system is rep- pendix. A for the example codes of the C++ lan- resented as a subclass of the QuantumStateBase guage. Here, we show typical examples and their class. Thus far, Qulacs supports two represen- basic features. Qulacs supports more operations tations of quantum states: state vector and den- than those explained here. For a detailed expla- sity matrix. The StateVector class represents nation, please see the documentation on the offi- a state vector, while the DensityMatrix class cial website [3]. Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 5
4.2 Quantum state 1 from qulacs import StateVector 2 num_qubit = 3 4.2.1 Initialization 3 state = StateVector ( num_qubit ) 4 state . set_ Haar _ran dom_s tate (0) In Qulacs, instantiating the StateVector class 5 allocates a state vector. By default, a state 6 # Get the state vector as numpy array vector is initialized to zero-state |0i⊗n . How- 7 vec = state . get_vector () 8 ever, the state can be initialized to other states 9 # Get the marginal probability such as a computational basis, a state vector of 10 # # The below example obtains prob of a given complex array, or a random pure state |021 > 11 # # where "2" is a wild card with its member functions. The instance of the 12 # # matching to both 0 and 1. StateVector class can create its copy or load 13 prob = state . g et _ m a rg i n a l_ p r o ba b i l it y ( the contents of another StateVector. Listing. 1 measured_value =[1 , 2 , 0]) shows example codes. 14 15 # Sampling results of the Z - basis 1 import numpy as np measurements 2 from qulacs import StateVector 16 samples = state . sampling ( count =100 , seed 3 =42) 4 # Allocate a state vector 17 5 num_qubit = 2 18 # Computing the squared norm 6 state = StateVector ( num_qubit ) 19 squared_norm = state . get_squared_norm () 7 20 8 # Reset to a computational basis 21 # Computing the inner product of two 9 # # (0:|00 > , 1:|01 > , 2:|10 > , 3:|11 >) quantum states 10 # # Note that the right - most digit 22 from qulacs . state import inner_product corresponds to 23 state_bra = StateVector ( num_qubit ) 11 # # the 0 - th qubit in Qulacs . 24 state_bra . set_ Haar _ran dom_s tate () 12 state . s e t _ c o mp u ta t io n al _b a si s ( index = 2) 25 state_ket = StateVector ( num_qubit ) 13 26 state_ket . set_ Haar _ran dom_s tate () 14 # Create a copy of the state vector 27 value = inner_product ( state_bra , 15 sub_state = state . copy () state_ket ) 16 17 # Load a given list , numpy array , Listing 2: An example Python program that evaluates 18 # or another StateVector the properties of quantum states. 19 state . load ( state =[0.5 , 0.5 , 0.5 , -0.5]) 20 state . load ( np . ones (4) /2) 21 state . load ( sub_state ) 4.2.3 Update 22 23 # Prepare a randomized pure quantum Several member functions of StateVector can be state 24 state . s e t _ H a ar _rand om_s tate ( seed = 42) used for quickly updating a state vector. The multiply_coef function multiplies a complex Listing 1: An example Python program that initializes quantum states. number to each element of a quantum state, while the multiply_elementwise_function multi- plies index-dependent coefficients to a state vec- tor with a function that returns a coefficient ac- 4.2.2 Analysis cording to a given index. The add_state func- Qulacs implements several functions to evaluate tion adds two state vectors. While these op- the properties of quantum states. Although the erations are not physically achievable, they are get_vector functions provide a full state vec- useful for analysis in theoretical studies such tor, evaluating the properties with built-in func- as creating a superposition of two given states tions is fast. For example, Listing. 2 shows ex- and supplying specific phases to each element ample codes to compute a marginal probability, of a state vector. A list of qubits in a state squared norm, and inner-product of two states. vector can be concatenated, permutated, or re- Note that the sampling functions create a cumu- duced with tensor_product, permutate_qubit, lative probability distribution as pre-processing or drop_qubit, respectively. Listing. 3 shows for the fast sampling, which temporally allocates code examples for multiplication kernels. an additional 2n -length array. 1 from qulacs import StateVector Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 6
2 from qulacs . state import tensor_product , tiple qubits, TOFFOLI-gate, and stabilizer pro- permutate_qubit , drop_qubit jection to +1 eigenspace. Quantum maps that 3 are not basic gates such as CPTP-map, projec- 4 state = StateVector (2) 5 state . s e t _ H a ar _rand om_s tate () tion measurements, and adaptive operations, are 6 represented using basic gates. Here, we show sev- 7 # Normalize the state vector eral popular types of operations which are imple- 8 squared_norm = state . get_squared_norm () mented as basic operations. 9 state . normalize ( squared_norm ) 10 11 # Multiply a complex number to each 4.3.2 Basic gate element 12 state . multiply_coef (0.5+0.1 j ) A basic gate is an operation that can be repre- 13 sented by ρ 7→ KρK † . In the case of a pure 14 # Perform element - wise multiply state, its action on the state vector is represented 15 def func ( index ) : 16 return (0.5 if index %2 else 0) by |ψi 7→ K |ψi. Note that K is not neces- 17 sarily unitary. Suppose that this quantum gate 18 state . m u l t i p l y _ e l e m e n t w i s e _ f u n c t i o n ( func acts on a pure quantum state |ψi and obtain an ) updated quantum state |ψ 0 i = K |ψi. We de- 19 20 # Perform element - wise addition note a state vector of the computational basis as |xi = i |xi i for x ∈ {0, 1}n , a coefficient of the N 21 state . add_state ( state ) 22 state vector along with the computational basis 23 # Make a tensor product of states . as ψx = hx|ψi, and a matrix representation of 24 # The resultant state has 4 qubits 25 sub_state = StateVector (2) K as Kx,y = hx|K|yi. Then, a coefficient of the 26 state = tensor_product ( state , sub_state ) updated quantum state can be written by 27 ψx0 = X 28 # Permutate qubit indices from [0 ,1 ,2 ,3] Kx,y ψy . (1) to [3 ,1 ,2 ,0] y∈{0,1}n 29 state = permutate_qubit ( state , [3 ,1 ,2 ,0]) Typically, quantum gates non-trivially act only 30 on a few qubits. Suppose that the total number of 31 # Drop the 1 - st and 2 - nd qubits from state qubits is n, the list of qubits on which the quan- 32 # and project to |0 > and |0 > subspace . tum gate non-trivially acts is M , and the num- 33 new_state = drop_qubit ( state , [1 ,2] , ber of the target qubits is m. Then, the action of [0 ,0]) the quantum gate K can be simplified by using a Listing 3: An example Python program that update and 2m × 2m complex matrix K̃, which we call a gate modify quantum states. matrix, as follows. We define the following two lists of n-bit strings: 4.3 Quantum gates B (0) = {x ∈ {0, 1}n |∀i ∈ M, xi = 0} (2) 4.3.1 Gate type B (1) = {x ∈ {0, 1}n |∀i ∈ / M, xi = 0}. (3) As explained in the overview, quantum gates in- Here, B (0) is a list of n-bit strings where all the clude not only typical quantum gates such as values at indices contained in M are zero, and unitary operators and Pauli-Z basis measure- B (1) is a list of n-bit strings where all the values ments, but also contain all operations that up- at indices not contained in M are zero. There are date quantum states. All the classes of quan- 2n−m elements in B (0) and 2m elements in B (1) . tum gates are defined as a subclass of the An arbitrary n-bit string x can be uniquely de- QuantumGateBase class, which has a function composed as x = x(0) + x(1) where x(i) ∈ Bi . update_quantum_state that acts on derived We use this decomposition implicitly in the fol- classes of QuantumStateBase. In Qulacs, quan- lowing discussion. Since the quantum gate only tum gates in which the action can be written as acts on the target qubits, a transition amplitude |ψi 7→ K |ψi, where |ψi is a state vector and between two computational basis states is zero K is a certain complex matrix, are called basic if their n-bit strings are different at indices not gates. Examples include a Pauli rotation on mul- contained in the target qubits, and a transition Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 7
amplitude is independent of the values at indices 10 temp_vector = np . dot ( gate_matrix , not contained in the target qubit, i.e., temp_vector ) 11 # Write values to the state vector 12 for ind , x1 in enumerate ( B1 ) : Kx,y = Kx(0) +x(1) ,y(0) +y(1) 13 state_vector [ x0 + x1 ] = temp_vector [ = Kx(1) ,y(1) δx(0) ,y(0) (4) ind ] Listing 4: An example Python code that naively for an arbitrary x, y ∈ {0, 1}n . Then, Eq. (1) can implements an update function of a quantum state be rephrased as Note that in the example code, without loss of ψx0 (0) +x(1) generality, B1 is considered to be arranged so that X = Kx(1) ,y(1) ψx(0) +y(1) . (5) y (1) ∈B (1) the i-th element of B1 is s(bin(i)). In practice, a time for reading/writing temporal state vectors We define a bijective function r : B (1) → {0, 1}m (i.e., temp_vector in the example code) from/to such that (x0 , · · · , xn−1 ) 7→ (xM0 , · · · , xMm−1 ) the whole state vector is not negligible. Thus, where Mi is the qubit index of the i-th target the update function essentially performs 2n−m it- qubit, and also define s : {0, 1}m → B (1) as the erations of the following three parts: read 2m - inverse of r. Then, a 2m × 2m gate matrix K̃ of dim complex numbers from a memory, perform the quantum gate K is defined as matrix-vector multiplication, and write 2m com- plex numbers to the memory. K̃z,w = hs(z)|K|s(w)i (6) The DenseMatrix function generates a ba- sic gate K with a gate matrix K̃, and the for z, w ∈ {0, 1}m . With the gate matrix, we can RandomUnitary function generates that with a write Eq. (5) as follows: random unitary matrix sampled from a Haar- random distribution. Listing. 5 shows quantum ψx0 (0) +s(z) = X K̃z,w ψx(0) +s(w) . (7) gates instantiated with dense matrices. w∈{0,1}m 1 from qulacs import StateVector 2 from qulacs . gate import DenseMatrix , We also define a temporal state vector |ψ̃x(0) i as a RandomUnitary 2m -dim complex vector of which the i-th element 3 state = StateVector (10) is ψx(0) +s(bin(i)) where bin(i) represents the m-bit 4 binary representation of the integer i. Then, we 5 # Update quantum state with given gate can simplify Eq. (7) as matrix 6 target_list = [1] 7 gate_matrix = [[0 , 1] ,[1 , 0]] |ψ̃ 0 x(0) i = K̃ |ψ̃x(0) i . (8) 8 gate = DenseMatrix ( target_list , gate_matrix ) Since for all x(0) ∈ B (0) , Eq. (8) has to be calcu- 9 gate . update_quantum_state ( state ) lated, an update function for K consists of 2n−m 10 11 # Update quantum state with random matrix-vector multiplications with a gate matrix unitary K̃ and the temporal state vectors |ψ̃x(0) i. List- 12 # # Matrix is drawn from Haar - random ing. 4 shows a naive implementation of update distribution functions in the Python language, where B0 and 13 random_gate = RandomUnitary ( target_list ) 14 random_gate . update_quantum_state ( state ) B1 are a list of n-bit integers corresponding to B0 and B1 , respectively. Listing 5: An example Python program that applies dense matrix gates to quantum states. 1 import numpy as np 2 The computation time to apply a quantum gate 3 def func ( state_vector , gate_matrix , B0 , is mainly determined by two factors: time for pro- B1 , m ) : 4 temp_vector = np . zeros (2** m ) cessing arithmetic operations to perform a calcu- 5 for x0 in B0 : lation with complex numbers and time for pro- 6 # Read values from the state vector cessing memory operations to transfer complex 7 for ind , x1 in enumerate ( B1 ) : numbers between the CPU and main RAM; we 8 temp_vector [ ind ] = state_vector [ x0 + x1 ] call these arithmetic-operation cost and memory- 9 # Perform matrix - vector operation cost, respectively. They are propor- multiplication tional to the number of arithmetic and mem- Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 8
ory operations in update functions. The heavier add_control_qubit. of the two determines the application time of a When a gate matrix has a small number of function. The number of arithmetic operations non-zero elements, it is called sparse. When K̃ in each iteration of an update function of dense is a sparse matrix, the arithmetic- and memory- matrix gates is O(22m ) and that of memory op- operation costs can be decreased according to the erations is O(2m + 22m ), where O is a Landau number of non-zero elements. A quantum gate notation. Since this iteration is looped for 2n−m with a sparse gate matrix can be generated with times, the total number of arithmetic and mem- SparseMatrix. ory operations are O(2n+m ) and O(2n + 2n+m ). Another special case is a diagonal matrix, In the case of small m, the gate matrix K̃, which which is a matrix with non-zero elements only in is re-used in all the iterations, is expected to re- the diagonal elements. In this case, arithmetic- side on cache memory, and the memory-operation and memory-operation costs become O(2n ). cost for the gate matrix is counted at once. Thus, These costs are independent of the number of tar- the number of memory operation is effectively get qubits m. In this case, DiagonalMatrix can O(2n + 22m ). Typically, because we consider the be used for generating diagonal matrix gates. case when m n, the memory-operation cost is An action of reversible Boolean functions in further approximated as O(2n ). classical computing can always be represented as Although any basic gate can be treated as a permutation matrix, which is a matrix where a dense matrix gate, quantum gates in quan- there is a single unity element in each row and tum computing research sometimes have an addi- column. ReversibleBoolean creates a unitary tional structure in a gate matrix K̃. By utilizing operation with a permutation matrix by supply- this structure, the arithmetic-operation cost, the ing a function that returns the index of a column memory-operation cost, or both can be decreased. with a unity element from the index of a given This motivates us to define specialized subclasses row. This function is applicable not only for re- of dense matrix gates for quantum gates with versible circuits but also for creating and anni- structured gate matrices. Here, we show func- hilating operators as a product of a permutation tions for several types of basic gates with a struc- matrix and a diagonal matrix. Its arithmetic- ture. Note that the relation between arithmetic- and memory-operation costs are O(2n ) and are and memory-operation costs and the total com- also independent of the number of target qubits putation time is discussed in Sec. 5. m. A set of m-qubit Pauli matrices is defined as a Let Mc be a subset of target qubits and c (0 ≤ tensor product of Pauli matrices {I, X, Y, Z}⊗m , c < 2|Mc | ) be an integer, where | · | represents where the number of elements in a given set. Any ! ! gate matrix K̃ can be represented in the form 1 0 0 1 P |Mc | I= ,X = , K̃ = 2x,y=0−1 |xi hy| ⊗ Lx,y , where the first part 0 1 1 0 ! ! of the tensor product acts on the space of qubits 0 −i 1 0 in Mc , and the latter part acts on the space of Y = ,Z = . (9) i 0 0 −1 target qubits except for Mc . Suppose a gate ma- trix K̃ which satisfies Lx,y = 0 if x 6= y and A Pauli gate is a gate in which the gate matrix is a Lx,x = I if x 6= c. A quantum gate with such Pauli matrix. By assigning numbers {0, 1, 2, 3} to a gate matrix K̃ is called controlled quantum Pauli matrices {I, X, Y, Z}, respectively, a Pauli gates. We can describe a gate matrix of a con- matrix can be represented with a sequence of in- trolled quantum gate with an integer c and com- tegers {0, 1, 2, 3}m . The Pauli function generates plex matrix Lc,c . Let mc := |Mc | be the num- a basic gate with a Pauli matrix represented by ber of control qubits, and mt := m − mc be the a sequence of assigned integers. Its arithmetic- number of qubits on which Lc,c act. Then, this and memory-operation costs are O(2n ) and are gate can be applied with arithmetic-operation independent of m. costs O(2n−mc +mt ) and memory-operation costs Since a set of m-qubit Pauli matrices is a ba- O(2n−mc ). In Qulacs, we can specify the sis of 2m × 2m matrices, any 2m × 2m matrix pair of the digit of control index c and corre- can be represented as a linear combination of sponding item in Mc with a member function m-qubit Pauli matrices. Furthermore, any self- Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 9
adjoint matrix can be represented as a linear com- 29 diagonal_gate . update_quantum_state ( state bination of m-qubit Pauli matrices with real co- ) 30 efficients. Therefore, any unitary gate matrix can P 31 # Update a quantum state with a be represented in the form K̃ = exp(i P θP P ), permutation matrix gate where θP is a real coefficient. Suppose a quan- 32 def basis_to_basis ( index , dim ) : tum gate such that θP = 0 if P 6= Q, where Q 33 return ( index +3) % dim 34 is a certain Pauli matrix. Such a quantum gate 35 target_list = [0 , 3 , 4] is called a Pauli rotation gate. PauliRotation 36 rev_gate = ReversibleBoolean ( target_list can be used for generating a Pauli rotation gate , basis_to_basis ) with a description of Pauli matrix Q and rotation 37 rev_gate . update_quantum_state ( state ) 38 angle θQ . Its arithmetic- and memory-operation 39 # Update a quantum state with Pauli gate costs are also O(2n ). Note that quantum gates 40 target_list = [1 ,2] with multiple non-zero rotation angles, which are 41 pauli_ids = [3 ,2] vital for simulating the dynamics of quantum sys- 42 pauli_gate = Pauli ( target_list , tems under a given Hamiltonian, can be gener- pauli_ids ) 43 pauli_gate . update_quantum_state ( state ) ated with DenseMatrix function with an explicit 44 P matrix representation of exp(i P θP P ) or gen- 45 # Update a quantum state with a Pauli erated as a Trotterized quantum circuit with ob- rotation gate servable. For the latter, see Sec. 4.5. Listing. 6 46 target_list = [1 ,2] 47 pauli_ids = [3 ,2] shows examples. Qulacs has several other spe- 48 rotation_angle = np . pi /5 cializations for basic gates, which are detailed in 49 rot_gate = PauliRotation ( target_list , the online manuals [3]. pauli_ids , rotation_angle ) 50 rot_gate . update_quantum_state ( state ) 1 import numpy as np 2 from scipy . sparse import csr_matrix Listing 6: An example Python program that applies 3 from qulacs import StateVector several basic gates to quantum states. 4 from qulacs . gate import DenseMatrix , SparseMatrix , DiagonalMatrix , Pauli , PauliRotation , ReversibleBoolean 5 state = StateVector (10) 4.3.3 Quantum map 6 7 # Update a quantum state with a Quantum maps are a general operation, which controlled dense matrix gate includes all the quantum maps that cannot be 8 target_list = [1] represented as basic gates such as measurement, 9 gate_matrix = [[0 , 1] ,[1 , 0]] noisy operation, and feedback operation. 10 control_gate = DenseMatrix ( target_list , gate_matrix ) The most general form of physical operations 11 # # Act when the 2 - nd qubit is |0 > without measurements is a completely-positive 12 control_gate . add_control_qubit (2 , 0) trace-preserving (CPTP) [55]. According to the 13 # # Act when the 3 - rd qubit is |1 > operator-sum representation, this map can be control_gate . add_control_qubit (3 , 1) represented as ρ 7→ i Ki ρKi† , where ρ is the den- 14 P 15 control_gate . update_quantum_state ( state ) 16 sity matrix and Ki is called the Kraus operator. The map must satisfy the condition i Ki† Ki = P 17 # Update a quantum state with a sparse matrix gate I. In Qulacs, a list of basic gates, which represent 18 target_list = [2 , 1] 19 sparse_matrix = csr_matrix ( the action of Kraus operators {Ki }, is required to 20 ([1 ,1] , ([0 ,3] , [0 ,3]) ) , create a CPTP-map. When a density matrix is 21 shape =(4 ,4) , dtype = complex ) used as a representation of quantum states, ρ is sparse_gate = SparseMatrix ( target_list , mapped to i Ki ρKi† . On the other hand, when 22 P sparse_matrix ) 23 sparse_gate . update_quantum_state ( state ) a state vector is used as a representation of quan- 24 tum states, the i-th Kraus operator is chosen with 25 # update a quantum state with a diagonal probability pi = |Ki |ψi |2 . Then, the state vector matrix gate Ki |ψi 26 target_list = [3 , 5] is mapped to √ . Suppose that the number 27 diagonal_element = [1 , -1 , -1 , 1] pi 28 diagonal_gate = DiagonalMatrix ( of Kraus operators is k and each Kraus opera- target_list , diagonal_element ) tor acts on m-qubits, then in the worst case, the Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 10
arithmetic- and memory-operation costs become 8 O(2n+m k). The CPTP-map can be created with 9 # Update a quantum state with a CPTP map 10 gate_list = [ P0 (0) , P1 (0) ] a CPTP function. 11 cptp_gate = CPTP ( gate_list ) One of the most general representations of all 12 cptp_gate . update_quantum_state ( state ) physically achievable operations, including mea- 13 surements, is the completely-positive (CP) in- 14 # Update a quantum state with a CP instrument strument. In Qulacs, this operation is the same 15 classical_register = 0 as a CPTP-map except that the index of the 16 gate_list = [ P0 (0) , P1 (0) ] chosen Kraus operator is stored in the classical 17 inst_gate = Instrument ( gate_list , register of QuantumStateBase and can be used classical_register ) later. This map can be generated with a func- 18 inst_gate . update_quantum_state ( state ) 19 tion Instrument. The arithmetic- and memory- 20 # Get and set values in the classical operation costs are the same as CPTP-map. register A CPTP-map is called unital when it maps a 21 value = state . get_classical_value ( maximally mixed state to itself. A unital CPTP- classical_register ) 22 state . set_classical_value ( map can be represented as a probabilistic applica- classical_register , 1 - value ) tion of unitary operations (i.e., for all i, Ki has a 23 √ form Ki = pi Ui , where pi is a real value and Ui 24 # Update a quantum state with a unital is unitary). Unlike a CPTP-map, the arithmetic- gate 25 gate_list = [ X (0) , Y (0) , Z (0) ] and memory-operation costs of unital maps de- 26 prob_list = [0.2 , 0.3 , 0.1] crease to O(2n+m ). The costs become indepen- 27 prob_gate = Probabilistic ( prob_list , dent of the number of Kraus operators since the gate_list ) probability distribution for sampling a Kraus op- 28 prob_gate . update_quantum_state ( state ) 29 erator is independent of the input state and can 30 # Update a quantum state with an be given in advance. A unital CPTP-map can adaptive gate generated with Probabilistic function. 31 def func ( cl a ss i ca l _r eg i st e r_ li s t ) : An adaptive map is one that acts on the quan- 32 return cl a ss ic a l_ r eg is t er _ li st [0] == tum states only when a given classical condition 0 33 is satisfied. This map requires a Boolean func- 34 gate = X (0) tion that determines an output according to the 35 adap_gate = Adaptive ( gate , condition = classical registers. Then, a map is applied only func ) when the returned value of the Boolean function 36 adap_gate . update_quantum_state ( state ) is True. This map is useful for treating feedback Listing 7: An example Python program that applies and feedforward operations such as the heralded general quantum maps to quantum states. operation, readout initialization, measurement- based quantum computation, and look-up ta- ble decoder for quantum error correction. This gate can be generated with a function named Adaptive. Its arithmetic- and memory-operation costs are dependent on a given gate and the prob- 4.3.4 Named gate ability that the given condition is satisfied. Listing. 7 shows the example codes of these gen- Since quantum gates with several specific gate eral maps. There are several other forms of gen- matrices and Kraus operators are frequently used eral gates in Qulacs for a specific research pur- in quantum computing research, functions to pose. See the online manual [3] for details. generate these gates are defined. TABLE.1 1 from qulacs import StateVector lists these named gates. Although some of 2 from qulacs . gate import X , Y , Z , P0 , P1 these function calls are simply redirected to 3 from qulacs . gate import Instrument , CPTP the definition as quantum maps, the following , Probabilistic , Adaptive gates are redirected optimized update functions: 4 5 state = StateVector (3) X,Y,Z,H,CNOT,SWAP,CZ. Thus, when these quan- 6 gate_list = [ X (0) , Y (0) , Z (0) ] tum gates are used, they should be instantiated 7 using these functions. Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 11
Category Name Description Single-qubit gate X Pauli-X gate Y Pauli-Y gate Z Pauli-Z gate sqrtX π/4 rotation of Pauli-X gate sqrtXdag −π/4 rotation of Pauli-X gate sqrtY π/4 rotation of Pauli-Y gate sqrtYdag −π/4 rotation of Pauli-Y gate S π/4 rotation of Pauli-Z gate Sdag −π/4 rotation of Pauli-Z gate T π/8 rotation of Pauli-Z gate Tdag −π/8 rotation of Pauli-Z gate H Hadamard gate Two-qubit gate CNOT Controlled-NOT gate CZ Controlled-Z gate SWAP SWAP gate Three-qubit gate TOFFOLI TOFFOLI gate FREDKIN FREDKIN gate Single-qubit RX Pauli-X rotation: exp(iθX/2) rotation gate RY Pauli-Y rotation: exp(iθY /2) RZ Pauli-Z rotation: exp(iθZ/2) U1 Rotate phase of LO (Local oscillator) U2 Rotate phase of LO with single π/2-pulse U3 Rotate phase of LO with two π/2-pulses Projection and P0 Projection matrix to |0i state measurement P1 Projection matrix to |1i state Measurement Single qubit measurement with Z-basis Noise BitFlipNoise Probabilistic Pauli-X operation DephasingNoise Probabilistic Pauli-Z operation DepolarizingNoise Single-qubit uniform depolarizing noise TwoQubitDepolarizingNoise Two-qubit uniform depolarizing noise AmplitudeDampingNoise Single-qubit amplitude damping noise Table 1: Partial listing of named gates in Qulacs. These are all defined in the qulacs.gate module. Several gates have optimized functions, while others are alias to quantum gates. For the definition of U1, U2, and U3, see the reference of IBMQ [56] 4.4 Quantum circuit update_quantum_state, all the contained quan- tum gates are applied to a given state sequen- In Qulacs, a quantum circuit is represented tially. as a simple array of quantum gates. The When users need to treat paramet- QuantumCircuit class instantiates quantum cir- ric quantum gates and circuits, the cuits. The add_gate function inserts a quan- ParametricQuantumCircuit class should tum gate to a circuit with a given position. be used, which provides functions to treat If a position is not given, the gate is ap- parameters in quantum circuits. By adding pended to the last of a quantum circuit. The ParametricRX, ParametricRY, ParametricRZ, remove_gate function removes a quantum gate and PrametricPauliRotation gates with the at a given position. By calling a member function add_parametric_gate function, their rota- Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 12
tion angles can be set and varied with the position get_parameter and set_parameter functions. 27 position = 0 28 gate = circuit . get_gate ( position ) Listing. 8 shows some examples. 29 30 # Update a state vector with the quantum 1 from qulacs import StateVector , circuit QuantumCircuit , 31 state = StateVector ( n ) P a r a m e t r i c Q u an t u m Ci r c u it 32 circuit . update_quantum_state ( state ) 2 from qulacs . gate import DenseMatrix , H , 33 CNOT 34 # Create parametric quantum circuit 3 from qulacs . gate import ParametricRX , 35 par_circuit = Pa r a m et r i c Qu a n t u mC i r c ui t ( n ParametricRY , Pa r am et r ic P au li R ot a ti o n ) 4 36 par_circuit . add_parametric_gate ( 5 # Create a quantum circuit and add ParametricRX (0 , 0.1) ) quantum gates 37 par_circuit . add_parametric_gate ( 6 n = 5 ParametricRY (1 , 0.1) ) 7 circuit = QuantumCircuit ( n ) 38 par_circuit . add_parametric_gate ( 8 circuit . add_gate ( DenseMatrix ([1] , P ar a me tr i cP a ul iR o ta t io n ([0 ,1] , [1 ,1] , [[0 ,1] ,[1 ,0]]) ) 0.1) ) 9 for index in range ( n ) : 39 10 circuit . add_gate ( H ( index ) ) 40 # Get the number of parameters in the 11 quantum circuit 12 # Insert quantum gates into a given 41 par_count = par_circuit . position get_parameter_count () 13 position = 1 42 14 circuit . add_gate ( CNOT (2 ,3) , position ) 43 # Get and set a parameter at a position 15 44 index = 0 16 # Remove a quantum gate at a position 45 angle = 0.2 17 position = 0 46 value = par_circuit . get_parameter ( index ) 18 circuit . remove_gate ( position ) 47 par_circuit . set_parameter ( index , angle ) 19 48 20 # Compute the depth of the quantum 49 # Get the position of a parametric gate circuit from the index of a parameter 21 depth = circuit . calculate_depth () 50 position = par_circuit . 22 g e t _ p a r a m e t r i c _ g a t e _ p o s i t i o n ( index ) 23 # Get the number of quantum gates 24 gate_count = circuit . get_gate_count () Listing 8: An example Python program that generates 25 quantum circuits and parametric ones. 26 # Get a copy of the quantum gate at a 4.5 Observable In quantum physics, physical values are obtained as an expectation value of a self-adjoint operator named observable. In Qulacs, any observable O is represented as a linear combination of Pauli ma- trices with real coefficients, i.e., O = P αP P where αP ∈ R. A Pauli term in observable αP P is P constructed with the PauliOperator class. An observable is generated with the Observable class. The add_operator function adds a Pauli term to an observable. Then, the get_expectation_value function computes the expectation value of an observable according to a given quantum state, and the get_transition_amplitude function computes the transition amplitude of an observable according to two quantum states. A Hamiltonian is an observable for the energy of a quantum system, and a unitary operator for time evolution is described as an exponential of the Hamiltonian operator with an imaginary coefficient. The add_observable_rotation_gate function adds a set of quantum gates for simulating the time evolution under a given Hamiltonian, which is generated with the Trotter decomposition, to a quantum circuit. Expectation values or transition amplitudes of Hamiltonian of molecules are frequently studied in the field of NISQ applications [41, 57, 58]. They are usually represented as a linear combi- nation of products of fermionic operators. They can be converted to an observable with Pauli Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 13
operators using the Jordan-Wigner transformation [59] or the Bravyi-Kitaev transformation [60], which are implemented in OpenFermion [54]. The format of OpenFermion is loaded with the create_quantum_operator_from_openfermion_text function, which allows the output of Open- Fermion to be interpreted as a format of Qulacs. The GeneralQuantumOperator class can be used to generate observables that are not self-adjoint. Listing. 9 shows example codes for treating observables and evaluating expectation values. 1 from qulacs import Observable , PauliOperator , StateVector , QuantumCircuit 2 from qulacs . quantum_operator import c r e a t e _ q u a n t u m _ o p e r a t o r _ f r o m _ o p e n f e r m i o n _ t e x t 3 4 # Construct a Pauli operator 5 coef = 2.0 6 Pauli_string = " X 0 X 1 Y 2 Z 4 " 7 pauli = PauliOperator ( Pauli_string , coef ) 8 9 # Create an observable acting on n qubits 10 n = 5 11 observable = Observable ( n ) 12 # Add a Pauli operator to the observable 13 observable . add_operator ( pauli ) 14 # or directly add it with coef and str 15 observable . add_operator (0.5 , " Y 1 Z 4 " ) 16 17 # Get the number of terms in the observable 18 term_count = observable . get_term_count () 19 20 # Get the number of qubit on which the observable acts 21 qubit_count = observable . get_qubit_count () 22 23 # Get a specific term as PauliOperator 24 index = 1 25 pauli = observable . get_term ( index ) 26 27 # Calculate the expectation value 28 state = StateVector ( n ) 29 state . s e t _ H a ar _rand om_s tate (0) 30 expect = observable . g et_ex pect atio n_val ue ( state ) 31 32 # Calculate the transition amplitude 33 bra = StateVector ( n ) 34 bra . s e t _ H a a r _ rand om_st ate (1) 35 trans_amp = observable . g e t _ tr a n s it i o n _a m p l it u d e ( bra , state ) 36 37 # Create a quantum circuit to simulate 38 # the time evolution by a given observable 39 # Observable is Trotterized with given slice count . 40 circuit = QuantumCircuit ( n ) 41 angle = 0.1 42 t_slice = 100 43 circuit . a d d _ o b s e r v a b l e _ r o t a t i o n _ g a t e ( obs , angle , t_slice ) 44 circuit . u p d a te_quantum_state ( state ) 45 46 # Load an observable from OpenFermion text 47 ope n_ferm ion_text = """ 48 ( -0.8126100000000005+0 j ) [] + 49 (0.04532175+0 j ) [ X0 Z1 X2 ] + 50 (0.04532175+0 j ) [ X0 Z1 X2 Z3 ] + 51 (0.04532175+0 j ) [ Y0 Z1 Y2 ] + 52 (0.04532175+0 j ) [ Y0 Z1 Y2 Z3 ] + 53 (0 . 1 7 1 2 0 1 0 0 0 0 00 000 02 +0 j ) [ Z0 ] + 54 (0 . 1 7 1 2 0 1 0 0 0 0 00 000 02 +0 j ) [ Z0 Z1 ] + 55 (0.165868+0 j ) [ Z0 Z1 Z2 ] + 56 (0.165868+0 j ) [ Z0 Z1 Z2 Z3 ] + Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 14
57 (0.12054625+0 j ) [ Z0 Z2 ] + 58 (0.12054625+0 j ) [ Z0 Z2 Z3 ] + 59 (0.16862325+0 j ) [ Z1 ] + 60 ( -0.22279649999999998+0 j ) [ Z1 Z2 Z3 ] + 61 (0.17434925+0 j ) [ Z1 Z3 ] + 62 ( -0.22279649999999998+0 j ) [ Z2 ] 63 """ 64 obs_of = c r e a t e _ q u a n t u m _ o p e r a t o r _ f r o m _ o p e n f e r m i o n _ t e x t ( open_fermion_text ) Listing 9: An example Python program that generates and evaluates observables. 5 Optimizations Tgate is lower bounded as In this section, we discuss possible performance Tgate ≥ Tover + max(Ncom /VFLOPS , Nmem /VBW ) bottlenecks in quantum simulation, and discuss (10) several optimization techniques for different com- puting devices. when computation and memory operations do not block each other. While actual processing is not 5.1 Background necessarily simplified to this equation, estimation with this equation works well when we develop a Since an application of a quantum gate is a large quantum circuit simulator. number of simple iterations, a time for circuit To develop a quantum circuit simulator satis- simulation can be roughly estimated as the sum fying the demands shown in Sec. 3, we can find of constant-time overheads for invoking functions several basic directions from this equation. In the and times for processing arithmetic and memory case of a small number of qubits, Tover becomes operations. Several factors such as an overhead a dominant factor. Thus, the pre- and post- to call C++ function via python interfaces, func- processing for applying quantum gates should be tions with parallelization, and GPU kernels in- minimized. In Qulacs, every core function is de- cur additional overheads in computation. These signed to minimize overheads. When the number additional overheads are quantitatively discussed of qubits n increases, values of Ncom /VFLOPS and later in Sec. 6. The times for processing arith- Nmem /VBW grow exponentially to n, and they metic and memory operations are determined by become larger than the overhead Tover . In this the number of operations divided by throughput. region, the values Ncom /VFLOPS and Nmem /VBW For the time for arithmetic operations, the num- should be minimized. If there are two ways to ber of operations that can be processed in a unit update quantum states, and if one has smaller second is vital, which is known as floating-point Ncom and Nmem than the other, the first one operations per second (FLOPS). To process com- should be chosen to minimize Tgate . To this plex numbers in the CPU, we need to load com- end, Qulacs provides several specialized update plex numbers representing quantum states from functions that utilize the structure of gate ma- the CPU cache or main RAM to the CPU reg- trices as introduced in Sec. 4. In this section, isters. The size of data per unit time that we we show four additional techniques to minimize can transfer between a memory and processor is Tgate when Ncom /VFLOPS and Nmem /VBW are called the bandwidth of the memory. Let the time dominant: SIMD Optimization, multi-threading for the additional overheads be Tover , the number with OpenMP, quantum circuit optimization, and of arithmetic operations be Ncom , the number of GPU acceleration. memory operations be Nmem , the FLOPS of CPU be VFLOPS , and the memory bandwidth, i.e., the 5.2 SIMD optimization number of complex numbers which we can trans- fer, be VBW . Tover is determined by a design of Recent processors support SIMD (single- a library, Nmem and Ncom are determined by a instruction, multiple data) instructions, which quantum gate to apply, and VFLOPS and VBW are can apply the same operation to multiple data determined by a computing device. Then, an ap- simultaneously. Qulacs utilizes instructions proximate total time for applying a quantum gate named Intel AVX2 [61], in which up to 256-bit Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 15
data can be processed simultaneously. When uation functions in parallel can increase the ef- a quantum state is represented as an array of fective instruction throughput VFLOPS . The use double-precision real values, we can load, store, of multiple cores is effective particularly when and process four real values (i.e. two complex FLOPS is a performance bottleneck (compute- numbers) simultaneously. Thus, the use of AVX2 bound), and each core has a certain amount of can reduce the number of instructions Ncom by a workload. Qulacs parallelizes the execution of factor of four at most. In Qulacs, several update update functions using OpenMP directives [62]. functions are optimized with AVX2 instructions The number of threads used in these func- by hand. When Qulacs is being installed, the tions can be controlled with environment vari- installer checks if a system supports such a able OMP_NUM_THREADS. The naive implementa- feature. If it is supported, the library is built tion shown in Listing. 4 consists of two loops: with AVX2 instructions enabled. B0 -loop and B1 -loop. In Qulacs, the B0 -loop is Here, we discuss our SIMD optimization tech- parallelized to maximize the amount of workload niques using dense matrix gates. However, it is for each core and to minimize the overhead in- worth noting that our techniques are applicable curred by multi-threading. Specifically, the par- to the other quantum gates. The naive imple- allelized loop iterates over [0..2n−m − 1] and the mentation of dense matrix gates is shown in List- iteration space is chunked into T chunks, where ing. 4. We implemented two SIMD versions of the T is the number of threads. In the loop body, the B1 -loop. When all the indices of the target qubits i-th element of B0 is computed on-the-fly from are large, it is not possible to SIMDize it as it is the loop index, which enables the even distribu- because the state vector elements required in one tion of workload across threads. In contrast, the iteration of the B0 -loop are scattered across non- list of B1 is created before executing the parallel contiguous memory locations. However, since the loop. While the data of B1 is accessed from ev- value of the adjacent memory location is always ery thread, its overhead is expected to be not so loaded in the next iteration, we unroll the B0 - high because B1 is not updated during the loop loop according to the number of target qubits to and its size |B1 | = 2m is typically small enough enable AVX2’s SIMD load/store operations. On to store within CPU registers or caches. A buffer the other hand, when there is a target qubit with for a temporal state vector (temp_vector in List- a small index, we can SIMDize it without such ing. 4) is a thread-local array that can be read an unrolling because the required state vector el- and written by each thread independently. Thus, ements are already adjacent. Note that the over- a buffer space for T × 2m data is allocated be- head of enumerating B0 and B1 is not negligible fore the B0 -loop, so that each 2m block can be when the number of target qubits m is small. We used by a corresponding thread, and is deallo- reduce the cost of the enumeration of B0 and B1 cated at the end of the parallel loop. Note that as follows: Instead of listing all the items in B0 when n and m are small, the amount of workload beforehand, the i-th element of B0 is computed for each thread becomes small. In that case, the from the index i in each iteration using bit-wise overhead due to multi-threading becomes larger operation techniques. In contrast, the list of B1 is than the speed-up by multi-threading. Therefore, computed before the B0 -loop since the size of B1 Qulacs automatically disables multi-threading if is typically small. This implementation is useful the number of qubits n is smaller than a thresh- for parallelizing iterations with multi-threading old value even when OMP_NUM_THREADS is set to by OpenMP, which is explained in the Sec. 5.3. 2 or more. This threshold value is empirically When a gate matrix has a structure and basic determined according to the number of m and a gates other than a dense matrix can be utilized, type of basic gates. an updated state vector can be calculated without matrix-vector multiplication, and thus a different 5.4 Circuit optimization optimization is applied. When the SIMD and multi-threading with OpenMP are enabled, a time for processing arith- 5.3 Multi-threading with OpenMP metic operations Ncom /VFLOPS becomes smaller Since recent CPUs contain multiple processing than that for processing memory operations cores, executing iterations in update and eval- Nmem /VBW , and a computing time Tgate is de- Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 16
You can also read