Qulacs: a fast and versatile quantum circuit simulator for research purpose

Page created by Isaac Kelly
 
CONTINUE READING
Qulacs: a fast and versatile quantum circuit simulator for
 research purpose
 Yasunari Suzuki1,2 , Yoshiaki Kawase3 , Yuya Masumura4 , Yuria Hiraga5 , Masahiro Nakadai6 ,
 Jiabao Chen7 , Ken M. Nakanishi7,8 , Kosuke Mitarai3,7,9 , Ryosuke Imai7 , Shiro Tamiya7,10 ,
 Takahiro Yamamoto7 , Tennin Yan7 , Toru Kawakubo7 , Yuya O. Nakagawa7 , Yohei Ibe7 ,
 Youyuan Zhang7,8 , Hirotsugu Yamashita11 , Hikaru Yoshimura11 , Akihiro Hayashi12 , and
 Keisuke Fujii2,3,9,13

 1
 NTT Computer and Data Science Laboratories, NTT Corporation, Musashino 180-8585, Japan
 2
 JST PRESTO, Kawaguchi, Saitama 332-0012, Japan
 3
 Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan
arXiv:2011.13524v4 [quant-ph] 5 Oct 2021

 4
 Graduate School of Information Science and Technology, Osaka University, 1-1 Yamadaoka, Suita, Osaka 565-0871, Japan
 5
 Graduate School of Information and Science, Nara Institute of Science and Technology, Takayama, Ikoma, Nara 630-0192,
 Japan
 6
 Graduate School of Science, Kyoto University, Yoshida-Ushinomiya, Sakyo, Kyoto 606-8302, Japan
 7
 QunaSys Inc., Aqua Hakusan Building 9F, 1-13-7 Hakusan, Bunkyo, Tokyo 113-0001, Japan
 8
 Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
 9
 Center for Quantum Information and Quantum Biology, Institute for Open and Transdisciplinary Research Initiatives, Osaka
 University, Japan
 10
 Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
 11
 Individual researcher
 12
 School of Computer Science, Georgia Institute of Technology, Atlanta, GA, 30332, USA
 13
 Center for Emergent Matter Science, RIKEN, Wako Saitama 351-0198, Japan

 To explore the possibilities of a near- 1 Introduction
 term intermediate-scale quantum algo-
 rithm and long-term fault-tolerant quan- Many theoretical groups have explored quantum
 tum computing, a fast and versatile quan- computing applications due to the rapid improve-
 tum circuit simulator is needed. Here, ments in quantum technologies and huge efforts
 we introduce Qulacs, a fast simulator for of experimental groups to develop quantum com-
 quantum circuits intended for research puters [1, 2]. Although classical simulation of
 purpose. We show the main concepts of quantum circuits is a vital tool to develop quan-
 Qulacs, explain how to use its features via tum computers, the simulation time increases ex-
 examples, describe numerical techniques ponentially with the number of qubits.
 to speed-up simulation, and demonstrate
 its performance with numerical bench- The primary reasons for using a classical sim-
 marks. ulator are the following: (i) Quantum devices
 suffer from much higher error rates. A classical
 simulator is necessary to determine the ideal re-
 sults for comparison. (ii) In certain cases, not all
 necessary values can be directly measured from
 experiments such as the full quantum state vec-
 Yasunari Suzuki: yasunari.suzuki.gz@hco.ntt.co.jp
 tor and marginal probabilities of measurements.
 Jiabao Chen: qulacs@qunasys.com
 (iii) To analyze the performance of quantum error
 Ryosuke Imai: qulacs@qunasys.com
 correction and noise characterization for an arbi-
 Tennin Yan: qulacs@qunasys.com
 trary noise model, noisy quantum circuits must
 Toru Kawakubo: qulacs@qunasys.com
 be simulated. Hence, there are broad demands
 Yuya O. Nakagawa: qulacs@qunasys.com
 on classical simulators for research on quantum

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 1
computing. Although full simulations of quan- tremendous. Qulacs offers optimized and paral-
tum circuits are not efficient in the sense of com- lelized codes to update a quantum state and eval-
putational complexity theory, it is important to uate probability distributions, observables, and
implement a fast classical simulator as much as so on.
possible.
 Here, we introduce a quantum circuit simula- Small overhead for simulating small quan-
tor, which is called Qulacs [3]. The main feature tum circuits: Often small but noisy quantum
of Qulacs is that it meets many popular demands circuits (up to 10 qubits) are simulated many
in quantum computing research such as evaluat- times rather than simulating a single-shot large
ing near-term applications, quantum error correc- ideal quantum circuit. In this case, the over-
tion, and quantum benchmark methods. In ad- head due to pre- and post-processing for calling
dition, Qulacs is available on many popular en- core API functions is not negligible relative to
vironments, and is one of the fastest quantum the overall simulation time. Qulacs is designed
circuit simulators. Herein we demonstrate the to minimize such overhead by focusing on core
structure of our library, optimization methods, features and avoiding complicated functionalities.
and numerical benchmarks for simulating typical
quantum gates and circuits.
 Available on different environments:
 This paper consists of the following topics. Sec-
 While numerical analysis is typically performed
tion. 2 overviews the main features and struc-
 on workstations or high-performance computers,
ture of Qulacs as well as compares it with exist-
 most software development occurs on laptops
ing libraries. Section. 3 introduces the expected
 or desktop personal computers. For versatility,
use of Qulacs. Section. 4 discusses how to write
 Qulacs provides interfaces for both Python and
codes with Qulacs using examples codes. Sec-
 C++ languages, while most of the codes of
tion. 5 introduces further optimization techniques
 Qulacs are written in C and C++ languages.
to speed-up the simulation of quantum circuits.
 In addition, Qulacs supports several compilers
Section. 6 shows the numerical benchmarks for
 such as GNU Compiler Collection (GCC) and
Qulacs. Section 7 compares the performance of
 Microsoft Visual Studio C++ (MSVC). Qulacs
Qulacs with that of existing simulators. Finally,
 is tested on several operating systems such as
Section. 8 is devoted to summary and discussion.
 Linux, Windows, and Mac OS.

2 Overview Many useful utilities for research: To meet
 various demands in quantum computing research,
2.1 Features of Qulacs a simulator should support general quantum op-
 erations. Qulacs can create not only common uni-
Qulacs is designed to accelerate research on quan-
 tary gates and projection measurements, but also
tum computing. Thus, Qulacs prioritizes the fol-
 general operations such as completely positive
lowing:
 instruments and adaptive quantum gates condi-
 tioned on measurement results.
Fast simulation of large quantum circuits:
A full simulation of quantum circuits requires a
 2.2 Structure of Qulacs
time that grows exponentially with the number
of qubits. This problem can be mitigated by op- Qulacs consists of three shared libraries. The first
timizing and parallelizing the simulation codes one is a core library written in C language for op-
for single- or multi-core CPUs and SIMD (Sin- timized memory management, update functions
gle Instruction Multiple Data) units and even of quantum states, and property evaluation of
GPUs (Graphics Processing Units), and optimiz- quantum states. The second library is built on
ing quantum circuits before a simulation. These top of the first library, and is written in C++
techniques can enable a few orders of magnitude language. This allows users to easily create and
performance improvement compared to naive im- control quantum gates and circuits in an object-
plementations. Although this is a constant fac- oriented way, thereby improving the programma-
tor speed-up, the effect on practical research is bility. Also, at runtime, it adaptively chooses

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 2
the best performing implementation of quantum tree-width is small, such as shallow quantum cir-
gates depending on the number of qubits. The cuits with a large number of qubits, a tree-width
third library explores variational methods with becomes equal to the number of qubits. More-
quantum circuits. This library wraps quantum over, the Schrödinger’s method is much faster
gates and circuits so that quantum circuits can than Feynman’s path integral when the number
be treated as variational objects. of qubits is limited and the memory size is suf-
 We expect that users who work on variational ficient for storing a full state vector. Although
quantum algorithms use Qulacs via the third li- we can also consider Schrödinger-Feynman ap-
brary, and the other users access Qulacs via the proach [1, 10] as a hybrid method, this method is
second library. Additionally, Qulacs can be used typically faster than Schrödinger’s method simu-
as a Python library, while there is some over- lating large quantum circuits with a small depth.
head in interfacing between Python and C++. If quantum circuits have specific features, then
Figure 1 shows the overview of the structure of the simulation time can be reduced. For exam-
Qulacs. The components of Qulacs and their us- ple, if quantum circuits are dominated by Clifford
ages are explained in Sec. 4 with example codes. gates, non-Clifford gates can be treated as a per-
Qulacs uses Eigen [4] to treat a sparse matrix turbation to the simulation. This treatment can
and to manage matrix representations of quan- reduce the time for simulation [11, 12]. This con-
tum gates and pybind11 [5] for exporting func- dition is satisfied in several situations, e.g., quan-
tions and classes from C++ to python. The fea- tum circuits of stabilizer measurements where
tures introduced in this paper is fully tested with non-Clifford errors happens with a very small
GoogleTest [6] and pytest [7]. probability or fault-tolerant quantum computing
 with a limited number of T and TOFFOLI-gates.
2.3 Simulation methods However, most of the quantum circuits of typical
 quantum algorithms do not satisfy these condi-
There are several approaches to simulate general tions.
quantum circuits with classical computers. The
simplest approach is to update quantum states
 2.4 Relation to the existing libraries
represented by state vectors or density matrices
sequentially by applying quantum gates as gen- To date, many groups have published a vari-
eral maps. This method is called Schrödinger’s ety of quantum circuit simulators. Cirq [13],
method [1]. Qulacs implements this method for Qiskit [14], PyQuil [15], and PennyLane [16] are
simulating quantum circuits due to its fast and published by Google, IBM, Rigetti computing,
versatile simulations of quantum circuits com- and Xanadu, respectively. Since these groups
pared with the other methods introduced in this are developers of hardware for quantum com-
section. A detail of implementation with this puting, these libraries are designed for submit-
method is described in Sec. 4. ting quantum tasks as a job without thinking
 Another method is Feynman’s approach [1, 8, about detailed experimental procedures or se-
9], which computes the sum of all Feynman’s tups. Q# [17] by Microsoft focuses on provid-
path contributions. This technique greatly de- ing higher levels of abstraction, such as pack-
creases the memory size requirement, allowing aged instructions for integer arithmetic or tool-
for the single amplitude of the final quantum chains for compilation. To simulate quantum
state to be quickly known. While the number circuits with a low depth and a large num-
of Feynman’s paths increases exponentially ac- ber of qubits, tensor-network-based simulators
cording to not only the number of qubits but also are used [18–20]. However, these are not good
the number of quantum gates, this requirement for higher depths or a fewer number of qubits.
can be relaxed by using tensor-network-based ap- For quantum circuit simulations on state-of-the-
proach. With tensor-network-based simulator, art supercomputers, several works have reported
we can make the simulation time increases ex- on the performance and optimizations [21–24].
ponentially according to a tree-width of the net- Qiskit-Aer [14], Intel-QS [25, 26], QX Simula-
work [9], which is a characteristic value of graph tor [27, 28], ProjectQ [29], QuEST [30], qsim [31],
representing how a tensor network is close to a Yao [32], QCGPU [33], and Qibo [34] were all de-
tree graph. However, except special cases where veloped with motivations similar to ours. They

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 3
Qulacs
 Front-end C++ classes Optimized low-level functions

 Quantum Circuit Memory management
 
 Allocate / deallocate array
 ( )
 Concatenate / drop qubits
 
 Copy / load state
 ⋯ Computing devices
 depolarize
 
 Quantum state evaluation
 C++ user codes Inner product
 Circuit optimization
 for fast simulation
 CPUs
 Expectation value
 Parameter controls and differentiation
 for variational algorithms
 Marginal probability - SIMD
 Sampling with Z-basis - Multi-threading
 Quantum Gates ⋯
 ↦ ෍ †
 Python interfaces

 ↦ †
 
 Quantum state manipulation GPUs
 Quantum State Basic gate
 Dense matrix Diagonal matrix
 0 00 01 ⋯
 Sparse matrix Permutation matrix
 = 1 = 10 ⋱
 ⋮ ⋮ Pauli matrix ⋯

 Observable General quantum maps
 Python user codes ෨ = ෍ CPTP-map CP-instrument

 ∈ Adaptive op ⋯

 Figure 1: The overview of the structure of Qulacs.

focused on optimizing quantum circuit simula- gates as variational parameters of a cost func-
tions for quantum computing researchers. By tion and optimize them by repeatedly simulating
contrast, Qulacs is one of the fastest simulators, quantum circuits with a relatively small number
which minimizes overhead even for small simula- of qubits. The other application is a quantum
tions, supports general quantum operations such simulator [36]. In this approach, large quantum
as completely-positive and trace-preserving maps systems are simulated to explore physics in many-
and completely-positive instruments, and runs on body quantum systems. In a typical scenario, we
various environments. In addition, Qulacs pro- perform a single simulation with quantum circuits
vides many useful utilities frequently used in re- as large as possible. Thus, the size of memory or
search such as calculations of transition ampli- allowed time limits the size of quantum circuits.
tudes and reversible Boolean functions.

3 Expected usages of Qulacs
 To date, Qulacs has been used in a few tens
Quantum circuit simulators should be designed of research papers. For example, Qulacs is
for specific targets. Qulacs is designed to help used by papers related to noisy intermediate-
researchers of quantum computing. In particular, scale quantum (NISQ) applications [37, 37–42]
we expect the following usages: and fault-tolerant quantum computing [43]. Al-
 though Qulacs does not support gate decompo-
3.1 Exploration of near-term applications and sition, which is supported by high-layer libraries,
error mitigation techniques Qulacs can be used as a faster backend library.
 For example, Qulacs can serve as a backend of
To highlight typical evaluation targets, here we Cirq [13] using a library cirq-qulacs [44]. Qulacs
show two popular directions of near-term appli- has also been chosen as a fast backend simulator
cations. One is optimizing a target function using in several libraries and services such as Penny-
variational quantum circuits such as the varia- Lane (Xanadu) [16], t|ket> (Cambridge Quan-
tional quantum eigensolver (VQE) [35]. In a typ- tum Computing) [45], Orquestra (Zapata com-
ical scenario, we assume rotation angles of Pauli puting) [46], and Tequila [47].

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 4
3.2 Performance analysis of quantum error represents a density matrix. These classes have
correction schemes some basic utilities as their member functions
 such as initialization to a certain quantum state,
Another important usage of the simulator for computing marginal probabilities, and sampling
quantum circuits is performance analysis of measurement results. The QuantumStateBase
the quantum error correction and fault-tolerant class also contains a variable-length integer array
quantum computing. To construct a quantum called classical registers, which are used to store
computer large enough for Shor’s algorithm [48, measurement results.
49], a quantum simulation for quantum many-
body systems [36, 50], or algorithms for linear When a quantum state is updated by quan-
systems [51], quantum error correction [52] is nec- tum operations, subclasses of QuantumGateBase
essary to reduce logical error rates to an arbitrar- are instantiated and applied to a quantum state.
ily small value. Many types of quantum error- This class supports not only unitary operations
correcting codes and schemes have been pro- and projection measurements but also a vari-
posed. However, the number of qubits needed ety of operations for general quantum mapping
for a specific application is highly dependent such as a completely-positive instrument and a
on the performance of quantum error-correcting completely-positive trace-preserving map.
schemes. Consequently, it is difficult to control To evaluate the expectation values of observ-
the noise properties on real devices and the per- ables, there is a class named Observable. We
formance of quantum error correction must be assume the Observable class is described as a
analyzed with classical simulation in near-term linear combination of Pauli operators. Thus, the
development. Unfortunately, the time to accu- Observable instance can be constructed directly
rately simulate quantum error-correcting codes or from an output of an external library such as
with practical noise models grows exponentially OpenFermion [54]. Additionally, a Trotterized
with the number of qubits. Thus, we need a fast quantum circuit can be created from a given ob-
and accurate simulator of noisy quantum circuits servable.
of quantum error correction.
 By default, Qulacs performs a simulation by
 allocating and manipulating StateVector on a
3.3 Generation of a reference of experimental CPU. Also, since the use of GPUs can signifi-
data cantly outperform a CPU in certain cases, Qulacs
To characterize and calibrate controls of qubits, supports GPU execution. This can be done by
sometimes experimental data must be compared using the StateVectorGpu. Once a state vector
with the ideal one. For example, several verifi- is allocated on a GPU, all the computations such
cation methods for large quantum devices [1, 53] as update and evaluation of state vectors are per-
require a full simulation of large quantum sys- formed in a GPU unless the state is explicitly
tems. While quantum circuits in quantum com- converted into StateVector. Qulacs does not
putational supremacy regime require supercom- support allocating a quantum state on multiple
puters for simulations, portable and fast quan- GPUs. One GPU can be selected by supplying
tum circuit simulators remain useful for generat- an index if multiple GPUs are installed.
ing small-scale experimental references. In the discussion below, we explain the case
 where quantum states are allocated as state vec-
 tors on the main RAM and processed within a
4 Implementation of Qulacs CPU. Although Qulacs can be used as a C++
 library, we show examples in the Python lan-
4.1 Overview
 guage for simplicity in the main text. See Ap-
In Qulacs, any state of a quantum system is rep- pendix. A for the example codes of the C++ lan-
resented as a subclass of the QuantumStateBase guage. Here, we show typical examples and their
class. Thus far, Qulacs supports two represen- basic features. Qulacs supports more operations
tations of quantum states: state vector and den- than those explained here. For a detailed expla-
sity matrix. The StateVector class represents nation, please see the documentation on the offi-
a state vector, while the DensityMatrix class cial website [3].

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 5
4.2 Quantum state 1 from qulacs import StateVector
 2 num_qubit = 3
 4.2.1 Initialization 3 state = StateVector ( num_qubit )
 4 state . set_ Haar _ran dom_s tate (0)
 In Qulacs, instantiating the StateVector class 5

 allocates a state vector. By default, a state 6 # Get the state vector as numpy array
 vector is initialized to zero-state |0i⊗n . How-
 7 vec = state . get_vector ()
 8
 ever, the state can be initialized to other states 9 # Get the marginal probability
 such as a computational basis, a state vector of 10 # # The below example obtains prob of
 a given complex array, or a random pure state |021 >
 11 # # where "2" is a wild card
 with its member functions. The instance of the
 12 # # matching to both 0 and 1.
 StateVector class can create its copy or load 13 prob = state . g et _ m a rg i n a l_ p r o ba b i l it y (
 the contents of another StateVector. Listing. 1 measured_value =[1 , 2 , 0])
 shows example codes. 14
 15 # Sampling results of the Z - basis
 1 import numpy as np measurements
 2 from qulacs import StateVector 16 samples = state . sampling ( count =100 , seed
 3 =42)
 4 # Allocate a state vector 17
 5 num_qubit = 2 18 # Computing the squared norm
 6 state = StateVector ( num_qubit ) 19 squared_norm = state . get_squared_norm ()
 7 20
 8 # Reset to a computational basis 21 # Computing the inner product of two
 9 # # (0:|00 > , 1:|01 > , 2:|10 > , 3:|11 >) quantum states
10 # # Note that the right - most digit 22 from qulacs . state import inner_product
 corresponds to 23 state_bra = StateVector ( num_qubit )
11 # # the 0 - th qubit in Qulacs . 24 state_bra . set_ Haar _ran dom_s tate ()
12 state . s e t _ c o mp u ta t io n al _b a si s ( index = 2) 25 state_ket = StateVector ( num_qubit )
13 26 state_ket . set_ Haar _ran dom_s tate ()
14 # Create a copy of the state vector 27 value = inner_product ( state_bra ,
15 sub_state = state . copy () state_ket )
16
17 # Load a given list , numpy array , Listing 2: An example Python program that evaluates
18 # or another StateVector the properties of quantum states.
19 state . load ( state =[0.5 , 0.5 , 0.5 , -0.5])
20 state . load ( np . ones (4) /2)
21 state . load ( sub_state )
 4.2.3 Update
22
23 # Prepare a randomized pure quantum Several member functions of StateVector can be
 state
24 state . s e t _ H a ar _rand om_s tate ( seed = 42) used for quickly updating a state vector. The
 multiply_coef function multiplies a complex
 Listing 1: An example Python program that initializes
 quantum states.
 number to each element of a quantum state, while
 the multiply_elementwise_function multi-
 plies index-dependent coefficients to a state vec-
 tor with a function that returns a coefficient ac-
 4.2.2 Analysis cording to a given index. The add_state func-
 Qulacs implements several functions to evaluate tion adds two state vectors. While these op-
 the properties of quantum states. Although the erations are not physically achievable, they are
 get_vector functions provide a full state vec- useful for analysis in theoretical studies such
 tor, evaluating the properties with built-in func- as creating a superposition of two given states
 tions is fast. For example, Listing. 2 shows ex- and supplying specific phases to each element
 ample codes to compute a marginal probability, of a state vector. A list of qubits in a state
 squared norm, and inner-product of two states. vector can be concatenated, permutated, or re-
 Note that the sampling functions create a cumu- duced with tensor_product, permutate_qubit,
 lative probability distribution as pre-processing or drop_qubit, respectively. Listing. 3 shows
 for the fast sampling, which temporally allocates code examples for multiplication kernels.
 an additional 2n -length array. 1 from qulacs import StateVector

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 6
2 from qulacs . state import tensor_product , tiple qubits, TOFFOLI-gate, and stabilizer pro-
 permutate_qubit , drop_qubit jection to +1 eigenspace. Quantum maps that
 3
 are not basic gates such as CPTP-map, projec-
 4 state = StateVector (2)
 5 state . s e t _ H a ar _rand om_s tate () tion measurements, and adaptive operations, are
 6 represented using basic gates. Here, we show sev-
 7 # Normalize the state vector eral popular types of operations which are imple-
 8 squared_norm = state . get_squared_norm ()
 mented as basic operations.
 9 state . normalize ( squared_norm )
10
11 # Multiply a complex number to each 4.3.2 Basic gate
 element
12 state . multiply_coef (0.5+0.1 j ) A basic gate is an operation that can be repre-
13 sented by ρ 7→ KρK † . In the case of a pure
14 # Perform element - wise multiply state, its action on the state vector is represented
15 def func ( index ) :
16 return (0.5 if index %2 else 0) by |ψi 7→ K |ψi. Note that K is not neces-
17 sarily unitary. Suppose that this quantum gate
18 state . m u l t i p l y _ e l e m e n t w i s e _ f u n c t i o n ( func acts on a pure quantum state |ψi and obtain an
 ) updated quantum state |ψ 0 i = K |ψi. We de-
19
20 # Perform element - wise addition
 note a state vector of the computational basis as
 |xi = i |xi i for x ∈ {0, 1}n , a coefficient of the
 N
21 state . add_state ( state )
22 state vector along with the computational basis
23 # Make a tensor product of states . as ψx = hx|ψi, and a matrix representation of
24 # The resultant state has 4 qubits
25 sub_state = StateVector (2)
 K as Kx,y = hx|K|yi. Then, a coefficient of the
26 state = tensor_product ( state , sub_state ) updated quantum state can be written by
27
 ψx0 =
 X
28 # Permutate qubit indices from [0 ,1 ,2 ,3] Kx,y ψy . (1)
 to [3 ,1 ,2 ,0] y∈{0,1}n
29 state = permutate_qubit ( state ,
 [3 ,1 ,2 ,0]) Typically, quantum gates non-trivially act only
30
 on a few qubits. Suppose that the total number of
31 # Drop the 1 - st and 2 - nd qubits from
 state
 qubits is n, the list of qubits on which the quan-
32 # and project to |0 > and |0 > subspace . tum gate non-trivially acts is M , and the num-
33 new_state = drop_qubit ( state , [1 ,2] , ber of the target qubits is m. Then, the action of
 [0 ,0]) the quantum gate K can be simplified by using a
 Listing 3: An example Python program that update and 2m × 2m complex matrix K̃, which we call a gate
 modify quantum states. matrix, as follows. We define the following two
 lists of n-bit strings:

 4.3 Quantum gates B (0) = {x ∈ {0, 1}n |∀i ∈ M, xi = 0} (2)

 4.3.1 Gate type B (1) = {x ∈ {0, 1}n |∀i ∈
 / M, xi = 0}. (3)

 As explained in the overview, quantum gates in- Here, B (0) is a list of n-bit strings where all the
 clude not only typical quantum gates such as values at indices contained in M are zero, and
 unitary operators and Pauli-Z basis measure- B (1) is a list of n-bit strings where all the values
 ments, but also contain all operations that up- at indices not contained in M are zero. There are
 date quantum states. All the classes of quan- 2n−m elements in B (0) and 2m elements in B (1) .
 tum gates are defined as a subclass of the An arbitrary n-bit string x can be uniquely de-
 QuantumGateBase class, which has a function composed as x = x(0) + x(1) where x(i) ∈ Bi .
 update_quantum_state that acts on derived We use this decomposition implicitly in the fol-
 classes of QuantumStateBase. In Qulacs, quan- lowing discussion. Since the quantum gate only
 tum gates in which the action can be written as acts on the target qubits, a transition amplitude
 |ψi 7→ K |ψi, where |ψi is a state vector and between two computational basis states is zero
 K is a certain complex matrix, are called basic if their n-bit strings are different at indices not
 gates. Examples include a Pauli rotation on mul- contained in the target qubits, and a transition

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 7
amplitude is independent of the values at indices 10 temp_vector = np . dot ( gate_matrix ,
 not contained in the target qubit, i.e., temp_vector )
 11 # Write values to the state vector
 12 for ind , x1 in enumerate ( B1 ) :
 Kx,y = Kx(0) +x(1) ,y(0) +y(1)
 13 state_vector [ x0 + x1 ] = temp_vector [
 = Kx(1) ,y(1) δx(0) ,y(0) (4) ind ]
 Listing 4: An example Python code that naively
 for an arbitrary x, y ∈ {0, 1}n . Then, Eq. (1) can implements an update function of a quantum state
 be rephrased as
 Note that in the example code, without loss of
 ψx0 (0) +x(1) generality, B1 is considered to be arranged so that
 X
 = Kx(1) ,y(1) ψx(0) +y(1) . (5)
 y (1) ∈B (1) the i-th element of B1 is s(bin(i)). In practice,
 a time for reading/writing temporal state vectors
 We define a bijective function r : B (1) → {0, 1}m (i.e., temp_vector in the example code) from/to
 such that (x0 , · · · , xn−1 ) 7→ (xM0 , · · · , xMm−1 ) the whole state vector is not negligible. Thus,
 where Mi is the qubit index of the i-th target the update function essentially performs 2n−m it-
 qubit, and also define s : {0, 1}m → B (1) as the erations of the following three parts: read 2m -
 inverse of r. Then, a 2m × 2m gate matrix K̃ of dim complex numbers from a memory, perform
 the quantum gate K is defined as matrix-vector multiplication, and write 2m com-
 plex numbers to the memory.
 K̃z,w = hs(z)|K|s(w)i (6) The DenseMatrix function generates a ba-
 sic gate K with a gate matrix K̃, and the
 for z, w ∈ {0, 1}m . With the gate matrix, we can RandomUnitary function generates that with a
 write Eq. (5) as follows: random unitary matrix sampled from a Haar-
 random distribution. Listing. 5 shows quantum
 ψx0 (0) +s(z) =
 X
 K̃z,w ψx(0) +s(w) . (7)
 gates instantiated with dense matrices.
 w∈{0,1}m
 1 from qulacs import StateVector
 2 from qulacs . gate import DenseMatrix ,
 We also define a temporal state vector |ψ̃x(0) i as a
 RandomUnitary
 2m -dim complex vector of which the i-th element 3 state = StateVector (10)
 is ψx(0) +s(bin(i)) where bin(i) represents the m-bit 4

 binary representation of the integer i. Then, we 5 # Update quantum state with given gate
 can simplify Eq. (7) as matrix
 6 target_list = [1]
 7 gate_matrix = [[0 , 1] ,[1 , 0]]
 |ψ̃ 0 x(0) i = K̃ |ψ̃x(0) i . (8) 8 gate = DenseMatrix ( target_list ,
 gate_matrix )
 Since for all x(0) ∈ B (0) , Eq. (8) has to be calcu- 9 gate . update_quantum_state ( state )
 lated, an update function for K consists of 2n−m 10
 11 # Update quantum state with random
 matrix-vector multiplications with a gate matrix unitary
 K̃ and the temporal state vectors |ψ̃x(0) i. List- 12 # # Matrix is drawn from Haar - random
 ing. 4 shows a naive implementation of update distribution
 functions in the Python language, where B0 and 13 random_gate = RandomUnitary ( target_list )
 14 random_gate . update_quantum_state ( state )
 B1 are a list of n-bit integers corresponding to B0
 and B1 , respectively. Listing 5: An example Python program that applies
 dense matrix gates to quantum states.
1 import numpy as np
2 The computation time to apply a quantum gate
3 def func ( state_vector , gate_matrix , B0 , is mainly determined by two factors: time for pro-
 B1 , m ) :
4 temp_vector = np . zeros (2** m )
 cessing arithmetic operations to perform a calcu-
5 for x0 in B0 : lation with complex numbers and time for pro-
6 # Read values from the state vector cessing memory operations to transfer complex
7 for ind , x1 in enumerate ( B1 ) : numbers between the CPU and main RAM; we
8 temp_vector [ ind ] = state_vector [ x0
 + x1 ]
 call these arithmetic-operation cost and memory-
9 # Perform matrix - vector operation cost, respectively. They are propor-
 multiplication tional to the number of arithmetic and mem-

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 8
ory operations in update functions. The heavier add_control_qubit.
of the two determines the application time of a When a gate matrix has a small number of
function. The number of arithmetic operations non-zero elements, it is called sparse. When K̃
in each iteration of an update function of dense is a sparse matrix, the arithmetic- and memory-
matrix gates is O(22m ) and that of memory op- operation costs can be decreased according to the
erations is O(2m + 22m ), where O is a Landau number of non-zero elements. A quantum gate
notation. Since this iteration is looped for 2n−m with a sparse gate matrix can be generated with
times, the total number of arithmetic and mem- SparseMatrix.
ory operations are O(2n+m ) and O(2n + 2n+m ). Another special case is a diagonal matrix,
In the case of small m, the gate matrix K̃, which which is a matrix with non-zero elements only in
is re-used in all the iterations, is expected to re- the diagonal elements. In this case, arithmetic-
side on cache memory, and the memory-operation and memory-operation costs become O(2n ).
cost for the gate matrix is counted at once. Thus, These costs are independent of the number of tar-
the number of memory operation is effectively get qubits m. In this case, DiagonalMatrix can
O(2n + 22m ). Typically, because we consider the be used for generating diagonal matrix gates.
case when m  n, the memory-operation cost is An action of reversible Boolean functions in
further approximated as O(2n ). classical computing can always be represented as
 Although any basic gate can be treated as a permutation matrix, which is a matrix where
a dense matrix gate, quantum gates in quan- there is a single unity element in each row and
tum computing research sometimes have an addi- column. ReversibleBoolean creates a unitary
tional structure in a gate matrix K̃. By utilizing operation with a permutation matrix by supply-
this structure, the arithmetic-operation cost, the ing a function that returns the index of a column
memory-operation cost, or both can be decreased. with a unity element from the index of a given
This motivates us to define specialized subclasses row. This function is applicable not only for re-
of dense matrix gates for quantum gates with versible circuits but also for creating and anni-
structured gate matrices. Here, we show func- hilating operators as a product of a permutation
tions for several types of basic gates with a struc- matrix and a diagonal matrix. Its arithmetic-
ture. Note that the relation between arithmetic- and memory-operation costs are O(2n ) and are
and memory-operation costs and the total com- also independent of the number of target qubits
putation time is discussed in Sec. 5. m.
 A set of m-qubit Pauli matrices is defined as a
 Let Mc be a subset of target qubits and c (0 ≤ tensor product of Pauli matrices {I, X, Y, Z}⊗m ,
c < 2|Mc | ) be an integer, where | · | represents where
the number of elements in a given set. Any ! !
gate matrix K̃ can be represented in the form 1 0 0 1
 P |Mc | I= ,X = ,
K̃ = 2x,y=0−1 |xi hy| ⊗ Lx,y , where the first part 0 1 1 0
 ! !
of the tensor product acts on the space of qubits 0 −i 1 0
in Mc , and the latter part acts on the space of Y = ,Z = . (9)
 i 0 0 −1
target qubits except for Mc . Suppose a gate ma-
trix K̃ which satisfies Lx,y = 0 if x 6= y and A Pauli gate is a gate in which the gate matrix is a
Lx,x = I if x 6= c. A quantum gate with such Pauli matrix. By assigning numbers {0, 1, 2, 3} to
a gate matrix K̃ is called controlled quantum Pauli matrices {I, X, Y, Z}, respectively, a Pauli
gates. We can describe a gate matrix of a con- matrix can be represented with a sequence of in-
trolled quantum gate with an integer c and com- tegers {0, 1, 2, 3}m . The Pauli function generates
plex matrix Lc,c . Let mc := |Mc | be the num- a basic gate with a Pauli matrix represented by
ber of control qubits, and mt := m − mc be the a sequence of assigned integers. Its arithmetic-
number of qubits on which Lc,c act. Then, this and memory-operation costs are O(2n ) and are
gate can be applied with arithmetic-operation independent of m.
costs O(2n−mc +mt ) and memory-operation costs Since a set of m-qubit Pauli matrices is a ba-
O(2n−mc ). In Qulacs, we can specify the sis of 2m × 2m matrices, any 2m × 2m matrix
pair of the digit of control index c and corre- can be represented as a linear combination of
sponding item in Mc with a member function m-qubit Pauli matrices. Furthermore, any self-

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 9
adjoint matrix can be represented as a linear com- 29 diagonal_gate . update_quantum_state ( state
 bination of m-qubit Pauli matrices with real co- )
 30
 efficients. Therefore, any unitary gate matrix can
 P 31 # Update a quantum state with a
 be represented in the form K̃ = exp(i P θP P ), permutation matrix gate
 where θP is a real coefficient. Suppose a quan- 32 def basis_to_basis ( index , dim ) :
 tum gate such that θP = 0 if P 6= Q, where Q 33 return ( index +3) % dim
 34
 is a certain Pauli matrix. Such a quantum gate
 35 target_list = [0 , 3 , 4]
 is called a Pauli rotation gate. PauliRotation 36 rev_gate = ReversibleBoolean ( target_list
 can be used for generating a Pauli rotation gate , basis_to_basis )
 with a description of Pauli matrix Q and rotation 37 rev_gate . update_quantum_state ( state )
 38
 angle θQ . Its arithmetic- and memory-operation
 39 # Update a quantum state with Pauli gate
 costs are also O(2n ). Note that quantum gates 40 target_list = [1 ,2]
 with multiple non-zero rotation angles, which are 41 pauli_ids = [3 ,2]
 vital for simulating the dynamics of quantum sys- 42 pauli_gate = Pauli ( target_list ,
 tems under a given Hamiltonian, can be gener- pauli_ids )
 43 pauli_gate . update_quantum_state ( state )
 ated with DenseMatrix function with an explicit 44
 P
 matrix representation of exp(i P θP P ) or gen- 45 # Update a quantum state with a Pauli
 erated as a Trotterized quantum circuit with ob- rotation gate
 servable. For the latter, see Sec. 4.5. Listing. 6 46 target_list = [1 ,2]
 47 pauli_ids = [3 ,2]
 shows examples. Qulacs has several other spe-
 48 rotation_angle = np . pi /5
 cializations for basic gates, which are detailed in 49 rot_gate = PauliRotation ( target_list ,
 the online manuals [3]. pauli_ids , rotation_angle )
 50 rot_gate . update_quantum_state ( state )
 1 import numpy as np
 2 from scipy . sparse import csr_matrix Listing 6: An example Python program that applies
 3 from qulacs import StateVector several basic gates to quantum states.
 4 from qulacs . gate import DenseMatrix ,
 SparseMatrix , DiagonalMatrix , Pauli ,
 PauliRotation , ReversibleBoolean
 5 state = StateVector (10) 4.3.3 Quantum map
 6
 7 # Update a quantum state with a Quantum maps are a general operation, which
 controlled dense matrix gate includes all the quantum maps that cannot be
 8 target_list = [1] represented as basic gates such as measurement,
 9 gate_matrix = [[0 , 1] ,[1 , 0]] noisy operation, and feedback operation.
10 control_gate = DenseMatrix ( target_list ,
 gate_matrix ) The most general form of physical operations
11 # # Act when the 2 - nd qubit is |0 > without measurements is a completely-positive
12 control_gate . add_control_qubit (2 , 0) trace-preserving (CPTP) [55]. According to the
13 # # Act when the 3 - rd qubit is |1 > operator-sum representation, this map can be
 control_gate . add_control_qubit (3 , 1)
 represented as ρ 7→ i Ki ρKi† , where ρ is the den-
14 P
15 control_gate . update_quantum_state ( state )
16 sity matrix and Ki is called the Kraus operator.
 The map must satisfy the condition i Ki† Ki =
 P
17 # Update a quantum state with a sparse
 matrix gate I. In Qulacs, a list of basic gates, which represent
18 target_list = [2 , 1]
19 sparse_matrix = csr_matrix (
 the action of Kraus operators {Ki }, is required to
20 ([1 ,1] , ([0 ,3] , [0 ,3]) ) , create a CPTP-map. When a density matrix is
21 shape =(4 ,4) , dtype = complex ) used as a representation of quantum states, ρ is
 sparse_gate = SparseMatrix ( target_list , mapped to i Ki ρKi† . On the other hand, when
22
 P
 sparse_matrix )
23 sparse_gate . update_quantum_state ( state )
 a state vector is used as a representation of quan-
24 tum states, the i-th Kraus operator is chosen with
25 # update a quantum state with a diagonal probability pi = |Ki |ψi |2 . Then, the state vector
 matrix gate Ki |ψi
26 target_list = [3 , 5] is mapped to √ . Suppose that the number
27 diagonal_element = [1 , -1 , -1 , 1] pi
28 diagonal_gate = DiagonalMatrix ( of Kraus operators is k and each Kraus opera-
 target_list , diagonal_element ) tor acts on m-qubits, then in the worst case, the

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 10
arithmetic- and memory-operation costs become 8

 O(2n+m k). The CPTP-map can be created with 9 # Update a quantum state with a CPTP map
 10 gate_list = [ P0 (0) , P1 (0) ]
 a CPTP function.
 11 cptp_gate = CPTP ( gate_list )
 One of the most general representations of all 12 cptp_gate . update_quantum_state ( state )
 physically achievable operations, including mea- 13

 surements, is the completely-positive (CP) in- 14 # Update a quantum state with a CP
 instrument
 strument. In Qulacs, this operation is the same
 15 classical_register = 0
 as a CPTP-map except that the index of the 16 gate_list = [ P0 (0) , P1 (0) ]
 chosen Kraus operator is stored in the classical 17 inst_gate = Instrument ( gate_list ,
 register of QuantumStateBase and can be used classical_register )
 later. This map can be generated with a func- 18 inst_gate . update_quantum_state ( state )
 19
 tion Instrument. The arithmetic- and memory- 20 # Get and set values in the classical
 operation costs are the same as CPTP-map. register
 A CPTP-map is called unital when it maps a 21 value = state . get_classical_value (
 maximally mixed state to itself. A unital CPTP- classical_register )
 22 state . set_classical_value (
 map can be represented as a probabilistic applica- classical_register , 1 - value )
 tion of unitary operations (i.e., for all i, Ki has a 23
 √
 form Ki = pi Ui , where pi is a real value and Ui 24 # Update a quantum state with a unital
 is unitary). Unlike a CPTP-map, the arithmetic- gate
 25 gate_list = [ X (0) , Y (0) , Z (0) ]
 and memory-operation costs of unital maps de-
 26 prob_list = [0.2 , 0.3 , 0.1]
 crease to O(2n+m ). The costs become indepen- 27 prob_gate = Probabilistic ( prob_list ,
 dent of the number of Kraus operators since the gate_list )
 probability distribution for sampling a Kraus op- 28 prob_gate . update_quantum_state ( state )
 29
 erator is independent of the input state and can
 30 # Update a quantum state with an
 be given in advance. A unital CPTP-map can adaptive gate
 generated with Probabilistic function. 31 def func ( cl a ss i ca l _r eg i st e r_ li s t ) :
 An adaptive map is one that acts on the quan- 32 return cl a ss ic a l_ r eg is t er _ li st [0] ==
 tum states only when a given classical condition 0
 33
 is satisfied. This map requires a Boolean func- 34 gate = X (0)
 tion that determines an output according to the 35 adap_gate = Adaptive ( gate , condition =
 classical registers. Then, a map is applied only func )
 when the returned value of the Boolean function 36 adap_gate . update_quantum_state ( state )
 is True. This map is useful for treating feedback Listing 7: An example Python program that applies
 and feedforward operations such as the heralded general quantum maps to quantum states.
 operation, readout initialization, measurement-
 based quantum computation, and look-up ta-
 ble decoder for quantum error correction. This
 gate can be generated with a function named
 Adaptive. Its arithmetic- and memory-operation
 costs are dependent on a given gate and the prob- 4.3.4 Named gate
 ability that the given condition is satisfied.
 Listing. 7 shows the example codes of these gen- Since quantum gates with several specific gate
 eral maps. There are several other forms of gen- matrices and Kraus operators are frequently used
 eral gates in Qulacs for a specific research pur- in quantum computing research, functions to
 pose. See the online manual [3] for details. generate these gates are defined. TABLE.1
1 from qulacs import StateVector
 lists these named gates. Although some of
2 from qulacs . gate import X , Y , Z , P0 , P1 these function calls are simply redirected to
3 from qulacs . gate import Instrument , CPTP the definition as quantum maps, the following
 , Probabilistic , Adaptive gates are redirected optimized update functions:
4
5 state = StateVector (3)
 X,Y,Z,H,CNOT,SWAP,CZ. Thus, when these quan-
6 gate_list = [ X (0) , Y (0) , Z (0) ] tum gates are used, they should be instantiated
7 using these functions.

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 11
Category Name Description
 Single-qubit gate X Pauli-X gate
 Y Pauli-Y gate
 Z Pauli-Z gate
 sqrtX π/4 rotation of Pauli-X gate
 sqrtXdag −π/4 rotation of Pauli-X gate
 sqrtY π/4 rotation of Pauli-Y gate
 sqrtYdag −π/4 rotation of Pauli-Y gate
 S π/4 rotation of Pauli-Z gate
 Sdag −π/4 rotation of Pauli-Z gate
 T π/8 rotation of Pauli-Z gate
 Tdag −π/8 rotation of Pauli-Z gate
 H Hadamard gate
 Two-qubit gate CNOT Controlled-NOT gate
 CZ Controlled-Z gate
 SWAP SWAP gate
 Three-qubit gate TOFFOLI TOFFOLI gate
 FREDKIN FREDKIN gate
 Single-qubit RX Pauli-X rotation: exp(iθX/2)
 rotation gate RY Pauli-Y rotation: exp(iθY /2)
 RZ Pauli-Z rotation: exp(iθZ/2)
 U1 Rotate phase of LO (Local oscillator)
 U2 Rotate phase of LO with single π/2-pulse
 U3 Rotate phase of LO with two π/2-pulses
 Projection and P0 Projection matrix to |0i state
 measurement P1 Projection matrix to |1i state
 Measurement Single qubit measurement with Z-basis
 Noise BitFlipNoise Probabilistic Pauli-X operation
 DephasingNoise Probabilistic Pauli-Z operation
 DepolarizingNoise Single-qubit uniform depolarizing noise
 TwoQubitDepolarizingNoise Two-qubit uniform depolarizing noise
 AmplitudeDampingNoise Single-qubit amplitude damping noise

Table 1: Partial listing of named gates in Qulacs. These are all defined in the qulacs.gate module. Several gates
have optimized functions, while others are alias to quantum gates. For the definition of U1, U2, and U3, see the
reference of IBMQ [56]

4.4 Quantum circuit update_quantum_state, all the contained quan-
 tum gates are applied to a given state sequen-
In Qulacs, a quantum circuit is represented tially.
as a simple array of quantum gates. The When users need to treat paramet-
QuantumCircuit class instantiates quantum cir- ric quantum gates and circuits, the
cuits. The add_gate function inserts a quan- ParametricQuantumCircuit class should
tum gate to a circuit with a given position. be used, which provides functions to treat
If a position is not given, the gate is ap- parameters in quantum circuits. By adding
pended to the last of a quantum circuit. The ParametricRX, ParametricRY, ParametricRZ,
remove_gate function removes a quantum gate and PrametricPauliRotation gates with the
at a given position. By calling a member function add_parametric_gate function, their rota-

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 12
tion angles can be set and varied with the position
 get_parameter and set_parameter functions. 27 position = 0
 28 gate = circuit . get_gate ( position )
 Listing. 8 shows some examples.
 29
 30 # Update a state vector with the quantum
 1 from qulacs import StateVector , circuit
 QuantumCircuit , 31 state = StateVector ( n )
 P a r a m e t r i c Q u an t u m Ci r c u it 32 circuit . update_quantum_state ( state )
 2 from qulacs . gate import DenseMatrix , H , 33
 CNOT 34 # Create parametric quantum circuit
 3 from qulacs . gate import ParametricRX , 35 par_circuit = Pa r a m et r i c Qu a n t u mC i r c ui t ( n
 ParametricRY , Pa r am et r ic P au li R ot a ti o n )
 4 36 par_circuit . add_parametric_gate (
 5 # Create a quantum circuit and add ParametricRX (0 , 0.1) )
 quantum gates 37 par_circuit . add_parametric_gate (
 6 n = 5 ParametricRY (1 , 0.1) )
 7 circuit = QuantumCircuit ( n ) 38 par_circuit . add_parametric_gate (
 8 circuit . add_gate ( DenseMatrix ([1] , P ar a me tr i cP a ul iR o ta t io n ([0 ,1] , [1 ,1] ,
 [[0 ,1] ,[1 ,0]]) ) 0.1) )
 9 for index in range ( n ) : 39
10 circuit . add_gate ( H ( index ) ) 40 # Get the number of parameters in the
11 quantum circuit
12 # Insert quantum gates into a given 41 par_count = par_circuit .
 position get_parameter_count ()
13 position = 1 42
14 circuit . add_gate ( CNOT (2 ,3) , position ) 43 # Get and set a parameter at a position
15
 44 index = 0
16 # Remove a quantum gate at a position 45 angle = 0.2
17 position = 0 46 value = par_circuit . get_parameter ( index )
18 circuit . remove_gate ( position ) 47 par_circuit . set_parameter ( index , angle )
19
 48
20 # Compute the depth of the quantum 49 # Get the position of a parametric gate
 circuit from the index of a parameter
21 depth = circuit . calculate_depth () 50 position = par_circuit .
22
 g e t _ p a r a m e t r i c _ g a t e _ p o s i t i o n ( index )
23 # Get the number of quantum gates
24 gate_count = circuit . get_gate_count () Listing 8: An example Python program that generates
25 quantum circuits and parametric ones.
26 # Get a copy of the quantum gate at a

 4.5 Observable
 In quantum physics, physical values are obtained as an expectation value of a self-adjoint operator
 named observable. In Qulacs, any observable O is represented as a linear combination of Pauli ma-
 trices with real coefficients, i.e., O = P αP P where αP ∈ R. A Pauli term in observable αP P is
 P

 constructed with the PauliOperator class. An observable is generated with the Observable class.
 The add_operator function adds a Pauli term to an observable. Then, the get_expectation_value
 function computes the expectation value of an observable according to a given quantum state, and the
 get_transition_amplitude function computes the transition amplitude of an observable according
 to two quantum states.
 A Hamiltonian is an observable for the energy of a quantum system, and a unitary operator for time
 evolution is described as an exponential of the Hamiltonian operator with an imaginary coefficient.
 The add_observable_rotation_gate function adds a set of quantum gates for simulating the time
 evolution under a given Hamiltonian, which is generated with the Trotter decomposition, to a quantum
 circuit.
 Expectation values or transition amplitudes of Hamiltonian of molecules are frequently studied
 in the field of NISQ applications [41, 57, 58]. They are usually represented as a linear combi-
 nation of products of fermionic operators. They can be converted to an observable with Pauli

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 13
operators using the Jordan-Wigner transformation [59] or the Bravyi-Kitaev transformation [60],
 which are implemented in OpenFermion [54]. The format of OpenFermion is loaded with the
 create_quantum_operator_from_openfermion_text function, which allows the output of Open-
 Fermion to be interpreted as a format of Qulacs. The GeneralQuantumOperator class can be used to
 generate observables that are not self-adjoint. Listing. 9 shows example codes for treating observables
 and evaluating expectation values.
 1 from qulacs import Observable , PauliOperator , StateVector , QuantumCircuit
 2 from qulacs . quantum_operator import c r e a t e _ q u a n t u m _ o p e r a t o r _ f r o m _ o p e n f e r m i o n _ t e x t
 3
 4 # Construct a Pauli operator
 5 coef = 2.0
 6 Pauli_string = " X 0 X 1 Y 2 Z 4 "
 7 pauli = PauliOperator ( Pauli_string , coef )
 8
 9 # Create an observable acting on n qubits
10 n = 5
11 observable = Observable ( n )
12 # Add a Pauli operator to the observable
13 observable . add_operator ( pauli )
14 # or directly add it with coef and str
15 observable . add_operator (0.5 , " Y 1 Z 4 " )
16
17 # Get the number of terms in the observable
18 term_count = observable . get_term_count ()
19
20 # Get the number of qubit on which the observable acts
21 qubit_count = observable . get_qubit_count ()
22
23 # Get a specific term as PauliOperator
24 index = 1
25 pauli = observable . get_term ( index )
26
27 # Calculate the expectation value 
28 state = StateVector ( n )
29 state . s e t _ H a ar _rand om_s tate (0)
30 expect = observable . g et_ex pect atio n_val ue ( state )
31
32 # Calculate the transition amplitude 
33 bra = StateVector ( n )
34 bra . s e t _ H a a r _ rand om_st ate (1)
35 trans_amp = observable . g e t _ tr a n s it i o n _a m p l it u d e ( bra , state )
36
37 # Create a quantum circuit to simulate
38 # the time evolution by a given observable
39 # Observable is Trotterized with given slice count .
40 circuit = QuantumCircuit ( n )
41 angle = 0.1
42 t_slice = 100
43 circuit . a d d _ o b s e r v a b l e _ r o t a t i o n _ g a t e ( obs , angle , t_slice )
44 circuit . u p d a te_quantum_state ( state )
45
46 # Load an observable from OpenFermion text
47 ope n_ferm ion_text = """
48 ( -0.8126100000000005+0 j ) [] +
49 (0.04532175+0 j ) [ X0 Z1 X2 ] +
50 (0.04532175+0 j ) [ X0 Z1 X2 Z3 ] +
51 (0.04532175+0 j ) [ Y0 Z1 Y2 ] +
52 (0.04532175+0 j ) [ Y0 Z1 Y2 Z3 ] +
53 (0 . 1 7 1 2 0 1 0 0 0 0 00 000 02 +0 j ) [ Z0 ] +
54 (0 . 1 7 1 2 0 1 0 0 0 0 00 000 02 +0 j ) [ Z0 Z1 ] +
55 (0.165868+0 j ) [ Z0 Z1 Z2 ] +
56 (0.165868+0 j ) [ Z0 Z1 Z2 Z3 ] +

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 14
57 (0.12054625+0 j ) [ Z0 Z2 ] +
58 (0.12054625+0 j ) [ Z0 Z2 Z3 ] +
59 (0.16862325+0 j ) [ Z1 ] +
60 ( -0.22279649999999998+0 j ) [ Z1 Z2 Z3 ] +
61 (0.17434925+0 j ) [ Z1 Z3 ] +
62 ( -0.22279649999999998+0 j ) [ Z2 ]
63 """
64 obs_of = c r e a t e _ q u a n t u m _ o p e r a t o r _ f r o m _ o p e n f e r m i o n _ t e x t ( open_fermion_text )
 Listing 9: An example Python program that generates and evaluates observables.

 5 Optimizations Tgate is lower bounded as

 In this section, we discuss possible performance Tgate ≥ Tover + max(Ncom /VFLOPS , Nmem /VBW )
 bottlenecks in quantum simulation, and discuss (10)
 several optimization techniques for different com-
 puting devices. when computation and memory operations do not
 block each other. While actual processing is not
 5.1 Background necessarily simplified to this equation, estimation
 with this equation works well when we develop a
 Since an application of a quantum gate is a large quantum circuit simulator.
 number of simple iterations, a time for circuit To develop a quantum circuit simulator satis-
 simulation can be roughly estimated as the sum fying the demands shown in Sec. 3, we can find
 of constant-time overheads for invoking functions several basic directions from this equation. In the
 and times for processing arithmetic and memory case of a small number of qubits, Tover becomes
 operations. Several factors such as an overhead a dominant factor. Thus, the pre- and post-
 to call C++ function via python interfaces, func- processing for applying quantum gates should be
 tions with parallelization, and GPU kernels in- minimized. In Qulacs, every core function is de-
 cur additional overheads in computation. These signed to minimize overheads. When the number
 additional overheads are quantitatively discussed of qubits n increases, values of Ncom /VFLOPS and
 later in Sec. 6. The times for processing arith- Nmem /VBW grow exponentially to n, and they
 metic and memory operations are determined by become larger than the overhead Tover . In this
 the number of operations divided by throughput. region, the values Ncom /VFLOPS and Nmem /VBW
 For the time for arithmetic operations, the num- should be minimized. If there are two ways to
 ber of operations that can be processed in a unit update quantum states, and if one has smaller
 second is vital, which is known as floating-point Ncom and Nmem than the other, the first one
 operations per second (FLOPS). To process com- should be chosen to minimize Tgate . To this
 plex numbers in the CPU, we need to load com- end, Qulacs provides several specialized update
 plex numbers representing quantum states from functions that utilize the structure of gate ma-
 the CPU cache or main RAM to the CPU reg- trices as introduced in Sec. 4. In this section,
 isters. The size of data per unit time that we we show four additional techniques to minimize
 can transfer between a memory and processor is Tgate when Ncom /VFLOPS and Nmem /VBW are
 called the bandwidth of the memory. Let the time dominant: SIMD Optimization, multi-threading
 for the additional overheads be Tover , the number with OpenMP, quantum circuit optimization, and
 of arithmetic operations be Ncom , the number of GPU acceleration.
 memory operations be Nmem , the FLOPS of CPU
 be VFLOPS , and the memory bandwidth, i.e., the
 5.2 SIMD optimization
 number of complex numbers which we can trans-
 fer, be VBW . Tover is determined by a design of Recent processors support SIMD (single-
 a library, Nmem and Ncom are determined by a instruction, multiple data) instructions, which
 quantum gate to apply, and VFLOPS and VBW are can apply the same operation to multiple data
 determined by a computing device. Then, an ap- simultaneously. Qulacs utilizes instructions
 proximate total time for applying a quantum gate named Intel AVX2 [61], in which up to 256-bit

 Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 15
data can be processed simultaneously. When uation functions in parallel can increase the ef-
a quantum state is represented as an array of fective instruction throughput VFLOPS . The use
double-precision real values, we can load, store, of multiple cores is effective particularly when
and process four real values (i.e. two complex FLOPS is a performance bottleneck (compute-
numbers) simultaneously. Thus, the use of AVX2 bound), and each core has a certain amount of
can reduce the number of instructions Ncom by a workload. Qulacs parallelizes the execution of
factor of four at most. In Qulacs, several update update functions using OpenMP directives [62].
functions are optimized with AVX2 instructions The number of threads used in these func-
by hand. When Qulacs is being installed, the tions can be controlled with environment vari-
installer checks if a system supports such a able OMP_NUM_THREADS. The naive implementa-
feature. If it is supported, the library is built tion shown in Listing. 4 consists of two loops:
with AVX2 instructions enabled. B0 -loop and B1 -loop. In Qulacs, the B0 -loop is
 Here, we discuss our SIMD optimization tech- parallelized to maximize the amount of workload
niques using dense matrix gates. However, it is for each core and to minimize the overhead in-
worth noting that our techniques are applicable curred by multi-threading. Specifically, the par-
to the other quantum gates. The naive imple- allelized loop iterates over [0..2n−m − 1] and the
mentation of dense matrix gates is shown in List- iteration space is chunked into T chunks, where
ing. 4. We implemented two SIMD versions of the T is the number of threads. In the loop body, the
B1 -loop. When all the indices of the target qubits i-th element of B0 is computed on-the-fly from
are large, it is not possible to SIMDize it as it is the loop index, which enables the even distribu-
because the state vector elements required in one tion of workload across threads. In contrast, the
iteration of the B0 -loop are scattered across non- list of B1 is created before executing the parallel
contiguous memory locations. However, since the loop. While the data of B1 is accessed from ev-
value of the adjacent memory location is always ery thread, its overhead is expected to be not so
loaded in the next iteration, we unroll the B0 - high because B1 is not updated during the loop
loop according to the number of target qubits to and its size |B1 | = 2m is typically small enough
enable AVX2’s SIMD load/store operations. On to store within CPU registers or caches. A buffer
the other hand, when there is a target qubit with for a temporal state vector (temp_vector in List-
a small index, we can SIMDize it without such ing. 4) is a thread-local array that can be read
an unrolling because the required state vector el- and written by each thread independently. Thus,
ements are already adjacent. Note that the over- a buffer space for T × 2m data is allocated be-
head of enumerating B0 and B1 is not negligible fore the B0 -loop, so that each 2m block can be
when the number of target qubits m is small. We used by a corresponding thread, and is deallo-
reduce the cost of the enumeration of B0 and B1 cated at the end of the parallel loop. Note that
as follows: Instead of listing all the items in B0 when n and m are small, the amount of workload
beforehand, the i-th element of B0 is computed for each thread becomes small. In that case, the
from the index i in each iteration using bit-wise overhead due to multi-threading becomes larger
operation techniques. In contrast, the list of B1 is than the speed-up by multi-threading. Therefore,
computed before the B0 -loop since the size of B1 Qulacs automatically disables multi-threading if
is typically small. This implementation is useful the number of qubits n is smaller than a thresh-
for parallelizing iterations with multi-threading old value even when OMP_NUM_THREADS is set to
by OpenMP, which is explained in the Sec. 5.3. 2 or more. This threshold value is empirically
When a gate matrix has a structure and basic determined according to the number of m and a
gates other than a dense matrix can be utilized, type of basic gates.
an updated state vector can be calculated without
matrix-vector multiplication, and thus a different 5.4 Circuit optimization
optimization is applied.
 When the SIMD and multi-threading with
 OpenMP are enabled, a time for processing arith-
5.3 Multi-threading with OpenMP
 metic operations Ncom /VFLOPS becomes smaller
Since recent CPUs contain multiple processing than that for processing memory operations
cores, executing iterations in update and eval- Nmem /VBW , and a computing time Tgate is de-

Accepted in Quantum 2021-07-26, click title to verify. Published under CC-BY 4.0. 16
You can also read