Scaling Up Silicon Photonic-based Accelerators: Challenges and Opportunities, and Roadmapping with Silicon Photonics 2.0
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Scaling Up Silicon Photonic-based Accelerators: Challenges and Opportunities, and Roadmapping with Silicon Photonics 2.0 M. A. Al-Qadasi,1 L. Chrostowski,1 B. J. Shastri,2, 3 and S. Shekhar1 1) The University of British Columbia,Vancouver, BC V6T 1Z4, Canada. 2) Queen’s University, Kingston, ON K7L 3N6, Canada. 3) Vector Institute, Toronto, ON M5G 1M1, Canada. (Dated: 09 September 2021) Digital accelerators in the latest generation of CMOS processes support multiply and accumulate (MAC) operations at energy efficiencies spanning 10-to-100 fJ/Op. But the operating speed for such MAC operations are often limited to a few hundreds of MHz. Optical or optoelectronic MAC operations on today’s SOI-based silicon photonic integrated circuit platforms can be realized at a speed of tens of GHz, leading to much lower latency and higher throughput. In this arXiv:2109.08025v2 [cs.ET] 18 Sep 2021 paper, we study the energy efficiency of integrated silicon photonic MAC circuits based on Mach-Zehnder modulators and microring resonators. We describe the bounds on energy efficiency and scaling limits for N × N optical networks with today’s technology, based on the optical and electrical link budget. We also describe research directions that can overcome the current limitations. The next generation of silicon photonics, which we term Silicon Photonics 2.0, promises many exciting opportunities to enable the next frontier in optical computing and communication. I. INTRODUCTION: speeds. For tasks where memory access is not the bot- tleneck, such as inference with fixed weights, an opti- Vector matrix multiplication operations represents the core cal implementation can reduce latency and improve the of artificial neural networks (ANNs) and other comput- energy efficiency at such high speeds. Digital CMOS ing applications of hardware accelerators. ANNs are re- counterparts, on the hand, are limited by the clock fre- alized in digital complementary metal–oxide-semiconductor quency. (CMOS) circuits with multiple processing elements imple- menting multiply and accumulate (MAC) operations which 2. Multiplication operations can be intrinsically imple- calculates the product of two numbers and adds the result to mented in parallel in which the analog nature of the an accumulator1 . The processing elements can be arranged computation allows all matrix operations to take place in a systolic architecture, where data is passed through con- at the same time for each input fetch18 . Therefore, opti- nected processing elements in a rhythmic sequence, to per- cal MAC implementations can increase the throughput form MACs either spatially or temporally over several clock and improve the energy efficiency at such high speeds. cycles2 . MACs implemented with digital circuits in CMOS are Integrated silicon photonics (SiP) circuits have been pop- limited by the wiring interconnect density. ularly employed in high-speed links to move data at a rate There are also some challenges of implementing MACs us- of tens of Gb/s, where optical modulation is more efficient ing SiP: than electronic switching for transmitting data over signif- icant distances3 . Optical modulation can be realized using 1. The losses in optical circuits severely limit the size of Mach Zehnder modulators (MZMs) or microring modulators the MAC networks that can be physically realized, in (MRMs)4 . MZMs are broadband and easily support complex comparison to a digital CMOS implementation where modulation schemes3 . MRMs have significantly smaller foot- signal gain and regeneration is easily available. This, in print and driver power consumption5 . As a technology, the turn, limits the applications of SiP MACs. current generation of SiP has now matured with high volume shipments for datacenter transceivers from companies such as 2. Although the power consumed in the optical MAC is Intel and Cisco. SiP circuits comprising of Mach Zehnder small, when accounting for the losses and the power interferometers (MZIs) or microring resonators (MRRs) have consumed by the laser and CMOS electronic circuits been used also for other applications such as high-speed opti- that drive and control the optical circuits, the energy ef- cal switches and filters6–10 . ficiency is degraded. SiP is also being used for computing applications11–16 , where devices such as MZIs, MZMs, MRRs, and MRMs are 3. Unlike digital CMOS implementations which can sup- used for computation in optical analog domain. Compared to port 16b/32b resolution, analog photonic MACs have digital CMOS, implementing MACs using SiP has two dis- a maximum demonstrated resolution of 8.5b19 . Never- tinct advantages17 : theless, such a resolution has been shown to be adequate for many inference tasks20–22 . 1. Optical MAC operations can be scaled to frequencies at tens of GHz, whereas MACs in digital CMOS are lim- 4. Achieving a high throughput optical network entails ac- ited to a few hundreds of MHz or at most GHz operating cessing data at high speed. High-speed input/output
2 (I/O) data can be streamed in/out from/to an off- In other words, the weight matrix performs linear trans- chip DRAM; the corresponding energy consumption formation for the input vector, VN×1 , and delivers N out- in moving the data must be considered for the overall puts, YN×1 , that are routed to an array of photodetectors implementation of an accelerator23 . The energy con- (PDs). These PDs are then connected to the electrical com- sumption in data fetch from off-chip DRAM is signifi- ponents, including TIAs, main amplifiers, and sense ampli- cantly large. However, for a given dataset, the DRAM fiers based comparators, which are either built in a separate associated penalty is similar for both optical and dig- CMOS/BiCMOS chip, or monolithically integrated with the ital CMOS implementations. We exclude that penalty SiP devices in the same process. in our work. For weight-stationary implementations Singular value decomposition (SVD) is an effective ap- where weights do not need to change frequently, an on- proach to represent a given matrix as a factorization of multi- chip SRAM can be used, which can be adequately large ple matrices27,28 . SVD decomposes a real matrix into a prod- due to the limited network size of the photonic acceler- uct of unitary matrices and a diagonal matrix. This is useful ators. in the experimental realization of an N × N matrix topology in which the sequential product of rotation matrices represents 5. To carry out optoelectronic computing with several bits the sequential arrangement of linear transformation units in of resolutions at high speed, CMOS or biCMOS drivers the overall matrix grid29 . A 2 × 2 linear transformation unit in and transimpedance amplifiers (TIAs) are needed that the whole grid arrangement is practically implemented using must operate with multi-level pulse-amplitude modula- a tunable beam splitter (TBS) as seen in Fig. 16,30 . A TBS tion (PAM) signaling. Although many PAM2 (1b) and comprises of an MZI with a phase shifter (θ ) in at least one PAM4 (2b) transceivers have been demonstrated24,25 , of the arms, along with either an outer phase shifter (φ ) or a higher levels of modulation require linear drivers and tunable directional coupler31,32 . The transfer function for a TIAs which are challenging to design at high speed and single TBS can be described using the matrix representations good energy efficiency3 .However, if data rates in opti- for ideal 50:50 beam splitters and lossless phase shifters as: cal computing are limited to a few tens of GBaud, this challenge is surmountable. 1 1 i eiθ 0 1 i eiφ 0 TBS = 6. Packaging considerations in optics (e.g., laser, fiber and 2 i 1 0 1 i 1 0 1 iφ (2) SOA attach) are far more challenging than the packag- θ i θ2 e sin( 2 ) cos( 2 ) θ ing considerations for electronic dies due to alignment = ie eiφ cos( θ2 ) −sin( θ2 ) accuracy and thermal management requirements26 . Thus, any arbitrary light redistribution can be obtained by In this paper, we describe the advantages and challenges of changing θ and φ . implementing MACs using SiP, and comment on how to ad- The SVD decomposition of the weight matrix can be de- dress the first two challenges described above. The paper is scribed as: organized as follows: Section II and III describe the link bud- get, energy efficiency and scaling opportunities for SiP MACs W = D ∏ Tm,n (3) implemented with MZM and MRM, respectively. Section IV m,n explores the possible approaches to further improve SiP MAC systems. It introduces the ongoing research in the field of SiP where D is a diagonal matrix, and Tm,n represents the transfor- that when fully realized, will lead to significant changes in the mation matrix for a 2 × 2 node between two input terminals, field of optical computing and communication. We term that m and n, within the N × N multiplication matrix30 , given as: next-generation technology as Silicon Photonics 2.0. Section V concludes the paper. 1 0 ... 0 0 0 1 . . . 0 eiφ sin( θ2 ) cos( θ2 ) II. MZM BASED SI-PHOTONIC IMPLEMENTATION i 2 . θ . Tm,n = ie eiφ cos( θ2 ) −sin( θ2 ) m,n (4) . . . . . . A. System Architecture 0 . 1 0 0 0 ... 0 1 Fig. 1 illustrates an MZM-based SiP implementation of an optical accelerator. Using off-chip lasers, light is guided by a polarization-maintaining (PM) SMF fiber, gets coupled to B. Optical Network Link Budget the SiP chip via an edge coupler and then split to N parts. These parts are modulated by an array of N MZI modulators and fed into an N × N weight transformation (multiplication) The photonic components shown in Fig. 1 are simulated in matrix, WN×N . The optical intensities at the MAC outputs, Cadence Spectre for 8 × 8 and 32 × 32 transformation matrix YN×1 , are thus described by the multiplication product of the sizes to verify the optical link budget analysis. The optical input vector, VN×1 and the weight matrix, WN×N , as: components are modeled in Verilog-A to enable electronics- photonics co-simulation33,34 . The models of some of the com- YN×1 = WN×N VN×1 (1) ponents used in this paper are derived from34 , with some mod-
3 FIG. 1. SiP circuit diagram of an N × N MZM-based accelerator. High-speed PN phase shifters and low-speed thermo-optic phase shifters are colored in blue and red, respectively. ifications to account for the laser electrical power consump- for analysis. tion, the wall plug efficiency, ηW PE , link losses, etc. The opti- To study the optical attenuation of the 8 × 8 MZM imple- cal power of laser is set to 0 dBm to estimate the optical power mentation shown in Fig. 1 and verify its functionality, all in- at the output terminals of the network, incident on the PDs. puts, except for the uppermost terminal (m = 1), are driven by Coupling light to the SiP chip introduces loss in the range Vπ voltage that creates a π phase shift difference between their of 0.6-to-3 dB based on the coupling scheme used35,36 . The MZM’s arms and null their outputs. As light is set to propa- overall coupling loss from the laser to the SiP chip is collec- gate through the uppermost input terminal, the optical depth is tively estimated as 1.6 dB considering possible optimizations defined by the route passing through the diagonal TBS nodes in the coupling efficiency. 1.6 dB is also a realistic estimate for with i = j. photonic wire bonds (PWBs), an emerging (SiP 2.0) technol- For a rectangular mesh arrangement, the matrix optical ogy which involves writing three-dimensional waveguides in depth is equal to N with a total number of N(N−1) optical 2 a photosensitive polymer. PWBs have demonstrated efficient crossings29,30 . Hence, the 8 × 8 matrix implementation shown interfacing between the external sources to the silicon waveg- in Fig. 1, has an optical depth of 8 with 28 crossings. uide with coupling losses as low as 0.4 dB up to 1.7 dB37–41 . The total optical link budgets are calculated based on (5), For an input vector size of N, light passes through log2 N where PSMF−att , PEC−IL , PSi−att , Psplitter−IL,EL , PPS−IL , PDC−IL splitters before modulation, resulting in a total insertion loss and Ppenalty represent the attenuation introduced by the SMF of 10log10 N + ELsplitter · log2 N dB, where ELsplitter is the es- fiber, fiber to chip coupling loss, silicon waveguide attenua- timated excess loss for a single splitter and ranges between tion, splitter insertion and excess loss, phase shifters’ inser- 0.01 dB to 0.5 dB42–46 . The estimated excess loss for beam tion loss, total adiabatic coupling insertion loss and network splitters and combiners in this study is 0.01 dB. The over- penalty, respectively. The network penalty takes into account all attenuation in the silicon waveguide is a function of the further impairments due to extinction ratio, cross talk, inter- depth of the network. The MZM based implementation is symbol interference (ISI) and laser RIN4 . based on Clement’s arrangement which is composed of beam splitters and phase shifters that can be programmed to imple- PO/p (dBm) = Plaser − PSMF−att − PEC−IL ment linear transformation30 . With an MZM representing one node in Clements arrangement30 , the waveguide attenuation − PSi−att − Psplitter−IL,EL (5) can be approximated as Nηwg LMZI , where ηwg represents the − NPPS−IL − NPDC−IL − Ppenalty optical intensity attenuation in the Si waveguide and LMZI is the length of an MZM arm, with chosen values of 3 dB/cm Fig. 2 shows the calculated optical power throughout N × N and 0.5 mm respectively. The insertion loss introduced by an networks with different inputs vector sizes. Besides the atten- MZM’s PN phase shifter is approximated as 1 dB/mm44 . uation due to the splitting, it can be noticed that the losses The insertion loss through a node is dependent on the ex- introduced by the optical components in the multiplication cess loss for cross and through states, ELcross and ELthru , re- matrix (i.e. directional couplers and phase shifters) pose a spectively, both of which are typically
4 0 0 TABLE I. SNR Calculation Parameters n i/p =6 b -10 n i/p =5 b -10 Parameter Description Value Optical Power (dBm) Optical Power (dBm) n i/p =4 b n i/p =3 b n i/p =2 b Plaser Laser Power Intensity 10 dBm -20 n i/p =1 b -20 R PD responsivity 1 A/W 49 -30 -30 RL Load Resistance 50 Ω Id Dark Current 35 nA49 -40 N=8 -40 T Absolute Temperature 300 K -50 N=16 -50 DR Data Rate 10 GS/s N=32 Bo Optical Bandwidth 25√GHz -60 N=64 -60 N=128 Be Electrical Bandwidth DR/ 2 GHz -70 -70 λ Wavelength 1550 nm RIN Relative Intensity Noise −140 dB/Hz50,51 EC G r F lty C r ZM se te SM W D na lit 10% La M W PE Wall Plug E f f iciency Si Sp Pe I/p Stage Within Optical Network enhancement due to an SOA. Assuming an SOA introducing a FIG. 2. Optical link budget analysis for N × N matrix sizes N=8, 16, gain of ηSOA (in dB), the corresponding enhancement is ρSOA = ηSOA 32, 64, 128 for R = 1.2 A/W and DR = 10 GS/s. The blue dotted 10 10 . The use of an SOA is discussed later in Section IV, but lines illustrate the optical intensities required at the output AFE to it can be inferred from Eq. (6) that the use of an SOA always detect a signal with a resolution of ni/p . degrades the overall energy efficiency. The amount of power dissipated by the laser is represented in terms of the laser’s wall plug efficiency, ηW PE , the optical front-end (AFE) to detect a signal with a resolution of ni/p insertion losses introduced by the SMF fiber, ILSMF , the fiber bit. This is obtained by representing the desired output signal to chip coupling ILEC , the silicon waveguide loss, ILW G , the and current noises in terms of the received optical intensity as input MZM loss, ILi/p−MZM , the weight phase shifter loss, given in Eq. (8). Fig. 2 is plotted with an optical power of ILweight−PS , the directional coupler loss, ILDC , as well as the laser set to 0 dBm. For a laser with a maximum optical in- receiver’s PD sensitivity, PPD−opt , as given in Eq. (7). tensity of 10 dBm, it can be shown that the maximum achiev- ILWG[dB] N(LMZI ) able matrix size is ∼ 35 × 35 for binary networks operating 10 10 N at DR = 10 GS/s. We revisit this calculation again in Sec- Plaser = ILSMF ILEC (ELsplitter ) 2 N ILi/p-MZM (ILPS )2N log (7) tion II D. PPD-opt × ηWPE (ILDC )2N ILpenalty C. Energy Efficiency The total length of the waveguide was roughly approxi- mated as the length spanning the optical depth of the matrix The total electrical power dissipation of the whole network only. The output sensitivity is solved based on the targeted bit comprises of the power consumed by the laser, input modula- resolution, ni/p as well as the total noise at the output front- tors, thermo-optic tuning of the matrix phase-shifters, and the end due to the photodetector shot noise, dark current Id , ther- AFE including PDs. Accordingly, for a configuration of size mal noise and laser relative intensity noise (RIN) as given in N × N operating at a data rate of DR, the energy efficiency Eq. (8), with values reported in Table I. The parameters R, RL , (J/Op) can be calculated as: k and T represent the PD responsivity, output load resistance, Boltzmann’s constant and the absolute temperature. Plaser NPi/p−drivers + 2Pmem−inter f ace The MZM drivers consume power that scales linearly with E(J/Op) = 2 + γ · 2N · DR 2N 2 · DR N as Pmod−driver = N · DR · EMZM−driver , where EMZM−driver 2(N − 1)Pmat−tuning + PSOA + Po/p−AFE represents the energy efficiency of the MZM driver. For bi- + nary resolution, the power consumed by the drivers and AFEs 2N · DR (6) are extracted from recent work on PAM2, and is typically in where Plaser , Pi/p−drivers , Pmem−inter f ace , Pmat−tuning and the range of ∼2pJ/b52 . Po/p−AFE represent the electrical power dissipated due to the The matrix weights are tuned using thermo-optic phase laser, input modulator drivers, data fetch interfacing circuits, shifters (TO-PS). Doped Si heaters on SOI platform typi- matrix tuning and the output AFE circuits, respectively. PSOA cally dissipate about ∼ 20 mW for a π-shift6,53,54 . The ef- represent the electrical power dissipated if a semiconductor ficiency of TO-PS can be improved using other heater ma- optical amplifier (SOA) is used to recover the loss. terials such as TiN, substrate undercut to improve insula- The factor γ refers to the energy efficiency enhancement. tion and deep trenches to reduce thermal cross-talk53,55 . As- 2ρ It can be represented as γ = ρopt suming uniformly distributed weights, the expected energy SOA where ρopt represents the energy scaling due to the loss of precision factor, and will be consumption of the thermo-optic phase shifter is ET O−PS = 1 R Pπ Pπ described later in Section II D. ρSOA represents the efficiency Pπ 0 Pheater dPheater = 2 , where Pπ denotes the amount of
5 " ! # 1 R×PPD−opt i√ ni/p = 6.02 20log10 hq q √ − 1.76 (8) 2q(RPPD−opt +Id )+ 4kT R +R 2 P2 PD−opt RIN+ 2qId + 4kT R DR/ 2 L L tuning power is N(N−1)4 Pπ . After optical processing, the optical data needs to be con- verted to the electrical domain to be processed, stored or reused in other networks. Efficient opto-electronic receivers, comprising of a PD, TIA and main amplifiers have been shown to have energy efficiencies ∼ 0.4 − 2.4 pJ/b56–61 . For an AFE operating at 10 Gb/s and realized in 40 nm CMOS technology, an energy efficiency of 0.4 pJ/b56 is assumed for the calculation of binary resolution AFE, which scales with a factor of N for the whole output array. Higher AFE res- olutions entail the use of linear TIAs along with analog to digital converter (ADC) circuits to recover the digital data. High speed linear TIAs have shown efficiencies as low as 0.6 pJ/b62 . The energy consumption for ADCs is extracted from the energy per conversion figure of merit (FOM) such that EADC (J/b) = 2N × FOM. The energy consumption val- ues used in this work for 2b, 3b and 4b are 1.7 pJ/b, 3.1 pJ/b and 5.7 pJ/b based on a FOM of 0.335 pJ/conversion for an ADC designed to operate at a sampling rate of 28 GS/s63 . Providing high-speed serial inputs to the SiP accelerator FIG. 3. Input data fetch for a SiP implementation with ni/p = 4b. requires FIFOs and multiplexers to interface the data trans- FIFO and serializers are used to address the high-speed throughput fer with DRAM, as shown in Fig. 3. For a fair compari- requirement of SiP accelerators. son to digital CMOS implementations, the power dissipation for both input and output interfacing circuits, represented by Pmem−inter f ace is taken into consideration in the energy effi- ciency calculation of SiP implementations. The power dissi- pated by the FIFO, multiplexers, clock dividers and retimers is estimated as 5.77 mW in 28 nm CMOS based on the data reported in 64 in 180 nm CMOS technology. D. Efficiency Tradeoff Factors It can be inferred from Eq. (7) that the required input laser power increases as a function of N as a result of the exponen- tially increasing optical losses in the MZM based accelerator. But the total energy efficiency Eq. (6) starts improving as N scales up due to the quadratic increase in the number of accel- erated operations performed by the optical matrix as shown in Fig. 4. Taking all optical losses into account shows that there is a scaling limit beyond which optical losses grow signifi- cantly and the overall efficiency drops and an optimal network FIG. 4. Total energy efficiency (pJ/Op) for an N × N MZM imple- size exists for minimum energy efficiency. Unfortunately, the mentation with PD responsivity R = 1.2 A/W and binary resolution, maximum network scale, Nltd , is limited by the rated output ni/p = 1b. Energy efficiency improves as the network scales up due optical power of the laser and the SNR required for any given to the increased number of operations. Improving the tuning effi- signal resolution, ni/p ≥ 1b, as given in Eq. (8) and illustrated ciency with insulation has significant enhancement on the overall en- in Fig. 2. ergy efficiency. Fig. 5 shows the total energy efficiency and scaling limit for various input resolutions considering thermo-optic phase shifters with and without insulation. The energy efficien- electrical power required to create a phase shift of π. Calcu- cies in Fig. 5 are calculated for accelerators to be operated lating for all the nodes in Clement’s topology, the total average at binary and higher resolution, ni/p = {1, 2, 3, 4}b. Although
6 TABLE II. MZM-based Implementation Characteristics ηWPE ILSMF ILEC ILWG ELSplitter ILMZI LMZI ILDC 2000 70 E(fJ/Op):R=0.7 A/W, opt =1, TOPS [dB] [dB] [dB/mm] [dB] [dB/mm] [mm] [dB] E(fJ/Op)R=1.2 A/W, opt =1, TOPS 0.1 0 1.6 0.3 0.01 1 0.5 0.01 1800 E(fJ/Op)R=0.7 A/W, opt = N, TOPS E(fJ/Op)R=1.2 A/W, opt = N, TOPS E(fJ/Op):R=0.7 A/W, =1, TOPS w/ Insulation 60 opt E(fJ/Op)R=1.2 A/W, =1, TOPS w/ Insulation 1600 E(fJ/Op)R=0.7 A/W, opt = N, TOPS w/ Insulation the probability of transition reduces for multilevel signaling5 , E(fJ/Op)R=1.2 A/W, opt opt = N, TOPS w/ Insulation the requirement on driver’s linearity or segmentation also in- Nltd :R=0.7 A/W 50 1400 Nltd :R=1.2 A/W creases. Furthermore, the energy consumed in the serializing and clocking remains the same5 . Thus, we assume similar en- Network Scale (N) ergy efficiency for multilevel signaling as PAM2, ∼ 2pJ/b, 1200 for MZM modulators25 . For binary resolution, the power con- 40 E(fJ/Op) sumed by the drivers and AFEs are extracted from recent work 1000 on PAM2 transceivers5,25,62,65 . Therefore, the energy effi- ciency for 2b, 3b and 4b input MZM drivers are estimated 30 in our calculation as ∼ 4pJ, 6pJ and 8pJ per symbol, respec- 800 tively. It can be concluded from Fig. 5 that opting for PDs with 600 20 higher responsivities improves the energy efficiency of the network. This compensates for the optical system loss, relaxes 400 the need to inject high optical power at the network input and improves the overall energy efficiency. Utilizing avalanche 10 PDs (APDs) is a possible way to significantly improve the op- 200 tical sensitivity66 . Fig. 5 also suggests that taking advantage of the loss of precision, when possible, shows minor improve- 0 0 ment in the energy efficiency and the network scaling. 1 2 3 4 Considering a matrix that scales with N, the laser optical in- ni/p (bit) tensity should be typically scaled by a factor of N to account for the splitting loss in a lossless network. For mesh-like con- figurations similar to Fig. 1, scaling the input vector size in- FIG. 5. Total energy efficiency and scaling limit of an MZM based creases the dynamic range of the output intensities. In other network with an input resolution of ni/p = {1, 2, 3, 4}b considering words, for a given matrix output, the intensity can be as low as thermo-optic phase shifters with and without insulation at the scaling that of a single input or as high as N times that amount. With limit, Nltd , where the laser rated power output (10 dBm) is reached. the input’s digital resolution being ni/p , the effective overall Implementing networks with higher resolution requires scaling down output resolution due to the network scaling is ni/p + log2 N. the network to improve the SNR at the AFE. This leads to a degrada- tion of the energy efficiency due to the reduced number of operations. The conservative estimate of scaling the input power by N may not be necessary in some computational context such as convolutional neural network (CNN) layers with adapt- able hidden layer resolutions67 ; the increased output resolu- R∞ 1 −u2 /2 where the Q function is defined as Q(x) = √ e 2π du, tion might be higher than that needed by the AFE to detect. x Therefore, an energy scaling vs. loss of precision trade-off and iirn represents the total input referred noise at the AFE factor, ρopt , can be introduced to take advantage of the net- input with contributions from the PD, TIA, main amplifiers work scaling,68,69 , as illustrated in Fig. 6. Full accuracy is and comparators (if applicable). Therefore, the laser power, described by ρopt = 1 corresponding to reduced output pre- Plaser = NPopt−o/p , can be traded off for loss of output resolu- cision of log2 (ρopt ) = 0, at which the input optical intensity tion. is scaled by N. Generally, for log2 (ρopt ) bit reduction, the The IL difference between the bar and cross states of a tun- input is scaled by N/ρopt . Therefore, the maximum amount able beam splitter impacts the interference between the nodes of energy saving is achieved when the log2 N bit reduction is in the mesh. For an MZM with intensity loss of α1 in one tolerable at the optical output (AFE input). arm and α2 in the other, it can be shown that the output in- To get a meaningful sense of the trade-off between the en- tensity √ at the cross state is given by Icross = Iin1 [α1 + α2 + ergy scaling and the loss of precision, ρopt is quantified in 2 α1 α2 cos(θ2 )] when Iin2=0 , where Iin1 and Iin2 represent Eq. (9) in terms of the probability of bit errors for binary net- the MZM input intensities at its input ports 1 and 2, respec- works (networks with binary weights) at the output such that: tively. To get the transmission response of an MZM with equal losses, α1 on both arms, θ should be modified such that cos(θ ) = ∆α+2α√ 1 cos(θ1 ) in order to account for the loss Popt−o/p R 2 α1 α2 Proberror = Q (9) ρopt 2iirn difference between the two MZM arms, where ∆α = α1 − α2
7 arrays. Operating at ∼ 10× lower clock speeds further de- creases their throughput in comparison to optical implementa- tions. Assuming a number α of clock cycles needed for digital MAC, the latency of a digital systolic array with a 1 × N input vector size and a N × N matrix size is 2Nα/ fCLK−CMOS where fCLK−CMOS represent the clock frequency of the digital CMOS implementation. Hence, the throughput ratio of the optical to CMOS implementations is 2Nα fCLK−OPT / fCLK−CMOS , where fCLK−OPT represents the clock speed at which an optical im- plementation operates at. We also investigate the energy efficiency and network size at lower data rates in Fig. 7. Intuitively, the energy efficiency degrades at lower data rates because less number of opera- tions are conducted with respect to the dissipated static power. On the other hand, the network size, shown in Fig. 7, can be increased by making use of the SNR improvement at lower data rates, as inferred from Eq. (8). If maximizing the op- tical throughput is not an overarching goal, opting for lower data rates (relative to 10 GS/s) leads to larger networks while not sacrificing much on the energy efficiency in implementa- tions incorporating phase shifters with insulation (which do not consume much static power as shown by the dotted curves FIG. 6. The loss of precision factor for a case when 4 input signals in Fig. 7 (a)). (2b digital resolution) combine to form the output Pout1 . Scaling down the input optical power by N = 4 can be achieved at the expense of reducing the digital precision at the output by log2 4 = 2b. and θ1 describes the phase shift when both arms have atten- III. MRR BASED SI-PHOTONIC IMPLEMENTATION uation of α1 . This difference in insertion losses can be ob- served in single-arm beam splitters in which phase shifters A. System Architecture are controlled by a single arm only. Using dual-arm tunable beam splitters, where θ is implemented differentially using phase shifters on both arms, introduces equal insertion losses The SiP MRR-based implementation of an optical acceler- for the bar and cross transmissions of each node. A dummy ator is illustrated in Fig. 8. A comb CW laser source is used phase shifter can also be used in single-arm topologies to ob- to provide wavelengths λ1 through λn which are coupled into tain equal losses. the chip and then modulated by an array of N MRMs. Unlike For sake of comparison, the energy consumption for a dig- mesh-like topologies, implementing vector matrix multiplica- ital MAC is estimated based on the energy consumed by tion in the form of dot products has the advantage of maintain- multiplication and accumulation operations as well as regis- ing equal path loss for all the outputs. The modulated input ter file access in a 28 nm CMOS implementation70 . With vector, Xi/p (λ ), is then split (broadcast) into N branches to an estimated energy consumption 0.046 pJ for 8b MAC and be modulated by the weight bank arrays72 ; each output repre- 0.0117 pJ for register file access, the calculated energy con- sents the dot product of the input vector and one of the row sumption is ∼ (0.046 pJ+0.0117 pJ)/2=28.85 fJ for a single arrays of the weight matrix. In order to achieve weights with operation. Conversely, it can be observed from Fig. 5 that SiP positive and negative polarities, the thru and drop transmis- networks based on MZMs need to be scaled down to achieve sions of the weight arrays are routed to balanced PDs in a higher resolutions which further degrades their energy effi- push-pull configuration at the receiver. The current difference ciency. Compared to their 8b digital CMOS counterpart, the at the output, YO/p , can be represented as14 : energy efficiencies for SiP MZM MACs (using low-power thermo-optic phase shifters with insulation and 1.2 A/W71 PD responsivity) at ni/p = 1b to 4b resolutions are 3.5× to Z ∞ 17.5× worse. Despite the lower energy efficiency, MZM- YO/p = |E0 (λ )|2 Xi/p (λ )Wdt (λ )R(λ )dλ (10) −∞ based MAC operations are performed at 2N× higher operat- ing speed and lower latency (for a weight-stationary systolic array with input vector size of N × 1) than the corresponding where E0 (λ ), Wdt (λ ) and R(λ ) represent the amplitude of the digital CMOS implementation. input optical field, the difference between the weight’s drop In addition, multiple clock cycles are needed for digital and thru intensity transmissions and the PD responsivity at a multipliers and adders to provide the MAC output in systolic wavelength λ , respectively.
8 FIG. 7. (a) Total energy efficiency and (b) MZM Network sizes for R = 1.2 A/W , DR = 1 − 10 GS/s. Operating at higher data rates helps reduce the contribution of the static power consumption of thermo-optic phase shifters to the energy efficiency. Network scales up at lower data rates due to the SNR improvement. FIG. 8. SiP circuit diagram of an N × N MRR-based accelerator B. Optical Network Link Budget where PMRM−I/p−IL represents the transmission insertion loss of the MRM for the input vector, PMRM−I/p−OBL represents Similar to Eq. (5), the optical link budget is calculated out of band insertion loss (OBL) of the MRM for the input based on Eq. (11): vector when the MRM resonance wavelength does not match the input vector wavelength, PMRR−W −IL represents transmis- PO/p (dBm) = Plaser − PSMF−att − PEC−IL − PSi−att sion insertion loss of the MRR for the weight vector, and PMRR−W −OBL represents out of band insertion loss of the MRR − PMRM−I/p−IL − (N − 1)PMRM−I/p−OBL for the weight vector. Other terms have been defined already − Psplitter−IL,EL − PMRR−W −IL − (N − 1)PMRR−W −OBL − Ppenalty (11)
9 be calculated as: Plaser NPi/p−drivers + 2Pmem−inter f ace E(J/Op) = + ρSOA · 2N 2 · DR 2N 2 · DR NPmat−tuning + PSOA + Po/p−AFE + 2N · DR (12) Plaser here represents the total electrical power consumed by the optical source (either a single comb laser source or mul- tiple sources generating all the desired input wavelengths). To better study the energy efficiency based on a targeted sig- nal resolution, ni/p , the power consumption of the input laser source is formulated as a function of the optical power reach- ing the PDs at the output AFE as given in (13). ηWG[dB] N(dMRR ) Stage Within Optical Network Plaser = 10 10 N ηSMF ηEC ILi/p-MRM (OBLMRM )N−1 (ELsplitter )log2 N (13) FIG. 9. Optical link budget analysis for matrix sizes, N=8, 16, 32, PPD-opt 64, 128 for R=1.2 A/W and DR=10 GS/s. The blue dotted lines × ηWPE ILweight-MRR (OBLweight-MRR )N−1 ILpenalty illustrate the optical intensities required at the output AFE to detect a signal with a resolution of ni/p . where dMRR represents the gap between the centers of two adja- cent microrings and is dictated by the thermal crosstalk which should be taken for design considerations. A dMRR of 15µm has been shown to be sufficient to avoid thermal crosstalk TABLE III. MRR-based Implementation Characteristics in a photonic switch implementation74 . We assume a dMRR of Parameter Value 20 µm in this work for an optimistic realization of the system ηWPE 0.1 with RMRR = 6 µm. ILSMF [dB] 0 Since the number of operations scale quadratically with the ILEC [dB] 1.6 weight vector size, N, while the energy consumption of the ILWG [dB/mm] 0.3 modulators and AFE increases linearly as can be seen in (12), ELSplitter [dB] 0.01 the overall energy efficiency improves with scaling as shown ILMRM [dB]73 4 in Fig. 10. We assume similar MRM energy efficiency for OBLMRM [dB] 0.01 multilevel signaling as PAM2, ∼ 0.3pJ/b,5 which takes into ILMRR [dB] 0.01 account both the contribution of the modulator driver as well dMRR [µm] 20 ILpenalty [dB] 4.8 as the serializers. Therefore, the energy efficiency for 2b, 3b and 4b input MRM drivers are estimated in our calculation as ∼ 0.6pJ, 0.9pJ and 1.2pJ per symbol, respectively. MRMs offer smaller footprint and lower input capacitance when describing Eq. (5). which leads to significant reduction in their driving power. However, they are also more sensitive to fabrication mismatch Fig. 9 shows the calculated optical power throughout an and thermal drift which entails the need to use heaters for cal- MRR-based implementation with different input vector sizes, ibration across a wide spectral range. Excluding heater power, N, using the values shown in Table III. Similar to MZM-based the power consumed by a closed loop controller implemented implementations, the attenuation introduced due to the split- for a low-power WDM topology is ∼ 0.2 mW 75 . The average ting and cascading of microrings significantly degrade the op- energy efficiency for state-of-the-art MRR heaters on an SOI tical power and pose a limitation on the energy efficiency as platform is ∼ 20 mW /π 7,76,77 . The scaling of the power con- will be discussed in Section III C. The blue dotted lines il- sumed by the heaters with O(N 2 ) degrades the overall energy lustrate the optical intensities required at the output AFE to efficiency of the network. Fig. 11 shows the total energy effi- detect a signal with a resolution of ni/p bit. For a laser with ciency and scaling limit for various input resolutions consid- a maximum optical intensity of 10 dBm, it can be shown that ering thermo-optic phase shifters with and without insulation. the maximum achievable matrix size is ∼ 85 × 85 for binary We assume power consumption values, Pheater , of 2.8 mW 77 networks. We revisit this calculation again in Section III D. and 40 mW 7 for phase shifters with and without insulation, respectively, to provide phase shift of one free spectral range (FSR). Driving inputs at high speeds entails the need to multiplex C. Energy Efficiency data fetched from the memory as shown in Fig. 3. As per the calculations in section II C, Pmem−inter f ace is taken as 5.77 mW The energy efficiency (J/Op) of the MRM-based imple- for the energy efficiency calculation of input or output mem- mentation with size N × N operating at a data rate of DR can ory interfacing circuits.
10 1600 120 E(fJ/Op):R=0.7 A/W, TOPS E(fJ/Op)R=1.2 A/W, TOPS E(fJ/Op):R=0.7 A/W, TOPS w/ Insulation 1400 E(fJ/Op)R=1.2 A/W, TOPS w/ Insulation Nltd :R=0.7 A/W 100 Nltd :R=1.2 A/W 1200 80 Network Scale (N) 1000 E(fJ/Op) 800 60 600 40 400 FIG. 10. Total energy efficiency (pJ/OP) for an N × N MRM im- 20 plementation with PD responsivity R = 1.2 A/W and binary resolu- 200 tion, ni/p = 1b. Energy efficiency improves as the network scales up due to the increased number of operations. Improving the tuning efficiency with insulation has significant enhancement on the overall 0 0 energy efficiency. 1 2 3 4 ni/p (bit) D. Scaling Limitations FIG. 11. Total energy efficiency and scaling limit of an MRM based network with an input resolution of ni/p = {1b, 2b, 3b, 4b} consider- It can be inferred from Fig. 11 that it is feasible to imple- ing thermo-optic phase shifters with and without insulation at the ment vector matrix multiplication using MRR based networks scaling limit, Nltd , where the laser rated power output (10 dBm) with sizes scaling up to N = 85 which is within the maximum is reached. Implementing networks with higher resolution requires number of microrings permitted for WDM implementations scaling down the network to improve the SNR at the AFE. This de- due to FSR limitations and crosstalk78 . Attempting to engi- grades the energy efficiency due to the reduced number of operations. neer the MRR’s dimensions and coupling ratio compromises the quality factor which degrades the channel spacings. This translates to a limit in the MRR vector size of N < FSR/∆λ . pling which increases the power consumption. Another at- Channel spacings are typically set according to the amount of tempt to achieve an FSR-free filter has been demonstrated us- acceptable crosstalk between channels (interchannel interfer- ing tunable couplers along with modified vernier filters that ence). As an example, for a 50 nm transmission window with use higher-order coupled MRRs83 . However, this topology channel spacings of 0.8 nm, the maximum number of channels is associated with penalty in terms of design complexity, in- is 6278,79 . creased footprint and tuning power. Using series coupling to increase the filter order has been Introducing contra-directional coupling (CDC) in a micror- experimentally shown to reduce both interchannel and intra- ing combines the wavelength selectivity of the CDC with the channel crosstalks, thus, maximizing the filter finesse and compact feature size of the MRR, thus reaping the advantages the channel count79 . However, this comes at the expense of of both and providing an FSR-free response84 . Implementing higher footprint, lower drop port transmission and extra tun- this design technique allows the potential use of several chan- ing power. To cascade several MRRs for MAC operations, nels in MRR-based accelerators. This comes with a tradeoff it is necessary to maintain channel spacings to avoid the ad- of using extra heaters in the CDC and in the region of the jacent weight-dependent cross-talk. The number of channels MRR that does not include corrugated structures. that can be supported by optimized MRRs with finesse of 368 As shown in Fig. 11, the optimum energy per operation of and 540 are calculated to be 108 and 148, respectively78,80,81 . binary SiP networks based on MRRs (∼ 75 fJ) is obtained at Two-point coupling scheme has been proposed to address N = 85 for PD responsivity of R = 1.2 A/W. It can also be the post-fabrication correction of MRM spectral features for shown that reducing the power consumption of weight tuning large-scale MRM implementations82 . Although it mitigates circuits by one order of magnitude improves the energy effi- the secondary resonances of an MRM and doubles the FSR, ciency by roughly one order of magnitude as well. Compared an extra micro-heater is introduced to correct for the cou- to their 8b digital CMOS counterpart, the energy efficiencies
11 180 ni/p =1b, TOPS ni/p =1b, TOPS ni/p =2b, TOPS ni/p =2b, TOPS 10 1 ni/p =3b, TOPS 160 ni/p =3b, TOPS ni/p =4b, TOPS ni/p =4b, TOPS ni/p =1b, TOPS w/ Insulation 140 ni/p =2b, TOPS w/ Insulation ni/p =3b, TOPS w/ Insulation 120 Network Scale (N) ni/p =4b, TOPS w/ Insulation E(pJ/Op) 100 0 10 80 60 40 20 10 -1 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 DR (GS/s) DR (GS/s) (a) (b) FIG. 12. (a) Total energy efficiency and (b) MRM Network sizes for R = 1.2 A/W , DR = 1 − 10 GS/s. Operating at higher data rates helps reduce the contribution of the static power consumption of thermo-optic phase shifters to the energy efficiency. Network scales up at lower data rates due to the SNR improvement. for SiP MRM MACs (using low-power thermo-optic phase A. Optical Loss Reduction shifters with insulation and 1.2 A/W PD responsivity) at ni/p = 1b to 4b resolutions are 2.6× to 13× worse. In com- As described in the previous sections, optical losses limit parison to MZM-based implementations, MRR-based imple- the scalability of the SiP 1.0 networks. Losses must be mini- mentations can have 1.8× bigger network scale and achieve mized at the coupling interfaces and in the components. PWB 1.3× lower energy consumption per operation. Similar to is one way to ensure efficient coupling between the chip and MZM-based implementations, MRR MACs are performed at the optical fiber with insertion loss ∼ 1 dB with negligible a 2Nα fCLK−OPT / fCLK−CMOS × higher throughput than its dig- variation38 . Passive alignment to SMF optical fibers can be ital CMOS counterparts. accomplished using V-grooves arrays. Such fiber to chip Fig. 12 shows the energy efficiency and scaling at lower self-alignment has been shown to have coupling efficiency of data rates. Similar to MZM implementations, reducing the ∼ −1.3 dB85 . In another demonstration, coupling losses as data rate degrades the energy efficiency while scaling up the low as ∼ 0.5 dB and ∼ 0.35 dB have also been reported for network size due to the reduced noise levels at the AFE. passive and active alignments, respectively86 . For scaling up the networks, the optical signal attenuation can be compensated by using SOAs. On-chip SOAs can be utilized to pre-amplify the input signal and also exploited as IV. RESEARCH OPPORTUNITIES weight matrix elements to provide weights magnitudes > 187 . However, the non-linear gain-current curve entails a need for As summarized in previous sections, SiP accelerators oper- calibration. ate at much higher speed and lower latency than their CMOS Improving the responsivity of the AFE is yet another way counterparts. Nevertheless, it is further desired to improve the to tolerate the optical losses. It relaxes the need to in- size of the MAC networks in SiP, especially for neural net- crease the laser power to compensate for the losses. The work applications, and improve the energy efficiency. There high multiplication gain and responsivity of APDs have been has been several promising research in the field of SiP. Classi- shown to improve the sensitivities of optoelectronic receivers fying the existing commercial SiP technology as the first gen- front-end66 . Improving dark current and quantum efficiency eration, we describe several emerging technologies that will by careful design of the APD geometry has been projected make up the next generation of SiP, which we call SiP 2.0. to improve the sensitivity of Si-Ge APD receivers up to Fig. 14 summarizes the advancements in SiP 2.0 that can be −29 dBm at 12.5 Gb/s88 as compared to −18.5 dBm for leveraged by SiP-based accelerators to reduce optical loss, im- Ge PIN detectors89 . Limiting the bandwidth of the AFE and prove the energy efficiency and incorporate heterogeneous in- using equalization techniques90 such as continuous time lin- tegration techniques for performance improvement. ear equalization (CTLE) and decision feedback equalization
12 1.8 1.8 Laser Laser 4 , %&' = 15 Input Drivers Input Drivers 1.6 Tuning 1.6 Tuning 4 , %&' = 15 3 , %&' = 21 Output AFE Output AFE 4 , %&' = 7 3 , %&' = 29 1.4 1.4 2 , %&' = 28 2 , %&' = 52 4 , %&' = 7 1 , %&' = 85 1 , %&' = 35 1.2 1.2 E(pJ/OP) E(pJ/OP) 1 1 3 , %&' = 9 4 , %&' = 15 3 , %&' =9 0.8 0.8 4 , %&' = 15 2 , %&' = 12 4 , %&' = 14 4 , %&' = 15 0.6 0.6 2 , %&' = 11 3 , %&' = 21 4 , %&' = 14 4 , %&' = 15 3 , %&' = 21 1 , %&' = 14 2 , %&' = 28 3 , %&' = 29 3 , %&' = 28 2 , %&' = 28 2 , %&' = 52 1 , %&' = 14 1 , %&' = 35 0.4 0.4 2 , %&' = 49 3 , %&' = 27 1 , %&' = 85 3 , %&' = 29 1 , %&' = 81 2 , %&' = 52 2 , %&' = 49 1 , %&' = 35 1 , %&' = 85 1 , %&' = 81 0.2 0.2 0 Thermal Thermal 0 Thermal Thermal LCOS 0 5 10 NOEMS 15 LCOS 20 25PCM 0 5 10 NOEMS 15 20 25PCM w/ Insulation w/ Insulation (a) (b) (a) (b) FIG. 13. Energy efficiency breakdown considering various weight tuning options for bit resolutions, ni/p = {1, 2, 3, 4}b for (a) an MZM-based implementation, and (b) an MRM-based implementation. Phase shifters with low insertion loss and static power consumption (e.g. TOPS with insulation and NOEMS) are good candidates to enhance the energy efficiency of SiP implementations. The high insertion loss of LCOS and PCM poses limitations on the network sizes for a target resolution. In addition, PCMs, with their large dynamic power consumption, are promising for MRM implementations provided a high weight reuse is possible. system92,93 . Hybrid-integrated silicon photonic lasers have been shown to provide ∼ 12.2% WPE94 . Although introducing on-chip CW lasers mitigates the cou- pling losses, the feasibility of using them, especially for net- works using WDM, require wavelength stabilization and re- flection cancellation95,96 . Reducing the power consumption of the phase shifters in the weight matrix is a critical requirement given that their overall energy consumption scales quadratically with network size (Eq. (6) and Eq. (12)). Thermo-optic phase shifters dis- sipate high power consumption given their resistive nature. FIG. 14. Evolution of SiP technology from current (1.0) to the next Introducing trenches, undercuts and back-side substrate re- generation (2.0). moval has been shown to improve the tuning efficiency of the rings by an order of magnitude with measured reported power consumption of ∼ 4 mW per FSR77,97,98 . However, thermal (DFE)91 can reduce the input referred noise of the AFE and isolation and substrate removal exacerbates self-heating and further improve the sensitivity. must be taken into consideration while designing a CMOS controller99 . Several post-fabrication schemes have been investigated to B. Improving energy efficiency correct for the fabrication-induced variations. Reducing the process variations was investigated by patterning SiN on top Commercial CW lasers suffer from low WPE in the range of the Si waveguide to introduce field perturbations which ef- of ∼ 1% − 10%, which impacts the energy efficiency on the fectively adjusts the optical path length100 . Another demon-
13 strated technique relies on trimming using Ge ion implanta- lation of energy efficiencies in this work. For networks where tion followed by laser annealing to tune MRR resonant wave- the weight reuse is low, the contribution of the dynamic en- length across the whole FSR without introducing any excess ergy per operation can be considerably higher for PCM than loss. Its accuracy, CMOS-compatibility, and feasibility for all the other weight tuning alternatives, degrading the energy wafer-scale correction renders it a potential technique to be efficiency by orders of magnitude. utilized in optical neuromorphic implementations to reduce Fig. 13 shows the energy efficiency breakdown for both the tuning power101 . the MZM-based and MRR-based architectures for weight Alternatives such as nano-opto-electro-mechanical systems tuning that rely on thermo-optic phase shifters without and (NOEMS)16,102 and liquid crystal on silicon (LCOS)103 have with insulation77 , NOEMS, LCOS and PCM. For each im- the potential to reduce the tuning power overhead signifi- plementation, energy calculations are based on the network cantly. The dynamic energy consumption of NOEMS was scales that can satisfy the SNR requirements to compute with reported in the range of 0.13 fJ and 0.32 fJ for digital pulse bit resolutions, ni/p = {1, 2, 3, 4}b. The insertion loss for a signals102 , and is assumed as ∼ 1 fJ for our study. On the 35 µm long LCOS used to realize a π phase shift is taken as other hand, LCOS have been shown to dissipate power as low 0.35 dB103 . For MZM implementations, thermo-optic phase as 2 nW103 . shifters with insulation seem currently attractive for energy Phase change materials (PCMs) such as efficiency and network size. For a large weight reuse fac- Ge2 Sb2 Se4 Te(GSST ) have been demonstrated as com- tor, NOEMS-based phase shifters promise further energy re- pact phase shifters in which the optical phase shift is obtained duction. For MRM implementations, similar conclusions can by tuning the state of the material from amorphous and be drawn except that LCOS-based phase shifters also seem crystalline104,105 . Being able to sustain their crystallization promising. state with the absence of power renders them as good candi- For both MZM-based and MRR-based architectures, it is dates for tuning low-speed weights in SiP implementations evident that opting for matrix weight tuning alternatives with with no static power consumption. Given their compact almost zero power consumption significantly improves the to- sizes and non-volatile nature, the efficiency of implementing tal energy efficiency, with values approaching < 100 fJ/Op for them in large-scale SiP networks is investigated as shown both architectures. Further research is still needed to demon- in Fig. 13. Although not as lossy as PN phase shifters, strate the feasibility of these approaches in high volume pro- PCMs have an IL = 0.32 dB which is relatively high for duction to realize such energy efficiency regime. cascaded phase shifters in a large-scale implementation104 . This limits the network sizes for computation with several bits of resolutions. The resolution of the weights can be set C. SOA Cascadability by adjusting the level of crystallization of a PCM cell106 . The pulse energy consumption for writing and erasing lev- SOAs are used to amplify optical signals over a given spec- els 1-7 were reported in the range of 372pJ − 601pJ and trum and can be implemented off-chip or using hybrid inte- 562pJ − 373pJ 106 , respectively. Assuming the weights to be gration. Cascading SOAs has been conventionally used to re- uniformly distributed, the average energy consumption, EPCM , store signal levels in interconnect links and has been recently for setting the PCM to various weights can be calculated as explored in deep neural network implementations87 . in Eq. (14), where EA and EC represent the pulse energy re- Although SOAs help compensate for the optical losses to quired to write (amorphization) and erase (crystallization) the increase the scale of the network, they contribute significantly first level (L1 ), respectively. For levels, Li , where i > 1, ∆EA to the energy consumption. In addition, they suffer from sev- and ∆EC represent the average amount of energy required to eral downsides, including the noises produced due to the op- transition to one level higher or a lower, respectively, with all tical amplification, the ripples in their gain spectrum, and am- levels assumed to be equally spaced for the sake of simplicity. plification nonlinearity108–110 . The major noise component 2n − 1 is attributed to the amplified spontaneous emission (ASE) of EPCM = (EA + EC ) photons towards the input and output of an SOA110,111 . There- 22n (14) fore, the build-up of ASE noise due to the cascade degrades (1/3)(22n − 1)2n−1 − (2n − 1) the SNR of the optical signal reaching the output108,110 . + (∆EA + ∆EC ) 22n To investigate the effect of incorporating SOAs on the scal- Assuming EA = 372 pJ, EC = 373 pJ, ∆EA = (601 − ability of the network as well as the received signal resolution, 372)/(2n − 2) pJ and ∆EC = (562 − 373)/(2n − 2) pJ, the the overall SNR is quantified in Eq. (15), with an SOA gain estimated average energy consumption for a PCM phase value as G = 17 dB50,112 . Bo and Be stand for the optical band- shifter with n = {1, 2, 3, 4}b equals {186, 231, 165, 121} pJ. width of the amplifier and electrical bandwidth of the AFE, For phase shifters with zero static power dissipation, the respectively. ρASE represents the ASE noise and is calculated dynamic energy consumption is divided by the number of as: times weights have been reused for vector matrix multiplica- hc tion. This is done by introducing a weight reuse factor, αw , ρASE = 2NSOA nsp (G − 1) (16) λ such that Pmat−tuning = PNOEMS,PCM /αw , where αw typically ranges between 26 to 218 in general matrix multiplications where NSOA and nsp represent the number of SOAs used in the (GEMMs)107 . A value of αw = 4096 is chosen for the calcu- network and the spontaneous emission factor of the optical
14 (RGPo/p )2 SNR = r q 2 (15) 2q(RGPo/p +Id )+ 4kT 2 2 2 2 2 4kT 2 2 R +2ρASE R GPo/p +ρASE R (2Bo −Be )+R Po/p RIN + (2qId + R +ρASE R (2Bo −Be )) Be L L amplifier, respectively. RIN represents the relative intensity noise of the input laser source caused by the random sponta- neous emission over time50,51 . The parameters given in Table I are used for calculating the SNR and NSOA . For a target network resolution, ni/p , the number of SOAs are calculated based on the resultant SNR whose signal and noise power values vary with the network scale. The SNR is calculated as in Eq. (17). SNR[dB] − 1.76 ni/p = (17) 6.02 Fig. 15 and Fig. 16 illustrate the number of SOAs in terms of the network resolution and scale for MZM and MRM based implementations, respectively. It can be observed that the res- olution which a SiP network is desired to operate at is depen- dant on the network scale. As can be inferred form Fig. 2 and Fig. 9 for an SOA-less network, the SNR degrades since the signal optical intensity at the AFE Po/p is attenuated with the scale of N. Incorporating an SOA helps replenish the sig- nal intensity. SOAs can be added in the network before the SNR degrades below the threshold for a given resolution due to the insertion losses. However, the signal dependent noises FIG. 15. Network scale vs. resolution for various number of SOAs are amplified as well which do not align in favor of the net- for an MZM-based implementation. The scale of the network can be work resolution. For calculating the resolution, the AFE is traded off with its resolution. Incorporating SOAs help in extending the network scale for a fixed resolution or increasing the resolution assumed to tolerate signals with maximum Popt intensity as for a given network scale. The number of SOAs that can be added to high as 10 dBm, beyond which the number of SOAs are lim- a network are limited to 1 or 2. ited. Regions with no SOAs at the right side of Fig. 15 and Fig. 16 represent regions where the desired SNR cannot be achieved, indicating an infeasible resolution for a given scale. It can be inferred that an MZM-based network can support signal resolutions of 4b for a network size of Nltd = 55, with a single SOA. In comparison, incorporating SOAs in MRM im- plementations increases the network limited scale to Nltd = 94 for ni/p = 4b as can be shown in Fig. 16. SOAs require introducing III-V or II-VI compound semi- conductor materials to the SiP platform which is non- compliant with the standard SiP CMOS foundry runs. Back propagating ASE noise from the SOA emphasizes the need to employ an optical isolator for the input laser source108 . Use of narrowband filters are also possible to reduce the out-of- band noise87,108 , but these further reduce the maximum chan- nel count, thus limiting the scaling of the network. To be used for linear analog computations, SOAs should deliver gains that are independent of the input intensities. Cross-gain mod- ulation (XGM) is one type of non-linearity observed in SOA amplifiers in which the combination of all the input intensities impacts the gain of a single channel113,114 . The aforementioned SOA limitations should be addressed in order to maintain the linearity of the weight matrix and al- FIG. 16. Network scale vs. resolution for various number of SOAs low further scaling of networks. Designing highly efficient for an MRM-based implementation. Compared to their MZM coun- SOAs has the potential to increase the size of the networks. terpart, SOAs have bigger impact in scaling up the network for a However, improving the network energy efficiency requires given resolution.
You can also read