Cross-layer Codesign for Resilient Hardware - Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation - GitHub Pages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Boston (and Beyond) Area Architecture Workshop (BARC) Cross-layer Codesign for Resilient Hardware Xinfei Guo, Ph.D. U of Virginia & NVIDIA Corporation Westborough, MA Jan 29th, 2021 xfguo@ieee.org www.xinfeiguo.com © Xinfei Guo | Confidential
Starting from Transistors – Aging/Wearout n Time-dependent device degradations q Transistors (P/NMOS) q Metal layers (Interconnect, PDN) PDN BTI BTI → Bias Temperature EM BTI Instability Vth Ids EM → Electromigration PMOS BTI PDN → Power Delivery Interconnect Network Vth → Threshold Voltage NMOS EM R td Ids → Current R → Resistance BTI EM BTI td → Propagation Delay PDN EM © Xinfei Guo | Confidential BARC 2021 3
Back to applications – lifetime is critical! q Longer expected lifetime q Higher susceptibility to aging q Extreme environmental conditions q Replacement costs q Longer run time (higher utilization) q … Source: https://semiengineering.com/making-chips-to-last- © Xinfei Guo | Confidential BARC 2021 their-lifetime/ 4
Why do we care about aging now? Wire broke due to EM n Reliability threat! q Permanent errors q Shorten lifetime q Worsen metrics such as performance, power and area Figure: [N. Cheung et al., UC Berkeley] n Getting worse with technology scaling! q Increased power density → Heat q Increased effective electrical field → More stress q More components → Require FinFETs More Aging lower single failure rate q Advanced nodes → New stress and issues: e.g. self-heating Source: [S.M. Ramey, et al. (Intel), 2018] © Xinfei Guo | Confidential BARC 2021 6
A cross-layer effect n Device level q Threshold voltage Vth increase (BTI) q Resistance increase (EM) n Circuit level q Performance degradation q Timing failures q Leakage power n Architecture level q Failures q Errors n System level q MTTF © Xinfei Guo | Confidential BARC 2021 7
Traditional solutions Adapting Passive Recovery Adding Margins (Sensing + Actuation) (More idle time) • Over-estimation • The worst case is • Very slow • Under-estimation getting worse • Unpredictable • Uncertain operation • Aging is unchecked • Permanent part will conditions • Tracking power (over keep accumulating • e.g. 10% for a 3-year 10 sensors per lifetime constraint partition) Single-layer solution is not adequate to deal with aging any more as it is becoming a bottleneck! © Xinfei Guo | Confidential BARC 2021 8
Introducing Accelerated Self-Healing n Fact - Aging is partially recoverable under passive recovery, but it is very slow. n Key Idea: Reverse the directions of aging and enable active Recovery t i ve t i ve A c Ac ing /Ag ?? g / n eali very Sleep/ H co Rejuven ation Re © Xinfei Guo | Confidential BARC 2021 9
Accelerated Self-Healing Key Idea: Recover by reversing the directions of Aging BTI Accelerated Self-Healing 1 Active Recovery: 2 3 Accelerated & Active 4 Passive Recovery Activate the recovery Accelerated Recovery Recovery ne s eli B a Vsg = 0, room Vsg = negative Vsg = 0, high Vsg = negative temperature room temperature temperature high temperature EM Accelerated Self-Healing 1 Active Recovery: 2 3 Accelerated & Active 4 Passive Recovery Activate the recovery Accelerated Recovery Recovery ne eli a s B I = 0, room I = negative I = 0, high I = negative temperature room temperature temperature high temperature © Xinfei Guo | Confidential BARC 2021 10
Experiments for demonstration Thermal Chamber Circuit Under Test (CUT) rst 75 LUTs En En in 16 16-b Cout Counter fref clk BTI Test Setup(a) Data Sampling Interface Board (b) Chip 40nm FPGA EM Test Setup Thermal Chamber Probe Technology 180nm Pads Material Copper Thickness 0.8um Length 2.673mm Metal Width 1.57um Wire Resistance (@rt) 35.76W Resistance Recording Constant Current Supply Device (Wire) under test On-chip Metal Wires © Xinfei Guo | Confidential BARC 2021 11
Measurement results summary n Recovery from aging can be made active and be accelerated, even the irreversible component can be fully eliminated or avoided through various techniques such as higher temperatures, negative voltages, active vs. sleep ratio … n What does this mean for chip designers and architects? A: Cross-layer Accelerated Self-Healing © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. DAC ‘14, ASP-DAC ‘16, DSN ‘17 12
Implication – Metric Improvements q >60x reduction of necessary margin for all cases q The average performance is close to the fresh during the whole lifetime q Both metrics don’t scale with the increase of the lifetime constraint ~ 2X ~ 1X Reduction of Necessary Design Margin Average Performance Improvement © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. ASP-DAC ‘16 13
Cross-layer Accelerated Self-Healing System Virtual Proactive Scheduler Load Balancer Sensors +1 Architecture Redundant Program Dark Silicon Counters Resources Active Accelerated Recovery EN -∆V EN Recovery Circuit Negative Voltage Heating Elements Wearout Generator Sensors © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. VLSI, Integration ‘17 14
Circuit Components for Self-Healing Non-overlapping clock generator 638ns Clock frequency = 66.7MHz clk BTI Accelerated EM Accelerated Wearout Sensing Circuits Self-Healing Circuits Self-Healing Circuits clk1 4.36mV clk2 -300.6mV Vdd On-Chip Negative 4.33mV BTI Sensing clk1 charging charge Voltage Generator Ripple: 1.45% EM and BTI redistribution Recovery Assist RO-based Metastable C1 clk1 C2 Power Gating Circuitry Vout for N/PBTI Element clk2 -300.6mV enables Recovery Sensing Based Reconfigurable Heating Element Sensor On-Chip Heater Accelerated Recovery Reconfigurab le output EM Sensing ROs Boosted Vdd Power Gating Block Vdd_high Vdd Always ON L L/2 Sleep_buf Sleep To other power C4 Bump VDD_PAD gating blocks Retention L/4 Sleep Negative Registers M10 track poll1 track poll1 Voltage D Q Global path0 2% path1 10% M9 Vddv Vdd PDN poll1 poll2 track poll2 M8 5% path2 10% poll2 M7 Logic Blocks Negative Voltage poll3 track poll3 EM VDD Grid 10% path3 10% Generator En Sleep M6 (EM hazards) VDD Via poll3 M5 M4 track MUX Connect to EM Connect to Out outa ref outb ref VSS Grid M3 EM VSS Grid M2 VDD discharg discharg discharg discharg P2 P1 P4 P3 Load BTI (a) (b) © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. Springer ‘20 15
Costs n Area ↓ Power ↓ Extra Heat ↓ q Optimal ways of distributing circuit IPs in a large system q Avoid unwanted heat q Trigger only when necessary Design Name Leakage Power Dynamic Power Area Performance Neg. Voltage 68.85nW 64.47uW 4300um2 >66.7MHz Generator On-Chip Heater 16.8nW 75uW 16um2 - Multi-mode - - 58.24um2 Wakeup time ~170ns Recovery Circuit Are there any other opportunities beyond circuit level? © Xinfei Guo | Confidential BARC 2021 16
Architectural Simulation Framework for Architecture Level Exploration – “OldSpot” More Aging Critical A CPU Layout Example Output of the tool -> Aging HotSpot! https://github.com/hplp/oldspot © Xinfei Guo | Confidential BARC 2021 A. Roelke, X. Guo, etc. ICCD ‘18 17
Unit-level Accelerated Self-Healing n Goal q Less area and power overhead n Solution q Placing self-healing IPs only for aging-critical units heaters rob Voltage Neg. Heat Map Wearout Map (From “HotSpot”) (From “OldSpot*”) © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. Springer ‘20 18
Utilize Intrinsic Heat n Goal q Avoid power overhead for generating extra heat n Solution q Take advantage of dark silicon or core redundancy q Utilize intrinsic sleep behaviors Shared Memory Shared Memory t1 t2 © Xinfei Guo | Confidential BARC 2021 X. Guo, etc. Springer ‘20 19
Scheduling for Recovery n Goal q Recover effectively only when necessary Normal Operation Accelerated and Active Recovery 344.3min Passive Recovery Wearout Sensors Trigger Point 57.8min 23.2min Reactive Recovery Performance (a.u.) a% - Preset Threshold Proactive 104.5min Recovery Preset Schedule Accumulated AC Full recovery time after 12-hour Clock Wearout Stress constant stress under normal Frequency condition 0 Time Proactive Recovery Application-dependent Scheduling © Xinfei Guo | Confidential BARC 2021 20
Putting it All Together: CLASH - Cross-layer Accelerated Self-Healing System • Application dependent scheduling • Scheduling for Recovery • Proactive Recovery • Unit-level Accelerated Self-Healing • Take advantage of dark silicon or core redundancy • Utilize intrinsic sleep behaviors • Accelerated Self-Healing Assist Circuitry • On-chip negative voltage generator • Power gating recovery enabler • On-chip Heater • Aging and Recovery Sensing Circuitry • BTI Sensor • RO Based • Metastable element based • EM Sensor © Xinfei Guo | Confidential BARC 2021 21
CLASH System – Hardware view Normal EM Active BTI Active Recovery Heat Flow Operation Recovery Recovery Circuits Wearout Sensors & Recovery Circuitry EM/BTI Assist Circuitry © Xinfei Guo | Confidential BARC 2021 22
What is the key benefit of doing cross-layer codesign here? Before… Accelerated Self-Healing • Margin (e.g. 10 – 20 %) • Margin (0.21%) • Track and Adaptation (Track • Only track the reversible part (~ 8X during the entire lifetime) tracking power reduction) • Passive recovery (
Sleep for rejuvenating and healing neurons? https://www.scientificamerican.com/article/lack-of-sleep-could-be-a- © Xinfei Guo | Confidential problem-for-ais/ BARC 2021 24
Key takeaways n Device level behaviors will have a lasting impact to all upper layers n Cross-layer codesign is an essential way of enlarging the search space n Challenges co-exist with opportunities q Infrastructures q Transparency q Design cycle q … © Xinfei Guo | Confidential BARC 2021 26
References 1. X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques,” ACM/IEEE Design Automation Conference (DAC), 2014. 2. X. Guo, M. Stan, “Work hard, sleep well - Avoid irreversible IC wearout with proactive rejuvenation,” ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC), 2016. 3. X. Guo, M. Stan, “Deep Healing: Ease the BTI and EM Wearout Crisis by Activating Recovery,” International Conference on Dependable Systems and Networks (DSN), 2017. 4. X. Guo, M. Stan, "Implications of Accelerated Self-Healing as a Key Design Knob for Cross-Layer Resilience", INTEGRATION, the VLSI journal (VLSI), vol. 56, pp. 167-180, 2017. 5. A. Roelke, X. Guo, M. Stan, “OldSpot: A Pre-RTL Model for Aging and Lifetime Optimization,” ICCD, 2018. 6. X. Guo, V. Verma, P. Guerrero, M. Stan, “When things get older - Exploring Circuit Aging in IoT Applications," International Symposium on Quality Electronic Design (ISQED), 2018. 7. X. Guo, M. Stan, “Circadian Rhythms for Future Resilient Electronic Systems - - Accelerated Active Self-Healing for Integrated Circuits”, Springer, 2020. © Xinfei Guo | Confidential BARC 2021 27
Stay Safe and Healthy! xfguo@ieee.org © Xinfei Guo | Confidential BARC 2021 28
You can also read