And Real Space Applications - System Hardening against Upsets CNES
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Toulouse – IRIT-ENSEEIHT December 13, 2013 System Hardening against Upsets and Real Space Applications Michel PIGNOL CNES DCT/TV/IN 18 avenue Edouard Belin 31401 Toulouse Cedex 9 - FRANCE michel.pignol@cnes.fr http://www.cnes.fr
Motivation Rationale for fault-tolerant architectures in the space domain ■ Up to now, space computers are mainly developed with rad-hard ICs ■ Mainly for performance reasons (not for cost reasons), commercial electronic integrated components (COTS ICs) will probably be more and more used ! For microprocessors (µP), the performance gap is around 50 (average value) • LEON2 = 100 MIPS peak PowerPC7448 = 5100 MIPS peak ! This gap is growing • PowerPC is superscalar, not LEON2 ■ Due to the SEE sensitivity of COTS, they must be protected by fault-tolerant mechanisms or architectures ■ SEE protections = high cost / planning overheads => it is important to assess carefully the safety/availability requirements of the project to select the optimal fault-tolerant solution ! Such solutions could range from very simple mechanisms having limited error detection/recovery capabilities to complete protection with FT archi. 2 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
OUTLINE 1 – INTRODUCTIVE PART ■ Avionics architecture of a satellite ■ SEE – Effects of radiation on digital parts 2 – ARCHITECTURE AND SYSTEM PROTECTIONS ■ 2-A – FDIR overview ■ 2-B – Links ! Avionics buses ! Sensor/actuator links ! High speed serial links ■ 2-C – Memory units ■ 2-D – Processing units ! Time replication ! Structural duplex ! Triplex / Quadruplex ! Micro-synchronized triplex ! Fault-tolerant trade-off with analysis of theoretical case studies 3 – REAL CASE STUDIES ■ ATV, the ESA Automated Transfer Vehicule ■ MYRIADE, the CNES micro-satellite family ■ CALIPSO, a Franco-American mini-satellite ■ REIMEI (INDEX), a Japanese small satellite 4 – CONCLUSION 3 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Acronyms CNES Centre National d'Etudes Spatiales, the French Space Agency ESA European Space Agency ADC Analog to Digital Converter TC TeleCommand Acq Acquisition TM TeleMetry ALU Arithmeric and Logic Unit Tx/Rx Transmitter/Receiver ATV Automated Transfer Vehicule µP MicroProcesseur Cmd Command (actuation) µSL Micro-Satellite Cntl Control VLIW Very Long Instruction Word (superscalar DSP COTS Commercial Off-The-Shelf having several execution units working in parallel) CPU Central Processing Unit WD WatchDog CRC Cyclic Redundancy Check wrt With Regard To CTXT Context (software variable) DMT Duplex Multiplexed in Time DRAM Dynamic RAM (time replication at task level, CNES architecture) DSP Digital Signal Processor DT2 Double Duplex Tolerant to Transient EDAC Error Detection And Correction (mini structural duplex at task level, CNES architecture) FDIR Fault Detection, Isolation and Recovery N-MR N-Modular Redundancy FT Faut-Tolerant DMR Double-MR = Duplex FTC Fault-Tolerant Computer TMR Triple-MR = Triplex GIPS Giga Instructions Per Second QMR Quad-MR = Quadruplex IC Integrated Circuit I/O Input/Output MBU Multiple Bit Upsets ISS International Space Station SEE Single Event Effect NG Next Generation SEFI Single Event Functional Interrupt OBC On-Board Computer SEL Single Event Latch-up PARAM Parameter (software variable) µSEL Micro latch-up PF PlatForm (of a satellite) SET Single Event Transient PL PayLoad (of a satellite) SEU Single Event Upset R/W Read/Write TID Total Ionizing Dose 4 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
1 – INTRODUCTIVE PART © CNES / ISRO MEGHA-TROPIQUES: a French / Indian mission to improve our knowledge on the tropical climate system; 5 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 launched CNES in 2011
Nb of input/output interfaces for small/large satellites: Avionics architecture of a satellite Thermistors acq.: 30 to 200 Analog acq.: 30 to 100 Status acq.: 30 to 60 Heater cde: 10 to 100 Image Bi-level cde: 20 to 50 Sensors/Actuators sensor Low rate serial links: 5 to 15 + Additional specific I/O i/f for: Video electr. Unregu- TM/TC Avionic Pyros, reaction wheels, magneto- lated links & (ADC) power reconf. buses & torquers, gyroscopes, magneto- PF Mntrg&Cntl other links meters, GPS, thrusters, etc. Tx/Rx HSSL bus signals PL Mntrg&Cntl Nominal computer Avionic bus Nominal Data unit Cross-strapped compression TM/TC interconnection HSSL Nom. Nom. Central Processing Nom. Nom. Nom. Nom. Nom. Nom. Red. Red. Red. Red. Red. Red. Red. Red. PF Mntrg&Cntl Mass Tx/Rx Redundant computer Avionic bus memory Redundant TM & TC & TM Mass Memory Converter TM/TC HSSL Reconf. Power I/O 2 I/O 3 I/O 4 I/O 1 Unit unit unit unit unit data Video Tx High rate Platform Sensors/Actuators Payload 1 to 10 Gbit Internal links and buses Main sensitive elements wrt SEE Video Usual budgets for small satellites Usual budgets for large satellites Mntrg&Cntl data computers Redundancy: No redundancy Nominal + Redundant units Volume = 3,3 litres / 5 boards Volume = 30 litres / 15 boards Ex. for Hot Mass = 3 kg Mass = 19 kg Warm Power dissipation = 6 W Power dissipation = 55 W Cold 6 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
© EADS ASTRIUM © ESA / CNES © CNES / D. Ducros ALPHABUS, a family of European Telecom satellites with a common platform from EADS ASTRIUM and THALES ALENIA SPACE Max 8800 kg Max 18000 W The 1st launch is ALPHASAT in 2013 Launched Launched in 2002 in 2005 . 3000 kg 6000 kg 2400 W 14000 W 5.7x3.1x3.1 m 7x2.9x2.3 m 2x60 km swath 45 m solar arrays 2.5 m resolution 10 m diam. antenna INMARSAT 4-F1 for SPOT 5 for mobiles-to-mobiles © CNES / P. Le Doare Earth observation telecommunications © EADS ASTRIUM 7 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
EADS ASTRIUM computers TM & TC & Reconf. board, including 3 ASTRIUM ASICs: - TC processing and reconf. - TM formatting and routing - Storage control for reconf. GSTB-V2 computer (Galileo System Test Bench; proposal for Galileo satellites) © Courtesy of EADS ASTRIUM CPU board using Fyber Optic ASTRIUM Multi-Chip- Gyro Electr. Module (2003) based Module on ERC32SC space µP (I/O board) 8 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
SEE – Effects of radiation on digital parts SEE : Single Event Effect ■ SEE concerns all effects due to a single particle ! SEE in digital ICs = SEL + µSEL + SEFI + SET + SEU + MBU ■ SEL – Single Event Latch-up ! Local short-circuit ! Detection: loss of functionality or over-consumption / Protection: power-cycling ! It is a good practice to avoid components which are sensitive to SEL ! And if not possible, to limit their usage and to protect them with adequate solutions ■ SEFI – Single Effect Functional Interrupt ! The component is put in a blocking state and a reset is not always capable to bring it back into an operational state ! Detection: loss of functionality (as for SEL) ! Protection: reset (optional but recommended) and/or power cycling (mandatory) ■ SEU/MBU – Single Event (Multiple Bit) Upset / SET – Single Event Transient ■ Goal of faul-tolerant architecture protections ! Thanks to DSM technos, more and more COTS parts are compliant with TID (Total Ionizing Dose) and SEL space constraints ! But all digital COTS components are sensitive to transients and upsets ! The presentation targets SEFI / SEU / MBU / SET mitigation, mainly on µP 9 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
■ An ingenious SEL mitigation example (MYRIADE real case): + minimizing the OFF/ON number of parts Watchdog Qn .. Vcc Current and, nevertheless, CLOCK timer Q1 . ERROR limitation R-threshold implementing both Reset Q0 detection and both Vcc mitigation methods Microprocessor RefreshWD ■ Whatsoever the 'detection' method is, it is a good practice to have a gradual 'recovery' process based on several levels, for instance: ! First attempt following a detection: a quick 'standard' recovery (i.e. without reset) is tried (in case of simple effect of an SET/SEU/MBU) ! Second attempt: if the first attempt is not successful, a reset of the computer is done (in case of more complex effect of an SET/SEU/MBU or in case of SEFI) ! Third attempt: if the computer still does not become operational, then a power supply cycling is done (in case of SEFI or SEL) ! Such a multi-level recovery process is implemented on CNES MYRIADE micro-satellite: See Section "3 – Real Case Studies" 10 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
THALES ALENIA SPACE computers TM & TC & Reconf. board, including 2 THALES ASICs: - TC processing and reconf. - TM formatting and routing and including 4 THALES hybrids for generating command signals SMU-V1 computer (Satellite Management Unit; platform computer for SpaceBus4000 Telecom family satellites and Globalstar2 satellite) © Courtesy of THALES ALENIA SPACE Satellite Distribution and Interface Unit for CPU board using Telecom ATMEL ERC32SC space µP satellites and COPRES THALES ASIC (I/O board) 11 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2 – ARCHITECTURE AND SYSTEM PROTECTIONS © ESA EUCLID: an ESA mission to map galaxies, to analyse their distribution and their apparent deformation under effect of the dark matter, for a better understanding of the dark matter and its influence on the origin of the accelerating expansion of the Universe; 12 launch planned M. Pignol – SysteminHardening 2020 TORRENTS 2013, Dec. 13 CNES
2-A – FDIR overview 13 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
The FDIR strategy – Fault Detection, Isolation and Recovery ■ Main objective of the FDIR strategy ! To keep the integrity of the satellite (i.e. its operational capability) in presence of anomalies • There is not an universal strategy, it is a case-by-case basis definition depending on the mission and on the considered faulty unit ■ Usual FDIR strategies when an anomaly is detected " "Satellite survival mode" = minimal mode allowing to keep at an acceptable level the electric pw, the internal temperature and the TM/TC link with the ground cntrl station ! Earth observation satellites: To pass in the survival mode and to leave to the ground control station the detection of the source of the anomaly then the selection of the best recovery strategy ! Telecom satellites: To reconfigure the avionics architecture to try to passivate the anomaly in order to remain in operational mode as long as possible to comply with the availability requirements; to limit the survival mode usage to exceptional cases => Telecom satellites have an higher autonomy than Earth observation satellites 14 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
The FDIR strategy (cont.) ■ Recovery action when an anomaly is detected sensors & ! Only few alarms are highly critical and directly start a recovery action actuators => examples of such critical alarms => and associated recovery action in case of cold redundancy - power falling down - switch-off nominal computer and nominal peripheral units - software watchdog - switch-on redundant computer and a mini. of redund. periph. - Earth sensor alarm for - then start from scratch and put the spacecraft in "attitude some missions acquisition & safe hold" mode ! For all the other alarms, the general rule is "to try to confirm the alarm before starting a recovery action", thanks to the "anomaly filtering process" ■ Some examples of the "anomaly filtering process" ! Time redundancy at the system level • when a task (thermal control, attitude and orbit control system, etc.) trigger an alarm during a given iteration, it is checked if the same alarm is still triggered during the next iteration(s) of this task ! Comparison between sensors to confirm an incoherent data • coupling with dedicated algorithms of linked data issued from gyro sensors and from the star sensor ! Start a BIST (Built-In Self Test) into the intelligent sensor which have issued the incoherent data 15 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-B – Links 16 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
■ Avionics bus: e.g. MIL-STD-1553B data bus ! Detection: Parity bit (or checksum, or CRC) ! Recovery: It is the responsibility of the higher level (e.g. software application level) to decide the best suited strategy wrt the application context => a "retry" is usual done (i.e. retransmit the message) ■ For sensors ! Complex sensor: same as for an avionics bus ! Simple sensor: triplication (e.g. all thermistors on CNES SPOT satellite family), time redundancy ■ For some actuators, protection with the "Arm & Fire" concept ! Such a command requires a first signal (Arm) then a second signal (Fire) sent by a distinct path, both being ANDed; typically used for pyro elements ■ For HSSL (High Speed Serial Link) ! For image data, retries are not possible (too much data to bufferize) ! Thus, the usual strategy is to select or design HSSL having a very high BER performance (SEE robustness), and no protection is implemented 17 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-C – Memory units 18 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Comparison of the efficiency of several protection codes Contribution from: A. Peus (CNES - DCT/SB/PS) Detect 1 error Detect/correct 1 err Detect 2 errors and Detect/correct MBU correct 1 error Parity Hamming Extended Hamming Reed-Solomon Correct 2 symbols of 4-bits per word 32 1 32 + 25 6 32 + 25 7 or 32 16 Not implemented Critical tripled data TMR Other methods for detection only, mainly usable for protection of a block of data: No 8 1 critical Checksum data CRC Signature 32 32 2 x 32 19 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
The 1st generation of CNES solid-state mass memory Flying on SPOT 4 satellite and VEGETATION payload Sextant Avionique (now Thales Alenia Space) and Dassault Electronique (now Thales Aerospace) development (1995) 16 DRAM / hybrid 8 hybrids / mem board 18 mem boards / unit 4 Mbits / DRAM 512 Mbits / mem board 9 Gbits / unit Unit = 37 kg 28 W in hold mode 50 W in R/W mode 20 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant architectures 21 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
A general remark ■ Comparators and voters are usually implemented in FPGA / ASIC either not sensitive to SEE by design (D-FF triplication, etc. => thus COTS are usable) or implemented in radiation-tolerant technologies A prototype of a QUADRUPLEX computer from Matra Marconi Space (now EADS Astrium France), development for the ex HERMES European shuttle project (1994) 22 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures # ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 23 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Time replication ■ Principle ! No hardware replication => No extra recurring cost ! The same software is processed N-times successively on the same CPU ! Detection capability: the results of the different replicas are compared ■ Time replication at instruction level ! See the talk "Software hardening" by Politecnico di Torino 24 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
■ Time replication at instruction level: real case example of an industrial development TTMR – Time-TMR (Space Micro Inc. – USA) Piece of Piece of code Software instructions Computer hardware code Vote Line Line Line Single Line Line Line CPU A1-A2-A3 A3 A2 A1 CPU C1 B1 A1 #1 T=4 T=3 T=2 T=1 Voting Computer logic Line Line Line CPU Vote Line Line Line hardware C2 B2 A2 #2 B1-B2-B3 B3 B2 B1 Line Line Line CPU T=8 T=7 T=6 T=5 C3 B3 A3 #3 Vote Line Line Line Software T=3 T=2 T=1 C1-C2-C3 C3 C2 C1 instructions T = 12 T = 11 T = 10 T=9 Time-slots Time-slots TMR architecture Time redundancy architecture Single instruction Software instructions VLIW DSP Hardware IC SEU Vote Instruct Instruct Instruct ALU A1-A2-A3 C1 B1 A1 #1 MMU Cache Clock Cntl Vote Instruct Instruct Instruct ALU cntl logic B1-B2-B3 C2 B2 A2 #2 Bus interface …/… Vote Instruct Instruct Instruct ALU Bus Parallel cntl I/O C1-C2-C3 C3 B3 A3 #3 T=4 T=3 T=2 T=1 Clock cycles © IEEE – Space Micro Inc. One TTMR possibility… with weakness (adapted from) 25 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TTMR – Time-TMR (Space Micro Inc. – USA) (cont.) Single Software instructions VLIW DSP Hardware IC Software instructions VLIW DSP instruction Not Instr Vote Instruct Instruct Instruct ALU ALU #1 Repeat 2 instructions required A1 A1-A2-A3 B3 C2 A1 #1 MMU Cache 99% of Instr 100% of time ALU #2 time A2 Vote Instruct Instruct Instruct ALU Clock Cntl Comp Compare A1-A2 100% cntl logic Branch #1 B1-B2-B3 C3 A2 B1 #2 A1-A2 with "free" branch Bus interface Instr ALU #3 When NO match, Vote Instruct Instruct Instruct ALU Bus Parallel A3 cntl I/O Comp complete instr A3 C1-C2-C3 A3 B2 C1 #3 Branch #2 A1-A3 and additional compare T=4 T=3 T=2 T=1 T=5 T=4 T=3 T=2 T=1 Clock cycles TTMR architecture Improved TTMR architecture …/… © IEEE – Space Micro Inc. (adapted from) 26 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TTMR pros/cons (cont.) ■ Proprietary architecture ! Space Micro Inc. patent ■ Dedicated to VLIW DSP (Very Long Instruction Word - Digital Signal Processor) ! Given that the ALUs are generally speaking not all fully used, not too much time is lost due to the time replication ■ The TTMR algorithm is coded into a "post-compiler" ! All the know-how lies in the "post-compiler": instruction replication + vote insertion + instr.->ALU assignment + instr. reordering to avoid empty slots ! The "post-compiler" must be developed for each targetted DSP ■ The SEFIs are processed by a patented rad-hard watchdog circuit 27 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level # – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 28 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Granularity for CNES DMT and DT2 fault-tolerant architectures ■ Granularity impact deeply the definition and latency/overhead of FT mechanisms ■ Coarse-grained granularity (macro-granularity) => task operational cycle ! the checking procedure runs at the end of each iteration of each task ! a low number of data to check => minimisation of overheads ! the main fault-containment region One iteration of a flight software in a platform computer of three tasks RTC – Real Simple example of a static scheduling Real time Time Cycle interrupt Task A OBT Task B AOCS Macro- granularity Task C preemption Thermal Task C operational cycle only all output data (but not the huge number of local data) Task T must be checked Background t OBT = On-Board Time ; AOCS = Attitude and Orbit Control System 29 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures # ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 30 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
DMT - Duplex Multiplexed in Time (CNES – Fr) PUC without DMT Processing Unit Core with DMT Redundant Redundant computer Nominal computer Nominal computer computer Mem Mem Redundant computer Switched-off in cold-redundancy strategy Edac Edac µP µP CESAM allows to segment the memory for monitoring of access rights: CC 1/ Avoid fault propagation between virtual channels Companion CC + 2/ Secure context data even if Chip (watchdog, timers, interrupt Cesam the µP is faulty cntl, I/O cntl, …) CESAM works as a Block Protection Unit (of a Memory Management Unit) with Acq/Cmd Acq/Cmd specific mechanisms (I/O-Bus) (I/O-Bus) 31 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
DMT – Scheduling and fault detection principle Iteration #i of a given task #T Without 3 DMT Acqui- 3 Processing + Commands generation 3 PUC sitions #i #i 3 3 3 1 4 4 4 t If protection of sensors and acq. electronics is required 1 2 Acq Acq ACQ Processing Processing Results Results 3 PUC #i #i comp #i #i comp generat 2 n 1 n 2 n 1 n 2 3 #i With DMT No in-out: all results (CMD, No in-out: all results (CMD, 4 1 1 CTXT, PARAM) are stored CTXT, PARAM) are stored t in VC#1 temporary tables in VC#2 temporary tables Virtual VC#1 VC#1 VC#2 channel VC#2 In phase Processing phase * Out phase 1 4 32 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES # ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 33 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
DT2 - Double Duplex Tolerant to Transients (CNES – Fr) PUC without DT2 Processing Unit Core with DT2 Redundant Redundant computer computer Nominal Nominal computer computer PUC#1 PUC#2 Mem Mem Mem Redundant computer Switched-off in cold-redun- dancy strategy Edac Edac Edac µP µP µP Error CC CC Companion CC + Syclopes + Chip (watchdog, timers, interrupt Cesam Cesam cntl, I/O cntl, …) Acq/Cmd Monitoring of memory Acq/Cmd 1/ Macro-synchronization on each I/O request 2/ Comparator (I/O-Bus) access rights (I/O-Bus) 3/ Input/output controller 34 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
DT2 – Scheduling and fault detection principle Iteration #i of a given task #T Without Acqui- 3 3 DT2 Processing + Commands generation 3 PUC sitions #i #i 3 3 3 1 4 4 4 t results gene. acquisitions Request for Request for Processing #i 1' 1' PUC#1 Wait No in-out: all results (CMD, CTXT, Wait 2 2 PARAM) are stored in temporary tables 3 3 Input Output 1 1' ACK 2 3 ACK request request 4 With DT2 Request Acknowledge Acknowledge Macrosynchr Macrosynchr Rqust Acqui- & results Results Syclopes comp sitions comp generat #i #i 1' 2 3 results gene. acquisitions Request for Request for Processing #i PUC#2 Wait No in-out: all results (CMD, CTXT, Wait PARAM) are stored in temporary tables 1 4 t In phase Processing phase * Out phase 35 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Problem of recovery with a duplex ■ A duplex is able to detect A duplex is intrinsically => ! comparison a “fail-stop” architecture ■ A duplex is not able to recover ! no information is available for determining which is the healthy/faulty channel (unlike a triplex architecture) !=> Specific mechanisms are required for implementing ! a recovery with a duplex architecture 36 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Example of a DT2 backward recovery ■ Nominal timing for PUC#1 and PUC#2 1/ Timing margin Context for recovery if Real-time interrupt static task #i-1 #i #i+1 scheduling t Cmd' Cmd Cmd'' Cmd Cmd #i-1 #i #i+1 ■ Backward recovery Context SEU PUC#1 #i #i t Cmd #i SYCLOPES NOK OK t Context PUC#2 #i #i t Cmd 4/ No data communication 2/ Detection 3/ Stop & reset & rollback signal #i between PUCs 37 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Two main conditions to recover successfully ■ The context data – basis of the recovery – must be healthy ! The memory is considered SEE-free, thanks to an EDAC ! A completely crashed µP must not be able to errouneously write in the memory zone where is stored all the context data • Thanks to CESAM which checks the memory access rights • The final location of context data is updated only after the comparison of all results, and only if 100 % of results (CMD + CTXT + PARAM) are healthy ■ A completely crashed / hanged µP must be detected, and a warm-restart must be done on the software ! A µP crash or hang will be detected • By several mechanisms, e.g. memory access right monitoring • In the DT2: by the very short timeout monitoring each macro-synchro request • In the DMT: at least by the usual watchdog-timer ! A µP reset allows to passivate SEFI ! The software warm-restart is possible thanks to the healthy context 38 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
DMT / DT2 pros/cons ☺ Agency proprietary architectures ! Available for every company ! Open and scalable architectures • Possibility to implement evolutions • Possibility to select a subset of the validated mechanisms ☺ Generic architectures independent from the microprocessor choice ! DSP or general purpose µP, single or multi-cores, superscalar or not, VLIW or not ! No new development required when used on a new microprocessor ☺ A single know-how for a two-fold architecture ! Same general principles for DMT and DT2 => one development for two different implementations, compatible with a larger part of potential applications ☺ Low cost architectures % Error coverage rate less than the one of a triplex architecture … ☺ … nevertheless suffisant for payloads 39 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES # ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 40 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Redundant channel Can be switched-off in cold-redundancy strategy TMR-Triplex & CPU1 CPU2 CPU3 CPU4 Pw Ck Pw Ck Pw Ck Pw Ck QMR-Quadruplex Mem Mem Mem Mem architecture µP µP µP µP ICN = Inter-Channel Network BC = I/O Bus Controller ICN IO-Bus = e.g. MIL-STD-1553 Voter V2 V3 V4 (N) = Nominal (R) = Redundant V1 ICN allows Voter Voter Voter Voter several-round Pw = Power supply interactive Ck = Clock generator Can be switched-off BC BC BC BC consistency exchanges to be Options Pw Ck Pw Ck Pw Ck Pw Ck robust to byzantine faults Lot of implementation IO-Bus IO-Bus possibilities, depending on (N) (R) robustness and mission requirements ■ Detection done by the majority vote ■ Recovery in two steps ! Fault-masking: The channels continue the processing for a short period of time; results of the faulty channel are continuously masked thanks to the healthy data issued by healthy channels => all commands and actuations will be correct ! Channel alignment: The faulty channel is reinserted later because it takes a long time 41 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) Based on publications from the GUARDS European R&D study (LAAS-CNRS, EADS Astrium France, Technicatome, Siemens, etc.), the pre-development of the HERMES project (a cancelled European Shuttle project in beginning of 90's) and the ATV development Support from: J-P. Blanquart (EADS Astrium France) ■ Vote may be distributed to avoid SPF (Single Point Failure) ! Multiple voters must look like a single virtual voter => ICN (Inter-Channel Network) for data exchanges between voters ! For ICN, a "broadcast bus" allows to avoid the possibility to propagate a common fault (e.g. a "stuck at" at the bus level) on every data (not detectable) & In the "Byzantine theory", a "broadcast bus" includes also a protocol allowing the interactive coherency of data to be robust to byzantine faults 42 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
! Multiplexed bus generating SPF at bus level Voter V1 V2 V3 '00' '01' '01' Rx1 definitive '00' '01' '01' '00' '01' '01' failure: Tx Rx LSB stuck at '1' Tx Rx Tx Rx LSB stuck at '1' at bus level => '00' is the correct value => Majority vote result = '01' % ! Broadcast bus Voter V1 V2 V3 '00' '00' '01' '00' '01' '00' '00' '00' '00' Rx1 definitive failure: Tx Rx Rx LSB stuck at '1' Tx Rx Rx Tx Rx Rx ICN LSB stuck at '1' broadcast bus => '00' is the correct value => Majority vote result = '00' ☺ 43 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) => Nevertheless, some faults are able to corrupt only one received data • Voltage and clocks at marginal level • Faulty connectors: trouble at contact level • Physical damage • Electrical noise: cross-talk, EMI (Electro-Magnetic Impulse) • Cosmic rays: upset • etc. => These faults are named "byzantine faults" • L. Lamport, R. Shostak and M. Pease (SRI International), "The Byzantine Generals Problem", ACM Transactions on Programming Languages and Systems, vol. 4, n 3, July 1982, pp. 382-401 44 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ■ Vote may be distributed to avoid SPF ! Multiple voters mean multiple decisions! ! ! => The decisional algorithm must allow for each healthy voter to take the same decision, and obviously the correct one • Lof of PhD works and R&D studies => Specific decisional algorithms allow to be robust to byzantine faults • Several pb: byzantine agreement, interactive consistency, unitary reliable broadcast • For "f" faults, it requires at least "3f+1" channels, "2f+1" disjoined links, and "f+1" exchange rounds For TMR: For QMR: With "1" fault, the problem has a solution with 3 chan- With "f=1" fault, the problem has a nels, 3 disjoint links (ICN) and 2 rounds only if we add solution with 4 channels, 3 disjoint an authentication capability which can be: "a relayed links (ICN) and 2 rounds message during the 2nd round can't be corrupted with- out the corruption can't be seen" (thanks to a CRC) 45 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ■ Clock generator may be distributed to avoid SPF ! It requires voting and resynchronizing the local clocks regularly => thanks to a cyclic exchange of a specific synchronization message through the ICN • Lot of PhD works and R&D studies, lot of algorithms have been proposed • Different implementations: HW, SW, mixed ! Example with a mixed HW/SW implementation (cf. HERMES project): A One time per RTC, send a "synchro message" and date it Local RTC CLK generator Tuning Are all "synchro No B messages" received? E µP Yes ICN#1-out C A Read the datation of each B D "synchro message" Datation See an example of FT D C algorithm for voting the Run the FT algo to select D three "synchro messages" the "best date" …/… Channel#1 Datation on next slide E Tune the local RTC period with Detection of the difference between the "sync msg" RTC = Real-Time Cycle date of its own "synchro ICN#2-in CLK = Clock message" and the "best date" ICN#3-in "sync msg" = "synchro message" ICN#4-in 46 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) • Compute the difference between the "best date" and the "local physical date" to tune the local RTC period generator => The FT algorithm is the median value 4 Datation of 5 "synchro 8 3 messages" exchanged A B C through 7 6 2 1 …/… the ICN Median value 4 3 7 B 6 5 ("best date") 8 1 issued by the 2 FT algo Allowed jitter period Correct values (w/o fault) Faulty values => The bounds of the "best date" in presence of any configuration of fault, are ☺ inside the allowed jitter period 47 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ■ Recovery – State Restoration (SR) ! A full context (content of µP's registers/caches + external main mem.) coming from one of the healthy channels is loaded into the faulty channel • This "transfusion" takes a very long time (several RTC) • Thus, the recovery is not started just after the detection of an error => the masking capability of the TMR/QMR is exploited to wait for an adequate time where it will be possible to switch the computer in a minimal mode • When it is adequate, the SR starts: the computer is switched in a minimal mode (i.e. to run only critical tasks) to have a maximum of bandwidth (buses + µP) for this "transfusion" and to reduce the evolution of the memory content • Such a "transfusion" is not so easy: even if the computer is switched in a minimal mode, the memory content into healthy channels is continuously evolving – Lot of PhD works and R&D studies, lot of algorithms have been proposed – Different implementations: HW, SW, mixed 48 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ! One real case example (cf. GUARDS): • Example with a mixed HW/SW implementation – Split the memory into K segments (e.g. 1 kb segments), one HW tag (e.g. one D-FF) is associated to each segment, and a NST (Number of Segments to Transfert) counter allows to count the number of segments to be transfered – Start SR: reset tags, and preset NST to the total number of segments (= K) – Start the first scan of the memory from the first segment to the last segment K: transfert one segment, then set its associated tag, then decrement NST, then leave the application (critical tasks) to write into the memory if required, if not continue the scan – Each time the critical tasks processing write a data into the memory, it associated tag is reseted and NST is incremented – After the transfert of the last segment having a reseted tag: - if NST > threshold, then start a new scan to transfert only segments having a reseted tag - if NST < threshold, then stop the application (all tasks) in order to complete the "transfusion" in a single shot; after that, the SR is completed and the normal processing can be resumed 49 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) Start SR: First scan: Second scan: Complete SR: Reset tags Start scan Start scan Start last scan "segment" Preset NST with K 4 x transfert 3 x transfert Then start scan 6 x transfert 2 x write + 1 x write + 2 x write 2 x transfert 10 4 6 3 2 0 0 1 1 1 1 1 scan 0 1 1 1 0 write 1 scan scan 0 1 0 write 0 1 1 0 1 1 1 1 1 scan 0 1 0 write write 0 1 1 write scan 0 1 1 1 0 write 1 0 0 0 1 1 1 scan scan 0 0 0 0 write 1 1 0 0 0 1 1 1 0 0 0 1 1 1 At the end of At the end of End of SR the 1st scan the 2nd scan (transfusion) 10 NST counter = Number of Segments to Transfert if NST > threshold if NST =< threshold then start a new scan then complete SR Segment (M ko) with not yet transfered data Segment with yet transfered data Stop all the tasks to complete the transfusion K = Number of segments in the memory = 10 in a single shot Threshold = 2 SR = State Restoration 50 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ■ Miscellaneous issues ! Vote can be on a bit-to-bit basis or versus a threshold • Vote done on a bit-to-bit basis (simpler): the acquisition of duplicated/triplicated sensors being done asynchronously by each channel, an alignment of input data is required specifically for analog acquisitions => a two-round interactive consistency exchange over the ICN must be done to find an agreement on bit-to- bit common values (for example the mediane value) • Vote not done on a bit-to-bit basis (more complex): the alignment of duplicated / triplicated input data is not required anymore. This method generates difficulties because data may not be at exactly the same values between channels. > For data checked versus a threshold (e.g. temperature monitoring): even if they are healthy, output data may not reach the threshold value exactly at the same RTC => these values require a time-filtering process of over several RTC > For other data: the =/= between voted data must lie within a defined interval ! If SW is very asynchronous, or if some non critical tasks are not triplicated (asymmetrical processing), data to be voted could be not sent in the same order between channels, requiring to stamp the data at the source (by each CPU) and for each voter to re-order data for voting only data of the same kind 51 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR (cont.) ■ One example of a triplex scheduling among others Channel software: Data movements: Iteration #i of a given task #T Channels #1, #2 and #3 run the same software, but in an Macro- Agreement on bit-to-bit Macro- asynchronous way between synchronization common values synchronization channels => a macro- synchronization is done on Acq. ACQ CMD CMD each I/O request Processing #i Channel#1 #i No ACQ-in & no CMD-out: all CMD are generat 1 Read sensors -> ACQ 1 consolidation stored in temporary tables vote 4 #i 5 Two-round interactive With TMR consistency exchange 5 6 over the ICN 6 Single-round exchange Acquisi- ACQ Processing #i CMD CMD Channel#2 tions #i No ACQ-in & no CMD-out: all CMD are generat 4 Write CMD to actuators 1 consolidation stored in temporary tables vote 4 #i Acq. ACQ Processing #i CMD CMD Channel#3 #i No ACQ-in & no CMD-out: all CMD are generat 1 consolidation stored in temporary tables vote 4 #i t In phase Processing phase Out phase 52 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
TMR & QMR pros/cons (cont.) ■ Specificities / constraints ! Architecture pertaining to the distributed computing domain ! Architecture requiring the highest level of theoretical analysis ! Architecture generating an incredible number of theoretical studies (PhD, R&D, …), and a lot of different implementations depending on the user needs and system requirements ■ Pros/cons ☺ The best level of error coverage + masking capability (delayed recovery) well suited to some kind of applications % Overheads: • Mass • Recurring cost (extra ICs) • Power consumption • Complexity 53 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV # ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms 54 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchronized triplex architecture ("lock-stepping") Redundant computer Switched-off in cold- redundancy strategy µP µP µP µP µP µP BC = I/O Bus Controller IO-Bus = e.g. MIL-STD-1553 Voter Voter (N) = Nominal (R) = Redundant Mem BC CB Mem BC Pw = Power supply Pw RCk Pw Ck Ck = Clock generator Switched-off module IO-Bus IO-Bus (N) (R) ■ All the µPs execute the same instruction at exactly the same clock cycle ■ It requires to have a µP having a lock-stepping capability (e.g. synchro- nization of internal clock generators, bus comparators, …) • Very old µP: Intel Pentium & i960, IBM RH6000, Atmel three-chip ERC-32 • Old µP: IBM PowerPC740/750 • Recent µP: ARM Cortex-R family (dual-core) 55 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchronized triplex (cont.) SEU Fault Reg Cache propagation µP µP µP µP µP µP µP µP µP S T S T S T U Y U Y W U Y X Y Z X W X Y Flush (few ms) S S S U U U Y W Y Voter Voter Voter S U X Y Healthy µP Mem Mem Mem context mirrored Nominal processing SEU, then processing Error detect => Start phase before detection recov: Flush reg/caches… Reset Invalidate (few ms) caches µP µP µP µP µP µP µP µP µP X Y Z W X Y X X X X Y X Y X Y Voter Voter Voter X Y X Y X Y Mem Mem Mem …+ invalidate caches …+ load reg/conf reg… …+ resume + reset faulty µP… 56 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchronized triplex (cont.) ■ When an error is detected on 1 of the 3 µPs, a recovery phase is started ! Flush all the registers / caches of the µPs to the single main memory • Thanks to the masking capability of the voter, the data set written back in main memory is 100 % healthy => a full and healthy µP context is saved into memory ! Then invalidate the caches (i.e. reset the caches) • To force the µP to read back the main memory for all data without exception ! Then reset the faulty µP ! Then load the faulty µP registers (including configuration registers) ! Then start again the processing phase • The three µPs must read all their data in the main mem. (due to cache invalidat ) • So the faulty µP will be "aligned" on the two healthy ones thanks to the healthy context mirrored into the external memory => µP alignment in 3 steps: Alignment performance: & Flush = Ctxt mirrored & Flush = few ms & Cache invalidation & µP reset = few ms & Resume: alignment is inherent & Resume: processing is slowing down (cache empty) 57 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchronized triplex (cont.) ■ Real case example of an industrial development SCS750 - Super Computer for Space (Maxwell Tech. – USA) SCS750P Prototype Model 7 – 25 W (typ) depending on clock rate © Maxwell Technologies SCS750F Flight Model 58 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchro. triplex pros/cons SCS750 (cont.) ■ SCS750 is selected on one major space program and one mini-satellite ! Large satellites: GAIA (Europe) ! Mini-satellite: GLORY (USA) • NASA Earth sciences mission, 545 kg, lifetime 3 years (5 years goal), failed launch in 2011 ■ SCS750 is a proprietary product ■ µ-synchro. archi. is dedicated to µP having a lock-stepping capability % This capability is becoming obsolescent due to deep submicron techno ! TID effects => asymmetric modif. of internal propagation delays between µPs ! Fully deterministic timing is less and less feasible ! Low-level fix-up routines to tolerate timing violations and soft-errors ! Multiple and complex clock trees Nevertheless … ! etc. 59 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Micro-synchro. duplex … Nevertheless ■ Recent automotive safety norm: ISO26262 (adaptation of IEC61508) ! "Functional safety standard" which stipulates regulations for HW and SW in electronics control systems to manage the risk of hazardous events ! ARM Cortex-R is oriented "Real-Time" for deeply embedded systems • with a focus on fast/deterministic response to interrupts, determinism (tightly- coupled memories) and safety/dependability (memory protection unit, ECC/parity, lock-step) • dual-core µP allowing implementation of a lock-step configuration to ease the compliance with ISO26262 • Texas Instr., LSI, Infineon, Fujitsu, Toshiba, Broadcom, … • on the other hand, the processing performance of the ARM Cortex-R family is significantly lower than the one of e.g. the PowerPC family 60 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. # M. Pignol – System Hardening ■ FT architectures trade-off ■ Other methods and elementary protection mechanisms TORRENTS 2013, Dec. 13 CNES 61
Fault-tolerant architectures trade-off ■ There is not an universal solution … ! Optimization is predominant over standardization Real cases of COTS-based computers: • UCTM-C/D (ARIANE 5, first launch 1996) = Double structural duplex, recovery without context • ARGOS (launched in 1999) = EDDI / Time replication at instruction level • BIRD (2001) = Double structural duplex, specific recovery mechanisms • MYRIADE (2004) = Mix of elementary protection mechanisms • REIMEI (2005) = Macro-synchronized triplex with a single voter • ROADRUNNER (2006) = TTMR / Time-TMR at instruction level • CALIPSO (2006) = Lock-stepping quadruplex with a redundant voter • GLORY (2011) & GAIA (2013) = Lock-stepping triplex with a single voter HiRel-based: • Shuttle (first launch 1981) = "4+1"-MR (QMR + 1 backup) • ATV (2009) = Triplex + Duplex • DMS-R (on ISS) = Triplex ■ … the final choice of the best suited architecture for a given project is application dependent ! Only 'detection', or 'detection and recovery' ! Hardware and software cost overhead ! Development and recurring cost overhead ! Power consumption overhead ! The time required for the recovery process 62 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Redundant computer Switched-off in cold-redundancy strategy Fault-tolerant architectures trade-off (cont.) DMT DT2 Mem Mem a = Nb of main items µP + Mem + Asic => Mass Mem1 Mem2 Mem1 Mem2 b = Nb of main items ON => Power consumption µP µP µP1 µP2 µP1 µP2 c1 = Computing pwr available (detection only) Cesam Cesam Cesam1+Cesam2 Cesam1+Cesam2 c2 = Computing pwr available (detect +recov.) +Syclopes +Syclopes d1 = Availability d2 = Correct actuations IO-Bus IO-Bus a= 6 10 IO-Bus IO-Bus N R b= 3 5 N R c1 / c2 = 0.5 / 0.3 1 / 0.7 d1 / d2 = 0.95 / 0.99 0.99 Micro-synchronized triplex a= 10 12 Triplex/quadruplex b= 5 9 µP1 µP2 µP3 µP1 µP2 µP3 c1 / c2 = 1* 1* Mem1 Mem2 Mem3 Mem4 Voter Voter d1 / d2 = 0.995 0.999 µP1 µP2 µP3 µP4 Mem Mem * = Requires time for Voter1 Voter2 Voter3 Voter4 computers alignment ICN IO-Bus IO-Bus IO-Bus IO-Bus N R N R 63 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
Fault-tolerant architectures trade-off (cont.) ■ If we are designing a COTS computer for a payload of a micro-satellite (100 kg) ! We could choose e.g. a time replication architecture: • Some of the most constraigning requirements are power consumption, and sometimes mass • The computing power requirement is, generaly speaking, not too high ■ If we are designing a COTS computer for a man-manned spacecraft (very hypo- thetical!) as the Shuttle one, focusing on the short but critical re-entry phase ! We could choose e.g. a 4-MR (Quadruplex) architecture: • Human life issue: the best availability capability is, the best suited is • During the re-entry phase, there is no possible contact with the ground station: an architecture with masking capability is very well suited to the shortness of this very critical phase ■ If we are designing a COTS computer for a payload of a large scientific mission (1000 kg) ! We could choose e.g. a duplex with (or without) recovery capability or a micro- synchronized triplex: • The computing power requirement could be very high, requiring several payload computers • PowerPC7448 = 30 W max => a computer could reach 100 W => Several computers in parallel could reach 1000 W … instead of 30 – 40 W for an usual computer! Thus, there is a power issue 64 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
2-D – Processing units / Fault-tolerant (FT) architectures ■ Time replication ! Time replication at instruction level – Example of Time-TMR from SPACE MICRO Inc. ! Granularity for CNES FT architectures ! Time replication at task level – Example of DMT from CNES ■ Structural duplex – Example of DT2 from CNES ■ TMR-Triplex & QMR-Quadruplex – Examples issued from the SHUTTLE, GUARDS and ATV ■ Micro-synchronized triplex – Example of SCS750 from MAXWELL Tech. ■ FT architectures trade-off # M. Pignol – System Hardening ■ Other methods and elementary protection mechanisms TORRENTS 2013, Dec. 13 CNES 65
Other methods and elementary protection mechanisms ■ ABFT – Algorithm-Based Fault Tolerance ■ BIST – Built-In Selft Test ■ WDP – WatchDog Processor (signature analysis) ■ Wrappers ■ etc. ■ Mix of different elementary protection mechanisms ! For protection at component level: ASIC • e.g. ERC32 and LEON European space microprocessors ! For protection at the system level • e.g. The CNES MYRIADE micro-satellite => See Part III 66 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
3 – REAL CASE STUDIES © CNES / D. Ducros PICARD: a CNES mission on a MYRIADE platform to take precise measurements of the Sun and of its variability; 67 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 launched CNES in 2010
ESA Automated Transfer Vehicle servicing the ISS (1st launch in 2008) The ATV example (with rad-hard ICs) Triplex + Duplex © ESA Duplex goal: tolerance to software bugs Implementation mainly for failure robustness, but also usable for SEE robustness The main monitoring and control computer FTC (triplex) © ESA / D. Ducros The checker computer MSU (duplex) monitoring the critical docking phase (collision avoidance) 68 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
MYRIADE: a CNES µSL family developed mainly with COTS Contribution from: J-L. Carayon (CNES - DCT/TV/AV) ■ TID: Switch-off sensitive ICs when not used ■ Protons: The Transputer µP is protected with a 2 mm tungsten shield ■ SEL: Serial resistors on power supply tracks or current limiter ■ SET: Filtering of analog acquisitions ! Time redundancy + average value computation ■ SEU: Protection of link/bus data exchanges ! Checksum/CRC and recovery protocols ■ SEU-SET: Flash and FRAM are protected ! Redunded data, checksum or CRC ! Flash and FRAMS are switched-off after the boot of the flight software ■ SEU: FPGA with critical registers implemented in with a TMR structure ■ SEU-MBU: TMR for critical data stored in the Transputer memory ! For flight software memory (4 Mbytes), not for TM memory (120 Mbytes) ■ SEU: Monitoring of some µP internal critical registers (timers, …) 69 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
MYRIADE (cont.) ■ SEU-SEL: Watchdog (WD) implemented with several levels ! Note: Each I/O block is constituted by a PIC nanocontroller and i/f ICs ! Internal PIC WD set to 100 ms: protection of PIC itself against SEL/µSEL or software hang due to a SEE (SEFI) ! Global WD for each I/O block set to 250 ms ! Local WD for Transputer CPU set to 500 ms ! Global WD for computer set to 1 sec with four levels of actions having deeper and deeper effect on the computer • Transputer reset • Transputer Off/On (in case of SEL) • CPU board Off/On (at this level, the Transputer memory content is lost) • Computer Off/On (in order to passivate any residual SEL) => MYRIADE is a typical example of a computer developed with commercial components and protected by a mix of elementary mechanisms for a mission without high availability requirements 70 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
MYRIADE (cont.) MYRIADE computer (CNES and MYRIADE platform during integration Steel Electronique development) MYRIADE CPU board © CNES / D. Ducros DEMETER: 1st mission based on a MYRIADE platform (launched in 2004) 71 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
CALIPSO: a US fault-tolerant COTS-based space computer developed by GDAIS (General Dynamics Advanced Information Systems) ■ CALIPSO is a Franco-American payload on a CNES PROTEUS mini- satellite platform for cloud-aerosol and infrared observations, launched in 2006 Voter 1 PowerPC 603r ASIC PowerPC 603r & COTS µP = Freescale PowerPC603r PowerPC 603r Voter 2 PowerPC 603r SDRAM array ASIC with cache & 4-MR architecture 128/64 MB & Voter is not a SPF Memory controller & Micro-synchro. / ASIC lock-stepping © SPIE (adapted from) 72 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
REIMEI (INDEX): a Japanese fault-tolerant COTS-based space computer developed by ISAS/JAXA + University of Tokyo ■ REIMEI is a small satellite for aurora observation and technology demonstration, launched in 2005 DRAM VOTER CPU & COTS µP = Hitachi SH-3 ROM & TMR architecture & Voter is a SPF (Single Point Failure) & Macro-synchro. & Reinsertion phase = stop the computer for 2 sec © IAF (adapted from) 73 M. Pignol – System Hardening TORRENTS 2013, Dec. 13 CNES
You can also read