RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper Issue 02 Date 2021-01-25 HUAWEI TECHNOLOGIES CO., LTD.
Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China Website: https://e.huawei.com Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. i
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper About This Document About This Document Purpose This document describes the Reliability, Availability, and Serviceability (RAS) features and technologies of the Huawei FusionServer Pro 2488H V6 server (2488H V6 for short). Symbol Conventions The symbols that may be found in this document are defined as follows. Symbol Description Indicates a hazard with a high level of risk which, if not avoided, will result in death or serious injury. Indicates a hazard with a medium level of risk which, if not avoided, could result in death or serious injury. Indicates a hazard with a low level of risk which, if not avoided, could result in minor or moderate injury. Indicates a potentially hazardous situation which, if not avoided, could result in equipment damage, data loss, performance deterioration, or unanticipated results. NOTICE is used to address practices not related to personal injury. Supplements the important information in the main text. NOTE is used to address information not related to personal injury, equipment damage, and environment deterioration. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. ii
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper About This Document Change History Issue Date Description 02 2021-01-25 ● This issue is the second official release. 01 2020-10-23 ● This issue is the first official release. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. iii
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper Contents Contents About This Document................................................................................................................ ii 1 Introduction.............................................................................................................................. 1 1.1 2488H V6 Overview................................................................................................................................................................1 1.2 RAS Definition.......................................................................................................................................................................... 2 1.3 RAS Measurements................................................................................................................................................................. 3 1.3.1 Reliability Measurement.................................................................................................................................................... 3 1.3.2 Serviceability Measurement............................................................................................................................................. 3 1.3.3 Availability Measurement................................................................................................................................................. 3 1.4 RAS Importance....................................................................................................................................................................... 4 2 RAS Basis....................................................................................................................................5 2.1 Component Selection and Derating Design................................................................................................................... 5 2.2 Reliability Filtering.................................................................................................................................................................. 6 2.3 Testing......................................................................................................................................................................................... 8 3 Fault Management System (FMS)...................................................................................... 9 3.1 Fault Management Methodology...................................................................................................................................... 9 3.1.1 Fault Management Architecture..................................................................................................................................... 9 3.1.2 Fault Types and Troubleshooting................................................................................................................................. 10 3.2 Fault Management System (FMS).................................................................................................................................. 12 3.3 Basic Hardware Faults......................................................................................................................................................... 14 3.4 Service Hardware Faults..................................................................................................................................................... 15 4 RAS Feature............................................................................................................................ 17 4.1 Architecture Design.............................................................................................................................................................. 17 4.2 Comprehensive Memory Protection............................................................................................................................... 18 4.2.1 End-to-End Memory Protection................................................................................................................................... 18 4.2.2 Memory Data Protection................................................................................................................................................ 19 4.2.3 High-Reliability Memory Application Design.......................................................................................................... 19 4.3 RAS Feature Summary........................................................................................................................................................ 20 4.4 RAS Feature Description..................................................................................................................................................... 25 4.4.1 System-Level RAS Features............................................................................................................................................ 25 4.4.2 Memory RAS Features..................................................................................................................................................... 33 4.4.3 PMem RAS Features......................................................................................................................................................... 41 4.4.4 I/O RAS Features................................................................................................................................................................ 47 Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. iv
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper Contents 4.4.5 UPI RAS Features............................................................................................................................................................... 50 4.4.6 Hardware RAS Features...................................................................................................................................................52 4.4.7 FDM RAS Features............................................................................................................................................................ 53 5 Glossary................................................................................................................................... 59 6 Summary................................................................................................................................. 61 Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. v
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 1 Introduction 1 Introduction This document describes the features and RAS definition, measurement, and importance of the Huawei 2488H V6 server. 1.1 2488H V6 Overview 1.2 RAS Definition 1.3 RAS Measurements 1.4 RAS Importance 1.1 2488H V6 Overview Huawei FusionServer Pro 2488H V6 (2488H V6) is a new-generation 2U 4-socket rack server designed for Internet, Internet Data Center (IDC), cloud computing, enterprise, and telecom applications. Powered by the third-generation Intel® Xeon® Cooper Lake processors, the 2488H V6 provides up to 28 cores, 3.1 GHz frequency, a 38.5 MB L3 cache, and six 10.4 GT/s UPI links between the processors, which deliver supreme processing performance. The major product specifications are as follows: ● The server supports a maximum of 48 DDR4 ECC 3200 MT/s DIMMs. The DDR4 ECC DIMMs support registered DIMMs (RDIMM) and load-reduced DIMMs (LRDIMMs), which provide high speed and availability. ● The server supports a maximum of 24 Intel® OptaneTM PMem module 200 series 200 (PMem modules for short). When the DDR4 DIMMs are used together, the server supports a maximum of 18 TB memory capacity (calculated based on a maximum of 256 GB capacity per DDR4 DIMM and a maximum of 512 GB capacity per PMem module). ● Flexible drive configurations meet a variety of business requirements and ensure high elasticity and scalability of storage resources. ● The use of all solid-state drives (SSDs) is supported. An SSD supports up to 100 times more I/O operations per second (IOPS) than a typical hard disk drive (HDD). The use of all SSDs provides higher I/O performance than the use of all HDDs or a combination of HDDs and SSDs. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 1
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 1 Introduction ● The use of 12 Gbit/s SCSI (SAS) serial connection for internal storage provides 2x data transmission rate than the use of 6 Gbit/s SAS connection, maximizing the performance of I/O-intensive applications. ● With Intel integrated I/O, the third-generation Intel® Xeon® Scalable processors integrate the PCIe 3.0 controller to shorten I/O latency and improve overall system performance. ● The server supports a maximum of 11 PCIe 3.0 slots, including one for the OCP 3.0 network adapter. ● The server supports one GE, 10GE, 25GE, or 100GE OCP 3.0 network adapter that supports hot swap, network controller sideband interface (NC-SI), Preboot eXecution Environment (PXE), and Wake on LAN (WoL). Based on rich RAS features of Intel processors and fault diagnosis system (FDM) of Huawei servers, the server supports precise fault locating, timely fault alarms, redundant fans and power modules, and hot replacement, providing customers with leading availability, serviceability, and reliability. 1.2 RAS Definition RAS stands for Reliability, Availability and Serviceability. ● Reliability: refers to the capability of a product to sustain specific functions in a given time under given conditions. It is the capability of a server to keep operating properly, free from faults. ● Availability: refers to the capability of a product to be in an operable state at any given time. It is the capability of a server to provide as long system availability time as possible. ● Serviceability: refers to the possibility of completing specific actions in a given time. It is the capability of a server to quickly recover from faults. Figure 1-1 shows the top-layer framework of RAS. Figure 1-1 Top-layer framework of RAS The core idea behind the RAS design of Huawei V6 servers is to maximize customer service availability and minimize the breakdown possibility. A highly available standalone system must have highly reliable underlying hardware and Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 2
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 1 Introduction software design, high error tolerance performance, and fast repair and service recovery capabilities. 1.3 RAS Measurements The server RAS is measured in terms of Mean Time Between Failure (MTBF), Mean Time to Repair (MTTR), availability, and other factors. 1.3.1 Reliability Measurement The major indicators that measure reliability are the failure rate (λ) and MTBF. The relationship between the λ and MTBF is as follows: λ = 1/MTBF A larger MTBF means a smaller failure rate and higher system reliability. NOTICE The MTBF does not indicate the service life, but indicates the availability of a component in its service life. The service life indicates the longest time during which a component can be used. 1.3.2 Serviceability Measurement Serviceability is often measured by the MTTR. The MTTR excludes the time required for administration and logistics as well as the time required for preventive maintenance. A smaller MTTR means better product serviceability. 1.3.3 Availability Measurement A indicates availability. A = MUT/(MUT + MDT) x 100% Mean Up Time (MUT) indicates the average available time. Mean Down Time (MDT) indicates the average interruption time and can also be considered as the average downtime. Availability (A) in a board sense is not suitable for describing inherent features of a product because the MDT includes the time required for administration and logistics. For example, although a fault is easy to handle, the product appears not highly available if the fault is not promptly rectified due to delayed fault reporting or misoperations by management personnel. Generally, the inherent availability parameter Ai is used to describe the inherent availability of a product. Ai = MTBF/(MTBF+MTTR) x100% A larger value of Ai means higher product availability. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 3
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 1 Introduction 1.4 RAS Importance As the server processing capability enhances and the number of bearer services increases, the impact of unexpected server breakdown is increasing. According to the ITIC survey report, the planned downtime per hour is as follows: ● For about 98% services, the downtime cost may exceed US$100,000/hour. ● For about 88% services, the downtime cost may exceed US$300,000/hour. ● For about 33% services, the downtime cost may exceed US$1 million/hour. Data source: May 2017, Information Technology Intelligence Consulting Corp. (ITIC) Unexpected downtime not only causes financial loss but also brings other negative effects: damage to the corporate image due to extensive media reporting, increase in the customer churn rate, and employees' work schedule foul-ups. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 4
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 2 RAS Basis 2 RAS Basis This section describes the RAS design basis, component-level reliability design, and production filtering requirements. Component-level reliability is a basic requirement for RAS design. That is, hardware must operate properly as long as possible. To achieve component-level reliability, designers need to ensure that correct components are used and components are used correctly. To ensure that correct components are used, careful component selection and introduction are required. To ensure that components are used correctly, excellent design (for example, derating design) is required. Component-level reliability quality assurance includes three procedures: supplier materials reliability management, product reliability design, and production reliability filtering. The three procedures are closely related to each other. Designers need to thoroughly consider all the three procedures. The following sections describe the procedures from different perspectives. 2.1 Component Selection and Derating Design 2.2 Reliability Filtering 2.3 Testing 2.1 Component Selection and Derating Design Benefiting from long-term accumulation in the hardware field of the CT industry, the 2488H V6 has stringent requirements on component selection and derating design. ● 2488H V6 component selection strategy Huawei has strict examination process for the introduction of new components, including supplier qualification review, component application reliability assessment and testing. Huawei also has complete certification processes and sufficient test capabilities to ensure the reliability of new components. ● 2488H V6 derating design In terms of component application, Huawei servers comply with the same derating standards as communication products. Derating design enables the Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 5
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 2 RAS Basis working stress of a component or device to be appropriately lower than the rating specified for the device or component to decrease the failure rate and improve reliability. Derating design improves component reliability or extends the service life of components from the following aspects: – Minimizing the possibility of a component at the edge of overstress to fail in its service life. – Minimizing the impact exerted by the initial tolerance of component parameters (such as differences among individual components, differences among components of different batches, and technology changes). – Minimizing the impact exerted by long-term deviation of component parameter values. – Providing allowance for uncertainties during stress calculation. – Providing allowance for the occurrence of unexpected events, such as air conditioner faults in the equipment room and transient stress at the peak voltage. Derating design for the 2488H V6 goes through several phases in the entire R&D process: – Component selection: Select appropriate components that satisfy derating requirements. – Design: Component derating design must comply with applicable specifications. – Testing: Product test engineers inspect component derating by conducting tests to determine whether components meet derating specifications. Product reliability engineers conduct technical reviews for derating inspection and testing and issue resolution. 2.2 Reliability Filtering Figure 2-1 shows the failure rate of electronic components according to the use time. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 6
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 2 RAS Basis Figure 2-1 Bathtub curve After the early failure period ends, the device enters a stable working period, that is, the random failure period. At the end of the service life, the device enters the wear-out failure period, in which the device failure possibility is increasing. The reliability filtering method makes the 2488H V6 enter the random failure period as soon as possible to improve device stability. Reliability filtering aims to: ● Check out early failures to ensure inherent design reliability. ● Reduce the product failure rate after delivery, and improve MTBF and inherent product availability. ● Establish a long-term large-sample failure analysis mechanism and continuously optimize front-end design to improve product reliability. Huawei has formed an effective method for reliability filtering after long-term accumulation in R&D in the CT industry. Based on the excellent experience and the characteristics of server products, Huawei has established a server reliability filtering mechanism. The following figure shows some of the reliability filtering tests. Test Content CPU large-stress test Maximum workload Increased temperature stress UPI large-stress test Increased electrical stress Memory large-stress test Long-term continuous operation Hard disk large-stress test Each 2488H V6 server must pass these large-stress tests before delivery. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 7
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 2 RAS Basis 2.3 Testing Signal testing is a necessary and important means to ensure reliability of electronic products. For example, the latch-up effect is an important failure mode of the CMOS circuit. The latch-up effect is a unique parasitic effect for the CMOS process, which may even cause a circuit failure or chip burning. Voltage overshoot is an important cause for the latch-up effect. The latch-up effect is not easy to test during the manufacturing process because the latch-up effect is strongly accidental and usually occurs after long-time use. An effective way to avoid the latch-up effect is to ensure that all signals are complete and no signal overshoot affects device functions. This task is usually completed in the R&D process. Huawei has conducted the following tests during the server R&D process: ● Integrity test for all signals: All signals are tested to ensure that the signals meet the component application requirements to improve design reliability from the bottom layer. ● Test for all power features: Power-on and power-off, input/output (I/O) features, and short circuits are tested for all power supply units (PSUs) to ensure that power supplies meet various application requirements. Special tests are conducted for key power supplies (for example, CPU VRD power supplies) to ensure that servers can operate stably in extreme workload and application environments. ● Multi-sample test for key high-speed links: Discrete tests are conducted on boards of different batches and from different vendors to assess key high- speed signals. ● Error tolerance test: The system-level and chip-level (FIT) error tolerance tests are conducted to improve server reliability. ● Stability test: Extreme tests (such as large-stress tests and repeated power-off and then power-on) are conducted for a large number of servers in different application scenarios and extreme environments. The stability test ensures high availability of the entire server system. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 8
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) 3 Fault Management System (FMS) This document describes the fault management system and methods of the 2488H V6. 3.1 Fault Management Methodology 3.2 Fault Management System (FMS) 3.3 Basic Hardware Faults 3.4 Service Hardware Faults 3.1 Fault Management Methodology With the development of servers, more and more modules and components are used. As a result, the risk of faults increases. Faults are handled in a hierarchical manner in accordance with the hierarchical architecture of the server 3.1.1 Fault Management Architecture Server faults generally refer to hardware faults. To handle these faults, hardware, firmware, BIOS, OS, and management software must be used together. Hardware is at the bottom layer of the entire system. It provides multiple troubleshooting methods, such as: ● Performs comprehensive fault detection based on chips (including CPUs and memory expansion chips) and multiple sensors. ● Corrects correctable faults based on mechanisms such as ECC and retry. ● Adopts the redundancy design to avoid faults that can be prevented. ● Records the detected errors in various registers, such as CPU MCA register, for the fault management system to use. For faults that cannot be rectified by hardware, a large part of them can be rectified using the firmware, BIOS, or OS such as page offline and core disable. For application software, the HA solution can be used to switch customer applications in time when a hardware fault occurs, ensuring the customer application Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 9
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) continuity. Note that HA support is application specific. For details, contact the application provider. Figure 3-1 shows the fault management architecture. Figure 3-1 Fault management architecture In addition to fault management architecture, the 2488H V6 also supports fully autonomous fault diagnosis management (FDM). The management system supports remote device management and provides one- stop services such as device configuration, software and firmware upgrade, and fault management. For details, see 3.2 Fault Management System (FMS). 3.1.2 Fault Types and Troubleshooting System faults can be classified into the following types, as listed in Table 3-1. Table 3-1 Types of server faults Cate Fault Impact Example gory Type on the System Categ Correctab Minor Frequent memory ECC errors may have minor ory A le chip impact within a short term but will cause huge errors risks in a long term. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 10
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) Cate Fault Impact Example gory Type on the System Lockable Minor Huawei ES3000 is a standard PCIe SSD card chip based on the NAND flash. The ES3000 errors controller locks chip errors inside and prevents any impact on the system. Categ Recovera Medium There are a large number of inter-integrated ory B ble circuit (I2C) components on the server. If an software I2C component is faulty, the entire I2C link errors may be interrupted, which will result in major impact. The iBMC has the reset capability to restore the link. Failover Medium Server components, such as PSUs, fan modules, to spare and memory ranks, work in redundancy mode. parts Degradin Major If a UPI link between CPUs is interrupted, the g width of the UPI link automatically reduced. The overall performance decreases, but the system still functions. Lockable Major If an error occurs in a memory unit, the OS system detects the error and makes the faulty memory errors page offline so that the error source is isolated. Categ Uncorrect Major If the output of a key clock source is abnormal, ory C able the iBMC can detect the error source but errors cannot correct the error. Undetect Uncertain This type of error is usually accidental and able difficult to locate, have different severities, and errors may be caused by hardware or software. Errors of categories A and B will not cause service interruption, but errors of category C will cause service interruption. The similarity between category A and category B is as follows: When an error of category A or B occurs, the system does not break down immediately, and the FMS reports the error to maintenance personnel so that they can correct the error as scheduled. Errors of category C must be corrected as soon as possible. To achieve this goal, system design must meet the following requirements: A reliable FMS is required for accurate fault locating, and good structure design is required for fast parts replacement. Table 3-1 lists error categories in theory. In fact, an error may have different impact in different application scenarios . For example, the unstable output of a PSU falls into category B if PSUs in N+N redundancy mode are configured, but falls into category C if PSUs are not redundant. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 11
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) Different resolution policies should be adopted for different types of faults. Figure 3-2 shows the resolution policies for various faults based on FMS. As shown in the following figure, all faults must be detected first. If an error occurs without being detected, the error handling process cannot be triggered. Different processing policies, such as software recovery, spare parts switchover, downgrade, and fault isolation, are used for faults that can be detected. All detected errors are reported to the fault management system for it to collect fault information and locate faults. NOTE Due to the technical capability of the industry, if an error cannot be detected or cannot be handled, offline maintenance should be performed in the maintenance plan. Figure 3-2 Fault classification and management 3.2 Fault Management System (FMS) Quickly locating fault sources among a large number of components is an important means to ensure availability and can greatly shorten the maintenance time. 2488H V6 hardware faults can be classified into two types by hardware location: basic hardware faults and service hardware faults. ● Basic hardware faults: Basic hardware includes PSUs, fan modules, board power modules, and clocks. Basic hardware is not directly associated with upper-layer services, and the fault detection process does not necessarily involve service system. Therefore, the iBMC on the 2488H V6 independently handles basic hardware errors. ● Service hardware faults: Service hardware includes processors, DIMMs, PCIe devices, and drives. These devices are in the execution path of applications and are closely related to customer services. Most service hardware faults are Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 12
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) located, analyzed, and handled by the BIOS and iBMC, and some faults require the OS. In addition to accurate fault locating and prompt fault rectification, the FMS needs to provide fault warning, that is, identify potential faults so that users can hot- swap components or use expected shutdown to minimize the impact on services. The 2488H V6 integrates the fault diagnosis and management system (FDM), as shown in Figure 3-3. The FDM consists of sensors, complex programmable logical devices (CPLDs), the out-of-band management system iBMC, BIOS, platform controller hub (PCH), CPUs, Huawei baseboard management agent (iBMA, optional), and FusionServer Tools (optional). Figure 3-3 FMS components The FMS of the 2488H V6 covers the hardware layer, BIOS layer, CPU platform, and out-of-band management system, and provides the interface protocols required for OS-layer fault locating. Figure 3-4 shows the FMS framework. Figure 3-4 FMS framework The FMS consists of the following components: ● iBMC: Huawei's latest-generation server management system, which is the core of the fault location system. Based on the Huawei-developed Hi1711 Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 13
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) chip, the iBMC collects, summarizes, and analyzes faults, and displays fault information using the WebUI, LCD, and logs to implement server management. The iBMC is an independent system decoupled from the OS and application software of the service system. Its chips and upper-layer software are developed by Huawei to meet various service requirements of different customers. ● Processor platform: The 2488H V6 uses the Intel® Xeon® scalable processors (Cooper Lake). In addition to basic RAS features, the 2488H V6 provides advanced RAS capabilities, greatly improving the capability of handling service hardware faults. ● CPLD: It collects basic hardware faults, and connects to hardware module interfaces and iBMC over Huawei's proprietary CPLD-Bus interface. ● BIOS: It collects and locates service hardware faults, provides fault locating results for the iBMC, and provides fault management interfaces for the OS. ● (optional) BMA: The Baseboard Management Agent runs on the OS and obtains service-side hardware information, which is helpful for fault locating and warning. ● (optional) FusionServer Tools: The tool suite developed for Huawei servers facilitates server installation, configuration, fault diagnosis, and fault prediction. ● User interface: A BMC WebUI, a local LCD, and fault indicators for key components are provided to facilitate remote or local system maintenance. ● Various protocols: The FMS uses the following interfaces and protocols: Huawei CPLD-Bus, low pin count (LPC), SML, Platform Environment Control Interface (PECI), PCIe, universal asynchronous receiver/transmitter (UART), I2C, and PMBus. 3.3 Basic Hardware Faults Basic hardware modules include PSUs, fan modules, and underlying hardware of other components (excluding CPUs, DIMMs, drives, and standard PCIe cards), such as the compute module, front I/O module, rear I/O module, and converged console. There are different types of basic hardware faults. During troubleshooting, the CPLD converges the fault information and reports the fault information to the iBMC. The fault information includes the fault type and fault location. iBMC parses the received fault information and displays it on the WebUI. When the fault information is parsed, the fault level and type are identified, and corresponding handling suggestions are provided based on the fault level and type. This helps the customer to quickly rectify the fault. In addition to monitoring basic hardware faults, the CPLD also monitors service hardware faults, including CPU faults and excessively high CPU and memory temperature. In this way, the iBMC monitors key hardware at the fastest speed and is not affected by the BIOS and OS (because the BIOS and OS may be unavailable when a serious fault occurs on the processor or memory), and takes measures in a timely manner, for example, increasing the fan speed, prevent key components from being damaged due to faults, which may cause severe damage to the entire system. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 14
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) 3.4 Service Hardware Faults Service hardware includes CPUs, memory, PCIe devices, and the local storage system. Due to the characteristics of the local storage system, the 2488H V6 can manage its faults as basic hardware faults, or use FusionServer Tools or BMA to implement inband fault management for the local storage system. In this section, service hardware includes CPUs, DIMMs, and PCIe devices. Based on the MCA architecture provided by the Intel® Xeon® scalable processors (Cooper Lake), the 2488H V6 integrates the hardware, BIOS, iBMC, and OS fault handling mechanism to create a unique FMS to provide a series of functions such as fault diagnosis, fault locating, fault rectification, fault information collection, and fault reporting after a fault occurs in the system. In addition, the core modules of the FMS run on the BIOS and iBMC and do not depend on the OS. Therefore, the FMS are always in running state and can take measures immediately when an error occurs to prevent the system from breaking down. Figure 3-5 shows the flowchart for handling service hardware faults. Figure 3-5 Flowchart for handling service hardware faults ● If the leaky bucket algorithm is used and the number of correctable errors reaches the specified threshold, a system management interrupt (SMI) is triggered to instruct the BIOS to handle the error. After receiving the SMI, the Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 15
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 3 Fault Management System (FMS) BIOS handles the error based on the SMI type. After ensuring that the system is running properly, the BIOS locates and isolates the faulty component, collects error status register information, and reports the error and detailed error status register information to the iBMC. The information helps users or maintenance personnel further analyze the error cause. (The purple arrow lines " " in Figure 3-5 show the flowchart for handling a correctable error.) ● The process for handling an uncorrectable, recoverable error is as follows: An uncorrectable, recoverable error has no adverse impact on the system. This error is marked with an error tag, and an SMI is triggered. After receiving the SMI, the BIOS collects error status register information, locates the faulty components, and reports error information and detailed error status register information to iBMC. (The dark-blue arrow lines " " in Figure 3-5 show the flowchart for handling an uncorrectable, recoverable error.) ● The process for handling an uncorrectable, unrecoverable error in the x86 system is as follows: If an uncorrectable, unrecoverable error occurs, the CATERR_N pin is pulled down. This error causes the system to stop responding. This error triggers the error collection program of the iBMC to obtain error status register information of the x86 system. Based on the onsite error information, the error collection program diagnoses the error and displays error information to users promptly. (The brown arrow lines " " in Figure 3-5 show the flowchart for handling an uncorrectable, unrecoverable error.) Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 16
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature 4 RAS Feature This section describes some key RAS features of the 2488H V6, lists all RAS features that have been implemented on the 2488H V6, and provides application scenarios. 4.1 Architecture Design 4.2 Comprehensive Memory Protection 4.3 RAS Feature Summary 4.4 RAS Feature Description 4.1 Architecture Design The system architecture design rules for the 2488H V6 are high availability, high performance, good compatibility, and successful evolution. High availability is the core requirement of RAS design. Compatibility and evolvability improve the serviceability of servers. ● High availability means using various design and troubleshooting methods to prompt the system availability time, minimize the system unplanned downtime and reduce its impact on services. ● Good compatibility refers to the decoupling of RAS features from customer service systems or upper-layer applications. For example, the FMS components of the 2488H V6 are mainly on the out-of-band management chip BMC. No FMS component is placed on the OS. This decouples the fault management module from the OS to prevent the fault management module from working improperly. Based on Huawei's powerful hardware platform, excellent overall structure design, and powerful management software of Huawei-developed Hi1711 management chips, the architecture design of the 2488H V6 implements the following functions: ● The modular design makes modules loosely coupled with each other, which facilitates parts replacement. Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 17
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature ● The fully autonomous management system iBMC supports remote, one-stop management, including server configuration, software and firmware upgrades, and fault management. ● Separate airflow design and Huawei efficient fan modules ensure that the 2488H V6 stably operates at 45°C (113°F) even if some air conditioners in the equipment room fail. The 2488H V6 provides enhanced RAS features for core server components. The 2488H V6 provides comprehensive memory protection against common memory faults in the industry. 4.2 Comprehensive Memory Protection As memory technologies are developing rapidly, the chip manufacturing process is improving, the chip operating voltage is decreasing, and the memory capacity is increasing. However, memory reliability has become a top-priority issue. Due to the lack of protective mechanisms for the memory, serious memory faults often result in severe consequences, such as system breakdown and service interruption. As the number of DIMMs is increasing, consequences arising from serious memory faults will be further worse. The 2488H V6 has made many efforts in memory RAS to solve current memory problems. 4.2.1 End-to-End Memory Protection To ensure memory availability, the 2488H V6 provides an end-to-end memory protection mechanism with the help of the FMS. This mechanism prevents memory faults from spreading or upgrading, which, if not avoided, will further affect the entire system. Figure 4-1 shows the mechanism. Figure 4-1 End-to-end memory protection mechanism To ensure memory availability, key measures include: Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 18
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature ● During DIMM purchase, only the DIMMs of mainstream vendors are selected, and the purchased DIMMs are strictly tested and filtered. ● Extensive memory RAS features based on CPU, algorithms related to the memory are enhanced to provide multiple algorithm protection. For example, the memory fault storm suppression algorithm and re-examination algorithm are optimized to ensure accurate locating and quick processing of memory faults. For details, see 4.2.3 High-Reliability Memory Application Design. ● Based on the FMS, the fault prediction algorithm is used to implement fault warning for risky DIMMs. ● During both POST and runtime, the faulty DIMMs can be accurately located, and the faulty memory units are isolated through startup isolation or runtime page offline. ● The management software reports alarms immediately to notify users of replacing risky DIMMs in time. 4.2.2 Memory Data Protection The 2488H V6 supports multiple memory data protection features, such as DDR bus data CRC check and retry, memory data error checking and correction (ECC), and faulty chip isolation. Memory chips are DIMM storage entities. In the x86 architecture, each time a CPU reads data from and writes data to memory, several memory chips are involved. Some chips provide data bits and others provide check bits. These chips together complete read and write of the minimum number of access bytes (usually called a buffer line). ECC is a basic feature that uses this check mechanism. However, it can correct only one bit data in a buffer line. The Cooper Lake processor is capable of correcting multiple bit data errors on the same memory chip. This enhanced correction capability has little impact on performance. 4.2.3 High-Reliability Memory Application Design The 2488H V6 uses multiple high-reliability memory application technologies to implement memory error prediction and self-healing, minimizing the impact on services. HiRAS Technology The 2488H V6 supports the HiRAS mode (high reliability mode). In HiRAS mode, the system provides enhanced RAS capabilities, including memory fault self- healing and stable system running technologies, to ensure high system reliability and reduce the memory failure rate by 50% (without affecting services). Memory Fault Self-Healing Technology The memory fault self-healing technology is a Huawei-developed patented technology. Based on the Huawei server log big data system, this technology uses the machine learning algorithm to obtain the memory fault feature model, embeds the fault feature model into the Huawei-developed BMC chip, and uses Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 19
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature the AI self-healing algorithm of the chip to accurately identify each memory fault feature, and proactively trigger the software and hardware isolation and self- healing mechanism. Precise Memory Prediction Technology For memory faults that cannot be self-healed in the system, the system uses the AI prediction algorithm of the Huawei-developed chip to identify the severity of the memory fault, and sends a pre-warning to remind the customer to migrate services or replace the memory in a timely manner. The warning accuracy reaches 79%, and the memory breakdown rate decreases by 40%. Fault Storm Suppression Technology The occurrence, correction, and recording of a single error have little impact on the system performance. However, when upper-layer applications frequently access a memory area and multiple faults occur in the area, an interrupt storm of memory errors occurs in the system, which adversely affects the system performance. In severe cases, services may be suspended. The fault storm suppression algorithm of the 2488H V6 can effectively reduce the number of interrupts that need to be triggered. In this way, an interrupt storm is suppressed when an error storm occurs, greatly reducing the impact on services. After a storm is over, the server management system checks the fault type. If the faults are transient ones, which are usually caused by environment changes, the management system records the storm event. If the faults are permanent ones, which are usually caused by electrical aging or damage, the management system isolates faulty components. In addition, to prevent useful information loss caused by storm suppression, the BMC proactively polls the fault register during storm suppression and incorporates the fault information into the subsequent processing mechanism. 4.3 RAS Feature Summary Table 4-1 lists the 2488H V6 RAS features, which are classified into seven categories: system-level RAS, memory RAS, PMem RAS, IIO RAS, hardware RAS, and FDM RAS. Table 4-1 2488H V6 RAS feature summary Type Feature CPU and System RAS Feature 2488H ID V6 SYSTEM SYSTEM_ CPU Built-in Self Test (BIST) Supporte 01 d SYSTEM SYSTEM_ Core Disable for Fault Resilient Boot (FRB) Supporte 02 d SYSTEM SYSTEM_ Corrupt Data Containment – Core Supporte 03 d SYSTEM SYSTEM_ Corrupt Data Containment – Uncore Supporte 04 d Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 20
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Type Feature CPU and System RAS Feature 2488H ID V6 SYSTEM SYSTEM_ Socket Disable for FRB Supporte 05 d SYSTEM SYSTEM_ Advanced Error Detection and Correction Supporte 06 (AEDC) d SYSTEM SYSTEM_ Time-out Timer Schemes Supporte 07 d SYSTEM SYSTEM_ Error Injection Supporte 08 d SYSTEM SYSTEM_ Machine Check Architecture (MCA) Supporte 09 Recovery d SYSTEM SYSTEM_ MCA Supporte 10 d SYSTEM SYSTEM_ Machine Check Exception Supporte 11 d SYSTEM SYSTEM_ Local MCE Supporte 12 d SYSTEM SYSTEM_ Enhanced MCA (EMCA) Gen2 Supporte 13 d SYSTEM SYSTEM_ Out-of-Band (OOB) Access to MCA Supporte 14 Registers d SYSTEM SYSTEM_ Error Reporting via IOMCA Supporte 15 d SYSTEM SYSTEM_ Failed DIMM Isolation Supporte 16 d SYSTEM SYSTEM_ HiRAS mode (High RAS) Supporte 17 d Memory MEMORY Memory Thermal Throttling Supporte _01 d Memory MEMORY Memory Single Device Data Correction Supporte _02 d Memory MEMORY DDR4 Command and Address Parity Check Supporte _03 and Retry d Memory MEMORY Memory Demand and Patrol Scrubbing Supporte _04 d Memory MEMORY Memory Mirroring Supporte _05 d Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 21
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Type Feature CPU and System RAS Feature 2488H ID V6 Memory MEMORY DDR4 Write Data CRC Check and Retry Supporte _06 d Memory MEMORY Memory Data Scrambling with Command Supporte _07 and Address d Memory MEMORY DDR4 Post Package Repair (PPR) Supporte _08 d Memory MEMORY Adaptive Data Correction – Single-Region Supporte _09 d Memory MEMORY Adaptive Double Device Data Correction – Supporte _10 Multiple-Region (ADDDC-MR, +1) d Memory MEMORY DDR4 Memory Multi Rank Sparing Supporte _11 d Memory MEMORY Address Range/Partial Memory Mirroring Supporte _12 d Memory MEMORY Memory SMBus Hang Recovery Supporte _13 d Memory MEMORY Memory Disable/map-out for FRB Supporte _14 d Memory MEMORY MEMHOT Pin Support for Error Reporting Supporte _15 d Memory MEMORY Failure Prediction and Correction Supporte _16 d Memory MEMORY Fault self-healing result reporting Supporte _17 d Memory MEMORY Precise warning of memory faults Supporte _18 d Memory MEMORY iBMA page isolation Supporte _19 d PMem PMem PMem Module Error Detection and Supporte MEMORY Correction d _01 PMem PMem SDDC – Single Device Data Correct Supporte MEMORY d _02 PMem PMem PMem Module Package Sparing Supporte MEMORY d _03 Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 22
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Type Feature CPU and System RAS Feature 2488H ID V6 PMem PMem PMem Module Patrol Scrub Supporte MEMORY d _04 PMem PMem PMem Module Media Address Error Supporte MEMORY Detection and Verification d _05 PMem PMem PMem Module Data Poisoning Supporte MEMORY d _06 PMem PMem PMem Module Viral Mode for Supporte MEMORY Containment d _07 PMem PMem PMem Module Address Range Scrub (ARS) Supporte MEMORY d _08 PMem PMem PMem Module Error Injection Supporte MEMORY d _09 PMem PMem DDR-T Command/Address Parity Check Supporte MEMORY and Retry d _10 PMem PMem Read/Write Data ECC Check and Retry Supporte MEMORY d _11 PMem PMem Failed PMem Module Isolation Supporte MEMORY d _12 PMem PMem PMem Module Error Reporting Supporte MEMORY d _13 IIO IIO_01 PCIe Advanced Error Reporting Supporte d IIO IIO_02 PCIe Corrupt Data Containment (Data Supporte Poisoning) d IIO IIO_03 PCIe Link CRC Error Check and Retry Supporte d IIO IIO_04 PCIe End to End CRC (ECRC) Supporte d IIO IIO_05 PCIe Link Retraining and Recovery Supporte d Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 23
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Type Feature CPU and System RAS Feature 2488H ID V6 IIO IIO_06 PCIe Card Hot-Plug Surprise Supporte d IIO IIO_07 PCIe "Stop and Scream" Supporte d UPI UPI_01 Intel UPI Link Level Retry Supporte d UPI UPI _02 Intel UPI Protocol Protection via 32 bit Supporte Rolling CRC d UPI UPI _03 Intel UPI Dynamic Link Width Reduction Supporte d UPI UPI _04 UPI Virus Mode Supporte d UPI UPI _05 UPI Topology Downgrade for Failed Link Supporte Isolation d Hardware HW_01 Hot-Swappable PSUs in N+N Backup Mode Supporte d Hardware HW_02 Hot-Swappable Fan Modules in N+1 Supporte Backup Mode d Hardware HW_03 RAID and Hot Swap Supported by Hard Supporte Drives d FDM FDM_01 Fault Diagnosis System Supporte d FDM FDM_02 Proactive Failure Analysis Engine (PFAE) Supporte d FDM FDM_03 Faulty CPU Locating Supporte d FDM FDM_04 Faulty DIMM Locating Supporte d FDM FDM_05 Faulty PSU Locating Supporte d FDM FDM_06 Faulty Fan Module Locating Supporte d FDM FDM_07 Faulty Hard Drive Locating Supporte d FDM FDM_08 Hard Drive PFA Supporte d Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 24
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Type Feature CPU and System RAS Feature 2488H ID V6 FDM FDM_09 Service Life Prediction of Huawei ES3000 Supporte (Standard PCIe SSD) d FDM FDM_10 iBMC CPU Self-Check Supporte d FDM FDM_11 Remote System Software and Firmware Supporte Upgrade by the iBMC d FDM FDM_12 Black Box of the iBMC Supporte d FDM FDM_13 Breakdown Screenshot Capturing of the Supporte iBMC d FDM FDM_14 Breakdown Video Recording of the iBMC Supporte d NOTE ● Some RAS features are not enabled by default. You can enable them using the BIOS setup. For details, see the server BIOS setup documentation. ● Some RAS features vary slightly depending on the CPU. 4.4 RAS Feature Description 4.4.1 System-Level RAS Features Feature ID SYSTEM_01 Feature CPU Built-in Self Test (BIST) Description The internal self-check module of a CPU checks each core of the CPU during BIOS startup, and records the self-check results. Category Reliability and serviceability Customer Benefit/ Customers can detect faults in the CPU. Application Scenario Usage This feature automatically takes effect and cannot be disabled. Constraints/ None Limitations Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 25
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Feature ID SYSTEM_02 Feature Core Disable for Fault Resilient Boot (FRB) Description After the CPU BIST, the faulty CPU core is isolated according to the BIST result, and the remaining CPU cores are started. Category Reliability and availability Customer Benefit/ The maximum CPU availability remains unchanged Application Scenario even when some CPU cores are faulty. Usage This feature automatically takes effect and cannot be disabled. Constraints/ None Limitations Feature ID SYSTEM_03 Feature Corrupt Data Containment – Core Description When the CPU core cache receives error data that is not corrected by hardware algorithms, the CPU cores do not crash immediately. Instead, an interrupt request (IRQ) is sent to the OS to perform recovery or retry according to the data usage. Category Reliability Customer Benefit/ The probability is increased that the system remains Application Scenario available when uncorrected hardware errors exist. Usage This feature automatically takes effect and is enabled by default. Constraints/ None Limitations Feature ID SYSTEM_04 Feature Corrupt Data Containment – Uncore Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 26
Huawei FusionServer Pro 2488H V6 Server RAS Technical White Paper 4 RAS Feature Description When the CPU peripheral devices, including the memory controller, cache agent module, internal I/O module, and UPI proxy module, receive error data that is not corrected by the hardware algorithm, the CPU peripheral devices do not crash immediately. Instead, after the data enters the CPU core cache, an IRQ is sent to the OS to perform recovery or retry according to the data usage. Category Reliability Customer Benefit/ The probability is increased that the system remains Application Scenario available when uncorrected hardware errors exist. For data whose destination is an external device, for example, an error of a certain pixel displayed on the screen, the data can be directly discarded without any processing. Usage This feature automatically takes effect and is enabled by default. Constraints/ None Limitations Feature ID SYSTEM_05 Feature Socket Disable for FRB Description If a socket has failed or cannot be directly or indirectly connected to the PCH due to a UPI fault, the system isolates the socket. Compared with a normal complete CPU interconnection topology, the topology downgrades and the system starts with fewer CPUs. Category Reliability and availability Customer Benefit/ When a CPU socket or the UPI bus is faulty, the Application Scenario maximum CPU availability remains unchanged. Usage This feature automatically takes effect and cannot be disabled. Constraints/ None Limitations Feature ID SYSTEM_06 Feature Advanced Error Detection and Correction (AEDC) Issue 02 (2021-01-25) Copyright © Huawei Technologies Co., Ltd. 27
You can also read