RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.

Page created by Terrence Rowe
 
CONTINUE READING
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server

RAS Technical White Paper

Issue           02
Date            2021-01-25

HUAWEI TECHNOLOGIES CO., LTD.
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without prior
written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

      and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.

Notice
The purchased products, services and features are stipulated by the contract made between Huawei and
the customer. All or part of the products, services and features described in this document may not be
within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,
information, and recommendations in this document are provided "AS IS" without warranties, guarantees
or representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.
Address:       Huawei Industrial Base
               Bantian, Longgang
               Shenzhen 518129
               People's Republic of China

Website:       https://e.huawei.com

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                                  i
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                        About This Document

                                                     About This Document

Purpose
                 This document describes the Reliability, Availability, and Serviceability (RAS)
                 features and technologies of the Huawei FusionServer Pro 2488H V6 server
                 (2488H V6 for short).

Symbol Conventions
                 The symbols that may be found in this document are defined as follows.

                  Symbol                    Description

                                            Indicates a hazard with a high level of risk which, if not
                                            avoided, will result in death or serious injury.

                                            Indicates a hazard with a medium level of risk which, if
                                            not avoided, could result in death or serious injury.

                                            Indicates a hazard with a low level of risk which, if not
                                            avoided, could result in minor or moderate injury.

                                            Indicates a potentially hazardous situation which, if not
                                            avoided, could result in equipment damage, data loss,
                                            performance deterioration, or unanticipated results.
                                            NOTICE is used to address practices not related to
                                            personal injury.

                                            Supplements the important information in the main
                                            text.
                                            NOTE is used to address information not related to
                                            personal injury, equipment damage, and environment
                                            deterioration.

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                                   ii
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                        About This Document

Change History
                  Issue           Date             Description

                  02              2021-01-25       ● This issue is the second official release.

                  01              2020-10-23       ● This issue is the first official release.

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                            iii
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                                                                                                            Contents

                                                                                                                                                               Contents

About This Document................................................................................................................ ii
1 Introduction.............................................................................................................................. 1
1.1 2488H V6 Overview................................................................................................................................................................1
1.2 RAS Definition.......................................................................................................................................................................... 2
1.3 RAS Measurements................................................................................................................................................................. 3
1.3.1 Reliability Measurement.................................................................................................................................................... 3
1.3.2 Serviceability Measurement............................................................................................................................................. 3
1.3.3 Availability Measurement................................................................................................................................................. 3
1.4 RAS Importance....................................................................................................................................................................... 4

2 RAS Basis....................................................................................................................................5
2.1 Component Selection and Derating Design................................................................................................................... 5
2.2 Reliability Filtering.................................................................................................................................................................. 6
2.3 Testing......................................................................................................................................................................................... 8

3 Fault Management System (FMS)...................................................................................... 9
3.1 Fault Management Methodology...................................................................................................................................... 9
3.1.1 Fault Management Architecture..................................................................................................................................... 9
3.1.2 Fault Types and Troubleshooting................................................................................................................................. 10
3.2 Fault Management System (FMS).................................................................................................................................. 12
3.3 Basic Hardware Faults......................................................................................................................................................... 14
3.4 Service Hardware Faults..................................................................................................................................................... 15

4 RAS Feature............................................................................................................................ 17
4.1 Architecture Design.............................................................................................................................................................. 17
4.2 Comprehensive Memory Protection............................................................................................................................... 18
4.2.1 End-to-End Memory Protection................................................................................................................................... 18
4.2.2 Memory Data Protection................................................................................................................................................ 19
4.2.3 High-Reliability Memory Application Design.......................................................................................................... 19
4.3 RAS Feature Summary........................................................................................................................................................ 20
4.4 RAS Feature Description..................................................................................................................................................... 25
4.4.1 System-Level RAS Features............................................................................................................................................ 25
4.4.2 Memory RAS Features..................................................................................................................................................... 33
4.4.3 PMem RAS Features......................................................................................................................................................... 41
4.4.4 I/O RAS Features................................................................................................................................................................ 47

Issue 02 (2021-01-25)                                      Copyright © Huawei Technologies Co., Ltd.                                                                                              iv
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                                                                                               Contents

4.4.5 UPI RAS Features............................................................................................................................................................... 50
4.4.6 Hardware RAS Features...................................................................................................................................................52
4.4.7 FDM RAS Features............................................................................................................................................................ 53

5 Glossary................................................................................................................................... 59
6 Summary................................................................................................................................. 61

Issue 02 (2021-01-25)                                  Copyright © Huawei Technologies Co., Ltd.                                                                                      v
RAS Technical White Paper - Huawei FusionServer Pro 2488H V6 Server - HUAWEI TECHNOLOGIES CO., LTD.
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                1 Introduction

                                                                 1          Introduction

                 This document describes the features and RAS definition, measurement, and
                 importance of the Huawei 2488H V6 server.

                 1.1 2488H V6 Overview
                 1.2 RAS Definition
                 1.3 RAS Measurements
                 1.4 RAS Importance

1.1 2488H V6 Overview
                 Huawei FusionServer Pro 2488H V6 (2488H V6) is a new-generation 2U 4-socket
                 rack server designed for Internet, Internet Data Center (IDC), cloud computing,
                 enterprise, and telecom applications. Powered by the third-generation Intel® Xeon®
                 Cooper Lake processors, the 2488H V6 provides up to 28 cores, 3.1 GHz frequency,
                 a 38.5 MB L3 cache, and six 10.4 GT/s UPI links between the processors, which
                 deliver supreme processing performance. The major product specifications are as
                 follows:
                 ●      The server supports a maximum of 48 DDR4 ECC 3200 MT/s DIMMs. The
                        DDR4 ECC DIMMs support registered DIMMs (RDIMM) and load-reduced
                        DIMMs (LRDIMMs), which provide high speed and availability.
                 ●      The server supports a maximum of 24 Intel® OptaneTM PMem module 200
                        series 200 (PMem modules for short). When the DDR4 DIMMs are used
                        together, the server supports a maximum of 18 TB memory capacity
                        (calculated based on a maximum of 256 GB capacity per DDR4 DIMM and a
                        maximum of 512 GB capacity per PMem module).
                 ●      Flexible drive configurations meet a variety of business requirements and
                        ensure high elasticity and scalability of storage resources.
                 ●      The use of all solid-state drives (SSDs) is supported. An SSD supports up to
                        100 times more I/O operations per second (IOPS) than a typical hard disk
                        drive (HDD). The use of all SSDs provides higher I/O performance than the
                        use of all HDDs or a combination of HDDs and SSDs.

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                              1
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                    1 Introduction

                 ●      The use of 12 Gbit/s SCSI (SAS) serial connection for internal storage provides
                        2x data transmission rate than the use of 6 Gbit/s SAS connection,
                        maximizing the performance of I/O-intensive applications.
                 ●      With Intel integrated I/O, the third-generation Intel® Xeon® Scalable processors
                        integrate the PCIe 3.0 controller to shorten I/O latency and improve overall
                        system performance.
                 ●      The server supports a maximum of 11 PCIe 3.0 slots, including one for the
                        OCP 3.0 network adapter.
                 ●      The server supports one GE, 10GE, 25GE, or 100GE OCP 3.0 network adapter
                        that supports hot swap, network controller sideband interface (NC-SI),
                        Preboot eXecution Environment (PXE), and Wake on LAN (WoL).
                        Based on rich RAS features of Intel processors and fault diagnosis system
                        (FDM) of Huawei servers, the server supports precise fault locating, timely
                        fault alarms, redundant fans and power modules, and hot replacement,
                        providing customers with leading availability, serviceability, and reliability.

1.2 RAS Definition
                 RAS stands for Reliability, Availability and Serviceability.

                 ●      Reliability: refers to the capability of a product to sustain specific functions in
                        a given time under given conditions. It is the capability of a server to keep
                        operating properly, free from faults.
                 ●      Availability: refers to the capability of a product to be in an operable state at
                        any given time. It is the capability of a server to provide as long system
                        availability time as possible.
                 ●      Serviceability: refers to the possibility of completing specific actions in a given
                        time. It is the capability of a server to quickly recover from faults.

                 Figure 1-1 shows the top-layer framework of RAS.

                 Figure 1-1 Top-layer framework of RAS

                 The core idea behind the RAS design of Huawei V6 servers is to maximize
                 customer service availability and minimize the breakdown possibility. A highly
                 available standalone system must have highly reliable underlying hardware and

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                                     2
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                1 Introduction

                 software design, high error tolerance performance, and fast repair and service
                 recovery capabilities.

1.3 RAS Measurements
                 The server RAS is measured in terms of Mean Time Between Failure (MTBF),
                 Mean Time to Repair (MTTR), availability, and other factors.

1.3.1 Reliability Measurement
                 The major indicators that measure reliability are the failure rate (λ) and MTBF.

                 The relationship between the λ and MTBF is as follows: λ = 1/MTBF

                 A larger MTBF means a smaller failure rate and higher system reliability.

                        NOTICE

                 The MTBF does not indicate the service life, but indicates the availability of a
                 component in its service life. The service life indicates the longest time during
                 which a component can be used.

1.3.2 Serviceability Measurement
                 Serviceability is often measured by the MTTR. The MTTR excludes the time
                 required for administration and logistics as well as the time required for preventive
                 maintenance. A smaller MTTR means better product serviceability.

1.3.3 Availability Measurement
                 A indicates availability.

                 A = MUT/(MUT + MDT) x 100%

                 Mean Up Time (MUT) indicates the average available time. Mean Down Time
                 (MDT) indicates the average interruption time and can also be considered as the
                 average downtime.

                 Availability (A) in a board sense is not suitable for describing inherent features of
                 a product because the MDT includes the time required for administration and
                 logistics. For example, although a fault is easy to handle, the product appears not
                 highly available if the fault is not promptly rectified due to delayed fault reporting
                 or misoperations by management personnel. Generally, the inherent availability
                 parameter Ai is used to describe the inherent availability of a product.

                 Ai = MTBF/(MTBF+MTTR) x100%

                 A larger value of Ai means higher product availability.

Issue 02 (2021-01-25)            Copyright © Huawei Technologies Co., Ltd.                            3
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                            1 Introduction

1.4 RAS Importance
                 As the server processing capability enhances and the number of bearer services
                 increases, the impact of unexpected server breakdown is increasing. According to
                 the ITIC survey report, the planned downtime per hour is as follows:
                 ●      For about 98% services, the downtime cost may exceed US$100,000/hour.
                 ●      For about 88% services, the downtime cost may exceed US$300,000/hour.
                 ●      For about 33% services, the downtime cost may exceed US$1 million/hour.
                 Data source: May 2017, Information Technology Intelligence Consulting Corp.
                 (ITIC)
                 Unexpected downtime not only causes financial loss but also brings other negative
                 effects: damage to the corporate image due to extensive media reporting, increase
                 in the customer churn rate, and employees' work schedule foul-ups.

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                            4
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                  2 RAS Basis

                                                                       2         RAS Basis

                 This section describes the RAS design basis, component-level reliability design, and
                 production filtering requirements.
                 Component-level reliability is a basic requirement for RAS design. That is,
                 hardware must operate properly as long as possible.
                 To achieve component-level reliability, designers need to ensure that correct
                 components are used and components are used correctly. To ensure that correct
                 components are used, careful component selection and introduction are required.
                 To ensure that components are used correctly, excellent design (for example,
                 derating design) is required.
                 Component-level reliability quality assurance includes three procedures: supplier
                 materials reliability management, product reliability design, and production
                 reliability filtering. The three procedures are closely related to each other.
                 Designers need to thoroughly consider all the three procedures.
                 The following sections describe the procedures from different perspectives.
                 2.1 Component Selection and Derating Design
                 2.2 Reliability Filtering
                 2.3 Testing

2.1 Component Selection and Derating Design
                 Benefiting from long-term accumulation in the hardware field of the CT industry,
                 the 2488H V6 has stringent requirements on component selection and derating
                 design.
                 ●      2488H V6 component selection strategy
                        Huawei has strict examination process for the introduction of new
                        components, including supplier qualification review, component application
                        reliability assessment and testing. Huawei also has complete certification
                        processes and sufficient test capabilities to ensure the reliability of new
                        components.
                 ●      2488H V6 derating design
                        In terms of component application, Huawei servers comply with the same
                        derating standards as communication products. Derating design enables the

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                              5
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                      2 RAS Basis

                        working stress of a component or device to be appropriately lower than the
                        rating specified for the device or component to decrease the failure rate and
                        improve reliability.
                        Derating design improves component reliability or extends the service life of
                        components from the following aspects:
                        –   Minimizing the possibility of a component at the edge of overstress to
                            fail in its service life.
                        –   Minimizing the impact exerted by the initial tolerance of component
                            parameters (such as differences among individual components,
                            differences among components of different batches, and technology
                            changes).
                        –   Minimizing the impact exerted by long-term deviation of component
                            parameter values.
                        –   Providing allowance for uncertainties during stress calculation.
                        –   Providing allowance for the occurrence of unexpected events, such as air
                            conditioner faults in the equipment room and transient stress at the peak
                            voltage.
                        Derating design for the 2488H V6 goes through several phases in the entire
                        R&D process:
                        –   Component selection: Select appropriate components that satisfy
                            derating requirements.
                        –   Design: Component derating design must comply with applicable
                            specifications.
                        –   Testing: Product test engineers inspect component derating by conducting
                            tests to determine whether components meet derating specifications.
                            Product reliability engineers conduct technical reviews for derating
                            inspection and testing and issue resolution.

2.2 Reliability Filtering
                 Figure 2-1 shows the failure rate of electronic components according to the use
                 time.

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                                6
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                   2 RAS Basis

                 Figure 2-1 Bathtub curve

                 After the early failure period ends, the device enters a stable working period, that
                 is, the random failure period. At the end of the service life, the device enters the
                 wear-out failure period, in which the device failure possibility is increasing. The
                 reliability filtering method makes the 2488H V6 enter the random failure period as
                 soon as possible to improve device stability.

                 Reliability filtering aims to:

                 ●      Check out early failures to ensure inherent design reliability.
                 ●      Reduce the product failure rate after delivery, and improve MTBF and inherent
                        product availability.
                 ●      Establish a long-term large-sample failure analysis mechanism and
                        continuously optimize front-end design to improve product reliability.

                 Huawei has formed an effective method for reliability filtering after long-term
                 accumulation in R&D in the CT industry. Based on the excellent experience and the
                 characteristics of server products, Huawei has established a server reliability
                 filtering mechanism. The following figure shows some of the reliability filtering
                 tests.

                  Test                                          Content

                  CPU large-stress test                         Maximum workload
                                                                Increased temperature stress
                  UPI large-stress test
                                                                Increased electrical stress
                  Memory large-stress test                      Long-term continuous operation
                  Hard disk large-stress test

                 Each 2488H V6 server must pass these large-stress tests before delivery.

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                            7
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                       2 RAS Basis

2.3 Testing
                 Signal testing is a necessary and important means to ensure reliability of
                 electronic products.
                 For example, the latch-up effect is an important failure mode of the CMOS circuit.
                 The latch-up effect is a unique parasitic effect for the CMOS process, which may
                 even cause a circuit failure or chip burning. Voltage overshoot is an important
                 cause for the latch-up effect. The latch-up effect is not easy to test during the
                 manufacturing process because the latch-up effect is strongly accidental and
                 usually occurs after long-time use.
                 An effective way to avoid the latch-up effect is to ensure that all signals are
                 complete and no signal overshoot affects device functions. This task is usually
                 completed in the R&D process.
                 Huawei has conducted the following tests during the server R&D process:
                 ●      Integrity test for all signals: All signals are tested to ensure that the signals
                        meet the component application requirements to improve design reliability
                        from the bottom layer.
                 ●      Test for all power features: Power-on and power-off, input/output (I/O)
                        features, and short circuits are tested for all power supply units (PSUs) to
                        ensure that power supplies meet various application requirements. Special
                        tests are conducted for key power supplies (for example, CPU VRD power
                        supplies) to ensure that servers can operate stably in extreme workload and
                        application environments.
                 ●      Multi-sample test for key high-speed links: Discrete tests are conducted on
                        boards of different batches and from different vendors to assess key high-
                        speed signals.
                 ●      Error tolerance test: The system-level and chip-level (FIT) error tolerance tests
                        are conducted to improve server reliability.
                 ●      Stability test: Extreme tests (such as large-stress tests and repeated power-off
                        and then power-on) are conducted for a large number of servers in different
                        application scenarios and extreme environments. The stability test ensures
                        high availability of the entire server system.

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                                   8
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                            3 Fault Management System (FMS)

               3           Fault Management System (FMS)

                 This document describes the fault management system and methods of the
                 2488H V6.

                 3.1 Fault Management Methodology
                 3.2 Fault Management System (FMS)
                 3.3 Basic Hardware Faults
                 3.4 Service Hardware Faults

3.1 Fault Management Methodology
                 With the development of servers, more and more modules and components are
                 used. As a result, the risk of faults increases.
                 Faults are handled in a hierarchical manner in accordance with the hierarchical
                 architecture of the server

3.1.1 Fault Management Architecture
                 Server faults generally refer to hardware faults. To handle these faults, hardware,
                 firmware, BIOS, OS, and management software must be used together.
                 Hardware is at the bottom layer of the entire system. It provides multiple
                 troubleshooting methods, such as:
                 ●      Performs comprehensive fault detection based on chips (including CPUs and
                        memory expansion chips) and multiple sensors.
                 ●      Corrects correctable faults based on mechanisms such as ECC and retry.
                 ●      Adopts the redundancy design to avoid faults that can be prevented.
                 ●      Records the detected errors in various registers, such as CPU MCA register, for
                        the fault management system to use.
                 For faults that cannot be rectified by hardware, a large part of them can be
                 rectified using the firmware, BIOS, or OS such as page offline and core disable. For
                 application software, the HA solution can be used to switch customer applications
                 in time when a hardware fault occurs, ensuring the customer application

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                              9
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                           3 Fault Management System (FMS)

                 continuity. Note that HA support is application specific. For details, contact the
                 application provider.

                 Figure 3-1 shows the fault management architecture.

                 Figure 3-1 Fault management architecture

                 In addition to fault management architecture, the 2488H V6 also supports fully
                 autonomous fault diagnosis management (FDM).

                 The management system supports remote device management and provides one-
                 stop services such as device configuration, software and firmware upgrade, and
                 fault management. For details, see 3.2 Fault Management System (FMS).

3.1.2 Fault Types and Troubleshooting
                 System faults can be classified into the following types, as listed in Table 3-1.

                 Table 3-1 Types of server faults

                  Cate    Fault        Impact       Example
                  gory    Type         on the
                                       System

                  Categ   Correctab    Minor        Frequent memory ECC errors may have minor
                  ory A   le chip                   impact within a short term but will cause huge
                          errors                    risks in a long term.

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                                10
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                          3 Fault Management System (FMS)

                  Cate    Fault        Impact       Example
                  gory    Type         on the
                                       System

                          Lockable     Minor        Huawei ES3000 is a standard PCIe SSD card
                          chip                      based on the NAND flash. The ES3000
                          errors                    controller locks chip errors inside and prevents
                                                    any impact on the system.

                  Categ   Recovera     Medium       There are a large number of inter-integrated
                  ory B   ble                       circuit (I2C) components on the server. If an
                          software                  I2C component is faulty, the entire I2C link
                          errors                    may be interrupted, which will result in major
                                                    impact. The iBMC has the reset capability to
                                                    restore the link.

                          Failover     Medium       Server components, such as PSUs, fan modules,
                          to spare                  and memory ranks, work in redundancy mode.
                          parts

                          Degradin     Major        If a UPI link between CPUs is interrupted, the
                          g                         width of the UPI link automatically reduced.
                                                    The overall performance decreases, but the
                                                    system still functions.

                          Lockable     Major        If an error occurs in a memory unit, the OS
                          system                    detects the error and makes the faulty memory
                          errors                    page offline so that the error source is isolated.

                  Categ   Uncorrect    Major        If the output of a key clock source is abnormal,
                  ory C   able                      the iBMC can detect the error source but
                          errors                    cannot correct the error.

                          Undetect     Uncertain    This type of error is usually accidental and
                          able                      difficult to locate, have different severities, and
                          errors                    may be caused by hardware or software.

                 Errors of categories A and B will not cause service interruption, but errors of
                 category C will cause service interruption.
                 The similarity between category A and category B is as follows: When an error of
                 category A or B occurs, the system does not break down immediately, and the
                 FMS reports the error to maintenance personnel so that they can correct the error
                 as scheduled.
                 Errors of category C must be corrected as soon as possible. To achieve this goal,
                 system design must meet the following requirements: A reliable FMS is required
                 for accurate fault locating, and good structure design is required for fast parts
                 replacement.
                 Table 3-1 lists error categories in theory. In fact, an error may have different
                 impact in different application scenarios . For example, the unstable output of a
                 PSU falls into category B if PSUs in N+N redundancy mode are configured, but
                 falls into category C if PSUs are not redundant.

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                                11
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                    3 Fault Management System (FMS)

                 Different resolution policies should be adopted for different types of faults. Figure
                 3-2 shows the resolution policies for various faults based on FMS.
                 As shown in the following figure, all faults must be detected first. If an error
                 occurs without being detected, the error handling process cannot be triggered.
                 Different processing policies, such as software recovery, spare parts switchover,
                 downgrade, and fault isolation, are used for faults that can be detected. All
                 detected errors are reported to the fault management system for it to collect fault
                 information and locate faults.

                         NOTE

                        Due to the technical capability of the industry, if an error cannot be detected or cannot be
                        handled, offline maintenance should be performed in the maintenance plan.

                 Figure 3-2 Fault classification and management

3.2 Fault Management System (FMS)
                 Quickly locating fault sources among a large number of components is an
                 important means to ensure availability and can greatly shorten the maintenance
                 time. 2488H V6 hardware faults can be classified into two types by hardware
                 location: basic hardware faults and service hardware faults.
                 ●      Basic hardware faults: Basic hardware includes PSUs, fan modules, board
                        power modules, and clocks. Basic hardware is not directly associated with
                        upper-layer services, and the fault detection process does not necessarily
                        involve service system. Therefore, the iBMC on the 2488H V6 independently
                        handles basic hardware errors.
                 ●      Service hardware faults: Service hardware includes processors, DIMMs, PCIe
                        devices, and drives. These devices are in the execution path of applications
                        and are closely related to customer services. Most service hardware faults are

Issue 02 (2021-01-25)             Copyright © Huawei Technologies Co., Ltd.                                        12
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                           3 Fault Management System (FMS)

                        located, analyzed, and handled by the BIOS and iBMC, and some faults
                        require the OS.
                 In addition to accurate fault locating and prompt fault rectification, the FMS needs
                 to provide fault warning, that is, identify potential faults so that users can hot-
                 swap components or use expected shutdown to minimize the impact on services.
                 The 2488H V6 integrates the fault diagnosis and management system (FDM), as
                 shown in Figure 3-3. The FDM consists of sensors, complex programmable logical
                 devices (CPLDs), the out-of-band management system iBMC, BIOS, platform
                 controller hub (PCH), CPUs, Huawei baseboard management agent (iBMA,
                 optional), and FusionServer Tools (optional).

                 Figure 3-3 FMS components

                 The FMS of the 2488H V6 covers the hardware layer, BIOS layer, CPU platform,
                 and out-of-band management system, and provides the interface protocols
                 required for OS-layer fault locating. Figure 3-4 shows the FMS framework.

                 Figure 3-4 FMS framework

                 The FMS consists of the following components:
                 ●      iBMC: Huawei's latest-generation server management system, which is the
                        core of the fault location system. Based on the Huawei-developed Hi1711

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                          13
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                             3 Fault Management System (FMS)

                        chip, the iBMC collects, summarizes, and analyzes faults, and displays fault
                        information using the WebUI, LCD, and logs to implement server
                        management. The iBMC is an independent system decoupled from the OS
                        and application software of the service system. Its chips and upper-layer
                        software are developed by Huawei to meet various service requirements of
                        different customers.
                 ●      Processor platform: The 2488H V6 uses the Intel® Xeon® scalable processors
                        (Cooper Lake). In addition to basic RAS features, the 2488H V6 provides
                        advanced RAS capabilities, greatly improving the capability of handling service
                        hardware faults.
                 ●      CPLD: It collects basic hardware faults, and connects to hardware module
                        interfaces and iBMC over Huawei's proprietary CPLD-Bus interface.
                 ●      BIOS: It collects and locates service hardware faults, provides fault locating
                        results for the iBMC, and provides fault management interfaces for the OS.
                 ●      (optional) BMA: The Baseboard Management Agent runs on the OS and
                        obtains service-side hardware information, which is helpful for fault locating
                        and warning.
                 ●      (optional) FusionServer Tools: The tool suite developed for Huawei servers
                        facilitates server installation, configuration, fault diagnosis, and fault
                        prediction.
                 ●      User interface: A BMC WebUI, a local LCD, and fault indicators for key
                        components are provided to facilitate remote or local system maintenance.
                 ●      Various protocols: The FMS uses the following interfaces and protocols:
                        Huawei CPLD-Bus, low pin count (LPC), SML, Platform Environment Control
                        Interface (PECI), PCIe, universal asynchronous receiver/transmitter (UART),
                        I2C, and PMBus.

3.3 Basic Hardware Faults
                 Basic hardware modules include PSUs, fan modules, and underlying hardware of
                 other components (excluding CPUs, DIMMs, drives, and standard PCIe cards), such
                 as the compute module, front I/O module, rear I/O module, and converged
                 console.

                 There are different types of basic hardware faults. During troubleshooting, the
                 CPLD converges the fault information and reports the fault information to the
                 iBMC. The fault information includes the fault type and fault location. iBMC parses
                 the received fault information and displays it on the WebUI. When the fault
                 information is parsed, the fault level and type are identified, and corresponding
                 handling suggestions are provided based on the fault level and type. This helps
                 the customer to quickly rectify the fault.

                 In addition to monitoring basic hardware faults, the CPLD also monitors service
                 hardware faults, including CPU faults and excessively high CPU and memory
                 temperature. In this way, the iBMC monitors key hardware at the fastest speed
                 and is not affected by the BIOS and OS (because the BIOS and OS may be
                 unavailable when a serious fault occurs on the processor or memory), and takes
                 measures in a timely manner, for example, increasing the fan speed, prevent key
                 components from being damaged due to faults, which may cause severe damage
                 to the entire system.

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                                14
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                             3 Fault Management System (FMS)

3.4 Service Hardware Faults
                 Service hardware includes CPUs, memory, PCIe devices, and the local storage
                 system. Due to the characteristics of the local storage system, the 2488H V6 can
                 manage its faults as basic hardware faults, or use FusionServer Tools or BMA to
                 implement inband fault management for the local storage system. In this section,
                 service hardware includes CPUs, DIMMs, and PCIe devices.

                 Based on the MCA architecture provided by the Intel® Xeon® scalable processors
                 (Cooper Lake), the 2488H V6 integrates the hardware, BIOS, iBMC, and OS fault
                 handling mechanism to create a unique FMS to provide a series of functions such
                 as fault diagnosis, fault locating, fault rectification, fault information collection,
                 and fault reporting after a fault occurs in the system. In addition, the core
                 modules of the FMS run on the BIOS and iBMC and do not depend on the OS.
                 Therefore, the FMS are always in running state and can take measures
                 immediately when an error occurs to prevent the system from breaking down.

                 Figure 3-5 shows the flowchart for handling service hardware faults.

                 Figure 3-5 Flowchart for handling service hardware faults

                 ●      If the leaky bucket algorithm is used and the number of correctable errors
                        reaches the specified threshold, a system management interrupt (SMI) is
                        triggered to instruct the BIOS to handle the error. After receiving the SMI, the

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                             15
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                            3 Fault Management System (FMS)

                        BIOS handles the error based on the SMI type. After ensuring that the system
                        is running properly, the BIOS locates and isolates the faulty component,
                        collects error status register information, and reports the error and detailed
                        error status register information to the iBMC. The information helps users or
                        maintenance personnel further analyze the error cause. (The purple arrow

                        lines "                      " in Figure 3-5 show the flowchart for handling a
                        correctable error.)
                 ●      The process for handling an uncorrectable, recoverable error is as follows: An
                        uncorrectable, recoverable error has no adverse impact on the system. This
                        error is marked with an error tag, and an SMI is triggered. After receiving the
                        SMI, the BIOS collects error status register information, locates the faulty
                        components, and reports error information and detailed error status register

                        information to iBMC. (The dark-blue arrow lines "                      " in
                        Figure 3-5 show the flowchart for handling an uncorrectable, recoverable
                        error.)
                 ●      The process for handling an uncorrectable, unrecoverable error in the x86
                        system is as follows: If an uncorrectable, unrecoverable error occurs, the
                        CATERR_N pin is pulled down. This error causes the system to stop
                        responding. This error triggers the error collection program of the iBMC to
                        obtain error status register information of the x86 system. Based on the onsite
                        error information, the error collection program diagnoses the error and
                        displays error information to users promptly. (The brown arrow lines

                        "                      " in Figure 3-5 show the flowchart for handling an
                        uncorrectable, unrecoverable error.)

Issue 02 (2021-01-25)           Copyright © Huawei Technologies Co., Ltd.                             16
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                              4 RAS Feature

                                                                 4         RAS Feature

                 This section describes some key RAS features of the 2488H V6, lists all RAS
                 features that have been implemented on the 2488H V6, and provides application
                 scenarios.

                 4.1 Architecture Design
                 4.2 Comprehensive Memory Protection
                 4.3 RAS Feature Summary
                 4.4 RAS Feature Description

4.1 Architecture Design
                 The system architecture design rules for the 2488H V6 are high availability, high
                 performance, good compatibility, and successful evolution. High availability is the
                 core requirement of RAS design. Compatibility and evolvability improve the
                 serviceability of servers.
                 ●      High availability means using various design and troubleshooting methods to
                        prompt the system availability time, minimize the system unplanned
                        downtime and reduce its impact on services.
                 ●      Good compatibility refers to the decoupling of RAS features from customer
                        service systems or upper-layer applications. For example, the FMS
                        components of the 2488H V6 are mainly on the out-of-band management
                        chip BMC. No FMS component is placed on the OS. This decouples the fault
                        management module from the OS to prevent the fault management module
                        from working improperly.
                 Based on Huawei's powerful hardware platform, excellent overall structure design,
                 and powerful management software of Huawei-developed Hi1711 management
                 chips, the architecture design of the 2488H V6 implements the following
                 functions:
                 ●      The modular design makes modules loosely coupled with each other, which
                        facilitates parts replacement.

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                           17
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                               4 RAS Feature

                 ●      The fully autonomous management system iBMC supports remote, one-stop
                        management, including server configuration, software and firmware
                        upgrades, and fault management.
                 ●      Separate airflow design and Huawei efficient fan modules ensure that the
                        2488H V6 stably operates at 45°C (113°F) even if some air conditioners in the
                        equipment room fail.
                 The 2488H V6 provides enhanced RAS features for core server components. The
                 2488H V6 provides comprehensive memory protection against common memory
                 faults in the industry.

4.2 Comprehensive Memory Protection
                 As memory technologies are developing rapidly, the chip manufacturing process is
                 improving, the chip operating voltage is decreasing, and the memory capacity is
                 increasing. However, memory reliability has become a top-priority issue.
                 Due to the lack of protective mechanisms for the memory, serious memory faults
                 often result in severe consequences, such as system breakdown and service
                 interruption. As the number of DIMMs is increasing, consequences arising from
                 serious memory faults will be further worse.
                 The 2488H V6 has made many efforts in memory RAS to solve current memory
                 problems.

4.2.1 End-to-End Memory Protection
                 To ensure memory availability, the 2488H V6 provides an end-to-end memory
                 protection mechanism with the help of the FMS. This mechanism prevents
                 memory faults from spreading or upgrading, which, if not avoided, will further
                 affect the entire system. Figure 4-1 shows the mechanism.

                 Figure 4-1 End-to-end memory protection mechanism

                 To ensure memory availability, key measures include:

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                           18
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                               4 RAS Feature

                 ●      During DIMM purchase, only the DIMMs of mainstream vendors are selected,
                        and the purchased DIMMs are strictly tested and filtered.
                 ●      Extensive memory RAS features based on CPU, algorithms related to the
                        memory are enhanced to provide multiple algorithm protection. For example,
                        the memory fault storm suppression algorithm and re-examination algorithm
                        are optimized to ensure accurate locating and quick processing of memory
                        faults. For details, see 4.2.3 High-Reliability Memory Application Design.
                 ●      Based on the FMS, the fault prediction algorithm is used to implement fault
                        warning for risky DIMMs.
                 ●      During both POST and runtime, the faulty DIMMs can be accurately located,
                        and the faulty memory units are isolated through startup isolation or runtime
                        page offline.
                 ●      The management software reports alarms immediately to notify users of
                        replacing risky DIMMs in time.

4.2.2 Memory Data Protection
                 The 2488H V6 supports multiple memory data protection features, such as DDR
                 bus data CRC check and retry, memory data error checking and correction (ECC),
                 and faulty chip isolation.
                 Memory chips are DIMM storage entities. In the x86 architecture, each time a CPU
                 reads data from and writes data to memory, several memory chips are involved.
                 Some chips provide data bits and others provide check bits. These chips together
                 complete read and write of the minimum number of access bytes (usually called a
                 buffer line).
                 ECC is a basic feature that uses this check mechanism. However, it can correct only
                 one bit data in a buffer line.
                 The Cooper Lake processor is capable of correcting multiple bit data errors on the
                 same memory chip. This enhanced correction capability has little impact on
                 performance.

4.2.3 High-Reliability Memory Application Design
                 The 2488H V6 uses multiple high-reliability memory application technologies to
                 implement memory error prediction and self-healing, minimizing the impact on
                 services.

HiRAS Technology
                 The 2488H V6 supports the HiRAS mode (high reliability mode). In HiRAS mode,
                 the system provides enhanced RAS capabilities, including memory fault self-
                 healing and stable system running technologies, to ensure high system reliability
                 and reduce the memory failure rate by 50% (without affecting services).

Memory Fault Self-Healing Technology
                 The memory fault self-healing technology is a Huawei-developed patented
                 technology. Based on the Huawei server log big data system, this technology uses
                 the machine learning algorithm to obtain the memory fault feature model,
                 embeds the fault feature model into the Huawei-developed BMC chip, and uses

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                              19
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                4 RAS Feature

                 the AI self-healing algorithm of the chip to accurately identify each memory fault
                 feature, and proactively trigger the software and hardware isolation and self-
                 healing mechanism.

Precise Memory Prediction Technology
                 For memory faults that cannot be self-healed in the system, the system uses the
                 AI prediction algorithm of the Huawei-developed chip to identify the severity of
                 the memory fault, and sends a pre-warning to remind the customer to migrate
                 services or replace the memory in a timely manner. The warning accuracy reaches
                 79%, and the memory breakdown rate decreases by 40%.

Fault Storm Suppression Technology
                 The occurrence, correction, and recording of a single error have little impact on
                 the system performance. However, when upper-layer applications frequently
                 access a memory area and multiple faults occur in the area, an interrupt storm of
                 memory errors occurs in the system, which adversely affects the system
                 performance. In severe cases, services may be suspended. The fault storm
                 suppression algorithm of the 2488H V6 can effectively reduce the number of
                 interrupts that need to be triggered. In this way, an interrupt storm is suppressed
                 when an error storm occurs, greatly reducing the impact on services. After a storm
                 is over, the server management system checks the fault type. If the faults are
                 transient ones, which are usually caused by environment changes, the
                 management system records the storm event. If the faults are permanent ones,
                 which are usually caused by electrical aging or damage, the management system
                 isolates faulty components. In addition, to prevent useful information loss caused
                 by storm suppression, the BMC proactively polls the fault register during storm
                 suppression and incorporates the fault information into the subsequent processing
                 mechanism.

4.3 RAS Feature Summary
                 Table 4-1 lists the 2488H V6 RAS features, which are classified into seven
                 categories: system-level RAS, memory RAS, PMem RAS, IIO RAS, hardware RAS,
                 and FDM RAS.

                 Table 4-1 2488H V6 RAS feature summary

                  Type        Feature      CPU and System RAS Feature                      2488H
                              ID                                                           V6

                  SYSTEM      SYSTEM_      CPU Built-in Self Test (BIST)                   Supporte
                              01                                                           d

                  SYSTEM      SYSTEM_      Core Disable for Fault Resilient Boot (FRB)     Supporte
                              02                                                           d

                  SYSTEM      SYSTEM_      Corrupt Data Containment – Core                 Supporte
                              03                                                           d

                  SYSTEM      SYSTEM_      Corrupt Data Containment – Uncore               Supporte
                              04                                                           d

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                             20
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                           4 RAS Feature

                  Type        Feature     CPU and System RAS Feature                  2488H
                              ID                                                      V6

                  SYSTEM      SYSTEM_     Socket Disable for FRB                      Supporte
                              05                                                      d

                  SYSTEM      SYSTEM_     Advanced Error Detection and Correction     Supporte
                              06          (AEDC)                                      d

                  SYSTEM      SYSTEM_     Time-out Timer Schemes                      Supporte
                              07                                                      d

                  SYSTEM      SYSTEM_     Error Injection                             Supporte
                              08                                                      d

                  SYSTEM      SYSTEM_     Machine Check Architecture (MCA)            Supporte
                              09          Recovery                                    d

                  SYSTEM      SYSTEM_     MCA                                         Supporte
                              10                                                      d

                  SYSTEM      SYSTEM_     Machine Check Exception                     Supporte
                              11                                                      d

                  SYSTEM      SYSTEM_     Local MCE                                   Supporte
                              12                                                      d

                  SYSTEM      SYSTEM_     Enhanced MCA (EMCA) Gen2                    Supporte
                              13                                                      d

                  SYSTEM      SYSTEM_     Out-of-Band (OOB) Access to MCA             Supporte
                              14          Registers                                   d

                  SYSTEM      SYSTEM_     Error Reporting via IOMCA                   Supporte
                              15                                                      d

                  SYSTEM      SYSTEM_     Failed DIMM Isolation                       Supporte
                              16                                                      d

                  SYSTEM      SYSTEM_     HiRAS mode (High RAS)                       Supporte
                              17                                                      d

                  Memory      MEMORY      Memory Thermal Throttling                   Supporte
                              _01                                                     d

                  Memory      MEMORY      Memory Single Device Data Correction        Supporte
                              _02                                                     d

                  Memory      MEMORY      DDR4 Command and Address Parity Check       Supporte
                              _03         and Retry                                   d

                  Memory      MEMORY      Memory Demand and Patrol Scrubbing          Supporte
                              _04                                                     d

                  Memory      MEMORY      Memory Mirroring                            Supporte
                              _05                                                     d

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                        21
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                            4 RAS Feature

                  Type        Feature     CPU and System RAS Feature                   2488H
                              ID                                                       V6

                  Memory      MEMORY      DDR4 Write Data CRC Check and Retry          Supporte
                              _06                                                      d

                  Memory      MEMORY      Memory Data Scrambling with Command          Supporte
                              _07         and Address                                  d

                  Memory      MEMORY      DDR4 Post Package Repair (PPR)               Supporte
                              _08                                                      d

                  Memory      MEMORY      Adaptive Data Correction – Single-Region     Supporte
                              _09                                                      d

                  Memory      MEMORY      Adaptive Double Device Data Correction –     Supporte
                              _10         Multiple-Region (ADDDC-MR, +1)               d

                  Memory      MEMORY      DDR4 Memory Multi Rank Sparing               Supporte
                              _11                                                      d

                  Memory      MEMORY      Address Range/Partial Memory Mirroring       Supporte
                              _12                                                      d

                  Memory      MEMORY      Memory SMBus Hang Recovery                   Supporte
                              _13                                                      d

                  Memory      MEMORY      Memory Disable/map-out for FRB               Supporte
                              _14                                                      d

                  Memory      MEMORY      MEMHOT Pin Support for Error Reporting       Supporte
                              _15                                                      d

                  Memory      MEMORY      Failure Prediction and Correction            Supporte
                              _16                                                      d

                  Memory      MEMORY      Fault self-healing result reporting          Supporte
                              _17                                                      d

                  Memory      MEMORY      Precise warning of memory faults             Supporte
                              _18                                                      d

                  Memory      MEMORY      iBMA page isolation                          Supporte
                              _19                                                      d

                  PMem        PMem        PMem Module Error Detection and              Supporte
                              MEMORY      Correction                                   d
                              _01

                  PMem        PMem        SDDC – Single Device Data Correct            Supporte
                              MEMORY                                                   d
                              _02

                  PMem        PMem        PMem Module Package Sparing                  Supporte
                              MEMORY                                                   d
                              _03

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                         22
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                         4 RAS Feature

                  Type        Feature     CPU and System RAS Feature                2488H
                              ID                                                    V6

                  PMem        PMem        PMem Module Patrol Scrub                  Supporte
                              MEMORY                                                d
                              _04

                  PMem        PMem        PMem Module Media Address Error           Supporte
                              MEMORY      Detection and Verification                d
                              _05

                  PMem        PMem        PMem Module Data Poisoning                Supporte
                              MEMORY                                                d
                              _06

                  PMem        PMem        PMem Module Viral Mode for                Supporte
                              MEMORY      Containment                               d
                              _07

                  PMem        PMem        PMem Module Address Range Scrub (ARS)     Supporte
                              MEMORY                                                d
                              _08

                  PMem        PMem        PMem Module Error Injection               Supporte
                              MEMORY                                                d
                              _09

                  PMem        PMem        DDR-T Command/Address Parity Check        Supporte
                              MEMORY      and Retry                                 d
                              _10

                  PMem        PMem        Read/Write Data ECC Check and Retry       Supporte
                              MEMORY                                                d
                              _11

                  PMem        PMem        Failed PMem Module Isolation              Supporte
                              MEMORY                                                d
                              _12

                  PMem        PMem        PMem Module Error Reporting               Supporte
                              MEMORY                                                d
                              _13

                  IIO         IIO_01      PCIe Advanced Error Reporting             Supporte
                                                                                    d

                  IIO         IIO_02      PCIe Corrupt Data Containment (Data       Supporte
                                          Poisoning)                                d

                  IIO         IIO_03      PCIe Link CRC Error Check and Retry       Supporte
                                                                                    d

                  IIO         IIO_04      PCIe End to End CRC (ECRC)                Supporte
                                                                                    d

                  IIO         IIO_05      PCIe Link Retraining and Recovery         Supporte
                                                                                    d

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                      23
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                            4 RAS Feature

                  Type        Feature     CPU and System RAS Feature                   2488H
                              ID                                                       V6

                  IIO         IIO_06      PCIe Card Hot-Plug Surprise                  Supporte
                                                                                       d

                  IIO         IIO_07      PCIe "Stop and Scream"                       Supporte
                                                                                       d

                  UPI         UPI_01      Intel UPI Link Level Retry                   Supporte
                                                                                       d

                  UPI         UPI _02     Intel UPI Protocol Protection via 32 bit     Supporte
                                          Rolling CRC                                  d

                  UPI         UPI _03     Intel UPI Dynamic Link Width Reduction       Supporte
                                                                                       d

                  UPI         UPI _04     UPI Virus Mode                               Supporte
                                                                                       d

                  UPI         UPI _05     UPI Topology Downgrade for Failed Link       Supporte
                                          Isolation                                    d

                  Hardware    HW_01       Hot-Swappable PSUs in N+N Backup Mode        Supporte
                                                                                       d

                  Hardware    HW_02       Hot-Swappable Fan Modules in N+1             Supporte
                                          Backup Mode                                  d

                  Hardware    HW_03       RAID and Hot Swap Supported by Hard          Supporte
                                          Drives                                       d

                  FDM         FDM_01      Fault Diagnosis System                       Supporte
                                                                                       d

                  FDM         FDM_02      Proactive Failure Analysis Engine (PFAE)     Supporte
                                                                                       d

                  FDM         FDM_03      Faulty CPU Locating                          Supporte
                                                                                       d

                  FDM         FDM_04      Faulty DIMM Locating                         Supporte
                                                                                       d

                  FDM         FDM_05      Faulty PSU Locating                          Supporte
                                                                                       d

                  FDM         FDM_06      Faulty Fan Module Locating                   Supporte
                                                                                       d

                  FDM         FDM_07      Faulty Hard Drive Locating                   Supporte
                                                                                       d

                  FDM         FDM_08      Hard Drive PFA                               Supporte
                                                                                       d

Issue 02 (2021-01-25)        Copyright © Huawei Technologies Co., Ltd.                         24
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                      4 RAS Feature

                  Type            Feature       CPU and System RAS Feature                        2488H
                                  ID                                                              V6

                  FDM             FDM_09        Service Life Prediction of Huawei ES3000          Supporte
                                                (Standard PCIe SSD)                               d

                  FDM             FDM_10        iBMC CPU Self-Check                               Supporte
                                                                                                  d

                  FDM             FDM_11        Remote System Software and Firmware               Supporte
                                                Upgrade by the iBMC                               d

                  FDM             FDM_12        Black Box of the iBMC                             Supporte
                                                                                                  d

                  FDM             FDM_13        Breakdown Screenshot Capturing of the             Supporte
                                                iBMC                                              d

                  FDM             FDM_14        Breakdown Video Recording of the iBMC             Supporte
                                                                                                  d

                         NOTE

                        ● Some RAS features are not enabled by default. You can enable them using the BIOS
                          setup. For details, see the server BIOS setup documentation.
                        ● Some RAS features vary slightly depending on the CPU.

4.4 RAS Feature Description

4.4.1 System-Level RAS Features
                  Feature ID                    SYSTEM_01

                  Feature                       CPU Built-in Self Test (BIST)

                  Description                   The internal self-check module of a CPU checks each
                                                core of the CPU during BIOS startup, and records the
                                                self-check results.

                  Category                      Reliability and serviceability

                  Customer Benefit/             Customers can detect faults in the CPU.
                  Application Scenario

                  Usage                         This feature automatically takes effect and cannot be
                                                disabled.

                  Constraints/                  None
                  Limitations

Issue 02 (2021-01-25)            Copyright © Huawei Technologies Co., Ltd.                                   25
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                 4 RAS Feature

                  Feature ID                SYSTEM_02

                  Feature                   Core Disable for Fault Resilient Boot (FRB)

                  Description               After the CPU BIST, the faulty CPU core is isolated
                                            according to the BIST result, and the remaining CPU
                                            cores are started.

                  Category                  Reliability and availability

                  Customer Benefit/         The maximum CPU availability remains unchanged
                  Application Scenario      even when some CPU cores are faulty.

                  Usage                     This feature automatically takes effect and cannot be
                                            disabled.

                  Constraints/              None
                  Limitations

                  Feature ID                SYSTEM_03

                  Feature                   Corrupt Data Containment – Core

                  Description               When the CPU core cache receives error data that is
                                            not corrected by hardware algorithms, the CPU cores
                                            do not crash immediately. Instead, an interrupt request
                                            (IRQ) is sent to the OS to perform recovery or retry
                                            according to the data usage.

                  Category                  Reliability

                  Customer Benefit/         The probability is increased that the system remains
                  Application Scenario      available when uncorrected hardware errors exist.

                  Usage                     This feature automatically takes effect and is enabled
                                            by default.

                  Constraints/              None
                  Limitations

                  Feature ID                SYSTEM_04

                  Feature                   Corrupt Data Containment – Uncore

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                             26
Huawei FusionServer Pro 2488H V6 Server
RAS Technical White Paper                                                                 4 RAS Feature

                  Description               When the CPU peripheral devices, including the
                                            memory controller, cache agent module, internal I/O
                                            module, and UPI proxy module, receive error data that
                                            is not corrected by the hardware algorithm, the CPU
                                            peripheral devices do not crash immediately. Instead,
                                            after the data enters the CPU core cache, an IRQ is sent
                                            to the OS to perform recovery or retry according to the
                                            data usage.

                  Category                  Reliability

                  Customer Benefit/         The probability is increased that the system remains
                  Application Scenario      available when uncorrected hardware errors exist. For
                                            data whose destination is an external device, for
                                            example, an error of a certain pixel displayed on the
                                            screen, the data can be directly discarded without any
                                            processing.

                  Usage                     This feature automatically takes effect and is enabled
                                            by default.

                  Constraints/              None
                  Limitations

                  Feature ID                SYSTEM_05

                  Feature                   Socket Disable for FRB

                  Description               If a socket has failed or cannot be directly or indirectly
                                            connected to the PCH due to a UPI fault, the system
                                            isolates the socket. Compared with a normal complete
                                            CPU interconnection topology, the topology
                                            downgrades and the system starts with fewer CPUs.

                  Category                  Reliability and availability

                  Customer Benefit/         When a CPU socket or the UPI bus is faulty, the
                  Application Scenario      maximum CPU availability remains unchanged.

                  Usage                     This feature automatically takes effect and cannot be
                                            disabled.

                  Constraints/              None
                  Limitations

                  Feature ID                SYSTEM_06

                  Feature                   Advanced Error Detection and Correction (AEDC)

Issue 02 (2021-01-25)          Copyright © Huawei Technologies Co., Ltd.                                 27
You can also read