Fabric Resiliency Best Practices - DATA CENTER
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DATA CENTER Fabric Resiliency Best Practices Provides best practices for deploying and utilizing advanced Brocade Fabric OS (FOS) features to identify, monitor, and protect Fibre Channel (FC) SANs from problematic device and media behavior.
DATA CENTER BEST PRACTICES GUIDE CONTENTS Introduction........................................................................................................................................................................................................................................ 3 Fabric Resiliency .............................................................................................................................................................................................................................. 4 Maintaining an Optimal FC SAN Environment...................................................................................................................................................................... 4 Bottleneck Detection ..................................................................................................................................................... 4 Fabric Watch/Port Fencing............................................................................................................................................ 5 Edge Hold Time Configuration....................................................................................................................................... 5 Device Latencies.............................................................................................................................................................................................................................. 5 Assessing Device Latency Severity ............................................................................................................................... 6 Moderate Device Latencies ................................................................................................................................... 6 Severe Device Latencies ........................................................................................................................................ 6 Latency Detection .......................................................................................................................................................... 6 Timeout Notification on F_Ports ............................................................................................................................ 7 Latency Mitigation Action .............................................................................................................................................. 7 Mitigation Action Based on Bottleneck Detection ................................................................................................ 7 Action Based on Timeouts ..................................................................................................................................... 7 Fabric Configuration ............................................................................................................................................... 8 Faulty Media...................................................................................................................................................................................................................................... 8 Automatically Detecting and Mitigating Faulty Media ................................................................................................. 9 Fabric Watch ........................................................................................................................................................... 9 Bottleneck Detection.............................................................................................................................................. 9 Summary of Recommendations............................................................................................................................................................................................... 9 Appendix A: Configuring Bottleneck Detection ...................................................................................................................................................................10 Enabling/Disabling Bottleneck Detection .................................................................................................................. 10 Displaying List of Ports with Bottleneck Detection Enabled ...................................................................................... 11 Changing Bottleneck Detection Settings on a Port .................................................................................................... 11 Displaying the History of Bottlenecks on a Port ......................................................................................................... 12 Bottleneck Alerts .......................................................................................................................................................... 12 Appendix B: Configuring Bottleneck Detection Mitigation Behavior ...........................................................................................................................13 Appendix C: Configuring Port Fencing ....................................................................................................................................................................................14 Port Fencing Threshold Recommendations ............................................................................................................... 14 Appendix D: Configuring Edge Hold Time .............................................................................................................................................................................16 Fabric Resiliency Best Practices 2
DATA CENTER BEST PRACTICES GUIDE INTRODUCTION Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard Fibre Channel (FC) media can significantly impact the performance of FC fabrics and the applications they support. And in most real-world scenarios, these issues cannot be corrected or completely mitigated within the fabric itself— the behavior must be addressed directly. However, with the proper knowledge and capabilities, the fabric can often identify and in some cases mitigate or protect against the effects of these “misbehaving” components to provide better fabric resiliency. This document provides a high-level description of the most commonly experienced detrimental device behaviors and how to use Brocade® products and features to protect your data center. Brocade has a long history of developing industry-leading hardware and software. Brocade also dedicates a significant amount of engineering development and resources to providing features and functionality to assist the SAN administrator in monitoring the fabric and troubleshooting problems. Capabilities such as Advanced Performance Monitoring and Fabric Watch focus specifically on these areas. As FC fabrics have evolved in size, capabilities, and complexity, Brocade has responded with more advanced monitoring and troubleshooting features. In Fabric OS (FOS®) 6.1, Brocade introduced Port Fencing as part of the optional Fabric Watch offering. In FOS 6.3, Brocade added a new set of base features referred to as Bottleneck Detection. This was extended in FOS 6.4 with broader monitoring, improved configuration, and detection capabilities for additional types of bottlenecks. Although there are certain aspects of today’s data centers that are common in most environments, no two data centers are exactly alike, and no “one size fits all” set of configuration parameters apply universally to all environments. Brocade has worked directly with customers to develop recommendations and guidelines for most environments. However, you should always validate all recommendations for your particular needs. Brocade also offers extensive Professional Services to assist you with tuning and optimizing all of the features discussed in this document customized for your data center. For details, visit: http://www.brocade.com/services-support/professional-services/index.page. Further details on all of the features described in this document can be found in the following product documents: • Fabric OS 6.4 Administrator’s Guide (53-1001763-01) • Fabric OS 6.4 Command Reference Manual (53-1001764-01) • Fabric OS 6.4 Fabric Watch Administrator’s Guide (53-1001770-01) It is assumed that readers of this document are already familiar with basic functionality and how to use features such as Port Fencing, Bottleneck Detection, and Fabric Watch. Fabric Resiliency Best Practices 3
DATA CENTER BEST PRACTICES GUIDE FABRIC RESILIENCY Two primary aspects of fabric resiliency are captured in this document: 1. Detecting “abnormal behavior” in external components (typically servers/hosts or storage devices) that can negatively impact the fabric so that you can identify and fix the faulty device. 2. Providing mechanisms that protect the fabric from adverse effects caused by a faulty component, including one or more actions that can be invoked automatically by a switch when faulty behavior is detected to contain and isolate the impact of the misbehaving component in the fabric. This should be considered a temporary measure: ultimately, the faulty or improperly configured component must be addressed to resolve the problem completely and permanently. There are two common classes of abnormal behavior originating from fabric components: • Misbehaving high-latency end devices (hosts or storage): End devices that do not respond as quickly as expected and cause the fabric to hold frames for excessive periods of time. This can result in application performance degradation or in extreme cases even I/O failure. Common examples of moderate device latency include disk arrays that are overloaded and hosts that cannot process data as fast as they request it. Severe latencies are caused by badly misbehaving devices that stop receiving, accepting, or acknowledging frames for excessive periods of time. • Faulty media (fiber optic cables and SFPs/optics): Faulty media can cause frame loss due to excessive CRC errors, invalid transmission words, and other conditions. This may result in I/O failure and application performance degradation. Note that FC switches cannot correct bad node behavior or faulty media—they can only attempt to alert and compensate for it. Ultimately the problems must be addressed in the host or target devices or media where they actually occur. MAINTAINING AN OPTIMAL FC SAN ENVIRONMENT While there are many features available in FOS to assist with monitoring, protecting, and troubleshooting fabrics, several recent enhancements have been implemented that focus exclusively on this area. This document focuses specifically on those newer features and related capabilities that help provide optimum fabric resiliency. Most are available and supported on the majority of 4 Gigabit per second (Gbps) and 8 Gbps platforms, provided that the most recent FOS releases are used. (Visit http://my.brocade.com or consult with your vendor for the latest supported FOS releases.) Some features may require optional licensing. This section discusses these features, minimum release levels, licensing requirements, and platform limitations. Brocade highly recommends that you review the additional documentation listed earlier to understand all of the tools available for maintaining a FC SAN environment. Be sure also to read the FOS Release Notes. NOTE: In order to use all of the capabilities described in this document, switches need to be running FOS 6.4.0 or later. Bottleneck Detection Bottleneck Detection was introduced in FOS 6.3.0 with monitoring for device latency conditions and then enhanced in FOS 6.4.0 with added support for congestion detection on both E_Ports and F_Ports. FOS 6.4 also added improved reporting options and simplified configuration capabilities. The FOS 6.3.1b release (and later) included enhancement in the algorithm for detecting device latency, making it more accurate. Bottleneck Detection does not require a license and is supported on both 4 and 8 Gbps platforms. Fabric Resiliency Best Practices 4
DATA CENTER BEST PRACTICES GUIDE Fabric Watch/Port Fencing Fabric Watch is an optional (licensed) feature that was enhanced in FOS 6.1.0 with the addition of Port Fencing. This capability allows a switch to monitor specific behaviors and protect a switch by blocking a port when specified thresholds have been reached. Edge Hold Time Configuration Edge Hold Time configuration is a new capability added in the FOS 6.3.1b release, however, it is not documented in the FOS 6.3 or FOS 6.4 Command Reference. See Appendix D for details on its use. There is no license required to configure the Edge Hold Time setting. DEVICE LATENCIES A device experiencing latencies responds more slowly than expected. The device does not return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to support the offered load, even though the offered load is less than the maximum physical capacity of the link connected to the device, as shown in Figure 1. Figure 1. Buffer backup on ingress port 6 on B1 causes congestion upstream on S1, port 3 Once it exhausts all available credits, the switch port connected to the device needs to hold additional outbound frames until a buffer credit is returned by the device. When a device isn’t responding in a timely fashion, the transmitting switch is forced to hold frames for longer periods of time resulting in high buffer occupancy. This in turn results in the switch lowering the rate at which it returns buffer credits to other transmitting switches. This effect propagates through switches (and potentially multiple switches with devices attempting to send frames to devices attached to the switch with the high-latency device) and ultimately impacts the fabric. Hosts Hosts 4. All servers using ISL impacted X 3. Continuing backflow X 2. Backflow of credit of credit exhaustion to X exhaustion depleting ISL port on Switch B B credits on ISL po rt B on Switch A A X A 1. Buffer credits exhausted X X Storage Storage arrays arrays 5. Connection to second storage device now at risk Figure 2. Latency on a switch can propagate through the fabric Fabric Resiliency Best Practices 5
DATA CENTER BEST PRACTICES GUIDE NOTE: The impact to the fabric (and other traffic flows) varies based on the severity of the latency exhibited by the device. The longer the delay caused by the device in returning credits to the switch, the more severe the problem. Assessing Device Latency Severity Moderate Device Latencies Moderate device latencies are defined as those not severe enough to cause frame loss. If the time between successive credit returns by the device is between a few hundred microseconds to tens of milliseconds, then the device exhibits moderate latencies since this delay is typically not enough to cause frame loss (frame loss typically occurs above 100 ms). This causes a drop in performance of traffic flows using the fabric but typically does not cause frame drops or I/O failures. When a device exhibits mild to moderate latency behavior, applications may see a drop in performance but typically not I/O failure. However, the higher the latency, the greater the chance that an end user will experience degraded performance. Severe Device Latencies Severe device latencies result in frame loss, which triggers the host SCSI stack to detect failures and to retry I/Os. This process can take tens of seconds (possibly as long as 30 – 60 seconds), which can cause a very noticeable application delay and potentially result in application errors. If the time between successive credit returns by the device is in excess of 100 milliseconds, then the device is exhibiting severe latency. When a device exhibits severe latency, the switch is forced to hold frames for excessively long periods of time (in the order of hundreds of milliseconds). When this time becomes greater than the established timeout threshold, the switch drops the frame (per Fibre Channel standards). Frame loss in switches is also known as C3 (Class 3) discards or timeouts. Since the effect of device latencies often spreads through the fabric, frames can be dropped due to timeouts not just on the F_Port to which the misbehaving device is connected but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause I/O errors that result in a host retry and can result in significant decreases in application performance. The implications of this behavior are compounded and exacerbated by the fact that frame drops on the affected F_Port (device) result not only in I/O failures to the misbehaving device (which would be expected), but also the fact that frame drops on E_Ports may cause I/O failures for unrelated traffic flows involving other hosts (which would not typically be expected). Latency Detection Bottleneck Detection on F_Ports. It is a recommended best practice to use the Bottleneck Detection feature to detect devices that exhibit latency. Bottleneck Detection is a comprehensive feature that can be used to detect a wide range of device latencies from mild to severe. See “Appendix A: Configuring Bottleneck Detection” for details on how to enable Bottleneck Detection. Once Bottleneck Detection is enabled, the switch monitors F_Ports for latency symptoms. Specifically, it looks for conditions in which the time delay between successive buffer credit returns from a device is higher than expected. When the condition is detected, Bottleneck Detection reports latency bottlenecks at F_Ports based on configurable thresholds. These reports can then be leveraged to: • Determine the severity and duration of the latency behavior • Determine the specific device port on which device latencies are occurring • Determine the actual device latency in the range of 100 microseconds to hundreds of milliseconds Fabric Resiliency Best Practices 6
DATA CENTER BEST PRACTICES GUIDE Timeout Notification on F_Ports It is a recommended best practice to use Fabric Watch to detect frame timeouts, that is, frames that have been dropped because of severe latency conditions (the Fabric Watch “C3TX_TO” area available in version 6.3 for 8 Gbps ports and available in FOS 6.3.1b/6.4.0 and later for 4 Gbps ports). If the number of timed- out frames on an F_Port exceeds the currently effective threshold settings, Fabric Watch can notify the user through one of the following mechanisms: • Send an SNMP trap • Log a RASlog message • Send an e-mail alert • Log a SYSlog message Latency Mitigation Action Mitigation Action Based on Bottleneck Detection FOS 6.4.0 and later includes an enhancement to Bottleneck Detection that allows the switch to provide some fabric-level mitigation when device latency is detected but Port Fencing thresholds have not yet been reached. When latency is detected on a port, frames held in the transmit port connected to the misbehaving device are dropped for a short period of time. This allows the switch to return credits to other transmitting switches, thus allowing other traffic flows to move at a faster rate. This action protects other flows from a severe performance drop resulting from a single misbehaving device. If a misbehaving device continues to exhibit latencies for several seconds, the port is disabled via Port Fencing (if Port Fencing has been enabled and configured). See” Appendix B: Configuring Bottleneck Detection” for information on configuring Bottleneck Detection Mitigation capability. Note that this capability is not documented in any existing FOS product documentation. Action Based on Timeouts Quarantining a misbehaving high-latency device based on timeouts: When a misbehaving device exhibits extremely high latencies causing frame timeouts, it is likely also causing a severe fabric impact and should be removed from the fabric. Port Fencing based on timeouts is another optional feature that can be used to quarantine a high latency device and mitigate the impact on the fabric (8 Gbps platform support available in FOS 6.3 and later, 4 Gbps platform support available in FOS 6.3.1b and later). Brocade recommends enabling Port Fencing for transmit timeouts on F_Ports. See “Appendix C: Configuring Port Fencing” for details on how to enable Fabric Watch Port Fencing. When the number of frames dropped due to timeouts on an F_Port reaches a user-configured threshold, the port is fenced (blocked). This disables the port, requiring user intervention to bring it back online. Once the F_Port of the offending device is fenced, no further actions are required. The default/recommended threshold settings noted in Appendix C can safely disable the misbehaving device, preventing an impact to the fabric without causing a false trigger (fencing a port when there isn’t a high-latency device). Reducing timeouts on unrelated flows: Per Fibre Channel standards, frames are dropped in switches if they’ve been held in the switch buffers for longer than the established Hold Time, a value calculated from several configurable fabric parameters. Unless any of these fabric parameters (R_A_TOV, E_D_TOV, WAN_TOV, or MAX_HOPs) have been changed from their defaults, the Hold Time is calculated to be 500 ms. In most environments, fabric parameters on all switches in a fabric should match, and thus the Hold Time should be consistent throughout a fabric. When congestion conditions cause frames to drop in the core of the fabric, where there tend to be more flows/traffic, there will be more disruption. In order to reduce frame drops on E_Ports on core switches, the edge switches that host the end server/storage devices can be Fabric Resiliency Best Practices 7
DATA CENTER BEST PRACTICES GUIDE configured to have a shorter Hold Time compared to the core switches by using the Edge Hold Time feature (available in FOS 6.3.1b and later). This setting lowers the Hold Time on the edge of the network, which reduces the likelihood of frame loss on the core of the network, effectively mitigating the impact of the misbehaving device. It is a recommended best practice to enable the Edge Hold Time feature. See “Appendix D: Configuring Edge Hold Time” for details on how to enable the Edge Hold Time feature. Note that enabling and configuring the Edge Hold Time is a non-disruptive operation. Fabric Configuration Fabrics can be architected to mitigate some impacts of device latency. Isolating the device flows (host/storage pair) that exhibit high latencies by either putting them in their own fabric or on their own blade/switch will contain the impact of the latencies to the fabric or blade/switch containing the high- latency device flows. Features such as Integrated Routing (Fibre Channel Routing) and local switching provide architectural-level solutions that limit the need for more complex monitoring and mitigation capabilities. However, using fabric design as a protection mechanism does require some knowledge of which devices are likely to exhibit latency. FAULTY MEDIA In addition to high-latency devices causing disruptions to data centers, fabric problems are often the result of faulty media. Faulty media can include bad cables, SFPs, extension equipment, receptacles, patch panels, improper connections, and so on. Media can fault on any port type (E_Port or F_Port) and fail, often unpredictably and intermittently, making it even harder to diagnose. Faulty media involving F_Ports results in an impact to the end device attached to the F_Port and to devices communicating with this device. Failures on E_Ports can have an even greater impact. Many flows (host/target pairs) can simultaneously traverse a single E_Port. In large fabrics, this can be hundreds or even thousands of flows. In the event of a media failure involving one of these links, it is possible to disrupt some or all of the flows utilizing the path. Severe cases of faulty media, such as a disconnected cable, can result in a complete failure of the media, which effectively brings a port offline. This is typically easy to detect and identify. When this occurs on an F_Port, the impact is specific to flows involving the F_Port. E_Ports are typically redundant so severe failures on E_Ports typically only result in a minor drop in bandwidth as the fabric automatically utilizes redundant paths. And error reporting built into FOS readily identify the failed link and port, allowing for simple corrective action and repair. With moderate cases of faulty media, failures occur but the port can remain online or transition between online and offline. This can cause repeated errors, which can occur indefinitely or until the media fails completely. When these types of failures occur on E_Ports, the result can be devastating, as there can be repeated errors that impact many flows. This can result in significant impacts to applications that last for prolonged durations. Signatures of these types of failures include the following: • CRC errors on frames • Invalid Words (includes encoder out errors) • State Changes (Ports going offline/online repeatedly) • Credit loss: Complete loss of credit on a VC on an E_Port prevents traffic from flowing on that VC, which results in frame loss and I/O failures for devices utilizing the VC Fabric Resiliency Best Practices 8
DATA CENTER BEST PRACTICES GUIDE Automatically Detecting and Mitigating Faulty Media Fabric Watch It is a recommended best practice to enable Fabric Watch to monitor for CRC errors, Invalid Words and State Changes. It is recommended to configure for alerts on reaching low thresholds and fence (disable) a port when reaching high thresholds. See “Appendix C: Configuring Port Fencing” for details on how to enable and configure Fabric Watch Port Fencing. Fabric Watch monitoring. Fabric Watch monitors can be enabled to automatically detect most of the faulty media conditions previously noted. For example, Fabric Watch can monitor CRC errors (available in FOS 6.1.x), Invalid Words (available in FOS 6.1.x), and State Changes (ports transitioning between offline and online, available in FOS 6.3). Fabric Watch generates alerts based on user-defined thresholds for these conditions. The most common cause of credit loss is corruption to credit return messages (VC_RDY or R_RDY) due to faulty media. Credit corruption is tracked by an encoder out error which is an Invalid Word error. Monitoring and mitigating Invalid Word issues protects against credit loss. Fabric Watch quarantine. Fabric Watch also provides a mechanism that quarantines the badly behaving component with the optional action of Port Fencing. Port Fencing is available for each of the previously noted conditions and is recommended to automatically protect the fabric from these error conditions. The recommended thresholds specified in “Appendix C: Configuring Port Fencing” have been tested and tuned to quarantine components that are misbehaving to the point at which they are likely to cause a fabric-wide impact. They do not falsely trigger on normally behaving components. Bottleneck Detection The Bottleneck Detection feature can detect different types of congestion in a fabric. Lost buffer credits can result in extreme congestion by slowing the aggregate throughput of a connection. Bottleneck Detection can detect ports that are blocked due to lost credits and generate special “stuck VC” and “lost credit” alerts for the E_Port with the lost credits (available in FOS 6.3.1b and later). Bottleneck Detection can also generate alerts on downstream E_Ports that are also blocked due to an upstream congestion condition such as an E_Port with lost credits or a high-latency device. See “Appendix A: Configuring Bottleneck Detection” for best practice recommendations on configuring and using this feature. SUMMARY OF RECOMMENDATIONS The following are recommended features and capabilities to improve the overall resiliency of Brocade FOS-based FC fabric environments: • Enable Fabric Watch to detect frame timeouts (the Fabric Watch “C3TX_TO” area). • Enable Port Fencing for transmit timeouts on F_Ports. • Enable the Edge Hold Time feature. • Enable Fabric Watch to monitor (alert) for CRC errors, Invalid Words and State Changes and fence on extreme behavior. • Enable Edge Hold Time in core/edge configurations. • Enable Bottleneck Detection for congestion conditions. Fabric Resiliency Best Practices 9
DATA CENTER BEST PRACTICES GUIDE APPENDIX A: CONFIGURING BOTTLENECK DETECTION Enabling/Disabling Bottleneck Detection When Bottleneck Detection is enabled, RASlog alerts can also be enabled to be sent when the bottleneck conditions at a port exceed a specified threshold. On the switch with target port connections, log in with admin level privileges. Enter bottleneckmon --enable to enable Bottleneck Detection on an F_Port or FL_Port. bottleneckmon --enable [ -alert ] [ -thresh threshold ] [ -time window ] [ -qtime quiet_time] [slot/]portlist [[slot/]portlist]... If the alert parameter is not specified, alerts are not sent, but a history of bottleneck conditions for the port can be viewed. The thresh, time, and qtime parameters are also ignored if the alert parameter is not specified. It is a recommended best practice to use the default values for the thresh (0.1), time (300), and qtime (300) parameters. If the time parameter is changed, it should be set to 300 or higher. Example of enabling bottleneck detection (preferred use case): The following example enables bottleneck detection on all F_ and FL_Ports in the switch with RASlog alerts using default values for threshold and time. Alerts are logged when a port is experiencing a bottleneck condition for 10% of the time (default value) over any period of 300 seconds (default value) with a minimum of 300 seconds (default value) between alerts. switch:admin> bottleneckmon --enable -alert * The following example enables bottleneck detection on ports 3–7 using default values for threshold and time. No alerts will be delivered to report bottleneck conditions, but the bottleneck history can be viewed using the CLI. switch:admin> bottleneckmon --enable 3-7 Example of disabling bottleneck detection: 1. Connect to the switch to which the target port belongs and log in as admin. 2. Enter bottleneckmon --disable to disable bottleneck detection on a port. Example of disabling bottleneck detection on port 3: switch:admin> bottleneckmon --disable 3 Fabric Resiliency Best Practices 10
DATA CENTER BEST PRACTICES GUIDE Displaying List of Ports with Bottleneck Detection Enabled 1. Connect to the switch to which the target ports belong and log in as admin. 2. Enter bottleneckmon --status to display the ports on which Bottleneck Detection is enabled. NOTE: When using Virtual Fabrics, the output displays ports that do not belong to the logical switch if the ports were moved out of the logical switch after Bottleneck Detection was enabled on them. switch:admin> bottleneckmon --status Port Alerts? Threshold Time (s) Quiet Time (s) ======================================================================= 3 N -- -- -- 4 Y 0.100 300 300 5 Y 0.100 300 300 6 N -- -- -- Changing Bottleneck Detection Settings on a Port The default settings for Bottleneck Detection are the recommended settings. The settings are configurable in the event that a user has specific reasons for modifying them, but in most cases, the default settings should not be changed. Some examples of reasons to change the defaults could include transient events that cause moderate congestion that are considered normal. Increasing the time or threshold may accommodate such events. Using the procedure described below, RASlog alerts can be enabled or disabled along with configuration of the following settings: • Threshold: the percentage of 1-second intervals required to generate an alert) • Time: the time window in seconds in which bottleneck conditions are monitored and compared against the threshold • Quiet Time options Bottleneck Detection must be disabled on a port before any of the settings can be modified. To change settings on a port: 1. Connect to the switch to which the target port belongs and log in as admin. 2. Enter bottleneckmon --disable to disable Bottleneck Detection on the port. 3. Enter bottleneckmon --enable to enable Bottleneck Detection, specify the new threshold values, and set the alert option. The following example changes the Bottleneck Detection settings on port 4. In this example, the bottleneck --status commands show the before and after settings. switch:admin> bottleneckmon –status Port Alerts? Threshold Time (s) Quiet Time (s) ============================================================================== 4 Y 0.800 300 300 switch:admin> bottleneckmon –-disable 4 switch:admin> bottleneckmon –-enable –thresh 0.6 –time 420 4 switch:admin> bottleneckmon –-status Port Alerts? Threshold Time (s) Quiet Time (s) ============================================================================== 4 Y 0.600 420 300 Fabric Resiliency Best Practices 11
DATA CENTER BEST PRACTICES GUIDE Displaying the History of Bottlenecks on a Port Use bottleneckmond –show to display a 3-hour history of bottleneck conditions for an individual port: 1. Connect to the switch to which the target port belongs and log in as admin. 2. Enter the bottleneckmon --show command to display a history of the bottleneck severity for a specific port. Example of displaying the bottleneck history for port 3 in 5-second windows over a period of 30 seconds: fcr_saturn1:root> bottleneckmon --show -interval 5 -span 30 3 ============================================================= Mon Jun 15 18:54:35 UTC 2010 ============================================================= From To affected secs ============================================================= Jun 15 18:54:30 Jun 15 18:54:35 80.00% Jun 15 18:54:25 Jun 15 18:54:30 40.00% Jun 15 18:54:20 Jun 15 18:54:25 0.00% Jun 15 18:54:15 Jun 15 18:54:20 0.00% Jun 15 18:54:10 Jun 15 18:54:15 20.00% Jun 15 18:54:05 Jun 15 18:54:10 80.00% Bottleneck Alerts Example of a bottleneck detection alert on an F_Port: 2010/03/16-03:40:47, [AN-1003], 21760, FID 128, WARNING, sw0, Latency bottleneck at slot 0, port 38. 100.00 percent of last 300 seconds were affected. Avg. time b/w transmits 80407.3975 us. Example of a “stuck VC” alert on an E_Port: 2010/03/16-03:40:48, [AN-1010], 21761, FID 128, WARNING, sw0, Severe latency bottleneck detected at slot 0 port 38. Example of a lost credit alert on an E_Port: 2010/03/15-22:50:41:187781, [C2-5021], 2/0, SLOT 4 | CHASSIS, WARNING, ?, S1,P0(16): Link Timeout, ftx=83276 tov=2000 (>1000), OID:0x43128010, c2_ops.c, line: 4058, comp:insmod, ltime:2010/03/15-22:50:41:187778. NOTE: This is an internal RASlog message. Fabric Resiliency Best Practices 12
DATA CENTER BEST PRACTICES GUIDE APPENDIX B: CONFIGURING BOTTLENECK DETECTION MITIGATION BEHAVIOR To enable Bottleneck Detection-based mitigation action on all F_Ports in a switch use: bottleneckmon --enable -act To enable/disable mitigation action after enabling Bottleneck Detection, use: bottleneckmon --config -act OR bottleneckmon --config -noact To enable/disable mitigation action after enabling Bottleneck Detection for a specific port, use: bottleneckmon --config -act OR bottleneckmon --config -noact All F_Ports with Bottleneck Detection enabled and the -act flag set are subject to mitigation action. Ports excluded from Bottleneck Detection (using the --exclude operation) are also excluded from mitigation action. Fabric Resiliency Best Practices 13
DATA CENTER BEST PRACTICES GUIDE APPENDIX C: CONFIGURING PORT FENCING Use the portFencing CLI command to enable error reporting for the Fabric Watch Port Fencing feature on all ports of a specified type and to configure the ports to report errors for a specific area. Supported port types include E_Ports, F_Ports, and physical ports. A specified port type can be configured to report errors for one or more areas. Port Fencing monitors ports for erratic behavior and disables a port if specified error conditions are met. The portFencing CLI command enables or disables the Port Fencing feature for an area of a class. You can customize or tune the threshold of an area using the portthConfig CLI command. Use portFencing to configure Port Fencing for C3_TX_TO. For example: portfencing –-enable fop-port –area C3TX_TO The same command can be used to configure Port Fencing on link reset. For example: portfencing –-enable fop-port –area LR Use portThconfig to customize Port Fencing thresholds.: switch:admin> portthconfig --set port -area crc -highthreshold -value 2 -trigger above -action email switch:admin> portthconfig --set port -area crc -highthreshold -trigger below -action email switch:admin> portthconfig --set port -ar crc -lowthreshold -value 1 -trigger above - action email switch:admin> portthconfig --set port -ar crc -lowthreshold -trigger below -action email To apply the new custom settings so they become effective: switch:admin> portthconfig --apply port -area crc -action cust -thresh_level custom To display the port threshold configuration for all port types and areas: switch:admin> portthconfig --show Port Fencing Threshold Recommendations Port Fencing Threshold recommendations three areas are shown in the following table. Area Recommended Threshold Link Reset 5 State Change 7 TX_TO 5 Fabric Resiliency Best Practices 14
DATA CENTER BEST PRACTICES GUIDE Threshold recommendations for CRC errors and Invalid Words. CRC errors and Invalid Words can occur on normal links. They have also been known to occur during certain transitions such as server reboots. When these errors occur more frequently, they can cause a severe impact. While most systems can tolerate infrequent CRC errors or Invalid Words, other environments can be sensitive to even infrequent instances. The overall quality of the fabric interconnects is also a factor. When establishing thresholds for CRC errors and Invalid Words, consider the following. In general, “cleaner” interconnects can have lower thresholds as they should be less likely to introduce errors on the links. Moderate (recommended), conservative and aggressive threshold recommendations are provided in the table below. After selecting the type of thresholds for an environment, set the low threshold with an action of ALERT (RASlog, e-mail, SNMP trap). The alert will be triggered whenever the low threshold is exceeded. Set the high threshold with an action of Fence. The port will be fenced (disabled) whenever the high threshold is detected. Aggressive threshold suggestions do not include settings for low, and instead only have the high values to trigger fencing action. Area Moderate/Recommended Aggressive Threshold Conservative Threshold Threshold CRC Low 5 High 2 Low 5 High 20 High 40 Invalid Word Low 25 High 25 Low 25 High 40 High 80 Fabric Resiliency Best Practices 15
DATA CENTER BEST PRACTICES GUIDE APPENDIX D: CONFIGURING EDGE HOLD TIME A user can configure the Edge Hold Time using the following commands. The switch does not need to be disabled to modify the hold time. Use the Configure edge hold time option to turn this feature on or off. configure Not all options will be available on an enabled switch. To disable the switch, use the "switchDisable" command. Configure... Fabric parameters (yes, y, no, n): [no] yes Configure edge hold time (yes, y, no, n): [yes] Edge hold time: (100..500) [100] The edge_hold_time value is persistently stored in the configuration file. All configuration file operations such as upload and download are supported for this feature. NOTE: This setting is available only in FOS 6.3.1b and later. © 2010 Brocade Communications Systems, Inc. All Rights Reserved. 06/10 GA-BP-300-00 Brocade, the B-wing symbol, BigIron, DCFM, DCX, Fabric OS, FastIron, IronView, NetIron, SAN Health, ServerIron, TurboIron, and Wingspan are registered trademarks, and Brocade Assurance, Brocade NET Health, Brocade One, Extraordinary Networks, MyBrocade, and VCS are trademarks of Brocade Communications Systems, Inc., in the United States and/or in other countries. Other brands, products, or service names mentioned are or may be trademarks or service marks of their respective owners. Notice: This document is for informational purposes only and does not set forth any warranty, expressed or implied, concerning any equipment, equipment feature, or service offered or to be offered by Brocade. Brocade reserves the right to make changes to this document at any time, without notice, and assumes no responsibility for its use. This informational document describes features that may not be currently available. Contact a Brocade sales office for information on feature and product availability. Export of technical data contained in this document may require an export license from the United States government. Fabric Resiliency Best Practices 16
You can also read