SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems woodacre@hpe.com
AGENDA • What do systems look like today • Processing technology directions • Optical interconnect opportunity 2
FACTORS AT WORK FOR OPTICAL INTERCONNECT • Electrical reach is falling • HPE Exascale systems uses 2.3m electrical links, • IEEE 802.3ck (100 Gbps/lane) is targeting 2m • The power required to drive these links is high • Optical cables are required for inter-rack links • Optical link costs improving • COGs cost is falling, but significant investment required to enable SiPh in broad use, low cost, reliable endpoints • New generations of systems will grow optical footprint • Power density of systems reaching the limit • Changing resource utilization requires increasing flexibility HPC System Challenges Enterprise System Challenges • Power density of processors going up • Power density of processors going up • Scale is key, driving to cost reduction of supporting • DRAM reliability sensitive to temperature resources to the processor (memory capacity, IO) • Integrated HBM capacity opens up opportunity for • Need for air-cooled in typical enterprise data centers additional memory/persistent memory further • Transactional applications look like random memory away from processor access, driving large memory footprints • High end dominated by direct-liquid cooling • Reliability at small scale is key • Reliability at scale is key • Minimize Stranded resources/Increase resource utilization Copyright HPE 2021 3
BIG DATA ANALYTICS X ARTIFICIAL INTELLIGENCE X MODELING & SIMULATION EXASCALE ERA Copyright HPE 2021 4
INTRODUCING HPE SLINGSHOT Traditional Ethernet Networks HPE Slingshot Traditional HPC Interconnects Ubiquitous & interoperable Standards based / interoperable Proprietary (single vendor) Broad connectivity ecosystem Broad connectivity Limited connectivity Broadly converged network Converged network HPC interconnect only Native IP protocol Native IP Support Expensive/ slow gateways Efficient for large payloads only Low latency Low latency High latency Efficient for small to large payloads Efficient for small to large payloads Limited scalability for HPC Full set of HPC features Full set of HPC features Limited HPC features Very scalable for HPC & Big Data Very scalable for HPC & Big Data ü Consistent, predictable reliable high performance High bandwidth + low latency, from one rack or exascale ü Excellent for emerging infrastructures Mix tightly-coupled HPC, AI, analytics, and cloud workloads ü Native connectivity to data center resources Copyright HPE 2021 5
EXASCALE TECHNOLOGY WINS FOR HPE CRAY EX SYSTEMS ANNOUNCED ANNOUNCED ANNOUNCED ANNOUNCED ANNOUNCED 30-OCT-2018 18-MAR-2019 7-MAY-2019 5-MAR-2020 1-OCT-2020 Copyright HPE 2021
SLINGSHOT OVERVIEW § 100GbE/200GbE interfaces § 64 ports at 100/200 Gbps § 12.8 Tbps total bandwidth ”Rosetta” Switch ASIC Slingshot Top of Rack Switch Slingshot Integrated Blade Switch Air Cooled Direct Liquid Cooled Rosetta 64×200 switch High Performance Ethernet Compliant World Class Adaptive Effective and Efficient Low, Uniform Latency Switch Routing and QoS Congestion Control Microarchitecture Over 250K endpoints Easy connectivity to High utilization at scale; Performance isolation Focus on tail latency, with a diameter of just datacenters and flawless support for between workloads because real apps three hops third-party storage hybrid workloads synchronize Copyright HPE 2021 7
SLINGSHOT PACKAGING – SHASTA MOUNTAIN/OLYMPUS (SCALE OPTIMIZED) “Colorado” Switch (horizontal) Rear - To Fabric Internal – to Blade Compute Chassis Switch Group/Global links Chassis Group cables Global cables Switch and NICs mate at 90 degree orientation QSFP-DD Cables Compute blade Dual NIC (vertical) Mezz Card Copyright HPE 2021 8
WHAT IS A DRAGONFLY NETWORK? WAY TO MINIMIZE LONG CABLES Global optical links between groups • Nodes are organized into groups • Usually racks or cabinets • Electrical links from NIC to switch Electrical group • All-to-all amongst switches in a group • Electrical links between switches • All-to-all between groups • Optical links • Group can have different characteristics • Global network can be tapered to reduce cost • Enabled by High radix router • Host:local:global = 1:2:1 Group 0 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 • Max ports = ~4×(radix/4)4 • E.g. 64 ports • Host links (16) Flexible compute • Local links (32) and I/O High density compute • Global links (16) • Max ports = 262,656 Copyright HPE 2021 10
WHY DRAGONFLY? • Dragonfly lets us use more copper cables which has several advantages • Cost – each link is about 10X less expensive when you use copper • Reliability – Links built with copper cables more than an order of magnitude more reliable than links built with AOCs • Power – Copper cables are passive and use no additional power, active electrical cables about half the power of an AOC currently • Dragonfly allows you to cable a network of a given global bandwidth with about half of the optical cables needed in a fat tree • With a high-radix switch, Dragonfly gives very low diameter topologies meaning we don’t care about placement of workloads Copyright HPE 2021
BENEFITS OF LOW DIAMETER NETWORKS Fat-tree (classical topology), 512 servers • 16 “in-row” 64p switches • 16 Top of Rack 64p switches • 512 Optical cables • 512 Copper cables • 512 NICs All-to-All topology, 512 servers • 0 “in-row” switches • 16 Top of Rack 64p switches • 240 Optical cables • 512 Copper cables • 512 NICs Copyright HPE 2021 12
ENTERPRISE SHARED MEMORY TECHNOLOGY: MODULAR BUILDING BLOCKS WITH MEMORY FABRIC Superdome Flex Scales up Seamlessly as a Single System 4-socket building block 4 sockets (up to 6TB) 8 sockets (up to 12TB) 16 sockets (up to 24TB) 20 sockets (up to 30TB) 32 sockets (up to 48TB) 12-socket, 24-socket, and 28-socket configurations not pictured Copyright HPE 2021
SUPERDOME FLEX CHASSIS Minimizing PCB cost of UPI interconnect of 4-skts (with 12 DIMMs/skt), pushes longer PCIe connection to rear of chassis. DRAM memory wants to remain cool for reliability QSFP interconnect ports Custom PCIe ribbon cables to rear of chassis to Copyright HPE 2021 14 enable use of entire PCIe pins from each skt,
HPE SUPERDOME FLEX SERVER 8/16/32 SOCKET ARCHITECTURE Copyright HPE 2021
SUPERDOME FLEX PROTOTYPE USE OF OPTICAL CABLING 64-skt SD-Flex, 2-rack Improving Serviceability Optical cabling would significantly reduce complexity/support of fabric, with similar MBO solution as shown in the ‘Machine’ prototype below Production Copper Cable Optical Cable Example – 4:1 reduction Up to 30 QSFP 100Gb cables per chassis Copyright HPE 2021
TRADITIONAL VS. MEMORY-DRIVEN COMPUTING ARCHITECTURE Today’s architecture Memory-Driven Computing: is constrained by the CPU Mix and match at the speed of memory PCI SATA DDR If you exceed the what can Ethernet be connected to one CPU, you need another CPU Copyright HPE 2021 17
DRIVING OPEN PROCESSOR INTERFACES: PCIE/CXL/GEN-Z x16 PCIe x16 CXL Past CPU / SoC Card Card (LP) DDR PCIe SAS / SATA Ethernet / IB Proprietary X16 Connector PCIe channel SERDES SoC Processor Future PCIE/CXL PCIE/CXL PCIE/CXL Gen-Z Gen-Z • CXL runs across the standard PCIe physical Memory Semantic Protocols layer with new protocols optimized for cache and memory • Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc. • CXL uses a flexible processor Port that can auto-negotiate to either the standard PCIe • CXL/Gen-Z enable memory-semantics outside processor proprietary domain transaction protocol or the alternate CXL transaction protocols • CXL speeds based on PCIE Gen5/Gen6 • First generation CXL aligns to 32 GT/s PCIe 5.0 specification • Disaggregation of devices, especially memory/persistent memory, accelerators • CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture PCIE Gen4/16Gb Gen5/32Gb Gen6/64Gb PCIe x16 200Gb 400Gb 800Gb Copyright HPE 2021 PCIe enabling processor use of higher ethernet speeds
GEN-Z CONSORTIUM: DEMONSTRATED ACHIEVEMENTS Demonstration of Media Modules and “Box of Slots” Enclosure • Flash Memory Summit & Supercomputing (2019) Gen-Z Media Box • Fully operational Smart Modular 256GB ZMM modules “Box of slots” 6-8x ZMM modules • Prototype Gen-Z Media Box Enclosure • FPGA module implemented 12-port switch (x48 lanes total) • ZMM Midplane • 8 ZMM module bays (only 6 electrically connected in demo) • Gen-Z x4 25G Gen-Z links (copper cabling internal to box) • Four QSFP28 Gen-Z uplink ports/links • Standard 5m QSFP DAC cables between boxes, switches, and servers • Looking to the future, fabric attached memory can supplement processor memory Memory Fabric 256GB DRAM ZFF • Fastest ‘storage’ with SCM (Storage Class Memory), eg 3DXpoint QSFP connectors Smart Modular using FPGA and Samsung DRAM • Memory pool for use to grow processor memory footprint dynamically Copyright HPE 2021 19
AGENDA • What do systems look like today • Processing technology directions • Optical interconnect opportunity 20
PROCESSING TECHNOLOGY - HETEROGENEITY • The “Cambrian explosion” is driven by: • Demand side: diversification of workloads • Supply side: CMOS process limitations Achieving performance through specialization 50+ start-ups working on custom ASICs • Key requirements looking forward: Cerebras Wafer Scale Engine • System architecture that can adapt to a wide range of compute and storage devices • System software that can rapidly adopt new silicon • Programming environment that can abstract specialized ASICs to accelerate workload productivity HBM enabled processors delivering memory bandwidth needs Opens up opportunity to package without DIMMs Drives need for greater interconnect bandwidth per node to balance system Copyright HPE 2021 21
COMPLEX WORKFLOWS – MANY INTERCONNECTS One Data Scientist, One DL Engine, One Workstation Neocortex PSC’s Neocortex system SuperDome Flex • 2x Cerebras CS-1, with each: 3 x 100Gb 2 x 100Gb NVMeNVMe NVMeNVMe Ethernet HDR – 400,000 sparce linear algebra “cores” NVMeNVMe NVMeNVMe – 18 GB SRAM on-chip Memory 100Gbe switch 12 x 100Gb NVMeNVMe NVMeNVMe HDR200 IB leaf switch NVMeNVMe – 9.6 PB/s memory bandwidth Ethernet NVMeNVMe Bridges-II & Lustre Filesystem – 100 Pb/sec on-chip interconnect bandwidth NVMeNVMe NVMeNVMe NVMeNVMe NVMeNVMe – 1.2 Tb/s I/O bandwidth – 15 RU NVMeNVMe NVMeNVMe NVMeNVMe NVMeNVMe • HPE Superdome Flex NVMe NVMe NVMeNVMe NVMeNVMe – 32 Xeon “Cascade Lake” CPUs NVMeNVMe – 24.5 TB System Memory 100Gbe switch 12 x 100Gb NVMeNVMe NVMeNVMe HDR200 IB NVMeNVMe leaf switch NVMeNVMe Ethernet – 200 TB NVMe local storage – 2 x 12 x 100G Ethernet (to 2x CS-1) NVMeNVMe NVMeNVMe NVMeNVMe NVMeNVMe – 16 x 100G HDR100 to Bridges-II and Lustre FS NVMeNVMe NVMe NVMe NVMeNVMe NVMeNVMe 120 Memory fabric connections 200TB raw flash filesystem 104 PCIe fabric connections (64 internal, 40 external) Aggregate throughput 1200Gb 1200Gb 1600Gb per stage each (raw) each (raw) (raw) Courtesy of Nick Nystrom, PSC – http://psc.edu 22 Copyright HPE 2021
AGENDA • What do systems look like today • Processing technology directions • Optical interconnect opportunity 23
OPTICAL INTERCONNECT OPPORTUNITIES Host Links: 16×400 Gbps links, 64×112Gbps lanes, path length < 1m Might replace orthogonal connector with passive optical equivalent Optimal number of optical links varies between use cases: • Relative cost of Cu and SiPh links • Interoperability requirements Local Links: 30×400 Gbps links, 1m < path length < 2.5m Global & I/O Links: 18×400 Gbps links, 5m < path length < 50m Ethernet interoperability on external links Copyright HPE 2021 24
WHAT DOES THE FUTURE HOLD FOR SYSTEM DESIGN? • HPC systems reaching limit of power density • Optical interconnect reach opens up innovation opportunities to re-think system packaging • Enterprise systems want to increase resource utilization • Optical interconnect opens up disaggregation of resources, RAS (Reliability, Availability, Serviceability) improvements • Cost is always a factor and people will cling to technology offering lower cost until it breaks • Reliability is key – existing AOC have shown to be at least order-of-magnitude less reliable in large deployments • Power of interconnect is important but need to consider in context of overall system power consumption • Key need to continue driving optical technology innovation for production/reliable deployment across connectors/MBO/co-packaging at 400Gb, to enable critical use cases at 800Gb, as well as PCIe/CXL use cases Copyright HPE 2021 25
THANKS! woodacre@hpe.com 26
You can also read