SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...

Page created by Matthew Allen
 
CONTINUE READING
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT

Mike Woodacre
HPE Fellow
CTO for High-Performance Computing and Mission Critical Systems
woodacre@hpe.com
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
AGENDA
• What do systems look like today
• Processing technology directions
• Optical interconnect opportunity

                                     2
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
FACTORS AT WORK FOR OPTICAL INTERCONNECT
• Electrical reach is falling
  •   HPE Exascale systems uses 2.3m electrical links,
  •   IEEE 802.3ck (100 Gbps/lane) is targeting 2m
  •   The power required to drive these links is high
  •   Optical cables are required for inter-rack links
• Optical link costs improving
  • COGs cost is falling, but significant investment required to enable SiPh in broad use, low cost, reliable endpoints
• New generations of systems will grow optical footprint
  • Power density of systems reaching the limit
  • Changing resource utilization requires increasing flexibility

                         HPC System Challenges                                                Enterprise System Challenges
       •   Power density of processors going up                             •   Power density of processors going up
       •   Scale is key, driving to cost reduction of supporting            •   DRAM reliability sensitive to temperature
           resources to the processor (memory capacity, IO)
       •   Integrated HBM capacity opens up opportunity for                 •   Need for air-cooled in typical enterprise data centers
           additional memory/persistent memory further                      •   Transactional applications look like random memory
           away from processor                                                  access, driving large memory footprints
       •   High end dominated by direct-liquid cooling                      •   Reliability at small scale is key
       •   Reliability at scale is key                                      •   Minimize Stranded resources/Increase resource utilization
                                                                                                                          Copyright HPE 2021   3
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
BIG DATA
ANALYTICS   X
                  ARTIFICIAL
                INTELLIGENCE   X   MODELING &
                                   SIMULATION
                                                EXASCALE
                                                  ERA

                                                   Copyright HPE 2021   4
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
INTRODUCING HPE SLINGSHOT

  Traditional Ethernet Networks                       HPE Slingshot                      Traditional HPC Interconnects
  Ubiquitous & interoperable                     Standards based / interoperable         Proprietary (single vendor)
  Broad connectivity ecosystem                   Broad connectivity                      Limited connectivity
  Broadly converged network                      Converged network                       HPC interconnect only
  Native IP protocol                             Native IP Support                       Expensive/ slow gateways
  Efficient for large payloads only              Low latency                             Low latency
  High latency                                   Efficient for small to large payloads   Efficient for small to large payloads
  Limited scalability for HPC                    Full set of HPC features                Full set of HPC features
  Limited HPC features                           Very scalable for HPC & Big Data        Very scalable for HPC & Big Data

                       ü Consistent, predictable reliable high performance
                              High bandwidth + low latency, from one rack or exascale
                       ü Excellent for emerging infrastructures
                              Mix tightly-coupled HPC, AI, analytics, and cloud workloads
                       ü Native connectivity to data center resources
                                                                                                            Copyright HPE 2021   5
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
EXASCALE TECHNOLOGY WINS FOR HPE CRAY EX SYSTEMS

  ANNOUNCED     ANNOUNCED     ANNOUNCED    ANNOUNCED
                                                              ANNOUNCED
 30-OCT-2018   18-MAR-2019   7-MAY-2019   5-MAR-2020        1-OCT-2020

                                                       Copyright HPE 2021
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
SLINGSHOT OVERVIEW
                                                               § 100GbE/200GbE interfaces
                                                               § 64 ports at 100/200 Gbps
                                                               § 12.8 Tbps total bandwidth

                                  ”Rosetta” Switch
                                       ASIC

  Slingshot Top of Rack Switch                       Slingshot Integrated Blade Switch
           Air Cooled                                       Direct Liquid Cooled
                                                                                                                        Rosetta 64×200 switch

 High Performance                Ethernet Compliant               World Class Adaptive        Effective and Efficient        Low, Uniform Latency
      Switch                                                       Routing and QoS             Congestion Control
 Microarchitecture
Over 250K endpoints              Easy connectivity to            High utilization at scale;   Performance isolation            Focus on tail latency,
with a diameter of just           datacenters and                  flawless support for        between workloads                because real apps
      three hops                 third-party storage                 hybrid workloads                                              synchronize

                                                                                                                            Copyright HPE 2021    7
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
SLINGSHOT PACKAGING – SHASTA MOUNTAIN/OLYMPUS (SCALE OPTIMIZED)
                                                                     “Colorado” Switch
                                                                        (horizontal)
                                            Rear - To Fabric                Internal – to Blade
Compute
Chassis

                        Switch
                                               Group/Global links
                        Chassis
                                        Group cables
                                                         Global cables

                                                                                Switch and NICs mate
                                                                              at 90 degree orientation
                                            QSFP-DD Cables

                                  Compute blade                               Dual NIC
                                    (vertical)                                Mezz Card

                                                                                 Copyright HPE 2021   8
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
CABLING THE SLINGSHOT NETWORK

                                Copyright HPE 2021
SYSTEMS LANDSCAPE FOR OPTICAL INTERCONNECT - Mike Woodacre HPE Fellow CTO for High-Performance Computing and Mission Critical Systems ...
WHAT IS A DRAGONFLY NETWORK? WAY TO MINIMIZE LONG CABLES
                                                                                   Global optical links between groups

• Nodes are organized into groups
  • Usually racks or cabinets
  • Electrical links from NIC to switch          Electrical group

• All-to-all amongst switches in a group
  • Electrical links between switches
• All-to-all between groups
  • Optical links
• Group can have different characteristics
• Global network can be tapered to reduce cost

          • Enabled by High radix router
            • Host:local:global = 1:2:1
                                                 Group 0       Group 1   Group 2        Group 3        Group 4           Group 5     Group 6   Group 7
            • Max ports = ~4×(radix/4)4
          • E.g. 64 ports
            •   Host links (16)
                                                 Flexible compute
            •   Local links (32)                      and I/O
                                                                                                    High density compute
            •   Global links (16)
            •   Max ports = 262,656                                                                                                Copyright HPE 2021    10
WHY DRAGONFLY?
• Dragonfly lets us use more copper cables which has
  several advantages
   • Cost – each link is about 10X less expensive when
     you use copper
   • Reliability – Links built with copper cables more
     than an order of magnitude more reliable than links
     built with AOCs
• Power – Copper cables are passive and use no
  additional power, active electrical cables about half
  the power of an AOC currently
• Dragonfly allows you to cable a network of a given
  global bandwidth with about half of the optical
  cables needed in a fat tree
• With a high-radix switch, Dragonfly gives very low
  diameter topologies meaning we don’t care about
  placement of workloads

                                                           Copyright HPE 2021
BENEFITS OF LOW DIAMETER NETWORKS

Fat-tree (classical topology), 512 servers
  •   16 “in-row” 64p switches
  •   16 Top of Rack 64p switches
  •   512 Optical cables
  •   512 Copper cables
  •   512 NICs

All-to-All topology, 512 servers
  •   0 “in-row” switches
  •   16 Top of Rack 64p switches
  •   240 Optical cables
  •   512 Copper cables
  •   512 NICs

                                             Copyright HPE 2021   12
ENTERPRISE SHARED MEMORY TECHNOLOGY:
MODULAR BUILDING BLOCKS WITH MEMORY FABRIC
                                                     Superdome Flex Scales up Seamlessly as a Single System

  4-socket
building block

                 4 sockets (up to 6TB)                  8 sockets (up to 12TB)   16 sockets (up to 24TB)   20 sockets (up to 30TB)   32 sockets (up to 48TB)

         12-socket, 24-socket, and 28-socket configurations not pictured

                                                                                                                                     Copyright HPE 2021
SUPERDOME FLEX CHASSIS

Minimizing PCB cost of UPI interconnect of 4-skts (with 12
DIMMs/skt), pushes longer PCIe connection to rear of chassis.
DRAM memory wants to remain cool for reliability

                                                                                                                   QSFP interconnect ports

                                                                Custom PCIe ribbon cables to rear of chassis to   Copyright HPE 2021   14
                                                                enable use of entire PCIe pins from each skt,
HPE SUPERDOME FLEX SERVER
8/16/32 SOCKET ARCHITECTURE

                          Copyright HPE 2021
SUPERDOME FLEX PROTOTYPE USE OF OPTICAL CABLING
              64-skt SD-Flex, 2-rack
                                                                                        Improving Serviceability

       Optical cabling would significantly reduce
complexity/support of fabric, with similar MBO solution as
        shown in the ‘Machine’ prototype below

                                                                    Production Copper Cable           Optical Cable Example – 4:1 reduction
                                                             Up to 30 QSFP 100Gb cables per chassis                Copyright HPE 2021
TRADITIONAL VS. MEMORY-DRIVEN COMPUTING ARCHITECTURE
            Today’s architecture                                        Memory-Driven Computing:
         is constrained by the CPU                                 Mix and match at the speed of memory
                                     PCI

                                                SATA
 DDR

                                      If you exceed the what can
         Ethernet                     be connected to one CPU,
                                      you need another CPU
                                                                                              Copyright HPE 2021   17
DRIVING OPEN PROCESSOR INTERFACES: PCIE/CXL/GEN-Z

                                                                                                                x16 PCIe                 x16 CXL
 Past                                              CPU / SoC                                                      Card                     Card

                (LP) DDR       PCIe               SAS / SATA             Ethernet / IB       Proprietary                 X16 Connector

                                                                                                                                    PCIe channel
                                                                                                                                     SERDES

                                                         SoC
                                                                                                                             Processor
Future          PCIE/CXL     PCIE/CXL              PCIE/CXL                  Gen-Z             Gen-Z

                                                                                                           •   CXL runs across the standard PCIe physical
Memory Semantic Protocols                                                                                      layer with new protocols optimized for
                                                                                                               cache and memory
 •   Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc.                                  •   CXL uses a flexible processor Port that can
                                                                                                               auto-negotiate to either the standard PCIe
 •   CXL/Gen-Z enable memory-semantics outside processor proprietary domain                                    transaction protocol or the alternate CXL
                                                                                                               transaction protocols
 •   CXL speeds based on PCIE Gen5/Gen6                                                                    •   First generation CXL aligns to 32 GT/s PCIe
                                                                                                               5.0 specification
 •   Disaggregation of devices, especially memory/persistent memory, accelerators                          •   CXL usages expected to be key driver for an
                                                                                                               aggressive timeline to PCIe 6.0 architecture
                              PCIE          Gen4/16Gb         Gen5/32Gb         Gen6/64Gb

                              PCIe x16      200Gb             400Gb             800Gb
                                                                                                                    Copyright HPE 2021
                                     PCIe enabling processor use of higher ethernet speeds
GEN-Z CONSORTIUM: DEMONSTRATED ACHIEVEMENTS
Demonstration of Media Modules and “Box of Slots” Enclosure
• Flash Memory Summit & Supercomputing (2019)
                                                                                              Gen-Z Media Box
• Fully operational Smart Modular 256GB ZMM modules                                                “Box of slots”
                                                                                                6-8x ZMM modules

• Prototype Gen-Z Media Box Enclosure
  •   FPGA module implemented 12-port switch (x48 lanes total)
  •   ZMM Midplane
  •   8 ZMM module bays (only 6 electrically connected in demo)
  •   Gen-Z x4 25G Gen-Z links (copper cabling internal to box)
  •   Four QSFP28 Gen-Z uplink ports/links
  •   Standard 5m QSFP DAC cables between boxes, switches, and servers

• Looking to the future, fabric attached memory can supplement
  processor memory                                                       Memory Fabric      256GB DRAM ZFF
  • Fastest ‘storage’ with SCM (Storage Class Memory), eg 3DXpoint        QSFP connectors     Smart Modular using
                                                                                            FPGA and Samsung DRAM
  • Memory pool for use to grow processor memory footprint dynamically

                                                                                              Copyright HPE 2021    19
AGENDA
• What do systems look like today
• Processing technology directions
• Optical interconnect opportunity

                                     20
PROCESSING TECHNOLOGY - HETEROGENEITY
• The “Cambrian explosion” is driven by:
  • Demand side: diversification of workloads
  • Supply side: CMOS process limitations
 Achieving performance through specialization
 50+ start-ups working on custom ASICs

• Key requirements looking forward:                                 Cerebras Wafer Scale Engine
  • System architecture that can adapt to a wide range of compute
    and storage devices
  • System software that can rapidly adopt new silicon
  • Programming environment that can abstract specialized ASICs
    to accelerate workload productivity

  HBM enabled processors delivering
  memory bandwidth needs
  Opens up opportunity to package
  without DIMMs
  Drives need for greater interconnect
  bandwidth per node to balance system
                                                                                        Copyright HPE 2021   21
COMPLEX WORKFLOWS – MANY INTERCONNECTS
 One Data Scientist, One DL Engine, One Workstation                                                                                                              Neocortex
PSC’s Neocortex system
                                                                                                                                SuperDome Flex
  • 2x Cerebras CS-1, with each:                                                                                  3 x 100Gb                                 2 x 100Gb
                                                                                                                                     NVMeNVMe
                                                                                                                                      NVMeNVMe
                                                                                                                  Ethernet                                  HDR
    – 400,000 sparce linear algebra “cores”                                                                                            NVMeNVMe
                                                                                                                                         NVMeNVMe

    – 18 GB SRAM on-chip Memory

                                                                                                  100Gbe switch
                                                                                     12 x 100Gb                                      NVMeNVMe
                                                                                                                                      NVMeNVMe

                                                                                                                                                                        HDR200 IB
                                                                                                                                                                        leaf switch
                                                                                                                                       NVMeNVMe
    – 9.6 PB/s memory bandwidth                                                      Ethernet                                            NVMeNVMe

                                                                                                                                                                                      Bridges-II & Lustre Filesystem
    – 100 Pb/sec on-chip interconnect bandwidth                                                                                      NVMeNVMe
                                                                                                                                      NVMeNVMe
                                                                                                                                       NVMeNVMe
                                                                                                                                         NVMeNVMe
    – 1.2 Tb/s I/O bandwidth
    – 15 RU                                                                                                                          NVMeNVMe
                                                                                                                                      NVMeNVMe
                                                                                                                                       NVMeNVMe
                                                                                                                                         NVMeNVMe

  • HPE Superdome Flex                                                                                                                   NVMe
                                                                                                                                     NVMe
                                                                                                                                      NVMeNVMe
                                                                                                                                       NVMeNVMe
    – 32 Xeon “Cascade Lake” CPUs                                                                                                        NVMeNVMe

    – 24.5 TB System Memory

                                                                                                  100Gbe switch
                                                                                     12 x 100Gb                                      NVMeNVMe
                                                                                                                                      NVMeNVMe

                                                                                                                                                                        HDR200 IB
                                                                                                                                       NVMeNVMe

                                                                                                                                                                        leaf switch
                                                                                                                                         NVMeNVMe
                                                                                     Ethernet
    – 200 TB NVMe local storage
    – 2 x 12 x 100G Ethernet (to 2x CS-1)                                                                                            NVMeNVMe
                                                                                                                                      NVMeNVMe
                                                                                                                                       NVMeNVMe
                                                                                                                                         NVMeNVMe
    – 16 x 100G HDR100 to Bridges-II and Lustre FS
                                                                                                                                     NVMeNVMe
                                                                                                                                          NVMe
                                                                                                                                      NVMe
                                                                                                                                       NVMeNVMe
                                                                                                                                         NVMeNVMe

120 Memory fabric connections                                                                                                  200TB raw flash filesystem
104 PCIe fabric connections (64 internal, 40 external)        Aggregate throughput   1200Gb                       1200Gb                                1600Gb
                                                              per stage              each (raw)                   each (raw)                            (raw)

                                                         Courtesy of Nick Nystrom, PSC – http://psc.edu                                                                               22
                                                                                                                                           Copyright HPE 2021
AGENDA
• What do systems look like today
• Processing technology directions
• Optical interconnect opportunity

                                     23
OPTICAL INTERCONNECT OPPORTUNITIES
               Host Links: 16×400 Gbps links, 64×112Gbps lanes, path length < 1m
                Might replace orthogonal connector with passive optical equivalent

                                                                       Optimal number of optical links varies
                                                                       between use cases:
                                                                       • Relative cost of Cu and SiPh links
                                                                       • Interoperability requirements

                                Local Links: 30×400 Gbps links,
                                   1m < path length < 2.5m

                 Global & I/O Links: 18×400 Gbps links, 5m < path length < 50m
                            Ethernet interoperability on external links                  Copyright HPE 2021   24
WHAT DOES THE FUTURE HOLD FOR SYSTEM DESIGN?
• HPC systems reaching limit of power density
  • Optical interconnect reach opens up innovation opportunities to re-think system packaging
• Enterprise systems want to increase resource utilization
  • Optical interconnect opens up disaggregation of resources, RAS (Reliability, Availability, Serviceability) improvements

• Cost is always a factor and people will cling to technology offering lower cost until it breaks
• Reliability is key – existing AOC have shown to be at least order-of-magnitude less reliable in large deployments
• Power of interconnect is important but need to consider in context of overall system power consumption

• Key need to continue driving optical technology innovation for production/reliable deployment across
  connectors/MBO/co-packaging at 400Gb, to enable critical use cases at 800Gb, as well as PCIe/CXL use cases

                                                                                                                Copyright HPE 2021   25
THANKS!
woodacre@hpe.com

                   26
You can also read