Realizing One Big Switch Performance Abstraction using P4 - Jeongkeun "JK" Lee Principal Engineer, Intel Barefoot Petr Lapukhov Network Engineer ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Realizing One Big Switch Performance Abstraction using P4 Jeongkeun “JK” Lee Principal Engineer, Intel Barefoot Petr Lapukhov Network Engineer, Facebook
Agenda • Need for high-performance cluster for ML training • Vision: One Big Switch with Distributed VoQ • Problem: network congestion • Switch-side building blocks and P4 constructs • Contributions from • Anurag Agrawal, Jeremias Blendin, Andy Fingerhut, Grzegorz Jereczek, Yanfang Le, Georgios Nikolaidis, Rong Pan, Mickey Spiegel 2
A Training Cluster Primer Storage BE – Backend Fabric FE – Frontend ToR – Top of Rack FE FE ToR ToR NIC1 NIC4 NIC1 NIC4 … … CPU1 CPU4 CPU1 CPU4 … Accel1 Accel2 Accel8 Accel1 Accel2 Accel8 … … RNIC1 RNIC2 RNIC8 RNIC1 RNIC2 RNIC8 BE BE ToR ToR Compute Fabric 3
Distributed ML training • Combines data & model parallelism • Bulk-synchronous parallel realization (BSP) • Collective-style communications • All trainers arrive to a barrier in each cycle Ref: https://arxiv.org/pdf/2104.05158.pdf 4
The problem statement • Synchronous training cycle = [Compute + Network] • Network Collectives: All-to-all, All-Reduce,… • Collective produces a combination of flows • Objective: min(collective completion time) • Large messages – O(1MB) 5
What’s the challenge? • Accelerators require HW offloaded transport • All-to-all is bisection BW hungry • All-reduce may cause incast; flow entropy varies • Large Clos fabric requires tuning with RoCE • Small scale is fine – single large crossbar switch 6
Vision: One Big Switch with Distributed VoQ • “Network as OBS” model has been used in Hose model Non-blocking fabric • SDN for network-wide control plane and policy program • OVS/OVN network virtualization BX BZ Hose BY bandwidth • Service Mesh for uServices X Y Z • OBS performance model = hose model • Non-blocking fabric • Congestion only at network egress (output queue) • Distributed VoQ (Virtual Output Queue) • VoQ reflects egress congestion & BW towards receivers • Move queueing from egress to ingress • OBS ingress = 1st-hop switches or senders 7 Source: http://lib.cnfolio.com/
Reality: Congestions in DC CLOS • Uplink/core ç challenge for non-blocking fabric • Cause: ECMP/LAG hash collision • Worse at oversubscribed networks • Incast ç challenge for VoQ • Cause: many-to-one traffic pattern • Congestion surge at the last-hop • Slows down e2e signal loop • Receiver NIC ç challenge for VoQ • Cause: slow software/CPU, PCIe bottleneck 8
Switch-side building blocks 1. Provide rich congestion, BW metrics è for applications to acquire available BW asap while controlling congestion 2. Fine-grained load-balancing w/ minimal or no out-of-order delivery è For non-blocking, full-bisection fabric 3. Cut-payload signaling to receivers è For NDP-style receiver pulling 4. Sub-RTT signaling back to senders è Sudden change of congestion/BW state 5. React to rx NIC congestion è Leverage switch programmability where smart NIC is not available or applicable 9
1. Provide Congestion and BW metrics • Goal: switch provides rich info, for sender/receiver to consume and control • For whom: incoming data pkts, control pkts (RTS/CTS solicitation) • What to provide: queue depth, drain time, TX rate or avail BW, arrival rate (incast rate), # of flows, congestion locator (node/port/Q IDs) • How: in-band on forwarding pkts, separate signal pkt back to sender, or signal pkt receiver • Where: last-hop or core switch, ingress or egress • When: threshold crossing, bloom-filter suppression, when switch metric is off from in-pkt metric, when NIC sends PFC, upon pkt drops, … 10
P4 Primitives (w/ PSA/TNA/T2NA) • For whom: incoming data pkts, control pkts (RTS/CTS solicitation) è P4 parser • What to provide: queue depth, drain time, per-port or -q TX rate, arrival rate (incast rate), # of flows, congestion locator (node/port/Q IDs) è standard/intrinsic metadata, register with HW reset, meter or LPF extern • How: in-band on forwarding pkts, separate signal pkt back to sender, or signal pkt receiver (at high-priority like NDP cut-payload) è modify, mirror, recirculate, multicast actions • Where: last-hop or core switch, ingress or egress, depending on use case è T2NA: ingress visibility of egress queue status (P4 Expert Round Table 2020) • When: threshold crossing, bloom-filter suppression, when switch metric is off from in-pkt metric, when NIC sends PFC, upon pkt drops, … è register, meter, P4 processing of RX PFC frame, TNA: Deflect on Drop 11
2. Fine-grained Load Balancing • For non-blocking fabric • Sender NIC may spray packets by changing L4 src port number (e.g., AWS SRD) • Pros: ECMP switch doesn’t need to change; better handle brownfield w/ path tracking at NIC • Cons: transport & congestion control change; 5tuple connection ID may alter • Switch alternative 1: flowlet switching (e.g., Conga) • Pros: no change to transport, no OOO delivery • Cons: flow scale is limited • Switch alternative 2: DRR (Deficit Round Robin) packet spraying • Pros: DRR minimizes load imbalance, hence OOO delivery window; DRR possible in P4 • Cons: need greenfield (at least ToR layer); need packet re-ordering at RX NIC/host 12
3. Cut-Payload to Receiver • Signal last-hop switch congestion & drop events to receiver at high-priority • NDP P4 design by Correct Networks and Intel, using • ingress meter + mirror • deflect-on-drop + multicast • on Tofino1 • P4 source will be posted soon at https://github.com/p4lang/p4-applications Check out “NDP with SONiC-PINS: A low latency and high performance data- center transport architecture integrated into SONiC.“ by Rong P. and Reshma S. 13
4. Sub-RTT, L3 Signaling back to sender • Edge-to-Edge signaling of congestion info • 5. able to carry rcv NIC congestion signal (e.g., NIC-to-switch PFC) to sender hosts • Senders can use the signal (VoQ) in various ways, e.g., instant flow control to ‘flatten the curve’ • Example below: SFC (Source Flow Control) = new L3 switch-to-switch signaling + PFC as existing flow control at sender NIC. (No PFC btw switches) • Future work: fine-grained per-receiver VoQ at sender Last-hop 1st-hop Switch Switch Rcv host 1 Port 1 2. Congestion signal Ingress Egress host q1 3. PFC q1 [IP reversed, pause time] q1 7. ACK q1 q2 q2 q2 q2 1. Incast traffic q3 6. ECN q3 q3 q3 Agg + Core cache after q [dstIP: pause time] switches delay host 2 Port 2 … q1 q1 q2 5. PFC q2 q3 q3 14
SFC: System Demo 5 6 swc swc victim 16 ia- ia- H0 0 an an sm 0 sm flo 100 16 ws w s ta ta flo G G 160 flows 25 G SW5 100G SW6 25G 100 s flow 640 flows 80 160 flow 0 H1 s 100G normal 0G flows 160 10 H3 H2 0G 10 H4 0G 10 Each host issues 160 flows (RDMA Reliable Connection). H5 15
Flow Completion Time 1.00 PFC SFC 0.75 CDF 0.50 0.25 0.00 0 5000 10000 15000 20000 25000 30000 Flow Completion Time (ms)
Queue Depth victim 10 H0 0G G 25 SW5 100G SW6 H1 25 G 100G H3 normal H4 H2 100G H5
Summary • True One-Big-Switch with performance is possible • P4 primitives can build • non-blocking CLOS fabric • sub-RTT signaling of congestion & BW metrics • Research questions • What is the scope of OBS; granularity of distributed VoQ • From last-hop switch queue to rxNIC port to service node (ML worker or µService) • Should consider major incast bottleneck point and VoQ scale • NIC-side building blocks to enable distributed VoQ at senders • How QoS infra and congestion control benefit from VoQ 18
Thank You jk.lee@intel.com petr@fb.com
Simulation setup Custer: 3-tier, 320 servers, full bisection, 12us base RTT Switch buffer: 16MB, Dynamic Threshold Congestion control: DCQCN+window, HPCC SFC Parameters • SFC trigger threshold = ECN threshold = 100KB, SFC drain target = 10KB Workload: RDMA writes • 50% Background load: shuffle, msg size follows public traces from RPC, Hadoop, DCTCP • 8% incast bursts: 120-to-1, msg size 250KB, synchronized starts within 145us Metrics • FCT slowdown: FCT normalize to the FCT of same-size flow at line rate • Goodput, switch buffer occupancy
Simulation with RPC-inspired msg size dist. 7R5-3)C )uOO-3)C 7R5-6)C-2 12-3)C 50% )C7 sORwGRwn 95% )C7 sORwGRwn 99% )C7 sORwGRwn 10 2 10 3 10 3 1.0 1.0 Better 0.8 0.8 2 2 10 10 0.6 0.6 CD) CD) 1 10 0.4 Better 0.4 10 1 10 1 Better 0.2 0.2 0 0 0 10 0 1 2 3 4 5 6 7 10 0 1 2 3 4 5 6 7 10 0 1 2 3 4 5 6 7 0.0 0.0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 4 8 12 16 0 20 40 60 80 6ize (%ytes) 6ize (%ytes) 6ize (%ytes) %uffer RFFuSDnFy (0%) GRRGSut (GbSs) DCQCN+Window 7R5-3)C )uOO-3)C 7R5-6)C-2 12-3)C 50% )C7 sORwGRwn 95% )C7 sORwGRwn 99% )C7 sORwGRwn 10 2 10 3 10 3 1.0 1.0 Better 0.8 0.8 2 2 10 10 0.6 0.6 CD) CD) 10 1 0.4 Better 0.4 10 1 10 1 Better 0.2 0.2 0 0 0 10 0 1 2 3 4 5 6 7 10 0 1 2 3 4 5 6 7 10 0 1 2 3 4 5 6 7 0.0 0.0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 4 8 12 16 0 20 40 60 80 6ize (%ytes) 6ize (%ytes) 6ize (%ytes) %uffer RFFuSDnFy (0%) GRRGSut (GbSs) HPCC
You can also read