EXS AND ROEE FROM UNH-IOL - BASED IN PART UPON DR. ROBERT RUSSELL'S SLIDES PRESENTER: MIKKEL HAGEN

Page created by Mildred Mack
 
CONTINUE READING
EXS and RoEE from UNH-IOL

Based in part upon Dr. Robert Russell's slides
          Presenter: Mikkel Hagen
              www.openfabrics.org
Outline

¾InterOperability Laboratory (IOL)
¾Extended Sockets API (EXS-API)
¾EXS in user space
¾Mapping EXS onto RDMA
¾Performance
¾RDMA over Enhanced Ethernet (RoEE)

www.openfabrics.org                   2
UNH-IOL

¾Neutral 3rd party testing
¾Hosts the OFA-IWG interoperability cluster
    ƒ ~30 hosts, with targets and switches from many
      different vendors
    ƒ Both iWARP and IB
¾Maintains the OFW-IWG Logo List
¾Related testing through:
    ƒ iWARP, 10Gig, IB/10G Cable Testing and Data
      Center Bridging Consortia
www.openfabrics.org                                    3
Extended Sockets (EXS-API)

¾Published by the Open Group
¾Extensions to “standard” sockets
¾Two major new features:
    ƒ Memory Registration for “zero-copy” I/O
    ƒ Event-based Asynchronous I/O
¾Useful for high-level access to RDMA

www.openfabrics.org                             4
IOL EXS Stack for OFA

                      Application Program
        user space
                       IOL EXS Library

                       OFED user library

                      OFED kernel modules
       kernel space
                       IWARP / IB Driver

                      iWARP RNIC / IB HCA
          hardware
                             Cable

www.openfabrics.org                         5
IOL EXS

¾Runs entirely in user space
¾Utilizes OFED library to access RDMA
¾Requires no change to OFED or Linux
¾Extends Open Group ES-API specifications
    ƒ ES-API designed to run in kernel
    ƒ ES-API requires modifications to existing socket
      functions

www.openfabrics.org                                      6
EXS Programming

¾Two major differences from “normal” sockets
    ƒ Memory registration
         • To lock memory regions for “zero-copy” I/O
    ƒ Event queues
         • To determine asynchronous I/O completion

www.openfabrics.org                                     7
Memory Registration

¾exs_mregister()
    ƒ Registers a region of user virtual memory for
      “zero-copy” I/O
¾exs_mderegister()
    ƒ Unregisters previously registered memory region
¾Memory regions are used in I/O operations
    ƒ exs_send()
    ƒ exs_recv()

www.openfabrics.org                                     8
Event Queues

¾exs_qcreate()
    ƒ Creates a new queue object
¾exs_qdelete()
    ƒ Deletes an existing queue object
¾exs_qdequeue()
    ƒ Blocks until event is posted to queue object
¾Event posted to a queue when an I/O
 completes

www.openfabrics.org                                  9
Parameters to exs_send()

¾ Socket file descriptor – normal
¾ Buffer containing data to send – normal
¾ Number of bytes of data to send – normal
¾ Flags – normal
¾ Event queue for posting completion event – new
¾ User-defined request identification – new
¾ Registered memory region for data buffer - new

www.openfabrics.org                                10
Parameters to exs_recv()

¾ Socket file descriptor – normal
¾ Buffer into which data is received – normal
¾ Max number of bytes of data to read – normal
¾ Flags – normal
¾ Event queue for posting completion event – new
¾ User-defined request identification – new
¾ Registered memory region for data buffer - new

www.openfabrics.org                                11
Other EXS functions

¾ exs_init()
¾ exs_fcntl() - IOL Extension
¾ exs_socket() - IOL Extension
¾ exs_bind() - IOL Extension
¾ exs_listen() - IOL Extension
¾ exs_connect()
¾ exs_accept()
¾ exs_close() - IOL Extension

www.openfabrics.org              12
Mapping IOL-EXS onto RDMA

¾ Transfer controlled by recipient of data
¾ exs_send() translates into RDMA “send” to advertise
  sender's data buffer to receiver
¾ exs_recv() sets up asynchronous wait for
  advertisement, then issues RDMA “read_request” to
  asynchronously transfer data directly from sender's
  memory into receiver's memory
¾ Asynchronous completion of transfer translates into
  short RDMA “send” of ACK back to sender

www.openfabrics.org                                13
Typical EXS Data Transfer
     user     EXS      OFED   RNIC        wire         RNIC   OFED   EXS    user

exs_send                               RDMA Send                           exs_recv
                                      advertisement

                                     RDMA Read-
                                       Request
                                                 RDMA Read-
                                                  Response
                                       RDMA Send
                                     acknowledgment
                                                                      exs_qdequeue
exs_qdequeue

 www.openfabrics.org                                                               14
IOL EXS Internal Flow Control

¾Each side maintains local “send credit” value
¾Initial credits negotiated when connection
 established
¾exs_send() decrements local “send credit” by
 1, blocks if none remaining
¾Receipt of ACK increments local “send credit”
 by 1, unblocks any waiting

www.openfabrics.org                          15
IOL-EXS Small Packets

¾Size limit configured by user with exs_fcntl()
 prior to connection establishment
¾Causes “small” amounts of data in exs_send()
 to travel as “immediate data” in the
 advertisement
¾Saves extra RDMA “read_request” /
 “read_response” exchange (hidden from user)
¾Costs extra data copy on receiver when data
 advertisement is matched with exs_recv()
www.openfabrics.org                          16
Initial Performance Measurements

¾Platform configuration
    ƒ 4 64-bit Intel 2.66 GHz processors
    ƒ 4 Gigabytes memory
    ƒ 1 NetEffect (Intel) 10 Gigabit/second RNIC
¾Test
    ƒ “blasting” data from one workstation to other
¾Metrics measured
    ƒ User-level throughput in Megabits/second
    ƒ CPU utilization as a percentage of 1 CPU
www.openfabrics.org                                   17
EXS Throughput in Mbits/sec

www.openfabrics.org           18
Bandwidth Utilization

¾ Standard Ethernet Frame        1538 bytes
   ƒ Ethernet gap, preamble, header and CRC    38
     bytes
   ƒ IP and TCP headers                      40
     bytes
   ƒ MPA, DDP, RDMAP headers and MPA CRC
     20 bytes
¾ Available user payload data     1440 bytes
¾ Percentage available for data     93.63%
¾ Percentage actually utilized      93.29%
www.openfabrics.org                             19
Percent CPU Utilization

www.openfabrics.org       20
EXS Conclusions

¾Application programs using EXS can attain
    ƒ High bandwidth utilization
    ƒ Low CPU overhead
¾Application programs using EXS can tune
 their performance
¾Programmers don't need to learn verbs in
 order to make use of RDMA technology

www.openfabrics.org                          21
EXS Future Work

¾Compare various RNIC hardware
¾Compare standard and jumbo Ethernet
 frames
¾Compare EXS over IB and iWARP
¾Compare various applications
¾Look at multiple outstanding transmissions
¾Look at multiple simultaneous connections

www.openfabrics.org                           22
RoEE

¾With the introduction of Data Center Bridging
 standards into Ethernet, applications sensitive
 to loss can now be placed directly upon
 Ethernet such as Fibre Channel and RDMA
¾DCB has four new technologies being
 introduced: PFC, ETS, CN, DCBX

www.openfabrics.org                           23
Priority-based Flow Control (PFC)

¾Introduces a new MAC Control Frame that
 extends existing PAUSE mechanism
¾New frame has fields to define pause time for
 all eight defined priorities
¾This allows end nodes to inhibit only the traffic
 classes that are over utilizing their fair share
 of bandwidth

www.openfabrics.org                             24
Enhanced Transmission Selection (ETS)

¾ Adds another field, Priority Group ID (PGID), to
  frames
¾ Priorities are added to different groups and the
  bandwidth of the link is divided up between the
  groups.
¾ This allows network admins to provide minimum
  QoS bandwidth requirements to different traffic
  classes
¾ Rate limiters on the transmitter maintains the BW
  allocations
www.openfabrics.org                                   25
Congestion Notification (CN)

¾Adds another field, FlowID, to frames
¾Provides end-to-end flow control
¾PFC causes back pressure congestion
 cascades that can pull in nodes and flows not
 directly involved in the congestion
¾CN allows devices to limit congestion pro-
 actively and hopefully prevent this
 phenomenon

www.openfabrics.org                          26
DCB Exchange (DCBX)

¾Allows end nodes to communicate
 configuration and even actively configure each
 other
¾Designed to provide a means of identifying
 config problems and limited means of fixing
 them

www.openfabrics.org                          27
DCB and UNH-IOL

¾ UNH-IOL is currently testing all of the above
  technologies in the new Data Center Bridging
  Consortium
¾ Future events are already planned with EA and FCIA to
  test DCB and FCoE (FC over DCB)
¾ Future events may also include Enterprise iSCSI (iSCSI
  over DCB)
¾ First test plan is complete:
    ƒ http://www.iol.unh.edu/services/testing/dcb/testsuites.php

www.openfabrics.org                                            28
RoEE and UNH-IOL

¾ UNH-IOL has over 20 years of experience working with
  standards bodies including IEEE, IETF, ISO, ITU, etc.
¾ UNH-IOL has worked with RDMA technologies including
  iWARP, IB and OFA for several years – testing,
  documenting and developing applications
¾ UNH-IOL has worked on DCB from the beginning and
  will lead the way with testing and developing this new
  technology
¾ UNH-IOL is uniquely qualified with experience in all
  areas touching RoEE

www.openfabrics.org                                        29
Contacts
¾Mikkel Hagen
    ƒ Research and Development
    ƒ Phone: 1 (603) 862 5083
    ƒ Email: mhagen@iol.unh.edu
¾Bob Noseworthy
    ƒ Chief Engineer
    ƒ Phone: 1 (603) 862 0090
    ƒ Email: ren@iol.unh.edu
¾Website: http://www.iol.unh.edu
www.openfabrics.org                30
You can also read