An Introduction to the Linux Kernel Block I/O Stack - Based on Linux 5.11 Benjamin Block March 14th, 2021

Page created by Wallace Long
 
CONTINUE READING
An Introduction to the Linux Kernel Block I/O Stack - Based on Linux 5.11 Benjamin Block March 14th, 2021
An Introduction to the Linux Kernel Block I/O Stack
Based on Linux 5.11

Benjamin Block ‹bblock@de.ibm.com›
March 14th, 2021
IBM Deutschland Research & Development GmbH
Trademark Attribution Statement
The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.

 Not all common law marks used by IBM are listed on this page. Because of the large number of products marketed by IBM, IBM’s practice is to list only the most important of its common law marks.
 Failure of a mark to appear on this page does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.

 A current list of IBM trademarks is available on the Web at “Copyright and trademark information”: https://www.ibm.com/legal/copytrade.

 IBM®, the IBM® logo, ibm.com®, AIX®, CICS®, Db2®, DB2®, developerWorks®, DS8000®, eServer™, Fiberlink®, FICON®, FlashCopy®, GDPS®, HyperSwap®, IBM Elastic Storage®, IBM FlashCore®, IBM
 FlashSystem®, IBM Plex®, IBM Spectrum®, IBM Z®, IBM z Systems®, IBM z13®, IBM z13s®, IBM z14®, OS/390®, Parallel Sysplex®, Power®, POWER®, POWER8®, POWER9™, Power Architecture®,
 PowerVM®, RACF®, RED BOOK®, Redbooks®, S390-Tools®, S/390®, Storwize®, System z®, System z9®, System z10®, System/390®, WebSphere®, XIV®, z Systems®, z9®, z13®, z13s®, z15™,
 z/Architecture®, z/OS®, z/VM®, z/VSE®, and zPDT® are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and
 service names might be trademarks of IBM or other companies.

The following are trademarks or registered trademarks of other companies.

 UNIX is a registered trademark of The Open Group in the United States and other countries.
 The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.
 Red Hat®, JBoss®, OpenShift®, Fedora®, Hibernate®, Ansible®, CloudForms®, RHCA®, RHCE®, RHCSA®, Ceph®, and Gluster® are trademarks or registered trademarks of Red Hat, Inc. or its
 subsidiaries in the United States and other countries.

All other products may be trademarks or registered trademarks of their respective companies.

Note:

 Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user
 will experience will vary depending upon considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload
 processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here.
 IBM hardware products are manufactured Sync new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
 All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have
 achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.
 All statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
 Information about non-IBM products is obtained Sync the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the
 performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
Outline

   What is a Block Device?

   Anatomy of a Block Device

   I/O Flow in the Block Layer
What is a Block Device?
A Try at a Definition

     In Linux, a Block Device is a hardware abstraction. It represents hardware whose data is
     stored and accessed in fixed size blocks of n bytes (e.g. 512, 2048, or 4096 bytes) [18].
     In contrast to Character Devices, blocks on block devices can be accessed in
     random-access pattern, wherein the former only allows sequential access pattern [19].
     Typically, and for this talk, block devices represent persistent mass storage hardware.

     But not all block devices in Linux are backed by persistent storage (e.g. RAM Disks whose
     data is stored in memory), nor must all of them organize their data in fixed blocks (e.g.
     ECKD formatted DASDs whose data is stored in variable length records). Even so, they
     can be represented as such in Linux, because of the abstraction provided by the Kernel.

© Copyright IBM Corp. 2021                                                                       2
What is a ‘Block’ Anyway?

     A Block is a fixed amount of bytes that is used in the communication with a block device
     and the associated hardware. But different layers in the software stack differ in the
     exact meaning and size:

         • Userspace Software: application specific meaning; usually how much data is read
           from/written to files via a single system-call.
         • VFS: unit of bytes in which I/O is done by file systems in Linux. Between 512, and
           PAGE_SIZE bytes (e.g. 4 KiB for x86 and s390x, may be as big as 1 MiB).
         • Hardware: also referred to as Sector.
               • Logical: smallest unit in bytes that is addressable on the device.
               • Physical: smallest unit in bytes that the device can operate on without resorting to
                 read-modify-write.
               • Physical may be bigger than Logical block size

© Copyright IBM Corp. 2021                                                                              3
Using Block Devices in Linux (Examples)
     Listing available block devices:
        # l s / sys / c l a s s / block
        dasda dasda1 dasda2               dm−0   dm−1   dm−2   dm−3   scma   sda   sdb

     Reading from a block device:
        # dd i f =/ dev / sdf o f =/ dev / n u l l bs=2MiB
        10737418240 bytes (11 GB, 10 GiB ) copied

     Listing the topology of a stacked block devices:
        # l s b l k −s / dev / mapper / rhel_t3545003 − r o o t
        NAME                      MAJ : MIN RM SIZE RO TYPE MOUNTPOINT
        rhel_t3545003 − r o o t 253:11        0  9G 0 lvm       /
        +−mpathf2                 253:8       0  9G 0 p a r t
          +−mpathf                253:5       0 10G 0 mpath
              +−sdf                 8:80      0 10G 0 d i s k
              +−sdam               66:96      0 10G 0 d i s k

© Copyright IBM Corp. 2021                                                               4
Some More Examples for Block Devices

        Local Disk                     Remote Disk                   Virtualized Disk
           /dev/nvme0n1                        /dev/sda                         /dev/vda

                                                                              Guest Kernel
                Kernel                          Kernel                           virtio_blk
                 nvme                              sd
                                                iscsi_tcp                      /dev/ram0
                     NVMe
                                                 iSCSI                         Host Kernel
                                                                                     brd

                                                 RAID
                                                                               RAM         RAM

              Figure 1: Examples for block device setups with different hardware backends [14].

© Copyright IBM Corp. 2021                                                                        5
Anatomy of a Block Device
Structure of a Block Device: User Facing

                                                                                   /dev/sda1
   block_device Userspace interface;                                   /dev/sda
       represents special file in /dev, and                                                    User
       links to other kernel objects for the             ge                                    Kernel
                                                           nd
       block device [4, 16]; partitions point to              is              block_device

                                                   re
                                                                   k   block_device

                                                     qu
                                                                                 bd_disk

                                                       es
       same disk and queue as whole device.
                                                                       bd_disk bd_queue

                                                          t_
                                                            qu
   inode Each block device gets a virtual                              bd_queue bd_partno = 1

                                                              eu
                                                                e
       inode assigned, so it can be used in the                        bd_partnobd_dev
                                                                                  =0     = devt
                                                                       bd_dev = devt
                                                                                 bd_start_sect = 32
       VFS.
                                                                       bd_start_sect = 0
   file and address_space Userspace                                             inode
                                                                       inode

                                                                  e
       processes open special file in /dev; in                                   i_mode = S_IFBLK

                                                               ac
                                                                       i_mode = S_IFBLK

                                                         fi _sp
       kernel represented as file with                                           i_rdev = devt
                                                                       i_rdev = devt
                                                                                 i_bdev

                                                             s
                                                            le
                                                           es
       assigned address_space that point                               i_bdev    i_size

                                                        dr
       to block device inode.                                          i_size

© Copyright IBM Corp. 2021                           ad                                                 6
Structure of a Block Device: Hardware Facing

    scsi_device 1            1       request_queue
                                                       gendisk and request_queue Central
    request_queue                    queuedata
                                     tag_set               part of any block device; abstract
             parent                  queue_ctx[C]          hardware details for higher layers;
    scsi_disk                        queue_hw_ctx[N]
                                     limits                gendisk represents the whole
    disk                                                   addressable space; request_queue
                                       queue_limits
                       1                                   how requests can be served
    gendisk
    queue                                              disk_part_tbl Points to partitions —
    part0                                                  represented as block_devices —
    part_tbl
    fops                                                   backed by gendisk
                                 1
                        1            block_device      scsi_device and scsi_disk Device
        block_device 1              N
                                                           drivers; provides common/mid layer
         _operations   disk_part_tbl                       for all SCSI-like hardware (incl. Serial
    *submit_bio(*bio)                len = N
    *open(*bdev, mode)               part[len]             ATA, SAS, iSCSI, FCP, …).

© Copyright IBM Corp. 2021                                                                            7
Queue Limits

         • Attached to the Request Queue structure of a Block device
         • Abstract hardware, firmware and device driver properties that influence how
           requests must be laid out
         • Very important for stacked block devices
         • For example:

     logical_block_size Smallest possible unit in bytes that is addressable in a request.
     physical_block_size Smallest unit in bytes handled without read-modify-write.
     max_hw_sectors Amount of sectors (512 bytes) that a device can handle per request.
        io_opt Preferred size in bytes for requests to the device.
     max_sectors Softlimit used by VFS for buffered I/O (can be changed).
     max_segment_size Maximum size a segment in a request’s scatter/gather list can
               have.
     max_segments Maximum amount of scatter/gather elements in a request.

© Copyright IBM Corp. 2021                                                                  8
Multi-Queue Request Queues

         • In the past, request queues in Linux worked single threaded and without
           associating requests with particular processors
         • Couldn’t properly exploit many-core systems and new storage hardware with more
           than one command queue (e.g.: NVMe); lots of cache thrashing
         • Explicit Multi-Queue (MQ) support [3] was added with Linux 3.13: blk-mq
         • I/O requests are scheduled on a hardware queue assigned to the I/O generating
           processor; responses are meant to be received on the same processor
         • Structures necessary for I/O submission and response-handling are kept per
           processor; no shared state as much as possible
         • With Linux 5.0 old single threaded queue implementation was removed

© Copyright IBM Corp. 2021                                                                  9
Block MQ Tag Set: Hardware Resource Allocation

                                                                       N=4
      • Per hardware queue resource
        allocation and management                                     Ctrl               M=8
      • Requests (Reqs) are pre-allocated per
        HW queue (blk_mq_tags)
                                                     0       1     HW Queues         2       3

      • Tags are index into the request array      blk_mq blk_mq blk_mq blk_mq
        per queue, or per tag set (new in 5.10       _tags  _tags  _tags  _tags
        [17])                                      Tags[M]       Tags[M]   Tags[M]       Tags[M]
                                                   Reqs[M]       Reqs[M]   Reqs[M]       Reqs[M]
      • Allocation of tags handled via special
        data-structure: sbitmap [20]                blk_mq
      • Tag set also provides mapping between      _tag_set
        CPU and hardware queue; objective is         tags[N]
                                                     map              0 0 1 1 2 2 3 3
        either 1 : 1 mapping, or cache proximity                     0 1 2 3 4 5 6 7
                                                             CPU IDs        C=8
© Copyright IBM Corp. 2021                                                                         10
Block MQ Soft- and Hardware-Context

                                                 blk_mq        blk_mq        blk_mq        blk_mq
      • For a request queue (see pointers on       _tags         _tags         _tags         _tags
        page 7)                                    blk_mq blk_mq blk_mq blk_mq
      • Hardware context (blk_mq_hw_ctx =         _hw_ctx _hw_ctx _hw_ctx _hw_ctx
        hctx) exists per hardware queue; hosts       tags          tags           tags          tags
                                                  *work()       *work()        *work()       *work()
        work item (kblockd work queue)
                                                     work
        scheduled on matching CPU; pulls             item
        requests out of associated ctx and        ctx ctx       ctx ctx        ctx ctx       ctx ctx
        submits them to hardware                     01            23             45            67
                                                  hctx hctx     hctx hctx      hctx hctx     hctx hctx
      • Software context (blk_mq_ctx = ctx)        rqL rqL       rqL rqL        rqL rqL       rqL rqL
        exists per CPU; queues requests in
                                                 scheduled               list of queued reqeusts
        simple FIFO in absence of Elevator;           here
        associated with assigned HCTX as per
                                                     0     1       2     3       4     5       6     7
        tag set mapping
                                                     Core      CPU       NUMA Domain
© Copyright IBM Corp. 2021                                                                               11
Block MQ Elevator / Scheduler

         • Elevator = I/O Scheduler
         • Can be set optionally per request queue
           (/sys/class/block//queue/scheduler)

     mq-deadline Forward port of old deadline scheduler; doesn’t handle MQ context
               affinities; default for device with 1 hardware queue; limits wait-time for
               requests to prevent starvation (500 ms for reads, 5 s for writes) [10, 18]
             kyber Only MQ native scheduler [10]; aims to meet certain latency targets (2 ms
                   for reads, 10 ms for writes) by limiting the queue-depth dynamically
                 bfq Only non-trivial I/O scheduler [6, 8, 9, 11] (replaces old CFQ scheduler);
                     doesn’t handle MQ context affinities; aims at providing fairness between
                     I/O issuing processes
               none Default for device with more than 1 hardware queue; simply FIFO via MQ
                    software context
© Copyright IBM Corp. 2021                                                                        12
What About Stacked Block Devices?

         • Device-Mapper (dm) and Raid (md) use virtual/stacked block device on top of
           existing hardware-backed block devices ([15, 21])
               • Examples: RAID, LVM2, Multipathing
         • Same structure as shown on page 6 and 7, without hardware specific structures,
           stacked on other block_devices
               • BIO based: doesn’t have an Elevator, an own tag-set, nor any soft-, or
                 hardware-contexts; modify I/O (BIO) after submission and immediately pass it on
               • Request based: have full set of infrastructure (only dm-multipath atm.); can queue
                 requests; bypass lower-level queueing
         • queue_limits of lower-level devices are aggregated into the “greatest common
           divisor”, so that requests can be scheduled on any of them
         • holders/slaves directories in sysfs show relationship

© Copyright IBM Corp. 2021                                                                            13
I/O Flow in the Block Layer
Submission of I/O Requests from Userspace (Simplified)

      • I/O submission mainly categorized in 2       Direct I/O       Buffered I/O
        disciplines:                             Write R W Read    Write R W Read
                                                                                      User
   Buffered I/O Requests served via Page                                              Kernel
       Cache; Writes cached and eventually —                  File                   VFS
                                                            systems
       usually asynchronously — written to
                                                                          Page
       disk via Writeback; Reads served                                   Cache
                                                           Writeback
       directly if fresh, otherwise read from
                                                           Read(-ahead)
       disk synchronously                                            block_device
   Direct I/O Requests served directly by                         block_device
                                                 submit_bio(bio) gendisk
       backing disk; alignment and possibly
       size requirements; DMA directly                                fops
                                                                  queue              Block
       into/from User memory possible                                                Layer
      • For syncronous I/O system calls, tasks   request_queue
        wait in state TASK_UNINTERRUPTIBLE
© Copyright IBM Corp. 2021                                                                     14
A New Asynchronous I/O Interface: io_uring

                                                            mmap structures into userspace
      • With Linux 5.1 a new I/O submission API             1. Fill
        has been added: io_uring [12, 2]                              2. Advance
                                                                                   3. Advance
      • New set of System Calls to create set of ring                                              User
        structures: Submission Queue (SQ),                                                         Kernel
        Completion Queue (CQ), and Submission                       per id in CQE        Tail
                                                        0   SQE                 Head
        Queue Entries (SQE) array

                                                                                                   He
                                                                                           CQ

                                                                                                     ad
                                                        1   SQE
      • Structures shared between Kernel and User                                             E
                                                        2   SQE                1
        via mmap(2)
                                                        3                 2
      • Submission and Completion work

                                                                  Tail
                                                        4
        asynchronously
                                                        5                                          CQ
      • Utilizes standard syscall backends for calls    6                SQ
        like readv(2), writev(2), or fsync(2);          7                     Feed into existing
        with same categories as on Page 14
                                                                                Syscall Paths
© Copyright IBM Corp. 2021                                                                                  15
The I/O Unit of the Block Layer: BIO
          0 ... N
                                     bvec_iter           • BIOs represent in-flight I/O
        bio                      1
                                     bi_sector           • Application data kept separate; array of
        bi_next                      bi_size
        bi_opf : req_opf             bi_idx                bio_vecs holds pointers to pages with
        bi_disk                      bi_bvec_done
        bi_partno = N
                                                           application data (scatter/gather list)
        bi_iter                  1 ... 256               • Position and progress managed in
        *bi_end_io()                        bio_vec
        bi_io_vec[]                       bio_vec
                                             bv_page
                                                           bvec_iter:
                                        bio_vec
                                            bv_len
                                           bv_page         bi_sector start sector
                                             bv_offset
                                          bv_len
                                         bv_page             bi_size size in bytes
    page        page         page         bv_offset
                                        bv_len
                                        bv_offset             bi_idx current bvec index
                                                           bi_bvec_done finished work in bytes
      48656C        210A00     000000    Pinned          • BIOs might be split when queue limits
      6C6F2C        48656C     000000    Data Pages
      20576F        6C6F2C                                 (see page 8) exceeded; or cloned when
      726C64        20576F    Pages
                              Physically Contiguous        same data goes to different places
                    726C64
   Data/LBA         210A00
   Contiguous                                            • Data of single BIO limited to 4 GiB
© Copyright IBM Corp. 2021                                                                            16
Plugging and Merging

     Plugging:

         • When the VFS layer generates I/O requests and submits them for processing, it
           plugs the request queue of the target block device ([1])
         • Requests generated while plug is active are not immediately submitted, but saved
           until unplugging
         • Unplugging happens either explicitly, or during scheduled context switches

     Merging:

         • BIOs and requests are tried to be merged with already queued or plugged requests
           Back-Merging: The new data fits to the end of an existing request
           Front-Merging: The new data fits to the beginning of an existing request
         • Merging is done by concatenating BIOs via bi_next
         • Merges must not exceed queue limits

© Copyright IBM Corp. 2021                                                                    17
Entry Function into the Block Layer
   submit_bio(bio) {
     if (exists(current.bios)) {
       current.bios += bio                   • When I/O is necessary, BIOs are generated and
       return
     }                      thread local       submitted into the block layer via submit_bio()
    On Stack: bios, old_bios                   ([4]).
    current.bios ← bios
    do {
      old_bios ← current.bios                • Doesn’t guarantee synchronous processing;
      current.bios = List()
      disk(bio)->submit_bio(bio)
                                               callback via bio->bi_end_io(bio).
      On Stack: same, lower
      lower = List()
                                             • One submitted BIO can turn into several more
      same = List()                            (block queue limits, stacked device, …); each is
      On Stack: q = queue(bio)
      while (bio ← pop(current.bios)) {        also submitted via submit_bio().
        if (q == queue(bio))
          same += bio                        • Especially for stacked devices this could exhaust
        else
      }
          lower += bio                         kernel stack space → turn recursion into iteration
      current.bios += lower
      current.bios += same                     (approx.: depth-first search with stack)
      current.bios += old_bios
    } while (bio = pop(current.bios))
    current.bios ← Null
    return                                 ← Pseudo-code representation of functionality
   }
© Copyright IBM Corp. 2021                                                                          18
Request Submission and Dispatch

                                                                    CTX
      • Once a BIO reaches a request queue (see Page 12
        and 11) the submitting task tries to get Tag from
        the associated HCTX
            • Creates back pressure if not possible

                                                                                                  s
                                                                                                  ue
                                                                                                 s
                                                                                            r le

                                                                                               ue
                                                                    0            1

                                                                                       Di ead
                                                                                          Th ab

                                                                                          es nt
                                                                                            tQ
      • The BIO is added to the associated Request; via

                                                                                        el pt

                                                                                       qu re
                                                                               HCTX ern em
                                                                    Pull

                                                                               HCTX Re ffe
        linking, this could be multiple BIOs per request

                                                                                    K re
                                                                                     P
      • The Request is inserted into software context       CPU 0

                                                                               HCTX
        FIFO queue, or Elevator (if enabled); the HCTX
                                                                                     kblockd
        work-item is queued into kblockd                                   Push using
                                                                           HW Device Driver
      • Associated CPU executes HCTX work-item
      • Work-item pulls queued requests out of associated
        software contexts or Elevators and hands them to                          Ctrl
        HW device driver
                                                                           0   HW Queue
© Copyright IBM Corp. 2021                                                                             19
Request Completion

                        Notify
                        Waiter               • Once a request is processed, device drivers usually get
                                               notified via an interrupt
       bio→bio_end_io(bio)
             For each associated BIO         • To keep data CPU local, interrupts should be bound to
   q→mq_ops→complete(rq)                       CPU of associated MQ software context
    Queue Request            Redirect with     (platform-/controller-/driver-dependant)
       Completion            IPI/SoftIRQ
                                                 • blk-mq resorts to IPI or SoftIRQ otherwise
       blk_mq_complete_req(rq)
                                             • The device driver is responsible for determining the
                    Find Request
                                               corresponding block layer request for the signaled
   CPU 0                                       completion
                                             • Progress is measured by how many bytes were
                      De RQ

           IRQ
                        I
                         vi Ha

                                               successfully completed; might cause re-schedule
                           ce n
                             Dr dle
                               iv r

                                             • Process of completing a request is a bunch of callbacks
                                 er

             Ctrl                              to notify the waiting user-/kernel-thread
© Copyright IBM Corp. 2021                                                                               20
Block Layer Polling

         • Similar to high speed networking, with high speed storage targets it can be
           beneficial to use Polling instead of Waiting for Interrupts to handle request
           completion
               • Decreases response times and reduces overhead produced by interrupts on fast devices
         • Available in Linux since 4.10 [7]; only supported by NVMe at this point (support for
           SCSI merged for 5.13 [13], support for dm in work [26])
               • Enable per request queue: echo 1 > /sys/class/block//queue/io_poll
               • Enable for NVMe with module parameter: nvme.poll_queues=N
         • Device driver creates separate HW queues that have interrupts disabled
         • Whether polling is used is controlled by applications (only with Direct I/O currently):
               • Pass RWF_HIPRI to readv(2)/writev(2)
               • Pass IORING_SETUP_IOPOLL to io_uring_setup(2) for io_uring
         • When used, application threads that issued I/O, or io_uring worker threads,
           actively poll in HW queues whether the issued request has been completed

© Copyright IBM Corp. 2021                                                                              21
Closing
Summary

         • Block devices are a hardware abstraction to allow uniform access to a range of
           diverse hardware
         • The entry into the block layer is provided by block_device device-nodes, which
           are backed by a gendisk — representing the block-addressable storage space — ,
           and a request_queue — providing a generic way to queue requests against the
           hardware
         • Special care is taken to allow processor local processing of I/O requests and
           responses
         • Userspace requests mainly categorized in Buffered and Direct I/O
         • Central structure for transporting information about in-flight I/O is the BIO; it allows
           for cloning, splitting and merging without copying payload
         • Processing of I/O is fundamentally asynchronous in the kernel, requests happen in
           a different context than responses, and are only synchronized via wait/notify
           mechanisms
© Copyright IBM Corp. 2021                                                                            22
Made with LATEX and Inkscape r

         Questions?
IBM Deutschland Research & Development

                                 • Big parts of the support for Linux on IBM Z — Kernel
        Headquarters Böblingen
                                   and Userland — are done at the IBM Laboratory in
                                   Böblingen
                                 • We follow a strict upstream policy and do not — with
                                   scarce exceptions — ship code that is not accepted
                                   in the respective upstream project
                                 • Parts of the hard- and firmware for the IBM
                                   Mainframes are also done in Böblingen
                                 •   https://www.ibm.com/de-de/marketing/entwicklung/
                                     about.html

© Copyright IBM Corp. 2021                                                            23
Backup Slides
But What About Zoned Storage?

         • Introduced with the advent of SMR Disks, but now also in NVMe specification
         • Tracks on the disk overlap; overwriting a single track means overwriting a bunch of
           other tracks as well [24]

     →     Data is organized in “bands” of tracks: zones. Each zone only written sequentially;
           out-of-order writes only after reset of whole zone. Random read access remains
           possible.

         • Supported by the Linux block layer, but breaks with previous definition
         • Direct use only via aid by special ioctls [23], via special device-mapper target [22],
           or via special file system [25]
         • Gonna ignore this for the rest of the talk

© Copyright IBM Corp. 2021
Structure of a Block Device: Whole Picture
                                     1                       1
          scsi_device                    request_queue
                                                              0 ... 1
          request_queue                  elevator                       elevator_queue
                                         backing_dev_info
                                         limits                   1
                                                                        queue_limits
          scsi_disk
          disk

                                 1
          gendisk                                                                                          bdev_inode
          part0                                                   1                         1              bdev
                                                                        block_device            1
          part_tbl                       1   disk_part_tbl        O                                        vfs_inode
          queue                                                         bd_disk
                                             len = O                    bd_queue
          fops                               part[len]                  bd_partno = N
                                                                        bd_dev = devt               1   inode
          (V)FS layer        1   block_device_operations                bd_start_sect = M               i_mode = S_IFBLK
          block layer            *submit_bio(*bio)                                                      i_rdev = devt
          scsi layer             *open(*bdev, mode)                                                     i_bdev

© Copyright IBM Corp. 2021
Structure of a MQ Request Queue: Whole Picture

                                                          blk_mq_ops                  blk_mq_tag_set                 blk_mq_queue_map
      request_queue                                  1                            1                        1 ... 3
                                                          *queue_rq()                 map[MAX_TYPES]                 mq_map[nr_cpu_ids]
      mq_ops                                              *complete()                 queue_depth                         mq_map[x] = nr-hw-queue
      tag_set                                             *timeout()                  tags[nr_hw_queues]
      queue_hw_ctx[]                                                                                                 blk_mq_tags
                                                                                                           1 ... *
      queue_ctx[]                                     1 ... nr_hw_queues                                             nr_tags = queue_depth
      elevator                                                                        blk_mq_hw_ctx                  bitmap_tags : sbitmap
      backing_dev_info                                                  1 ... 3                                      rqs[nr_tags]
                                     per_cpu_ptr                                      dispatch : list
      limits                                                            1 ... *       cpumask                        static_rqs[nr_tags]
      nr_hw_queues                      blk_mq_ctx                                    nr_ctxs                        pages : list
                       1                rq_lists[MAX_TYPES] : list                    ctx[nr_ctx]
                                        cpu                                           ctx_map : sbitmap              page 1 ... nr_tags
                                        hctx[MAX_TYPES]                               tags                           data[]     request
                             1          queue                                         queue
      queue_limits
      max_hw_sectors
      max_sectors                1
                                     backing_dev_info
      max_segment_size
      physical_block_size            ra_pages                  0 .. 1
                                     min_ratio                          elevator_queue 1          e.g.:                       Tag Set
      logical_block_size                                                                           mq_deadline
      io_opt                         max_ratio                          type                                                  Request Queue
      max_segments                   wb : bdi_writeback                 elevator_data              ops                        Queue Context

© Copyright IBM Corp. 2021
Glossary i

                  BIO BIO: represents metadata and data for I/O in the Linux block layer; no hardware specific information.

                  CFQ Completely Fair Queuing: deprecated I/O scheduler for single queue block layer.

                 DASD Direct-Access Storage Device: disk storage type used by IBM Z via FICON.
                   dm Device-Mapper: low level volume manager; allows to specify mappings for ranges of logical sectors; higher
                      level volume managers such as LVM2 use this driver.
                 DMA Direct Memory Access: hardware components can access main memory without CPU involvement.

                 ECKD Extended Count Key Data: a recording format of data stored on DASDs.
              Elevator Synonym for “I/O Scheduler” in the Linux Kernel.

                  FCP Fibre Channel Protocol: transport for the SCSI command set over Fibre Channel networks.
                 FIFO First in, First out.

                 HCTX Hardware Context of a request queue.

                  ioctl input/output control: system call that allows to query device-specific information, or execute
                        device-specific operations.

© Copyright IBM Corp. 2021
Glossary ii

                   IPI Inter-Processor Interrupt: interrupt an other processor to communicate some required action.
                 iSCSI Internet SCSI: transport for the SCSI command set over TCP/IP.

                 LVM2 Logical Volume Manager: flexible methodes of allocating (non-linear) space on mass-storage devices.

                   md Multiple devices support: support multiple physical block devices through a single logical device; required
                      for RAID and logical volume management.
                   MQ Short for: Multi-Queue.
          Multipathing Accessing one storage target via multiple independent paths with the purposes of redundancy and
                       load-balancing.

                NVMe Non-Volatile Memory Express: interface for accessing persistent storage device over PCI Express.

                 RAID Redundant Array of Inexpensive Disks: combines multiple physical disks into a logical one with the
                      purposes of data redundancy and load-balancing.
                  RAM Random-Access Memory: a form of information storage, random-accessible, and normally volatile.

                  SAS Serial Attached SCSI: transport for the SCSI command set over a serial point-to-point bus.

© Copyright IBM Corp. 2021
Glossary iii

                 SCSI Small Computer System Interface: set of standards for commands, protocols, and physical interfaces to
                      connect computers with peripheral devices.
            Serial ATA Serial AT Attachment: serial bus that connects host bus adapters with mass storage devices.
                  SMR Shingled Magnetic Recording: a magnetic storage data recording technology used to provide increased
                      areal density.

                  VFS Virtual File System: abstraction layer in Linux that provides a common interface to file systems and devices
                      for software.

© Copyright IBM Corp. 2021
References i

     [1]   J. Axboe.
           Explicit block device plugging, Apr. 2011.
           https://lwn.net/Articles/438256/.
     [2]   J. Axboe.
           Efficient io with io_uring.
           https://kernel.dk/io_uring.pdf, Oct. 2019.
     [3]   M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet.
           Linux block io: Introducing multi-queue ssd access on multi-core systems.
           In Proceedings of the 6th International Systems and Storage Conference, SYSTOR ’13, pages 22:1–22:10, New York, NY, USA, 2013. ACM.
     [4]   N. Brown.
           A block layer introduction part 1: the bio layer, Oct. 2017.
           https://lwn.net/Articles/736534/.
     [5]   N. Brown.
           Block layer introduction part 2: the request layer, Nov. 2017.
           https://lwn.net/Articles/738449/.
     [6]   J. Corbet.
           The bfq i/o scheduler, June 2014.
           https://lwn.net/Articles/601799/.

© Copyright IBM Corp. 2021
References ii

     [7]   J. Corbet.
           Block-layer i/o polling, Nov. 2015.
           https://lwn.net/Articles/663879/.
     [8]   J. Corbet.
           The return of the bfq i/o scheduler, Feb. 2016.
           https://lwn.net/Articles/674308/.
     [9]   J. Corbet.
           A way forward for bfq, Dec. 2016.
           https://lwn.net/Articles/709202/.
     [10] J. Corbet.
          Two new block i/o schedulers for 4.12, Apr. 2017.
          https://lwn.net/Articles/720675/.
     [11] J. Corbet.
          I/o scheduling for single-queue devices.
          Oct. 2018.
          https://lwn.net/Articles/767987/.
     [12] J. Corbet.
          Ringing in a new asynchronous i/o api, Jan. 2019.
          https://lwn.net/Articles/776703/.

© Copyright IBM Corp. 2021
References iii

     [13] K. Desai.
          io_uring iopoll in scsi layer, Feb. 2021.
          https://lore.kernel.org/linux-scsi/20210215074048.19424-1-kashyap.desai@broadcom.com/T/#.
     [14] W. Fischer and G. Schönberger.
          Linux storage stack diagramm, Mar. 2017.
          https://www.thomas-krenn.com/de/wiki/Linux_Storage_Stack_Diagramm.
     [15] E. Goggin, A. Kergon, C. Varoqui, and D. Olien.
          Linux multipathing.
          In Proceedings of the Linux Symposium, volume 1, pages 155–176, July 2005.
          https://www.kernel.org/doc/ols/2005/ols2005v1-pages-155-176.pdf.
     [16] kernel development community.
          Block documentation.
          https://www.kernel.org/doc/html/latest/block/index.html.
     [17] M. Lei, H. Reinecke, J. Garry, and K. Desai.
          blk-mq/scsi: Provide hostwide shared tags for scsi hbas, Aug. 2020.
          https://lore.kernel.org/linux-scsi/1597850436-116171-1-git-send-email-john.garry@huawei.com/T/#.
     [18] R. Love.
          Linux Kernel Development.
          Addison-Wesley Professional, 3 edition, June 2010.

© Copyright IBM Corp. 2021
References iv

     [19] O. Purdila, R. Chitu, and R. Chitu.
          Linux kernel labs: Block device drivers, May 2019.
          https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html.
     [20] O. Sandoval.
          blk-mq: abstract tag allocation out into sbitmap library, Sept. 2016.
          https://lore.kernel.org/linux-block/cover.1474100040.git.osandov@fb.com/T/#.
     [21] K. Ueda, J. Nomura, and M. Christie.
          Request-based device-mapper multipath and dynamic load balancing.
          In Proceedings of the Linux Symposium, volume 2, pages 235–244, June 2007.
          https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf.
     [22] Western Digital Corporation.
          dm-zoned.
          https://www.zonedstorage.io/linux/dm/#dm-zoned.
     [23] Western Digital Corporation.
          Zoned block device user interface.
          https://www.zonedstorage.io/linux/zbd-api/.
     [24] Western Digital Corporation.
          Zoned storage overview.
          https://www.zonedstorage.io/introduction/zoned-storage/.

© Copyright IBM Corp. 2021
References v

     [25] Western Digital Corporation.
          zonefs.
          https://www.zonedstorage.io/linux/fs/#zonefs.
     [26] J. Xu.
          dm: support polling, Mar. 2021.
          https://lore.kernel.org/linux-block/20210303115740.127001-1-jefflexu@linux.alibaba.com/T/#.

© Copyright IBM Corp. 2021
You can also read