An Introduction to the Linux Kernel Block I/O Stack - Based on Linux 5.11 Benjamin Block March 14th, 2021
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Introduction to the Linux Kernel Block I/O Stack Based on Linux 5.11 Benjamin Block ‹bblock@de.ibm.com› March 14th, 2021 IBM Deutschland Research & Development GmbH
Trademark Attribution Statement The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both. Not all common law marks used by IBM are listed on this page. Because of the large number of products marketed by IBM, IBM’s practice is to list only the most important of its common law marks. Failure of a mark to appear on this page does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market. A current list of IBM trademarks is available on the Web at “Copyright and trademark information”: https://www.ibm.com/legal/copytrade. IBM®, the IBM® logo, ibm.com®, AIX®, CICS®, Db2®, DB2®, developerWorks®, DS8000®, eServer™, Fiberlink®, FICON®, FlashCopy®, GDPS®, HyperSwap®, IBM Elastic Storage®, IBM FlashCore®, IBM FlashSystem®, IBM Plex®, IBM Spectrum®, IBM Z®, IBM z Systems®, IBM z13®, IBM z13s®, IBM z14®, OS/390®, Parallel Sysplex®, Power®, POWER®, POWER8®, POWER9™, Power Architecture®, PowerVM®, RACF®, RED BOOK®, Redbooks®, S390-Tools®, S/390®, Storwize®, System z®, System z9®, System z10®, System/390®, WebSphere®, XIV®, z Systems®, z9®, z13®, z13s®, z15™, z/Architecture®, z/OS®, z/VM®, z/VSE®, and zPDT® are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The following are trademarks or registered trademarks of other companies. UNIX is a registered trademark of The Open Group in the United States and other countries. The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Red Hat®, JBoss®, OpenShift®, Fedora®, Hibernate®, Ansible®, CloudForms®, RHCA®, RHCE®, RHCSA®, Ceph®, and Gluster® are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the United States and other countries. All other products may be trademarks or registered trademarks of their respective companies. Note: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured Sync new parts, or new and serviceable used parts. Regardless, our warranty terms apply. All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions. All statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Information about non-IBM products is obtained Sync the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
Outline What is a Block Device? Anatomy of a Block Device I/O Flow in the Block Layer
What is a Block Device?
A Try at a Definition In Linux, a Block Device is a hardware abstraction. It represents hardware whose data is stored and accessed in fixed size blocks of n bytes (e.g. 512, 2048, or 4096 bytes) [18]. In contrast to Character Devices, blocks on block devices can be accessed in random-access pattern, wherein the former only allows sequential access pattern [19]. Typically, and for this talk, block devices represent persistent mass storage hardware. But not all block devices in Linux are backed by persistent storage (e.g. RAM Disks whose data is stored in memory), nor must all of them organize their data in fixed blocks (e.g. ECKD formatted DASDs whose data is stored in variable length records). Even so, they can be represented as such in Linux, because of the abstraction provided by the Kernel. © Copyright IBM Corp. 2021 2
What is a ‘Block’ Anyway? A Block is a fixed amount of bytes that is used in the communication with a block device and the associated hardware. But different layers in the software stack differ in the exact meaning and size: • Userspace Software: application specific meaning; usually how much data is read from/written to files via a single system-call. • VFS: unit of bytes in which I/O is done by file systems in Linux. Between 512, and PAGE_SIZE bytes (e.g. 4 KiB for x86 and s390x, may be as big as 1 MiB). • Hardware: also referred to as Sector. • Logical: smallest unit in bytes that is addressable on the device. • Physical: smallest unit in bytes that the device can operate on without resorting to read-modify-write. • Physical may be bigger than Logical block size © Copyright IBM Corp. 2021 3
Using Block Devices in Linux (Examples) Listing available block devices: # l s / sys / c l a s s / block dasda dasda1 dasda2 dm−0 dm−1 dm−2 dm−3 scma sda sdb Reading from a block device: # dd i f =/ dev / sdf o f =/ dev / n u l l bs=2MiB 10737418240 bytes (11 GB, 10 GiB ) copied Listing the topology of a stacked block devices: # l s b l k −s / dev / mapper / rhel_t3545003 − r o o t NAME MAJ : MIN RM SIZE RO TYPE MOUNTPOINT rhel_t3545003 − r o o t 253:11 0 9G 0 lvm / +−mpathf2 253:8 0 9G 0 p a r t +−mpathf 253:5 0 10G 0 mpath +−sdf 8:80 0 10G 0 d i s k +−sdam 66:96 0 10G 0 d i s k © Copyright IBM Corp. 2021 4
Some More Examples for Block Devices Local Disk Remote Disk Virtualized Disk /dev/nvme0n1 /dev/sda /dev/vda Guest Kernel Kernel Kernel virtio_blk nvme sd iscsi_tcp /dev/ram0 NVMe iSCSI Host Kernel brd RAID RAM RAM Figure 1: Examples for block device setups with different hardware backends [14]. © Copyright IBM Corp. 2021 5
Anatomy of a Block Device
Structure of a Block Device: User Facing /dev/sda1 block_device Userspace interface; /dev/sda represents special file in /dev, and User links to other kernel objects for the ge Kernel nd block device [4, 16]; partitions point to is block_device re k block_device qu bd_disk es same disk and queue as whole device. bd_disk bd_queue t_ qu inode Each block device gets a virtual bd_queue bd_partno = 1 eu e inode assigned, so it can be used in the bd_partnobd_dev =0 = devt bd_dev = devt bd_start_sect = 32 VFS. bd_start_sect = 0 file and address_space Userspace inode inode e processes open special file in /dev; in i_mode = S_IFBLK ac i_mode = S_IFBLK fi _sp kernel represented as file with i_rdev = devt i_rdev = devt i_bdev s le es assigned address_space that point i_bdev i_size dr to block device inode. i_size © Copyright IBM Corp. 2021 ad 6
Structure of a Block Device: Hardware Facing scsi_device 1 1 request_queue gendisk and request_queue Central request_queue queuedata tag_set part of any block device; abstract parent queue_ctx[C] hardware details for higher layers; scsi_disk queue_hw_ctx[N] limits gendisk represents the whole disk addressable space; request_queue queue_limits 1 how requests can be served gendisk queue disk_part_tbl Points to partitions — part0 represented as block_devices — part_tbl fops backed by gendisk 1 1 block_device scsi_device and scsi_disk Device block_device 1 N drivers; provides common/mid layer _operations disk_part_tbl for all SCSI-like hardware (incl. Serial *submit_bio(*bio) len = N *open(*bdev, mode) part[len] ATA, SAS, iSCSI, FCP, …). © Copyright IBM Corp. 2021 7
Queue Limits • Attached to the Request Queue structure of a Block device • Abstract hardware, firmware and device driver properties that influence how requests must be laid out • Very important for stacked block devices • For example: logical_block_size Smallest possible unit in bytes that is addressable in a request. physical_block_size Smallest unit in bytes handled without read-modify-write. max_hw_sectors Amount of sectors (512 bytes) that a device can handle per request. io_opt Preferred size in bytes for requests to the device. max_sectors Softlimit used by VFS for buffered I/O (can be changed). max_segment_size Maximum size a segment in a request’s scatter/gather list can have. max_segments Maximum amount of scatter/gather elements in a request. © Copyright IBM Corp. 2021 8
Multi-Queue Request Queues • In the past, request queues in Linux worked single threaded and without associating requests with particular processors • Couldn’t properly exploit many-core systems and new storage hardware with more than one command queue (e.g.: NVMe); lots of cache thrashing • Explicit Multi-Queue (MQ) support [3] was added with Linux 3.13: blk-mq • I/O requests are scheduled on a hardware queue assigned to the I/O generating processor; responses are meant to be received on the same processor • Structures necessary for I/O submission and response-handling are kept per processor; no shared state as much as possible • With Linux 5.0 old single threaded queue implementation was removed © Copyright IBM Corp. 2021 9
Block MQ Tag Set: Hardware Resource Allocation N=4 • Per hardware queue resource allocation and management Ctrl M=8 • Requests (Reqs) are pre-allocated per HW queue (blk_mq_tags) 0 1 HW Queues 2 3 • Tags are index into the request array blk_mq blk_mq blk_mq blk_mq per queue, or per tag set (new in 5.10 _tags _tags _tags _tags [17]) Tags[M] Tags[M] Tags[M] Tags[M] Reqs[M] Reqs[M] Reqs[M] Reqs[M] • Allocation of tags handled via special data-structure: sbitmap [20] blk_mq • Tag set also provides mapping between _tag_set CPU and hardware queue; objective is tags[N] map 0 0 1 1 2 2 3 3 either 1 : 1 mapping, or cache proximity 0 1 2 3 4 5 6 7 CPU IDs C=8 © Copyright IBM Corp. 2021 10
Block MQ Soft- and Hardware-Context blk_mq blk_mq blk_mq blk_mq • For a request queue (see pointers on _tags _tags _tags _tags page 7) blk_mq blk_mq blk_mq blk_mq • Hardware context (blk_mq_hw_ctx = _hw_ctx _hw_ctx _hw_ctx _hw_ctx hctx) exists per hardware queue; hosts tags tags tags tags *work() *work() *work() *work() work item (kblockd work queue) work scheduled on matching CPU; pulls item requests out of associated ctx and ctx ctx ctx ctx ctx ctx ctx ctx submits them to hardware 01 23 45 67 hctx hctx hctx hctx hctx hctx hctx hctx • Software context (blk_mq_ctx = ctx) rqL rqL rqL rqL rqL rqL rqL rqL exists per CPU; queues requests in scheduled list of queued reqeusts simple FIFO in absence of Elevator; here associated with assigned HCTX as per 0 1 2 3 4 5 6 7 tag set mapping Core CPU NUMA Domain © Copyright IBM Corp. 2021 11
Block MQ Elevator / Scheduler • Elevator = I/O Scheduler • Can be set optionally per request queue (/sys/class/block//queue/scheduler) mq-deadline Forward port of old deadline scheduler; doesn’t handle MQ context affinities; default for device with 1 hardware queue; limits wait-time for requests to prevent starvation (500 ms for reads, 5 s for writes) [10, 18] kyber Only MQ native scheduler [10]; aims to meet certain latency targets (2 ms for reads, 10 ms for writes) by limiting the queue-depth dynamically bfq Only non-trivial I/O scheduler [6, 8, 9, 11] (replaces old CFQ scheduler); doesn’t handle MQ context affinities; aims at providing fairness between I/O issuing processes none Default for device with more than 1 hardware queue; simply FIFO via MQ software context © Copyright IBM Corp. 2021 12
What About Stacked Block Devices? • Device-Mapper (dm) and Raid (md) use virtual/stacked block device on top of existing hardware-backed block devices ([15, 21]) • Examples: RAID, LVM2, Multipathing • Same structure as shown on page 6 and 7, without hardware specific structures, stacked on other block_devices • BIO based: doesn’t have an Elevator, an own tag-set, nor any soft-, or hardware-contexts; modify I/O (BIO) after submission and immediately pass it on • Request based: have full set of infrastructure (only dm-multipath atm.); can queue requests; bypass lower-level queueing • queue_limits of lower-level devices are aggregated into the “greatest common divisor”, so that requests can be scheduled on any of them • holders/slaves directories in sysfs show relationship © Copyright IBM Corp. 2021 13
I/O Flow in the Block Layer
Submission of I/O Requests from Userspace (Simplified) • I/O submission mainly categorized in 2 Direct I/O Buffered I/O disciplines: Write R W Read Write R W Read User Buffered I/O Requests served via Page Kernel Cache; Writes cached and eventually — File VFS systems usually asynchronously — written to Page disk via Writeback; Reads served Cache Writeback directly if fresh, otherwise read from Read(-ahead) disk synchronously block_device Direct I/O Requests served directly by block_device submit_bio(bio) gendisk backing disk; alignment and possibly size requirements; DMA directly fops queue Block into/from User memory possible Layer • For syncronous I/O system calls, tasks request_queue wait in state TASK_UNINTERRUPTIBLE © Copyright IBM Corp. 2021 14
A New Asynchronous I/O Interface: io_uring mmap structures into userspace • With Linux 5.1 a new I/O submission API 1. Fill has been added: io_uring [12, 2] 2. Advance 3. Advance • New set of System Calls to create set of ring User structures: Submission Queue (SQ), Kernel Completion Queue (CQ), and Submission per id in CQE Tail 0 SQE Head Queue Entries (SQE) array He CQ ad 1 SQE • Structures shared between Kernel and User E 2 SQE 1 via mmap(2) 3 2 • Submission and Completion work Tail 4 asynchronously 5 CQ • Utilizes standard syscall backends for calls 6 SQ like readv(2), writev(2), or fsync(2); 7 Feed into existing with same categories as on Page 14 Syscall Paths © Copyright IBM Corp. 2021 15
The I/O Unit of the Block Layer: BIO 0 ... N bvec_iter • BIOs represent in-flight I/O bio 1 bi_sector • Application data kept separate; array of bi_next bi_size bi_opf : req_opf bi_idx bio_vecs holds pointers to pages with bi_disk bi_bvec_done bi_partno = N application data (scatter/gather list) bi_iter 1 ... 256 • Position and progress managed in *bi_end_io() bio_vec bi_io_vec[] bio_vec bv_page bvec_iter: bio_vec bv_len bv_page bi_sector start sector bv_offset bv_len bv_page bi_size size in bytes page page page bv_offset bv_len bv_offset bi_idx current bvec index bi_bvec_done finished work in bytes 48656C 210A00 000000 Pinned • BIOs might be split when queue limits 6C6F2C 48656C 000000 Data Pages 20576F 6C6F2C (see page 8) exceeded; or cloned when 726C64 20576F Pages Physically Contiguous same data goes to different places 726C64 Data/LBA 210A00 Contiguous • Data of single BIO limited to 4 GiB © Copyright IBM Corp. 2021 16
Plugging and Merging Plugging: • When the VFS layer generates I/O requests and submits them for processing, it plugs the request queue of the target block device ([1]) • Requests generated while plug is active are not immediately submitted, but saved until unplugging • Unplugging happens either explicitly, or during scheduled context switches Merging: • BIOs and requests are tried to be merged with already queued or plugged requests Back-Merging: The new data fits to the end of an existing request Front-Merging: The new data fits to the beginning of an existing request • Merging is done by concatenating BIOs via bi_next • Merges must not exceed queue limits © Copyright IBM Corp. 2021 17
Entry Function into the Block Layer submit_bio(bio) { if (exists(current.bios)) { current.bios += bio • When I/O is necessary, BIOs are generated and return } thread local submitted into the block layer via submit_bio() On Stack: bios, old_bios ([4]). current.bios ← bios do { old_bios ← current.bios • Doesn’t guarantee synchronous processing; current.bios = List() disk(bio)->submit_bio(bio) callback via bio->bi_end_io(bio). On Stack: same, lower lower = List() • One submitted BIO can turn into several more same = List() (block queue limits, stacked device, …); each is On Stack: q = queue(bio) while (bio ← pop(current.bios)) { also submitted via submit_bio(). if (q == queue(bio)) same += bio • Especially for stacked devices this could exhaust else } lower += bio kernel stack space → turn recursion into iteration current.bios += lower current.bios += same (approx.: depth-first search with stack) current.bios += old_bios } while (bio = pop(current.bios)) current.bios ← Null return ← Pseudo-code representation of functionality } © Copyright IBM Corp. 2021 18
Request Submission and Dispatch CTX • Once a BIO reaches a request queue (see Page 12 and 11) the submitting task tries to get Tag from the associated HCTX • Creates back pressure if not possible s ue s r le ue 0 1 Di ead Th ab es nt tQ • The BIO is added to the associated Request; via el pt qu re HCTX ern em Pull HCTX Re ffe linking, this could be multiple BIOs per request K re P • The Request is inserted into software context CPU 0 HCTX FIFO queue, or Elevator (if enabled); the HCTX kblockd work-item is queued into kblockd Push using HW Device Driver • Associated CPU executes HCTX work-item • Work-item pulls queued requests out of associated software contexts or Elevators and hands them to Ctrl HW device driver 0 HW Queue © Copyright IBM Corp. 2021 19
Request Completion Notify Waiter • Once a request is processed, device drivers usually get notified via an interrupt bio→bio_end_io(bio) For each associated BIO • To keep data CPU local, interrupts should be bound to q→mq_ops→complete(rq) CPU of associated MQ software context Queue Request Redirect with (platform-/controller-/driver-dependant) Completion IPI/SoftIRQ • blk-mq resorts to IPI or SoftIRQ otherwise blk_mq_complete_req(rq) • The device driver is responsible for determining the Find Request corresponding block layer request for the signaled CPU 0 completion • Progress is measured by how many bytes were De RQ IRQ I vi Ha successfully completed; might cause re-schedule ce n Dr dle iv r • Process of completing a request is a bunch of callbacks er Ctrl to notify the waiting user-/kernel-thread © Copyright IBM Corp. 2021 20
Block Layer Polling • Similar to high speed networking, with high speed storage targets it can be beneficial to use Polling instead of Waiting for Interrupts to handle request completion • Decreases response times and reduces overhead produced by interrupts on fast devices • Available in Linux since 4.10 [7]; only supported by NVMe at this point (support for SCSI merged for 5.13 [13], support for dm in work [26]) • Enable per request queue: echo 1 > /sys/class/block//queue/io_poll • Enable for NVMe with module parameter: nvme.poll_queues=N • Device driver creates separate HW queues that have interrupts disabled • Whether polling is used is controlled by applications (only with Direct I/O currently): • Pass RWF_HIPRI to readv(2)/writev(2) • Pass IORING_SETUP_IOPOLL to io_uring_setup(2) for io_uring • When used, application threads that issued I/O, or io_uring worker threads, actively poll in HW queues whether the issued request has been completed © Copyright IBM Corp. 2021 21
Closing
Summary • Block devices are a hardware abstraction to allow uniform access to a range of diverse hardware • The entry into the block layer is provided by block_device device-nodes, which are backed by a gendisk — representing the block-addressable storage space — , and a request_queue — providing a generic way to queue requests against the hardware • Special care is taken to allow processor local processing of I/O requests and responses • Userspace requests mainly categorized in Buffered and Direct I/O • Central structure for transporting information about in-flight I/O is the BIO; it allows for cloning, splitting and merging without copying payload • Processing of I/O is fundamentally asynchronous in the kernel, requests happen in a different context than responses, and are only synchronized via wait/notify mechanisms © Copyright IBM Corp. 2021 22
Made with LATEX and Inkscape r Questions?
IBM Deutschland Research & Development • Big parts of the support for Linux on IBM Z — Kernel Headquarters Böblingen and Userland — are done at the IBM Laboratory in Böblingen • We follow a strict upstream policy and do not — with scarce exceptions — ship code that is not accepted in the respective upstream project • Parts of the hard- and firmware for the IBM Mainframes are also done in Böblingen • https://www.ibm.com/de-de/marketing/entwicklung/ about.html © Copyright IBM Corp. 2021 23
Backup Slides
But What About Zoned Storage? • Introduced with the advent of SMR Disks, but now also in NVMe specification • Tracks on the disk overlap; overwriting a single track means overwriting a bunch of other tracks as well [24] → Data is organized in “bands” of tracks: zones. Each zone only written sequentially; out-of-order writes only after reset of whole zone. Random read access remains possible. • Supported by the Linux block layer, but breaks with previous definition • Direct use only via aid by special ioctls [23], via special device-mapper target [22], or via special file system [25] • Gonna ignore this for the rest of the talk © Copyright IBM Corp. 2021
Structure of a Block Device: Whole Picture 1 1 scsi_device request_queue 0 ... 1 request_queue elevator elevator_queue backing_dev_info limits 1 queue_limits scsi_disk disk 1 gendisk bdev_inode part0 1 1 bdev block_device 1 part_tbl 1 disk_part_tbl O vfs_inode queue bd_disk len = O bd_queue fops part[len] bd_partno = N bd_dev = devt 1 inode (V)FS layer 1 block_device_operations bd_start_sect = M i_mode = S_IFBLK block layer *submit_bio(*bio) i_rdev = devt scsi layer *open(*bdev, mode) i_bdev © Copyright IBM Corp. 2021
Structure of a MQ Request Queue: Whole Picture blk_mq_ops blk_mq_tag_set blk_mq_queue_map request_queue 1 1 1 ... 3 *queue_rq() map[MAX_TYPES] mq_map[nr_cpu_ids] mq_ops *complete() queue_depth mq_map[x] = nr-hw-queue tag_set *timeout() tags[nr_hw_queues] queue_hw_ctx[] blk_mq_tags 1 ... * queue_ctx[] 1 ... nr_hw_queues nr_tags = queue_depth elevator blk_mq_hw_ctx bitmap_tags : sbitmap backing_dev_info 1 ... 3 rqs[nr_tags] per_cpu_ptr dispatch : list limits 1 ... * cpumask static_rqs[nr_tags] nr_hw_queues blk_mq_ctx nr_ctxs pages : list 1 rq_lists[MAX_TYPES] : list ctx[nr_ctx] cpu ctx_map : sbitmap page 1 ... nr_tags hctx[MAX_TYPES] tags data[] request 1 queue queue queue_limits max_hw_sectors max_sectors 1 backing_dev_info max_segment_size physical_block_size ra_pages 0 .. 1 min_ratio elevator_queue 1 e.g.: Tag Set logical_block_size mq_deadline io_opt max_ratio type Request Queue max_segments wb : bdi_writeback elevator_data ops Queue Context © Copyright IBM Corp. 2021
Glossary i BIO BIO: represents metadata and data for I/O in the Linux block layer; no hardware specific information. CFQ Completely Fair Queuing: deprecated I/O scheduler for single queue block layer. DASD Direct-Access Storage Device: disk storage type used by IBM Z via FICON. dm Device-Mapper: low level volume manager; allows to specify mappings for ranges of logical sectors; higher level volume managers such as LVM2 use this driver. DMA Direct Memory Access: hardware components can access main memory without CPU involvement. ECKD Extended Count Key Data: a recording format of data stored on DASDs. Elevator Synonym for “I/O Scheduler” in the Linux Kernel. FCP Fibre Channel Protocol: transport for the SCSI command set over Fibre Channel networks. FIFO First in, First out. HCTX Hardware Context of a request queue. ioctl input/output control: system call that allows to query device-specific information, or execute device-specific operations. © Copyright IBM Corp. 2021
Glossary ii IPI Inter-Processor Interrupt: interrupt an other processor to communicate some required action. iSCSI Internet SCSI: transport for the SCSI command set over TCP/IP. LVM2 Logical Volume Manager: flexible methodes of allocating (non-linear) space on mass-storage devices. md Multiple devices support: support multiple physical block devices through a single logical device; required for RAID and logical volume management. MQ Short for: Multi-Queue. Multipathing Accessing one storage target via multiple independent paths with the purposes of redundancy and load-balancing. NVMe Non-Volatile Memory Express: interface for accessing persistent storage device over PCI Express. RAID Redundant Array of Inexpensive Disks: combines multiple physical disks into a logical one with the purposes of data redundancy and load-balancing. RAM Random-Access Memory: a form of information storage, random-accessible, and normally volatile. SAS Serial Attached SCSI: transport for the SCSI command set over a serial point-to-point bus. © Copyright IBM Corp. 2021
Glossary iii SCSI Small Computer System Interface: set of standards for commands, protocols, and physical interfaces to connect computers with peripheral devices. Serial ATA Serial AT Attachment: serial bus that connects host bus adapters with mass storage devices. SMR Shingled Magnetic Recording: a magnetic storage data recording technology used to provide increased areal density. VFS Virtual File System: abstraction layer in Linux that provides a common interface to file systems and devices for software. © Copyright IBM Corp. 2021
References i [1] J. Axboe. Explicit block device plugging, Apr. 2011. https://lwn.net/Articles/438256/. [2] J. Axboe. Efficient io with io_uring. https://kernel.dk/io_uring.pdf, Oct. 2019. [3] M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet. Linux block io: Introducing multi-queue ssd access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference, SYSTOR ’13, pages 22:1–22:10, New York, NY, USA, 2013. ACM. [4] N. Brown. A block layer introduction part 1: the bio layer, Oct. 2017. https://lwn.net/Articles/736534/. [5] N. Brown. Block layer introduction part 2: the request layer, Nov. 2017. https://lwn.net/Articles/738449/. [6] J. Corbet. The bfq i/o scheduler, June 2014. https://lwn.net/Articles/601799/. © Copyright IBM Corp. 2021
References ii [7] J. Corbet. Block-layer i/o polling, Nov. 2015. https://lwn.net/Articles/663879/. [8] J. Corbet. The return of the bfq i/o scheduler, Feb. 2016. https://lwn.net/Articles/674308/. [9] J. Corbet. A way forward for bfq, Dec. 2016. https://lwn.net/Articles/709202/. [10] J. Corbet. Two new block i/o schedulers for 4.12, Apr. 2017. https://lwn.net/Articles/720675/. [11] J. Corbet. I/o scheduling for single-queue devices. Oct. 2018. https://lwn.net/Articles/767987/. [12] J. Corbet. Ringing in a new asynchronous i/o api, Jan. 2019. https://lwn.net/Articles/776703/. © Copyright IBM Corp. 2021
References iii [13] K. Desai. io_uring iopoll in scsi layer, Feb. 2021. https://lore.kernel.org/linux-scsi/20210215074048.19424-1-kashyap.desai@broadcom.com/T/#. [14] W. Fischer and G. Schönberger. Linux storage stack diagramm, Mar. 2017. https://www.thomas-krenn.com/de/wiki/Linux_Storage_Stack_Diagramm. [15] E. Goggin, A. Kergon, C. Varoqui, and D. Olien. Linux multipathing. In Proceedings of the Linux Symposium, volume 1, pages 155–176, July 2005. https://www.kernel.org/doc/ols/2005/ols2005v1-pages-155-176.pdf. [16] kernel development community. Block documentation. https://www.kernel.org/doc/html/latest/block/index.html. [17] M. Lei, H. Reinecke, J. Garry, and K. Desai. blk-mq/scsi: Provide hostwide shared tags for scsi hbas, Aug. 2020. https://lore.kernel.org/linux-scsi/1597850436-116171-1-git-send-email-john.garry@huawei.com/T/#. [18] R. Love. Linux Kernel Development. Addison-Wesley Professional, 3 edition, June 2010. © Copyright IBM Corp. 2021
References iv [19] O. Purdila, R. Chitu, and R. Chitu. Linux kernel labs: Block device drivers, May 2019. https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html. [20] O. Sandoval. blk-mq: abstract tag allocation out into sbitmap library, Sept. 2016. https://lore.kernel.org/linux-block/cover.1474100040.git.osandov@fb.com/T/#. [21] K. Ueda, J. Nomura, and M. Christie. Request-based device-mapper multipath and dynamic load balancing. In Proceedings of the Linux Symposium, volume 2, pages 235–244, June 2007. https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf. [22] Western Digital Corporation. dm-zoned. https://www.zonedstorage.io/linux/dm/#dm-zoned. [23] Western Digital Corporation. Zoned block device user interface. https://www.zonedstorage.io/linux/zbd-api/. [24] Western Digital Corporation. Zoned storage overview. https://www.zonedstorage.io/introduction/zoned-storage/. © Copyright IBM Corp. 2021
References v [25] Western Digital Corporation. zonefs. https://www.zonedstorage.io/linux/fs/#zonefs. [26] J. Xu. dm: support polling, Mar. 2021. https://lore.kernel.org/linux-block/20210303115740.127001-1-jefflexu@linux.alibaba.com/T/#. © Copyright IBM Corp. 2021
You can also read