From Open-Channel SSDs to Zoned Namespaces - Matias Bjørling Director, Solid-State System Software - Usenix

Page created by Felix Martin
 
CONTINUE READING
From Open-Channel SSDs to Zoned Namespaces - Matias Bjørling Director, Solid-State System Software - Usenix
From Open-Channel SSDs to
Zoned Namespaces
Matias Bjørling
Director, Solid-State System Software
23th January 2019

© 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19
From Open-Channel SSDs to Zoned Namespaces - Matias Bjørling Director, Solid-State System Software - Usenix
Forward-Looking Statements
Safe Harbor | Disclaimers
This presentation contains forward-looking statements that involve risks and uncertainties, including, but not limited to, statements regarding our solid-
state technologies, product development efforts, software development and potential contributions, growth opportunities, and demand and market
trends. Forward-looking statements should not be read as a guarantee of future performance or results, and will not necessarily be accurate indications of
the times at, or by, which such performance or results will be achieved, if at all. Forward-looking statements are subject to risks and uncertainties that
could cause actual performance or results to differ materially from those expressed in or suggested by the forward-looking statements.

Key risks and uncertainties include volatility in global economic conditions, business conditions and growth in the storage ecosystem, impact of
competitive products and pricing, market acceptance and cost of commodity materials and specialized product components, actions by competitors,
unexpected advances in competing technologies, difficulties or delays in manufacturing, and other risks and uncertainties listed in the company’s filings
with the Securities and Exchange Commission (the “SEC”) and available on the SEC’s website at www.sec.gov, including our most recently filed periodic
report, to which your attention is directed. We do not undertake any obligation to publicly update or revise any forward-looking statement, whether as a
result of new information, future developments or otherwise, except as required by law.

                     © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                               06.03.19       2
Ubiquitous Workloads
Efficiency of the Cloud requires many different workloads to a single
SSD

        Databases                                Sensors                               Analytics     Virtualization   Video

                                                                                 Solid-State Drive
          © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                          06.03.19   3
                                                                                                                                             3
Solid State Drive Internals                                                                                                                          Read/Write
                                                                                                                                                 Host Interface (NVMe)

• NAND Read/Program/Erase                                                                                             Solid State Drive
                                                                                                                           Logical to Physical
                                                                                                                                                             Wear-Leveling          Garbage Collection
• Highly Parallel Architecture                                                                                              Translation Map

       – Tens of Dies
                                                                                                                      Bad Block Management                Media Error Handling              …
• NAND Access Latencies
                                                                                                                                                    Media Interface Controller
• Translation Layer
       –     Logical to Physical Translation Layer                                                                                                        Read/Program/Erase
       –     Wear-leveling                                                                                                                                                                                  Read (50-100us)

       –     Garbage Collection                                                                                            Die0                    Die0                      Die0                Die0      Write (1-10ms)

       –     Bad block management                                                                                          Die1                    Die1                      Die1                Die1
                                                                                                                                                                                                           Erase (3-15ms)

       –     Media error handling
                                                                                                                           Die2                    Die2                      Die2                Die2
       –     Etc.
                                                                                                                           Die3                    Die3                      Die3                Die3
• Read/Write/Erase -> Read/Write
                                                                                                                                            Highly Parallel Architecture
©2018 Western Digital Corporation or its affiliates. All rights reserved.
                                              © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                                                 06.03.19
                                                                                                                                                                                                                        44
Open-Channel Concepts
Chunks & Parallel Units

• Chunks                                                                                     • Parallel Units
 – Write sequentially within an LBA range                                                             – Host can direct I/Os to separate workloads
 – Requires reset for rewrites                                                                        – Stripes across single or multiple dies.
 – Borrows from HDD’s SMR specification                                                               – The parallel units inherits the throughput and
   (ZAC/ZBC)                                                                                            latency characteristics of the underlying media
 – Optimized for SSD physical constraints                                                             – Similar concept to I/O Determinism in NVMe
   • Align writes to media layout
                                                                                                                                              Application-1   Application-2         Application-3
                                                                                             Application-1   Application-2   Application-3
                                                                                                                                                 FTL-1           FTL-2                  FTL3

                                                                                                              SSD Firmware                                     SSD Firmware

              Disk LBA range divided in chunks                                                                                               Flash
                                                                                              Flash

   Chunk 0   Chunk 1        Chunk 2          Chunk 3                       Chunk X

                © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                                                    06.03.19              5
Open-Channel SSDs

    I/O Isolation                                                Predictable Latency   Data Placement &
                                                                                        I/O Scheduling

      © 2019 Western Digital Corporation or its affiliates. All rights reserved.                      06.03.19   6
                                                                                                                     6
Open-Channel SSDs
Interface Adoption

        Open-Channel SSD architectures are being adopted
                    by major hyper-scalers

   Concepts ready to be part of the NVMe standard specification
         © 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19   7
NVMe Approach
Drive features into NVMe which address key OCSSD use cases

          Zoned Namespaces (e.g. Chunks)                                                      Endurance Group Mgmt (e.g. Parallel Units)

                                                                                         • Explicit Provisioning of Device Topology
                                                                                         • Build on IO Determinism

                   Today’s talk
            © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                06.03.19   8
Zoned Namespaces (ZNS)

• Technical Proposal in the NVMe working group
• Standardizes zone interface as an approach to:
 – Reduce device-side write amplification
 – Reduce over-provisioning
   • “Note that excessive over-provisioning is similar to
     early replacement -- in both cases you buy more devices.”
                                                 - Mark Callaghan, Facebook
 – Reduce DRAM in SSDs
   • Highest cost after NAND itself
 – Improve latency outliers and throughput
   • Less device-side data movement
   • Tail at scale
 – Enable software eco-system.
   • Everyone benefits from improvements!

              © 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19   9
                                                                                                          9
Zoned Namespaces similar to ZBC/ZAC for SMR
HDDs
• Storage capacity is divided into zones                                                                  Disk LBA range divided in zones

• Each zone is written sequentially                                                            Zone 0   Zone 1      Zone 2       Zone 3            Zone X

• Interface optimized for SSDs
 – Align with media characteristics
   • Zone size aligned to media (E.g., NAND block sizes)
   • Zone capacity aligned to physical media sizes
 – Reduce NAND media erase cycles (Write amp.)
                                                                                                                                          Write commands
                                                                                                                                          advance the write
                                                                                                                 Write pointer
                                                                                                                                              pointer
                                                                                                                   position

                                                                                                 Reset write pointer commands
                                                                                                    rewind the write pointer

                  © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                      06.03.19    10
Zone Information

• Zone State
 – Empty, Implicitly Opened, Explicitly Opened, Closed,
   Full, Read Only, Offline
 – Empty -> Open -> Full -> Empty -> ….
• Zone Reset
 – Full -> Empty
• Zone Size & Zone Capacity
 – Zone Size is fixed
 – Zone Capacity is the writeable area within a zone

                                                                                            Zone Size (e.g., 512MB)

                      Zone X - 1                                                                   Zone X                                 Zone X + 1

                                                            Zone Start LBA                                  Zone Capacity (E.g., 500MB)

               © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                              06.03.19   11
Zone Append
                                                                                                    QEMU (i7-6700K, 4K, QD1, RW, libaio)
                                                                                                    1500
                                                                                                    1300

• Low scalability on multiple writers to a zone                                                     1100
                                                                                                     900

                                                                                           K IOPS
                                                                                                     700
  – Write Queue Depth per Zone = 1                                                                   500

  – IOPS: 80K vs 880K on Qemu and 300K vs 1400K on metal
                                                                                                     300
                                                                                                     100
                                                                                                    -100            1             2            3               4
• ZAC/ZBC requires strict write ordering                                                                                          Number of Jobs
                                                                                                                                                                   Zone Lock
                                                                                                                                                                   Contention
• Limits write performance and increases host overhead
                                                                                                           1 Zone       4 Zones       Zone Append / No Zones

                                                                                                     Metal (i7-6700K, 4K, QD1, RW, libaio)
• Big challenge with software eco-system, HBAs, etc.                                                1500
                                                                                                    1300
• Introducing Zone Append                                                                           1100
                                                                                                     900

                                                                                           K IOPS
• Append data to a zone without defining offset
                                                                                                     700
                                                                                                     500
                                                                                                     300
  •   Drive returns where data was written in the zone                                               100
                                                                                                    -100            1             2            3               4
                                                                                                                                  Number of Jobs

                                                                                                           1 Zone       4 Zones       Zone Append / No Zones

              © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                            06.03.19           12
Zone Write Example
3x Writes (4K, 8K, 16K) – Queue Depth = 1

                                                                4K Write0                  16K Write2
                                                                                 8K Write1

                                                    Zone

                                                                WPWP                  WP       WP

         © 2019 Western Digital Corporation or its affiliates. All rights reserved.                     06.03.19   13
Zone Append Example
3x Writes (4K, 8K, 16K) – Queue Depth = 3

                                                                4K Write0
                                                                                             16K Write2
                                                                                 8K Write1

                                                    Zone

                                                                WPWP                  WP      WP

         © 2019 Western Digital Corporation or its affiliates. All rights reserved.                       06.03.19   14
ZNS: Synergy w/ ZAC/ZBC software ecosystem

• Existing ZAC/ZBC-aware file systems & device                                            User Space

  mappers “just work”                                                                                          Applications
                                                                                                                                                  util-linux         fio

 – Few changes to support to ZNS                                                                                                                   libzbc         blktests

• Reuse existing work already applied with for
                                                                                                                                                                           Linux
                                                                                                                        File-System with SMR                               Kernel
                                                                                               Regular File-
                                                                                                                                Support
                                                                                               Systems (xfs)
  ZAC/ZBC hard-drives (SMR)                                                                                                   (f2fs, btrfs)

                                                                                            Logical Block Device
                                                                                                (dm-zoned)
• Integrate directly with file-systems                                                                                           Block Layer
 – No host-side FTL
                                                                                                           SCSI/ATA                                      NVMe
 – No 1GB DRAM per 1TB Media requirement
 – Better utilization of SSD                                                              SMR HDD
                                                                                          ZBC/ZAC
                                                                                                                                        ZNS SSD

• Code is already in production at hyper-scalers                                          *= Enhanced data paths for SMR drives

  and available in the Linux eco-system.

             © 2019 Western Digital Corporation or its affiliates. All rights reserved.                                                                         06.03.19
                                                                                                                                                                                     06.01
                                                                                                                                                                                    15

                                                                                                                                                                                     3.195
Mature Software Stack
ZNS to use existing storage stack

• User-space libraries
 –   Libzbd
 –   Nvme-cli
 –   Blktests
 –   Util-linux (blkzone)
 –   fio
 –   libzns
• Kernel-space libraries
 – NVMe support for Zones
 – XFS, Btrfs, F2FS, dm-zoned, etc…
• Qemu with ZNS support

                © 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19   16
ZBC State Model

• States (Empty, Implicitly Opened,
  Explicitly Opened, Full, Offline)
• Each zone
 –   Size
 –   Capacity
 –   Write Pointer
 –   And other fields
• Zone Management command
 – ZM Open, ZM Close, ZM Reset, ZM
   Finish

               © 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19   17
© 2019 Western Digital Corporation or its affiliates. All rights reserved.   06.03.19
You can also read