SCALING THE LINUX VFS - NICK PIGGIN SUSE LABS, NOVELL INC. SEPTEMBER 19, 2009 0-0

Page created by Tina Brooks

Government & Politics

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Scaling the Linux VFS
      Nick Piggin
  SuSE Labs, Novell Inc.

   September 19, 2009

           0-0

Outline

I will cover the following areas:

 • Introduce each of the scalability bottlenecks
 • Describe common operations they protect
 • Outline my approach to improving synchronisation
 • Report progress, results, problems, future work

                                    1

Goal

• Improve scalability of common vfs operations;
• with minimal impact on single threaded performance;
• and without an overly complex design.
• Single-sb scalability.

                               2

VFS overview

• Virtual FileSystem, or Virtual Filesystem Switch
• Entry point for filesystem operations (eg. syscalls)
• Delegates operations to appropriate mounted filesystems
• Caches things to reduce or eliminate fs responsibility
• Provides a library of functions to be used by fs

                                 3

The contenders

• f iles lock
• vf smount lock
• mnt count
• dcache lock
• inode lock
• And several other write-heavy shared data

                              4

f iles lock
• Protects modification and walking a per-sb list of open files
• Also protects a per-tty list of files open for ttys
• open(2), close(2) syscalls add and delete file from list
• remount,ro walks the list to check for RW open files

                                   5

f iles lock ideas
• We can move tty usage into its own private lock
• per-sb locks would help, but I want scalability within a single fs
• Fastpath is updates, slowpath is reading – RCU won’t work.
• Modifying a single object (the list head) cannot be scalable:
• must reduce number of modifications (eg. batching),
• or split modifications to multiple objects.
• Slowpath reading the list is very rarely used!

                                  6

f iles lock my implementation
• This suggests per-CPU lists, protected by per-CPU locks.
• Slowpath can take all locks and walk all lists
• Pros: “perfect” scalability for file open/close, no extra atomics
• Cons: larger superblock struct, slow list walking on huge
  systems

• Cons: potential cross-CPU file removal

                                 7

vf smount lock
• Largely, protects reading and writing mount hash
• Lookup vfsmount hash for given mount point
• Publishing changes to mount hierarchy to the mount hash
• Mounting, unmounting filesystems modify the data
• Path walking across filesystem mounts reads the data

                               8

vf smount lock ideas
• Fastpath are lookups, slowpath updates
• RCU could help here, but there is a complex issue:
• Need to prevent umounts for a period after lookup (while we
  have a ref)

• Usual implementations have per-object lock, but per-sb
  scalability

• Umount could synchronize rcu(), this can sleep and be
  very slow

                               9

vf smount lock my implementation
• Per-cpu locks again, this time optimised for reading
• “brlock”, readers take per-cpu lock, writers take all locks
• Pros: “perfect” scalability for mount lookup, no extra atomics
• Cons: slower umounts

                                 10

mnt count
• A refcount on vfsmount, not quite a simple refcount
• Used importantly in open(2), close(2), and path walk over
  mounts

                               11

mnt count my implementation
• Fastpath is get/put.
• A “put” must also check count==0, makes per-CPU counter hard
• However count==0 is always false when vfsmount is attached
• So only need to check for 0 when not mounted (rare case)
• Then per-CPU counters can be used, with per-CPU
  vf smount lock
• Pros: “perfect” scalability for vfsmount refcounting
• Cons: larger vfsmount struct

                                 12

dcache lock
• Most dcache operations require dcache lock .
• except name lookup, converted to RCU in 2.5
• dput last reference (except for “simple” filesystems)
• any fs namespace modification (create, delete, rename)
• any uncached namespace population (uncached path walks)
• dcache LRU scanning and reclaim
• socket open/close operations

                                13

dcache lock is hard
• Code and semantics can be complex
• It is exported to filesystems and held over methods
• Hard to know what it protects in each instance it is taken
• Lots of places to audit and check
• Hard to verify result is correct
• This is why I need vfs experts and fs developers

                                 14

dcache lock approach
• identify what the lock protects in each place it is called
• implement new locking scheme to protect usage classes
• remove dcache lock
• improve scalability of (now simplified) classes of locks

                                 15

dcache locking classes

• dcache hash
• dcache LRU list
• per-inode dentry list
• dentry children list
• dentry fields (d count, d f lags, list membership)
• dentry refcount
• reverse path traversal
• dentry counters

                               16

dcache my implementation outline

• All dentry fields including list mebership protected by d lock
• children list protected by d lock (this is a dentry field too)
• dcache hash, LRU list, inode dentry list protected by new locks
• Lock ordering can be difficult, trylock helps
• Walking up multiple parents requires RCU and rename blocking.
  Hard!

                                 17

dcache locking difficulties 1

 • “Locking classes” not independent.
      1:   spin_lock(&dcache_lock);
      2:   list_add(&dentry->d_lru, &dentry_lru);
      3:   hlist_add(&dentry->d_hash, &hash_list);
      4:   spin_unlock(&dcache_lock);

is not the same as
       1: spin_lock(&dcache_lru_lock);
       2: list_add(&dentry->d_lru, &dentry_lru);
       3: spin_unlock(&dcache_lru_lock);
       4: spin_lock(&dcache_hash_lock);
       5: hlist_add(&dentry->d_hash, &hash_list);
       6: spin_unlock(&dcache_hash_lock);

Have to consider each dcache   lock site carefully, in context.
d lock does help a lot.

                                18

dcache locking difficulties 2

 • EXP ORT SY M BOL(dcache lock);
 • − > d delete
Filesystems may use dcache    lock in non-trivial ways for protecting
their own data structures and locking parts of dcache code from
executing. Autofs4 seems to do this, for example.

                                 19

dcache locking difficulties 3

 • Reverse path walking (from child to parent)
We have dcache parent−  >child lock ordering. Walking the other
way is tough. dcache lock would freeze the state of the entire
dcache tree. I use RCU to prevent parent from being freed while
dropping the child’s lock to take the parent lock. Rename lock or
seqlock/retry logic can prevent renames causing our walk to
become incorrect.

                                 20

dcache scaling in my implementation

• dcache hash lock made per-bucket
• per-inode dentry list made per-inode
• dcache stats counters made per-CPU
• dcache LRU list is last global dcache lock , could be made
  per-zone

• pseudo filesystems don’t attach dentries to global parent

                               21

dcache implementation complexity

• Lock ordering can be difficult
• Lack of a way to globally freeze the tree
• Otherwise in some ways it is actually simpler

                                   22

inode lock
• Most inode operations require inode lock .
• Except dentry− >inode lookup and refcounting
• Inode lookup, cached and uncached, inode creation and
  destruction

• Including socket, other pseudo-sb operations
• Inode dirtying, writeback, syncing
• icache LRU walking and reclaim
• socket open/close operations

                               23

inode lock approach
• Same as approach for dcache

                            24

icache locking classes

• inode hash
• inode LRU list
• inode superblock inodes list
• inode dirty list
• inode fields (i state, i count, list membership)
• iunique
• last ino
• inode counters

                                 25

icache implementation outline

• Largely similar to dcache
• All inode fields including list membership protected by i lock
• icache hash, superblock list, LRU+dirty lists protected by new
  locks

• last ino, iunique given private locks
• Not simple, but easier than dcache! (less complex and less
  code)

                               26

icache scaling my implementation

• inode made RCU freed to simplify lock orderings and reduce
  complexity

• icache hash lock made per-bucket, lockless lookup
• icache LRU list made lazy like dcache, could be made per-zone
• per-cpu, per-sb inode lists
• per-cpu inode counter
• per-cpu inode number allocator (Eric Dumazet)
• inode and dirty list remains problematic.

                                27

Current progress

• Very few fundamentally global cachelines remain
• I’m using tmpfs, ramfs, ext2/3, nfs, nfsd, autofs4.
• Most others require some work
• Particularly dcache changes not audited in all filesystems
• Still stamping out bugs, doing some basic performance testing
• Still working to improve single threaded performance

                                28

Performance results

• The abstract was a lie!
• open(2)/close(2) in seperate subdirs seems perfectly scalable
• creat(2)/unlink(2) seems perfectly scalable
• Path lookup less scalable with common cwd, due to d lock in
  refcount

• Single-threaded performance is worse in some cases, better in
  others

                               29

close(open("path")) on independent files, same cwd
                                 3e+07
                                                                                    standard
                                                                                    vfs-scale
                               2.5e+07
Total time (lower is better)

                                 2e+07

                               1.5e+07

                                 1e+07

                                 5e+06

                                       0
                                           1          2        3     4      5         6         7     8
                                                                    CPUs used

                                               unlink(creat("path")) on independent files, same cwd
                               8e+06
                                                                                    standard
                                                                                    vfs-scale
                               7e+06
Total time (lower is better)

                               6e+06

                               5e+06

                               4e+06

                               3e+06

                               2e+06

                               1e+06

                                   0
                                       1          2        3        4      5         6          7     8
                                                                   CPUs used
                                                                                             30

close(open("path")) on independent files, different cwd
                               1.6e+07
                                                                                    standard
                                                                                    vfs-scale
                               1.4e+07
Total time (lower is better)

                               1.2e+07

                                1e+07

                                8e+06

                                6e+06

                                4e+06

                                2e+06

                                    0
                                         1         2        3       4      5          6         7        8
                                                                   CPUs used

                                             unlink(creat("path")) on independent files, different cwd
                               4.5e+06
                                                                                    standard
                                4e+06                                               vfs-scale
Total time (lower is better)

                               3.5e+06

                                3e+06

                               2.5e+06

                                2e+06

                               1.5e+06

                                1e+06

                               500000

                                    0
                                         1         2        3       4      5          6         7        8
                                                                   CPUs used
                                                                                             31

Multi-process close lots of sockets
                               2
                                                                                   plain
                                                                               vfs-scale

                              1.5
total time, lower is better

                               1

                              0.5

                               0
                                    0   0.1        0.2        0.3        0.4         0.5   0.6

                                                              32

osdl reaim 7 Peter Chubb workload

                                  3e+06                                            plain
                                                                               vfs-scale
Max jobs/min, higher is better

                                 2.5e+06

                                  2e+06

                                 1.5e+06

                                  1e+06

                                 500000

                                      0
                                           0   0.1         0.2      0.3      0.4         0.5   0.6

                                                                  33

Future work

 • Improve scalability (eg. LRU lists, inode dirty list)
 • Look at single threaded performance, code simplifications
Interesting future possibilities:

 • Path walk without taking d lock
 • Paves the way for NUMA aware dcache/icache reclaim
 • Can expand the choice of data structure (simplicity, RCU
    requirement)

                                    34

How can you help

• Review code
• Audit filesystems
• Suggest alternative approaches to scalability
• Implement improvements, “future work”, etc
• Test your workload

                               35

Conclusion

VFS is hard. That’s the only thing I can conclude so far.

                          Thank you

                                  36

You can also read

SUPER 9 2017 REPORT - Samoa Rugby Union

Product Catalog The S&G portfolio of medium and high security solutions for commercial, financial and residential applications - Sargent & ...

Assessment of 21 Days Lockdown Effect in Some States and Overall India: A Predictive Mathematical Study on COVID-19 Outbreak

Quoin Price List October 2021 - Knoll

Wetsuits SPRING/SUMMER 2020

Catalogue 2020 Smart Residential Protect what matters with Yale security solutions - Yale Middle East

Locks & Electromechanical Security Solutions - RA-Lock Security Solutions Catalog

RECOMMENDED RETAIL PRICELIST 2020 - Valid from 17.02.2020

Shuffling: A Framework for Lock Contention Aware Thread Scheduling for Multicore Multiprocessor Systems

Installation Manual PEDESTRIAN GATE LOCK

HARDWARE CATALOGUE March 2016 - Joinery Hardware

MODEL DK-26 DIGITAL KEYPAD SYSTEM INSTALLATION AND OPERATING INSTRUCTIONS

ARCHITECTURAL DOOR HARDWARE - www.sylvan.co.nz - GD Rutter

TECHNICAL DATA SHEETS AND DRAWINGS

JAWS for Windows Quick Start Guide - Freedom Scientific, Inc - Freedom Scientific Home Page

LockDoc: Trace-Based Analysis of Locking in the Linux Kernel - TU Dortmund

BICYCLE SECURITY PRODUCTS CATALOGUE 2010 - Magnum Industries ...

Mechatronic applications for EDS locks - Hardware Solutions for Doors - Applications and system solutions for locks with electronic latch control ...

Product Catalog - Turn 10

ONGUARD ONE MEAN 2019 - LOCKTM

CATALOG 2020 | PC-2001 - Institutional Grade Cabinet Locks, Padlocks and Accessories - Olympus Lock

RIM INSTRUCTION MANUAL - Warning

Early Experience on Transactional Execution of JavaTM Programs Using Intel R Transactional Synchronization Extensions

Contention-Aware Lock Scheduling for Transactional Databases - VLDB Endowment

GPS Lock and Tracking Solutions

ELECTRONIC CABINET & LOCKER SERIES - ELECTRONIC LOCKING SYSTEM - Product Catalog Volume 1609 - LockeyUSA

Commercial Hardware Range 2021 - milesnelson.co.nz

State Pension triple lock - BRIEFING PAPER Number CBP-07812, 8 January 2020 - UK Parliament

Quick Start Guide JAWS for Windows - Freedom Scientific, Inc.

HARTING Han-Quick Lock Termination

Wireless & standalone access control door solutions trusted by more schools & universities, offices, retail chains & healthcare providers - Alarm Lock

Open up to the possibilities of Saflok Messenger by Kaba - Stylish and wireless online hotel door management

Automated Commercial Environment - Chapter 3: Creating EDIFACT Transactions November 6, 2006 - CBP

WELCOME HOME INFI NITY - Dekko Window Systems

Beretta-Lighting Catalogue 2016. Component and System.

Synchronization 1: Concurrency (cont'd), Lock Implementation - CS162 Operating Systems and Systems Programming Lecture 7

Cencon ATM Cash Vault Security System

Product Catalog 2021 IT hardware Security and Display Solutions - Smart. Secure. Essential - tablethalterung.de

Oracle Berkeley DB, Java Edition Getting Started with Transaction Processing 12c Release 1 - Library Version 12.1.6.0

RIM MOUNTED AND DUAL ACCESS LOCKS

The River Thames and Connecting Waterways - Cruising Guide to

Locksmith: Practical Static Race Detection for C

Debugging Java performance problems - Ryan Matteson

Baxter Elastomeric Pumps - CLINICIAN GUIDE

INTO YOUR WORLD TATA STEEL - # WEALSOMAKE TOMORROW - TATA PRAVESH

PRODUCTS CATALOG 2021 - ACCESS CONTROL PRODUCTS - Visionis

CONCISE OUTLINES FOR A COMPLEX LOGIC: A PROOF OUTLINE CHECKER FOR TADA (FULL PAPER)

Jug Documentation Release 2.1.1 - Luis Pedro Coelho

Tidland Series 'C' Knifeholder - Class I Razor Slitting

2021 POWERFLY FULL SUSPENSION - SERVICE MANUAL SUPPLEMENT