15
This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference. July 10–12, 2019 • Renton, WA, USA ISBN 978-1-939133-03-8 Open access to the Proceedings of the 2019 USENIX Annual Technical Conference is sponsored by USENIX. Extension Framework for File Systems in User space Ashish Bijlani and Umakishore Ramachandran, Georgia Institute of Technology https://www.usenix.org/conference/atc19/presentation/bijlani

Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference

July 10ndash12 2019 bull Renton WA USA

ISBN 978-1-939133-03-8

Open access to the Proceedings of the 2019 USENIX Annual Technical Conference

is sponsored by USENIX

Extension Framework for File Systems in User space

Ashish Bijlani and Umakishore Ramachandran Georgia Institute of Technology

httpswwwusenixorgconferenceatc19presentationbijlani

Extension Framework for File Systems in User space

Ashish BijlaniGeorgia Institute of Technology

Umakishore RamachandranGeorgia Institute of Technology

AbstractUser file systems offer numerous advantages over their in-kernel implementations such as ease of development and bet-ter system reliability However they incur heavy performancepenalty We observe that existing user file system frameworksare highly general they consist of a minimal interpositionlayer in the kernel that simply forwards all low-level requeststo user space While this design offers flexibility it alsoseverely degrades performance due to frequent kernel-usercontext switching

This work introduces EXTFUSE a framework for develop-ing extensible user file systems that also allows applicationsto register ldquothinrdquo specialized request handlers in the kernelto meet their specific operative needs while retaining thecomplex functionality in user space Our evaluation with twoFUSE file systems shows that EXTFUSE can improve theperformance of user file systems with less than a few hundredlines on average EXTFUSE is available on GitHub

1 IntroductionUser file systems not only offer better security (ie un-privileged execution) and reliability [46] when compared toin-kernel implementations but also ease the developmentand maintenancedebugging processes Therefore manyapproaches to develop user space file systems have alsobeen proposed for monolithic Operating Systems (OS) suchas Linux and FreeBSD While some approaches targetspecific systems [25 45 53] a number of general-purposeframeworks for implementing user file systems also ex-ist [3 14 29 33 47] FUSE [47] in particular is thestate-of-the-art framework for developing user file systemsOver a hundred FUSE file system have been created in aca-demicresearch [113237414449] as well as in productionsettings [9 24 40 52]

Being general-purpose the primary goal of the aforemen-tioned frameworks is to enable easy yet fully-functional im-plementation of file systems in user space supporting multipledifferent functionalities To do so they implement a mini-mal kernel driver that interfaces with the Virtual File System(VFS) operations and simply forwards all low-level requeststo user space For example when an application (app) makesan open() system call the VFS issues a lookup request foreach path component Similarly getxattr requests are is-sued to read security labels while serving write() systemcalls Such low-level requests are simply forwarded to user

space This design offers flexibility to developers to easilyimplement their functionality and apply custom optimizationsbut also incurs a high overhead due to frequent user-kernelswitching and data copying For example despite severalrecent optimizations even a simple passthrough FUSE filesystem can introduce up to 83 overhead on an SSD [50]As a result some FUSE file systems have been replaced byalternative implementations in production [8 22 24]

There have been attempts to address performance issuesof user file system frameworks for example by eliminatinguser-kernel switching in FUSE under certain scenarios [2834] Nevertheless the optimizations proposed pertain to theirspecific use cases and do not address the inherent designlimitations of existing frameworks

We observe that the interfaces exported by existing userfile system frameworks are too low-level and general-purposeAs such they fail to match the specific operative needs offile systems For example getxattr requests during write()can be completely eliminated for files containing no secu-rity labels lookup replies from the daemon could be cachedand validated in the kernel to reduce context switches to userspace Similarly when stacking sandboxing functionalityfor enforcing custom permissions checks in open() systemcall IO requests (eg readwrite) could be passed directlythrough the host file system Nevertheless modifying exist-ing frameworks to efficiently address specific functional andperformance requirements of each use case is impractical

We borrow the idea of safely extending system servicesat runtime from past works [6 13 43 56] and propose toaddress the performance issues in existing user file systemsframeworks by allowing developers to safely extend the func-tionality of the kernel driver at runtime for specialized han-dling of their use case This work introduces EXTFUSE anextension framework for file systems in user space that al-lows developers to define specialized ldquothinrdquo extensions alongwith auxiliary data structures for handling low-level requestsin the kernel The extensions are safely executed under asandboxed runtime environment in the kernel immediately asthe requests are issued from the upper file system layer (egVFS) thereby offering a fine-grained ability to either serveeach request entirely in the kernel (fast path) or fall back tothe existing complex logic in user space (slow path) to achievethe desired operative goals of functionality and performanceThe fast and slow paths can access (and modify) the auxiliarydata structures to define custom logic for handling requests

USENIX Association 2019 USENIX Annual Technical Conference 121

EXTFUSE consists of three components First a helperuser library that provides a familiar set of file system APIsto register extensions and implement custom fast-path func-tionality in a subset of the C language Second a wrapper(no-ops) interposition driver that bridges with the low-levelVFS interfaces and provides the necessary support to for-ward requests to the registered kernel extensions as well as tothe lower file system as needed Third an in-kernel VirtualMachine (VM) runtime that safely executes the extensions

We have built EXTFUSE to work in concert with existinguser file system frameworks to allow both the fast and theexisting slow path to coexist with no overhauling changes tothe design of the target user file system Although there areseveral user file system frameworks this work focuses onlyon FUSE because of its wide-spread use Nonetheless wehave implemented EXTFUSE as in a modular fashion so itcan be easily adopted for others

We added support for EXTFUSE in four popular FUSEfile systems namely LoggedFS [16] Android sdcard daemonMergerFS and BindFS [35] Our evaluation of the first twoshows that EXTFUSE can offer substantial performance im-provements of user file systems by adding less than a fewhundred lines on average

This paper makes the following contributionsbull We identify optimization opportunities in user file sys-

tem frameworks for monolithic OSes ( sect2) and proposeextensible user file systems

bull We present the design (sect3) and architecture (sect34) ofEXTFUSE an extension framework for user file systemsthat offers the performance of kernel file systems whileretaining the safety properties of user file systems

bull We demonstrate the applicability of EXTFUSE by adopt-ing it for FUSE the state-of-the-art user file systemframework and evaluating it on Linux (sect6)

bull We show the practical benefits and limitations of theEXTFUSE with two popular FUSE file systems onedeployed in production (sect62)

2 Background and Extended MotivationThis section provides a brief technical background on FUSEand its limitations that motivate our work

21 FUSEFUSE is the state-of-the-art framework for developing userfile systems It consists of a loadable kernel driver and ahelper user-space library that provides a set of portable APIsto allow users to implement their own fully-functional filesystem as an unprivileged daemon process on Unix-basedsystems with no additional kernel support The driver is asimple interposition layer that only serves as a communica-tion channel between the user-space daemon and the VFSIt registers a new file system to interface with the VFS op-erations and directly forwards all low-level requests to thedaemon as received

The library provides two different sets of APIs Firsta fuse_lowlevel_ops interface that exports all VFS oper-ations such as lookup for path to inode mapping It is usedby file systems that need to access low-level abstractions(eg inodes) for custom optimizations Second a high-levelfuse_operations interface that builds on the low-level APIsIt hides complex abstractions and offers a simple (eg path-based) API for the ease of development Depending on theirparticular use case developers can adopt either of the twoAPIs Furthermore many operations in both APIs are op-tional For example developers can choose to not handleextended attributes (xattrs) operations (eg getxattr)

As apps make system calls the VFS layer forwards all low-level requests to the kernel driver For example to serve theopen() system call on Linux the VFS issues multiple lookuprequests to the driver one for each input path componentSimilarly every write() request is preceded by a getxattrrequest from the VFS to fetch the security labels The driverqueues up all requests along with the relevant parametersfor the daemon through devfuse device file and blocks thecalling app thread until the requests are served The daemonthrough the libfuse interface retrieves the requests from thequeue processes them as needed and enqueues the results forthe driver to read The driver copies the results populatingVFS caches as appropriate (eg page cache for readwritedcache for dir entries from lookup) and wakes the app threadand returns the results to it Once cached the subsequentaccesses to the same data are served with no user-kernelround-trip communication

In addition to the common benefits of user file systems suchas ease of development and maintenance FUSE also providesapp-transparency and fine-grained control to developers overthe low-level API to easily support their custom functionalityFurthermore on Linux the FUSE driver is GPL-licensedwhich detaches the implementation in user space from legalobligations [57] It also supports multiple language bindings(eg Python Go etc) thereby enabling access to ecosystemof third-party libraries for reuse

Given its general-purpose design and numerous advantagesover a hundred FUSE file systems with different functionali-ties have been created A large majority of them are stackablethat is they introduce incremental functionality on the hostfile system [38] FUSE has long served as a popular tool forquick experimentation and prototyping new file systems inacademic and research settings [11 32 37 41 44 49] How-ever more recently a host of FUSE file systems have also beendeployed in production Both Gluster [40] and Ceph [52] clus-ter file systems use a FUSE network client implementationAndroid v44 introduced FUSE sdcard daemon (stackable) toadd multi-user support and emulate FAT functionality on theEXT4 file system [24]

While exporting low-level abstractions and VFS interfacesoffers more control and flexibility to developers this designcomes with a cost It induces frequent user-kernel round-

122 2019 USENIX Annual Technical Conference USENIX Association

Media W xattr (Diff) WO xattr (Diff)

SW RW SW RW

HDD -029 -380 -017 -283SSD -115 -2338 +031 -1224

Table 1 Percentage difference between IO throughput (opssec) forEXT4 vs FUSE StackfsOpt (see sect6) under single thread 4K SeqWrite (SW) and Rand Write (RW) settings on a 60GB file acrossdifferent storage media as reported by Filebench [48] Received 15million getxattr requests

trip communication and data copying and thus inevitablyyields poor runtime performance Nonetheless FUSE hasevolved significantly over the years several optimizationshave been added to minimize the user-kernel communicationzero-copy data transfer (splicing) and utilizing system-wideshared VFS caches (page cache for data IO and dentry cachefor metadata) However despite these optimizations FUSEseverely degrades runtime performance in a host of scenar-ios For instance data caching improves IO throughput bybatching requests (eg read-aheads small writes) but it doesnot help apps that perform random reads or demand low writelatency (eg databases) Splicing is only used for over 4K re-quests Therefore apps with small writesreads (eg Androidapps etc) will incur data copying overheads Worse yet theoverhead is higher with faster storage media Even a simplepassthrough (no-ops) FUSE file system can introduce up to83 overhead for metadata-heavy workloads on SSD [50]

The performance penalty incurred by user file systems over-shadows their benefits Consequently some user file systemshave been replaced with alternative implementations in pro-duction For instance Android v70 replaced the sdcard dae-mon with its in-kernel implementation after several years [24]Ceph [52] adopted an in-kernel client for Linux kernel [8]

22 Generality vs SpecializationWe observe that being highly general-purpose the FUSEframework induces unnecessary user-kernel communicationin many cases yielding low throughput and high latency Forinstance getxattr requests generated by the VFS duringwrite() system calls (one per write) can double the numberof user-kernel transitions which decreases the IO through-put of sequential writes by over 27 and random writes byover 44 compared to native (EXT4) performance (Table 1)Moreover the penalty incurred is higher with the faster media

Nevertheless simple filtering or caching metadata repliesin the kernel can substantially reduce user-kernel communi-cation For instance by caching the last getxattr reply inthe FUSE driver and simply validating the cached state forevery subsequent getxattr request from the VFS on the samefile unnecessary user-kernel transitions can be eliminated toachieve significant improvement in write performance Simi-larly replies from other metadata operations such as lookup

and getattr can be cached validated and served entirelywithin the kernel Note that custom validation of cached meta-data is imperative and lack of support to do so may result inincorrect behavior as happens in the case of optimized FUSEthat leverages VFS caches to serve requests (sect51)

Many stackable user file systems add a thin layer of func-tionality they perform simple checks in a few operations andpass remaining requests directly through the host (lower) filesystem LoggedFS [16] filters requests that must be loggedand do so by accessing host file system services Union filesystems such as MergerFS [20] determine the backend hostfile in open() and redirects IO requests to it Android sdcarddaemon performs access permission checks only in metadataoperations (eg open lookup) but data IO requests (egread write etc) are simply forwarded to the lower file sys-tem Thin functionality that realizes such use cases does notneed any complex processing in user space and thereforecan easily be stacked in the kernel thereby avoiding expen-sive user-kernel switching to yield lower latency and higherthroughput for the same functionality Furthermore data IOrequests could be directly forwarded to the host file system

The FUSE framework offers a few configuration (config)options to the developers to tune the behavior of its kerneldriver for their use case However those options are coarse-grained and implemented as fixed checks embedded in thedriver code thus it offers limited static control For examplefor file systems that do not support certain operations (eggetxattr) the FUSE driver caches this knowledge upon firstENOSUPPORT reply and does not issue such requests subse-quently While this benefits the file systems that completelyomit certain functionality (eg security xattr labels) it doesnot allow custom and fine-grained filtering of requests thatmay be desired by some file systems supporting only partialfunctionality For example file systems providing encryp-tion (or compression) functionality may desire a fine-grainedcustom control over the kernel driver to skip the decryption(or decompression) operation in user space during read()requests on non-sensitive (or unzipped) files

As such the FUSE framework proves to be too low-leveland general-purpose in many cases While it enables a numberof different use cases modifying the framework to efficientlyhandle special needs of each use case is impractical This isa typical unbalanced generality vs specialization problemwhich can be addressed by extending the functionality of theFUSE driver in the kernel [6 13 43]

3 DesignIn this section we 1) present an overview of EXTFUSE 2)discuss the design goals and challenges we faced and 3)mechanisms we adopted to address those challenges

31 OverviewEXTFUSE is a framework for developing extensible FUSEfile systems for UNIX-like monolithic OSes It allows the

USENIX Association 2019 USENIX Annual Technical Conference 123

unprivileged FUSE daemon processes to register ldquothinrdquo exten-sions in the kernel for specialized handling of low-level filesystem requests while retaining their existing complex logicin user space to achieve the desired level of performance

The registered extensions are safely executed under a sand-boxed eBPF runtime environment in the kernel ( sect33) im-mediately as requests are issued from the upper file system(eg VFS) Sandboxing enables the FUSE daemon to safelyextend the functionality of the driver at runtime and offers afine-grained ability to either serve each request entirely in thekernel or fall back to user space thereby offering safety ofuser space and performance of kernel file systems

EXTFUSE also provides a set of APIs to create sharedkey-value data structures (called maps) that can host abstractdata blobs The user-space daemon and its kernel extensionscan leverage maps to storemanipulate their custom data typesas needed to work in concert and serve file system requests inthe kernel without incurring expensive user-kernel round-tripcommunication if deemed unnecessary

32 Goals and ChallengesThe over-arching goal of EXTFUSE is to carefully balancethe safety and runtime extensibility of user file systems toachieve the desired level of performance and specializedfunctionality Nevertheless in order for developers to useEXTFUSE it must also be easy to adopt We identify thefollowing concrete design goalsDesign Compatibility The abstractions and interfaces of-fered by EXTFUSE framework must be compatible withFUSE without hindering existing functionality or proper-ties It must be general-purpose so as to support multipledifferent use cases Developers must further be able to adoptEXTFUSE for their use case without overhauling changesto their existing design It must be easy for them to imple-ment specialized extensions with little to no knowledge of theunderlying implementation detailsModular Extensibility EXTFUSE must be highly modularand limit any unnecessary new changes to FUSE Particularlydevelopers must be able to retain their existing user-spacelogic and introduce specialized extensions only if neededBalancing Safety and Performance Finally withEXTFUSE even unprivileged (and untrusted) FUSEdaemon processes must be able to safely extend thefunctionality of the driver as needed to offer performancethat is as close as possible to the in-kernel implementationHowever unlike Microkernels [2 23 30] that host systemservices in separate protection domains as user processesor the OSes that have been developed with safe runtimeextensibility as a design goal [6 13] extending systemservices of general-purpose UNIX-like monolithic OSesposes a design trade-off question between the safety andperformance requirements

Untrusted extensions must be as lightweight as possiblewith their access restricted to only a few well-defined APIs

to guarantee safety For example kernel file systems offernear-native performance but executing complex logic in thekernel results in questionable reliability Additionally mostOS kernels employ Address Space Randomization [36] DataExecution Prevention [5] etc for code protection and hid-ing memory pointers Providing unrestricted kernel accessto extensions can render such protections useless There-fore extensions must not be able to access arbitrary memoryaddresses or leak pointer values to user space

However severely restricting extensibility can prevent userfile systems from fully meeting their operative performanceand functionality goals thus defeating the purpose of exten-sions in the first place Therefore EXTFUSE must carefullybalance safety and performance goalsCorrectness Furthermore specialized extensions can alterthe existing design of user file systems which can lead tocorrectness issues For example there will be two separatepaths (fast and slow) both operating on the requests and datastructs at the same time The framework must provide a wayfor them to synchronize and offer safe concurrent accesses

33 eBPFEXTFUSE leverages extended BPF (eBPF) [26] an in-kernelVirtual Machine (VM) runtime framework to load and safelyexecute user file system extensionsRicher functionality eBPF is an extension of classic Berke-ley Packet Filters (BPF) an in-kernel interpreter for a pseudomachine architecture designed to only accept simple networkfiltering rules from user space It enhances BPF to includemore versatility such as 64-bit support a richer instruction set(eg call cond jump) more registers and native performancethrough JIT compilationHigh-level language support The eBPF bytecode backendis also supported by ClangLLVM compiler toolchain whichallows functionality logic to be written in a familiar high-levellanguage such as C and GoSafety The eBPF framework provides a safe execution en-vironment in the kernel It prohibits execution of arbitrarycode and access to arbitrary kernel memory regions insteadthe framework restricts access to a set of kernel helper APIsdepending on the target kernel subsystem (eg network) andrequired functionality (eg packet handling) The frame-work includes a static analyzer (called verifier) that checksthe correctness of the bytecode by performing an exhaus-tive depth-first search through its control flow graph to detectproblems such as infinite loops out-of-bound and illegalmemory errors The framework can also be configured toallow or deny eBPF bytecode execution request from unprivi-leged processesKey-Value Maps eBPF allows user space to create map datastructures to store arbitrary key-value blobs using systemcalls and access them using file descriptors Maps are alsoaccessible to eBPF bytecode in the kernel thus providing acommunication channel between user space and the bytecode

124 2019 USENIX Annual Technical Conference USENIX Association

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 2: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

Extension Framework for File Systems in User space

Ashish BijlaniGeorgia Institute of Technology

Umakishore RamachandranGeorgia Institute of Technology

AbstractUser file systems offer numerous advantages over their in-kernel implementations such as ease of development and bet-ter system reliability However they incur heavy performancepenalty We observe that existing user file system frameworksare highly general they consist of a minimal interpositionlayer in the kernel that simply forwards all low-level requeststo user space While this design offers flexibility it alsoseverely degrades performance due to frequent kernel-usercontext switching

This work introduces EXTFUSE a framework for develop-ing extensible user file systems that also allows applicationsto register ldquothinrdquo specialized request handlers in the kernelto meet their specific operative needs while retaining thecomplex functionality in user space Our evaluation with twoFUSE file systems shows that EXTFUSE can improve theperformance of user file systems with less than a few hundredlines on average EXTFUSE is available on GitHub

1 IntroductionUser file systems not only offer better security (ie un-privileged execution) and reliability [46] when compared toin-kernel implementations but also ease the developmentand maintenancedebugging processes Therefore manyapproaches to develop user space file systems have alsobeen proposed for monolithic Operating Systems (OS) suchas Linux and FreeBSD While some approaches targetspecific systems [25 45 53] a number of general-purposeframeworks for implementing user file systems also ex-ist [3 14 29 33 47] FUSE [47] in particular is thestate-of-the-art framework for developing user file systemsOver a hundred FUSE file system have been created in aca-demicresearch [113237414449] as well as in productionsettings [9 24 40 52]

Being general-purpose the primary goal of the aforemen-tioned frameworks is to enable easy yet fully-functional im-plementation of file systems in user space supporting multipledifferent functionalities To do so they implement a mini-mal kernel driver that interfaces with the Virtual File System(VFS) operations and simply forwards all low-level requeststo user space For example when an application (app) makesan open() system call the VFS issues a lookup request foreach path component Similarly getxattr requests are is-sued to read security labels while serving write() systemcalls Such low-level requests are simply forwarded to user

space This design offers flexibility to developers to easilyimplement their functionality and apply custom optimizationsbut also incurs a high overhead due to frequent user-kernelswitching and data copying For example despite severalrecent optimizations even a simple passthrough FUSE filesystem can introduce up to 83 overhead on an SSD [50]As a result some FUSE file systems have been replaced byalternative implementations in production [8 22 24]

There have been attempts to address performance issuesof user file system frameworks for example by eliminatinguser-kernel switching in FUSE under certain scenarios [2834] Nevertheless the optimizations proposed pertain to theirspecific use cases and do not address the inherent designlimitations of existing frameworks

We observe that the interfaces exported by existing userfile system frameworks are too low-level and general-purposeAs such they fail to match the specific operative needs offile systems For example getxattr requests during write()can be completely eliminated for files containing no secu-rity labels lookup replies from the daemon could be cachedand validated in the kernel to reduce context switches to userspace Similarly when stacking sandboxing functionalityfor enforcing custom permissions checks in open() systemcall IO requests (eg readwrite) could be passed directlythrough the host file system Nevertheless modifying exist-ing frameworks to efficiently address specific functional andperformance requirements of each use case is impractical

We borrow the idea of safely extending system servicesat runtime from past works [6 13 43 56] and propose toaddress the performance issues in existing user file systemsframeworks by allowing developers to safely extend the func-tionality of the kernel driver at runtime for specialized han-dling of their use case This work introduces EXTFUSE anextension framework for file systems in user space that al-lows developers to define specialized ldquothinrdquo extensions alongwith auxiliary data structures for handling low-level requestsin the kernel The extensions are safely executed under asandboxed runtime environment in the kernel immediately asthe requests are issued from the upper file system layer (egVFS) thereby offering a fine-grained ability to either serveeach request entirely in the kernel (fast path) or fall back tothe existing complex logic in user space (slow path) to achievethe desired operative goals of functionality and performanceThe fast and slow paths can access (and modify) the auxiliarydata structures to define custom logic for handling requests

USENIX Association 2019 USENIX Annual Technical Conference 121

EXTFUSE consists of three components First a helperuser library that provides a familiar set of file system APIsto register extensions and implement custom fast-path func-tionality in a subset of the C language Second a wrapper(no-ops) interposition driver that bridges with the low-levelVFS interfaces and provides the necessary support to for-ward requests to the registered kernel extensions as well as tothe lower file system as needed Third an in-kernel VirtualMachine (VM) runtime that safely executes the extensions

We have built EXTFUSE to work in concert with existinguser file system frameworks to allow both the fast and theexisting slow path to coexist with no overhauling changes tothe design of the target user file system Although there areseveral user file system frameworks this work focuses onlyon FUSE because of its wide-spread use Nonetheless wehave implemented EXTFUSE as in a modular fashion so itcan be easily adopted for others

We added support for EXTFUSE in four popular FUSEfile systems namely LoggedFS [16] Android sdcard daemonMergerFS and BindFS [35] Our evaluation of the first twoshows that EXTFUSE can offer substantial performance im-provements of user file systems by adding less than a fewhundred lines on average

This paper makes the following contributionsbull We identify optimization opportunities in user file sys-

tem frameworks for monolithic OSes ( sect2) and proposeextensible user file systems

bull We present the design (sect3) and architecture (sect34) ofEXTFUSE an extension framework for user file systemsthat offers the performance of kernel file systems whileretaining the safety properties of user file systems

bull We demonstrate the applicability of EXTFUSE by adopt-ing it for FUSE the state-of-the-art user file systemframework and evaluating it on Linux (sect6)

bull We show the practical benefits and limitations of theEXTFUSE with two popular FUSE file systems onedeployed in production (sect62)

2 Background and Extended MotivationThis section provides a brief technical background on FUSEand its limitations that motivate our work

21 FUSEFUSE is the state-of-the-art framework for developing userfile systems It consists of a loadable kernel driver and ahelper user-space library that provides a set of portable APIsto allow users to implement their own fully-functional filesystem as an unprivileged daemon process on Unix-basedsystems with no additional kernel support The driver is asimple interposition layer that only serves as a communica-tion channel between the user-space daemon and the VFSIt registers a new file system to interface with the VFS op-erations and directly forwards all low-level requests to thedaemon as received

The library provides two different sets of APIs Firsta fuse_lowlevel_ops interface that exports all VFS oper-ations such as lookup for path to inode mapping It is usedby file systems that need to access low-level abstractions(eg inodes) for custom optimizations Second a high-levelfuse_operations interface that builds on the low-level APIsIt hides complex abstractions and offers a simple (eg path-based) API for the ease of development Depending on theirparticular use case developers can adopt either of the twoAPIs Furthermore many operations in both APIs are op-tional For example developers can choose to not handleextended attributes (xattrs) operations (eg getxattr)

As apps make system calls the VFS layer forwards all low-level requests to the kernel driver For example to serve theopen() system call on Linux the VFS issues multiple lookuprequests to the driver one for each input path componentSimilarly every write() request is preceded by a getxattrrequest from the VFS to fetch the security labels The driverqueues up all requests along with the relevant parametersfor the daemon through devfuse device file and blocks thecalling app thread until the requests are served The daemonthrough the libfuse interface retrieves the requests from thequeue processes them as needed and enqueues the results forthe driver to read The driver copies the results populatingVFS caches as appropriate (eg page cache for readwritedcache for dir entries from lookup) and wakes the app threadand returns the results to it Once cached the subsequentaccesses to the same data are served with no user-kernelround-trip communication

In addition to the common benefits of user file systems suchas ease of development and maintenance FUSE also providesapp-transparency and fine-grained control to developers overthe low-level API to easily support their custom functionalityFurthermore on Linux the FUSE driver is GPL-licensedwhich detaches the implementation in user space from legalobligations [57] It also supports multiple language bindings(eg Python Go etc) thereby enabling access to ecosystemof third-party libraries for reuse

Given its general-purpose design and numerous advantagesover a hundred FUSE file systems with different functionali-ties have been created A large majority of them are stackablethat is they introduce incremental functionality on the hostfile system [38] FUSE has long served as a popular tool forquick experimentation and prototyping new file systems inacademic and research settings [11 32 37 41 44 49] How-ever more recently a host of FUSE file systems have also beendeployed in production Both Gluster [40] and Ceph [52] clus-ter file systems use a FUSE network client implementationAndroid v44 introduced FUSE sdcard daemon (stackable) toadd multi-user support and emulate FAT functionality on theEXT4 file system [24]

While exporting low-level abstractions and VFS interfacesoffers more control and flexibility to developers this designcomes with a cost It induces frequent user-kernel round-

122 2019 USENIX Annual Technical Conference USENIX Association

Media W xattr (Diff) WO xattr (Diff)

SW RW SW RW

HDD -029 -380 -017 -283SSD -115 -2338 +031 -1224

Table 1 Percentage difference between IO throughput (opssec) forEXT4 vs FUSE StackfsOpt (see sect6) under single thread 4K SeqWrite (SW) and Rand Write (RW) settings on a 60GB file acrossdifferent storage media as reported by Filebench [48] Received 15million getxattr requests

trip communication and data copying and thus inevitablyyields poor runtime performance Nonetheless FUSE hasevolved significantly over the years several optimizationshave been added to minimize the user-kernel communicationzero-copy data transfer (splicing) and utilizing system-wideshared VFS caches (page cache for data IO and dentry cachefor metadata) However despite these optimizations FUSEseverely degrades runtime performance in a host of scenar-ios For instance data caching improves IO throughput bybatching requests (eg read-aheads small writes) but it doesnot help apps that perform random reads or demand low writelatency (eg databases) Splicing is only used for over 4K re-quests Therefore apps with small writesreads (eg Androidapps etc) will incur data copying overheads Worse yet theoverhead is higher with faster storage media Even a simplepassthrough (no-ops) FUSE file system can introduce up to83 overhead for metadata-heavy workloads on SSD [50]

The performance penalty incurred by user file systems over-shadows their benefits Consequently some user file systemshave been replaced with alternative implementations in pro-duction For instance Android v70 replaced the sdcard dae-mon with its in-kernel implementation after several years [24]Ceph [52] adopted an in-kernel client for Linux kernel [8]

22 Generality vs SpecializationWe observe that being highly general-purpose the FUSEframework induces unnecessary user-kernel communicationin many cases yielding low throughput and high latency Forinstance getxattr requests generated by the VFS duringwrite() system calls (one per write) can double the numberof user-kernel transitions which decreases the IO through-put of sequential writes by over 27 and random writes byover 44 compared to native (EXT4) performance (Table 1)Moreover the penalty incurred is higher with the faster media

Nevertheless simple filtering or caching metadata repliesin the kernel can substantially reduce user-kernel communi-cation For instance by caching the last getxattr reply inthe FUSE driver and simply validating the cached state forevery subsequent getxattr request from the VFS on the samefile unnecessary user-kernel transitions can be eliminated toachieve significant improvement in write performance Simi-larly replies from other metadata operations such as lookup

and getattr can be cached validated and served entirelywithin the kernel Note that custom validation of cached meta-data is imperative and lack of support to do so may result inincorrect behavior as happens in the case of optimized FUSEthat leverages VFS caches to serve requests (sect51)

Many stackable user file systems add a thin layer of func-tionality they perform simple checks in a few operations andpass remaining requests directly through the host (lower) filesystem LoggedFS [16] filters requests that must be loggedand do so by accessing host file system services Union filesystems such as MergerFS [20] determine the backend hostfile in open() and redirects IO requests to it Android sdcarddaemon performs access permission checks only in metadataoperations (eg open lookup) but data IO requests (egread write etc) are simply forwarded to the lower file sys-tem Thin functionality that realizes such use cases does notneed any complex processing in user space and thereforecan easily be stacked in the kernel thereby avoiding expen-sive user-kernel switching to yield lower latency and higherthroughput for the same functionality Furthermore data IOrequests could be directly forwarded to the host file system

The FUSE framework offers a few configuration (config)options to the developers to tune the behavior of its kerneldriver for their use case However those options are coarse-grained and implemented as fixed checks embedded in thedriver code thus it offers limited static control For examplefor file systems that do not support certain operations (eggetxattr) the FUSE driver caches this knowledge upon firstENOSUPPORT reply and does not issue such requests subse-quently While this benefits the file systems that completelyomit certain functionality (eg security xattr labels) it doesnot allow custom and fine-grained filtering of requests thatmay be desired by some file systems supporting only partialfunctionality For example file systems providing encryp-tion (or compression) functionality may desire a fine-grainedcustom control over the kernel driver to skip the decryption(or decompression) operation in user space during read()requests on non-sensitive (or unzipped) files

As such the FUSE framework proves to be too low-leveland general-purpose in many cases While it enables a numberof different use cases modifying the framework to efficientlyhandle special needs of each use case is impractical This isa typical unbalanced generality vs specialization problemwhich can be addressed by extending the functionality of theFUSE driver in the kernel [6 13 43]

3 DesignIn this section we 1) present an overview of EXTFUSE 2)discuss the design goals and challenges we faced and 3)mechanisms we adopted to address those challenges

31 OverviewEXTFUSE is a framework for developing extensible FUSEfile systems for UNIX-like monolithic OSes It allows the

USENIX Association 2019 USENIX Annual Technical Conference 123

unprivileged FUSE daemon processes to register ldquothinrdquo exten-sions in the kernel for specialized handling of low-level filesystem requests while retaining their existing complex logicin user space to achieve the desired level of performance

The registered extensions are safely executed under a sand-boxed eBPF runtime environment in the kernel ( sect33) im-mediately as requests are issued from the upper file system(eg VFS) Sandboxing enables the FUSE daemon to safelyextend the functionality of the driver at runtime and offers afine-grained ability to either serve each request entirely in thekernel or fall back to user space thereby offering safety ofuser space and performance of kernel file systems

EXTFUSE also provides a set of APIs to create sharedkey-value data structures (called maps) that can host abstractdata blobs The user-space daemon and its kernel extensionscan leverage maps to storemanipulate their custom data typesas needed to work in concert and serve file system requests inthe kernel without incurring expensive user-kernel round-tripcommunication if deemed unnecessary

32 Goals and ChallengesThe over-arching goal of EXTFUSE is to carefully balancethe safety and runtime extensibility of user file systems toachieve the desired level of performance and specializedfunctionality Nevertheless in order for developers to useEXTFUSE it must also be easy to adopt We identify thefollowing concrete design goalsDesign Compatibility The abstractions and interfaces of-fered by EXTFUSE framework must be compatible withFUSE without hindering existing functionality or proper-ties It must be general-purpose so as to support multipledifferent use cases Developers must further be able to adoptEXTFUSE for their use case without overhauling changesto their existing design It must be easy for them to imple-ment specialized extensions with little to no knowledge of theunderlying implementation detailsModular Extensibility EXTFUSE must be highly modularand limit any unnecessary new changes to FUSE Particularlydevelopers must be able to retain their existing user-spacelogic and introduce specialized extensions only if neededBalancing Safety and Performance Finally withEXTFUSE even unprivileged (and untrusted) FUSEdaemon processes must be able to safely extend thefunctionality of the driver as needed to offer performancethat is as close as possible to the in-kernel implementationHowever unlike Microkernels [2 23 30] that host systemservices in separate protection domains as user processesor the OSes that have been developed with safe runtimeextensibility as a design goal [6 13] extending systemservices of general-purpose UNIX-like monolithic OSesposes a design trade-off question between the safety andperformance requirements

Untrusted extensions must be as lightweight as possiblewith their access restricted to only a few well-defined APIs

to guarantee safety For example kernel file systems offernear-native performance but executing complex logic in thekernel results in questionable reliability Additionally mostOS kernels employ Address Space Randomization [36] DataExecution Prevention [5] etc for code protection and hid-ing memory pointers Providing unrestricted kernel accessto extensions can render such protections useless There-fore extensions must not be able to access arbitrary memoryaddresses or leak pointer values to user space

However severely restricting extensibility can prevent userfile systems from fully meeting their operative performanceand functionality goals thus defeating the purpose of exten-sions in the first place Therefore EXTFUSE must carefullybalance safety and performance goalsCorrectness Furthermore specialized extensions can alterthe existing design of user file systems which can lead tocorrectness issues For example there will be two separatepaths (fast and slow) both operating on the requests and datastructs at the same time The framework must provide a wayfor them to synchronize and offer safe concurrent accesses

33 eBPFEXTFUSE leverages extended BPF (eBPF) [26] an in-kernelVirtual Machine (VM) runtime framework to load and safelyexecute user file system extensionsRicher functionality eBPF is an extension of classic Berke-ley Packet Filters (BPF) an in-kernel interpreter for a pseudomachine architecture designed to only accept simple networkfiltering rules from user space It enhances BPF to includemore versatility such as 64-bit support a richer instruction set(eg call cond jump) more registers and native performancethrough JIT compilationHigh-level language support The eBPF bytecode backendis also supported by ClangLLVM compiler toolchain whichallows functionality logic to be written in a familiar high-levellanguage such as C and GoSafety The eBPF framework provides a safe execution en-vironment in the kernel It prohibits execution of arbitrarycode and access to arbitrary kernel memory regions insteadthe framework restricts access to a set of kernel helper APIsdepending on the target kernel subsystem (eg network) andrequired functionality (eg packet handling) The frame-work includes a static analyzer (called verifier) that checksthe correctness of the bytecode by performing an exhaus-tive depth-first search through its control flow graph to detectproblems such as infinite loops out-of-bound and illegalmemory errors The framework can also be configured toallow or deny eBPF bytecode execution request from unprivi-leged processesKey-Value Maps eBPF allows user space to create map datastructures to store arbitrary key-value blobs using systemcalls and access them using file descriptors Maps are alsoaccessible to eBPF bytecode in the kernel thus providing acommunication channel between user space and the bytecode

124 2019 USENIX Annual Technical Conference USENIX Association

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 3: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

EXTFUSE consists of three components First a helperuser library that provides a familiar set of file system APIsto register extensions and implement custom fast-path func-tionality in a subset of the C language Second a wrapper(no-ops) interposition driver that bridges with the low-levelVFS interfaces and provides the necessary support to for-ward requests to the registered kernel extensions as well as tothe lower file system as needed Third an in-kernel VirtualMachine (VM) runtime that safely executes the extensions

We have built EXTFUSE to work in concert with existinguser file system frameworks to allow both the fast and theexisting slow path to coexist with no overhauling changes tothe design of the target user file system Although there areseveral user file system frameworks this work focuses onlyon FUSE because of its wide-spread use Nonetheless wehave implemented EXTFUSE as in a modular fashion so itcan be easily adopted for others

We added support for EXTFUSE in four popular FUSEfile systems namely LoggedFS [16] Android sdcard daemonMergerFS and BindFS [35] Our evaluation of the first twoshows that EXTFUSE can offer substantial performance im-provements of user file systems by adding less than a fewhundred lines on average

This paper makes the following contributionsbull We identify optimization opportunities in user file sys-

tem frameworks for monolithic OSes ( sect2) and proposeextensible user file systems

bull We present the design (sect3) and architecture (sect34) ofEXTFUSE an extension framework for user file systemsthat offers the performance of kernel file systems whileretaining the safety properties of user file systems

bull We demonstrate the applicability of EXTFUSE by adopt-ing it for FUSE the state-of-the-art user file systemframework and evaluating it on Linux (sect6)

bull We show the practical benefits and limitations of theEXTFUSE with two popular FUSE file systems onedeployed in production (sect62)

2 Background and Extended MotivationThis section provides a brief technical background on FUSEand its limitations that motivate our work

21 FUSEFUSE is the state-of-the-art framework for developing userfile systems It consists of a loadable kernel driver and ahelper user-space library that provides a set of portable APIsto allow users to implement their own fully-functional filesystem as an unprivileged daemon process on Unix-basedsystems with no additional kernel support The driver is asimple interposition layer that only serves as a communica-tion channel between the user-space daemon and the VFSIt registers a new file system to interface with the VFS op-erations and directly forwards all low-level requests to thedaemon as received

The library provides two different sets of APIs Firsta fuse_lowlevel_ops interface that exports all VFS oper-ations such as lookup for path to inode mapping It is usedby file systems that need to access low-level abstractions(eg inodes) for custom optimizations Second a high-levelfuse_operations interface that builds on the low-level APIsIt hides complex abstractions and offers a simple (eg path-based) API for the ease of development Depending on theirparticular use case developers can adopt either of the twoAPIs Furthermore many operations in both APIs are op-tional For example developers can choose to not handleextended attributes (xattrs) operations (eg getxattr)

As apps make system calls the VFS layer forwards all low-level requests to the kernel driver For example to serve theopen() system call on Linux the VFS issues multiple lookuprequests to the driver one for each input path componentSimilarly every write() request is preceded by a getxattrrequest from the VFS to fetch the security labels The driverqueues up all requests along with the relevant parametersfor the daemon through devfuse device file and blocks thecalling app thread until the requests are served The daemonthrough the libfuse interface retrieves the requests from thequeue processes them as needed and enqueues the results forthe driver to read The driver copies the results populatingVFS caches as appropriate (eg page cache for readwritedcache for dir entries from lookup) and wakes the app threadand returns the results to it Once cached the subsequentaccesses to the same data are served with no user-kernelround-trip communication

In addition to the common benefits of user file systems suchas ease of development and maintenance FUSE also providesapp-transparency and fine-grained control to developers overthe low-level API to easily support their custom functionalityFurthermore on Linux the FUSE driver is GPL-licensedwhich detaches the implementation in user space from legalobligations [57] It also supports multiple language bindings(eg Python Go etc) thereby enabling access to ecosystemof third-party libraries for reuse

Given its general-purpose design and numerous advantagesover a hundred FUSE file systems with different functionali-ties have been created A large majority of them are stackablethat is they introduce incremental functionality on the hostfile system [38] FUSE has long served as a popular tool forquick experimentation and prototyping new file systems inacademic and research settings [11 32 37 41 44 49] How-ever more recently a host of FUSE file systems have also beendeployed in production Both Gluster [40] and Ceph [52] clus-ter file systems use a FUSE network client implementationAndroid v44 introduced FUSE sdcard daemon (stackable) toadd multi-user support and emulate FAT functionality on theEXT4 file system [24]

While exporting low-level abstractions and VFS interfacesoffers more control and flexibility to developers this designcomes with a cost It induces frequent user-kernel round-

122 2019 USENIX Annual Technical Conference USENIX Association

Media W xattr (Diff) WO xattr (Diff)

SW RW SW RW

HDD -029 -380 -017 -283SSD -115 -2338 +031 -1224

Table 1 Percentage difference between IO throughput (opssec) forEXT4 vs FUSE StackfsOpt (see sect6) under single thread 4K SeqWrite (SW) and Rand Write (RW) settings on a 60GB file acrossdifferent storage media as reported by Filebench [48] Received 15million getxattr requests

trip communication and data copying and thus inevitablyyields poor runtime performance Nonetheless FUSE hasevolved significantly over the years several optimizationshave been added to minimize the user-kernel communicationzero-copy data transfer (splicing) and utilizing system-wideshared VFS caches (page cache for data IO and dentry cachefor metadata) However despite these optimizations FUSEseverely degrades runtime performance in a host of scenar-ios For instance data caching improves IO throughput bybatching requests (eg read-aheads small writes) but it doesnot help apps that perform random reads or demand low writelatency (eg databases) Splicing is only used for over 4K re-quests Therefore apps with small writesreads (eg Androidapps etc) will incur data copying overheads Worse yet theoverhead is higher with faster storage media Even a simplepassthrough (no-ops) FUSE file system can introduce up to83 overhead for metadata-heavy workloads on SSD [50]

The performance penalty incurred by user file systems over-shadows their benefits Consequently some user file systemshave been replaced with alternative implementations in pro-duction For instance Android v70 replaced the sdcard dae-mon with its in-kernel implementation after several years [24]Ceph [52] adopted an in-kernel client for Linux kernel [8]

22 Generality vs SpecializationWe observe that being highly general-purpose the FUSEframework induces unnecessary user-kernel communicationin many cases yielding low throughput and high latency Forinstance getxattr requests generated by the VFS duringwrite() system calls (one per write) can double the numberof user-kernel transitions which decreases the IO through-put of sequential writes by over 27 and random writes byover 44 compared to native (EXT4) performance (Table 1)Moreover the penalty incurred is higher with the faster media

Nevertheless simple filtering or caching metadata repliesin the kernel can substantially reduce user-kernel communi-cation For instance by caching the last getxattr reply inthe FUSE driver and simply validating the cached state forevery subsequent getxattr request from the VFS on the samefile unnecessary user-kernel transitions can be eliminated toachieve significant improvement in write performance Simi-larly replies from other metadata operations such as lookup

and getattr can be cached validated and served entirelywithin the kernel Note that custom validation of cached meta-data is imperative and lack of support to do so may result inincorrect behavior as happens in the case of optimized FUSEthat leverages VFS caches to serve requests (sect51)

Many stackable user file systems add a thin layer of func-tionality they perform simple checks in a few operations andpass remaining requests directly through the host (lower) filesystem LoggedFS [16] filters requests that must be loggedand do so by accessing host file system services Union filesystems such as MergerFS [20] determine the backend hostfile in open() and redirects IO requests to it Android sdcarddaemon performs access permission checks only in metadataoperations (eg open lookup) but data IO requests (egread write etc) are simply forwarded to the lower file sys-tem Thin functionality that realizes such use cases does notneed any complex processing in user space and thereforecan easily be stacked in the kernel thereby avoiding expen-sive user-kernel switching to yield lower latency and higherthroughput for the same functionality Furthermore data IOrequests could be directly forwarded to the host file system

The FUSE framework offers a few configuration (config)options to the developers to tune the behavior of its kerneldriver for their use case However those options are coarse-grained and implemented as fixed checks embedded in thedriver code thus it offers limited static control For examplefor file systems that do not support certain operations (eggetxattr) the FUSE driver caches this knowledge upon firstENOSUPPORT reply and does not issue such requests subse-quently While this benefits the file systems that completelyomit certain functionality (eg security xattr labels) it doesnot allow custom and fine-grained filtering of requests thatmay be desired by some file systems supporting only partialfunctionality For example file systems providing encryp-tion (or compression) functionality may desire a fine-grainedcustom control over the kernel driver to skip the decryption(or decompression) operation in user space during read()requests on non-sensitive (or unzipped) files

As such the FUSE framework proves to be too low-leveland general-purpose in many cases While it enables a numberof different use cases modifying the framework to efficientlyhandle special needs of each use case is impractical This isa typical unbalanced generality vs specialization problemwhich can be addressed by extending the functionality of theFUSE driver in the kernel [6 13 43]

3 DesignIn this section we 1) present an overview of EXTFUSE 2)discuss the design goals and challenges we faced and 3)mechanisms we adopted to address those challenges

31 OverviewEXTFUSE is a framework for developing extensible FUSEfile systems for UNIX-like monolithic OSes It allows the

USENIX Association 2019 USENIX Annual Technical Conference 123

unprivileged FUSE daemon processes to register ldquothinrdquo exten-sions in the kernel for specialized handling of low-level filesystem requests while retaining their existing complex logicin user space to achieve the desired level of performance

The registered extensions are safely executed under a sand-boxed eBPF runtime environment in the kernel ( sect33) im-mediately as requests are issued from the upper file system(eg VFS) Sandboxing enables the FUSE daemon to safelyextend the functionality of the driver at runtime and offers afine-grained ability to either serve each request entirely in thekernel or fall back to user space thereby offering safety ofuser space and performance of kernel file systems

EXTFUSE also provides a set of APIs to create sharedkey-value data structures (called maps) that can host abstractdata blobs The user-space daemon and its kernel extensionscan leverage maps to storemanipulate their custom data typesas needed to work in concert and serve file system requests inthe kernel without incurring expensive user-kernel round-tripcommunication if deemed unnecessary

32 Goals and ChallengesThe over-arching goal of EXTFUSE is to carefully balancethe safety and runtime extensibility of user file systems toachieve the desired level of performance and specializedfunctionality Nevertheless in order for developers to useEXTFUSE it must also be easy to adopt We identify thefollowing concrete design goalsDesign Compatibility The abstractions and interfaces of-fered by EXTFUSE framework must be compatible withFUSE without hindering existing functionality or proper-ties It must be general-purpose so as to support multipledifferent use cases Developers must further be able to adoptEXTFUSE for their use case without overhauling changesto their existing design It must be easy for them to imple-ment specialized extensions with little to no knowledge of theunderlying implementation detailsModular Extensibility EXTFUSE must be highly modularand limit any unnecessary new changes to FUSE Particularlydevelopers must be able to retain their existing user-spacelogic and introduce specialized extensions only if neededBalancing Safety and Performance Finally withEXTFUSE even unprivileged (and untrusted) FUSEdaemon processes must be able to safely extend thefunctionality of the driver as needed to offer performancethat is as close as possible to the in-kernel implementationHowever unlike Microkernels [2 23 30] that host systemservices in separate protection domains as user processesor the OSes that have been developed with safe runtimeextensibility as a design goal [6 13] extending systemservices of general-purpose UNIX-like monolithic OSesposes a design trade-off question between the safety andperformance requirements

Untrusted extensions must be as lightweight as possiblewith their access restricted to only a few well-defined APIs

to guarantee safety For example kernel file systems offernear-native performance but executing complex logic in thekernel results in questionable reliability Additionally mostOS kernels employ Address Space Randomization [36] DataExecution Prevention [5] etc for code protection and hid-ing memory pointers Providing unrestricted kernel accessto extensions can render such protections useless There-fore extensions must not be able to access arbitrary memoryaddresses or leak pointer values to user space

However severely restricting extensibility can prevent userfile systems from fully meeting their operative performanceand functionality goals thus defeating the purpose of exten-sions in the first place Therefore EXTFUSE must carefullybalance safety and performance goalsCorrectness Furthermore specialized extensions can alterthe existing design of user file systems which can lead tocorrectness issues For example there will be two separatepaths (fast and slow) both operating on the requests and datastructs at the same time The framework must provide a wayfor them to synchronize and offer safe concurrent accesses

33 eBPFEXTFUSE leverages extended BPF (eBPF) [26] an in-kernelVirtual Machine (VM) runtime framework to load and safelyexecute user file system extensionsRicher functionality eBPF is an extension of classic Berke-ley Packet Filters (BPF) an in-kernel interpreter for a pseudomachine architecture designed to only accept simple networkfiltering rules from user space It enhances BPF to includemore versatility such as 64-bit support a richer instruction set(eg call cond jump) more registers and native performancethrough JIT compilationHigh-level language support The eBPF bytecode backendis also supported by ClangLLVM compiler toolchain whichallows functionality logic to be written in a familiar high-levellanguage such as C and GoSafety The eBPF framework provides a safe execution en-vironment in the kernel It prohibits execution of arbitrarycode and access to arbitrary kernel memory regions insteadthe framework restricts access to a set of kernel helper APIsdepending on the target kernel subsystem (eg network) andrequired functionality (eg packet handling) The frame-work includes a static analyzer (called verifier) that checksthe correctness of the bytecode by performing an exhaus-tive depth-first search through its control flow graph to detectproblems such as infinite loops out-of-bound and illegalmemory errors The framework can also be configured toallow or deny eBPF bytecode execution request from unprivi-leged processesKey-Value Maps eBPF allows user space to create map datastructures to store arbitrary key-value blobs using systemcalls and access them using file descriptors Maps are alsoaccessible to eBPF bytecode in the kernel thus providing acommunication channel between user space and the bytecode

124 2019 USENIX Annual Technical Conference USENIX Association

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 4: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

Media W xattr (Diff) WO xattr (Diff)

SW RW SW RW

HDD -029 -380 -017 -283SSD -115 -2338 +031 -1224

Table 1 Percentage difference between IO throughput (opssec) forEXT4 vs FUSE StackfsOpt (see sect6) under single thread 4K SeqWrite (SW) and Rand Write (RW) settings on a 60GB file acrossdifferent storage media as reported by Filebench [48] Received 15million getxattr requests

trip communication and data copying and thus inevitablyyields poor runtime performance Nonetheless FUSE hasevolved significantly over the years several optimizationshave been added to minimize the user-kernel communicationzero-copy data transfer (splicing) and utilizing system-wideshared VFS caches (page cache for data IO and dentry cachefor metadata) However despite these optimizations FUSEseverely degrades runtime performance in a host of scenar-ios For instance data caching improves IO throughput bybatching requests (eg read-aheads small writes) but it doesnot help apps that perform random reads or demand low writelatency (eg databases) Splicing is only used for over 4K re-quests Therefore apps with small writesreads (eg Androidapps etc) will incur data copying overheads Worse yet theoverhead is higher with faster storage media Even a simplepassthrough (no-ops) FUSE file system can introduce up to83 overhead for metadata-heavy workloads on SSD [50]

The performance penalty incurred by user file systems over-shadows their benefits Consequently some user file systemshave been replaced with alternative implementations in pro-duction For instance Android v70 replaced the sdcard dae-mon with its in-kernel implementation after several years [24]Ceph [52] adopted an in-kernel client for Linux kernel [8]

22 Generality vs SpecializationWe observe that being highly general-purpose the FUSEframework induces unnecessary user-kernel communicationin many cases yielding low throughput and high latency Forinstance getxattr requests generated by the VFS duringwrite() system calls (one per write) can double the numberof user-kernel transitions which decreases the IO through-put of sequential writes by over 27 and random writes byover 44 compared to native (EXT4) performance (Table 1)Moreover the penalty incurred is higher with the faster media

Nevertheless simple filtering or caching metadata repliesin the kernel can substantially reduce user-kernel communi-cation For instance by caching the last getxattr reply inthe FUSE driver and simply validating the cached state forevery subsequent getxattr request from the VFS on the samefile unnecessary user-kernel transitions can be eliminated toachieve significant improvement in write performance Simi-larly replies from other metadata operations such as lookup

and getattr can be cached validated and served entirelywithin the kernel Note that custom validation of cached meta-data is imperative and lack of support to do so may result inincorrect behavior as happens in the case of optimized FUSEthat leverages VFS caches to serve requests (sect51)

Many stackable user file systems add a thin layer of func-tionality they perform simple checks in a few operations andpass remaining requests directly through the host (lower) filesystem LoggedFS [16] filters requests that must be loggedand do so by accessing host file system services Union filesystems such as MergerFS [20] determine the backend hostfile in open() and redirects IO requests to it Android sdcarddaemon performs access permission checks only in metadataoperations (eg open lookup) but data IO requests (egread write etc) are simply forwarded to the lower file sys-tem Thin functionality that realizes such use cases does notneed any complex processing in user space and thereforecan easily be stacked in the kernel thereby avoiding expen-sive user-kernel switching to yield lower latency and higherthroughput for the same functionality Furthermore data IOrequests could be directly forwarded to the host file system

The FUSE framework offers a few configuration (config)options to the developers to tune the behavior of its kerneldriver for their use case However those options are coarse-grained and implemented as fixed checks embedded in thedriver code thus it offers limited static control For examplefor file systems that do not support certain operations (eggetxattr) the FUSE driver caches this knowledge upon firstENOSUPPORT reply and does not issue such requests subse-quently While this benefits the file systems that completelyomit certain functionality (eg security xattr labels) it doesnot allow custom and fine-grained filtering of requests thatmay be desired by some file systems supporting only partialfunctionality For example file systems providing encryp-tion (or compression) functionality may desire a fine-grainedcustom control over the kernel driver to skip the decryption(or decompression) operation in user space during read()requests on non-sensitive (or unzipped) files

As such the FUSE framework proves to be too low-leveland general-purpose in many cases While it enables a numberof different use cases modifying the framework to efficientlyhandle special needs of each use case is impractical This isa typical unbalanced generality vs specialization problemwhich can be addressed by extending the functionality of theFUSE driver in the kernel [6 13 43]

3 DesignIn this section we 1) present an overview of EXTFUSE 2)discuss the design goals and challenges we faced and 3)mechanisms we adopted to address those challenges

31 OverviewEXTFUSE is a framework for developing extensible FUSEfile systems for UNIX-like monolithic OSes It allows the

USENIX Association 2019 USENIX Annual Technical Conference 123

unprivileged FUSE daemon processes to register ldquothinrdquo exten-sions in the kernel for specialized handling of low-level filesystem requests while retaining their existing complex logicin user space to achieve the desired level of performance

The registered extensions are safely executed under a sand-boxed eBPF runtime environment in the kernel ( sect33) im-mediately as requests are issued from the upper file system(eg VFS) Sandboxing enables the FUSE daemon to safelyextend the functionality of the driver at runtime and offers afine-grained ability to either serve each request entirely in thekernel or fall back to user space thereby offering safety ofuser space and performance of kernel file systems

EXTFUSE also provides a set of APIs to create sharedkey-value data structures (called maps) that can host abstractdata blobs The user-space daemon and its kernel extensionscan leverage maps to storemanipulate their custom data typesas needed to work in concert and serve file system requests inthe kernel without incurring expensive user-kernel round-tripcommunication if deemed unnecessary

32 Goals and ChallengesThe over-arching goal of EXTFUSE is to carefully balancethe safety and runtime extensibility of user file systems toachieve the desired level of performance and specializedfunctionality Nevertheless in order for developers to useEXTFUSE it must also be easy to adopt We identify thefollowing concrete design goalsDesign Compatibility The abstractions and interfaces of-fered by EXTFUSE framework must be compatible withFUSE without hindering existing functionality or proper-ties It must be general-purpose so as to support multipledifferent use cases Developers must further be able to adoptEXTFUSE for their use case without overhauling changesto their existing design It must be easy for them to imple-ment specialized extensions with little to no knowledge of theunderlying implementation detailsModular Extensibility EXTFUSE must be highly modularand limit any unnecessary new changes to FUSE Particularlydevelopers must be able to retain their existing user-spacelogic and introduce specialized extensions only if neededBalancing Safety and Performance Finally withEXTFUSE even unprivileged (and untrusted) FUSEdaemon processes must be able to safely extend thefunctionality of the driver as needed to offer performancethat is as close as possible to the in-kernel implementationHowever unlike Microkernels [2 23 30] that host systemservices in separate protection domains as user processesor the OSes that have been developed with safe runtimeextensibility as a design goal [6 13] extending systemservices of general-purpose UNIX-like monolithic OSesposes a design trade-off question between the safety andperformance requirements

Untrusted extensions must be as lightweight as possiblewith their access restricted to only a few well-defined APIs

to guarantee safety For example kernel file systems offernear-native performance but executing complex logic in thekernel results in questionable reliability Additionally mostOS kernels employ Address Space Randomization [36] DataExecution Prevention [5] etc for code protection and hid-ing memory pointers Providing unrestricted kernel accessto extensions can render such protections useless There-fore extensions must not be able to access arbitrary memoryaddresses or leak pointer values to user space

However severely restricting extensibility can prevent userfile systems from fully meeting their operative performanceand functionality goals thus defeating the purpose of exten-sions in the first place Therefore EXTFUSE must carefullybalance safety and performance goalsCorrectness Furthermore specialized extensions can alterthe existing design of user file systems which can lead tocorrectness issues For example there will be two separatepaths (fast and slow) both operating on the requests and datastructs at the same time The framework must provide a wayfor them to synchronize and offer safe concurrent accesses

33 eBPFEXTFUSE leverages extended BPF (eBPF) [26] an in-kernelVirtual Machine (VM) runtime framework to load and safelyexecute user file system extensionsRicher functionality eBPF is an extension of classic Berke-ley Packet Filters (BPF) an in-kernel interpreter for a pseudomachine architecture designed to only accept simple networkfiltering rules from user space It enhances BPF to includemore versatility such as 64-bit support a richer instruction set(eg call cond jump) more registers and native performancethrough JIT compilationHigh-level language support The eBPF bytecode backendis also supported by ClangLLVM compiler toolchain whichallows functionality logic to be written in a familiar high-levellanguage such as C and GoSafety The eBPF framework provides a safe execution en-vironment in the kernel It prohibits execution of arbitrarycode and access to arbitrary kernel memory regions insteadthe framework restricts access to a set of kernel helper APIsdepending on the target kernel subsystem (eg network) andrequired functionality (eg packet handling) The frame-work includes a static analyzer (called verifier) that checksthe correctness of the bytecode by performing an exhaus-tive depth-first search through its control flow graph to detectproblems such as infinite loops out-of-bound and illegalmemory errors The framework can also be configured toallow or deny eBPF bytecode execution request from unprivi-leged processesKey-Value Maps eBPF allows user space to create map datastructures to store arbitrary key-value blobs using systemcalls and access them using file descriptors Maps are alsoaccessible to eBPF bytecode in the kernel thus providing acommunication channel between user space and the bytecode

124 2019 USENIX Annual Technical Conference USENIX Association

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 5: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

unprivileged FUSE daemon processes to register ldquothinrdquo exten-sions in the kernel for specialized handling of low-level filesystem requests while retaining their existing complex logicin user space to achieve the desired level of performance

The registered extensions are safely executed under a sand-boxed eBPF runtime environment in the kernel ( sect33) im-mediately as requests are issued from the upper file system(eg VFS) Sandboxing enables the FUSE daemon to safelyextend the functionality of the driver at runtime and offers afine-grained ability to either serve each request entirely in thekernel or fall back to user space thereby offering safety ofuser space and performance of kernel file systems

EXTFUSE also provides a set of APIs to create sharedkey-value data structures (called maps) that can host abstractdata blobs The user-space daemon and its kernel extensionscan leverage maps to storemanipulate their custom data typesas needed to work in concert and serve file system requests inthe kernel without incurring expensive user-kernel round-tripcommunication if deemed unnecessary

32 Goals and ChallengesThe over-arching goal of EXTFUSE is to carefully balancethe safety and runtime extensibility of user file systems toachieve the desired level of performance and specializedfunctionality Nevertheless in order for developers to useEXTFUSE it must also be easy to adopt We identify thefollowing concrete design goalsDesign Compatibility The abstractions and interfaces of-fered by EXTFUSE framework must be compatible withFUSE without hindering existing functionality or proper-ties It must be general-purpose so as to support multipledifferent use cases Developers must further be able to adoptEXTFUSE for their use case without overhauling changesto their existing design It must be easy for them to imple-ment specialized extensions with little to no knowledge of theunderlying implementation detailsModular Extensibility EXTFUSE must be highly modularand limit any unnecessary new changes to FUSE Particularlydevelopers must be able to retain their existing user-spacelogic and introduce specialized extensions only if neededBalancing Safety and Performance Finally withEXTFUSE even unprivileged (and untrusted) FUSEdaemon processes must be able to safely extend thefunctionality of the driver as needed to offer performancethat is as close as possible to the in-kernel implementationHowever unlike Microkernels [2 23 30] that host systemservices in separate protection domains as user processesor the OSes that have been developed with safe runtimeextensibility as a design goal [6 13] extending systemservices of general-purpose UNIX-like monolithic OSesposes a design trade-off question between the safety andperformance requirements

Untrusted extensions must be as lightweight as possiblewith their access restricted to only a few well-defined APIs

to guarantee safety For example kernel file systems offernear-native performance but executing complex logic in thekernel results in questionable reliability Additionally mostOS kernels employ Address Space Randomization [36] DataExecution Prevention [5] etc for code protection and hid-ing memory pointers Providing unrestricted kernel accessto extensions can render such protections useless There-fore extensions must not be able to access arbitrary memoryaddresses or leak pointer values to user space

However severely restricting extensibility can prevent userfile systems from fully meeting their operative performanceand functionality goals thus defeating the purpose of exten-sions in the first place Therefore EXTFUSE must carefullybalance safety and performance goalsCorrectness Furthermore specialized extensions can alterthe existing design of user file systems which can lead tocorrectness issues For example there will be two separatepaths (fast and slow) both operating on the requests and datastructs at the same time The framework must provide a wayfor them to synchronize and offer safe concurrent accesses

33 eBPFEXTFUSE leverages extended BPF (eBPF) [26] an in-kernelVirtual Machine (VM) runtime framework to load and safelyexecute user file system extensionsRicher functionality eBPF is an extension of classic Berke-ley Packet Filters (BPF) an in-kernel interpreter for a pseudomachine architecture designed to only accept simple networkfiltering rules from user space It enhances BPF to includemore versatility such as 64-bit support a richer instruction set(eg call cond jump) more registers and native performancethrough JIT compilationHigh-level language support The eBPF bytecode backendis also supported by ClangLLVM compiler toolchain whichallows functionality logic to be written in a familiar high-levellanguage such as C and GoSafety The eBPF framework provides a safe execution en-vironment in the kernel It prohibits execution of arbitrarycode and access to arbitrary kernel memory regions insteadthe framework restricts access to a set of kernel helper APIsdepending on the target kernel subsystem (eg network) andrequired functionality (eg packet handling) The frame-work includes a static analyzer (called verifier) that checksthe correctness of the bytecode by performing an exhaus-tive depth-first search through its control flow graph to detectproblems such as infinite loops out-of-bound and illegalmemory errors The framework can also be configured toallow or deny eBPF bytecode execution request from unprivi-leged processesKey-Value Maps eBPF allows user space to create map datastructures to store arbitrary key-value blobs using systemcalls and access them using file descriptors Maps are alsoaccessible to eBPF bytecode in the kernel thus providing acommunication channel between user space and the bytecode

124 2019 USENIX Annual Technical Conference USENIX Association

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 6: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

to define custom key-value types and share execution state ordata Concurrent accesses to maps are protected under read-copy update (RCU) synchronization mechanism Howevermaps consume unswappable kernel memory Furthermorethey are either accessible to everyone (eg by passing filedescriptors) or only to CAP_SYS_ADMIN processes

eBPF is a part of the Linux kernel and is already used heav-ily by networking tracing and profiling subsystems Givenits rich functionality and safety properties we adopt eBPFfor providing support for extensible user file systems Specifi-cally we define a white-list of kernel APIs (including theirparameters and return types) and abstractions that user filesystem extensions can safely use to realize their specializedfunctionality The eBPF verifier utilizes the whitelist to vali-date the correctness of the extensions We also build on eBPFabstractions (eg maps) and apply further access restrictionsto enable safe in-kernel execution as needed

34 Architecture

Figure 1 Architectural view of the EXTFUSE framework Thecomponents modified or introduced have been highlighted

Figure 1 shows the architecture of the EXTFUSE frame-work It is enabled by three core components namely a kernelfile system (driver) a user library (libExtFUSE) and an in-kernel eBPF virtual machine runtime (VM)

The EXTFUSE driver uses interposition technique to in-terface with FUSE at low-level file system operations How-ever unlike the FUSE driver that simply forwards file systemrequests to user space the EXTFUSE driver is capable ofdirectly delivering requests to in-kernel handlers (extensions)It can also forward a few restricted set of requests (eg readwrite) to the host (lower) file system if present The latteris needed for stackable user file systems that add thin func-tionality on top of the host file system libExtFUSE exports aset of APIs and abstractions for serving requests in the kernelhiding the underlying implementation details

Use of libExtFUSE is optional and independent of libfuseThe existing file system handlers registered with libfuse

FS Interface API(s) Abstractions Description

Low-level fuse_lowlevel_ops Inode FS Ops

Kernel Access API(s) Abstractions Description

eBPF Funcs bpf_ UID PID etc Helper FuncsFUSE extfuse_reply_ fuse_reply_ Req OutputKernel bpf_set_pasthru FileDesc Enable PthruKernel bpf_clear_pasthru FileDesc Disable Pthru

DataStructs API(s) Abstractions Description

SHashMap CRUD KeyVal Hosts arbitrary data blobsInodeMap CRUD FileDesc Hosts upper-lower inode pairs

Table 2 APIs and abstractions provided by EXTFUSE It providesFUSE-like file system interface for easy portability CRUD (createread update and delete) APIs are offered for map data structuresto operate on KeyValue pairs Kernel accesses are restricted tostandard eBPF kernel helper functions We introduced APIs toaccess the same FUSE request parameters as available to user space

continue to reside in user space Therefore their invocationincurs context switches and thus we refer to their executionas the slow path With EXTFUSE user space can also registerkernel extensions that are invoked immediately as file systemrequests are received from the VFS in order to allow servingthem in the kernel We refer to the in-kernel execution as thefast path Depending upon the return values from the fastpath the requests can be marked as served or be sent to theuser-space daemon via the slow path to avail any complexprocessing as needed Fast path can also return a special valuethat instructs the EXTFUSE driver to interpose and forwardthe request to the lower file system However this feature isonly available to stackable user file systems and is verifiedwhen the extensions are loaded in the kernel

The fast path interfaces exported by libExtFUSE are thesame as those exported by libfuse to the slow path Thisis important for easy transfer of design and portability Weleverage eBPF support in the LLVMClang compiler toolchainto provide developers with a familiar set of APIs and allowthem to implement their custom functionality logic in a subsetof the C language

The extensions are loaded and executed inside the kernelunder the eBPF VM sandbox thereby providing user space afine-grained ability to safely extend the functionality of FUSEkernel driver at runtime for specialized handling of each filesystem request

35 EXTFUSE APIs and AbstractionslibExtFUSE provides a set of high-level APIs and abstractionsto the developers for easy implementation of their specializedextensions hiding the complex implementation details Ta-ble 2 summarizes the APIs For handling file system opera-tions libExtFUSE exports the familiar set of FUSE interfacesand corresponding abstractions (eg inode) for design com-patibility Both low-level as well as high-level file systeminterfaces are available offering flexibility and developmentease Furthermore as with libfuse the daemon can reg-

USENIX Association 2019 USENIX Annual Technical Conference 125

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 7: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

ister extensions for a few or all of the file system APIs of-fering them flexibility to implement their functionality withno additional development burden The extensions receivethe same request parameters (struct fuse_[inout]) as theuser-space daemon This design choice not only conformsto the principle of least privilege but also offers the user-space daemon and the extensions the same interface for easyportability

For hostingsharing data between the user daemon andkernel extensions libExtFUSE provides a secure variant ofeBPF HashMap keyvalue data structure called SHashMap thatstores arbitrary keyvalue blobs Unlike regular eBPF mapsthat are either accessible to all user processes or only toCAP_SYS_ADMIN processes SHashMap is only accessible bythe unprivileged daemon that creates it libExtFUSE furtherabstracts low-level details of SHashMap and provides high-level CRUD APIs to create read update and delete entries(keyvalue pairs)

EXTFUSE also provides a special InodeMap to enablepassthrough IO feature for stackable EXTFUSE file sys-tems (sect52) Unlike SHashMap that stores arbitrary entriesInodeMap takes open file handle as key and stores a pointer tothe corresponding lower (host) inode as value Furthermoreto prevent leakage of inode object to user space the InodeMapvalues can only be read by the EXTFUSE driver

36 WorkflowTo understand how EXTFUSE facilitates implementation ofextensible user file systems we describe its workflow in de-tail Upon mounting the user file system FUSE driver sendsFUSE_INIT request to the user-space daemon At this pointthe user daemon checks if the OS kernel supports EXTFUSEframework by looking for FUSE_CAP_ExtFUSE flag in the re-quest parameters If supported the daemon must invokelibExtFUSE init API to load the eBPF program that containsspecialized handlers (extensions) into the kernel and regis-ter them with the EXTFUSE driver This is achieved usingbpf_load_prog system call which invokes eBPF verifier tocheck the integrity of the extensions If failed the programis discarded and the user-space daemon is notified of the er-rors The daemon can then either exit or continue with defaultFUSE functionality If the verification step succeeds and theJIT engine is enabled the extensions are processed by theJIT compiler to generate machine assembly code ready forexecution as needed

Extensions are installed in a bpf_prog_type map (calledextension map) which serves effectively as a jump tableTo invoke an extension the FUSE driver simply executesa bpf_tail_call (far jump) with the FUSE operation code(eg FUSE_OPEN) as an index into the extension map Oncethe eBPF program is loaded the daemon must informEXTFUSE driver about the kernel extensions by replyingto FUSE_INIT containing identifiers to the extension map

Once notified EXTFUSE can safely load and execute the

Component Version Loc Modified Loc New

FUSE kernel driver 4110 312 874FUSE user-space library 320 23 84EXTFUSE user-space library - - 581

Table 3 Changes made to the existing Linux FUSE framework tosupport EXTFUSE functionality

extensions at runtime under the eBPF VM environment Ev-ery request is first delivered to the fast path which may decideto 1) serve it (eg using data shared between the fast andslow paths) 2) pass the request through to the lower file sys-tem (eg after modifying parameters or performing accesschecks) or 3) take the slow path and deliver the request to userspace for complex processing logic (eg data encryption)as needed Since the execution path is chosen per-requestindependently and the fast path is always invoked first thekernel extensions and user daemon can work in concert andsynchronize access to requests and shared data structures Itis important to note that the EXTFUSE driver only acts as athin interposition layer between the FUSE driver and kernelextensions and in some cases between the FUSE driver andthe lower file system As such it does not perform any IOoperation or attempts to serve requests on its own

4 Implementation

To implement EXTFUSE we provided eBPF support forFUSE Specifically we added additional kernel helper func-tions and designed two new map types to support secure com-munication between the user-space daemon and kernel exten-sions as well as support for passthrough access in readwriteWe modified FUSE driver to first invoke registered eBPF han-dlers (extensions) Passthrough implementation is adoptedfrom WrapFS [54] a wrapper stackable in-kernel file systemSpecifically we modified FUSE driver to pass IO requestsdirectly to the lower file system

Since with EXTFUSE developers can install extensions tobypass the user-space daemon and pass IO requests directlyto the lower file system a malicious process could stack anumber of EXTFUSE file systems on top of each other andcause the kernel stack to overflow To guard against suchattacks we limit the number of EXTFUSE layers that couldbe stacked on a mount point We rely on s_stack_depthfield in the super-block to track the number of stacked layersand check it against FILESYSTEM_MAX_STACK_DEPTH whichwe limit to two Table 3 reports the number of lines of codefor EXTFUSE We also modified libfuse to allow apps toregister kernel extensions

5 Optimizations

Here we describe a set of optimizations that can be enabledby leveraging custom kernel extensions in EXTFUSE to im-plement in-kernel handling of file system requests

126 2019 USENIX Annual Technical Conference USENIX Association

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 8: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

1 void handle_lookup(fuse_req_t req fuse_ino_t pino2 const char name) 3 lookup or create node cname parent pino 4 struct fuse_entry_param e5 if (find_or_create_node(req pino name ampe)) return6 + lookup_key_t key = pino name7 + lookup_val_t val = 0not stale ampe8 + extfuse_insert_shmap(ampkey ampval) cache this entry 9 fuse_reply_entry(req ampe)

10

Figure 2 FUSE daemon lookup handler in user space WithEXTFUSE lines 6-8 (+) enable caching replies in the kernel

51 Customized in-kernel metadata cachingMetadata operations such as lookup and getattr are fre-quently issued and thus form high sources of latency in FUSEfile systems [50] Unlike VFS caches that are only reactiveand fixed in functionality EXTFUSE can be leveraged toproactively cache metadata replies in the kernel Kernel ex-tensions can be installed to manage and serve subsequentoperations from caches without switching to user spaceExample To illustrate let us consider the lookup operationIt is the most common operation issued internally by the VFSfor serving open() stat() and unlink() system calls Eachcomponent of the input path string is searched using lookupto fetch the corresponding inode data structure Figure 2lists code fragment for FUSE daemon handler that serveslookup requests in user space (slow path) The FUSE lookupAPI takes two input parameters the parent node ID andthe next path component name The node ID is a 64-bitinteger that uniquely identifies the parent inode The daemonhandler function traverses the parent directory searching forthe child entry corresponding to the next path componentUpon successful search it populates the fuse_entry_paramdata structure with the node ID and attributes (eg size) ofthe child and sends it to the FUSE driver which creates anew inode for the dentry object representing the child entry

With EXTFUSE developers could define a SHashMap thathosts fuse_entry_param replies in the kernel (lines 7-10) Acomposite key generated from the parent node identifier andthe next path component string arguments is used as an indexinto the map for inserting corresponding replies Since themap is also accessible to the extensions in the kernel sub-sequent requests could be served from the map by installingthe EXTFUSE lookup extension (fast path) Figure 3 listsits code fragment The extension uses the same compositekey as an index into the hash map to search whether the cor-responding fuse_entry_param entry exists If a valid entryis found the reference count (nlookup) is incremented and areply is sent to the FUSE driver

Similarly replies from user space daemon for other meta-data operations such as getattr getxattr and readlinkcould be cached using maps and served in the kernel by re-spective extensions (Table 4) Network FUSE file systemssuch as SshFS [39] and Gluster [40] already perform aggres-sive metadata caching and batching at client to reduce thenumber of remote calls to the server SshFS [39] for example

1 int lookup_extension(extfuse_req_t req fuse_ino_t pino2 const char name) 3 lookup in map bail out if not cached or stale 4 lookup_key_t key = pino name5 lookup_val_t val = extfuse_lookup_shmap(ampkey)6 if (val || atomic_read(ampval-gtstale)) return UPCALL7 EXAMPLE Android sdcard daemon perm check 8 if (check_caller_access(pino name)) return -EACCES9 populate output incr count (used in FUSE_FORGET)

10 extfuse_reply_entry(req ampval-gte)11 atomic_incr(ampval-gtnlookup 1)12 return SUCCESS13

Figure 3 EXTFUSE lookup kernel extension that serves validcached replies without incurring any context switches Customizedchecks could further be included Android sdcard daemon permis-sion check is shown as an example (see Figure 10)

implements its own directory attribute and symlink cachesWith EXTFUSE such caches could be implemented in thekernel for further performance gainsInvalidation While caching metadata in the kernel reducesthe number of context switches to user space developers mustalso carefully invalidate replies as necessary For examplewhen a file (or dir) is deleted or renamed the correspondingcached lookup replies must be invalidated Invalidations canbe performed in user space by the relevant request handlers orin the kernel by installing their extensions before new changesare made However the former case may introduce raceconditions and produce incorrect results because all requeststo user space daemon are queued up by the FUSE driverwhereas requests to the extensions are not Cached lookupreplies can be invalidated in extensions for unlink rmdir andrename operations Similarly when attributes or permissionson a file change cached getattr replies can be invalidated insetattr extension Our design ensures race-free invalidationby executing the extensions before forwarding requests touser space daemon where the changes may be madeAdvantages over VFS caching As previously mentionedrecent optimizations added to FUSE framework leverage VFScaches to reduce user-kernel context switching For instanceby specifying non-zero entry_valid and attr_valid timeoutvalues dentries and inodes cached by the VFS from previ-ous lookup operations could be utilized to serve subsequentlookup and getattr requests respectively However theVFS offers no control to the user file system over the cacheddata For example if the file system is mounted withoutthe default_permissions parameter VFS caching of inodesintroduces a security bug [21] This is because the cached per-missions are only checked for first accessing user In contrastwith EXTFUSE developers can define their own metadatacaches and install custom code to manage them For instanceextensions can perform uid-based access permission checksbefore serving requests from the caches to obviate the afore-mentioned security issue (Figure 10)

Additionally unlike VFS caches that are only reactiveEXTFUSE enables proactive caching For example since areaddir request is expected after an opendir call the user-space daemon could proactively cache directory entries in the

USENIX Association 2019 USENIX Annual Technical Conference 127

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 9: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

Metadata Map Key Map Value Caching Operations Serving Extensions Invalidation Operations

Inode ltnodeID namegt fuse_entry_param lookup create mkdir mknod lookup unlink rmdir renameAttrs ltnodeIDgt fuse_attr_out getattr lookup getattr setattr unlink rmdirSymlink ltnodeIDgt link path symlink readlink readlink unlinkDentry ltnodeIDgt fuse_dirent opendir readdir readdir releasedir unlink rmdir renameXAttrs ltnodeID labelgt xattr value open getxattr listxattr getattr listxattr close setxattr removexattr

Table 4 Metadata can be cached in the kernel using eBPF maps by the user-space daemon and served by kernel extensions

kernel by inserting them in a BPF map while serving opendirrequests to reduce transitions to user space Alternativelysimilar to read-ahead optimization proactive caching of sub-sequent directory entries could be performed during the firstreaddir call to the user-space daemon Memory occupied bycached entries could then be freed by the releasedir handlerin user space that deletes them from the map Similarly se-curity labels on a file could be cached during the open call touser space and served in the kernel by getxattr extensionsNonetheless since eBPF maps consume kernel memory de-velopers must carefully manage caches and limit the numberof map entries to keep memory usage under check

52 Passthrough IO for stacking functionality

Many user file systems are stackable with a thin layer offunctionality that does not require complex processing inthe user-space For example LoggedFS [16] filters requeststhat must be logged logs them as needed and then simplyforwards them to the lower file system User-space unionfile systems such as MergerFS [20] determine the backendhost file in open and redirects IO requests to it BindFS [35]mirrors another mount point with custom permissions checksAndroid sdcard daemon performs access permission checksand emulates the case-insensitive behavior of FAT only inmetadata operations (eg lookup open etc) but forwardsdata IO requests directly to the lower file system For suchsimple cases the FUSE API proves to be too low-level andincurs unnecessarily high overhead due to context switching

With EXTFUSE readwrite IO requests can take the fastpath and directly be forwarded to the lower (host) file systemwithout incurring any context-switching if the complex slow-path user-space logic is not needed Figure 4 shows howthe user-space daemon can install the lower file descriptorin InodeMap while handling open() system call for notifyingthe EXTFUSE driver to store a reference to the lower inodekernel object With the custom_filtering_logic(path) con-dition this can be done selectively for example if accesspermission checks pass in Android sdcard daemon Simi-larly BindFS and MergerFS can adopt EXTFUSE to availpassthrough optimization The readwrite kernel extensionscan check in InodeMap to detect whether the target file is setupfor passthrough access If found EXTFUSE driver can be in-structed with a special return code to directly forward the IOrequest to the lower file system with the corresponding lowerinode object as parameter Figure 5 shows a template read

1 void handle_open(fuse_req_t req fuse_ino_t ino2 const struct fuse_open_in in) 3 file represented by ino inode num 4 struct fuse_open_out out char path[PATH_MAX]5 int len fd = open_file(ino in-gtflags path ampout)6 if (fd gt 0 ampamp custom_filtering_logic(path)) 7 + install fd in inode map for pasthru 8 + imap_key_t key = out-gtfh9 + imap_val_t val = fd lower fd

10 + extfuse_insert_imap(ampkey ampval)11

Figure 4 FUSE daemon open handler in user space WithEXTFUSE lines 7-9 (+) enable passthrough access on the file1 int read_extension(extfuse_req_t req fuse_ino_t ino2 const struct fuse_read_in in) 3 lookup in inode map passthrough if exists 4 imap_key_t key = in-gtfh5 if (extfuse_lookup_imap(ampkey)) return UPCALL6 EXAMPLE LoggedFS log operation 7 log_op(req ino FUSE_READ in sizeof(in))8 return PASSTHRU forward req to lower FS 9

Figure 5 The EXTFUSE read kernel extension returns PASSTHRU toforward request directly to the lower file system Custom thin func-tionality could further be pushed in the kernel LoggedFS loggingfunction is shown as an example (see Figure 11)

extension Kernel extensions can include additional logic orchecks before returning For instance LoggedFS readwriteextensions can filter and log operations as needed sect62

6 EvaluationTo evaluate EXTFUSE we answer the following questionsbull Baseline Performance How does an EXTFUSE imple-

mentation of a file system perform when compared to itsin-kernel and FUSE implementations (sect61)

bull Use cases What kind of existing FUSE file systems canbenefit from EXTFUSE and what performance improve-ments can they expect ( sect62)

61 PerformanceTo measure the baseline performance of EXTFUSE weadopted the simple no-ops (null) stackable FUSE file sys-tem called Stackfs [50] This user-space daemon serves allrequests by forwarding them to the host (lower) file system Itincludes all recent FUSE optimizations (Table 5) We evaluateStackfs under all possible EXTFUSE configs listed in Table 5Each config represents a particular level of performance thatcould potentially be achieved for example by caching meta-data in the kernel or directly passing readwrite requeststhrough the host file system for stacking functionality To putour results in context we compare our results with EXT4 and

128 2019 USENIX Annual Technical Conference USENIX Association

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 10: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

Figure 6 Throughput(opssec) for EXT4 and FUSEEXTFUSE Stackfs (w xattr) file systems under different configs (Table 5) as measuredby Random Read(RR)Write(RW) Sequential Read(SR)Write(SW) Filebench [48] data micro-workloads with IO Sizes between 4KB-1MBand settings Nth N threads Nf N files We use the same workloads as in [50]

Figure 7 Number of file system request received by the daemon inFUSEEXTFUSE Stackfs (w xattr) under workloads in Figure 6Only a few relevant request types are shown

Figure 8 Throughput(Opssec) for EXT4 and FUSEEXTFUSEStackfs (w xattr) under different configs (Table 5) as measuredby Filebench [48] Creation(C) Deletion(D) Reading(R) metadatamicro-workloads on 4KB files and FileServer(F) WebServer(W)macro-workloads with settings NthN threads NfN files

the optimized FUSE implementation of Stackfs (Opt)Testbed We use the same experiments and settings as in [50]Specifically we used EXT4 because of its popularity as thehost file system and ran benchmarks to evaluate Howeversince FUSE performance problems were reported to be moreprominent with a faster storage medium we only carry outour experiments with SSDs We used a Samsung 850 EVO250GB SSD installed on an Asus machine with Intel Quad-Core i5-3550 33 GHz and 16GB RAM running Ubuntu16043 Further to minimize any variability we formatted the

Config File System Optimizations

Opt [50] FUSE 128K Writes Splice WBCache MltThrdMDOpt EXTFUSE Opt + Caches lookup attrs xattrsAllOpt EXTFUSE MDOpt + Pass RW reqs through host FS

Table 5 Different Stackfs configs evaluated

SSD before each experiment and disabled EXT4 lazy inodeinitialization To evaluate file systems that implement xattroperations for handling security labels (eg in Android) ourimplementation of Opt supports xattrs and thus differs fromthe implementation in [50]Workloads Our workload consists of Filebench [48] microand synthetic macro benchmarks to test each config withmetadata- and data-heavy operations across a wide range ofIO sizes and parallelism settings We measure the low-levelthroughput (opssec) Our macro-benchmarks consist of asynthetic file server and web serverMicro Results Figure 6 shows the results of micro workloadunder different configs listed in Table 5

Reads Due to the default 128KB read-ahead feature ofFUSE the sequential read throughput on a single file with asingle thread for all IO sizes and under all Stackfs configsremained the same Multi-threading improved for the sequen-tial read benchmark with 32 threads and 32 files Only onerequest was generated per thread for lookup and getattr op-erations Hence metadata caching in MDOpt was not effectiveSince FUSE Opt performance is already at par with EXT4the passthrough feature in AllOpt was not utilized

Unlike sequential reads small random reads could not takeadvantage of the read-ahead feature of FUSE Additionally4KB reads are not spliced and incur data copying across user-kernel boundary With 32 threads operating on a single filethe throughput improves due to multi-threading in Opt How-ever degradation is observed with 4KB reads AllOpt passesall reads through EXT4 and hence offers near-native through-put In some cases the performance was slightly better than

USENIX Association 2019 USENIX Annual Technical Conference 129

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 11: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

EXT4 We believe that this minor improvement is due todouble caching at the VFS layer Due to a single request perthread for metadata operations no improvement was seenwith EXTFUSE metadata caching

Writes During sequential writes the 128K big writes andwriteback caching in Opt allow the FUSE driver to batch smallwrites (up to 128KB) together in the page cache to offer ahigher throughput However random writes are not batchedAs a result more write requests are delivered to user spacewhich negatively affects the throughput Multiple threads on asingle file perform better for requests bigger than 4KB as theyare spliced With EXTFUSE AllOpt all writes are passedthrough the EXT4 file system to offer improved performance

Write throughput degrades severely for FUSE file systemsthat support extended attributes because the VFS issues agetxattr request before every write Small IO requestsperform worse as they require more write which generatemore getxattr requests Opt random writes generated 30xfewer getxattr requests for 128KB compared to 4KB writesresulting in a 23 decrease in the throughput of 4KB writes

In contrast MDOpt caches the getxattr reply in the kernelupon the first call and serves subsequent getxattr requestswithout incurring further transitions to user space Figure 7compares the number of requests received by the user-spacedaemon in Opt and MDOpt Caching replies reduced the over-head for 4KB workload to less than 5 Similar behavior wasobserved with both sequential writes and random writesMacro Results Figure 8 shows the results of macro-workloads and synthetic server workloads emulated usingFilebench under various configs Neither of the EXTFUSEconfigs offer improvements over FUSE Opt under creationand deletion workloads as these metadata-heavy workloadscreated and deleted a number of files respectively This isbecause no metadata caching could be utilized by MDOpt Sim-ilarly no passthrough writes were utilized with AllOpt since4KB files were created and closed in user space In contrastthe File and Web server workloads under EXTFUSE utilizedboth metadata caching and passthrough access features andimproved performance We saw a 47 89 and 100 dropin lookup getattr and getattr requests to user space un-der MDOpt respectively when configured to cache up to 64Kfor each type of request AllOpt further enabled passthroughreadwrite requests to offer near native throughput for bothmacro reads and server workloadsReal Workload We also evaluated EXTFUSE with two realworkloads namely kernel decompression and compilation of4180 Linux kernel We created three separate caches forhosting lookup getattr and getxattr replies Each cachecould host up to 64K entries resulting in allocation of up to atotal of 50MB memory when fully populated

The kernel compilation make tinyconfig make -j4 ex-periment on our test machine (see sect6) reported a 52 drop incompilation time from 3974 secs under FUSE Opt to 3768secs with EXTFUSE MDOpt compared to 3091 secs with

EXT4 This was due to over 75 99 and 100 decreasein lookup getattr and getxattr requests to user spacerespectively (Figure 9) getxattr replies were proactivelycached while handling open requests thus no transitions touser space were observed for serving xattr requests WithEXTFUSE AllOpt the compilation time further dropped to3364 secs because of 100 reduction in read and writerequests to user space

In contrast the kernel decompression tar xf experimentreported a 635 drop in the completion time from 1102secs under FUSE Opt to 1032 secs with EXTFUSE MDOptcompared to 527 secs with EXT4 With EXTFUSE AllOptthe decompression time further dropped to 867 secs due to100 reduction in read and write requests to user spaceas shown in Figure 9 Nevertheless reducing the numberof cached entries for metadata requests to 4K resulted in adecompression time of 1087 secs (253 increase) due to3555 more getattr requests to user space This suggeststhat developers must efficiently manage caches

Figure 9 Linux kernel 4180 untar (decompress) and compilationtime taken with StackFS under FUSE and EXTFUSE settings Num-ber of metadata and IO requests are reduced with EXTFUSE

62 Use casesWe ported four real-world stackable FUSE file systemsnamely LoggedFS Android sdcard daemon MergerFSand BindFS to EXTFUSE and enabled both metadatacaching sect51 and passthrough IO sect52 optimizations

File System Functionality Ext Loc

StackFS [50] No-ops File System 664BindFS [35] Mirroring File System 792Android sdcard [24] Perm checks amp FAT Emu 928MergerFS [20] Union File System 686LoggedFS [16] Logging File System 748

Table 6 Lines of code (Loc) of kernel extensions required to adoptEXTFUSE for existing FUSE file systems We added support formetadata caching as well as RW passthrough

As EXTFUSE allows file systems to retain their exist-ing FUSE daemon code as the default slow path adoptingEXTFUSE for real-world file systems is easy On averagewe made less than 100 lines of changes to the existing FUSEcode to invoke EXTFUSE helper library functions for ma-nipulating kernel extensions including maps We added ker-

130 2019 USENIX Annual Technical Conference USENIX Association

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 12: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

nel extensions to support metadata caching as well as IOpassthrough Overall it required fewer than 1000 lines of newcode in the kernel Table 6 We now present detailed evalua-tion of Android sdcard daemon and LoggedFS to present anidea on expected performance improvementsAndroid sdcard daemon Starting version 30 Android in-troduced the support for FUSE to allow a large part of in-ternal storage (eg datamedia) to be mounted as externalFUSE-managed storage (called sdcard) Being large in sizesdcard hosts user data such as videos and photos as wellas any auxiliary Opaque Binary Blobs (OBB) needed by An-droid apps The FUSE daemon enforces permission checks inmetadata operations (eglookup etc) on files under sdcardto enable multi-user support and emulates case-insensitiveFAT functionality on the host (eg EXT4) file system OBBfiles are compressed archives and typically used by gamesto host multiple small binaries (eg shade rs textures) andmultimedia objects (eg images etc)

However FUSE incurs high runtime performance over-head For instance accessing OBB archive content throughthe FUSE layer leads to high launch latency and CPU uti-lization for gaming apps Therefore Android version 70replaced sdcard daemon with with an in-kernel file systemcalled SDCardFS [24] to manage external storage It is a wrap-per (thin) stackable file system based on WrapFS [54] thatenforces permission checks and performs FAT emulation inthe kernel As such it imposes little to no overhead comparedto its user-space implementation Nevertheless it introducessecurity risks and maintenance costs [12]

We ported Android sdcard FUSE daemon to EXTFUSEframework First we leverage eBPF kernel helper functionsto push metadata checks into the kernel For example weembed access permission check (Figure 10) in lookup kernelextension to validate access before serving lookup repliesfrom the cache (Figure 3) Similar permission checks areperformed in the kernel to validate accesses to files undersdcard before serving cached getattr requests We alsoenabled passthrough on readwrite using InodeMap

We evaluated its performance on a 1GB RAM HiKey620board [1] with two popular game apps containing OBBfiles of different sizes Our results show that under AllOptpassthrough mode the app launch latency and the correspond-ing peak CPU consumption reduces by over 90 and 80respectively Furthermore we found that the larger the OBBfile the more penalty is incurred by FUSE due to many moresmall files in the OBB archive

LoggedFS is a FUSE-based stackable user-space file systemIt logs every file system operation for monitoring purposesBy default it writes to syslog buffer and logs all operations(eg open read etc) However it can be configured to writeto a file or log selectively Despite being a simple file systemit has a very important use case Unlike existing monitor-ing mechanisms (eg Inotify [31]) that suffer from a host oflimitations [10] LoggedFS can reliably post all file system

App Stats CPU () Latency (ms)

Name OBB Size D P D P

Disney Palace Pets 51 374MB 20 29 2235 1766Dead Effect 4 11GB 205 32 8895 4579

Table 7 App launch latency and peak CPU consumption of sdcarddaemon under default (D) and passthrough (P) settings on Androidfor two popular games In passthrough mode the FUSE driver neverforwards readwrite requests to user space but always passes themthrough the host (EXT4) file system See Table 5 for config details1 bool check_caller_access_to_name(int64_t key const char name) 2 define a shmap for hosting permissions 3 int val = extfuse_lookup_shmap(ampkey)4 Always block security-sensitive files at root 5 if (val || val == PERM_ROOT) return false6 special reserved files 7 if (strncasecmp(name autoruninf 11) ||8 strncasecmp(name android_secure 15) ||9 strncasecmp(name android_secure 14))

10 return false11 return true12

Figure 10 Android sdcard permission checks EXTFUSE code

events Various apps such as file system indexers backuptools Cloud storage clients such as Dropbox integrity check-ers and antivirus software subscribe to file system events forefficiently tracking modifications to files

We ported LoggedFS to EXTFUSE framework Figure 11shows the common logging code that is called from variousextensions which serve requests in the kernel (eg read ex-tension Figure 5) To evaluate its performance we ran theFileServer macro benchmark with synthetic a workload of200000 files and 50 threads from Filebench suite We foundover 9 improvement in throughput under MDOpt comparedto FUSE Opt due to 53 99 and 100 fewer lookupgetattr and getxattr requests to user space respectivelyFigure 12 shows the results AllOpt reported an additional20 improvement by directly forwarding all readwrite re-quests to the host file system offering near-native throughput

1 void log_op(extfuse_req_t req fuse_ino_t ino2 int op const void arg int arglen) 3 struct data log record 4 u32 op u32 pid u64 ts u64 ino char data[MAXLEN]5 example filter only whitelisted UIDs in map 6 u16 uid = bpf_get_current_uid_gid()7 if (extfuse_lookup_shmap(uid_wlist ampuid)) return8 log opcode timestamp(ns) and requesting process 9 dataopcode = op datats = bpf_ktime_get_ns()

10 datapid = bpf_get_current_pid_tgid() dataino = ino11 memcpy(datadata arg arglen)12 submit to per-cpu mmaprsquod ring buffer 13 u32 key = bpf_get_smp_processor_id()14 bpf_perf_event_output(req ampbuf ampkey ampdata sizeof(data))15

Figure 11 LoggedFS kernel extension that logs requests

7 DiscussionFuture use cases Given negligible overhead of EXTFUSEand direct passthrough access to the host file system for stack-ing incremental functionality multiple app-defined ldquothinrdquo filesystem functions (eg security checks logging etc) can be

USENIX Association 2019 USENIX Annual Technical Conference 131

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 13: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

Figure 12 LoggedFS performance measured by Filebench File-Server benchmark under EXT4 FUSE and EXTFUSE Fewer meta-data and IO requests were delivered to user space with EXTFUSE

stacked with low overhead which otherwise would have beenvery expensive in user space with FUSESafety EXTFUSE sandboxes untrusted user extensions toguarantee safety For example the eBPF runtime allowsaccess to only a few simple non-blocking kernel helper func-tions Map data structures are of fixed size Extensions arenot allowed to allocate memory or directly perform any IOoperations Even so EXTFUSE offers significant perfor-mance boost across a number of use cases sect62 by offloadingsimple logic in the kernel Nevertheless with EXTFUSEuser file systems can retain their existing slow-path logic forperforming complex operations such as encryption in userspace Future work can extend the EXTFUSE framework totake advantage of existing generic in-kernel services such asVFS encryption and compression APIs to even serve requeststhat require such complex operations entirely in the kernel

8 Related WorkHere we compare our work with related existing researchUser File System Frameworks There exists a number offrameworks to develop user file systems A number of userfile systems have been implemented using NFS loopbackservers [19] UserFS [14] exports generic VFS-like filesystem requests to the user space through a file descriptorArla [53] is an AFS client system that lets apps implementa file system by sending messages through a device file in-terface devxfs0 Coda file system [42] exported a simi-lar interface through devcfs0 NetBSD provides Pass-to-Userspace Framework FileSystem (PUFFS) Maziegraveres et alproposed a C++ toolkit that exposes a NFS-like interface forallowing file systems to be implemented in user space [33]UFO [3] is a global file system implemented in user space byintroducing a specialized layer between the apps and the OSthat intercepts file system callsExtensible Systems Past works have explored the idea of let-ting apps extend system services at runtime to meet their per-formance and functionality needs SPIN [6] and VINO [43]allow apps to safely insert kernel extensions SPIN uses atype-safe language runtime whereas VINO uses softwarefault isolation to provide safety ExoKernel [13] is anotherOS design that lets apps define their functionality Systemssuch as ASHs [17 51] and Plexus [15] introduced the con-cept of network stack extension handlers inserted into the

kernel SLIC [18] extends services in monolithic OS usinginterposition to enable incremental functionality and compo-sition SLIC assumes that extensions are trusted EXTFUSEis a framework that allows user file systems to add ldquothinrdquo ex-tensions in the kernel that serve as specialized interpositionlayers to support both in-kernel and user space processing toco-exist in monolithic OSeseBPF EXTFUSE is not the first system to use eBPF for safeextensibility eXpress DataPath (XDP) [27] allows apps toinsert eBPF hooks in the kernel for faster packet process-ing and filtering Amit et al proposed Hyperupcalls [4] aseBPF helper functions for guest VMs that are executed bythe hypervisor More recently SandFS [7] uses eBPF to pro-vide an extensible file system sandboxing framework LikeEXTFUSE it also allows unprivileged apps to insert customsecurity checks into the kernel

FUSE File System Translator (FiST) [55] is a tool forsimplifying the development of stackable file system It pro-vides boilerplate template code and allows developers to onlyimplement the core functionality of the file system FiSTdoes not offer safety and reliability as offered by user spacefile system implementation Additionally it requires learninga slightly simplified file system language that describes theoperation of the stackable file system Furthermore it onlyapplies to stackable file systems

Narayan et al [34] combined in-kernel stackable FiSTdriver with FUSE to offload data from IO requests to userspace to apply complex functionality logic and pass processedresults to the lower file system Their approach is only ap-plicable to stackable file systems They further rely on staticper-file policies based on extended attributes labels to en-able or disable certain functionality In contrast EXTFUSEdownloads and safely executes thin extensions from user filesystems in the kernel that encapsulate their rich and special-ized logic to serve requests in the kernel and skip unnecessaryuser-kernel switching

9 ConclusionWe propose the idea of extensible user file systems and presentthe design and architecture of EXTFUSE an extension frame-work for FUSE file system EXTFUSE allows FUSE filesystems to define ldquothinrdquo extensions along with auxiliary datastructures for specialized handling of low-level requests inthe kernel while retaining their existing complex logic in userspace EXTFUSE provides familiar FUSE-like APIs and ab-stractions for easy adoption We demonstrate its practicalusefulness suitability for adoption and performance benefitsby porting and evaluating existing FUSE implementations

10 AcknowledgmentsWe thank our shepherd Dan Williams and all anonymousreviewers for their feedback which improved the contentof this paper This work was funded in part by NSF CPSprogram Award 1446801 and a gift from Microsoft Corp

132 2019 USENIX Annual Technical Conference USENIX Association

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 14: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

References[1] 96boards Hikey (lemaker) development boards May 2019

[2] Mike Accetta Robert Baron William Bolosky David Golub RichardRashid Avadis Tevanian and Michael Young Mach A new kernelfoundation for unix development pages 93ndash112 1986

[3] Albert D Alexandrov Maximilian Ibel Klaus E Schauser and Chris JScheiman Extending the operating system at the user level theUfo global file system In Proceedings of the 1997 USENIX AnnualTechnical Conference (ATC) pages 6ndash6 Anaheim California January1997

[4] Nadav Amit and Michael Wei The design and implementation ofhyperupcalls In Proceedings of the 2017 USENIX Annual TechnicalConference (ATC) pages 97ndash112 Boston MA July 2018

[5] Starr Andersen and Vincent Abella Changes to Functionality inWindows XP Service Pack 2 Part 3 Memory Protection Technolo-gies 2004 httpstechnetmicrosoftcomen-uslibrarybb457155aspx

[6] B N Bershad S Savage P Pardyak E G Sirer M E FiuczynskiD Becker C Chambers and S Eggers Extensibility Safety andPerformance in the SPIN Operating System In Proceedings of the 15thACM Symposium on Operating Systems Principles (SOSP) CopperMountain CO December 1995

[7] Ashish Bijlani and Umakishore Ramachandran A lightweight andfine-grained file system sandboxing framework In Proceedings ofthe 9th Asia-Pacific Workshop on Systems (APSys) Jeju Island SouthKorea August 2018

[8] Ceph Ceph kernel client April 2018 httpsgithubcomcephceph-client

[9] Open ZFS Community ZFS on Linux httpszfsonlinuxorgApril 2018

[10] J Corbet Superblock watch for fsnotify April 2017

[11] Brian Cornell Peter A Dinda and Fabiaacuten E Bustamante Wayback Auser-level versioning file system for linux In Proceedings of the 2004USENIX Annual Technical Conference (ATC) pages 27ndash27 BostonMA JunendashJuly 2004

[12] Exploit Database Android - sdcardfs changes current-gtfs withoutproper locking 2019

[13] Dawson R Engler M Frans Kaashoek and James OrsquoToole Jr Exoker-nel An Operating System Architecture for Application-Level ResourceManagement In Proceedings of the 15th ACM Symposium on Oper-ating Systems Principles (SOSP) Copper Mountain CO December1995

[14] Jeremy Fitzhardinge Userfs March 2018 httpwwwgooporg~jeremyuserfs

[15] Marc E Fiuczynski and Briyan N Bershad An Extensible ProtocolArchitecture for Application-Specific Networking In Proceedings ofthe 1996 USENIX Annual Technical Conference (ATC) San Diego CAJanuary 1996

[16] R Flament LoggedFS - Filesystem monitoring with Fuse March2018 httpsrflamentgithubiologgedfs

[17] Gregory R Ganger Dawson R Engler M Frans Kaashoek Hec-tor M Bricentildeo Russell Hunt and Thomas Pinckney Fast and FlexibleApplication-level Networking on Exokernel Systems ACM Transac-tions on Computer Systems (TOCS) 20(1)49ndash83 2002

[18] Douglas P Ghormley David Petrou Steven H Rodrigues andThomas E Anderson Slic An extensibility system for commod-ity operating systems In Proceedings of the 1998 USENIX AnnualTechnical Conference (ATC) New Orleans Louisiana June 1998

[19] David K Gifford Pierre Jouvelot Mark A Sheldon et al Semantic filesystems In Proceedings of the 13th ACM Symposium on OperatingSystems Principles (SOSP) pages 16ndash25 Pacific Grove CA October1991

[20] A Featureful Union Filesystem March 2018 httpsgithubcomtrapexitmergerfs

[21] LibFuse | GitHub Without lsquodefault_permissionslsquo cached permis-sions are only checked on first access 2018 httpsgithubcomlibfuselibfuseissues15

[22] Gluster libgfapi April 2018 httpstaged-gluster-docsreadthedocsioenrelease370beta1Featureslibgfapi

[23] Gnu hurd April 2018 wwwgnuorgsoftwarehurdhurdhtml

[24] Storage | Android Open Source Project September 2018 httpssourceandroidcomdevicesstorage

[25] John H Hartman and John K Ousterhout Performance measurementsof a multiprocessor sprite kernel In Proceedings of the Summer1990 USENIX Annual Technical Conference (ATC) pages 279ndash288Anaheim CA 1990

[26] eBPF extended Berkley Packet Filter 2017 httpswwwiovisororgtechnologyebpf

[27] IOVisor Xdp - io visor project May 2019

[28] Shun Ishiguro Jun Murakami Yoshihiro Oyama and Osamu TatebeOptimizing local file accesses for fuse-based distributed storage InHigh Performance Computing Networking Storage and Analysis(SCC) 2012 SC Companion pages 760ndash765 2012

[29] Antti Kantee puffs-pass-to-userspace framework file system In Pro-ceedings of the Asian BSD Conference (AsiaBSDCon) Tokyo JapanMarch 2007

[30] Jochen Liedtke Improving ipc by kernel design In Proceedings ofthe 14th ACM Symposium on Operating Systems Principles (SOSP)pages 175ndash188 Asheville NC December 1993

[31] Robert Love Kernel korner Intro to inotify Linux Journal 200582005

[32] Ali Joseacute Mashtizadeh Andrea Bittau Yifeng Frank Huang and DavidMaziegraveres Replication history and grafting in the ori file systemIn Proceedings of the 24th ACM Symposium on Operating SystemsPrinciples (SOSP) Farmington PA November 2013

[33] David Maziegraveres A Toolkit for User-Level File Systems In Proceed-ings of the 2001 USENIX Annual Technical Conference (ATC) pages261ndash274 June 2001

[34] S Narayan R K Mehta and J A Chandy User space storage systemstack modules with file level control In Proceedings of the LinuxSymposium pages 189ndash196 Ottawa Canada July 2010

[35] Martin Paumlrtel bindfs 2018 httpsbindfsorg

[36] PaX Team PaX address space layout randomization (ASLR) 2003httpspaxgrsecuritynetdocsaslrtxt

[37] Rogeacuterio Pontes Dorian Burihabwa Francisco Maia Joatildeo Paulo Vale-rio Schiavoni Pascal Felber Hugues Mercier and Rui Oliveira SafefsA modular architecture for secure user-space file systems One fuseto rule them all In Proceedings of the 10th ACM International onSystems and Storage Conference pages 91ndash912 Haifa Israel May2017

[38] Nikolaus Rath List of fuse file systems 2011 httpsgithubcomlibfuselibfusewikiFilesystems

[39] Nicholas Rauth A network filesystem client to connect to SSH serversApril 2018 httpsgithubcomlibfusesshfs

[40] Gluster April 2018 httpglusterorg

[41] Kai Ren and Garth Gibson Tablefs Enhancing metadata efficiencyin the local file system In Proceedings of the 2013 USENIX AnnualTechnical Conference (ATC) pages 145ndash156 San Jose CA June 2013

USENIX Association 2019 USENIX Annual Technical Conference 133

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments
Page 15: Extension Framework for File Systems in User space · FUSE file systems shows thatEXTFUSE can improve the performance of user file systems with less than a few hundred lines on average

[42] Mahadev Satyanarayanan James J Kistler Puneet Kumar Maria EOkasaki Ellen H Siegel and David C Steere Coda A highlyavailable file system for a distributed workstation environment IEEETransactions on Computers 39(4)447ndash459 April 1990

[43] Margo Seltzer Yasuhiro Endo Christopher Small and Keith A SmithAn introduction to the architecture of the vino kernel Technical reportTechnical Report 34-94 Harvard Computer Center for Research inComputing Technology October 1994

[44] Helgi Sigurbjarnarson Petur O Ragnarsson Juncheng Yang YmirVigfusson and Mahesh Balakrishnan Enabling space elasticity instorage systems In Proceedings of the 9th ACM International onSystems and Storage Conference pages 61ndash611 Haifa Israel June2016

[45] David C Steere James J Kistler and Mahadev Satyanarayanan Effi-cient user-level file cache management on the sun vnode interface InProceedings of the Summer 1990 USENIX Annual Technical Confer-ence (ATC) Anaheim CA 1990

[46] Swaminathan Sundararaman Laxman Visampalli Andrea C Arpaci-Dusseau and Remzi H Arpaci-Dusseau Refuse to Crash with Re-FUSE In Proceedings of the 6th European Conference on ComputerSystems (EuroSys) Salzburg Austria April 2011

[47] M Szeredi and NRauth Fuse - filesystems in userspace 2018 httpsgithubcomlibfuselibfuse

[48] Vasily Tarasov Erez Zadok and Spencer Shepler Filebench Aflexible framework for file system benchmarking login The USENIXMagazine 41(1)6ndash12 March 2016

[49] Ungureanu Cristian and Atkin Benjamin and Aranya Akshat andGokhale Salil and Rago Stephen and Całkowski Grzegorz and Dub-nicki Cezary and Bohra Aniruddha HydraFS A High-throughputFile System for the HYDRAstor Content-addressable Storage SystemIn 10th USENIX Conference on File and Storage Technologies (FAST)(FAST 10) pages 17ndash17 San Jose California February 2010

[50] Bharath Kumar Reddy Vangoor Vasily Tarasov and Erez Zadok ToFUSE or Not to FUSE Performance of User-Space File Systems In15th USENIX Conference on File and Storage Technologies (FAST)(FAST 17) Santa Clara CA February 2017

[51] Deborah A Wallach Dawson R Engler and M Frans KaashoekASHs Application-Specific Handlers for High-Performance Messag-ing In Proceedings of the 7th ACM SIGCOMM Palo Alto CA August1996

[52] Sage A Weil Scott A Brandt Ethan L Miller Darrell D E Long andCarlos Maltzahn Ceph A scalable high-performance distributed filesystem In Proceedings of the 7th USENIX Symposium on OperatingSystems Design and Implementation (OSDI) pages 307ndash320 SeattleWA November 2006

[53] Assar Westerlund and Johan Danielsson Arla a free afs client InProceedings of the 1998 USENIX Annual Technical Conference (ATC)pages 32ndash32 New Orleans Louisiana June 1998

[54] E Zadok I Badulescu and A Shender Extending File Systems UsingStackable Templates In Proceedings of the 1999 USENIX AnnualTechnical Conference (ATC) pages 57ndash70 June 1999

[55] E Zadok and J Nieh FiST A Language for Stackable File SystemsIn Proceedings of the 2000 USENIX Annual Technical Conference(ATC) June 2000

[56] Erez Zadok Sean Callanan Abhishek Rai Gopalan Sivathanu andAvishay Traeger Efficient and safe execution of user-level code in thekernel In Parallel and Distributed Processing Symposium 19th IEEEInternational pages 8ndash8 2005

[57] ZFS-FUSE April 2018 httpsgithubcomzfs-fusezfs-fuse

134 2019 USENIX Annual Technical Conference USENIX Association

  • Introduction
  • Background and Extended Motivation
    • FUSE
    • Generality vs Specialization
      • Design
        • Overview
        • Goals and Challenges
        • eBPF
        • Architecture
        • ExtFUSE APIs and Abstractions
        • Workflow
          • Implementation
          • Optimizations
            • Customized in-kernel metadata caching
            • Passthrough IO for stacking functionality
              • Evaluation
                • Performance
                • Use cases
                  • Discussion
                  • Related Work
                  • Conclusion
                  • Acknowledgments