10

Click here to load reader

Story Book: An Efficient Extensible Provenance Framework

Embed Size (px)

Citation preview

Page 1: Story Book: An Efficient Extensible Provenance Framework

Story Book: An Efficient Extensible Provenance FrameworkR. Spillane, R. Sears∗, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok

Stony Brook University and ∗University of California, Berkeley

Abstract

Most application provenance systems are hard codedfor a particular type of system or data, while cur-rent provenance file systems maintain in-memory prove-nance graphs and reside in kernel space, leading to com-plex and constrained implementations. Story Book re-sides in user space, and treats provenance events as ageneric event log, leading to a simple, flexible and eas-ily optimized system.

We demonstrate the flexibility of our design by addingprovenance to a number of different systems, includinga file system, database and a number of file types, and byimplementing two separate storage backends. AlthoughStory Book is nearly 2.5 times slower than ext3 underworst case workloads, this is mostly due to FUSE mes-sage passing overhead. Our experiments show that cou-pling our simple design with existing storage optimiza-tions provides higher throughput than existing systems.

1 IntroductionExisting provenance systems are designed to deal withspecific applications and system architectures, and aredifficult to adapt to other systems and types of data.Story Book decouples the application-specific aspects ofprovenance tracking from dependency tracking, queriesand other mechanisms common across provenance sys-tems.

Story Book runs in user space, simplifying its im-plementation, and allowing it to make use of existing,off-the-shelf components. Implementing provenance(or any other) file system in user space incurs signifi-cant overhead. However, our user space design signif-icantly reduces communication costs for Story Book’sapplication-specific provenance extensions, which mayuse existing IPC mechanisms, or even run inside the pro-cess generating provenance data.

Story Book supports application-specific extensions,allowing it to bring provenance to new classes of sys-tems. It currently supports file system and databaseprovenance and can be extended to other types of sys-tems, such as Web or email servers. Story Book’s filesystem provenance system also supports extensions thatrecord additional information based on file type. Wehave implemented two file-type modules: One records

the changes made by applications to .txt files, and theother records modifications to .docx metadata, such asthe author name.

Our experiments show that Story Book’s storage andquery performance are adequate for file system prove-nance workloads where each process performs a hand-ful of provenance-related actions. However, systemsthat service many small requests across a wide rangeof users, such as databases and Web servers, gener-ate provenance information at much higher rates andwith finer granularity. As such systems handle mostof today’s multi-user workloads, we believe they alsoprovide higher-value provenance information than in-teractive desktop systems. This space is where StoryBook’s extensible user-space architecture and high writethroughput are most valuable.

The rest of this paper is organized as follows: Sec-tion 2 describes existing approaches to provenance, Sec-tion 3 describes the design and implementation of StoryBook, and Section 4 describes our experiments and per-formance comparisons with PASSv2. We conclude inSection 5.

2 BackgroundA survey of provenance systems [14] provides a taxon-omy of existing provenance databases. A survey of thetaxonomy shows that Story Book is flexible enough tocover much of the design space targeted by existing sys-tems. This flexibility comes from Story Book’s layered,modular approach to provenance tracking (Figure 1).

A provenance source intercepts user interactionevents with application data and sends these events toapplication specific extensions which interpret them andgenerate provenance inserts into one of Story Book’sstorage backends. Queries are handled by Story Book’sStory Book API. Story Book relies on external li-braries to implement each of its modules. FUSE [8]and MySQL [18] intercept file system and databaseevents. Extensions to Story Book’s FUSE file sys-tem, such as the .txt and .docx modules, anno-tate events with application-specific provenance. Theseprovenance records are then inserted into either Sta-sis [12] or Berkeley DB [15]. Stasis stores provenancedata using database-style no-Force/Steal recovery for itshashtables, and compressed log structured merge (LSM)

Page 2: Story Book: An Efficient Extensible Provenance Framework

Provenance Sources

FUSE FS MySQL Network logs

Storage Backends

Stasis / Rose BDB ...

Story Book API

Application Specific Extensions

TXT

DOCX

Versioning

...

Realtime Query ...

Du

rab

ility P

roto

cols (V

alo

r)

Even

t timesta

mp

s

Pro

cesss h

iera

rch

y

...

Compression

Events

Inserts

Figure 1: Story Book’s modular approach to provenance track-ing.

trees [11] for its Rose [13] indexes. Story Book utilizesValor [16] to maintain write-ordering between its logsand the kernel page cache. This reduces the number ofdisk flushes, greatly improving performance.

Although Story Book utilizes a modular design thatseparates different aspects of its operation into externalmodules (i.e., Rose, Valor, and FUSE), it does hardcodesome aspects of its implementation. Simhan’s surveydiscusses provenance systems that store metadata sep-arately from application data, and provenance systemsthat do not. One design decision hardcoded into StoryBook is that metadata is always stored separately fromapplication data for performance reasons. Story Bookstores blob data in a physically separate location on diskin order to preserve locality of provenance graph nodesand efficiently support queries.

Story Book avoids hardcoding most of the other de-sign decisions mentioned by the survey. Because StoryBook does not directly support versioning, application-specific provenance systems are able to decide whetherto store logical undo/redo records of application opera-tions to save space and improve performance, or to saveraw value records to reduce implementation time. StoryBook’s .txt module stores patches between text filesrather than whole blocks or pages that were changed.

Story Book allows applications to determine the gran-ularity of their provenance records, and whether to fo-cus on information about data, processes or both. Simi-larly, Story Book’s implementation is independent of theformat of extended records, allowing applications to usemetadata standards such as RDF [19] or Dublin Core [2]as they see fit.

The primary limitation of Story Book compared tosystems that make native use of semantic metadata isthat Story Book’s built-in provenance queries would

PASS Linux Kernel

Graph pruning

Event Reordering

FS instrumentation

Cycle detection and breaking

Network consensus protocol

WAL implementation

WALDO

WAL consumption + indexing

Queries over stale data

Figure 2: PASSv2’s monolithic approach to provenance track-ing. Although Story Book contains more components thanPASSv2, Story Book addresses a superset of PASSv2’s ap-plications, and leverages existing systems to avoid complex,provenance-specific code.

need to be extended to understand annotations suchas “version-of” or “derived-from,” and act accordingly.Applications could implement custom queries that makeuse of these attributes, but such queries would incur sig-nificant I/O cost. This is because our schema stores (po-tentially large) application-specific annotations in a sep-arate physical location on disk. Of course, user-definedannotations could be stored in the provenance graph, butthis would increase complexity, both in the schema, andin provenance query implementations.

Databases such as Trio [20] reason about the qualityor reliability of input data and processes that modify itover time. Such systems typically cope with provenanceinformation, as unreliable inputs reflect uncertainty inthe outputs of queries. Story Book targets provenanceover exact data. This is especially appropriate for filesystem provenance and regulatory compliance applica-tions; in such circumstances, it is more natural to askwhich version of an input was used rather than to calcu-late the probability that a particular input is correct.

2.1 Comparison to PASSExisting provenance systems couple general provenanceconcepts to system architectures, leading to complex in-kernel mechanisms and complicating integration withapplication data representations. Although PASSv2 [9]suffers from these issues, it is also the closest system toStory Book (Figure 2).

PASSv2 distinguishes between data attributes, suchas name or creation time, and relationships indicat-ing things like data flow and versioning. Story Bookperforms a similar layering, except that it treats ver-sioning information (if any) as an attribute. Further-more, whereas PASSv2 places such mechanisms in ker-nel space, Story Book places them in user space.

Page 3: Story Book: An Efficient Extensible Provenance Framework

On the one hand, this means that user-level prove-nance inspectors could tamper with provenance infor-mation; on the other it has significant performance andusability advantages. In regulatory environments (whichcope with untrusted end-users), Story Book should runin a trusted file or other application server. The securityimplications of this approach are minimal: in both casesa malicious user requires root access to the provenanceserver in order to tamper with provenance information.

PASSv2 is a system-call-level provenance system,and requires that all provenance data pass through rel-evant system calls. Application level provenance datais tracked by applications, then used to annotate corre-sponding system calls that are intercepted by PASSv2.In contrast, Story Book is capable of system call in-terception, but does not require it. This avoids theneed for applications to maintain in-memory graphs ofprovenance information. Furthermore, it is unclear howsystem-call-level instrumentation interacts with mech-anisms such as database buffer managers which de-stroy the relationship between application-level opera-tions and system calls. Story Book’s ability to trackMySQL provenance shows that such systems are easilyhandled with our approach.

PASSv2 supersedes PASS, which directly wroteprovenance information to a database. The authors ofthe PASS systems concluded that direct database accesswas neither “efficient nor scalable,” and moved databaseoperations into an asynchronous background thread.

Our experiments show that, when properly tuned,Story Book’s Fable storage system and PASSv2’s Waldostorage system both perform well enough to allowdirect database access, and that the bulk of prove-nance overhead is introduced by system instrumenta-tion, and insertion-time manipulation of provenancegraphs. Story Book suffers from the first bottleneckwhile PASSv2 suffers from the latter.

This finding has a number of implications. First, StoryBook performs database operations synchronously, soits provenance queries always run against up-to-date in-formation. Second, improving Story Book performanceinvolves improvements to general-purpose code, whilePASSv2’s bottleneck is in provenance-specific mecha-nisms.

From a design perspective then, Story Book is at a sig-nificant advantage; improving our performance is a mat-ter of utilizing a different instrumentation technology.In contrast, PASSv2’s bottlenecks stem from a special-purpose component that is fundamental to its design.

PASSv2 uses a recovery protocol called write aheadprovenance, which is similar to Story Book’s file-systemrecovery approach. However, Story Book is slightlymore general, as its recovery mechanisms are capableof entering into system-specific commit protocols. This

is important when tracking database provenance, as itallows Story Book to reuse existing durability mecha-nisms, such as Valor, and inexpensive database replica-tion techniques. For our experiments, we extended writeahead provenance with some optimizations employedby Fable so that PASSv2’s write throughput would bewithin an order of magnitude of ours.

PASSv2 supports a number of workloads currentlyunaddressed by Story Book, such as network operationand unified naming across provenance implementations.The mechanisms used to implement these primitivesin PASSv2 follow from the decision to allow multipleprovenance sources to build provenance graphs beforeapplying them to the database. Even during local opera-tion, PASSv2 employs special techniques to avoid cyclesin provenance graphs due to reordering of operations andrecord suppression. In the distributed case, clock skewand network partitions further complicate matters.

In contrast, Story Book appends each provenance op-eration to a unified log, guaranteeing a consistent, cycle-free dependency graph (see Section 3.8). We prefer thisapproach to custom provenance consensus algorithms,as the problem of imposing a partial order over eventsin a distributed system is well-understood; building adistributed version of Story Book is simply a matter ofbuilding a provenance source that makes use of well-known distributed systems techniques such as logicalclocks [6] or Paxos [7].

3 Design and ImplementationStory Book has two primary scalability goals: (1) de-crease the overhead of logging provenance information,and (2) decrease the implementation effort required totrack the provenance of applications which use the filesystem or other instrumentable storage to organize data.

We discuss Story Book’s application developmentmodel in Section 3.1, and Story Book’s architecture andprovenance file system write path in Section 3.2. We ex-plain Story Book’s indexing optimizations in Section 3.3and recovery in Section 3.4. We discuss reuse of thirdparty code in Section 3.5 and the types of provenancethat are recorded on-disk in Section 3.6. We explainhow Story Book accomplishes a provenance query inSection 3.7 and describe Story Book’s ability to com-press and suppress redundant provenance information inSection 3.8.

3.1 Application Development ModelStory Book provides a number of approaches toapplication-specific provenance. The most powerful, ef-ficient and precise approach directly instruments appli-cations, allowing the granularity of operations and actorsrecorded in the provenance records to match the appli-cation. This approach also allows for direct communica-

Page 4: Story Book: An Efficient Extensible Provenance Framework

tion between the application and Story Book, minimiz-ing communication overheads.

However, some applications cannot be easily instru-mented, and file formats are often modified by manydifferent applications. In such cases, it makes sense totrack application-specific provenance information at thefile system layer. Story Book addresses this problem byallowing developers to implement provenance inspec-tors, which intercept requests to modify particular setsof files. When relevant files are accessed, the appropriateprovenance inspector is alerted and allowed to log addi-tional provenance information. Provenance inspectorswork best when the intercepted operations are indicativeof the application’s intent (e.g., an update to a documentrepresents an edit to the document). Servers that servicemany unrelated requests or use the file system primarilyas a block device (e.g., Web servers, databases) are stillable to record their provenance using Story Book’s file-system provenance API, though they would be better-served by Story Book’s application-level provenance.

For our evaluation of Story Book’s file system-levelprovenance, we implemented a .txt and a .docx

provenance inspector. The .txt inspector recordspatches that document changes made to a text file be-tween the open and close. The .docx inspector recordschanges to document metadata (e.g., author names, title)between the time it was opened and closed. We describeprovenance inspectors in more detail in Section 3.6.3.

3.2 ArchitectureStory Book stores three kinds of records: (1) basicrecords which are a record of an open, close, read orwrite, (2) process records which are a record of a caller-callee relationship and (3) extended records which area record of application-specific provenance information.These records are generated by the provenance sourcesand application specific extensions layers. The storagebackend in Story Book which is optimized for writethroughput by using Rose and Stasis is called Fable. Ba-sic, process, and extended records are inserted into Fa-ble, or into the Berkeley DB backend if it is being used.

Rather than synchronously commit new basic, pro-cess and extended records to disk before allowing up-dates to propagate to the file system, we use Stasis’ non-durable commit mechanism and Valor’s write orderingto ensure that write-ahead records reach disk before filesystem operations. This approach is a simplified versionof Speculator’s [10] external synchrony support, in thatit performs disk synchronization before page write-back,but unlike Speculator, it will not sync the disk before anyevent which presents information to the user (e.g., viatty1).

This allows Fable to use a single I/O operation to com-mit batches of provenance operations just before the file

system flushes its dirty pages. Although writes to Stasis’page file never block log writes or file system writeback,hash index operations may block if the Stasis pages theymanipulate are not in memory. Unless the system is un-der heavy memory pressure, the hash pages will be resi-dent. Therefore, most application requests that block onFable I/O also block on file system I/O.

Figure 3 describes what happens when an applica-tion P writes to a document. The kernel receives therequest from P and forwards the request to the FUSEfile system (1). FUSE forwards the request to StoryBook’s FUSE daemon (2) which determines the type offile being accessed. Once the file type is determined,the request is forwarded to the appropriate provenanceinspector (3). The provenance inspector generates ba-sic and extended log records (4) and schedules them forasynchronous insertion into a write-ahead log (5). Fa-ble synchronously updates (cached) hash table pages andRose’s in-memory component (6) and then allows theprovenance inspector to return to the caller P (7). Aftersome time, the file system flushes dirty pages (8), andValor ensures that any scheduled entries in Fable’s logare written (9) before the file system pages (10). Stasis’dirty pages are occasionally written to disk to enable logtruncation or to respond to memory pressure (11).

WALLogs

P

File SystemApplications

Linux Page Cache

Write Ordering

FUSE

FUSE Daemon

User LevelKernel

.docProvenanceInspector

Fable

Rose RB-Tree

Hashtablepage cache

Story Book

Queries and tools

write-aheadlog

file system

6

5

8

9 10

12

3

7

4

Stasispage file

11

Figure 3: Write path of Story Book’s provenance file system

3.3 Log Structured Merge Trees

Rose indexes are compressed LSM-trees that havebeen optimized for high-throughput and asynchronous,durable commit. Their contents are stored on disk incompressed, bulk loaded B-trees. Insertions are servicedby inserting data into an in-memory red-black tree. Once

Page 5: Story Book: An Efficient Extensible Provenance Framework

this tree is big enough, it is merged with the smalleston-disk tree and emptied. Since the on-disk tree’s leafnodes are laid out sequentially on disk, this merge doesnot cause a significant number of disk seeks. Similarly,the merge produces data in sorted order, allowing it tocontiguously lay out the new version of the tree on disk.

The small on-disk tree component will eventually be-come large, causing merges with in-memory trees to be-come prohibitively expensive. Before this happens, wemerge the smaller on-disk tree with a larger on-disk tree,emptying the smaller tree. By using two on-disk treesand scheduling merges correctly, the amortized cost ofinsertion is reduced to O(log n ∗

√n), with a constant

factor proportional to the cost of compressed sequentialI/O. B-trees’ O(log n) insertion cost is proportional tothe cost of random I/O. The ratio of random I/O cost tosequential I/O cost is increasing exponentially over time,both for disks and for in-memory operations.

Analysis of asymptotic LSM-tree performance showsthat, on current hardware, worst-case LSM-tree through-put will dominate worst-case B-tree write throughputacross all reasonable index sizes, but that in the bestcase for a B-tree, insertions arrive in sorted order, al-lowing it to write back dirty pages serially to disk withperfect locality. Provenance queries require fast lookupof basic records based on inode, not arrival time, forc-ing Story Book’s indexes to re-sort basic records beforeplacing them on disk. If insertions do show good locality(due to skew based on inode, for example), then Rose’stree merges will needlessly touch large amounts of cleandata. Partitioning the index across multiple Rose indexeswould lessen the impact of this problem [4].

Sorted data compresses well, and Story Book’s tableshave been chosen to be easily compressible. This al-lows Rose to use simple compression techniques, suchas run length encoding, to conserve I/O bandwidth. Su-perscalar implementations of such compression algo-rithms compress and decompress with GiB/s throughputon modern processor cores. Also, once the data has beencompressed for writeback, it is kept in compressed formin memory, conserving RAM.

3.4 RecoveryAs we will see in Section 3.6, the basic records stored inFable contain pointers into extended record logs. At run-time, Fable ensures that extended records reach disk be-fore the basic records that reference them. This ensuresthat all pointers from basic records encountered duringrecovery point to complete extended records, so recov-ery does not need to take any special action to ensurethat the logs are consistent.

If the system crashes while file system (or other ap-plication) writes are in flight, Story Book must ensurethat a record of the in-flight operations is available in the

provenance records so that future queries are aware ofany potential corruption due to crash.

Since Valor guarantees that Stasis’ write ahead log en-tries reach disk before corresponding file system opera-tions, Fable simply relies upon Stasis recovery to recoverfrom crash. Provenance over transactional systems (suchas MySQL) may make use of any commit and replica-tion mechanisms supported by the application.

3.5 Reusing User-Level LibrariesProvenance inspectors are implemented as user-levelplugins to Story Book’s FUSE daemon. Therefore, theycan use libraries linked by the application to safely andefficiently extract application-specific provenance. Forexample, our .docx inspector is linked against the XMLparser ExPat [3]. Story Book’s FUSE-based design fa-cilitates transparent provenance inspectors that requireno application modification and do not force developersto port user-level functionality into the kernel [17].

3.6 Provenance SchemaStory Book models the system’s provenance using basicrecords, process records and extended records. To sup-port efficient queries, Fable maintains several differentRose trees and persistent hash indexes. In this sectionwe discuss the format of these record types and indexes.

3.6.1 Basic RecordsStory Book uses basic records to record general prove-nance. To determine the provenance or history of an ar-bitrary file, Story Book must maintain a record of whichprocesses read from and wrote to which objects. StoryBook clients achieve this by recording a basic record forevery create, open, close, read and write regardless oftype. The format of a basic record is as follows:

Extended Record Log

Magic Number

Record Data

Magic Number

Magic Number

Record Data

ino Basic Recordino GiD pGid ex# op LSN

exe ex#

exe ex#H

file name ino

file name inoH

pid ino|pGid ex# LSN

pid Process Record

ex# exe

ex# exeH

Figure 4: Story Book Database Layout. Table keys are bold.’H’ indicates the table is a hash table.

Page 6: Story Book: An Efficient Extensible Provenance Framework

GLOBAL ID The ID of the process performing the op-eration. This ID is globally unique across reboots.

PARENT GLOBAL ID The global ID of the parent ofthe process performing the operation.

INODE The inode number of a file or object id.EXECUTABLE ID The global ID corresponding to the

absolute path of the executable or other name asso-ciated with this process.

OPERATION Indicates if the process performed a read,write, open or close.

LSN The log sequence number of the record, used forevent ordering and lookup of application-specificextended records

Figure 4 illustrates how Fable stores basic records. Ex-ecutable paths are stored using two hash tables that mapbetween executable ID’s and executable names (’exe’ to’ex#’) and (’ex#’ to ’exe’). The first is used during in-sertion; the second to retrieve executable names duringqueries. File names are handled in a similar fashion. Fa-ble stores the remaining basic record attributes in a Rosetable, sorted by inode / object id.

3.6.2 Process RecordsStory Book uses process records to record parent-childrelationships. When processes perform file operationsStory Book records their executable name and processID within a basic record. However, Story Book cannotrely on every process to perform a file operation (e.g., thecat program in Figure 5), and must also record processrecords alongside basic records and application-specificprovenance. Process records capture the existence ofprocesses that might never access a file. The format of aprocess record is as follows:

F1 2F

3F

P1

P2

P4 P5

cat

tee

P3indexor

indexor.thread

index

Makefiles

ino Basic Recordino GiD pGid ex# op LSN

pid ino|pGid ex# LSN

pid Process Record

3 43 5

4 P.3

5 P.35 I.2

4 I.1

1 22 2

52

80104

428128152

2 P.1 0

76

Figure 5: A provenance query. On the left is the data used toconstruct the provenance graph on the right. The query beginswith a hash lookup on the file name’s inode whose provenancewe determine (< “index”, 3 >).

GLOBAL ID The ID of the process performing the fileaccess. This ID is globally unique across reboots.

PARENT GLOBAL ID OR FILE INODE Containseither the global ID of the parent of this process orthe inode of a file this process modified or read.

EXECUTABLE ID Global ID corresponding to the ab-solute path of the executable.

LSN The log sequence number of the record, used toestablish the ordering of child process creation andfile operations during provenance queries.

Before Story Book emits a basic record for a process itfirst acquires all the global IDs of that process’s parentsby calling a specially added system call. It records aprocess record for every (global ID, parent global

ID) pair it has not already recorded as a process recordat some earlier time.

The exact mechanisms used to track processes varywith the application being tracked, but file system prove-nance is an informative example. By generating processrecords during file access rather than at fork and exit,file system provenance is able to avoid logging processrecords for processes that do not modify files. This al-lows it to avoid maintenance of a provenance graph inkernel RAM during operation, but is not sufficient to re-flect process reparenting that occurs before a file accessor a change in the executable name that occurs after thelast file access.

3.6.3 Extended RecordsStory Book manages application-specific provenance byhaving each application’s provenance inspector attachadditional application-specific information to a basicrecord. This additional information is called an ex-tended record. Extended records are not stored in Rose,but rather are appended in chronological order to a log.This is because their format and length may vary andextended records are not necessary to reconstruct theprovenance of a file in the majority of cases. However,the extended record log cannot be automatically trun-cated, since it contains the only copy of the provenanceinformation stored in the extended records. An extendedrecord can have multiple application-specific entries ornone at all. The format of an extended record is as fol-lows:

MAGIC NUMBER Indicates the type of the record datathat is to follow. For basic or process records withno application-specific data to store this value is 0.

RECORD DATA Contains any application-specific data.

Each basic and process record must refer to an ex-tended record because the offset of the extended recordis used as an LSN. If this were not true Rose would haveto add another column to the basic record table. In addi-tion to an LSN it would need the offset into the extendedrecord log for basic records which refer to application-specific provenance. Compressing the basic record ta-

Page 7: Story Book: An Efficient Extensible Provenance Framework

ble with this additional column would be more difficultand would increase Story Book’s overhead. Records thathave no application-specific data to store need not incuran additional write since the magic number for an emptyextended record is zero and can be recorded by makinga hole in the file. Reading the extended record need onlybe done when explicitly checking for a particular kind ofapplication-specific provenance in a basic record.

3.7 Provenance QueryStory Book acquires the provenance of a file by lookingup its inode in the (file name, ino) hash table. Theinode is used to begin a breadth first search in the basicand process record tables. We perform a transitive clo-sure over sets of processes that wrote to the given file (orother files encountered during the search), over the setof files read by these processes, and over the parents ofthese processes. We search in order of decreasing LSNand return results in reverse chronological order.

3.8 Compression and Record SuppressionRose tables provide high-performance column-wise datacompression. Rose’s basic record table run length en-codes all basic record columns except the LSN, whichis stored as a 16 bit diff in the common case, and theopcode, which is stored as an uncompressed 8 bit value.The process record table applies run length encoding tothe process id, and ex#. It attempts to store the remain-ing columns as 16 bit deltas against the first entry onthe page. The hash tables do not support compression,but are relatively small, as they only store one value perexecutable name and file in the system.

In addition to compression, Story Book supportsrecord suppression, which ignores multiple calls to readand write from the same process to the same file. This isimportant for programs that deal with large files; withoutrecord suppression, the number of (compressed) entriesstored by Rose is proportional to file sizes.

Processes that wish to use record suppression needonly store the first and last read and write to each filethey access. This guarantees that the provenance graphsgenerated by Story Book’s queries contain all taintingprocesses and files, but it loses some information re-garding the number of times each dependency was es-tablished at runtime. Unlike existing approaches torecord suppression, this scheme cannot create cyclic de-pendency graphs. Therefore, Story Book dependencygraphs can never contain cycles, no special cycle detec-tion or removal techniques are required.

However, record suppression can force queries to treatsuppressed operations as though they happened in race.If this is an issue, then record suppression can be dis-abled or (in some circumstances) implemented usingapplication-specific techniques that cannot lead to the

detection of false races.Note that this scheme never increases the number of

records generated by a process, as we can interpret se-quences such as “open, read, close” to mean a singleread. Record suppression complicates recovery slightly.If the sequence “open, read, ..., crash” is encountered, itmust be interpreted as: “open, begin read, ..., end read,crash.” Although this scheme requires up to two entriesper file accessed by each process, run length encodingusually ensures that the second entry compresses well.

4 EvaluationStory Book must provide fast, high-throughput writes inorder to avoid impacting operating system and existingapplication performance. However, it must also indexits provenance records so that it can provide reasonablequery performance. Therefore, we focus on Story Bookwrite throughput in Section 4.2, a traditional file systemworkload in Section 4.3, and a traditional database work-load in Section 4.4. We measure provenance read perfor-mance in Section 4.5.

4.1 Experimental SetupWe used eight identical machines, each with a 2.8GHzXeon CPU and 1GB of RAM for benchmarking. Eachmachine was equipped with six Maxtor DiamondMax10 7,200 RPM 250GB SATA disks and ran CentOS 5.2with the latest updates as of September 6, 2008. Forthe file system workload in Section 4.3 Story Book usesa modified 2.6.25 kernel to ensure proper write order-ing, and for all other systems an unmodified 2.6.18 ker-nel was used. To ensure a cold cache and an equivalentblock layout on disk, we ran each iteration of the rele-vant benchmark on a newly formatted file system withas few services running as possible. For query bench-marks, database files are copied fresh to the file systemeach time. We ran all tests at least four times and com-puted 95% confidence intervals for the mean elapsed,system, user, and wait times using the Student’s-t dis-tribution. In each case, unless otherwise noted, the halfwidths of the intervals were less than 5% of the mean.Wait time is elapsed time less system and user time andmostly measures time performing I/O, though it can alsobe affected by process scheduling. Except in Figure 6and Figure 7 where we are measuring throughput andvarying C0 size explicitly, all cache sizes are consistentfor all systems across all experiments. Berkeley DB’scache is set to 128MiB, and Stasis’ page cache is set to100MiB with a Rose C0 of 22MiB, allocating a total of122MiB to Fable.

Provenance Workloads. The workload we use toevaluate our throughput is a shell script which createsa file hierarchy with an arity of 14 that is five levels

Page 8: Story Book: An Efficient Extensible Provenance Framework

0

100

200

300

400

500

600

700

800

900

10 20 30 40 50 60 70 80 90 100

Ela

psed

Tim

e (s

ec)

Percentage of tree dataset processed

bdb (Elapsed)bdb (User)

rose (Elapsed)rose (User)

waldo (Elapsed)waldo (User)

Figure 6: Performance of Waldo and Story Book’s insert toolon logs of different sizes.

deep where each directory contains a single file. Sep-arate processes copy files in blocks from parent nodes tochild nodes. This script was designed to generate a denseprovenance graph to provide a lower bound on queryperformance and to demonstrate write performance. Werefer to this script as the tree dataset. With no WALtruncation, the script generates a 3.5GiB log of uncom-pressed provenance data in Story Book, and a 244MiBlog of compressed (pruned) provenance data in PASSv2.Fable compresses provenance events on the fly, so thatthe database file does not exceed 400MiB in size.

4.2 Provenance ThroughputPASSv2 consists of two primary components: (1) anin-kernel set of hooks that log provenance to a write-ahead log, and (2) a user-level daemon called Waldowhich reads records from the write-ahead log and storesthem in a Berkeley Database (BDB) database. We useda pre-release version of PASSv2, which could not reli-ably complete I/O intensive benchmarks. We found thattheir pre-release kernel was approximately 3x slowerthan Story Book’s FUSE-based provenance file system,but expect their kernel’s performance to change signifi-cantly as they stabilize their implementation. Therefore,we focus on log processing comparisons here.

For both Story Book and PASSv2, log generationwas at least an order of magnitude slower than logprocessing. However, as we will see below, most ofStory Book’s overhead is due to FUSE instrumentation,not provenance handling. We expect log processingthroughput to be Story Book’s primary bottleneck whenit is used to track provenance in systems with lower in-strumentation overheads.

By default, Waldo durably commits each insertioninto BDB. When performing durable inserts, Waldo isat least ten times slower than Fable. Under the as-sumption that it will eventually be possible for PASSv2

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120 140 160 180 200

Thr

ough

put (

Mib

/sec

)

C0 size (MiB)

1.5 GiB Log3 GiB Log

Figure 7: Performance of Story Book’s Fable backend withvarying in-memory tree component sizes.

to safely avoid these durable commits, we set theBDB TXN NOSYNC option, which allows BDB to com-mit without performing synchronous disk writes.

We processed the provenance events generated by thetree dataset with Waldo and Story Book. This produceda 3.5GiB Story Book log and a 244MiB Waldo log. Wethen measured the amount of time it took to processthese logs using Waldo, Fable and Story Book’s Berke-ley DB backend.

Rose uses 100MiB for its page cache and 172MiB forits C0 tree. Both Berkeley DB and Waldo use 272MiBfor their page cache size. Our optimized version ofWaldo performs 3.5 times faster than Fable. However,PASSv2 performs a significant amount of preprocessing,record suppression and graph-pruning to its provenancedata before writing it to the Waldo log. Fable interleavesthese optimizations with provenance event handling anddatabase operations. We are unable to measure the over-head of PASSv2’s preprocessing steps, making it diffi-cult to compare Waldo and Fable performance.

Story Book’s Berkeley DB backend is 2.8 timesslower than Fable. This is largely due to Fable’s amor-tization of in-memory buffer manager requests, as UserCPU time dominated Berkeley DB’s performance dur-ing this test. Because Rose’s LSM-trees are optimizedfor extremely large data sets, we expect the performancegap between Berkeley DB and Fable to increase withlarger provenance workloads.

4.3 File System PerformanceStory Book’s write performance depends directly on thesize of its in-memory C0 tree. In Figure 7 we comparethe throughput of our Rose backend when processing a1536MiB WAL and 3072MiB WAL generated from ourtree dataset. Since Rose must perform tree merges eachtime C0 becomes full, throughput drops significantly ifwe allocate less than 22MiB to C0. We set C0 to 22MiBfor the remainder of the tests.

Figure 8 measures the performance of a multi-

Page 9: Story Book: An Efficient Extensible Provenance Framework

0

50

100

150

200

250

300

350

400

0 5 10 15 20 25

Ela

psed

Tim

e (s

ec)

Percentage of text files

Ext3FuseFS

TxtFS

Figure 8: Performance of Story Book compared to ext3 andFUSE for I/O intensive file system workloads.

threaded version of Postmark [5] on top of ext3, ontop of FuseFS, a “pass through” FUSE file system thatmeasures Story Book’s instrumentation overhead, andon top of TxtFS, a Story Book provenance file systemthat records diffs when .txt files are modified. Whenrunning under TxtFS, each modification to a text filecauses it to be copied to an object store in the root par-tition. When the file is closed, a diff is computed andstored as an external provenance record. This signifi-cantly impacts performance as the number of .txt filesincreases.

We initialized Postmark’s number of directories to890, its number of files to 9,000, its number of trans-actions to 3,600, its average filesize to 190 KiB, and itsread and write size to 28KiB. These numbers are basedon the arithmetic mean of file system size from Agar-wal’s 5-year study[1]. We scaled Agarwal’s numbersdown by 10 as they represented an entire set of file sys-tem modifications, but the consistent scaling across allvalues maintains their ratios with respect to each other.We ran this test using 30 threads as postmark was in-tended to emulate a mail server which currently woulduse worker threads to do work. Because FUSE aggre-gates message passing across threads, both TxtFS andFuseFS perform better under concurrent workloads withrespect to Ext3. We repeated each run, increasing thepercentage of .txt files in the working set from 5% to25%, while keeping all the other parameters the same(Text files are the same size as the other files in this test).

FuseFS and ext3 remain constant throughout thebenchmark. TxtFS’s initial overhead on top of FuseFSis 9%, and is 76% on top of Ext3. As the number ofcopies to TxtFS’s object store and the number of diffsincreases, TxtFS’s performance worsens. At 25% .txt

files, TxtFS has a 40% overhead on top of FuseFS, andis 2.2 times slower than Ext3.

Regular Story Book

Response Time (s) Response Time (s)

Delivery 52.6352 53.423

New Order 45.8146 48.037

Order Status 44.4822 47.4768

Payment 44.2592 46.6032

Stock Level 53.6332 55.1508

Throughput (tpmC) 201.82 196.176

Table 1: Average response time for TPC-C Benchmark

0

50

100

150

200

250

300

350

400

50-rose50-bdb

40-rose40-bdb

30-rose30-bdb

20-rose20-bdb

10-rose10-bdb

Tim

e (s

econ

ds)

28.4

68.052.1

138.3

74.9

211.8

99.4

281.0

124.6

353.2

ElapsedUser

System

Figure 9: Performance of a provenance query.

4.4 MySQL TPC-CNext, we instrumented MySQL in order to measure theperformance of a user-level Story Book provenance sys-tem. This workload has significantly lower instrumen-tation overhead than our FUSE file system. We imple-mented a tool that reads MySQL’s general query log anddetermines which transactions read from and wrote towhich tables. It translates this information into StoryBook provenance records by treating each operation inthe general log as a process (We set the process “exe-cutable name” to the query text).

We measured the performance impact of our prove-nance tracking system on a copy of TPC-C configuredfor 50 warehouses and 20 clients. In Table 1 we listtwo configurations: Regular and Story Book. Re-sponse times under Story Book increased by an averageof 4.24%, while total throughput dropped by 2.80%.

4.5 Provenance ReadTo evaluate Story Book’s query performance, we com-pared the query performance of our Berkeley DB back-end to that of Fable. Our BDB database uses the sametable schema and indexes as Fable, except that it usesB-trees in the place of Stasis’ Rose indexes.

We generated our database by processing the entiretree dataset, which is designed to create dense prove-nance graphs. This results in a 400MiB database for

Page 10: Story Book: An Efficient Extensible Provenance Framework

Fable, and a 905MiB database for Berkeley DB. Werandomly selected a sequence of 50 files from our treedataset, and we perform just the first 10 queries in ourfirst data point, then the first 20 queries and so on.

In both cases, the data set fits in RAM, so both sys-tems spend the majority of their time in User. As ex-pected, runtime increases linearly with the number ofqueries for both systems. Fable is 3.8 times slower thanBerkeley DB in this experiment.

Provenance queries on the tree dataset result in a largenumber of point queries because our database is indexedby inode, and each inode only has a few entries. Al-though Rose provides very fast tree iterators, it mustcheck multiple tree components per point query. In con-trast, Berkeley DB must only consult a single B-tree perindex lookup. Rose was originally intended for use inversioning systems, where queries against recently mod-ified data can be serviced by looking at a single tree com-ponent. Our provenance queries must obtain all indexentries, and cannot make use of this optimization.

5 ConclusionsOur implementation of Story Book focuses on minimiz-ing the amount of state held in RAM and on simpli-fying the design of provenance systems. It is able toeither record exact provenance information, or to ap-ply a straightforward record-suppression scheme. Al-though our Berkeley DB based backend provides supe-rior query performance, the Fable backend provides su-perior write throughput, and should scale gracefully toextremely large data sets.

Story Book’s support for system-specific durabilityprotocols allows us to apply Story Book to a wide rangeof applications, including MySQL. It also allows us touse Valor to avoid synchronous log writes, removing theprimary bottleneck of existing systems.

Story Book’s application development model and thesimplicity of our provenance data model make it easy toadd support for new types of provenance systems, and tostore additional information on a per-file basis.

Story Book’s primary overhead comes from its FUSEfile system instrumentation layer. Our experiments showthat Story Book’s underlying provenance database pro-vides much greater throughput than necessary for mostfile system provenance workloads. This, coupled withStory Book’s extensibility, allow it to be applied in moredemanding environments such as databases and othermultiuser applications.

Acknowledgments. We would like to thank the re-viewers for their comments and advice. We would alsolike to thank Leif Walsh and Karthikeyan Srinaivasanfor their extensive help in running and analyzing experi-ments for the camera ready copy.

References[1] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A

five-year study of file-system metadata. In FAST ’07: Proc. ofthe 5th USENIX conference on File and Storage Technologies,pp. 3–3, Berkeley, CA, USA, 2007.

[2] Dublin Core Metadata Initiative. Dublin core. dublincore.org,2008.

[3] J. Clark et al. ExPat. expat.sf.net, Dec 2008.[4] C. Jermaine, E. Omiecinski, and W. G. Yee. The partitioned

exponential file for database storage management. The VLDBJournal, 16(4):417–437, 2007.

[5] J. Katcher. PostMark: A new filesystem benchmark. Techni-cal Report TR3022, Network Appliance, 1997. www.netapp.com/tech_library/3022.html.

[6] L. Lamport. Time, clocks, and the ordering of events in a dis-tributed system. Communications of the ACM, 21(7):558–565,Jul. 1978.

[7] L. Lamport. Paxos made simple. SIGACT News, 2001.[8] D. Morozhnikov. FUSE ISO File System, Jan. 2006. http:

//fuse.sf.net/wiki/index.php/FuseIso.[9] K. Muniswamy-Reddy, J. Barillari, U. Braun, D. A. Holland,

D. Maclean, M. Seltzer, and S. D. Holland. Layering inprovenance-aware storage systems. Technical Report TR-04-08,Harvard University Computer Science, 2008.

[10] E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn.Rethink the sync. In Proc. of the 7th Symposium on Operat-ing Systems Design and Implementation, pp. 1–14, Seattle, WA,Nov. 2006.

[11] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and ElizabethO’Neil. The log-structured merge-tree (LSM-tree). Acta Inf.,33(4):351–385, 1996.

[12] R. Sears and E. Brewer. Stasis: Flexible Transactional Storage.In Proc. of the 7th Symposium on Operating Systems Design andImplementation, Seattle, WA, Nov. 2006.

[13] R. Sears, M. Callaghan, and E. Brewer. Rose: Compressed, log-structured replication. In Proc. of the VLDB Endowment, vol-ume 1, Auckland, New Zealand, 2008.

[14] Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. A sur-vey of data provenance in e-science. ACM SIGMOD Record,34(3):31–36, Sept. 2005.

[15] Sleepycat Software, Inc. Berkeley DB Reference Guide, 4.3.27edition, Dec. 2004. www.oracle.com/technology/documentation/berkeley-db/db/api_c/frame.html.

[16] R. P. Spillane, S. Gaikwad, E. Zadok, C. P. Wright, andM. Chinni. Enabling transactional file access via lightweight ker-nel extensions. In Proc. of the seventh usenix conference on fileand storage technologies, San Francisco, CA, Feb. 2009.

[17] R. P. Spillane, C. P. Wright, G. Sivathanu, and E. Zadok. Rapidfile system development using ptrace. In Proc. of the Workshopon Experimental Computer Science , in conjunction with ACMFCRC, page Article No. 22, San Diego, CA, Jun. 2007.

[18] Sun Microsystems. MySQL. www.mysql.com, Dec 2008.[19] W3C. Resource description framework. w3.org/RDF, 2008.[20] Jennifer Widom. Trio: A system for integrated management of

data, accuracy, and lineage. In Proc. of the 2005 CIDR Conf.,2005.