Fast14 Paper Rumble

8/10/2019 Fast14 Paper Rumble

1/17

This paper is included in the Proceedings of the12th USENIX Conference on File and Storage Technologies (FAST 14).

February 1720, 2014 Santa Clara, CA USAISBN 978-1-931971-08-9

Open access to the Proceedings of the12th USENIX Conference on File and Storage

Technologies (FAST 14)

is sponsored by

Log-structured Memory for DRAM-based StorageStephen M. Rumble, Ankita Kejriwal, and John Ousterhout, Stanford University

https://www.usenix.org/conference/fast14/technical-sessions/presentation/rumble


2/17

USENIX Association 12th USENIX Conference on File and Storage Technologies 1

Log-structured Memory for DRAM-based StorageStephen M. Rumble, Ankita Kejriwal, and John Ousterhout

{rumble, ankitak, ouster }@cs.stanford.eduStanford University

Abstract

Traditional memory allocation mechanisms are notsuitable for new DRAM-based storage systems becausethey use memory inefciently, particularly under chang-ing access patterns. In contrast, a log-structured approachto memory management allows 80-90% memory utiliza-tion while offering high performance. The RAMCloudstorage system implements a unied log-structured mech-anism both for active information in memory and backupdata on disk. The RAMCloud implementation of log-structured memory uses a two-level cleaning policy,which conserves disk bandwidth and improves perfor-

mance up to 6x at high memory utilization. The cleanerruns concurrently with normal operations and employsmultiple threads to hide most of the cost of cleaning.

1 IntroductionIn recent years a new class of storage systems has

arisen in which all data is stored in DRAM. Examplesinclude memcached [2], Redis [3], RAMCloud [30], andSpark [38]. Because of the relatively high cost of DRAM,it is important for these systems to use their memory ef-ciently. Unfortunately, efcient memory usage is notpossible with existing general-purpose storage allocators:they can easily waste half or more of memory, particularly

in the face of changing access patterns.In this paper we show how a log-structured approach to

memory management (treating memory as a sequentially-written log) supports memory utilizations of 80-90%while providing high performance. In comparison to non-copying allocators such as malloc, the log-structured ap-proach allows data to be copied to eliminate fragmenta-tion. Copying allows the system to make a fundamen-tal space-time trade-off: for the price of additional CPUcycles and memory bandwidth, copying allows for moreefcient use of storage space in DRAM. In comparisonto copying garbage collectors, which eventually require aglobal scan of all data, the log-structured approach pro-

vides garbage collection that is more incremental. Thisresults in more efcient collection, which enables highermemory utilization.

We have implemented log-structured memory in theRAMCloud storage system, using a unied approach thathandles both information in memory and backup replicasstored on disk or ash memory. The overall architectureis similar to that of a log-structured le system [32], butwith several novel aspects: In contrast to log-structured le systems, log-structured

memory is simpler because it stores very little metadatain the log. The only metadata consists of log digests toenable log reassembly after crashes, and tombstones toprevent the resurrection of deleted objects.

RAMCloud uses a two-level approach to cleaning, withdifferent policies for cleaning data in memory versussecondary storage. This maximizes DRAM utilizationwhile minimizing disk and network bandwidth usage.

Since log data is immutable once appended, the logcleaner can run concurrently with normal read andwrite operations. Furthermore, multiple cleaners canrun in separate threads. As a result, parallel cleaninghides most of the cost of garbage collection.

Performance measurements of log-structured memoryin RAMCloud show that it enables high client through-put at 80-90% memory utilization, even with articiallystressful workloads. In the most stressful workload, asingle RAMCloud server can support 270,000-410,000durable 100-byte writes per second at 90% memory uti-lization. The two-level approach to cleaning improvesperformance by up to 6x over a single-level approachat high memory utilization, and reduces disk bandwidthoverhead by 7-87x for medium-sized objects (1 to 10 KB).Parallel cleaning effectively hides the cost of cleaning: anactive cleaner adds only about 2% to the latency of typical

client write requests.

2 Why Not Use Malloc?An off-the-shelf memory allocator such as the C li-

brarys malloc function might seem like a natural choicefor an in-memory storage system. However, existing allo-cators are not able to use memory efciently, particularlyin the face of changing access patterns. We measured avariety of allocators under synthetic workloads and foundthat all of them waste at least 50% of memory under con-ditions that seem plausible for a storage system.

Memory allocators fall into two general classes: non-copying allocators and copying allocators. Non-copying

allocators such as malloc cannot move an object once ithas been allocated, so they are vulnerable to fragmen-tation. Non-copying allocators work well for individualapplications with a consistent distribution of object sizes,but Figure 1 shows that they can easily waste half of mem-ory when allocation patterns change. For example, ev-ery allocator we measured performed poorly when 10 GBof small objects were mostly deleted, then replaced with10 GB of much larger objects.

Changes in size distributions may be rare in individual


3/17

2 12th USENIX Conference on File and Storage Technologies USENIX Association

0 5

10

15

20

25

30

35

glibc 2.12 malloc Hoard 3.9 jemalloc 3.3.0 tcmalloc 2.0 memcached 1.4.13 Java 1.7OpenJDK

Boehm GC 7.2d

G B U s e

d

Allocators

W1W2W3W4W5W6W7W8

Live

Figure 1: Total memory needed by allocators to support 10 GB of live data under the changing workloads described in Table 1(average of 5 runs). Live indicates the amount of live data, and represents an optimal result. glibc is the allocator typically usedby C and C++ applications on Linux. Hoard [10], jemalloc [19], and tcmalloc [1] are non-copying allocators designed forspeed and multiprocessor scalability. Memcached is the slab-based allocator used in the memcached [2] object caching system.Java is the JVMs default parallel scavenging collector with no maximum heap size restriction (it ran out of memory if given lessthan 16 GB of total space). Boehm GC is a non-copying garbage collector for C and C++. Hoard could not complete the W8workload (it overburdened the kernel by mmap ing each large allocation separately).

Workload Before Delete AfterW1 Fixed 100 Bytes N/A N/AW2 Fixed 100 Bytes 0% Fixed 130 Bytes

W3 Fixed 100 Bytes 90% Fixed 130 BytesW4 Uniform 100 - 150 Bytes 0% Uniform 200 - 250 BytesW5 Uniform 100 - 150 Bytes 90% Uniform 200 - 250 BytesW6 Uniform 100 - 200 Bytes 50% Uniform 1,000 - 2,000 BytesW7 Uniform 1,000 - 2,000 Bytes 90% Uniform 1,500 - 2,500 BytesW8 Uniform 50 - 150 Bytes 90% Uniform 5,000 - 15,000 Bytes

Table 1: Summary of workloads used in Figure 1. The workloads were not intended to be representative of actual applicationbehavior, but rather to illustrate plausible workload changes that might occur in a shared storage system. Each workload consistsof three phases. First, the workload allocates 50 GB of memory using objects from a particular size distribution; it deletes existingobjects at random in order to keep the amount of live data from exceeding 10 GB. In the second phase the workload deletes afraction of the existing objects at random. The third phase is identical to the rst except that it uses a different size distribution(objects from the new distribution gradually displace those from the old distribution). Two size distributions were used: Fixedmeans all objects had the same size, and Uniform means objects were chosen uniform randomly over a range (non-uniformdistributions yielded similar results). All workloads were single-threaded and ran on a Xeon E5-2670 system with Linux 2.6.32.

applications, but they are more likely in storage systemsthat serve many applications over a long period of time.Such shifts can be caused by changes in the set of appli-cations using the system (adding new ones and/or remov-ing old ones), by changes in application phases (switchingfrom map to reduce), or by application upgrades that in-crease the size of common records (to include additionalelds for new features). For example, workload W2 inFigure 1 models the case where the records of a table areexpanded from 100 bytes to 130 bytes. Facebook encoun-tered distribution changes like this in its memcached stor-age systems and was forced to introduce special-purposecache eviction code for specic situations [28]. Non-

copying allocators will work well in many cases, but theyare unstable: a small application change could dramat-ically change the efciency of the storage system. Un-less excess memory is retained to handle the worst-casechange, an application could suddenly nd itself unableto make progress.

The second class of memory allocators consists of those that can move objects after they have been created,such as copying garbage collectors. In principle, garbagecollectors can solve the fragmentation problem by moving

live data to coalesce free heap space. However, this comeswith a trade-off: at some point all of these collectors (eventhose that label themselves as incremental) must walk all live data, relocate it, and update references. This isan expensive operation that scales poorly, so garbage col-lectors delay global collections until a large amount of garbage has accumulated. As a result, they typically re-quire 1.5-5x as much space as is actually used in orderto maintain high performance [39, 23]. This erases anyspace savings gained by defragmenting memory.

Pause times are another concern with copying garbagecollectors. At some point all collectors must halt theprocesses threads to update references when objects aremoved. Although there has been considerable work onreal-time garbage collectors, even state-of-art solutionshave maximum pause times of hundreds of microseconds,or even milliseconds [8, 13, 36] this is 100 to 1,000times longer than the round-trip time for a RAMCloudRPC. All of the standard Java collectors we measured ex-hibited pauses of 3 to 4 seconds by default (2-4 timeslonger than it takes RAMCloud to detect a failed serverand reconstitute 64 GB of lost data [29]). We ex-perimented with features of the JVM collectors that re-


4/17


duce pause times, but memory consumption increased byan additional 30% and we still experienced occasionalpauses of one second or more.

An ideal memory allocator for a DRAM-based storagesystem such as RAMCloud should have two properties.First, it must be able to copy objects in order to elimi-nate fragmentation. Second, it must not require a globalscan of memory: instead, it must be able to perform thecopying incrementally , garbage collecting small regionsof memory independently with cost proportional to thesize of a region. Among other advantages, the incremen-tal approach allows the garbage collector to focus on re-gions with the most free space. In the rest of this paperwe will show how a log-structured approach to memorymanagement achieves these properties.

In order for incremental garbage collection to work, itmust be possible to nd the pointers to an object with-out scanning all of memory. Fortunately, storage systemstypically have this property: pointers are conned to in-dex structures where they can be located easily. Tradi-tional storage allocators work in a harsher environmentwhere the allocator has no control over pointers; the log-structured approach could not work in such environments.

3 RAMCloud OverviewOur need for a memory allocator arose in the context

of RAMCloud. This section summarizes the features of RAMCloud that relate to its mechanisms for storage man-agement, and motivates why we used log-structured mem-ory instead of a traditional allocator.

RAMCloud is a storage system that stores data in theDRAM of hundreds or thousands of servers within a dat-acenter, as shown in Figure 2. It takes advantage of low-latency networks to offer remote read times of 5 s andwrite times of 16 s (for small objects). Each storageserver contains two components. A master module man-ages the main memory of the server to store RAMCloudobjects; it handles read and write requests from clients. Abackup module uses local disk or ash memory to storebackup copies of data owned by masters on other servers.The masters and backups are managed by a central coor-dinator that handles conguration-related issues such ascluster membership and the distribution of data among theservers. The coordinator is not normally involved in com-mon operations such as reads and writes. All RAMClouddata is present in DRAM at all times; secondary storageis used only to hold duplicate copies for crash recovery.

RAMCloud provides a simple key-value data modelconsisting of uninterpreted data blobs called objects thatare named by variable-length keys. Objects are groupedinto tables that may span one or more servers in the clus-ter. Objects must be read or written in their entirety.RAMCloud is optimized for small objects a few hun-dred bytes or less but supports objects up to 1 MB.

Each masters memory contains a collection of objectsstored in DRAM and a hash table (see Figure 3). The

Coordinator

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

Master

BackupDisk

. . .

Client Client Client Client. . .

Datacenter Network

Figure 2: RAMCloud cluster architecture.

Log-structured Memory

MasterHash Table

. . .

Backup

Buffered Segment

Backup

Buffered Segment

. . .Disk Disk

Segments

Figure 3: Master servers consist primarily of a hash table andan in-memory log, which is replicated across several backupsfor durability.

hash table contains one entry for each object stored on thatmaster; it allows any object to be located quickly, givenits table and key. Each live object has exactly one pointer,which is stored in its hash table entry.

In order to ensure data durability in the face of servercrashes and power failures, each master must keep backupcopies of its objects on the secondary storage of otherservers. The backup data is organized as a log for max-imum efciency. Each master has its own log, which isdivided into 8 MB pieces called segments . Each segmentis replicated on several backups (typically two or three).A master uses a different set of backups to replicate eachsegment, so that its segment replicas end up scatteredacross the entire cluster.

When a master receives a write request from a client, itadds the new object to its memory, then forwards informa-tion about that object to the backups for its current headsegment. The backups append the new object to segmentreplicas stored in nonvolatile buffers; they respond to themaster as soon as the object has been copied into theirbuffer, without issuing an I/O to secondary storage (back-ups must ensure that data in buffers can survive powerfailures). Once the master has received replies from allthe backups, it responds to the client. Each backup accu-mulates data in its buffer until the segment is complete.At that point it writes the segment to secondary storageand reallocates the buffer for another segment. This ap-proach has two performance advantages: writes completewithout waiting for I/O to secondary storage, and backupsuse secondary storage bandwidth efciently by perform-ing I/O in large blocks, even if objects are small.


5/17


RAMCloud could have used a traditional storage allo-cator for the objects stored in a masters memory, but wechose instead to use the same log structure in DRAM thatis used on disk. Thus a masters object storage consists of 8 MB segments that are identical to those on secondarystorage. This approach has three advantages. First, itavoids the allocation inefciencies described in Section 2.Second, it simplies RAMCloud by using a single uniedmechanism for information both in memory and on disk.Third, it saves memory: in order to perform log cleaning(described below), the master must enumerate all of theobjects in a segment; if objects were stored in separatelyallocated areas, they would need to be linked together bysegment, which would add an extra 8-byte pointer per object (an 8% memory overhead for 100-byte objects).

The segment replicas stored on backups are never readduring normal operation; most are deleted before theyhave ever been read. Backup replicas are only read duringcrash recovery (for details, see [29]). Data is never readfrom secondary storage in small chunks; the only read op-eration is to read a masters entire log.

RAMCloud uses a log cleaner to reclaim free space thataccumulates in the logs when objects are deleted or over-written. Each master runs a separate cleaner, using a basicmechanism similar to that of LFS [32]: The cleaner selects several segments to clean, using the

same cost-benet approach as LFS (segments are cho-sen for cleaning based on the amount of free space andthe age of the data).

For each of these segments, the cleaner scans the seg-ment stored in memory and copies any live objectsto new survivor segments . Liveness is determined bychecking for a reference to the object in the hash ta-ble. The live objects are sorted by age to improvethe efciency of cleaning in the future. Unlike LFS,RAMCloud need not read objects from secondary stor-age during cleaning.

The cleaner makes the old segments memory availablefor new segments, and it noties the backups for thosesegments that they can reclaim the replicas storage.

The logging approach meets the goals from Section 2:it copies data to eliminate fragmentation, and it operatesincrementally, cleaning a few segments at a time. How-ever, it introduces two additional issues. First, the logmust contain metadata in addition to objects, in order to

ensure safe crash recovery; this issue is addressed in Sec-tion 4. Second, log cleaning can be quite expensive athigh memory utilization [34, 35]. RAMCloud uses twotechniques to reduce the impact of log cleaning: two-levelcleaning (Section 5) and parallel cleaning with multiplethreads (Section 6).

4 Log MetadataIn log-structured le systems, the log contains a lot of

indexing information in order to provide fast random ac-

cess to data in the log. In contrast, RAMCloud has a sep-arate hash table that provides fast access to information inmemory. The on-disk log is never read during normal use;it is used only during recovery, at which point it is read inits entirety. As a result, RAMCloud requires only threekinds of metadata in its log, which are described below.

First, each object in the log must be self-identifying:it contains the table identier, key, and version numberfor the object in addition to its value. When the log isscanned during crash recovery, this information allowsRAMCloud to identify the most recent version of an object and reconstruct the hash table.

Second, each new log segment contains a log digest that describes the entire log. Every segment has a uniqueidentier, and the log digest is a list of identiers for allthe segments that currently belong to the log. Log digestsavoid the need for a central repository of log information(which would create a scalability bottleneck and introduceother crash recovery problems). To replay a crashed mas-ters log, RAMCloud locates the latest digest and loadseach segment enumerated in it (see [29] for details).

The third kind of log metadata is tombstones that iden-tify deleted objects. When an object is deleted or mod-ied, RAMCloud does not modify the objects existingrecord in the log. Instead, it appends a tombstone recordto the log. The tombstone contains the table identier,key, and version number for the object that was deleted.Tombstones are ignored during normal operation, but theydistinguish live objects from dead ones during crash re-covery. Without tombstones, deleted objects would comeback to life when logs are replayed during crash recovery.

Tombstones have proven to be a mixed blessing inRAMCloud: they provide a simple mechanism to preventobject resurrection, but they introduce additional prob-lems of their own. One problem is tombstone garbagecollection. Tombstones must eventually be removed fromthe log, but this is only safe if the corresponding objectshave been cleaned (so they will never be seen during crashrecovery). To enable tombstone deletion, each tombstoneincludes the identier of the segment containing the ob-solete object. When the cleaner encounters a tombstonein the log, it checks the segment referenced in the tomb-stone. If that segment is no longer part of the log, then itmust have been cleaned, so the old object no longer ex-ists and the tombstone can be deleted. If the segment stillexists in the log, then the tombstone must be preserved.

5 Two-level CleaningAlmost all of the overhead for log-structured memory

is due to cleaning. Allocating new storage is trivial; newobjects are simply appended at the end of the head seg-ment. However, reclaiming free space is much more ex-pensive. It requires running the log cleaner, which willhave to copy live data out of the segments it chooses forcleaning as described in Section 3. Unfortunately, thecost of log cleaning rises rapidly as memory utilization in-


6/17


creases. For example, if segments are cleaned when 80%of their data are still live, the cleaner must copy 8 bytesof live data for every 2 bytes it frees. At 90% utiliza-tion, the cleaner must copy 9 bytes of live data for every1 byte freed. Eventually the system will run out of band-width and write throughput will be limited by the speed of the cleaner. Techniques like cost-benet segment selec-tion [32] help by skewing the distribution of free space,so that segments chosen for cleaning have lower utiliza-tion than the overall average, but they cannot eliminatethe fundamental tradeoff between utilization and cleaningcost. Any copying storage allocator will suffer from in-tolerable overheads as utilization approaches 100%.

Originally, disk and memory cleaning were tied to-gether in RAMCloud: cleaning was rst performed onsegments in memory, then the results were reected to thebackup copies on disk. This made it impossible to achieveboth high memory utilization and high write throughput.For example, if we used memory at high utilization (80-90%) write throughput would be severely limited by thecleaners usage of disk bandwidth (see Section 8). Onthe other hand, we could have improved write bandwidthby increasing the size of the disk log to reduce its aver-age utilization. For example, at 50% disk utilization wecould achieve high write throughput. Furthermore, disksare cheap enough that the cost of the extra space wouldnot be signicant. However, disk and memory were fun-damentally tied together: if we reduced the utilization of disk space, we would also have reduced the utilization of DRAM, which was unacceptable.

The solution is to clean the disk and memory logs in-dependently we call this two-level cleaning . With two-level cleaning, memory can be cleaned without reectingthe updates on backups. As a result, memory can havehigher utilization than disk. The cleaning cost for mem-ory will be high, but DRAM can easily provide the band-width required to clean at 90% utilization or higher. Disk cleaning happens less often. The disk log becomes largerthan the in-memory log, so it has lower overall utilization,and this reduces the bandwidth required for cleaning.

The rst level of cleaning, called segment compaction ,operates only on the in-memory segments on masters andconsumes no network or disk I/O. It compacts a singlesegment at a time, copying its live data into a smaller re-gion of memory and freeing the original storage for newsegments. Segment compaction maintains the same logi-cal log in memory and on disk: each segment in memorystill has a corresponding segment on disk. However, thesegment in memory takes less space because deleted objects and obsolete tombstones were removed (Figure 4).

The second level of cleaning is just the mechanism de-scribed in Section 3. We call this combined cleaning be-cause it cleans both disk and memory together. Segmentcompaction makes combined cleaning more efcient bypostponing it. The effect of cleaning a segment later isthat more objects have been deleted, so the segments uti-

Compacted and Uncompacted Segments in Memory

Corresponding Full-sized Segments on Backups

. . .

. . .

Figure 4: Compacted segments in memory have variablelength because unneeded objects and tombstones have beenremoved, but the corresponding segments on disk remain full-size. As a result, the utilization of memory is higher than thatof disk, and disk can be cleaned more efciently.

lization will be lower. The result is that when combinedcleaning does happen, less bandwidth is required to re-claim the same amount of free space. For example, if the disk log is allowed to grow until it consumes twiceas much space as the log in memory, the utilization of segments cleaned on disk will never be greater than 50%,which makes cleaning relatively efcient.

Two-level cleaning leverages the strengths of memoryand disk to compensate for their weaknesses. For mem-ory, space is precious but bandwidth for cleaning is plenti-ful, so we use extra bandwidth to enable higherutilization.For disk, space is plentiful but bandwidth is precious, sowe use extra space to save bandwidth.

5.1 Seglets

In the absence of segment compaction, all segments arethe same size, which makes memory management simple.With compaction, however, segments in memory can havedifferent sizes. One possible solution is to use a stan-dard heap allocator to allocate segments, but this wouldresult in the fragmentation problems described in Sec-tion 2. Instead, each RAMCloud master divides its logmemory into xed-size 64 KB seglets . A segment con-sists of a collection of seglets, and the number of segletsvaries with the size of the segment. Because seglets arexed-size, they introduce a small amount of internal frag-mentation (one-half seglet for each segment, on average).In practice, fragmentation should be less than 1% of mem-ory space, since we expect compacted segments to aver-age at least half the length of a full-size segment. In addi-tion, seglets require extra mechanism to handle log entriesthat span discontiguous seglets (before seglets, log entrieswere always contiguous).

5.2 When to Clean on Disk?

Two-level cleaning introduces a new policy question:when should the system choose memory compaction overcombined cleaning, and vice-versa? This choice has animportant impact on system performance because com-bined cleaning consumes precious disk and network I/Oresources. However, as we explain below, memory com-paction is not always more efcient. This section explainshow these considerations resulted in RAMClouds current


7/17


policy module; we refer to it as the balancer . For a morecomplete discussion of the balancer, see [33].

There is no point in running either cleaner until the sys-tem is running low on memory or disk space. The reasonis that cleaning early is never cheaper than cleaning lateron. The longer the system delays cleaning, the more timeit has to accumulate dead objects, which lowers the frac-tion of live data in segments and makes them less expen-sive to clean.

The balancer determines that memory is running lowas follows. Let L be the fraction of all memory occu-pied by live objects and F be the fraction of memory inunallocated seglets. One of the cleaners will run when-ever F min (0.1, (1 L)/ 2) In other words, cleaningoccurs if the unallocated seglet pool has dropped to lessthan 10% of memory and at least half of the free mem-ory is in active segments (vs. unallocated seglets). Thisformula represents a tradeoff: on the one hand, it delayscleaning to make it more efcient; on the other hand, itstarts cleaning soon enough for the cleaner to collect freememory before the system runs out of unallocated seglets.

Given that the cleaner must run, the balancer mustchoose which cleaner to use. In general, compaction ispreferred because it is more efcient, but there are twocases in which the balancer must choose combined clean-ing. The rst is when too many tombstones have ac-cumulated. The problem with tombstones is that mem-ory compaction alone cannot remove them: the com-bined cleaner must rst remove dead objects from disk before their tombstones can be erased. As live tombstonespile up, segment utilizations increase and compaction be-comes more and more expensive. Eventually, tombstoneswould eat up all free memory. Combined cleaning ensuresthat tombstones do not exhaust memory and makes futurecompactions more efcient.

The balancer detects tombstone accumulation as fol-lows. Let T be the fraction of memory occupied bylive tombstones, and L be the fraction of live objects (asabove). Too many tombstones have accumulated onceT/ (1 L) 40%. In other words, there are too manytombstones when they account for 40% of the freeablespace in a master ( 1 L; i.e., all tombstones and dead objects). The 40% value was chosen empirically based onmeasurements of different workloads, object sizes, andamounts of available disk bandwidth. This policy tendsto run the combined cleaner more frequently under work-loads that make heavy use of small objects (tombstonespace accumulates more quickly as a fraction of freeablespace, because tombstones are nearly as large as the objects they delete).

The second reason the combined cleaner must run isto bound the growth of the on-disk log. The size must belimited both to avoid running out of disk space and to keepcrash recovery fast (since the entire log must be replayed,its size directly affects recovery speed). RAMCloud im-plements a congurable disk expansion factor that sets the

maximum on-disk log size as a multiple of the in-memorylog size. The combined cleaner runs when the on-disk logsize exceeds 90% of this limit.

Finally, the balancer chooses memory compactionwhen unallocated memory is low and combined cleaningis not needed (disk space is not low and tombstones havenot accumulated yet).

6 Parallel CleaningTwo-level cleaning reduces the cost of combined clean-

ing, but it adds a signicant new cost in the form of seg-ment compaction. Fortunately, the cost of cleaning can behidden by performing both combined cleaning and seg-ment compaction concurrently with normal read and writerequests. RAMCloud employs multiple cleaner threadssimultaneously to take advantage of multi-core CPUs.

Parallel cleaning in RAMCloud is greatly simplied bythe use of a log structure and simple metadata. For exam-ple, since segments are immutable after they are created,the cleaner need not worry about objects being modiedwhile the cleaner is copying them. Furthermore, the hashtable provides a simple way of redirecting references toobjects that are relocated by the cleaner (all objects areaccessed indirectly through it). This means that the basiccleaning mechanism is very straightforward: the cleanercopies live data to new segments, atomically updates ref-erences in the hash table, and frees the cleaned segments.

There are three points of contention between cleanerthreads and service threads handling read and write re-quests. First, both cleaner and service threads need to adddata at the head of the log. Second, the threads may con-ict in updates to the hash table. Third, the cleaner mustnot free segments that are still in use by service threads.These issues and their solutions are discussed in the sub-sections below.

6.1 Concurrent Log Updates

The most obvious way to perform cleaning is to copythe live data to the head of the log. Unfortunately, thiswould create contention for the log head between cleanerthreads and service threads that are writing new data.

RAMClouds solution is for the cleaner to write sur-vivor data to different segments than the log head. Eachcleaner thread allocates a separate set of segments forits survivor data. Synchronization is required when al-locating segments, but once segments are allocated, eachcleaner thread can copy data to its own survivor segmentswithout additional synchronization. Meanwhile, request-processing threads can write new data to the log head.Once a cleaner thread nishes a cleaning pass, it arrangesfor its survivor segments to be included in the next log di-gest, which inserts them into the log; it also arranges forthe cleaned segments to be dropped from the next digest.

Using separate segments for survivor data has the addi-tional benet that the replicas for survivor segments willbe stored on a different set of backups than the replicas


8/17


of the head segment. This allows the survivor segmentreplicas to be written in parallel with the log head repli-cas without contending for the same backup disks, whichincreases the total throughput for a single master.

6.2 Hash Table Contention

The main source of thread contention during cleaningis the hash table. This data structure is used both by ser-vice threads and cleaner threads, as it indicates which objects are alive and points to their current locations in thein-memory log. The cleaner uses the hash table to check whether an object is alive (by seeing if the hash table cur-rently points to that exact object). If the object is alive,the cleaner copies it and updates the hash table to referto the new location in a survivor segment. Meanwhile,service threads may be using the hash table to nd objects during read requests and they may update the hashtable during write or delete requests. To ensure consis-tency while reducing contention, RAMCloud currentlyuses ne-grained locks on individual hash table buckets.In the future we plan to explore lockless approaches toeliminate this overhead.

6.3 Freeing Segments in Memory

Once a cleaner thread has cleaned a segment, the seg-ments storage in memory can be freed for reuse. Atthis point, future service threads will not use data in thecleaned segment, because there are no hash table entriespointing into it. However, it could be that a service threadbegan using the data in the segment before the cleaner up-dated the hash table; if so, the cleaner must not free thesegment until the service thread has nished using it.

To solve this problem, RAMCloud uses a simple mech-anism similar to RCUs [27] wait-for-readers primitiveand Tornado/K42s generations [6]: after a segment hasbeen cleaned, thesystemwill not free it until allRPCs cur-rently being processed complete. At this point it is safe toreuse the segments memory, since new RPCs cannot ref-erence the segment. This approach has the advantage of not requiring additional locks for normal reads and writes.

6.4 Freeing Segments on Disk

Once a segment has been cleaned, its replicas on back-ups must also be freed. However, this must not bedone until the corresponding survivor segments have beensafely incorporated into the on-disk log. This takes twosteps. First, the survivor segments must be fully repli-cated on backups. Survivor segments are transmitted tobackups asynchronously during cleaning, so at the end of each cleaning pass the cleaner must wait for all of its sur-vivor segments to be received by backups. Second, a newlog digest must be written, which includes the survivorsegments and excludes the cleaned segments. Once thedigest has been durably written to backups, RPCs are is-sued to free the replicas for the cleaned segments.

20 11 16 20

13

11

utilization = 75 / 80

80

19 11 15 16 14

18 17 19 15

18 20 20 17

14

Cleaned Segments Survivor Segments

Figure 5: A simplied situation in which cleaning uses morespace than it frees. Two 80-byte segments at about 94% uti-lization are cleaned: their objects are reordered by age (notdepicted) and written to survivor segments. The label in eachobject indicates its size. Because of fragmentation, the lastobject (size 14) overows into a third survivor segment.

7 Avoiding Cleaner Deadlock

Since log cleaning copies data before freeing it, thecleaner must have free memory space to work with be-fore it can generate more. If there is no free memory,the cleaner cannot proceed and the system will deadlock.RAMCloud increases the risk of memory exhaustion byusing memory at high utilization. Furthermore, it delayscleaning as long as possible in order to allow more objectsto be deleted. Finally, two-level cleaning allows tomb-stones to accumulate, which consumes even more freespace. This section describes how RAMCloud preventscleaner deadlock while maximizing memory utilization.

The rst step is to ensure that there are always freeseglets for the cleaner to use. This is accomplished byreserving a special pool of seglets for the cleaner. Whenseglets are freed, they are used to replenish the cleanerpool before making space available for other uses.

The cleaner pool can only be maintained if each clean-ing pass frees as much space as it uses; otherwise thecleaner could gradually consume its own reserve and thendeadlock. However, RAMCloud does not allow objects tocross segment boundaries, which results in some wastedspace at the end of each segment. When the cleaner re-organizes objects, it is possible for the survivor segmentsto have greater fragmentation than the original segments,and this could result in the survivors taking more totalspace than the original segments (see Figure 5).

To ensure that the cleaner always makes forwardprogress, it must produce at least enough free space tocompensate for space lost to fragmentation. Suppose thatN segments are cleaned in a particular pass and the frac-tion of free space in these segments is F ; furthermore, letS be the size of a full segment and O the maximum objectsize. The cleaner will produce N S (1 F ) bytes of livedata in this pass. Each survivor segment could contain aslittle as S O +1 bytes of live data (if an object of size Ocouldnt quite t at the end of the segment), so the max-imum number of survivor segments will be NS (1

F )S O + 1 .

The last seglet of each survivor segment could be emptyexcept for a single byte, resulting in almost a full seglet of


9/17


CPU Xeon X3470 (4x2.93 GHz cores, 3.6 GHz Turbo)RAM 24 GB DDR3 at 800 MHzFlash 2x Crucial M4 SSDsDisks CT128M4SSD2 (128 GB)NIC Mellanox ConnectX-2 Inniband HCA

Switch Mellanox SX6036 (4X FDR)

Table 2: The server hardware conguration used for bench-marking. All nodes ran Linux 2.6.32 and were connected toan Inniband fabric.

fragmentation for each survivor segment. Thus, F mustbe large enough to produce a bit more than one segletsworth of free data for each survivor segment generated.For RAMCloud, we conservatively require 2% of freespace per cleaned segment, which is a bit more than twoseglets. This number could be reduced by making segletssmaller.

There is one additional problem that could result inmemory deadlock. Before freeing segments after clean-ing, RAMCloud must write a new log digest to add thesurvivors to the log and remove the old segments. Writ-

ing a new log digest means writing a new log head seg-ment (survivor segments do not contain digests). Unfor-tunately, this consumes yet another segment, which couldcontribute to memory exhaustion. Our initial solution wasto require each cleaner pass to produce enough free spacefor the new log head segment, in addition to replacing thesegments used for survivor data. However, it is hard toguarantee better than break-even cleaner performancewhen there is very little free space.

The current solution takes a different approach: it re-serves two special emergency head segments that containonly log digests; no other data is permitted. If there is nofree memory after cleaning, one of these segments is allo-cated for the head segment that will hold the new digest.Since the segment contains no objects or tombstones, itdoes not need to be cleaned; it is immediately freed whenthe next head segment is written (the emergency headis not included in the log digest for the next head seg-ment). By keeping two emergency head segments in re-serve, RAMCloud can alternate between them until a fullsegments worth of space is freed and a proper log headcan be allocated. As a result, each cleaner pass only needsto produce as much free space as it uses.

By combining these techniques, RAMCloud can guar-antee deadlock-free cleaning with total memory utiliza-tion as high as 98%. When utilization reaches this limit,no new data (or tombstones) can be appended to the loguntil the cleaner has freed space. However, RAMCloudsets a lower utilization limit for writes, in order to reservespace for tombstones. Otherwise all available log spacecould be consumed with live data and there would be noway to add tombstones to delete objects.

8 Evaluation

All of the features described in the previous sectionsare implemented in RAMCloud version 1.0, which was

released in January, 2014. This section describes a seriesof experiments we ran to evaluate log-structured memoryand its implementation in RAMCloud. The key resultsare: RAMCloud supports memory utilizations of 80-90%

without signicant loss in performance. At high memory utilizations, two-level cleaning im-

proves client throughput up to 6x over a single-levelapproach.

Log-structured memory also makes sense for otherDRAM-based storage systems, such as memcached.

RAMCloud provides a better combination of durabil-ity and performance than other storage systems such asHyperDex and Redis.

Note: all plots in this section show the average of 3 ormore runs, with error bars for minimum and maximumvalues.

8.1 Performance vs. Utilization

The most important metric for log-structured memoryis how it performs at high memory utilization. In Sec-tion 2 we found that other allocators could not achievehigh memory utilization in the face of changing work-loads. With log-structured memory, we can choose anyutilization up to the deadlock limit of about 98% de-scribed in Section 7. However, system performance willdegrade as memory utilization increases; thus, the keyquestion is how efciently memory can be used beforeperformance drops signicantly. Our hope at the begin-ning of the project was that log-structured memory couldsupport memory utilizations in the range of 80-90%.

The measurements in this section used an 80-node clus-ter of identical commodity servers (see Table 2). Our pri-mary concern was the throughput of a single master, sowe divided the cluster into groups of ve servers and useddifferent groups to measure different data points in par-allel. Within each group, one node ran a master server,three nodes ran backups, and the last node ran the co-ordinator and client benchmark. This conguration pro-vided each masterwith about 700 MB/s of back-end band-width. In an actual RAMCloud system the back-endbandwidth available to one master could be either moreor less than this; we experimented with different back-end bandwidths and found that it did not change any of our conclusions. Each byte stored on a master was repli-

cated to three different backups for durability.All of our experiments used a maximum of two threads

for cleaning. Our cluster machines have only four cores,and the main RAMCloud server requires two of them,so there were only two cores available for cleaning (wehave not yet evaluated the effect of hyperthreading onRAMClouds throughput or latency).

In each experiment, the master was given 16 GB of logspace and the client created objects with sequential keysuntil it reached a target memory utilization; then it over-


10/17


0

10

20

30

40

50

60

0

100

200

300

400

500

600

M B / s

O b j e c t s / s

( x 1

, 0 0 0 )

Two-level (Zipfian)One-level (Zipfian)

Two-level (Uniform)One-level (Uniform)

Sequential

100-byte Objects

0

50

100

150

200

250

0

50

100

150

200

250

M B / s

O b j e c t s / s

( x 1

, 0 0 0 )

1,000-byte Objects

0

50

100

150

200

250

300

30 40 50 60 70 80 90 0

5

10

15

20

25

30

M B / s

O b j e c t s

/ s ( x 1

, 0 0 0 )

Memory Utilization (%)

10,000-byte Objects

Figure 6: End-to-end client write performance as a func-tion of memory utilization. For some experiments two-levelcleaning was disabled, so only the combined cleaner wasused. The Sequential curve used two-level cleaning anduniform access patterns with a single outstanding write re-quest at a time. All other curves used the high-stress work-load with concurrent multi-writes. Each point is averagedover 3 runs on different groups of servers.

wrote objects (maintaining a xed amount of live datacontinuously) until the overhead for cleaning convergedto a stable value.

We varied the workload in four ways to measure systembehavior under different operating conditions:

1. Object Size: RAMClouds performance dependson average object size (e.g., per-object overheads versusmemory copying overheads), but not on the exact size dis-tribution (see Section 8.5 for supporting evidence). Thus,unless otherwise noted, the objects for each test had thesame xed size. We ran different tests with sizes of 100,1000, 10000, and 100,000 bytes (we omit the 100 KBmeasurements, since they were nearly identical to 10 KB).

2. Memory Utilization: The percentage of DRAMused for holding live data (not including tombstones) wasxed in each test. For example, at 50% and 90% utiliza-tion there were 8 GB and 14.4 GB of live data, respec-tively. In some experiments, total memory utilization wassignicantly higher than the listed number due to an ac-cumulation of tombstones.

3. Locality: We ran experiments with both uniformrandom overwrites of objects and a Zipan distribution in

0

1

2

3

4

5

C l e a n e r

/ N e w

B y t e s

One-level (Uniform)Two-level (Uniform)

One-level (Zipfian)Two-level (Zipfian)

100-byte Objects

0

1

2

3

4

5

C l e a n e r

/ N e w

B y

t e s

1,000-byte Objects

0

1

2

3

4

5

30 40 50 60 70 80 90

C l e a n e r /

N e w

B y

t e s


10,000-byte Objects

Figure 7: Cleaner bandwidth overhead (ratio of cleanerbandwidth to regular log write bandwidth) for the workloadsin Figure 6. 1 means that for every byte of new data writtento backups, the cleaner writes 1 byte of live data to backupswhile freeing segment space. The optimal ratio is 0.

which 90% of writes were made to 15% of the objects.The uniform random case represents a workload with nolocality; Zipan represents locality similar to what hasbeen observed in memcached deployments [7].

4. Stress Level: For most of the tests we created anarticially high workload in order to stress the masterto its limit. To do this, the client issued write requestsasynchronously, with 10 requests outstanding at any giventime. Furthermore, each request was a multi-write con-taining 75 individual writes. We also ran tests where theclient issued one synchronous request at a time, with asingle write operation in each request; these tests are la-beled Sequential in the graphs.

Figure 6 graphs the overall throughput of a RAMCloudmaster with different memory utilizations and workloads.With two-level cleaning enabled, client throughput dropsonly 10-20% as memory utilization increases from 30%to80%, even with an articially high workload. Throughputdrops more signicantly at 90% utilization: in the worstcase (small objects with no locality), throughput at 90%utilization is about half that at 30%. At high utilization thecleaner is limited by disk bandwidth and cannot keep upwith write trafc; new writes quickly exhaust all availablesegments and must wait for the cleaner.


11/17


These results exceed our original performance goals forRAMCloud. At the start of the project, we hoped thateach RAMCloud server could support 100K small writesper second, out of a total of one million small operationsper second. Even at 90% utilization, RAMCloud can sup-port almost 410K small writes per second with some lo-cality and nearly 270K with no locality.

If actual RAMCloud workloads are similar to ourSequential case, then it should be reasonable to runRAMCloud clusters at 90% memory utilization (for 100and 1,000B objects there is almost no performance degra-dation). If workloads include many bulk writes, like mostof the measurements in Figure 6, then it makes more senseto run at 80% utilization: the higher throughput will morethan offset the 12.5% additional cost for memory.

Compared to the traditional storage allocators mea-sured in Section 2, log-structured memory permits signif-icantly higher memory utilization.

8.2 Two-Level Cleaning

Figure 6 also demonstrates the benets of two-levelcleaning. The gure contains additional measurements inwhich segment compaction was disabled (One-level);in these experiments, the system used RAMClouds orig-inal one-level approach where only the combined cleanerran. The two-level cleaning approach provides a consider-able performance improvement: at 90% utilization, clientthroughput is up to 6x higher with two-level cleaning thansingle-level cleaning.

One of the motivations for two-level cleaning was toreduce the disk bandwidth used by cleaning, in order tomake more bandwidth available for normal writes. Fig-ure 7 shows that two-level cleaning reduces disk and net-work bandwidth overheads at high memory utilizations.The greatest benets occur with larger object sizes, wheretwo-level cleaning reduces overheads by 7-87x. Com-paction is much more efcient in these cases becausethere are fewer objects to process.

8.3 CPU Overhead of Cleaning

Figure 8 shows the CPU time required for cleaning intwo of the workloads from Figure 6. Each bar representsthe average number of fully active cores used for com-bined cleaning and compaction in the master, as well asfor backup RPC and disk I/O processing in the backups.

At low memory utilization a master under heavy loaduses about 30-50% of one core for cleaning; backups ac-count for the equivalent of at most 60% of one core acrossall six of them. Smaller objects require more CPU timefor cleaning on the master due to per-object overheads,while larger objects stress backups more because the mas-ter can write up to 5 times as many megabytes per sec-ond (Figure 6). As free space becomes more scarce, thetwo cleaner threads are eventually active nearly all of thetime. In the 100B case, RAMClouds balancer prefers torun combined cleaning due to the accumulation of tomb-

0

0.4

0.8

1.2

1.6

2

2.4

30 40 50 60 70 80 90 30 40 50 60 70 80 90

A v e r a g e

N u m

b e r o f

A c t

i v e

C o r e s


CombinedCompaction

Backup UserBackup Kern

100-byte 1,000-byte

Figure 8: CPU overheads for two-level cleaning under the100 and 1,000-byte Zipan workloads in Figure 6, measuredin average number of active cores. Backup Kern repre-sents kernel time spent issuing I/Os to disks, and BackupUser represents time spent servicing segment write RPCson backup servers. Both of these bars are aggregated acrossall backups, and include trafc for normal writes as wellas cleaning. Compaction and Combined represent timespent on the master in memory compaction and combinedcleaning. Additional core usage unrelated to cleaning is notdepicted. Each bar is averaged over 3 runs.

1e-07 1e-06 1e-05

0.0001 0.001

0.01 0.1

1 10

100

10 100 1000 10000

% o f

W r i t e s

T a k

i n g

L o n g e r

T h a n a

G i v e n

T i m e

( L o g

S c a

l e )

Microseconds (Log Scale)

No CleanerCleaner

Figure 9: Reverse cumulative distribution of client writelatencies when a single client issues back-to-back write re-

quests for 100-byte objects using the uniform distribution.The No cleaner curve was measured with cleaning dis-abled. The Cleaner curve shows write latencies at 90%memory utilization with cleaning enabled. For example,about 10% of all write requests took longer than 18 s in bothcases; with cleaning enabled, about 0.1% of all write requeststook 1ms or more. The median latency was 16.70 s withcleaning enabled and 16.35 s with the cleaner disabled.

stones. With larger objects compaction tends to be moreefcient, so combined cleaning accounts for only a smallfraction of the CPU time.

8.4 Can Cleaning Costs be Hidden?

One of the goals for RAMClouds implementation of log-structured memory was to hide the cleaning costs sothey dont affect client requests. Figure 9 graphs the la-tency of client write requests in normal operation witha cleaner running, and also in a special setup wherethe cleaner was disabled. The distributions are nearlyidentical up to about the 99.9th percentile, and cleaningonly increased the median latency by 2% (from 16.35 to16.70 s). About 0.1% of write requests suffer an addi-tional 1ms or greater delay when cleaning. Preliminary


12/17


0

0.2

0.4

0.6

0.8

1

70% 80% 90% 90%(Sequential)

R a

t i o o

f P e r f o r m a n c e w

i t h

a n

d w

i t h o u

t C l e a n

i n g

Memory Utilization

W1W2W3W4W5W6W7W8

Figure 10: Client performance in RAMCloud under the sameworkloads as in Figure 1 from Section 2. Each bar measuresthe performance of a workload (with cleaning enabled) rela-tive to the performance of the same workload with cleaningdisabled. Higher is better and 1.0 is optimal; it means that thecleaner has no impact on the processing of normal requests.As in Figure 1, 100 GB of allocations were made and at most10 GB of data was alive at once. The 70%, 80%, and 90%utilization bars were measured with the high-stress requestpattern using concurrent multi-writes. The Sequential barsused a single outstanding write request at a time; the data sizewas scaled down by a factor of 10x for these experiments tomake running times manageable. The master in these experi-ments ran on the same Xeon E5-2670 system as in Table 1.

experiments both with larger pools of backups and withreplication disabled (not depicted) suggest that these de-lays are primarily due to contention for the NIC and RPCqueueing delays in the single-threaded backup servers.

8.5 Performance Under Changing Workloads

Section 2 showed that changing workloads causedpoor memory utilization in traditional storage alloca-tors. For comparison, we ran those same workloads onRAMCloud, using the same general setup as for earlierexperiments. The results are shown in Figure 10 (thisgure is formatted differently than Figure 1 in order toshow RAMClouds performance as a function of memoryutilization). We expected these workloads to exhibit per-formance similar to the workloads in Figure 6 (i.e. weexpected the performance to be determined by the aver-age object sizes and access patterns; workload changesper se should have no impact). Figure 10 conrms thishypothesis: with the high-stress request pattern, perfor-mance degradation due to cleaning was 10-20% at 70%utilization and 40-50% at 90% utilization. With the Se-quential request pattern, performance degradation was5% or less, even at 90% utilization.

8.6 Other Uses for Log-Structured Memory

Our implementation of log-structured memory is tied toRAMClouds distributed replication mechanism, but webelieve that log-structured memory also makes sense inother environments. To demonstrate this, we performedtwo additional experiments.

First, we re-ran some of the experiments from Fig-ure 6 with replication disabled in order to simulate aDRAM-only storage system. We also disabled com-

0

100

200

300

400

500

600

30 40 50 60 70 80 90 0

100

200

300

400

500

600

M B / s

O b j e c t s / s

( x 1

, 0 0 0 )


Zipfian R = 0Uniform R = 0

Zipfian R = 3Uniform R = 3

Figure 11: Two-level cleaning with (R = 3) and withoutreplication (R = 0) for 1000-byte objects. The two lowercurves are the same as in Figure 6.

Allocator Fixed 25-byte Zipan 0 - 8 KBSlab 8737 982Log 11411 1125

Improvement 30.6% 14.6%

Table 3: Average number of objects stored per megabyte of cache in memcached, with its normal slab allocator and witha log-structured allocator. The Fixed column shows sav-ings from reduced metadata (there is no fragmentation, sincethe 25-byte objects t perfectly in one of the slab allocatorsbuckets). The Zipan column shows savings from eliminat-ing internal fragmentation in buckets. All experiments ran ona 16-core E5-2670 system with both client and server on thesame machine to minimize network overhead. Memcachedwas given 2 GB of slab or log space for storing objects, andthe slab rebalancer was enabled. YCSB [15] was used to gen-erate the access patterns. Each run wrote 100 million objectswith Zipan-distributed key popularity and either xed 25-byte or Zipan-distributed sizes between 0 and 8 KB. Resultswere averaged over 5 runs.

paction (since there is no backup I/O to conserve) and hadthe server run the combined cleaner on in-memory seg-ments only. Figure 11 shows that without replication, log-structured memory supports signicantly higher through-put: RAMClouds single writer thread scales to nearly600K 1,000-byte operations per second. Under very highmemory pressure throughput drops by 20-50% dependingon access locality. At this object size, one writer threadand two cleaner threads sufce to handle between onequarter and one half of a 10 gigabit Ethernet links worthof write requests.

Second, we modied the popular memcached [2]1.4.15 object caching server to use RAMClouds log andcleaner instead of its slab allocator. To make room fornew cache entries, we modied the log cleaner to evictcold objects as it cleaned, rather than using memcachedsslab-based LRU lists. Our policy was simple: segments

Allocator Throughput (Writes/s x1000) % CPU CleaningSlab 259.9 0.6 0%Log 268.0 0.6 5.37 0.3 %

Table 4: Average throughput and percentage of CPU usedfor cleaning under the same Zipan write-only workload asin Table 3. Results were averaged over 5 runs.


13/17


0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

A B C D F A g g r e g a

t e O p e r a

t i o n s / s

( M i l l i o n s )

YCSB Workloads

HyperDex 1.0rc4Redis 2.6.14

RAMCloud 75%RAMCloud 90%

RAMCloud 75% VerbsRAMCloud 90% Verbs

Figure 12: Performance of HyperDex, RAMCloud, andRedis under the default YCSB [15] workloads B, C, and Dare read-heavy workloads, while A and F are write-heavy;workload E was omitted because RAMCloud does not sup-port scans. Y-values represent aggregate average through-put of 24 YCSB clients running on 24 separate nodes (seeTable 2). Each client performed 100 million operations ona data set of 100 million keys. Objects were 1 KB each(the workload default). An additional 12 nodes ran the stor-age servers. HyperDex and Redis used kernel-level socketsover Inniband. The RAMCloud 75% and RAMCloud90% bars were measured with kernel-level sockets over In-niband at 75% and 90% memory utilisation, respectively(each servers share of the 10 million total records corre-sponded to 75% or 90% of log memory). The RAMCloud75% Verbs and RAMCloud 90% Verbs bars were mea-sured with RAMClouds kernel bypass user-level Inni-band transport layer, which uses reliably-connected queuepairs via the Inniband Verbs API. Each data point is aver-aged over 3 runs.

were selected for cleaning based on how many recentreads were made to objects in them (fewer requests indi-cate colder segments). After selecting segments, 75% of their most recently accessed objects were written to sur-

vivor segments (in order of access time); the rest werediscarded. Porting the log to memcached was straight-forward, requiring only minor changes to the RAMCloudsources and about 350 lines of changes to memcached.

Table 3 illustrates the main benet of log-structuredmemory in memcached: increased memory efciency.By using a log we were able to reduce per-object meta-data overheads by 50% (primarily by eliminating LRU listpointers, like MemC3 [20]). This meant that small objects could be stored much more efciently. Furthermore,using a log reduced internal fragmentation: the slab allo-cator must pick one of several xed-size buckets for eachobject, whereas the log can pack objects of different sizes

into a single segment. Table 4 shows that these benetsalso came with no loss in throughput and only minimalcleaning overhead.

8.7 How does RAMCloud compare to other systems?

Figure 12 compares the performance of RAMCloud toHyperDex [18] and Redis [3] using the YCSB [15] bench-mark suite. All systems were congured with triple repli-cation. Since HyperDex is a disk-based store, we cong-ured it to use a RAM-based le system to ensure that no

operations were limited by disk I/O latencies, which theother systems specically avoid. Both RAMCloud andRedis wrote to SSDs (Redis append-only logging mecha-nism was used with a 1s fsync interval). It is worth notingthat Redis is distributed with jemalloc [19], whose frag-mentation issues we explored in Section 2.

RAMCloud outperforms HyperDex in every case, evenwhen running at very high memory utilization and de-spite conguring HyperDex so that it does not write todisks. RAMCloud also outperforms Redis, except inwrite-dominated workloads A and F when kernel sock-ets are used. In these cases RAMCloud is limited byRPC latency, rather than allocation speed. In particular,RAMCloud must wait until data is replicated to all back-ups before replying to a clients write request. Redis, onthe other hand, offers no durability guarantee; it respondsimmediately and batches updates to replicas. This unsafemode of operation means that Redis is much less relianton RPC latency for throughput.

Unlike the other two systems, RAMCloud was opti-mized for high-performance networking. For fairness,the RAMCloud 75% and RAMCloud 90% bars de-pict performance using the same kernel-level sockets asRedis and HyperDex. To show RAMClouds full poten-tial, however, we also included measurements using theInniband Verbs API, which permits low-latency ac-cess to the network card without going through the ker-nel. This is the normal transport used in RAMCloud; itmore than doubles read throughput, and matches Rediswrite throughput at 75% memory utilisation (RAMCloudis 25% slower than Redis for workload A at 90% uti-lization). Since Redis is less reliant on latency for per-formance, we do not expect it to benet substantially if ported to use the Verbs API.

9 LFS Cost-Benet RevisitedLike LFS [32], RAMClouds combined cleaner uses

a cost-benet policy to choose which segments toclean. However, while evaluating cleaning techniques forRAMCloud we discovered a signicant aw in the orig-inal LFS policy for segment selection. A small changeto the formula for segment selection xes this aw andimproves cleaner performance by 50% or more at highutilization under a wide range of access localities (e.g.,the Zipan and uniform access patterns in Section 8.1).This improvement applies to any implementation of log-structured storage.

LFS selected segments to clean by evaluating the fol-lowing formula for each segment and choosing the seg-ments with the highest ratios of benet to cost:

benefitcost

= (1 u ) objectAge

1 + u

In this formula, u is the segments utilization (fraction of data still live), and objectAge is the age of the youngestdata in the segment. The cost of cleaning a segment is


14/17


0

4

8

12

16

20

24

0 10 20 30 40 50 60 70 80 90 100

W r i t e

C o s

t

Disk Utilization (%)

New Simulator (Youngest File Age)Original Simulator

New Simulator (Segment Age)

Figure 13: An original LFS simulation from [31]s Figure5-6 compared to results from our reimplemented simulator.The graph depicts how the I/O overhead of cleaning under aparticular synthetic workload (see [31] for details) increaseswith disk utilization. Only by using segment age were weable to reproduce the original results (note that the bottomtwo lines coincide).

determined by the number of bytes that must be read orwritten from disk (the entire segment must be read, thenthe live bytes must be rewritten). The benet of cleaning

includes two factors: the amount of free space that willbe reclaimed ( 1 u ), and an additional factor intended torepresent the stability of the data. If data in a segment isbeing overwritten rapidly then it is better to delay cleaningso that u will drop; if data in a segment is stable, it makesmore sense to reclaim the free space now. objectAge wasused as an approximation for stability. LFS showed thatcleaning can be made much more efcient by taking allthese factors into account.

RAMCloud uses a slightly different formula for seg-ment selection:

benefitcost

= (1 u ) segmentAge

u

This differs from LFS in two ways. First, the cost haschanged from 1 + u to u. This reects the fact thatRAMCloud keeps live segment contents in memory at alltimes, so the only cleaning cost is for rewriting live data.

The second change to RAMClouds segment selectionformula is in the way that data stability is estimated; thishas a signicant impact on cleaner performance. Usingobject age produces pathological cleaning behavior whenthere are very old objects. Eventually, some segmentsobjects become old enough to force the policy into clean-ing the segments at extremely high utilization, which isvery inefcient. Moreover, since live data is written tosurvivor segments in age-order (to segregate hot and colddata and make future cleaning more efcient), a viciouscycle ensues because the cleaner generates new segmentswith similarly high ages. These segments are then cleanedat high utilization, producing new survivors with highages, and so on. In general, object age is not a reliableestimator of stability. For example, if objects are deleteduniform-randomly, then an objectss age provides no in-dication of how long it may persist.

To x this problem, RAMCloud uses the age of the seg-ment , not the age of its objects, in the formula for segment

selection. This provides a better approximation to the sta-bility of the segments data: if a segment is very old, thenits overall rate of decay must be low, otherwise its u-valuewould have dropped to the point of it being selected forcleaning. Furthermore, this age metric resets when a seg-ment is cleaned, which prevents very old ages from ac-cumulating. Figure 13 shows that this change improvesoverall write performance by 70% at 90% disk utilization.This improvement applies not just to RAMCloud, but toany log-structured system.

Intriguingly, although Sprite LFS used youngest objectage in its cost-benet formula, we believe that the LFSsimulator, which was originally used to develop the cost-benet policy, inadvertently used segment age instead.We reached this conclusion when we attempted to repro-duce the original LFS simulation results and failed. Ourinitial simulation results were much worse than those re-ported for LFS (see Figure 13); when we switched fromobjectAge to segmentAge , our simulations matchedthose for LFS exactly. Further evidence can be foundin [26], which was based on a descendant of the originalLFS simulator and describes the LFS cost-benet policyas using the segments age. Unfortunately, source code isno longer available for either of these simulators.

10 Future WorkThere are additional opportunities to improve the per-

formance of log-structured memory that we have not yetexplored. One approach that has been used in many otherstorage systems is to compress the data being stored. Thiswould allow memory to be used even more efciently, butit would create additional CPU overheads both for readingand writing objects. Another possibility is to take advan-tage of periods of low load (in the middle of the night,for example) to clean aggressively in order to generate asmuch free space as possible; this could potentially reducethe cleaning overheads during periods of higher load.

Many of our experiments focused on worst-case syn-thetic scenarios (for example, heavy write loads at veryhigh memory utilization, simple object size distributionsand access patterns, etc.). In doing so we wanted to stressthe system as much as possible to understand its limits.However, realistic workloads may be much less demand-ing. When RAMCloud begins to be deployed and usedwe hope to learn much more about its performance underreal-world access patterns.

11 Related WorkDRAM has long been used to improve performance in

main-memory database systems [17, 21], and large-scaleWeb applications have rekindled interest in DRAM-basedstorage in recent years. In addition to special-purpose sys-tems like Web search engines [9], general-purpose storagesystems like H-Store [25] and Bigtable [12] also keep partor all of their data in memory to maximize performance.

RAMClouds storage management is supercially sim-


15/17


ilar to Bigtable [12] and its related LevelDB [4] li-brary. For example, writes to Bigtable are rst logged toGFS [22] and then stored in a DRAM buffer. Bigtablehas several different mechanisms referred to as com-pactions, which ush the DRAM buffer to a GFS lewhen it grows too large, reduce the number of les ondisk, and reclaim space used by delete entries (anal-ogous to tombstones in RAMCloud and called dele-tion markers in LevelDB). Unlike RAMCloud, the pur-pose of these compactions is not to reduce backup I/O,nor is it clear that these design choices improve mem-ory efciency. Bigtable does not incrementally removedelete entries from tables; instead it must rewrite them en-tirely. LevelDBs generational garbage collection mech-anism [5], however, is more similar to RAMClouds seg-mented log and cleaning.

Cleaning in log-structured memory serves a functionsimilar to copying garbage collectors in many commonprogramming languages such as Java and LISP [24, 37].

Section 2 has already discussed these systems.Log-structured memory in RAMCloud was inuenced

by ideas introduced in log-structured le systems [32].Much of the nomenclature and general techniques areshared (log segmentation, cleaning, and cost-benet se-lection, for example). However, RAMCloud differs inits design and application. The key-value data model,for instance, allows RAMCloud to use simpler metadatastructures than LFS. Furthermore, as a cluster system,RAMCloud has many disks at its disposal, which reducescontention between cleaning and regular log appends.

Efciency has been a controversial topic in log-

structured le systems [34, 35]. Additional techniqueswere introduced to reduce or hide the cost of cleaning [11,26]. However, as an in-memory store, RAMClouds useof a log is more efcient than LFS. First, RAMCloud neednot read segments from disk during cleaning, which re-duces cleaner I/O. Second, RAMCloud may run its disksat low utilization, making disk cleaning much cheaperwith two-level cleaning. Third, since reads are alwaysserviced from DRAM they are always fast, regardless of locality of access or placement in the log.

RAMClouds data model and use of DRAM as the loca-tion of record for all data are similar to various NoSQLstorage systems. Redis [3] is an in-memory store that sup-ports a persistence log for durability, but does not docleaning to reclaim free space, and offers weak durabilityguarantees. Memcached [2] stores all data in DRAM, butit is a volatile cache with no durability. Other NoSQL sys-tems like Dynamo [16] and PNUTS [14] also have simpli-ed data models, but do not service all reads from mem-ory. HyperDex [18] offers similar durability and consis-tency to RAMCloud, but is a disk-based system and sup-ports a richer data model, including range scans and ef-cient searches across multiple columns.

12 ConclusionLogging has been used for decades to ensure durabil-

ity and consistency in storage systems. When we begandesigning RAMCloud, it was a natural choice to use a log-ging approach on disk to back up the data stored in mainmemory. However, it was surprising to discover that log-

ging also makes sense as a technique for managing thedata in DRAM. Log-structured memory takes advantageof the restricted use of pointers in storage systems to elim-inate the global memory scans that fundamentally limitexisting garbage collectors. The result is an efcient andhighly incremental form of copying garbage collector thatallows memory to be used efciently even at utilizationsof 80-90%. A pleasant side effect of this discovery wasthat we were able to use a single technique for managingboth disk and main memory, with small policy differencesthat optimize the usage of each medium.

Although we developed log-structured memory forRAMCloud, we believe that the ideas are generally appli-

cable and that log-structured memory is a good candidatefor managing memory in DRAM-based storage systems.

13 AcknowledgementsWe would like to thank Asaf Cidon, Satoshi Mat-

sushita, Diego Ongaro, Henry Qin, Mendel Rosenblum,Ryan Stutsman, Stephen Yang, the anonymous review-ers from FAST 2013, SOSP 2013, and FAST 2014, andour shepherd, Randy Katz, for their helpful comments.This work was supported in part by the Gigascale Sys-tems Research Center and the Multiscale Systems Cen-ter, two of six research centers funded under the Fo-cus Center Research Program, a Semiconductor Research

Corporation program, by C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation pro-gram, sponsored by MARCO and DARPA, and by theNational Science Foundation under Grant No. 0963859.Additional support was provided by Stanford Experimen-tal Data Center Laboratory afliates Facebook, Mellanox,NEC, Cisco, Emulex, NetApp, SAP, Inventec, Google,VMware, and Samsung. Steve Rumble was supported bya Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship.

References[1] Google performance tools, Mar. 2013. http://goog-

perftools.sourceforge.net/ .

[2] memcached: a distributed memory object caching system, Mar.2013. http://www.memcached.org/ .

[3] Redis, Mar. 2013. http://www.redis.io/ .

[4] leveldb - a fast and lightweight key/value database libraryby google, Jan. 2014. http://code.google.com/p/leveldb/ .

[5] Leveldb le layouts and compactions, Jan. 2014. http://leveldb.googlecode.com/svn/trunk/doc/impl.html .

[6] A PPAVOO , J., H UI , K., S OULES , C. A. N., W ISNIEWSKI , R. W.,DA S ILVA , D. M., K RIEGER , O., A USLANDER , M. A., E DEL -


16/17


SOHN , D. J . , G AMSA , B. , G ANGER , G. R., M C KENNEY , P.,O STROWSKI , M., R OSENBURG , B. , S TUMM , M., AND XENI -DI S , J. Enabling autonomic behavior in systems software with hotswapping. IBM Syst. J. 42 , 1 (Jan. 2003), 6076.

[7] ATIKOGLU , B . , X U , Y. , F RACHTENBERG , E . , J IANG , S .,AN D PALECZNY , M. Workload analysis of a large-scalekey-value store. In Proceedings of the 12th ACM SIGMET- RICS/PERFORMANCE joint international conference on Mea-

surement and Modeling of Computer Systems (New York, NY,USA, 2012), SIGMETRICS 12, ACM, pp. 5364.

[8] B ACON , D. F. , C HENG , P., AN D RAJAN , V. T. A real-timegarbage collector with low overhead and consistent utilization.In Proceedings of the 30th ACM SIGPLAN-SIGACT symposiumon Principles of programming languages (New York, NY, USA,2003), POPL 03, ACM, pp. 285298.

[9] B ARROSO , L. A., D EAN , J., AND H OLZLE , U. Web search fora planet: The google cluster architecture. IEEE Micro 23 , 2 (Mar.2003), 2228.

[10] B ERGER , E . D. , M CK INLEY , K . S . , B LUMOFE , R. D., ANDW ILSON , P. R. Hoard: a scalable memory allocator for multi-threaded applications. In Proceedings of the ninth internationalconference on Architectural support for programming languagesand operating systems (New York, NY, USA, 2000), ASPLOS IX,ACM, pp. 117128.

[11] B LACKWELL , T. , H ARRIS , J . , AN D SELTZER , M. Heuristiccleaning algorithms in log-structured le systems. In Proceedingsof the USENIX 1995 Technical Conference (Berkeley, CA, USA,1995), TCON95, USENIX Association, pp. 277288.

[12] C HANG , F., D EAN , J . , G HEMAWAT , S. , H SIEH , W. C., W AL -LACH , D. A., B URROWS , M., C HANDRA , T., F IKES , A., ANDG RUBER , R. E. Bigtable: A distributed storage system for struc-tured data. In Proceedings of the 7th Symposium on OperatingSystems Design and Implementation (Berkeley, CA, USA, 2006),OSDI 06, USENIX Association, pp. 205218.

[13] C HENG , P., A ND BLELLOCH , G. E. A parallel, real-time garbagecollector. In Proceedings of the ACM SIGPLAN 2001 conferenceon Programming language design and implementation (New York,NY, USA, 2001), PLDI 01, ACM, pp. 125136.

[14] C OOPER , B. F., R AMAKRISHNAN , R. , S RIVASTAVA , U., S IL -BERSTEIN , A., B OHANNON , P., J ACOBSEN , H.-A., P UZ , N.,W EAVER , D., AND YERNENI , R. Pnuts: Yahoo!s hosted dataserving platform. Proc. VLDB Endow. 1 (August 2008), 12771288.

[15] C OOPER , B. F., S ILBERSTEIN , A., TAM , E., R AMAKRISHNAN ,R., AND SEARS , R. Benchmarking cloud serving systems withycsb. In Proceedings of the 1st ACM symposium on Cloud comput-ing (New York, NY, USA, 2010), SoCC 10, ACM, pp. 143154.

[16] D ECANDIA , G., H ASTORUN , D., J AMPANI , M., K AKULAPATI ,G., L AKSHMAN , A . , P ILCHIN , A . , S IVASUBRAMANIAN , S. ,VOSSHALL , P., AN D VOGELS , W. Dynamo: amazons highlyavailable key-value store. In Proceedings of twenty-rst ACM SIGOPS symposium on operating systems principles (New York,NY, USA, 2007), SOSP 07, ACM, pp. 205220.

[17] D EW ITT , D . J . , K ATZ , R. H., O LKEN , F. , S HAPIRO , L. D.,

STONEBRAKER , M. R., AND WOOD , D. A . Implementationtechniques for main memory database systems. In Proceedingsof the 1984 ACM SIGMOD international conference on manage-ment of data (New York, NY, USA, 1984), SIGMOD 84, ACM,pp. 18.

[18] E SCRIVA , R., W ON G , B., AND S IRER , E. G. Hyperdex: a dis-tributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, archi-tectures, and protocols for computer communication (New York,NY, USA, 2012), SIGCOMM 12, ACM, pp. 2536.

[19] E VANS , J. A scalable concurrent malloc (3) implementation forfreebsd. In Proceedings of the BSDCan Conference (Apr. 2006).

[20] F AN , B . , A NDERSEN , D. G., AN D KAMINSKY , M. Memc3:compact and concurrent memcache with dumber caching andsmarter hashing. In Proceedings of the 10th USENIX conferenceon Networked Systems Design and Implementation (Berkeley, CA,USA, 2013), NSDI13, USENIX Association, pp. 371384.

[21] G ARCIA -M OLINA , H., A ND S ALEM , K . Main memory databasesystems: An overview. IEEE Trans. on Knowl. and Data Eng. 4(December 1992), 509516.

[22] G HEMAWAT , S., G OBIOFF , H., AN D L EUNG , S. -T. The googlele system. In Proceedings of the nineteenth ACM symposium onOperating systems principles (New York, NY, USA, 2003), SOSP03, ACM, pp. 2943.

[23] H ERTZ , M., AN D BERGER , E. D. Quantifying the perfor-mance of garbage collection vs. explicit memory management.In Proceedings of the 20th annual ACM SIGPLAN conference onObject-oriented programming, systems, languages, and applica-tions (New York, NY, USA, 2005), OOPSLA 05, ACM, pp. 313326.

[24] J ONES , R. , H OSKING , A., AN D MOS S , E. The Garbage Col-lection Handbook: The Art of Automatic Memory Management ,1st ed. Chapman & Hall/CRC, 2011.

[25] K ALLMAN , R., K IMURA , H., N ATKINS , J., PAVLO , A., R ASIN ,A ., Z DONIK , S . , J ONES , E . P. C . , M ADDEN , S . , S TONE -BRAKER , M., Z HANG , Y., H UG G , J., A ND A BADI , D. J. H-store:a high-performance, distributed main memory transaction process-ing system. Proc. VLDB Endow. 1 (August 2008), 14961499.

[26] M ATTHEWS , J . N., R OSELLI , D., C OSTELLO , A. M., W AN G ,R. Y., AN D ANDERSON , T. E. Improving the performance of log-structured le systems with adaptive methods. SIGOPS Oper.Syst. Rev. 31 , 5 (Oct. 1997), 238251.

[27] M CKENNEY , P. E., AND S LINGWINE , J. D. Read-copy update:Using execution history to solve concurrency problems. In Paral-lel and Distributed Computing and Systems (Las Vegas, NV, Oct.1998), pp. 509518.

[28] N ISHTALA , R. , F UGAL , H., G RIMM , S. , K WIATKOWSKI , M.,LEE , H., L I , H. C., M CELROY , R., PALECZNY , M., P EEK , D.,SAA B , P., S TAFFORD , D., T UN G , T., AND VENKATARAMANI ,V. Scaling memcache at facebook. In Proceedings of the 10thUSENIX conference on Networked Systems Design and Implemen-tation (Berkeley, CA, USA, 2013), NSDI13, USENIX Associa-tion, pp. 385398.

[29] O NGARO , D., R UMBLE , S. M., S TUTSMAN , R., O USTERHOUT ,J., AN D R OSENBLUM , M. Fast crash recovery in ramcloud. InProceedings of the Twenty-Third ACM Symposium on OperatingSystems Principles (New York, NY, USA, 2011), SOSP 11, ACM,pp. 2941.

[30] O USTERHOUT , J., A GRAWAL , P., E RICKSON , D., K OZYRAKIS ,C., L EVERICH , J . , M AZ I ERES , D., M ITRA , S. , N ARAYANAN ,A., O NGARO , D . , PARULKAR , G . , R OSENBLUM , M. , R UM -BLE , S. M., S TRATMANN , E., AND STUTSMAN , R. The casefor ramcloud. Commun. ACM 54 (July 2011), 121130.

[31] R OSENBLUM , M. The design and implementation of a log-structured le system . PhD thesis, Berkeley, CA, USA, 1992. UMIOrder No. GAX93-30713.

[32] R OSENBLUM , M., A ND O USTERHOUT , J. K. The design and im-plementation of a log-structured le system. ACM Trans. Comput.Syst. 10 (February 1992), 2652.

[33] R UMBLE , S. M. Memory and Object Management in RAMCloud .PhD thesis, Stanford, CA, USA, 2014.

[34] S ELTZER , M., B OSTIC , K., M CKUSICK , M. K., A ND S TAELIN ,C. An implementation of a log-structured le system for unix.In Proceedings of the 1993 Winter USENIX Technical Conference(Berkeley, CA, USA, 1993), USENIX93, USENIX Association,pp. 307326.


17/17

[35] S ELTZER , M., S MITH , K. A., B ALAKRISHNAN , H . , C HANG ,J., M CM AINS , S., AND PADMANABHAN , V. File system log-ging versus clustering: a performance comparison. In Proceedingsof the USENIX 1995 Technical Conference (Berkeley, CA, USA,1995), TCON95, USENIX Association, pp. 249264.

[36] T ENE , G., I YENGAR , B., A ND W OL F , M . C4: the continuouslyconcurrent compacting collector. In Proceedings of the interna-tional symposium on Memory management (New York, NY, USA,2011), ISMM 11, ACM, pp. 7988.

[37] W ILSON , P. R. Uniprocessor garbage collection techniques. InProceedings of the International Workshop on Memory Manage-ment (London, UK, UK, 1992), IWMM 92, Springer-Verlag,pp. 142.

[38] Z AHARIA , M. , C HOWDHURY , M. , D AS , T. , D AVE , A . , M A ,J., M CC AULEY , M., F RANKLIN , M., S HENKER , S., AN D S TO -ICA , I. Resilient distributed datasets: A fault-tolerant abstrac-tion for in-memory cluster computing. In Proceedings of the 9thUSENIX conference on Networked Systems Design and Implemen-tation (Berkeley, CA, USA, 2012), NSDI12, USENIX Associa-tion.

[39] Z ORN , B . The measured cost of conservative garbage collection.Softw. Pract. Exper. 23 , 7 (July 1993), 733756.

Documents

Fast14 Paper Rumble