GeePS: Scalable deep learning on distributed GPUs …Caffe [21], a state-of-the-art open-source system for deep learning on a single GPU, to store its data in GeePS, we can improve

GeePS: Scalable deep learning on distributed GPUswith a GPU-specialized parameter server

Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, Eric P. Xing

Carnegie Mellon University

AbstractLarge-scale deep learning requires huge computational re-

sources to train a multi-layer neural network. Recent systems

propose using 100s to 1000s of machines to train networks

with tens of layers and billions of connections. While the

computation involved can be done more efficiently on GPUs

than on more traditional CPU cores, training such networks

on a single GPU is too slow and training on distributed GPUs

can be inefficient, due to data movement overheads, GPU

stalls, and limited GPU memory. This paper describes a new

parameter server, called GeePS, that supports scalable deep

learning across GPUs distributed among multiple machines,

overcoming these obstacles. We show that GeePS enables

a state-of-the-art single-node GPU implementation to scale

well, such as to 13 times the number of training images pro-

cessed per second on 16 machines (relative to the original

optimized single-node code). Moreover, GeePS achieves a

higher training throughput with just four GPU machines than

that a state-of-the-art CPU-only system achieves with 108

machines.

1. IntroductionLarge-scale deep learning is emerging as a primary machine

learning approach for important, challenging problems such

as image classification [7, 14, 23, 33] and speech recogni-

tion [13, 18]. In deep learning, large multi-layer neural net-

works are trained without pre-conceived models to learn com-

plex features from raw input data, such as the pixels of labeled

images. Given sufficient training data and computing power,

deep learning approaches far outperform other approaches

for such tasks.

The computation required, however, is substantial—prior

studies have reported that satisfactory accuracy requires

training large (billion-plus connection) neural networks on

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact

the Owner/Author(s). Request permissions from [email protected] or Publications Dept., ACM, Inc., fax +1 (212)

869-0481.

EuroSys ’16 April 18–21, 2016, London, United Kingdom

Copyright c© 2016 held by owner/author(s). Publication rights licensed to ACM.

ACM 978-1-4503-4240-7/16/04. . . $15.00

DOI: http://dx.doi.org/10.1145/2901318.2901323

100s or 1000s of servers for days [7, 14]. Neural network

training is known to map well to GPUs [8, 23], but it has

been argued that this approach is only efficient for smaller

scale neural networks that can fit on GPUs attached to a single

machine [7]. The challenges of limited GPU memory and

inter-machine communication have been identified as major

limitations.

This paper describes GeePS, a parameter server system

specialized for scaling deep learning applications across

GPUs distributed among multiple server machines. Like pre-

vious CPU-based parameter servers [10, 24], GeePS handles

the synchronization and communication complexities asso-

ciated with sharing the model parameters being learned (the

weights on the connections, for neural networks) across paral-

lel workers. Unlike such previous systems, GeePS performs a

number of optimizations specially tailored to making efficient

use of GPUs, including pre-built indexes for “gathering” the

parameter values being updated in order to enable parallel

updates of many model parameters in the GPU, along with

GPU-friendly caching, data staging, and memory manage-

ment techniques.

GeePS supports data-parallel model training, in which

the input data is partitioned among workers on different

machines that collectively update shared model parameters

(that themselves are sharded across machines). This avoids

the excessive communication delays that would arise in

model-parallel approaches, in which the model parameters

are partitioned among the workers on different machines,

given the rich dependency structure of neural networks [33].

Data-parallel approaches are limited by the desire to fit the

entire model in each worker’s memory, and, as observed by

prior work [7, 9, 14, 22, 30, 33], this would seem to imply

that GPU-based systems (with their limited GPU memory)

are suited only for relatively small neural networks. GeePS

overcomes this apparent limitation by assuming control

over memory management and placement, and carefully

orchestrating data movement between CPU and GPU memory

based on its observation of the access patterns at each layer

of the neural network.

Experiments show that single-GPU codes can be easily

modified to run with GeePS and obtain good scalable per-

formance across multiple GPUs. For example, by modifying

Caffe [21], a state-of-the-art open-source system for deep

learning on a single GPU, to store its data in GeePS, we can

improve Caffe’s training throughput (images per second) by

13× using 16 machines. Using GeePS, less than 8% of the

GPU’s time is lost to stalls (e.g., for communication, synchro-

nization, and data movement), as compared to 65% when

using an efficient CPU-based parameter server implementa-

tion. In terms of image classification accuracy, GeePS’s rate

of improvement on 16 machines is 8× faster than the single-

GPU optimized Caffe’s. The training throughput achieved

with just four GPU machines beats that reported recently

for a state-of-the-art 108-machine CPU-only system (Pro-

jectAdam) [7], and the accuracy improvement with just 16

GPU machines is 4× faster than what was reported for a 58-

machine ProjectAdam configuration. Experiments with video

classification via recurrent neural networks show similar re-

sults. Interestingly, in contrast with recent work [7, 19, 24],

we find that for deep learning on GPUs, BSP-style execu-

tion leads to faster accuracy improvements than more asyn-

chronous parameter consistency models, because the negative

impact of staleness outweighs the benefits of reduced com-

munication delays.

Experiments also confirm the efficacy of GeePS’s support

for data-parallel training of very large neural networks on

GPUs. For example, results are shown for a 20 GB neural

network (5.6 billion connections) trained on GPUs with only

5 GB memory, with the larger CPU memory holding most of

the parameters and intermediate layer state most of the time.

By moving data between CPU memory and GPU memory in

the background, GeePS is able to keep the GPU engines busy

without suffering a significant decrease in training throughput

relative to the case of all data fitting into GPU memory.

This paper makes three primary contributions. First, it de-

scribes the first GPU-specialized parameter server design and

the changes needed to achieve efficient data-parallel multi-

machine deep learning with GPUs. Second, it reports on

large-scale experiments showing that GeePS indeed supports

scalable data parallel execution via a parameter server, in

contrast to previous expectations [7]. Third, it introduces new

parameter server support for enabling such data-parallel deep

learning on GPUs even when models are too big to fit in GPU

memory, by explicitly managing GPU memory as a cache for

parameters and intermediate layer state.

The remainder of this paper is organized as follows.

Section 2 motivates GeePS’s design with background on deep

learning, GPU architecture, and parameter servers for ML.

Section 3 describes how GeePS’s design differs from previous

CPU-based parameter server systems. Section 4 describes the

GeePS implementation. Section 5 presents results from deep

learning experiments with models of various sizes, including

comparison to a state-of-the-art CPU-based parameter server.

Section 6 discusses additional related work.

2. High performance deep learningThis section briefly describes deep learning, using image clas-

sification as a concrete example, and common approaches

to achieving good performance by using GPUs or by paral-

lelizing over 100s of traditional CPU-based machines with

a parameter server architecture. Our goal is to enable the

two approaches to be combined, with a parameter server

system that effectively supports parallelizing over multiple

distributed GPUs.

2.1 Deep learningIn deep learning, the ML programmer/user does not specify

which specific features of the raw input data correlate with

the outcomes being associated. Instead, the ML algorithm

determines which features correlate most strongly by training

a neural network with a large number of hidden layers [5],

which consists of a layered network of nodes and edges

(connections), as depicted in Figure 1.

Because deep learning is somewhat difficult to describe in

the abstract, this section instead does so by describing how it

works for the specific case of image classification [7, 14, 23,

28] and video classification [16, 34]. An image classification

network classifies images (raw pixel maps) into pre-defined

labels and is trained using a set of training images with known

labels. The video classification network classifies videos (a

sequence of image frames) into pre-defined labels, and often

uses an image classification network as a submodule. This

subsection describes an image classification network first and

then describes how a video classification network can be built

on it.

Convolutional layer

Input layer

Fully connected layer

Fully connected layer Label probabilities

Image pixels

Intermediate states

Connection weights

Figure 1. A convolutional neural network, with one convo-

lutional layer and two fully connected layers.

The image classification task often uses a type of model

called a convolutional neural network (CNN). The first layer

of the nodes (input of the network) are the raw pixels of

the input image, and the last layer of the nodes (output of

the network) are the probabilities that this image should

be assigned to each label. The nodes in the middle are

intermediate states. To classify an image using such a neural

network, the image pixels will be assigned as the values

for the first layer of nodes, and these nodes will activatetheir connected nodes of the next layer. There is a weightassociated with each connection, and the value of each node

at the next layer is a prespecified function of the weighted

values of its connected nodes. Each layer of nodes is activated,

one by one, by the setting of the node values for the layer

below. This procedure is called a forward pass.

There are two types of layers: those with weights to be

trained and those with fixed functions (no weights to be

trained). Common examples of the former are fully connectedlayers, in which the value of each node is a weighted sum of

all the node values at the prior level, and convolutional layers,

in which the value of each node is the result of applying a

convolution function over a (typically small) subset of the

node values.

A common way of training a neural network is to use

a stochastic gradient descent (SGD) algorithm. For each

training image, a forward pass is done to activate all nodes

using the current weights. The values computed for each

node are retained as intermediate states. At the last layer, an

error term is calculated by comparing the predicted label

probabilities with the true label. Then, the error terms are

propagated back through the network with a backward pass.

During the backward pass, the gradient of each connection

weight is calculated from the error terms and the retained

node values, and the connection weights (i.e., the model

parameters) are updated using these gradients.

For efficiency, most training applications do each forward

and backward pass with a batch of images (called a mini-batch) instead of just one image. For each image inside the

mini-batch, there will be one set of node activation values

during the forward pass and one set of error terms during the

backward pass. Convolutional layers tend to have far fewer

weights than fully connected layers, both because there are far

fewer connections and because, by design, many connections

share the same weight.

Video classification tasks often use a structure called a

recurrent neural network (RNN). An RNN is made up of

multiple layers with recurrent (i.e. feedback) connections,

called recurrent layers, such that a static unrolling of the

RNN would be a very deep network with shared weights

between some of the layers. The Long-Short Term Memory(LSTM) layer [20] is one popular type of recurrent layers

that is frequently used for vision and speech tasks to capture

sequence information of the data [16, 29, 34]. The LSTM

layer contains memory cells that “remember” knowledge of

previous timestamps, and the memory is updated selectively

at each timestamp, controlled by special gate functions. A

common approach for using RNNs on vision tasks, such as

video classification [16, 34] and image captioning [29], is to

stack LSTM layers on top of CNN layers. The CNN layers

serve as an encoder that converts each frame of a video into

a feature vector and feeds the video as a sequence of feature

vectors into the LSTM layers. In order to train the LSTM

layers, a complete sequence of the image frames need to be

in one mini-batch.

2.2 Deep learning using GPUsGPUs are often used to train deep neural networks, be-

cause the primary computational steps match their single-

instruction-multiple-data (SIMD) nature and they provide

much more raw computing capability than traditional CPU

cores. Most high end GPUs are on self-contained devices

that can be inserted into a server machine, as illustrated in

Figure 2. One key aspect of GPU devices is that they have

dedicated local memory, which we will refer to as “GPU

memory,” and their computing elements are only efficient

when working on data in that GPU memory. Data stored out-

side the device, in CPU memory, must first be brought into

the GPU memory (e.g., via PCI DMA) for it to be accessed

efficiently.

DRAM(CPU memory)

GPU device

GPUmemory

(a few GB)

GPU coresNIC

Network

CPU cores

...

Localstorage

Figure 2. A machine with a GPU device.

Neural network training is an excellent match to the GPU

computing model. For example, the forward pass of a fully

connected layer, for which the value of each output node

is calculated as the weighted sum of all input nodes, can

be expressed as a matrix-matrix multiplication for a whole

mini-batch. During the backward pass, the error terms and

gradients can also be computed with similar matrix-matrix

multiplications. These computations can be easily decom-

posed into SIMD operations and be performed efficiently with

the GPU cores. Computations of other layers, such as con-

volution, have similar SIMD properties and are also efficient

on GPUs. NVIDIA provides libraries for launching these

computations on GPUs, such as the cuBLAS library [1] for

basic linear algebra computations and the cuDNN library [2]

for neural-network specific computations (e.g., convolution).

Caffe [21] is an open-source deep learning system that

uses GPUs. In Caffe, a single-threaded worker launches and

joins with GPU computations, by calling NVIDIA cuBLAS

and cuDNN libraries, as well as some customized CUDA

kernels. Each mini-batch of training data is read from an

input file via the CPU, moved to GPU memory, and then

processed as described above. For efficiency, Caffe keeps all

model parameters and intermediate states in the GPU memory.

As such, it is effective only for models and mini-batches small

enough to be fully held in GPU memory. Figure 3 illustrates

the CPU and GPU memory usage for a basic Caffe scenario.

2.3 Scaling ML with a parameter serverWhile early parallel ML implementations used direct message

passing (e.g., via MPI) among threads for update exchanges, a

parameter server architecture has become a popular approach

to making it easier to build and scale ML applications

Inputdata

GPU memoryCPU memory

Intermediatedata

Parameter data

Staging memoryfor input data batch

Input data file(training data)

Figure 3. Single GPU ML, such as with default Caffe.

across CPU-based clusters [3, 4, 7, 10, 11, 14, 19, 25, 31,

36], particularly for data-parallel execution. Indeed, two of

the largest efforts to address deep learning have used this

architecture [7, 14].

Figure 4 illustrates the basic parameter server architec-

ture. All state shared among application workers (i.e., the

model parameters being learned) is kept in distributed shared

memory implemented as a specialized key-value store called

a “parameter server”. An ML application’s workers process

their assigned input data and use simple Read and Update

methods to fetch or apply a delta to parameter values, leaving

the communication and consistency issues to the parameter

server. The value type is often application defined, but is re-

quired to be serializable and be defined with an associative

and commutative aggregation function, such as plus or multi-

ply, so that updates from different workers can be applied in

any order. In our image classification example, the value type

could be an array of floating point values and the aggregation

function could be plus.

Parameter cache

...Worker WorkerMachine

...

Parameter ServerParameter data

Parameter cache

...Worker WorkerMachine

Figure 4. Parallel ML with parameter server.

To reduce remote communication, a parameter server sys-

tem includes client-side caches that serve most operations

locally. While some systems rely entirely on best-effort asyn-

chronous propagation of parameter updates, many include an

explicit Clock method to identify a point (e.g., the end of an

iteration or mini-batch) at which a worker’s cached updates

should be pushed to the shared key-value store and its local

cache state should be refreshed. The consistency model can

conform to the Bulk Synchronous Parallel (BSP) model, in

which all updates from the previous clock must be visible

before proceeding to the next clock, or can use a looser but

still bounded model. For example, the Stale Synchronous

Parallel (SSP) model [10, 19] allows the fastest worker to be

ahead of the slowest worker by a bounded number of clocks.

Both models have been shown to converge, experimentally

and theoretically, with different tradeoffs.

While the picture illustrates the parameter server as sepa-

rate from the machines executing worker threads, and some

systems do work that way, the server-side parameter server

state is commonly sharded across the same machines as the

worker threads. The latter approach is particularly appro-

priate when considering a parameter server architecture for

GPU-based ML execution, since the CPU cores and CPU

memory is otherwise used only for input data buffering and

preparation.

Given its proven value in CPU-based distributed ML, it

is natural to use the same basic architecture and program-

ming model with distributed ML on GPUs. To explore its

effectiveness, we ported two applications (the Caffe system

discussed above and a multi-class logistic regression (MLR)

program) to a state-of-the-art parameter server system (Iter-

Store [11]). Doing so was straightforward and immediately

enabled distributed deep learning on GPUs, confirming the

application programmability benefits of the data-parallel pa-

rameter server approach. Figure 5 illustrates what sits where

in memory, to allow comparison to Figure 3 and designs

described later.

Inputdata

GPU memoryCPU memory

Intermediatedata

Param working copyParameter cacheParameter

server shard 0

Staging memoryfor PS access Staging memory

for input data batch

Network


Figure 5. Distributed ML on GPUs using a CPU-based

parameter server. The right side of the picture is much like the

single-GPU illustration in Figure 3. But, a parameter server shard

and client-side parameter cache are added to the CPU memory, and

the parameter data originally only in the GPU memory is replaced

in GPU memory by a local working copy of the parameter data.

Parameter updates must be moved between CPU memory and GPU

memory, in both directions, which requires an additional application-

level staging area since the CPU-based parameter server is unaware

of the separate memories.

While it was easy to get working, the performance was not

acceptable. As noted by Chilimbi et al. [7], the GPU’s com-

puting structure makes it “extremely difficult to support data

parallelism via a parameter server” using current implementa-

tions, because of GPU stalls, insufficient synchronization/con-

sistency, or both. Also as noted by them and others [30, 33],

the need to fit the full model, as well as a mini-batch of input

data and intermediate neural network states, in the GPU mem-

ory limits the size of models that can be trained. The next

section describes our design for overcoming these obstacles.

3. GPU-specialized parameter server designThis section describes three primary specializations to a

parameter server to enable efficient support of parallel ML

applications running on distributed GPUs: explicit use of

GPU memory for the parameter cache, batch-based parameter

access methods, and parameter server management of GPU

memory on behalf of the application. The first two address

performance, and the third expands the range of problem sizes

that can be addressed with data-parallel execution on GPUs.

Also discussed is the topic of execution model synchrony,

which empirically involves a different choice for data-parallel

GPU-based training than for CPU-based training.

3.1 Maintaining the parameter cache in GPU memoryOne important change needed to improve parameter server

performance for GPUs is to keep the parameter cache in GPU

memory, as shown in Figure 6. (Section 3.3 discusses the case

where everything does not fit.) Perhaps counter-intuitively,

this change is not about reducing data movement between

CPU memory and GPU memory—the updates from the local

GPU must still be moved to CPU memory to be sent to other

machines, and the updates from other machines must still be

moved from CPU memory to GPU memory. Rather, moving

the parameter cache into GPU memory enables the parameter

server client library to perform these data movement steps

in the background, overlapping them with GPU computing

activity. Then, when the application uses the read or update

functions, they proceed within the GPU memory. Putting the

parameter cache in GPU memory also enables updating of

the parameter cache state using GPU parallelism.

Parameter cache

Inputdata

Intermediatedata

Parameter server shard 0

Staging memory for parameter cache

GPU memoryCPU memoryNetwork

Param working copy



Figure 6. Parameter cache in GPU memory. In addition to the

movement of the parameter cache box from CPU memory to GPU

memory, this illustration differs from Figure 5 in that the associated

staging memory is now inside the parameter server library. It is used

for staging updates between the network and the parameter cache,

rather than between the parameter cache and the GPU portion of the

application.

3.2 Pre-built indexes and batch operationsGiven the SIMD-style parallelism of GPU devices, per-value

read and update operations of arbitrary model parameter val-

ues can significantly slow execution. In particular, perfor-

mance problems arise from per-value locking, index lookups,

and data movement. To realize sufficient performance, our

GPU-specialized parameter server supports batch-based in-

terfaces for reads and updates. Moreover, GeePS exploits

the iterative nature of model training [11] to provide batch-

wide optimizations, such as pre-built indexes for an entire

batch that enable GPU-efficient parallel “gathering” and up-

dating of the set of parameters accessed in a batch. These

changes make parameter server accesses much more efficient

for GPU-based training.

3.3 Managing limited GPU device memoryAs noted earlier, the limited size of GPU device memory

was viewed as a serious impediment to data-parallel CNN

implementations, limiting the size of the model to what could

fit in a single device memory. Our parameter server design

addresses this problem by managing the GPU memory for

the application and swapping the data that is not currently

being used to CPU memory. It moves the data between GPU

and CPU memory in the background, minimizing overhead

by overlapping the transfers with the training computation,

and our results demonstrate that the two do not interfere with

one another.

Managing GPU memory inside the parameter server.Our GPU-specialized parameter server design provides read

and update interfaces with parameter-server-managed buffers.

When the application reads parameter data, the parameter

server client library will allocate (user-level allocation im-

plemented by the parameter server) a buffer in GPU memory

for it and return the pointer to this buffer to the applica-

tion, instead of copying the parameter data to an application-

provided buffer. When the application finishes using the pa-

rameter data, it returns the buffer to the parameter server. We

call those two interfaces Read and PostRead. When the ap-

plication wants to update parameter data, it will first request

a buffer from the parameter server using PreUpdate and use

this buffer to store its updates. The application calls Update

to pass that buffer back, and the parameter server library will

apply the updates stored in the buffer and reclaim the buffer

memory.

The application can also store their local non-parameter

data (e.g., intermediate states) in the parameter server using

similar interfaces. The local data will not be shared with the

other application workers, so accessing the local data will be

much faster than accessing the parameter data. For example,

when the application reads the local data, the parameter

server will just return a pointer that points to the stored

local data, without copying it to a separate buffer. Similarly,

the application can directly modify the requested local data,

without needing to issue an explicit Update operation.

Method name Input Description BlockingRead list of keys and data staleness bound request a buffer, filled with parameter data yes

PostRead buffer from Read call release the buffer no

PreUpdate list of keys request an empty buffer, structured for parameter data yes

Update buffer from PreUpdate call release the buffer and save the updates no

LocalAccess list of keys for local data request a buffer, (by default) filled with local data yes

PostLocalAccess buffer from LocalAccess call release the buffer and (by default) save the data in it no

TableClock table ID commit all updates to one table no

Table 1. GeePS API calls used for access to parameter data and GeePS-managed local data.

Swapping data to CPU memory when it does not fit.The parameter server client library will be able to manage

all the GPU memory on a machine, if the application keeps

all its local data in the parameter server and uses the PS-

managed buffers. When the GPU memory of a machine is not

big enough to host all data, the parameter server will store

part of the data in the CPU memory. The application still

accesses everything through GPU memory, as before, and

the parameter server library will do the data movement for

it. When the application Reads parameter data that is stored

in CPU memory, the parameter server will perform this read

using CPU cores and copy the data from CPU memory to an

allocated GPU buffer, likewise for local data Reads. Figure 7

illustrates the resulting data layout in the GPU and CPU

memories.

Pinned local data


Parameter server shard 0

Staging memory for parameter cache

GPU memoryCPU memoryNetwork

Access buffer pool

Local data(CPU part)

Parameter cache(CPU part)

Pinned param cache


Figure 7. Parameter cache and local data partitioned across

CPU and GPU memories. When all parameter and local data

(input data and intermediate data) cannot fit within GPU memory,

our parameter server can use CPU memory to hold the excess.

Whatever amount fits can be pinned in GPU memory, while the

remainder is transferred to and from buffers that the application can

use, as needed.

GPU/CPU data movement in the background. Copy-

ing data between GPU and CPU memory could significantly

slow down data access. To minimize slowdowns, our param-

eter server uses separate threads to perform the Read and

Update operations in the background. For an Update oper-

ation, because the parameter server owns the update buffer,

it can apply the updates in the background and reclaim the

update buffer after it finishes. In order to perform the Read

operations in the background, the parameter server will need

to know in advance the sets of parameter data that the ap-

plication will access. Fortunately, iterative applications like

neural network training typically apply the same parameter

data accesses every iteration [11], so the parameter server

can easily predict the Read operations and perform them in

advance in the background.

3.4 Eschewing asynchronyMany recent ML model training systems, including for neural

network training, use a parameter server architecture to

share state among data-parallel workers executing on CPUs.

Consistent reports indicate that, in such an architecture,

some degree of asynchrony (bounded or not) in parameter

update exchanges among workers leads to significantly faster

convergence than when using BSP [3, 7, 10, 14, 19, 24,

32]. We observe the opposite with data-parallel workers

executing on GPUs—while synchronization delays can be

largely eliminated, as expected, convergence is much slower

with the more asynchronous models because of reduced

training quality. This somewhat surprising observation is

supported and discussed further in Section 5.4.

4. GeePS implementationThis section describes GeePS, a GPU-specialized parameter

server system that implements the design aspects described

in Section 3.

4.1 GeePS data model and APIGeePS is a C++ library that manages both the parameter data

and local data for GPU-based machine learning applications

(such as Caffe). The distributed application program usually

creates one ML worker process on each machine and each of

them links to one instance of the GeePS library. Algorithm 1

gives an example structure of a deep learning application

using GeePS. The ML application worker often runs in a

single CPU thread that launches NVIDIA library calls or

customized CUDA kernels to perform computations on GPUs,

and it calls GeePS functions to access and release GeePS-

managed data. The GeePS APIs are summarized in Table 1.

GeePS manages all data as a collection of rows indexed

by keys. The rows are then logically grouped into tables, and

rows in the same table share the same attributes (e.g., data

age). In our current implementation, each row is defined as

a fixed sized array of float values, allowing efficient cross-

machine communication without any marshalling. In our

deep learning application, because the model parameters (i.e.,

connection weights of each layer) can have different sizes,

we store each model parameter as multiple rows in the same

table.

GeePS implements the read and update operations with

PS-managed buffers for parameter data access, and a pair of

operations for local data access, with which the application

can directly modify the accessed local data without an ex-

plicit update operation. GeePS also provides a TableClock

operation for application workers to signal the completion of

per-table updates, and the data age of a table (and the rows

in it) is defined as the number of times that the TableClock

operation is called on that table by all workers. Among all the

API calls, Read, PreUpdate, and LocalAccess are block-

ing, forcing the application worker to wait when data or buffer

space is not ready, and the other calls are all asynchronous

and return immediately. By making the application worker

wait on Read, GeePS supports three execution synchrony

models: BSP, SSP [19], and Asynchrony.

Some of our specializations (pre-built indices, background

Read, and data placement decisions) exploit knowledge of the

operation sequence of the application. Previous work shows

that one can easily get such operation sequence information

from many iterative ML applications (including deep learn-

ing), because they do the same (or nearly the same) sequence

of operations every iteration [11]. GeePS implements an op-

eration sequence gathering mechanism like that described by

Cui et al. [11]. It can gather the operation sequence either

in the first iteration or in a virtual iteration. For example,

in Algorithm 1, before the real training iterations start, the

application performs a virtual iteration, with all GeePS calls

being marked with a virtual flag, so that the operations are

only recorded by GeePS but no real actions are taken. GeePS

uses the gathered operation sequence knowledge as a hint

to build the data structures, build the access indices, make

GPU/CPU data placement decisions, and perform prefetching.

Since the gathered access information is used only as a hint,

knowing the exact operation sequence is not a requirement

for correctness, but a performance optimization.1

4.2 GeePS architectureStoring data. GeePS shards the parameter data across all

instances, and each GeePS instance stores one shard of the

parameter data in its parameter server shard. The parameter

server shards are not replicated, and fault tolerance is handled

by checkpointing. In order to reduce communication traffic,

each instance has a parameter cache that stores a local

1 For most DNN applications (including CNN and RNN), the application

accesses all model parameters every mini-batch, so the gathered information

is exact. For some applications with sparse training data (e.g., BOW

representation for NLP tasks), the bottom layer of the network might just

use a subset of the weights. Even for these tasks, the operation sequence

of a whole epoch still repeats. The operation sequence only changes when

the training data is shuffled across epochs, and, for this special case, we can

choose to prefetch all the parameter data that can possibly be used, when

there is enough memory.

Algorithm 1 A DNN application with GeePS

L← number of layers in the network

paramDataKeys← decide row keys for param data

localDataKeys← decide row keys for local data

# Report access information with a virtual iterationTRAINMINIBATCH(null, virtual = yes)

# Real training iterationswhile not done do

TRAINMINIBATCH(nextTrainData, virtual = false)

end whilefunction TRAINMINIBATCH(trainData, virtual)

# Forward passfor i = 0 ∼ (L− 1) do

paramDataPtr ←geeps.Read(paramDataKeysi, virtual)

localDataP tr ←geeps.LocalAccess(localDataKeysi, virtual)

if not virtual thenSetup layeri with data pointers

Forward computation of layeriend ifgeeps.PostRead(paramDataPtr)geeps.PostLocalAccess(localDataP tr)

end for# Backward passfor i = (L− 1) ∼ 0 do

paramDataPtr ←geeps.Read(paramDataKeysi, virtual)

paramUpdateP tr ←geeps.PreUpdate(paramDataKeysi, virtual)

localDataP tr ←geeps.LocalAccess(localDataKeysi, virtual)

if not virtual thenSetup layeri with data pointers

Backward computation of layeriend ifgeeps.PostRead(paramDataPtr)geeps.Update(paramUpdateP tr)geeps.PostLocalAccess(localDataP tr)geeps.TableClock(table = i, virtual)

end forend function

snapshot of the parameter data, and the parameter cache

is refreshed from the parameter server shards, such as at

every clock for BSP. When the application applies updates

to the parameter data, those updates are also stored in the

parameter cache (a write-back cache) and will be submitted

to the parameter server shards at the end of every clock

(when a TableClock is called). The parameter cache has two

parts, a GPU-pinned parameter cache and a CPU parameter

cache. If everything fits in GPU memory, only the GPU

parameter cache is used. But, if the GPU memory is not big

enough, GeePS will keep some parameter data in the CPU

parameter cache. (The data placement policies are described

in Section 4.4.) Each GeePS instance also has an access bufferpool in GPU memory, and GeePS allocates GPU buffers for

Read and PreUpdate operations from the buffer pool. When

PostRead or Update operations are called, the memory will

be reclaimed by the buffer pool. GeePS manages application’s

input data and intermediate states as local data. The local

data also has a GPU-pinned part and a CPU part, with the

CPU part only used if necessary. GeePS divides the key space

into multiple partitions, and the rows in different partitions

are physically managed in different data structures and with

different sets of communication threads.

Data movement across machines. GeePS performs com-

munication across machines asynchronously with three types

of background threads: keeper threads manage the parameter

data in parameter server shards; pusher threads send parame-

ter data updates from parameter caches to parameter server

shards, by sending messages to keeper threads; puller threads

receive parameter data from parameter server shards to pa-

rameter caches, by receiving messages from keeper threads.

The communication is implemented using sockets, so the

data needs to be copied to some CPU staging memory before

being sent through the network, and the received data will also

be in the CPU staging memory. The pusher/puller threads

perform data movement between CPU memory and GPU

memory using CUDA APIs.

Data movement inside a machine. GeePS uses two back-

ground threads to perform the data access operations for

the application workers. The allocator thread performs the

Read, PreUpdate, and LocalAccess operations by allo-

cating buffers from the buffer pool and copying the re-

quested data to the buffers. The reclaimer thread performs the

PostRead, Update, and PostLocalAccess operations by

saving the data to parameter cache or local store and reclaim-

ing the buffers back to the buffer pool. These threads assign

and update parameter data in large batches with pre-built

indices by launching CUDA kernels on GPUs, as described

in Section 4.3.

Synchronization and data freshness guarantees. GeePS

supports BSP, asynchrony, and the Staleness Synchronous

Parallel (SSP) model [19], wherein a worker at clock t is

guaranteed to see all updates from all workers up to clock

t − 1 − slack, where the slack parameter controls the data

freshness. SSP with a slack of zero is the same as BSP.

To enforce SSP bounds, each parameter server shard

keeps a vector clock for each table, where each vector

clock entry stores the number of times each worker calls

the TableClock operation on that table. The data age of

each table in a parameter server shard is the minimal value

of the corresponding vector clock. The parameter cache also

keeps the data age information with the cached data, and the

allocator thread is blocked when the data is not fresh enough.

Locking. GeePS’s background threads synchronize with

each other, as well as the application threads, using mutex

locks and condition variables. Unlike some other CPU-based

parameter servers that use per-row locks [10, 11, 32], we

employ a coarse-grained locking design, where one set of

mutex lock and condition variable is used for a whole key

partition. We make this design decision for two reasons. First,

with coarse-grained locking, batched data operations can be

easily performed on a whole partition of rows. Second, unlike

CPU applications, where one application thread is launched

for each CPU core, a GPU application often has just one CPU

host thread interacting with each GeePS instance, making

lock contention less of an issue.

4.3 Parallelizing batched accessGeePS provides a key-value store interface to the appli-

cation, where each parameter row is named by a unique

key. When the application issues a read or update opera-

tion (for accessing a set of model parameters), it will pro-

vide a list of keys for the target rows. GeePS could use

a hash map to map the row keys to the locations where

the rows are stored. But, in order to make the batched ac-

cess be executed by all GPU cores, GeePS will use the fol-

lowing mechanism. Suppose the application update n rows,

each with m floating point values, in one Update opera-

tion, it will provide an array of n parameter row updates

{{updates[i][j]}mj=1}ni=1, and (provided in PreUpdate) an

array of n keys {keys[i]}ni=1. GeePS will use an index with nentries, where each of {index[i]}ni=1 stores the location of the

cached parameter update. Then, it will do the following data

operation for this Update: {{parameters[index[i]][j] +=updates[i][j]}mj=1}ni=1. This operation can be executed with

all the GPU cores. Moreover, the index can be built just once

for each batch of keys, based on the operation sequence gath-

ered as described earlier, and re-used for each instance of the

given batch access.

4.4 GPU memory managementGeePS keeps the GPU-pinned parameter cache, GPU-pinned

local data, and access buffer pool in GPU memory. They

will be all the GPU memory allocated in a machine if the

application keeps all its input data and intermediate states in

GeePS and uses the GeePS-managed buffers. GeePS will pin

as much parameter data and local data in GPU memory as

possible. But, if the GPU memory is not large enough, GeePS

will keep some of the data in CPU memory (the CPU part of

the parameter cache and/or CPU part of the local data).

In the extreme case, GeePS can keep all parameter data

and local data in the CPU memory. But, it will still need the

buffer pool to be in the GPU memory, and the buffer pool

needs to be large enough to keep all the actively used dataeven at peak usage. We refer to this peak memory usage as

peak size. In order to perform the GPU/CPU data movement

in the background, GeePS does double buffering by making

the buffer pool twice as large as the peak size.

Data placement policy. We will now describe our policy

for choosing which data to pin in GPU memory. In our

implementation, any local data that is pinned in GPU memory

does not need to use any access buffer space. The allocator

thread will just give the pointer to the pinned GPU local

data to the application, without copying the data. For the

parameter data, even though it is pinned in GPU memory,

the allocator thread still needs to copy it from the parameter

cache to an access buffer, because the parameter cache could

be modified by the background communication thread (the

puller thread) while the application is doing computation. As

a result, pinning local data in GPU memory gives us more

benefit than pinning parameter cache data. Moreover, if we

pin the local data that is used at the peak usage, we can reduce

the peak access buffer usage, because it does not need the

buffer, allowing us to reserve less memory for the access

buffer.

Algorithm 2 GPU/CPU data placement policy

Input: {paramData}, {localData} ← entries of all parameter data

and local data accessed at each layer

Input: totalMem← the amount of GPU memory to use

# Start with everything in CPU memory{cpuMem} ← {paramData} ∪ {localData}{gpuMem} ← ∅# Set access buffer twice the peak size for double bufferingpeakSize← peak data usage, excluding GPU local data

bufferSize← 2× peakSizeavailMem← totalMem− bufferSize# First pin local data used at peakwhile ∃data ∈ {localData} ∩ {cpuMem} ∩ {peakLayer} do

peakSizeDelta←peakSize change if data is moved to {gpuMem}

memSizeDelta← size(data) + 2× peakSizeDeltaif availMem < memSizeDelta then

breakend ifMove data from {cpuMem} to {gpuMem}availMem← availMem−memSizeDelta

end while# Pin more local data using the available memoryfor each data ∈ {localData} ∩ {cpuMem} do

if availMem ≥ size(data) thenMove data from {cpuMem} to {gpuMem}availMem← availMem− size(data)

end ifend for# Pin parameter data using the available memoryfor each data ∈ {paramData} ∩ {cpuMem} do

if availMem ≥ size(paramData) thenMove data from {cpuMem} to {gpuMem}availMem← availMem− size(data)

end ifend for# Dedicate the remaining available memory to the access bufferIncrease bufferSize by availMem

Algorithm 2 illustrates our GPU/CPU data placement

policy, and it only runs at the setup stage, after the access

information is gathered. The algorithm chooses the entries to

pin in GPU memory based on the gathered access information

and a given GPU memory budget. While keeping the access

buffer pool twice the peak size for double buffering, our

policy will first try to pin the local data that is used at the peak

in GPU memory, in order to reduce the peak size and thus the

size of the buffer pool. Then, it will try to use the available

capacity to pin more local data and parameter cache data in

GPU memory. Finally, it will add any remaining available

GPU memory to the access buffer.

Avoiding unnecessary data movement. When the appli-

cation accesses/post-accesses the local data that is stored

in CPU memory, by default, the allocator/reclaimer thread

will need to copy the data between the CPU memory and

the allocated GPU memory. However, sometimes this data

movement is not necessary. For example, when we train a

deep neural network, the input data and intermediate data

are overwritten every new mini-batch, and the old values

from the last mini-batch can be safely thrown away. To avoid

this unnecessary data movement, we allow the application

to specify a no-fetch flag when calling LocalAccess, and

it tells GeePS to just allocate an uninitialized piece of GPU

memory, without fetching the data from CPU memory. Sim-

ilarly, when the application calls PostLocalAccess with a

no-save flag, GeePS will just free the GPU memory, without

saving the data to CPU memory.

5. EvaluationThis section evaluates GeePS’s support for parallel deep learn-

ing over distributed GPUs, using two recent image classifica-

tion models and a video classification model executed in the

original and modified Caffe application. The evaluation con-

firms four main findings: (1) GeePS provides effective data-

parallel scaling of training throughput and training conver-

gence rate, at least up to 16 machines with GPUs. (2) GeePS’s

efficiency is much higher, for GPU-based training, than a tra-

ditional CPU-based parameter server and also much higher

than parallel CPU-based training performance reported in the

literature. (3) GeePS’s dynamic management of GPU memory

allows data-parallel GPU-based training on models that are

much larger than used in state-of-the-art deep learning for im-

age classification and video classification. (4) For GPU-based

training, unlike for CPU-based training, loose consistency

models (e.g., SSP and asynchronous) significantly reduce

convergence rate compared to BSP. Fortunately, GeePS’s ef-

ficiency enables significant scaling benefits even with larger

BSP-induced communication delays.

A specific non-goal of our evaluation is comparing the

classification accuracies of the different models. Our focus is

on enabling faster training of whichever model is being used,

which is why we measure performance for several.

5.1 Experimental setupApplication setup. We use Caffe [21], the open-source

single-GPU convolutional neural network application dis-

cussed earlier.2 Our experiments use unmodified Caffe to

represent the optimized single-GPU case and a minimally

modified instance (GeePS-Caffe) that uses GeePS for data-

parallel execution. GeePS-Caffe uses GeePS to manage all

its parameter data and local data (including input data and

intermediate data), using the same structure as illustrated in

Algorithm 1. The parameter data of each layer is stored as

rows of a distinct GeePS table, allowing GeePS to propagate

2 For the image classification application, we used the version of Caffe

from https://github.com/BVLC/caffe as of June 19, 2015. Since their

master branch version does not support RNN, for the video classification

application, we used the version from https://github.com/LisaAnne/

lisa-caffe-public/tree/lstm_video_deploy as of Nov 9, 2015.

the each layer’s updates during the computations of other

layers, as suggested by Zhang et al. [35]. Each GeePS row is

configured to be an array of 128 float values.

Cluster setup. Each machine in our cluster has one

NVIDIA Tesla K20C GPU, which has 5 GB of GPU device

memory. In addition to the GPU, each machine has four

2-die 2.1 GHz 16-core AMD R© Opteron 6272 packages

and 128 GB of RAM. Each machine is installed with 64-

bit Ubuntu 14.04, CUDA toolkit 7.5, and cuDNN v2. The

machines are inter-connected via a 40 Gbps Ethernet interface

(12 Gbps measured via iperf), and Caffe reads the input

training data from remote file servers via a separate 1 Gbps

Ethernet interface.

Image classification datasets and models. The experi-

ments use two datasets for image classification. The first one

is the ImageNet22K dataset [15], which contains 14 million

images labeled to 22,000 classes. Because of the computation

work required to train such a large dataset, CPU-based sys-

tems running on this dataset typically need a hundred or more

machines and spend over a week to reach convergence [7]. We

use half of the images (7 million images) as the training set

and the other half as the testing set, which is the same setup

as described by Chilimbi et al. [7]. For the ImageNet22K

dataset, we use a similar model to the one used to evaluate

ProjectAdam [7], which we refer to as the AdamLike model.3

The AdamLike model has five convolutional layers and three

fully connected layers. It contains 2.4 billion connections for

each image, and the model parameters are 470 MB in size.

The second dataset we used is Large Scale Visual Recogni-

tion Challenge 2012 (ILSVRC12) [26]. It is a subset of the Im-

ageNet22K dataset, with 1.3 million images labeled to 1000

classes. For this dataset, we use the GoogLeNet model [28], a

recent inception model from Google. The network has about

100 layers, and 22 of them have model parameters. Though

the number of layers is large, the model parameters are only

57 MB in size, because they use mostly convolutional layers.

Video classification datasets and models. We use the

UCF-101 dataset [27] for our video classification experiments.

UCF-101 has about 8,000 training videos and 4,000 testing

videos categorized into 101 human action classes. We use a

recurrent neural network model for this application, following

the approach described by Donahue et al. [16]. We use the

GoogLeNet network as the CNN layers and stack LSTM

layers with 256 hidden units on top of them. The weights of

the GoogLeNet layers have been pre-trained with the single

frames of the training videos. Following the same approach

as is described by Donahue et al. [16], we extract the video

frames at a rate of 30 frames per second and train the model

with randomly selected video clips of 32 frames each.

3 We were not able to obtain the exact model that ProjectAdam used, so

we emulated it based on the descriptions in the paper. Our emulated model

has the same number and types of layers and connections, and we believe

our training performance evaluations are representative even if the resulting

model accuracy may not be.

Training algorithm setup. Both the unmodified and the

GeePS-hosted Caffe train the models using the SGD algo-

rithm with a momentum of 0.9. Unless otherwise specified,

we use the configurations listed in Table 2. Our experiments

in Section 5.3 evaluate performance with different mini-batch

sizes. For AdamLike and GoogLeNet network, we used the

same learning rate for both single-machine training and dis-

tributed training, because we empirically found that this learn-

ing rate is the best setting for both. For the RNN model, we

used a different learning rate for single-machine training,

because it leads to faster convergence.

Model Mini-batch size(per machine) Learning rate

AdamLike 200 images0.0036,

divided by 10 every 3 epochs

GoogLeNet 32 images0.0036,


RNN1 video,

32 frames each

0.0000125 for 8 machines,

0.0001 for single-machine,


Table 2. Model training configurations.

GeePS setup. Unless otherwise specified, we let GeePS

keep the parameter cache and local data in GPU memory

for our experiments, since it all fits for all of the models

used; Section 5.3 evaluates performance when keeping part

of the data in CPU memory, including for a very large model

scenario. Unless otherwise specified, the BSP mode is used;

Section 5.4 analyzes the effect of looser synchronization

models.

5.2 Scaling deep learning with GeePSThis section evaluates how well GeePS supports data-parallel

scaling of GPU-based training on both image classification

and video classification application. We compare GeePS with

three classes of systems: (1) Single-GPU optimized training:

the original unmodified Caffe system (referred to as “Caffe”)

represents training optimized for execution on a single GPU.

(2) GPU workers with CPU-based parameter server: multiple

instances of the modified Caffe linked via IterStore [11] a

state-of-the-art CPU-based parameter server (“CPU-PS”). (3)

CPU workers with CPU-based parameter server: reported

performance numbers from recent literature are used to put

the GPU-based performance into context relative to state-of-

the-art CPU-based deep learning.

Image classification. Figure 8 shows the training through-

put scalability of the image classification application, in terms

of both the number of images trained per second and the num-

ber of network connections trained per second. Note that there

is a linear relationship between those two metrics. GeePS

scales almost linearly when we add more machines. Com-

pared to the single-machine optimized Caffe, GeePS achieves

13× speedups on both GoogLeNet and AdamLike model

with 16 machines. Compared to CPU-PS, GeePS achieves

over 2× more throughput. The GPU stall time of GeePS is

1 2 4 8 16Number of machines

0

400

800

1200

1600

2000

Images t

rain

ed/s

ec

GeePS

CPU-PS

Caffe

Ideal Caffe scaling

0

1

2

3

4

Connecti

ons t

rain

ed/s

ec1e12

(a) AdamLike model on ImageNet22K dataset.

1 2 4 8 16Number of machines

0

200

400

600

800

1000

1200

Images t

rain

ed/s

ec

GeePS

CPU-PS

Caffe

Ideal Caffe scaling

(b) GoogLeNet model on ILSVRC12 dataset.

Figure 8. Image classification throughput scalability. Both

GeePS and CPU-PS run in the BSP mode.

only 8% for both GoogLeNet and AdamLike model, so 92%

of the total runtime is spent on the application’s computa-

tional work. While using CPU-PS, the GPU stall time is 51%

and 65% respectively.

Chilimbi et al. [7] report that ProjectAdam trains 570 bil-

lion connections per second on the ImageNet22K dataset

when with 108 machines (88 CPU-based worker machines

with 20 parameter server machines) [7]. Figure 8(a) shows

that GeePS achieves higher throughput with only 4 GPU

machines, because of its efficient data-parallel execution on

GPUs.

Figure 9 shows the image classification top-1 testing

accuracies of the trained models. The top-1 classification

accuracy is defined as the fraction of the testing images that

are correctly classified. To evaluate convergence speed, we

will compare the amount of time required to reach a given

level of accuracy, which is a combination of image training

throughput and model convergence per trained image. For the

AdamLike model on the ImageNet22K dataset, Caffe needs

26.9 hours to reach 10% accuracy, while GeePS needs only

4.6 hours with 8 machines (6× speedup) or 3.3 hours with

16 machines (8× speedup). For the GoogLeNet model on

the ILSVRC12 dataset, Caffe needs 13.7 hours to reach 30%

accuracy, while GeePS needs only 2.8 hours with 8 machines

(5× speedup) or 1.8 hours with 16 machines (8× speedup).

The model training time speedups compared to the single-

GPU optimized Caffe are lower than the image training

throughput speedups, as expected, because each machine

0 4 8 12 16 20 24 28 32Time (hours)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Testi

ng t

op-1

accura

cy

GeePS, 16 machines

GeePS, 8 machines

Caffe, 1 machine

(a) AdamLike model on ImageNet22K dataset.

0 2 4 6 8 10 12 14 16Time (hours)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Testi

ng t

op-1

accura

cy

GeePS, 16 machines

GeePS, 8 machines

Caffe, 1 machine

(b) GoogLeNet model on ILSVRC12 dataset.

Figure 9. Image classification top-1 accuracies.

determines gradients independently. Even using BSP, more

training is needed than with a single worker to make the

model converge. But, the speedups are still substantial.

For the AdamLike model on the ImageNet22K dataset,

Chilimbi et al. [7] report that ProjectAdam needs one day

to reach 13.6% accuracy with 58 machines (48 CPU-based

worker machines with 10 parameter server machines). GeePS

needs only 6 hours to reach the same accuracy with 16 ma-

chines (about 4× speedup). To reach 13.6% accuracy, the

DistBelief system trained (a different model) with 2,000 ma-

chines for a week [14].

Because both GeePS and CPU-PS run in the BSP mode,

with the same number of machines, the accuracy improve-

ment speedups of GeePS over CPU-PS are the same as the

throughput speedups, so we leave them out of the graphs.

Video classification. Figure 10(a) shows the training

throughput of the video classification application. GeePS

scales linearly from Caffe (8× throughput with 8 machines).

Figure 10(b) shows the top-1 testing accuracies. To reach 60%

accuracy, Caffe used 3.6 hours, while GeePS with 8 machines

used 0.5 hours (7× speedup); to reach 68% accuracy, Caffe

used 8.4 hours, while GeePS with 8 machines used 2.4 hours

(3.5× speedup).

5.3 Dealing with limited GPU memoryAn oft-mentioned concern with data-parallel deep learning

on GPUs is that it can only be used when the entire model,

as well as all intermediate state and the input mini-batch, fit

1 2 4 8Number of machines

0

5

10

15

20V

ideos t

rain

ed/s

ec

GeePS

CPU-PS

Caffe

Ideal Caffe scaling

(a) Training throughput

0 2 4 6 8 10 12Time (hours)

0.00.10.20.30.40.50.60.70.8

Testi

ng t

op-1

accura

cy

GeePS, 8 machines

Caffe, 1 machine

(b) Top-1 accuracies.

Figure 10. Video classification scalability.

in GPU memory. GeePS eliminates this limitation with its

support for managing GPU memory and using it to buffer

data from the much larger CPU memory. Although all of

the models we experiment with (and most state-of-the-art

models) fit in our GPUs’ 5 GB memories, we demonstrate the

efficacy of GeePS’s mechanisms in two ways: by using only a

fraction of the GPU memory for the largest case (AdamLike)

and by experimenting with a much larger synthetic model.

We also show that GeePS’s memory management support

allows us to do video classification on longer videos.

Artificially shrinking available GPU memory. With a

mini-batch of 200 images per machine, training the Adam-

Like model on the ImageNet22K dataset requires only

3.67 GB memory per machine, with 123 MB for input data,

2.6 GB for intermediate states, and 474 MB each for parame-

ter data and computed parameter updates. Note that the sizes

of the parameter data and parameter updates are determined

by the model, while the input data and intermediate states

grow linearly with the mini-batch size. For best throughput,

GeePS also requires use of an access buffer that is large

enough to keep the actively used parameter data and parame-

ter updates at the peak usage, which is 528 MB minimal and

1.06 GB for double buffering (the default) to maximize over-

lapping of data movement with computation. So, in order to

keep everything in GPU memory, the GeePS-based training

needs 4.73 GB of GPU memory.

Recall, however, that GeePS can manage GPU memory

usage such that only the data needed for the layers being

processed at a given point need to be in GPU memory.

0 10forward

20 20 10backward

0

Neural network layers

0.0

0.2

0.4

0.6

0.8

1.0

Mem

ory

usage (

GB

)

Parameter updates

Parameter data

Intermediate states

Input data

0.00

0.04

0.08

0.12

0.16

Mem

ory

usage f

racti

on

Figure 11. Per-layer memory usage of AdamLike model on

ImageNet22K dataset.

Figure 11 shows the per-layer memory usage for training

the AdamLike model, showing that it is consistently much

smaller than the total memory usage. The left Y axis shows

the absolute size (in GB) for a given layer, and the right Y

axis shows the fraction of the absolute size over the total

size of 4.73 GB. Each bar is partitioned into the sizes of

input data, intermediate states, parameter data, and parameter

updates for the given layer. Most layers have little or no

parameter data, and most of the memory is consumed by the

intermediate states for neuron activations and error terms.

The layer that consumes the most memory uses about 17%

of the total memory usage, meaning that about 35% of the

4.73 GB is needed for full double buffering.

0 1 2 3 4 5GPU memory per machine (GB)

0

200

400

600

800

Images t

rain

ed p

er

sec 200 images per batch

100 images per batch

50 images per batch

Figure 12. Throughput of AdamLike model on Ima-

geNet22K dataset with different GPU memory budgets.

Figure 12 shows data-parallel training throughput using 8

machines, when we restrict GeePS to using different amounts

of GPU memory to emulate GPUs with smaller memories.

When there is not enough GPU memory to fit everything,

GeePS must swap data to CPU memory. For the case of 200

images per batch, when we swap all data in CPU memory,

we need only 35% of the GPU memory compared to keeping

all data in GPU memory, but we are still able to get 73% of

the throughput.

When the GPU memory limits the scale, people are often

forced to use smaller mini-batch sizes to let everything fit

in GPU memory. Our results in Figure 12 also shows that

using our memory management mechanism is more efficient

than shrinking the mini-batch size. For the three mini-batch

sizes compared, we keep the inter-machine communication

the same by doing multiple mini-batches per clock as needed

(e.g., four 50-image mini-batches per clock). For the case

of 100 images per batch and 50 images per batch, 3.7 GB

and 3.3 GB respectively are needed to keep everything in

GPU memory (including access buffers for double buffering).

While smaller mini-batches reduce the total memory require-

ment, they perform significantly less well for two primary

reasons: (1) the GPU computation is more efficient with a

larger mini-batch size, and (2) the time for reading and up-

dating the parameter data locally, which does not shrink with

mini-batch size, is amortized over more data.

Training a very large neural network. To evaluate per-

formance for much larger neural networks, we create and

train huge synthetic models. Each such neural network con-

tains only fully connected layers with no weight sharing, so

there is one model parameter (weight) for every connection.

The model parameters of each layer is about 373 MB. We

create multiple such layers and measure the throughput (in

terms of # connections trained per second) of training differ-

ent sized networks, as shown in Figure 13. For all sizes tested,

up to a 20 GB model (56 layers) that requires over 70 GB

total (including local data), GeePS is able to train the neural

network without excessive overhead. The overall result is

that GeePS’s GPU memory management mechanisms allows

data-parallel training of very large neural networks, bounded

by the largest layer rather than the overall model size.

0 5 10 15 20Model parameter size (GB)

0.0

0.4

0.8

1.2

Connecti

ons

train

ed/s

ec

1e12

Figure 13. Training throughput on very large models. Note

that the number of connections increases linearly with model size,

so the per-image training time grows with model size because the

per-connection training time stays relatively constant.

Video classification on longer videos. As is described

in Section 2.1, the video classification application requires

complete sequences of image frames to be in the same mini-

batch, so the GPU memory size will either limit the maximum

number of frames per video or force the model to be split

across multiple machines, incurring extra complexity and

network communication overhead. Using unmodified Caffe,

for example, our RNN can support a maximum video length

of 48 frames.4 Because the videos are often sampled at a

rate of 30 frames per second, a 48-frame video is less than

2 seconds in length. Ng et al. [34] find that using more frames

in a video improves the classification accuracy. In order to

use a video length of 120 frames, Ng et al. used a model-

4 Here, we are just considering the memory consumption of the training

stage. If we further consider the memory used for testing, the supported

maximum video length will be shorter.

parallel approach to split the model across four machines,

which incurs extra network communication overhead. By

contrast, with the memory management support of GeePS,

we are able to train videos with up to 192 frames, using solely

data parallelism.

5.4 The effects of looser synchronizationUse of looser synchronization models, such as Stale Syn-

chronous Parallel (SSP) or even unbounded asynchronous,

has been shown to provide significantly faster convergence

rates in data-parallel CPU-based model training systems [3,

7, 10, 14, 19, 24, 32]. This section shows results confirming

our experience that this does not hold true with GPU-based

deep learning.

GeePS GeePS-single-table CPU-PS0

2

4

6

8

Tim

e p

er

batc

h (

sec)

Slack 0 Slack 1Slack Inf

Slack 0

Slack 1Slack Inf

Slack 0Slack 1Slack Inf

Stall time

Computation time

Figure 14. Data-parallel per-mini-batch training time for

AdamLike model under different configurations. We refer to

“stall time” as any part of the runtime when GPUs are not

doing computations.

Figure 14 compares the per-mini-batch AdamLike model

training time with 8 machines, when using GeePS, GeePS-

single-table, and CPU-PS, and each of three synchroniza-

tion models: BSP (“Slack 0”), SSP (“Slack 1”), and Asyn-

chronous (“Slack Inf”). The GeePS-single-table setup has

the application store all its parameter data in a single table,

so that the parameter updates of all layers are sent to the

server together as a whole batch. While in the default GeePS

setup, where the parameter data of each layer is stored in a

distinct table, the parameter updates of a layer can be sent

to the server (and propagated to other workers) before the

computation of other layers finish. Showing the performance

of GeePS-single-table helps us understand the performance

improvement coming from decoupling the parameter data

of different layers. Each bar in Figure 14 divides the total

into two parts. First, the computation time is the time that the

application spends on training, which is mostly unaffected

by the parameter server artifact and synchronization model;

the non-BSP GeePS cases show a slight increase because of

minor GPU-internal resource contention when there is great

overlap between computation and background data move-

ment. Second, the stall time includes any additional time

spent on reading and updating parameter data, such as wait-

ing time for slow workers to catch up or updated parameters

to arrive, and time for moving data between GPU and CPU

memory.

There are two interesting observations here. First, even for

the Slack 0 (BSP) case, GeePS has little stall time. That is

because, with our specializations, the per-layer updates can

be propagated in the background, before the computations of

other layers finish. The second observation is that expensive

GPU/CPU data movement stalls CPU-PS execution, even

with asynchronous communication (Slack Inf). Though CPU-

PS also keeps parameter data of different layers in separate

tables, the application has to perform expensive GPU/CPU

data movement in the foreground each time it accesses the

parameter data.

0 5 10 15 20 25 30 35 40Million images trained

0.00

0.05

0.10

0.15

0.20

0.25

Testi

ng t

op-1

accura

cy

Staleness bound 0 (BSP)

Staleness bound 1 (SSP)

Staleness bound Inf (Async)

Figure 15. AdamLike model top-1 accuracy as a function of

the number of training images processed, for BSP, SSP with

slack 1, and Async.

Figure 15 compares the classification accuracy as a func-

tion of the number of training images processed, for GeePS

with BSP, SSP (Slack 1), and Async. The result shows that,

when using SSP (Slack 1) or Async, many more images must

be processed to reach the same accuracy as with BSP (e.g.,

2× more for Slack 1 and 3× more for Async to reach 10%

accuracy). The training throughput speedups achieved with

Slack 1 or Async are not sufficient to overcome the reduced

training quality per image processed—using BSP leads to

much faster convergence. We believe there are two reasons

causing this outcome. First, with our specializations, there is

little to no communication delay for DNN applications, so

adding data staleness does not increase the throughput much.

Second, the conditions in the SSP proofs of previous liter-

atures [19] do not apply to DNN, because training a DNN

is a non-convex problem. Interestingly, our observation is

consistent with the concern about using parameter servers for

data-parallel GPU-based training expressed by Chilimbi et

al. [7]: “Either the GPU must constantly stall while waiting

for model parameter updates or the models will likely diverge

due to insufficient synchronization.” Fortunately, GeePS’s

GPU-specialized design greatly reduces the former effect,

allowing BSP-based execution to scale well.

6. Additional related workThis section augments the background related work discussed

in Section 2, which covered use of parameter servers and

individual GPUs for deep learning.

Coates et al. [9] describe a specialized multi-machine

GPU-based system for parallel deep learning. The architec-

ture used is very different than GeePS. Most notably, it relies

on model parallelism to partition work across GPUs, rather

than the simpler data-parallel model. It also uses special-

ized MPI-based communication over Infiniband, rather than

a general parameter server architecture, regular sockets, and

Ethernet.

Deep Image [33] is a custom-built supercomputer for deep

learning via GPUs. The GPUs used have large memory ca-

pacity (12 GB), and their image classification network fits

within it, allowing use of data-parallel execution. They also

support for model-parallel execution, with ideas borrowed

from Krizhevsky et al. [22], by partitioning the model on fully

connected layers. The machines are interconnected by Infini-

band with GPUDirect RDMA, so no CPU involvement is

required, and they do not use the CPU cores or CPU memory

to enhance scalability like GeePS does. Deep Image exploits

its low latency GPU-direct networking for specialized param-

eter state exchanges rather than using a general parameter

server architecture like GeePS.

MXNet [6] and Poseidon [35] are two concurrently de-

veloped ([12]) systems for multi-GPU deep learning. Both

systems take the data-parallel approach by making use of

CPU-based parameter servers (Poseidon uses Bosen [32]

and MXNet uses ParameterServer [24]). GeePS differs from

MXNet and Poseidon in two primary ways. First, in order

to overcome the inefficiency of using CPU-based parameter

servers, both MXNet and Poseidon rely on optimizations to

their specific GPU application system (Poseidon is built on

Caffe and MXNet writes their own). GeePS, on the other

hand, specializes its reusable parameter server module, pro-

viding efficiency for all GPU deep learning applications it

hosts. Indeed, the application improvements made for MXNet

and Poseidon would further improve our reported perfor-

mance numbers. Second, both MXNet and Poseidon require

that each of their GPU machines has enough GPU memory

to store all model parameters and intermediate states, limit-

ing the size of their neural networks. GeePS’s explicit GPU

memory management support, on the other hand, allows the

training of neural networks that are much bigger than the

available GPU memory.

7. ConclusionsGeePS is a new parameter server for data-parallel deep learn-

ing on GPUs. Experimental results show that GeePS enables

scalable training throughput, resulting in faster convergence

of model parameters when using multiple GPUs and much

faster convergence than CPU-based training. In addition,

GeePS’s explicit GPU memory management support enables

GPU-based training of neural networks that are much larger

than the GPU memory, swapping data to and from CPU mem-

ory in the background. Combined, GeePS enables use of

data-parallel execution and the general-purpose parameter

server model to achieve efficient, scalable deep learning on

distributed GPUs.

AcknowledgmentsWe thank our shepherd Derek Murray and the EuroSys re-

viewers for their detailed feedback. We thank the members

and companies of the PDL Consortium (including Avago,

Citadel, EMC, Facebook, Google, Hewlett-Packard Labs, Hi-

tachi, Intel, Microsoft Research, MongoDB, NetApp, Oracle,

Samsung, Seagate, Symantec, Two Sigma, and Western Digi-

tal) for their interest, insights, feedback, and support. This re-

search is supported in part by Intel as part of the Intel Science

and Technology Center for Cloud Computing (ISTC-CC),

National Science Foundation under awards CNS-1042537

and CNS-1042543 (PRObE [17]).

References[1] NVIDIA cuBLAS https://developer.nvidia.com/

cublas.

[2] NVIDIA cuDNN https://developer.nvidia.com/

cudnn.

[3] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J.

Smola. Scalable inference in latent variable models. In WSDM,

2012.

[4] S. Ahn, B. Shahbaba, and M. Welling. Distributed stochastic

gradient MCMC. In ICML, 2014.

[5] Y. Bengio, Y. LeCun, et al. Scaling learning algorithms towards

AI. Large-scale kernel machines, 34(5), 2007.

[6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,

B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and

efficient machine learning library for heterogeneous distributed

systems. arXiv preprint arXiv:1512.01274, 2015.

[7] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman.

Project Adam: Building an efficient and scalable deep learning

training system. In OSDI, 2014.

[8] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep

neural networks for image classification. In CVPR, 2012.

[9] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and

N. Andrew. Deep learning with COTS HPC systems. In

ICML, 2013.

[10] H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei,

W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P.

Xing. Exploiting bounded staleness to speed up big data

analytics. In USENIX ATC, 2014.

[11] H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky,

Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P.

Xing. Exploiting iterative-ness for parallel ML computations.

In SoCC, 2014.

[12] H. Cui, G. R. Ganger, and P. B. Gibbons. Scalable deep learn-

ing on distributed GPUs with a GPU-specialized parameter

server. CMU PDL Technical Report (CMU-PDL-15-107),

2015.

[13] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent

pre-trained deep neural networks for large-vocabulary speech

recognition. IEEE Transactions on Audio, Speech, and Lan-guage Processing, 20(1), 2012.

[14] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,

A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale

distributed deep networks. In NIPS, 2012.

[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.

ImageNet: A large-scale hierarchical image database. In CVPR,

2009.

[16] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,

S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur-

rent convolutional networks for visual recognition and descrip-

tion. arXiv preprint arXiv:1411.4389, 2014.

[17] G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. PRObE:

A thousand-node experimental cluster for computer systems

research. USENIX ;login:, 2013.

[18] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,

N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,

et al. Deep neural networks for acoustic modeling in speech

recognition: The shared views of four research groups. IEEESignal Processing Magazine, 29(6), 2012.

[19] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons,

G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective

distributed ML via a Stale Synchronous Parallel parameter

server. In NIPS, 2013.

[20] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation, 9(8), 1997.

[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-

shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional

architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[22] A. Krizhevsky. One weird trick for parallelizing convolutional

neural networks. arXiv preprint arXiv:1404.5997, 2014.

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet

classification with deep convolutional neural networks. In

NIPS, 2012.

[24] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed,

V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling

distributed machine learning with the parameter server. In

OSDI, 2014.

[25] R. Power and J. Li. Piccolo: Building fast, distributed programs

with partitioned tables. In OSDI, 2010.

[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,

Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,

and L. Fei-Fei. ImageNet large scale visual recognition

challenge. International Journal of Computer Vision, 2015.

[27] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset

of 101 human actions classes from videos in the wild. arXivpreprint arXiv:1212.0402, 2012.

[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,

D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper

with convolutions. arXiv preprint arXiv:1409.4842, 2014.

[29] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show

and tell: A neural image caption generator. arXiv preprintarXiv:1411.4555, 2014.

[30] M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, and Z. Zhang.

Minerva: A scalable and highly efficient training platform for

deep learning. NIPS 2014 Workshop of Distributed Matrix

Computations, 2014.

[31] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang,

Y. Gao, J. Zeng, Q. Yang, et al. Towards topic modeling for

big data. arXiv preprint arXiv:1405.4402, 2014.

[32] J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger,

P. B. Gibbons, G. A. Gibson, and E. P. Xing. Managed

communication and consistency for fast data-parallel iterative

analytics. In SoCC, 2015.

[33] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep

image: Scaling up image recognition. arXiv preprint

arXiv:1501.02876, 2015.

[34] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,

O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets:

Deep networks for video classification. In CVPR, 2015.

[35] H. Zhang, Z. Hu, J. Wei, P. Xie, G. Kim, Q. Ho, and

E. Xing. Poseidon: A system architecture for efficient GPU-

based deep learning on multiple machines. arXiv preprintarXiv:1512.06216, 2015.

[36] R. Zhang and J. Kwok. Asynchronous distributed ADMM

algorithm for global variable consensus optimization. In ICML,

2014.

Documents

GeePS: Scalable deep learning on distributed GPUs …Caffe [21], a state-of-the-art open-source system for deep learning on a single GPU, to store its data in GeePS, we can improve