Performance and Control Flow in TensorFlowStanford Department of
Computer Science
[email protected]
Emma Pierson Stanford Department of Computer Science
[email protected]
1 SUMMARY The success of deep learning on a variety of statistical
tasks has sparked a revolution. This, in turn, raises new
challenges in pro- gramming language design: how should we design a
language that allows users to efficiently crunch a huge amount of
data in parallel? How canwemake use of specialized hardware?What
sort of control flows are most effective? In this project, we will
study TensorFlow – probably the most popular deep learning
framework around – and how it attempts to answer these challenges.
To understand TensorFlow from the bottom up, we begin by
benchmarking and as- sessing ease-of-use of CUDA, the low-level
framework upon which TensorFlow relies. We then investigate two
recent higher-level Ten- sorFlow innovations – the Dataset API and
eager execution. Last, to get a birds-eye view of the design
considerations underlying the TensorFlow framework, we conduct
three lengthy interviews with world-class TensorFlow practitioners,
all of whom have more than a thousand hours of practical deep
learning experience.
2 BACKGROUND TensorFlow is a deep learning framework developed by
Google Brain and released in November 2015. As of this writing, it
is the most popular deep learning framework on a variety of metrics
[2]; the initial paper describing it [1] has more than 2,000
citations and the GitHub repository has more than 81,000 stars.
TensorFlow has been used as the framework underlying numerous
advances in the state-of-the-art, including the defeat of the human
Go world champion by a neural network trained from tabula rasa in
less than a day [15]. TensorFlow’s success has even given rise to
specialized hardware like the Tensor Processing Unit [7, 8].
In this project, we examine several fundamental building blocks
underlying TensorFlow. At a low level, TensorFlow relies on GPUs to
efficiently do its number crunching; it interfaces with these GPUs
through kernels written in CUDA, a parallel computing platform from
NVIDIA. Tensorflow programs also often need to ingest and process a
large amount of data, so much so that data loading speed is a
common performance bottleneck [18]: for high-dimensional data, like
images, data loading speed can be the rate-limiting factor,
reducing CPU/GPU utilization. TensorFlow has developed several
utilities to increase dataset loading speed, including a Dataset
API and its own proprietary data format, TFRecord. We will study
both GPU usage/CUDA as well as the Dataset API and explore how they
affect performance.
At a higher level, a core abstraction that TensorFlow relies on to
perform computation is data flow graphs. (The name “Tensor- Flow”
is a portmanteau of two mathematical objects fundamental to its
operations – data flow graphs, and tensors, which are higher-
dimensional generalizations of vectors). A data flow graph is an
old programming language construct which is not original to Tensor-
Flow [9]: it expresses a sequence of computations in as a graph in
which nodes represent computations which transform input
values.
Data flow graphs are useful abstractions for a number of reasons:
they can be easily visualized, making it easier to understand and
de- bug increasingly complex neural networks (Figure 1);
computations can be distributed, by assigning different nodes in
the data flow graph to different computers; and the graph structure
facilitates optimizations.
Figure 1: A dataflowgraph for part of a convolutional neural
network, visualized using TensorBoard. Source: https://www.
tensorflow.org/get_started/graph_viz.
In standard TensorFlow, a data flow graph is first constructed:
i.e., the operations that need to be performed are described, but
not actually performed. After the graph is constructed, its nodes
can be evaluated. For example, in many neural networks, a forward
pass first computes the values of critical nodes like the loss; a
backward pass, which makes use of TensorFlow’s symbolic
differentiation capabilities, then computes the gradient of the
loss with respect to the network’s input parameters so that
optimizations can be performed.
This two-part approach of defining and then evaluating the com-
putation graph has disadvantages, however. Because the graph
structure must be predefined, it is difficult to implement dynamic
computation graphs, in which the graph structure depends on the
input. Such models are used, for example, in work using tree-
structured LSTMs to do sentiment analysis (where the structure of
the parse tree is determined by the input sentence) [16]. It is
also somewhat unintuitive for practitioners used to the standard
imperative execution paradigm to learn to define operations and
only execute them later, and this separation can make TensorFlow
somewhat difficult to debug. All these facts mean that rivals of
Ten- sorFlow that offer imperative execution, like Torch, are
appealing.
In response to this, TensorFlow released eager execution [14],
which allows operations to be executed in an imperative fashion.
Eager execution is an extremely recent development in TensorFlow,
and as of this writing can only be run using TensorFlow’s nightly
build. In theory, it offers faster debugging, more intuitive
imperative
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
execution, and support for almost all of TensorFlow’s ops, but it
is also described as “experimental”, and the nightly build is
recom- mended only for “adventurous” users, so rough edges are
expected. We assess the ease of use and performance of eager
execution in our experiments, described below.
With all complex computational tools, interviewing real-world
practitioners to understand how the tool is used in practice is an
invaluable supplement to theoretical analyses of the tool’s
structure and benchmark analyses on simulated data [13]. We
therefore set up interviews with three deep learning experts to
supplement our analyses with real-world perspectives.
3 APPROACH We conducted four investigations to understand the
features which contribute to TensorFlow’s performance.
(1) We implemented benchmarks in CUDA, the low-level frame- work
upon which TensorFlow relies on to interface with GPUs, and
assessed both performance and ease-of-use.
(2) We implemented benchmarks using the Dataset API to as- sess
both its runtime performance and ease-of-use.
(3) We installed the nightly build of TensorFlow so we could assess
the runtime performance and ease-of-use of the eager execution
mode.
(4) We tracked down three expert TensorFlow practitioners, in-
cluding an author on the original TensorFlow paper, and ar- ranged
lengthy interviews so that they could give us context on its
real-world applications, history, and design choices. All our
interview subjects have more than a thousand hours of experience in
deep learning.
We describe the results of each of these investigations in separate
sections below.
4 CUDA 4.1 Model CUDA is a parallel computing platform, introduced
by NVIDIA in 2007, that aims to allow developers to write code that
takes advantage of the computing power in GPUs. Concretely, it is a
set of APIs and libraries that developers can make use of when
writing programs in C++, Fortran, and other supported
languages.
GPUs can perform certain types of computation very efficiently
primarily because they are massively parallel: for example, the
latest GPUs contain thousands of cores, each capable of running
hundreds of thousands (if not more) of threads [11]. In particu-
lar, many operations in machine learning can be parallelized: for
example, in matrix-vector multiplication, each row of the matrix
can be separately multiplied by the vector and then aggregated
together. If done in parallel, this can potentially lead to
significant computational savings, at the slight cost of having to
transport the data from the CPU to the GPU and then back.
4.2 Benchmarks We set out to investigate how fast GPUs could
perform a dot product between two vectors, an operation that is
ubiquitous in machine learning. (We adapted some starter code for
adding two vectors together from NVIDIA’s developer’s blog [4] to
do this.) A simple
implementation in C++ took an average of 0.19 seconds to take the
dot product of 2 vectors, each of length ≈ 16 million and filled
with 32-bit floats (as measured with gprof). On the other hand, the
corresponding CUDA implementation (using 32 blocks and 256 threads
per block) took less than 1 millisecond (as measured with nvprof),
which is more than 2 orders of magnitude faster. This speedup is
due to the parallelization and not because a single thread on the
GPU is inherently faster at floating point operations: when we ran
our CUDA implementation using just 1 block and 1 thread per block,
it took more than 4 seconds to complete.
4.3 Issues However, the efficiency of running code on the GPU does
not come for free. In order to obtain speedups, developers need to
write code that explicitly indexes into the parallel structure of
the GPU, which is divided into blocks that each contain threads.
This can be quite clunky, as shown in Fig 2: while the simple C++
implementation is straightforward to code and debug, the CUDA
implementation requires reasoning about indices of threads, block
dimensions, ac- cumulation, GPU-CPU synchronization, and the like.
Moreover, the usual problems with parallel code, like race
conditions, also come into play.
Fortunately, for the average user, Tensorflow abstracts all of this
away, hiding any direct GPU interfacing from the end-user. Tensor-
flow developers need only install the correct GPU-enabled version
of Tensorflow to automatically benefit from the CUDA kernels al-
ready implemented within Tensorflow, which cover most of every day
use cases (advanced users can opt to write their own kernels to
plug in). As seen in Fig 3, Tensorflow on the GPU is an order of
magnitude faster at matrix-matrix multiplication than Tensorflow on
the CPU. This ability to reap the performance benefits of GPUs
without the hindrance of coding in low-level CUDA is one of the
prime advantages of working in a framework like Tensorflow.
Figure 2: Dot product code in C++ and with CUDA.
5 DATASET API Tensorflow v1.3 introduced a new Dataset API [17].
This API, together with its TFRecord data format, gives us twomajor
benefits: 1) it allows us to write Tensorflow programs that are
more resource- efficient, and 2) it helps us prototype and iterate
on Tensorflow
Performance and Control Flow in TensorFlow 2017, CS242, Final
Project
2000 4000 6000 8000 10000 Matrix Dimension
10 2
10 1
CPU GPU
Figure 3: Runtimes for square matrix multiplication using GPU
versus CPU.
programs more efficiently. It achieves these by enabling streaming
– that is, lazily loading and processing data only when required –
and by providing easy ways for developers to implement common data
processing patterns. We elaborate on these in the sequel.
To test this new API, we downloaded and processed a subset of
10,000 images from the popular ImageNet dataset [3]. We then wrote
simple Tensorflow programs, using different data ingestion methods,
to iterate over all of these images and sum up all of their pixel
values. We measured the best performance of these different
programs over repeated runs.
5.1 Performance Naive data ingestion. The most common way of
handling data ingestion – and the easiest to code up without using
the Dataset API – is to load all of the required data into memory
at the start of the program, and then simply index into it. We
implemented this with the code shown in Fig 4 by storing the data
on disk as a large 10GB Numpy matrix. As shown in Fig 4, this is
slow, taking about 2 minutes to run; moreover, it requires being
able to fit the whole matrix into memory. However, we note that in
real-world applications where you might make multiple passes over
the data, loading the entire dataset into memory (if feasible)
could see more speedups since we would only need to do the loading
once.
Data streaming with the DatasetAPI. We can get significant savings
on time and memory by streaming the data via the Dataset API. Since
we don’t need to use more than one image at a time, we can simply
load the next example whenever we need it, which means that we only
need enoughmemory to store one image instead of all the images at
once. As seen in Fig 5, this method only takes about 1 minute to
run, a 2x speedup. The computational savings come from two factors:
not having to find and allocate a large contiguous chunk of memory,
and loading the data in a binary format that can be used directly
in Tensorflow without having to first convert it from the Numpy
format.
Pre-fetching with the Dataset API. We can obtain additional savings
by interleaving file I/O with computation time. In
particular,
we can pre-fetch the next examples from disk while the compu-
tation is ongoing, so that when we need the next example, it is
already loaded into memory. This trades off a small amount of
memory usage for greater speed. We implemented this in Fig 6 by
just adding a single extra call to the Dataset API, resulting in a
25% speedup (down to 45s from 1 minute). In our example, the com-
putation is quick (since it is just adding the pixel values
together), and we expect even greater savings in real applications
where the computation (e.g., taking gradients through large neural
networks) can be much more expensive.
The TFRecord data format. Tensorflow recommends using their own
TFRecord data format, which is a simple binary file format with
some convenience features (like having built-in serial- izers and
deserializers for common data types). The main benefit of
converting data into a TFRecord is data locality, especially in
cases where the data for a single example is spread over multiple
locations. For example, in a medical diagnosis setting, one might
have MRI image data stored in one place and a patient’s electronic
medical history in a different place. This slows down loading
individual examples from disk. Instead, we could first combine each
example’s data into a single TFRecord and store that on disk,
improving data locality for future look-ups. We did this for our
image data in Fig 7: in this simple example, we combine the raw
image data (X ) with the image label (Y ) in the same record. In
our case, we have a single TFRecord that stores all the data;
however, in cases where there is too much data to do this, we could
easily have separated each example into its own TFRecord, since the
TFRecordDataset con- structor also accepts a list of TFRecord
paths. This is considerably more difficult to implement with Numpy,
which does not have a built-in way of combining data from many
different source files.
5.2 Ease of development Beyond performance improvements in running
the entire program, the Dataset API makes life better for
developers in two main ways.
Facilitating rapid prototyping. When writing a program, de-
velopers often want to be able to test it out quickly.
Unfortunately, with the naive method of loading data above, we have
to wait for the entire dataset to be loaded before we can try
anything out. In our running example, it takes 1.5 minutes to get
to the computational loop (Fig 8), so if we have a bug there it is
slow to test. However, if we lazily load in the data (Fig 9), we
can get to the computational loop much, much quicker (40ms). This
makes it far easier to develop code on the real data, instead of
having to create artificially smaller datasets just for
prototyping.
Support for common patterns. In machine learning applica- tions,
there are many common operations that developers want to perform on
their data: for example, batching examples up into small batches
(instead of operating on individual examples), and taking multiple
passes through the data, shuffling the order of data process- ing
with each pass. These can be somewhat tedious to implement on our
own. For example, in Fig 10, we show an implementation of batching
and shuffling using Numpy; this makes the code messier
(interleaving data processing code with the actual computation) and
can be a big pain to implement if the batch size does not cleanly
di- vide the number of training examples. However, using the
Dataset
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
Figure 4: Ingesting data by loading it all into memory at the start
of the program.
Figure 5: Ingesting data by streaming it from a TFRecord.
API, implementing batching and shuffling just requires two extra
calls to the batch and shuffle functions (Fig 11).
6 EAGER EXECUTION Eager execution is an execution mode in
TensorFlow introduced in late 2017 [14]. As of this writing, it is
only part of the nightly
build of TensorFlow. Eager execution allows imperative, immediate
execution of TensorFlow commands. In contrast, as described above,
standard TensorFlow requires one to set up a computation graph and
only then evaluate parts of the graph using session.run.
We ran a series of three experiments to evaluate performance and
ease of development with eager execution. One’s a priori
belief
Performance and Control Flow in TensorFlow 2017, CS242, Final
Project
Figure 6: Ingesting data by streaming it from a TFRecord, using
pre-fetching.
Figure 7: Improving data locality by writing to a TFRecord.
would be that eager execution might achieve comparable perfor-
mance to standard TensorFlow on computationally simple tasks
without complicated graphs. However, we might expect eager ex-
ecution to have worse performance a) on more computationally
intensive tasks (because it is still bleeding-edge) and b) on com-
plex graphs (because it cannot optimize the graph structure ahead
of time). Our experiments assess whether these hypotheses are
accurate.
6.1 Performance We ran a series of three experiments to compare the
performance of eager execution to performance of standard
TensorFlow. All experiments were performed on a single NVIDIA TITAN
Xp GPU with no other processes running on it.
Matrixmultiplication: We compared the time to multiply two random
square matrices of varying dimensions (Figure 12). Times were very
similar, as expected, for the two methods, although at very large
matrix sizes eager execution began to be slower. (We were limited
in the size of matrices we could assess because the GPU ran out of
memory).
Matrix inversion: We compared the time to invert a random matrix of
varying dimensions (Figure 13). Eager execution was only slightly
slower for small matrices (for 1000 × 1000 matrices, it ran in 25
milliseconds as opposed to 18 milliseconds) but became sig-
nificantly slower for larger matrices. We note that matrix
inversion is more computationally expensive than matrix
multiplication, and as such may enjoy greater benefits from the
more fully optimized static computation.
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
Figure 8: Pre-loading all of the data makes debugging and
prototyping slow.
Figure 9: Streaming data allows developers to prototype code more
easily.
End-to-end neural network (autoencoder): In order to as- sess
performance in a realistic scenario, we implemented a full
autoencoder [5] in both the standard and eager framework. To
generate simulated data X , of dimension p, we drew Z from a k-
dimensional Gaussian with k < p: Z ∼ N (0, I ) and then setX =
AZ , with A a transformation matrix. This data generation process,
stan- dard in dimensionality reduction models like probabilistic
principal components analysis, ensured thatX could in fact be
represented as a transformation of a low-dimensional latent state,
making an au- toencoder an appropriate model. We assessed the time
to complete three epochs for 10,000 samples, k = 10, p = 1000,
using a one-layer autoencoder with linear activations and the Adam
optimizer [10].
As expected, given the muchmore complex computational graph, eager
execution was significantly slower than the static computa- tion
graph, with an average time of 4.9 seconds as opposed to 1.9
seconds.
Analysis of results: Our results are consistent with our prior
expectations for eager execution. On computationally easy tasks
without complex computation graphs like matrix multiplication,
eager execution performs comparably to standard TensorFlow. On more
computationally intensive tasks, like large matrix inversions, or
tasks with complex graph structure, like performing both a for-
ward and backward pass through an autoencoder, eager execution is
slower.
Performance and Control Flow in TensorFlow 2017, CS242, Final
Project
Figure 10: Batching and multiple passes with a naive Numpy
implementation.
Figure 11: Batching and multiple passes with the Dataset API.
6.2 Ease of development Eager execution offers two advantages over
standard TensorFlow from a development standpoint:
• More intuitive programming patterns: Most program- mers are used
to being able to having operations immediately execute, not having
to define everything they want to run and then executing specific
parts of the computation graph. Consequently, eager execution is
much closer to standard programming patterns than standard
TensorFlow. This, in turn, can translate into more compact and
intuitive code and easier debugging.
• Dynamic graphs: some models are difficult to implement in
standard TensorFlow. Specifically, graphs where the struc- ture
depends on the input, like tree-structured LSTMs in NLP, are
difficult to implement with a static graph. Our inter- views with
deep learning practitioners (Section 7) confirm this. Eager
execution, which allows for the computations performed to depend on
the input, more naturally facilitates dynamic graph
structures.
To summarize our findings, the more intuitive and flexible inter-
face of eager execution make it somewhat useful for prototyping and
dynamic graph structures. However, the slower runtimes for
computationally intensive tasks or complex computation graphs
2017, CS242, Final Project Pang Wei Koh and Emma Pierson
2000 4000 6000 8000 10000 12000 14000 Matrix Dimension
10 2
10 1
2000 4000 6000 8000 10000 Matrix Dimension
10 1
Figure 13: Performance on thematrix inversion benchmark.
mean that standard TensorFlow is preferable for cases where high
performance is desired and flexibility is unnecessary. We note
that, for our purposes at least, eager execution does not at
present save developer time; the time to set up the nightly build
and get used to the different way of setting up computations more
than offset any gains. Further, we were somewhat disappointed to
discover that eager execution was slower than standard TensorFlow
even on tasks like matrix inversion, with less consistent runtimes.
We refactored our eager execution code numerous times in an effort
to get its performance to match standard TensorFlow’s, and even
opened an issue on the TensorFlow GitHub repository to make
developers aware of our results; hopefully our project will be able
to contribute to the TensorFlow ecosystem. However, as eager exe-
cution matures, and its implementation is no longer bleeding-edge,
it may come closer to approaching the performance of standard
TensorFlow in real-world applications – an exciting possibility for
developers.
7 USER CASE STUDIES We were fortunate enough to be able to secure
lengthy interviews with three expert deep learning practitioners in
industry, some of whom would speak to us only on the condition that
we did not identify them by name.
We interviewed one of the original developers of TensorFlow (an
author on the canonical TensorFlow paper) about the thought that
went into its original structure. He explained that the
development
team anticipated the limitations of the original graph structure
when they were developing TensorFlow. Specifically, they knew it
would be hard to develop models that required dynamic graphs whose
structure changed in response to input. Such models were already
being developed at the time in natural language processing [16]. So
this was a known limitation of TensorFlow when it was developed. He
viewed the development of the TensorFlow eager execution as an
important step forward for TensorFlow, but not necessarily for the
field of machine learning as a whole, since Torch has similar
capabilities and is much faster.
A second deep learning practitioner, a research scientist at Sales-
force Einstein (formerly MetaMind), largely corroborated this ac-
count and gave us welcome context on the history that preceded the
development of TensorFlow eager execution. He explained that
TensorFlow had initially attempted to compromise between two
competing conceptions of how to build models:
• Model as data structure: Data structures are desirable be- cause
they are easy to optimize (e.g., compilers convert code to data
structures like trees and graphs to perform optimiza- tions). They
are also highly portable: data structures are totally
self-contained and easier to use across platforms – e.g., Caffe
does this [6].
• Model as code: Harder to optimize and less portable. How- ever,
much easier for programmers to write – complicated models are a
pain to implement as data structures. Also, data structures don’t
have that much flexibility (e.g., if you want graph structure to
change depending on the input).
To get the best of both worlds, the researcher explained, the
original TensorFlow framework took code and converted it to data
structures. TensorFlow’s framework is quite comprehensive, with 500
ops, so it does offer a lot of versatility, but it’s pretty hard
and opaque to accomplish certain tasks. And writing things in terms
of graphs is not the most intuitive way to run code.
As all this was going on, the researcher explained, there was also
a lot of work being done on automatic differentiation which was
incorporated into the Torch library, and after a huge amount of
work theymanaged tomake it just as fast as TensorFlowwhile being
much more flexible and allowing dynamic graph structure. This made
Torch much more appealing for many NLP tasks. TensorFlow eager
execution did more to match Torch’s flexibility but still had a lot
of rough edges and was much slower, so the researcher told us he
preferred Torch at present. He added, however, that if he were in
an ecosystem that already used TensorFlow, eager execution would
represent a significant step forward.
A third deep learning researcher corroborated the above accounts
and added that the TensorFlow Fold API was another development to
facilitate creation of dynamic computation graphs. We asked him
about the merits of the Dataset API. He said he did not use it at
present but in general had observed that significant speedups (and
increases in GPU utilization) could be provided by using Tensor-
Flow’s many data loading utilities. He added, though, that he
rarely worried too much about optimizing such things because he was
not doing work that required highly optimized code. In general, our
impression is that the sheer size of TensorFlow’s codebase is a
mixed blessing: it has so many features, so many ways to do things,
that it isn’t always clear which features are worth picking up as
a
Performance and Control Flow in TensorFlow 2017, CS242, Final
Project
developer. The fact that parts of the codebase are under flux also
makes development somewhat difficult.
8 DISCUSSION TensorFlow’s progress is in some sense a metaphor for
the progress of deep learning as a whole. It has enjoyed explosive
growth, in terms of codebase, contributors, and users, but that
rapid growth has produced some rough edges. Many of its features
clearly im- prove both performance and ease-of-use: for example, it
offers an interface to GPUs which massively improves performance
over CPUs, while being much easier to use than CUDA. Similarly, its
Dataset API both reduces development cost and dataset loading time.
The many hours we invested in accomplishing these ubiq- uitous deep
learning tasks without TensorFlow’s utilities greatly increased our
appreciation for how much time it saves.
But, as in deep learning as a whole, TensorFlow’s rapid growth has
produced rough edges. For example, when we used bleeding- edge
eager mode, we find its performance fails to match standard
TensorFlow (as its developers warned), and its development cost
remains substantial. The fact that parts of the codebase are still
in flux also introduces challenges for practitioners: for example,
it was reported that when rounding behavior in part of the
TensorFlow codebase was changed from “truncate” to “round to even”,
one model went from 25% error to 99% error [12]. Our conclusion, on
the basis of our investigations and our interviews with real-world
practitioners, is “caveat programmer”: TensorFlow’s mature fea-
tures are invaluable, but its newer features continue to be ironed
out. That said, the enormous amount of investment and engineering
talent from both Google and the open source community, and the
excitement around deep learning in general, makes us confident
TensorFlow will only continue to improve.
9 ACKNOWLEDGMENTS Both authors contributed equally to this project.
We thank the deep learning practitioners who contributed their time
and expertise.
REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene
Brevdo, Zhifeng Chen,
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu
Devin, et al. 2016. Tensorflow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
(2016).
[2] Rachel Allen and Michael Li. 2017. Ranking Popular Deep Learn-
ing Libraries for Data Science. https://www.kdnuggets.com/2017/10/
ranking-popular-deep-learning-libraries-data-science.html.
(2017).
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
2009. ImageNet: A Large-Scale Hierarchical Image Database. In
CVPR09.
[4] Mark Harris. 2017. An Even Easier Introduction to CUDA.
https://devblogs.
nvidia.com/parallelforall/even-easier-introduction-cuda/.
(2017).
[5] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing
the dimensional- ity of data with neural networks. science 313,
5786 (2006), 504–507.
[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor
Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature
Embedding. arXiv preprint arXiv:1408.5093 (2014).
[7] Norm Jouppi. 2016. Google supercharges machine learning tasks
with TPU custom chip. https://cloudplatform.googleblog.com/2016/05/
Google-supercharges-machine-learning-tasks-with-custom-chip.html.
(2016).
[8] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson,
Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan
Boden, Al Borchers, et al. 2017. In-datacenter performance analysis
of a tensor processing unit. In Proceedings of the 44th Annual
International Symposium on Computer Architecture. ACM, 1–12.
[9] Krishna M. Kavi, Bill P. Buckles, and U. Narayan Bhat. 1986. A
formal definition of data flow graph models. IEEE Transactions on
computers 11 (1986), 940–948.
[10] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimiza- tion. arXiv preprint arXiv:1412.6980
(2014).
[11] NVIDIA. 2017. NVIDIA TITAN Xp specs.
https://www.nvidia.com/en-us/
design-visualization/products/titan-xp/. (2017).
[12] Ali Rahimi. 2017. NIPS 2017 test-of-time award presentation.
https://www. youtube.com/watch?v=Qi1Yry33TQE. (2017).
[13] Peter Seibel. 2009. Coders at work: Reflections on the craft
of programming. Apress. [14] Asim Shankar and Wolff Dobson. 2017.
Eager Execution: An imperative,
define-by-run interface to TensorFlow.
https://research.googleblog.com/2017/10/
eager-execution-imperative-define-by.html. (2017).
[15] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,
Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go
without human knowledge. Nature 550, 7676 (2017), 354–359.
[16] Kai Sheng Tai, Richard Socher, and Christopher D Manning.
2015. Improved semantic representations from tree-structured long
short-termmemory networks. arXiv preprint arXiv:1503.00075
(2015).
[17] TensorFlow Development Team. 2017. Importing Data.
https://www.tensorflow. org/programmers_guide/datasets.
(2017).
[18] TensorFlow Development Team. 2017. Performance Guide.
https://www. tensorflow.org/performance/performance_guide.
(2017).