13
Runtime Performance Prediction for Deep Learning Models with Graph Neural Network Yanjie Gao Microsoft Research China [email protected] Xianyu Gu Microsoft Research China [email protected] Hongyu Zhang The University of Newcastle Australia [email protected] Haoxiang Lin Microsoft Research China [email protected] Mao Yang Microsoft Research China [email protected] ABSTRACT Deep learning (DL) models have been widely adopted in many appli- cation domains. Predicting the runtime performance of DL models such as GPU memory consumption and training time is important for boosting development productivity and reducing resource waste because improper configurations of hyperparameters and neural architectures can result in many failed training jobs or unsatisfac- tory models. However, the runtime performance prediction of DL models is challenging because of the hybrid DL programming para- digm, complicated hidden factors within the framework runtime, fairly huge model configuration space, and wide differences among models. In this paper, we propose DNNPerf, a novel Graph Neural Network-based tool for predicting the runtime performance of DL models using Graph Neural Network. DNNPerf represents a DL model as a directed acyclic computation graph and incorporates a rich set of performance-related features based on the computa- tional semantics of both nodes and edges. We also propose a new Attention-based Node-Edge Encoder to better encode the node and edge features. DNNPerf is evaluated on thousands of configurations of real-world and synthetic DL models to predict their GPU memory consumption and training time. The experimental results show that DNNPerf achieves accurate predictions, with an overall error of 13.684% for the GPU memory consumption prediction and an over- all error of 7.443% for the training time prediction, outperforming all the compared methods. CCS CONCEPTS Software and its engineering Extra-functional properties. KEYWORDS deep learning, AutoML, graph neural network, runtime perfor- mance, performance prediction 1 INTRODUCTION In recent years, deep learning (DL) has been widely adopted in many application domains such as computer vision [28], speech recognition [72], and natural language processing [15]. Like many traditional software systems, DL models are also highly config- urable via a set of configuration options for hyperparameters (e.g., the batch size and the dropout) and neural architectures (e.g., the Corresponding author. number of layers). To search for the optimal configuration of a DL model that satisfies specific requirements, developers usually run (e.g., via automated machine learning tools) a number of training jobs to explore diverse configurations. Different model configurations may lead to different quality attributes (i.e., non-functional properties), among which runtime performance (e.g., GPU memory consumption and training time) is one of the most important quality attributes because it directly af- fects model quality and development productivity. Improper model configurations can unexpectedly degrade runtime performance and result in many failed jobs or unsatisfactory models. For instance, an overlarge batch size causes a job to exhaust GPU memory and raise an OOM (out-of-memory) exception. According to a recent empirical study on 4,960 DL job failures in Microsoft [75], 8.8% of the failed jobs were caused by GPU OOM, which accounts for the largest category in all DL specific failures. Another example is that a sophisticated neural architecture increases the computation load and makes the training time longer than expected, as a result the job has to be stopped early because of limited budget. Even worse, in the automated machine learning (AutoML) scenario, other tens or hundreds of jobs with the same batch size or similar neural architectures can experience the same issues and fail. Therefore, predicting the runtime performance of a DL model ahead of job execution is critical for boosting development productivity and reducing resource waste. In literature, there is already much research work for estimat- ing the runtime performance of programs [2, 3, 5, 12, 64, 67] and deployed systems [2426, 5658] using program analysis and ma- chine learning (ML) techniques. Recently, some authors advanced the estimation effort to DL models [19, 33, 49, 73]. However, these works either cannot be directly applied or have limitations in pre- cise prediction for the following challenges: (1) DL frameworks provide a hybrid programming paradigm: developers only invoke high-level interfaces to construct DL models while low-level computational operations are implemented with proprietary NVIDIA CUDA, cuDNN, or cuBLAS APIs. Such a paradigm hides the internal implemen- tation details and thus makes it hard to model the runtime performance accurately. (2) There are many complicated hidden factors such as garbage collection and operator execution order within the frame- work runtime, which observably affect a model’s runtime

Runtime Performance Prediction for Deep Learning Models

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models withGraph Neural Network

Yanjie GaoMicrosoft Research

[email protected]

Xianyu GuMicrosoft Research

[email protected]

Hongyu ZhangThe University of Newcastle

[email protected]

Haoxiang Lin∗Microsoft Research

[email protected]

Mao YangMicrosoft Research

[email protected]

ABSTRACTDeep learning (DL) models have been widely adopted in many appli-cation domains. Predicting the runtime performance of DL modelssuch as GPU memory consumption and training time is importantfor boosting development productivity and reducing resource wastebecause improper configurations of hyperparameters and neuralarchitectures can result in many failed training jobs or unsatisfac-tory models. However, the runtime performance prediction of DLmodels is challenging because of the hybrid DL programming para-digm, complicated hidden factors within the framework runtime,fairly huge model configuration space, and wide differences amongmodels. In this paper, we propose DNNPerf, a novel Graph NeuralNetwork-based tool for predicting the runtime performance of DLmodels using Graph Neural Network. DNNPerf represents a DLmodel as a directed acyclic computation graph and incorporatesa rich set of performance-related features based on the computa-tional semantics of both nodes and edges. We also propose a newAttention-based Node-Edge Encoder to better encode the node andedge features. DNNPerf is evaluated on thousands of configurationsof real-world and synthetic DLmodels to predict their GPUmemoryconsumption and training time. The experimental results show thatDNNPerf achieves accurate predictions, with an overall error of13.684% for the GPU memory consumption prediction and an over-all error of 7.443% for the training time prediction, outperformingall the compared methods.

CCS CONCEPTS• Software and its engineering→ Extra-functional properties.

KEYWORDSdeep learning, AutoML, graph neural network, runtime perfor-mance, performance prediction

1 INTRODUCTIONIn recent years, deep learning (DL) has been widely adopted inmany application domains such as computer vision [28], speechrecognition [72], and natural language processing [15]. Like manytraditional software systems, DL models are also highly config-urable via a set of configuration options for hyperparameters (e.g.,the batch size and the dropout) and neural architectures (e.g., the∗Corresponding author.

number of layers). To search for the optimal configuration of a DLmodel that satisfies specific requirements, developers usually run(e.g., via automated machine learning tools) a number of trainingjobs to explore diverse configurations.

Different model configurations may lead to different qualityattributes (i.e., non-functional properties), among which runtimeperformance (e.g., GPU memory consumption and training time) isone of the most important quality attributes because it directly af-fects model quality and development productivity. Improper modelconfigurations can unexpectedly degrade runtime performance andresult in many failed jobs or unsatisfactory models. For instance,an overlarge batch size causes a job to exhaust GPU memory andraise an OOM (out-of-memory) exception. According to a recentempirical study on 4,960 DL job failures in Microsoft [75], 8.8%of the failed jobs were caused by GPU OOM, which accounts forthe largest category in all DL specific failures. Another example isthat a sophisticated neural architecture increases the computationload and makes the training time longer than expected, as a resultthe job has to be stopped early because of limited budget. Evenworse, in the automated machine learning (AutoML) scenario, othertens or hundreds of jobs with the same batch size or similar neuralarchitectures can experience the same issues and fail. Therefore,predicting the runtime performance of a DL model ahead of jobexecution is critical for boosting development productivity andreducing resource waste.

In literature, there is already much research work for estimat-ing the runtime performance of programs [2, 3, 5, 12, 64, 67] anddeployed systems [24–26, 56–58] using program analysis and ma-chine learning (ML) techniques. Recently, some authors advancedthe estimation effort to DL models [19, 33, 49, 73]. However, theseworks either cannot be directly applied or have limitations in pre-cise prediction for the following challenges:

(1) DL frameworks provide a hybrid programming paradigm:developers only invoke high-level interfaces to constructDL models while low-level computational operations areimplemented with proprietary NVIDIA CUDA, cuDNN, orcuBLASAPIs. Such a paradigm hides the internal implemen-tation details and thus makes it hard to model the runtimeperformance accurately.

(2) There are many complicated hidden factors such as garbagecollection and operator execution order within the frame-work runtime, which observably affect a model’s runtime

Page 2: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

performance. However, understanding and extracting suchhidden factors can be very difficult, time-consuming, anderror-prone because they are volatile and fast-changingwith the rapid evolution of DL frameworks.

(3) The configuration space of a DL model is fairly huge andthere exist wide differences among various kinds of DLmodels; therefore, it is challenging to design a general MLprediction model from limited samples with high accuracy.

In this paper, we proposeDNNPerf, a novel GraphNeural Network-based tool for predicting the runtime performance of DL modelswith Graph Neural Network (GNN) [52]. Our key observation is thata DL model can be represented as a directed acyclic computationgraph [22]. Each node denotes a computational operation calledan operator (e.g., matrix addition), and an edge delivers a tensorand specifies the execution dependency between two nodes. Thealgorithmic execution of the model is then represented as iterativeforward and backward propagation on such a computation graph.Node features include the operator type (e.g., Conv2D), hyperpa-rameters (e.g., the kernel size), number of floating-point operations(FLOPs), sizes of input/output/weight/temporary tensors, etc. Edgefeatures consist of the edge type (forward or backward) and sizeof the delivered tensor. Relevant parameters of target devices suchas floating-point operations per second (FLOPS), memory capac-ity, and bandwidth are also extracted as node or edge features.Moreover, we propose a new Attention-based Node-Edge Encoderto better encode the node and edge features by adapting ideas ofGraph Attention Network (GAT) [66] and Edge Enhanced GraphNeural Network (EGNN) [21].

We have implemented DNNPerf as a prediction model trainedfrom the runtime performance data of various kinds of DL models.The full dataset includes: (1) 10,238 model configurations of ten real-world DL models (e.g., VGG [60], ResNet[28], and Vanilla RNN [50])whose representative hyperparameters’ values are randomly gen-erated within reasonable ranges; (2) 8,403 model configurationswhose neural architectures are randomly synthesized by NeuralArchitecture Search (NAS) [47] algorithms. We explore two typi-cal prediction tasks for GPU memory consumption and trainingtime; however, our approach can be easily applied to other taskssuch as predicting the inference time and GPU power consumption.The experimental results show that DNNPerf outperforms all fivecompared methods (e.g., BiRNN [3] and GBDT [43]) with an overallmean relative error of 13.684% for the GPU memory consumptionprediction and an overall mean relative error of 7.443% for thetraining time prediction, confirming its effectiveness.

We summarize our main contributions as follows:(1) We propose a novel Graph Neural Network-based approach

for accurately predicting the runtime performance of deeplearning models.

(2) We design a rich set of node and edge features to captureperformance-related factors (e.g., compute, I/O, etc.). Wealso propose a novel Attention-based Node-Edge Encoderto better encode the computation graph of a DL model.

(3) We implement an end-to-end tool named DNNPerf anddemonstrate its practical effectiveness through a set of ex-periments on thousands of configurations of real-world andsynthetic DL models.

1 from tensorflow.keras import layers, models2 ...3 model = models.Sequential()4 model.add(layers.Conv2D(filters=32, kernel_size=(3, 3),

activation='relu', input_shape=(224, 224, 3)))↩→

5 model.add(layers.AveragePooling2D(pool_size=(2, 2)))6 model.add(layers.Conv2D(filters=64, kernel_size=(3, 3),

activation='relu'))↩→

7 model.add(layers.AveragePooling2D(pool_size=(2, 2)))8 model.add(layers.Flatten())9 model.add(layers.Dense(units=1000, activation="softmax"))10 model.compile(optimizer='SGD',

loss="categorical_crossentropy")↩→

11 model.fit(train_images, train_labels, batch_size, epochs=60)

Figure 1: A sample Keras MNIST model constructed with theConv2D, AvgPool2D, Flatten, Dense, and Softmax operators.

Figure 2: Computation graph for training the above model.

2 BACKGROUND2.1 Deep Learning Models and Computation

GraphsBeing essentially mathematical functions, DLmodels are formalizedby frameworks like TensorFlow [1] and PyTorch [45] as tensor-oriented computation graphs (i.e., directed acyclic graphs). Theinputs and outputs of a computation graph or a graph node aretensors (multidimensional arrays of numerical values). Each nodedenotes a mathematical operation called an operator (e.g., matrixaddition). An edge pointing from one output of node𝐴 to one inputof 𝐵 delivers a tensor and specifies their execution dependency. ADL model usually provides a set of configuration options for itshyperparameters (e.g., the batch size and the dropout) and neuralarchitecture (e.g., the number of layers).

Figure 1 shows the snippet of a sample MNIST [14] trainingprogram written with TensorFlow’s Keras [11] API. The programconstructs a sequential model with the framework built-in Conv2D(2D convolution with a 3 × 3 kernel size), AvgPool2D (2D aver-age pooling with a 2 × 2 pool size), Flatten (collapsing the inputtensor into one dimension), Dense (fully-connected layer with 64units), and Softmax (normalizing “the probability distribution over

Page 3: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

𝑘 different classes” [22]) operators (lines 3–9). The above numberof output channels (denoted by the variable filters), kernel size,number of units, etc. are hyperparameters that control the train-ing process. For training, the framework constructs a computationgraph shown in Figure 2 with concrete hyperparameter values andthen applies iterative forward and backward propagation on sucha graph to learn the optimal learnable parameters (aka weights).Operators in the middle of Figure 2 (e.g., AvgPool2d_BP) are au-tomatically generated by TensorFlow for backward propagation.Input data (𝐼𝑛𝑝𝑢𝑡_𝐷𝑎𝑡𝑎) is fed through the computation graph andmanipulated by the developer-specified operators (on the left of Fig-ure 2). Input labels (𝐼𝑛𝑝𝑢𝑡_𝐿𝑎𝑏𝑒𝑙𝑠) and produced outputs of Denseare then propagated backward to compute the gradients of weights.Finally, an optimizer [34] updates the weights to minimize the loss(e.g., the mean squared error between actual and predicted outputs),marking the end of one training iteration.

2.2 Graph Neural NetworksThe graph is a common data structure for representing elementsand their relations, and it is widely used in data analysis. Recently,there are emerging requirements to apply DL techniques to learn-ing from graph data. However, existing models such as Convolu-tional Neural Network (CNN) [39] and Recurrent Neural Network(RNN) [30] have limitations in handling graphs because of the irreg-ularity caused by unequal node neighbors. Graph Neural Network(GNN) [52] is then proposed to efficiently address this issue.

Graph Convolutional Network (GCN) [35] is the first popularGNN, which learns localized spatial features by convolutional filters.Graph Attention Network (GAT) [66] uses masked self-attentionallayers to address the shortcomings of prior GCN-based methods.GAT does not require costly matrix operations or knowing thegraph structure upfront by specifying different weights to differentnodes in a neighborhood. In some datasets, edge features also con-tain important graph information; however, they are not adequatelyutilized by GCN and GAT. Edge Enhanced Graph Neural Network(EGNN) [21] builds a framework for a family of new GNNs thatcan sufficiently exploit the features of edges (including both undi-rected and multi-dimensional ones). In our work, we adopt GNNand propose a novel Attention-based Node-Edge Encoder for betterpredicting the runtime performance of DL models.

3 DNNPERF: A GNN-BASED APPROACH3.1 Problem FormulationAs mentioned before, a DL modelM is represented as the followingdirected acyclic graph (DAG) [22]:

M = ⟨{𝑢𝑖 }𝑛𝑖=1, {𝑒𝑖 𝑗 = (𝑢𝑖 , 𝑢 𝑗 )}𝑖, 𝑗∈[1,𝑛] , {ℎ𝑝𝑘 }𝑚𝑘=1⟩ .

Each node𝑢𝑖 is an operator such as the above Conv2D and AvgPool2D.A directed edge 𝑒𝑖 𝑗 pointing from node 𝑢𝑖 to 𝑢 𝑗 delivers an outputtensor of 𝑢𝑖 to 𝑢 𝑗 as input and also specifies the execution depen-dency. Each ℎ𝑝𝑘 is a hyperparameter of some operator such as thebatch size and input tensor shape, whose domain is denoted as𝐵𝑘 . The modelM is assumed to be deterministic without control-flow operators (e.g., loops and conditional branches) or dynamicstructural changes; hence, the execution flow and runtime perfor-mance across different training iterations should be identical. We

Figure 3: Workflow of DNNPerf.

use M(𝑏1 ∈ 𝐵1, 𝑏2 ∈ 𝐵2, · · · , 𝑏𝑚 ∈ 𝐵𝑚) to denote one model con-figuration ofM. Then, ΔM , the configuration space ofM, is definedas follows:

ΔM = { M(𝑏1, 𝑏2, · · · , 𝑏𝑚) | 𝑏𝑘 ∈ 𝐵𝑘 for 𝑘 ∈ [1,𝑚] } .A runtime specification describes the execution environment,

which currently contains the target device’s information such asfloating-point operations per second (FLOPS) and memory capacity.All the actual runtime specifications constitute a space denoted byS. Then, the runtime performance prediction tasks forM can beformulated as a family of real regression functions:

𝑓𝑖 : ΔM × S → R .Each 𝑓𝑖 is a performance function, taking a model configuration anda runtime specification as its parameters. Hence, our GNN-basedapproach trains two typical performance functions that return theGPU memory consumption and training time, respectively.

3.2 WorkflowFigure 3 illustrates the workflow of DNNPerf. It accepts a DL modelfile, a model configuration specification, and a runtime specificationas input and reports the runtime performance values. DNNPerfimplements a front-end parser using the framework’s built-inmodeldeserialization APIs, which is responsible for reading the inputmodel file and reconstructing the corresponding computation graph.The model configuration specification includes hyperparametervalues. The runtime specification currently contains the targetdevice’s information (e.g., GPU FLOPS).

DNNPerf traverses the computation graph to automatically gen-erate the node and edge features, strictly following the operatorexecution order that is predefined by reference to the frameworkimplementations [38, 48]. It employs a novel Attention-based Node-Edge Encoder (ANEE) to embed the nodes and edges in severalrounds (Section 3.4). To improve the prediction accuracy and re-duce the amount of training data, DNNPerf further utilizes thesemantics of both nodes and edges. The details are summarized inTable 1 and Table 2. Finally, DNNPerf aggregates the updated fea-ture vectors and uses a multilayer perceptron (MLP) [27] to outputthe result.

3.3 Model EncodingWe refer to the framework source code, carefully identify hiddenfactors, and then design a set of effective performance-related fea-tures based on the computational semantics of both nodes andedges. First, DNNPerf uses an operator’s type, hyperparameters,tensor sizes, GPU memory consumption, and FLOPs to capture

Page 4: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

Table 1: Node features.Category Name DescriptionOperator (ℎ𝑡 ) Operator Type Type of the operator (e.g., Conv2D and AvgPool2D)

Hyperparameter (ℎ𝑝 ) Hyperparameter Type and value of each hyperparameter of the operator (e.g., kernel size andchannel size)

Tensor (ℎ𝑑 )

Weight Tensor Size Sizes of weights (aka learnable parameters) and weight gradients (computedunder backward propagation for updating weights)

In/Out Tensor Size Sizes of inputs, outputs, and output gradients (computed under backward prop-agation for calculating weight gradients)

Temporary Tensor Size Sizes of temporary variables used by the operatorComputation Cost (ℎ𝑐 ) FLOPs Number of floating-point operations of the operatorDevice (ℎ𝑟 ) FLOPS Peak floating-point operations per second

Memory Capacity Total amount of the device memory

Table 2: Edge features.Category Name DescriptionEdge (𝑒𝑡 ) Edge Type Type of the edge (“forward” or “backward”)Tensor (𝑒𝑑 ) Tensor Size Size of the delivered tensorDevice (𝑒𝑟 ) Bandwidth Bandwidth for accessing the delivered tensor

operator-level factors. Secondly, DNNPerf uses an edge’s type andsize of the delivered tensor to capture the operator execution order,tensor liveness [19], and I/O costs. Lastly, some device featuressuch as FLOPS, memory capacity, and bandwidth are also extractedand associated with the nodes and edges to capture runtime fac-tors such as garbage collection, memory access, and distributedcommunication.

3.3.1 Initial Node Encoding. For each node of a computation graph(such as the graph in Figure 2), we associate it with a real-valuedfeature vector ℎ ∈ R𝑁1 where 𝑁1 is the number of designed nodefeatures. ℎ is composed by several sub-feature vectors such thatℎ = 𝐹 (ℎ𝑡 ∥ ℎ𝑝 ∥ ℎ𝑑 ∥ ℎ𝑐 ∥ ℎ𝑟 ):

(1) “∥” is the vector concatenation operation, and 𝐹 is the func-tion for raw feature encoding and vector shape alignment.

(2) ℎ𝑡 represents the operator type (e.g., Conv2D).(3) ℎ𝑝 is a vector for encoding the operator’s hyperparameters,

whose length equals the number of total hyperparameters.(4) ℎ𝑑 stores the sizes of the weight, input, output, and tempo-

rary tensors.(5) ℎ𝑐 denotes the computation cost. Currently, we use FLOPs.(6) ℎ𝑟 encodes the device information such as GPU memory

capacity and FLOPS, which affects the operator execution.Details are listed in Table 1. For categorical features (e.g., operatortype), we adopt the One-Hot encoding [70] strategy.

We take Conv2D, the 2D convolution operator [39], as an exampleto illustrate how to compute the initial values of ℎ𝑐 (i.e., FLOPs)and ℎ𝑑 . Let 𝑠𝑧 be the size of the data type in bytes (e.g., 4 bytes forfloat data). The following hyperparameters are needed: kernelsize (𝑘𝑠), stride (𝑠𝑑), padding (pad), and dilation (dil). We assumethat each of them is an array of two positive integers that specifythe height- and width-related information, respectively. Supposethat Conv2D receives an input tensor whose shape is [𝑁,𝐶𝑖 , 𝐻𝑖 ,𝑊𝑖 ](i.e., a batch of 𝑁 input samples with height 𝐻𝑖 , width𝑊𝑖 , and 𝐶𝑖input channels). Conv2D produces an output tensor whose shape is

[𝑁,𝐶𝑜 , 𝐻𝑜 ,𝑊𝑜 ]. 𝐶𝑜 is the number of output channels. The outputheight (𝐻𝑜 ) and width (𝑊𝑜 ) can be calculated as follows [62]:

𝐻𝑜 =

⌊𝐻𝑖 + 2 × pad [0] − dil [0] × (𝑘𝑠 [0] − 1) − 1

𝑠𝑑 [0] + 1⌋,

𝑊𝑜 =

⌊𝑊𝑖 + 2 × pad [1] − dil [1] × (𝑘𝑠 [1] − 1) − 1

𝑠𝑑 [1] + 1⌋.

Now we show how the FLOPs under forward propagation [8] andtensor sizes (in bytes) of Conv2D are computed:

𝐹𝐿𝑂𝑃𝑠 (Conv2D) = 2 × (𝐶𝑖 × 𝐻𝑖 ×𝑊𝑖 + 1) × 𝑁 ×𝐶𝑜 × 𝐻𝑜 ×𝑊𝑜 ,

𝐼𝑇 (Conv2D) = 𝑠𝑧 × 𝑁 ×𝐶𝑖 × 𝐻𝑖 ×𝑊𝑖 ,

𝑂𝑇 (Conv2D) = 𝑠𝑧 × 𝑁 ×𝐶𝑜 × 𝐻𝑜 ×𝑊𝑜 ,

𝑊𝑇 (Conv2D) = 𝑠𝑧 × (𝐶𝑖 × 𝐻𝑖 ×𝑊𝑖 + 1) ×𝐶𝑜 .

𝐼𝑇 , 𝑂𝑇 , and𝑊𝑇 are functions that return the sizes of the input,output, and weight tensors, respectively.

Currently, DNNPerf supports 70+ common types of DL operator.

3.3.2 Initial Edge Encoding. For each edge of a computation graph,we associate it with a real-valued feature vector 𝑒 ∈ R𝑁2 where 𝑁2is the number of designed edge features. 𝑒 is composed of severalsub-feature vectors such that 𝑒 = 𝑒𝑡 ∥ 𝑒𝑑 ∥ 𝑒𝑟 :

(1) 𝑒𝑡 represents the edge type. Currently, its initial value iseither “forward” or “backward” depending on which propa-gation stage the delivered tensor is used. 𝑒𝑡 tries to capturethe tensor liveness that affects the GPU memory consump-tion.

(2) 𝑒𝑑 and 𝑒𝑟 denote the size of the delivered tensor and devicebandwidth, respectively. These two features affect the I/Oaccess latency, which in turn affects the training time.

Details are listed in Table 2.

3.3.3 Normalization. Since feature values vary widely, DNNPerfscales the range of each feature to the interval [0, 1] by MinMaxS-caler [53]. DNNPerf does not use the neighbor-level normalization

Page 5: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

from EGNN [21] because the average node degree of a computationgraph is much smaller than that of previously studied large graphssuch as the social network graph. Suppose 𝑔 is the computationgraph of a model configuration in the training dataset 𝐷𝑆 and 𝑢is a node of 𝑔. We use ℎ𝑔,𝑢 and ℎ𝑔,𝑢,𝑖 to denote the initial featurevector of node 𝑢 and its 𝑖-th dimension, respectively. Then, the nor-malized node feature vector ℎ̂𝑔,𝑢 = [ℎ̂𝑔,𝑢,1, · · · , ℎ̂𝑔,𝑢,𝑁1 ] is producedas follows:

ℎ̂𝑔,𝑢,𝑖 =ℎ𝑔,𝑢,𝑖 −min 𝑆𝑖max 𝑆𝑖 −min 𝑆𝑖

, 𝑆𝑖 = {ℎ𝑔′,𝑢′,𝑖 | 𝑔′ ∈ 𝐷𝑆 ∧ 𝑢′ ∈ 𝑔′} .

Similarly, we use 𝑒𝑔,𝑙 and 𝑒𝑔,𝑙,𝑖 to denote the initial feature vectorof edge 𝑙 and its 𝑖-th dimension, respectively. Then, the normalizededge feature vector 𝑒𝑔,𝑙 = [𝑒𝑔,𝑙,1, · · · , 𝑒𝑔,𝑙,𝑁2 ] is produced as follows:

𝑒𝑔,𝑙,𝑖 =𝑒𝑔,𝑙,𝑖 −min 𝑆𝑖max 𝑆𝑖 −min 𝑆𝑖

, 𝑆𝑖 = {𝑒𝑔′,𝑙 ′,𝑖 | 𝑔′ ∈ 𝐷𝑆 ∧ 𝑙 ′ ∈ 𝑔′} .

3.4 Attention-based Node-Edge EncoderTo achieve high prediction accuracy while taking into account ef-ficiency, we propose a novel Attention-based Node-Edge Encoder(ANEE) for multi-dimensional features, which is shown in the mid-dle part of Figure 3.

Firstly, the original GCN and GAT models are restricted to onlyone-dimensional, real-valued edge features, which cannot capturethe effects of an edge. In comparison, ANEE could capture both thenode and edge features. Secondly, the operators of a DL model con-tain less fan-out, and the costs of operators are different. DNNPerfuses a global normalizer that could achieve better efficiency andaccuracy. We also use weight sharing design for the encoder pa-rameter matrix (e.g., W𝑢 , W𝑒 , and W𝑚) to reduce the number ofparameters, which could benefit the training/inference time andGNN model size. Therefore, compared with EGNN [21], ANEE islightweight and can achieve better efficiency.

ANEE performs multiple rounds of computation to encode thenodes and edges. Suppose 𝑔 is the computation graph of a modelconfiguration, 𝑢 is a node, and 𝑙 = (𝑢𝑠 , 𝑢𝑑 ) is an edge pointing fromthe source node 𝑢𝑠 to the target node 𝑢𝑑 . We use ℎ𝑖𝑔,𝑢 and 𝑒𝑖

𝑔,𝑙to

denote the computed feature vectors of node𝑢 and edge 𝑙 in the 𝑖-th(𝑖 > 0) round, respectively. ℎ0𝑔,𝑢 and 𝑒0

𝑔,𝑙are the initial normalized

feature vectors mentioned before.First, ANEE computes a preliminary result of ℎ𝑖𝑔,𝑢 (denoted by

ℎ𝑖

𝑔,𝑢 ) locally as follows:

LeakyReLU(𝑥) = max(0, 𝑥) − 𝛼 ×max(0,−𝑥) ,

ℎ𝑖

𝑔,𝑢 = LeakyReLU(W𝑢 × ℎ𝑖−1𝑔,𝑢 ) .The activation function LeakyReLU is the leaky version of RectifiedLinear Unit (ReLU), 𝛼 is the slope coefficient that defaults to 0.01,andW𝑢 is a parameter matrix.

Next, ANEE updates 𝑒𝑖𝑔,𝑙

as follows:

𝑒𝑖𝑔,𝑙

= 𝜎 (𝑎𝑇 × (ℎ𝑖𝑔,𝑠 ∥ ℎ𝑖𝑔,𝑑 ) ×W𝑒 × 𝑒𝑖−1𝑔,𝑙

) .

𝜎 is the sigmoid activation function, and W𝑒 is a parameter matrix.The attention mechanism uses a single-layer feedforward neuralnetwork that is denoted by a weight vector 𝑎 ∈ R2×𝑁1 . 𝑎𝑇 is thetranspose of 𝑎.

Then, ANEE gathers the information of 𝑢’s neighbors and theassociated edges to compute the final ℎ𝑖𝑔,𝑢 as follows:

𝑓 (𝑢′, 𝑙 ′) = Softmax(W𝑚 × 𝑒𝑖𝑔,𝑙 ′ ) × ℎ

𝑖

𝑔,𝑢′ ,

ℎ𝑖𝑔,𝑢 = LeakyReLU(∑︁

𝑙 ′=(𝑢′,𝑢 ) 𝑓 (𝑢′, 𝑙 ′)) .

𝑢′ is a neighbour node of 𝑢, 𝑙 ′ is their associated edge, andW𝑚 isa parameter vector. Softmax makes coefficients easily comparableacross different features and normalizes them across all choices offeatures.

Finally, DNNPerf performs a global aggregation by summing upall the node feature vectors. The aggregated result will be fed tothe predictor (an MLP layer) to generate the runtime performancevalue.

3.5 Loss FunctionThe GNN-based prediction of GPU memory consumption and train-ing time can be reduced to a regression problem. We use the meansquared error (MSE) to design the following loss function:

L =

∑𝑁𝑖=1 (𝑦𝑖 − 𝑦𝑖 )2

𝑁.

𝑁 is the number of model configurations in the training set; 𝑦𝑖 and𝑦𝑖 are the real and predicted runtime performance values of the 𝑖-thmodel configuration, respectively. We actually tried mean relativeerror (MRE), but found no better results, so we adopted the morecommonly used MSE.

4 EXPERIMENTAL SETUP4.1 DatasetsWe collect 18,641 model configurations implemented with Ten-sorFlow v1.13.1 and divide them into two groups. The first is anHPO (Hyperparameter Optimization) [13] dataset that includes10,238 configurations of ten real-world models with various hy-perparameter combinations. In this dataset, the configurations offive randomly selected DL models (LeNet [40], ResNet-V1 [28],Inception-V3 [61], Vanilla RNN [50], and LSTM [29]) are used fornormal training, validation, and test, in a proportion of 70:10:20. Therest models (AlexNet [37], VGG [60], OverFeat [55], ResNet-V2 [28],and GRU [9]) are unseen to DNNPerf because their configurationsdo not exist in the training set and are only used to evaluate thegeneralization ability of DNNPerf (Section 5.2). These ten modelsare representative in areas of computer vision, speech recognition,and natural language processing, among which some (such as RNN,LSTM, and GRU) are also widely used in various DL for software en-gineering applications [23, 32, 42, 74]. The second is a NAS (NeuralArchitecture Search) [47] dataset that contains 8,403 configurationsof NAS-searched models.

HPO dataset. We choose representative DL models from a Ten-sorFlow public benchmark [59] and adopt a random strategy basedon the provided APIs to generate the model configurations. Weconsider the batch size (16, 32, 48, 64, 80, 96, 112, and 128), numberof input channels (1, 3, 5, 7, and 9), input height (224), and inputwidth (224) as the hyperparameters. For VGG, we use differentdomains of the batch size (interval [16, 128]) and number of inputchannels (interval [1, 10]). For AlexNet, OverFeat, and LeNet, we

Page 6: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

Figure 4: Example neural architectures in the NAS dataset.

additionally take the number of output channels of their Conv2Doperators as a hyperparameter whose domain is set to an interval[default × 0.5, default × 2]. For VGG model configurations, we ran-domly generate them from VGG-11, VGG-16, and VGG-19 models.Similarly, for ResNet-V1 and ResNet-V2 model configurations, werandomly generate them from ResNet-50, ResNet-101, ResNet-152,and ResNet-200 models. For Inception-V3, we assign random num-bers to the Inception API’s min_depth, depth_multiplier, andend_point parameters. The statistics of the HPO dataset is shownin Table 3.

NAS dataset.We use the NAS search space illustrated in Figure 4to generate the model configurations. The model skeleton consistsof combinations of operator cells (e.g., Cell Type 1, Cell Type2, etc.).Each cell contains a diverse combination of different operators(e.g., Conv2D, Dense, and MaxPool2D) and hyperparameters. Thehyperparameters include: (1) batch size (16, 32, 48, 64, 80, 96, 112,and 128), input height (224), input width (224), and number ofinput channels (1, 3, 5, 7, and 9); (2) number of output channels(16, 32, 64, 96, 128, 160, 256, 512, and 1,024) and kernel size (1 × 1,3 × 3, and 5 × 5) for Conv2D; (3) kernel size (2 × 2) for MaxPool2D;(4) number of units (128, 256, 512, 768, and 1,024) for Dense. Weuse the “same” padding for each Conv2D operator to avoid shapeerrors. The cells also contains randomly generated Dropout andMaxPool2D operators. The statistics of the NAS dataset is shown inTable 4.

We measure the runtime performance (execution time and peakGPU memory consumption) of three iterations and compute aver-age values. The first few iterations are bypassed to make sure thatthe training has been stable.

4.2 BaselinesTo compare with the GNN-based model of DNNPerf, we considerthe following models as the baselines:

(1) BiRNN (Bidirectional RNN) [3] is a two-layer bidirectionalRNN [30] with the LSTM [29] cell. Node features includethe name, type, and computation cost of the operator, whichare computed in the same way as the initial node featuresof DNNPerf. We feed the node feature vectors to BiRNN asthe input sequence according to the topological order of thecomputation graph. This baseline allows us to identify how

Table 3: Statistics of the HPO dataset.Model Name Node # Avg Node # Edge # Avg Edge # Model #

LeNet [18, 18] 18.0 [25,25] 25.0 800

ResNet-V1 [242, 796] 494.6 [413, 1,382] 854.7 622

Inception-V3 [12, 254] 95.2 [16, 449] 162.7 1,850

Vanilla RNN [20, 56] 38.0 [28, 82] 55.0 1,280

LSTM [20, 56] 38.0 [28, 82] 55.0 1,280

AlexNet (UNSEEN) [28, 28] 28.0 [40, 40] 40.0 800

VGG (UNSEEN) [38, 54] 46.7 [55, 79] 68.0 870

OverFeat (UNSEEN) [28, 28] 28.0 [40, 40] 40.0 800

ResNet-V2 (UNSEEN) [242, 796] 493.5 [413, 1,382] 852.9 656

GRU (UNSEEN) [20, 56] 38.0 [28, 82] 55.0 1,280

Total [12, 796] 102.9 [16, 1,382] 170.6 10,238

Table 4: Statistics of the NAS dataset.

Node # Total [20, 29] [30, 39] [40, 49] [50, 70]

Model # 8,403 499 4,445 2,974 485

Edge # Total [35, 49] [50, 64] [65, 79] [80, 115]

Model # 8,403 2,071 4,183 1,797 352

effective is GNN in capturing the data flow information ofthe computation graph.

(2) ARNN (Adjacency BiRNN) [3] is an extension to BiRNN [3].The feature vector of a node is updated by computing theaverage of vectors of predecessors, successors, and itself.ARNN is a stronger baseline since it takes some structuralinformation of a computation graph into account.

(3) MLP (Multilayer Perceptron) [27] is a traditional machinelearning model. Prior work [33] used it to predict the oper-ator execution time. We evaluate MLP with the same nodefeatures that DNNPerf uses.

(4) GBDT (Gradient Boost Decision Tree) [43] is also a tradi-tional machine learning model. It was used in the work ofChen et al. [7] for encoding loop programs. GBDT uses thesame features as MLP.

(5) BRP-NAS (Binary Relation Predictor-based NAS) [18] is agraph convolutional network for the NAS dataset. It en-codes only the graph topology without considering theruntime factors (e.g., compute and I/O) of nodes and edges.

We have implemented BiRNN, ARNN, MLP, and BRP-NAS withPyTorch v1.5.0 [45]; for GBDT, we use the built-in functionality ofscikit-learn v0.20.3 [46]. After tuning these models, we select thefollowing hyperparameter values. For BiRNN and ARNN, learningrate is 0.0001, layer size is 1, and hidden size is 512. ForMLP, learningrate is 0.001, and the number of units of each layer is 512, 128, 16,and 1, respectively. For GBDT, the learning rate is 0.01, the maxtree depth is 30, and the number of trees is 200. For BRP-NAS, thelearning rate is 0.0001.

4.3 Implementation and Settings of DNNPerfWehave implementedDNNPerfwithDGL (DeepGraph Library) [68]v0.4.3, which is a package built on top of PyTorch for easy imple-mentation of graph neural networks.

We tune the hyperparameters of our GNN-based model on thevalidation dataset. The MLP uses hidden layers of 512, 128, and 16

Page 7: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

units (the size of final output is 1). The number of message-passingrounds of ANEE is set to 3 (training time prediction) or to 1 (GPUmemory consumption prediction). We use the Adam [34] optimizerwith default parameter values (𝛼 = 0.9 and 𝛽 = 0.999), set batchsize to 64 and initial learning rate to 0.0001, and train our model for250 epochs (training time prediction) or 200 epochs (GPU memoryconsumption prediction). After tuning, our model scales to about737.2 graphs per second during the training and about 1,800 graphsper second during the inference. These graphs have 73.7 nodes and119.7 edges on average.

4.4 Evaluation MetricsTo assess the effectiveness of DNNPerf, we use the mean relativeerror (MRE) and root-mean-square error (RMSE):

MRE =

∑𝑁𝑖=1

��� �̂�𝑖−𝑦𝑖𝑦𝑖

���𝑁

× 100% , RMSE =

√︄∑𝑁𝑖=1 (𝑦𝑖 − 𝑦𝑖 )2

𝑁.

𝑁 is the number of data samples in the test set; 𝑦𝑖 and 𝑦𝑖 are thepredicted and real performance values of the 𝑖-th sample, respec-tively. We choose these two metrics because they are widely used tomeasure the accuracy of prediction models [19, 26, 33]. The smallerthe error, the higher the prediction accuracy. For the prediction ofGPU memory consumption and training time, the units of RMSEvalues are gigabyte (GB) and millisecond (ms), respectively.

5 EVALUATIONWe evaluate our proposed approach by addressing the followingResearch Questions (RQs):RQ1: How effective is DNNPerf in predicting runtime performance?RQ2: How general is DNNPerf to unseen DL models?RQ3: How effective is DNNPerf in the ablation study?

Our experiments are conducted on an Azure Standard ND24svirtual machine [44], which has 24 Intel Xeon E5-2690V4 (2.60 GHz,35M Cache) vCPUs, 448 GB main memory, and 4 NVIDIA Tesla P40(24 GB GDDR5X memory) GPUs, running Ubuntu 18.04.

5.1 RQ1: How effective is DNNPerf inpredicting the runtime performance?

In this section, we compare DNNPerf with five baselines (Sec-tion 4.2) on the same test set. Table 5 shows the MRE and RMSE val-ues of all the six approaches for predicting the runtime performance.On average (see the “Overall” rows), DNNPerf achieves 7.443%MRE and 58.5 RMSE for the training time prediction and 13.684%MRE and 1.806 RMSE for the GPU memory consumption predic-tion, which exceed the best MRE/RMSE values of the baselines:17.982%/106.2 fromMLP and 15.592%/2.213 fromGBDT, respectively.More specifically, DNNPerf outperforms BiRNN by 13.363%/92.9,ARNN by 13.497%/55.4, MLP by 10.539%/47.7, GBDT by 23.57%/64.1,and BRP-NAS by 87.805%/276.9 in terms of MRE/RMSE for the train-ing time prediction; DNNPerf outperforms BiRNN by 8.618%/1.011,ARNNby 11.813%/1.076,MLP by 7.776%/0.529, GBDT by 1.908%/0.407,and BRP-NAS by 19.97%/2.338 in terms of MRE/RMSE for the GPUmemory consumption prediction. We also experiment on the HPOmodels and the NAS models separately. The results demonstratethat DNNPerf still excels each of the baselines and confirm its ef-fectiveness. We think the reason is that DNNPerf captures richer

Table 5: Overall experimental results.Model Prediction of Training Time

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 7.443 20.806 20.940 17.982 31.013 95.248

RMSE (ms) 58.5 151.4 113.9 106.2 122.6 335.4

HPOMRE (%) 7.813 21.560 22.220 19.405 33.113 86.715

RMSE (ms) 51.9 107.7 95.2 94.7 120.4 294.3

NASMRE (%) 6.272 18.418 16.879 13.471 24.353 122.304

RMSE (ms) 75.7 242.6 159.5 136.4 129.3 441.0

Model Prediction of GPU Memory Consumption

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 13.684 22.302 25.497 21.460 15.592 33.654

RMSE (GB) 1.806 2.817 2.882 2.335 2.213 4.144

HPOMRE (%) 13.162 22.138 26.684 22.104 14.150 31.510

RMSE (GB) 1.801 2.819 3.026 2.442 2.137 4.191

NASMRE (%) 15.338 22.823 21.732 19.416 20.164 40.452

RMSE (GB) 1.820 2.811 2.369 1.955 2.437 3.991

Table 6: Experimental results on HPO models.Model Prediction of Training Time

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 8.456 28.751 31.962 22.470 27.582 69.044

RMSE (ms) 36.8 89.3 94.0 74.4 60.2 226.8

LeNetMRE (%) 3.857 27.214 15.457 14.622 43.499 69.194

RMSE (ms) 7.1 64.5 32.7 23.2 44.6 87.0

ResNet-V1MRE (%) 4.110 7.521 11.403 7.402 8.253 61.706

RMSE (ms) 60.0 84.2 109.3 74.5 130.5 505.3

Inception-V3MRE (%) 12.697 28.680 41.854 18.464 47.683 95.833

RMSE (ms) 41.2 58.3 70.1 31.3 53.6 175.2

RNNMRE (%) 8.491 41.321 40.958 36.038 21.123 77.296

RMSE (ms) 20.8 91.4 91.4 77.0 36.8 154.6

LSTMMRE (%) 7.528 27.689 29.437 26.196 7.835 29.946

RMSE (ms) 38.5 126.9 131.4 117.4 37.9 183.7

Model Prediction of GPU Memory Consumption

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 12.779 24.282 30.475 18.068 16.771 63.249

RMSE (GB) 1.450 2.700 3.202 1.830 1.300 4.601

LeNetMRE (%) 11.401 19.913 20.081 15.799 15.103 46.489

RMSE (GB) 1.313 2.557 2.236 1.561 1.587 3.466

ResNet-V1MRE (%) 8.716 15.342 16.254 9.727 7.378 31.064

RMSE (GB) 1.632 2.852 2.957 1.693 1.735 4.790

Inception-V3MRE (%) 14.827 25.750 27.437 17.464 27.844 57.183

RMSE (GB) 1.150 1.912 1.916 1.429 1.217 2.686

RNNMRE (%) 10.787 16.848 22.195 17.098 16.659 114.484

RMSE (GB) 0.813 1.055 1.451 0.940 0.861 4.735

LSTMMRE (%) 14.678 36.017 54.649 24.913 7.867 46.936

RMSE (GB) 2.105 4.225 5.487 2.847 1.336 6.503

and more diverse performance-related information from both theoperators/hyperparameters and the neural architecture than theother approaches.

Table 6 shows the prediction results on the HPO dataset. Thetest model configurations are derived from five representative real-world models: LeNet, ResNet-V1, Inception-V3, Vanilla RNN, andLSTM, which are already seen to DNNPerf because the trainingdataset contains their configurations. We choose various hyperpa-rameters and larger value ranges (Section 4.1) to bring the diversityof the runtime performance. DNNPerf achieves satisfactory pre-cision and outperforms the baseline approaches in most cases by

Page 8: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

20 30 40 50 700

50

100

Node Number

MRE

(%)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

20 30 40 50 70

0

200

400

600

Node NumberRM

SE(m

s)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

35 50 65 80 115

0

50

100

Edge Number

MRE

(%)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

35 50 65 80 115

0

200

400

600

Edge Number

RMSE

(ms)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

Figure 5: Experimental results of the training time predictionon NAS models.

20 30 40 50 70

20

30

40

Node Number

MRE

(%)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

20 30 40 50 70

2

3

4

5

Node Number

RMSE

(GB)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

35 50 65 80 115

20

30

40

Edge Number

MRE

(%)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

35 50 65 80 115

2

3

4

Edge Number

RMSE

(GB)

DNNPerf BiRNN ARNNMLP GBDT BRP-NAS

Figure 6: Experimental results of the GPUmemory consump-tion prediction on NAS models.

a large margin. The overall MRE/RMSE improvement over thebaseline approaches ranges from 14.014%/23.4 to 60.548%/190 inpredicting the training time, and from 3.992%/0.38 to 50.47%/3.151in predicting the GPU memory consumption. We notice that theoverall RMSE value of GBDT for the GPUmemory consumption pre-diction is a little better than DNNPerf since GBDT has advantageson configurations of models whose operator types are relative single(e.g., LSTM and ResNet-V1). The experimental results demonstratethat DNNPerf is very stable to diverse hyperparameter options.

The NAS dataset exhibits a great diversity of neural architecturesbecause the numbers of nodes and edges are broadly distributed.Wedivide the dataset into several subsets according to value ranges ofthe node number (internals [20, 29], [30, 39], [40, 49], and [50, 70])and the edge number (internals [35, 49], [50, 64], [65, 79], and [80,115]) separately. Then, we conduct one experiment for each subsetand show the prediction results using line charts in Figures 5 and 6.

DNNPerf still achieves satisfactory precision and outperforms allthe baseline approaches. The MRE values of DNNPerf over differentnumbers of nodes ranges from 5.824% to 7.224% for the trainingtime prediction and from 14.805% to 15.898% for the GPU memoryconsumption prediction; the RMSE values ranges from 25.4 to 154.5and from 1.557 to 2.257, respectively. The MRE values of DNNPerfover different numbers of edges ranges from 5.914% to 6.988% for thetraining time prediction and from 14.772% to 16.583% for the GPUmemory consumption prediction; the RMSE values ranges from50.6 to 153.2 and from 1.630 to 1.999, respectively. The experimentalresults demonstrate that DNNPerf has good stability on diverseneural architectures.

5.2 RQ2: How general is DNNPerf to unseen DLmodels?

In this section, we evaluate the generalization ability of DNNPerf,which is important for our tool to be used in practice. We con-struct a test set that includes model configurations of AlexNet,VGG, OverFeat, ResNet-V2, and GRU. These five models are unseento DNNPerf; that is to say, the training set does not contain anyconfigurations of them.

After tuning, our model scales to about 737.2 graphs per secondduring the training and about 1,800 graphs per second during theinference. These graphs have 73.7 nodes and 119.7 edges on average.

Table 7 shows the MRE and RMSE values, in which the overall re-sults of DNNPerf outperform the baseline approaches. The improve-ment of MRE/RMSE ranges from 10.984%/40.4 to 83.509%/253.9for the training time prediction task, and from 0.233%/0.421 to12.472%/2.202 for the GPU memory consumption prediction task.BRP-NAS does not performs well because it is designed only for spe-cific NAS search space without modeling the runtime performance-related features. Other methods do not encode the complete com-putation graph and hidden runtime factors (e.g., operator sched-uling and tensor liveness), resulting in relatively poor results. Insummary, the experimental results demonstrate that DNNPerf isachieves good generalization ability.

5.3 RQ3: How effective is DNNPerf in theablation study?

In this section, we run additional experiments to study the impact ofalternative design choices of DNNPerf such as the graph encodingand feature normalization methods:

(1) DNNPerf-GAT, which replaces the ANEE layer (Section 3.4)by GAT [66]. Since GAT cannot perform local encoding onedge features through message passing, this experiment isto evaluate the effectiveness of ANEE.

(2) DNNPerf-StandardScaler, which uses StandardScaler [54]to normalize features “by removing the mean and scalingto unit variance” [46]. This experiment is to evaluate whichnormalization method is more effective.

(3) DNNPerf-NoTensorCost, which excludes the tensor sizeand computation cost features described in Table 1 andTable 2. This experiment is to evaluate the effectiveness ofruntime performance-related features.

Page 9: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

Table 7: Experimental results on unseen models.Model Prediction of Training Time

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 7.651 19.751 19.771 18.635 34.504 91.160

RMSE (ms) 55.1 111.8 95.5 99.1 131.2 309.0

AlexNetMRE (%) 11.942 22.545 18.394 21.633 85.414 291.101

RMSE (ms) 13.6 27.1 28.6 26.5 90.4 264.6

VGGMRE (%) 7.295 13.308 10.099 16.817 12.593 41.648

RMSE (ms) 84.2 164.6 92.3 154.4 148.6 339.0

OverFeatMRE (%) 7.462 20.735 24.892 16.211 58.530 65.409

RMSE (ms) 24.3 90.7 77.9 77.6 178.6 154.4

ResNet-V2MRE (%) 5.690 7.232 10.580 5.892 9.933 65.009

RMSE (ms) 91.1 98.1 118.1 65.9 175.2 584.0

GRUMRE (%) 6.334 28.187 28.714 26.042 15.154 29.344

RMSE (ms) 29.7 118.7 118.9 105.8 59.1 130.7

Model Prediction of GPU Memory Consumption

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE (%) 13.258 21.599 25.730 23.119 13.491 23.529

RMSE (GB) 1.879 2.848 2.980 2.573 2.300 4.081

AlexNetMRE (%) 13.453 30.288 42.645 38.239 16.214 11.158

RMSE (GB) 1.270 2.584 3.462 3.162 1.489 0.942

VGGMRE (%) 15.772 17.918 17.897 16.786 16.499 28.979

RMSE (GB) 2.341 3.234 2.579 2.378 2.989 4.634

OverFeatMRE (%) 13.597 28.719 35.111 34.502 13.986 13.625

RMSE (GB) 1.188 2.469 2.980 2.941 1.320 1.131

ResNet-V2MRE (%) 8.282 14.294 15.110 9.553 8.382 30.159

RMSE (GB) 1.640 2.755 2.851 1.684 1.784 4.683

GRUMRE (%) 13.765 17.964 20.063 17.813 12.053 30.349

RMSE (GB) 2.273 2.991 2.973 2.418 2.835 5.490

(4) DNNPerf-ConcatEdge, which sums the node features andedge features separately, concatenates the summarized fea-tures into a vector, and then deliver the vector to MLP.

(5) DNNPerf-AvgReadout, which averages all the values ofnode features before passing the features to MLP. It andthe above DNNPerf-ConcatEdge are used to evaluate theeffectiveness of different global aggregation methods.

Table 8 shows the the overall prediction results of DNNPerf andother variants on the complete test set. Firstly, after replacing ourANEE encoder with GAT, the prediction accuracy is reduced inwhich the overall MRE/RMSE values increase by 4.25%/42.8 for thetraining time prediction and 3.373%/0.201 for the GPU memoryconsumption prediction. Nevertheless, the training throughput ofDNNPerf-GAT improves some from 737.2 graphs/second to 847.3graphs/second. We do not try EGNN [21] because it is much morecostly than GAT in training (about 3X slower). Secondly, the Min-MaxScaler normalization method outperforms StandardScaler inthree out of four cases, showing its effectiveness and stability. Al-though the RMSE value of the training time prediction using Stan-dardScaler is better, the difference is very small. Thirdly, we findthat the features of tensor size and computation cost are obviouslyeffective for the runtime performance prediction. After excludingthem, the overall MRE/RMSE values increase by 8.419%/33.2 for thetraining time prediction and 10.73%/1.514 for the GPU memory con-sumption prediction. Finally, the results of DNNPerf-ConcatEdgeand DNNPerf-AvgReadout show that concatenating the node andedge features and averaging the node features do not improve theprediction accuracy. The current global aggregation which sums up

Table 8: Ablation study.Ablation Training Time GPU Memory Consumption

Description MRE (%) RMSE (ms) MRE (%) RMSE (GB)

DNNPerf 7.443 58.5 13.684 1.806

DNNPerf-GAT 11.693 101.3 17.057 2.007

DNNPerf-StandardScaler 11.763 56.0 14.998 1.991

DNNPerf-NoTensorCost 15.862 91.7 24.414 3.320

DNNPerf-ConcatEdge 7.993 60.5 15.151 1.894

DNNPerf-AvgReadout 20.884 100.9 19.836 2.447

Table 9: Training cost and accuracy of DNNPerf on fivedatasets of different sizes.

Sample Memory Memory Time Time Collection TrainingSize (Ratio) RMSE(GB) MRE(%) RMSE(ms) MRE(%) Time Time

9964 (100%) 1.806 13.684 58.5 7.443 2.91h (20 collectors) 2.5h4982 (50%) 1.974 16.496 96.4 16.244 2.91h (10 collectors) 1.5h1992 (20%) 2.415 20.547 118.9 18.504 3.8h (3 collectors) 1h996 (10%) 2.607 23.610 120.2 18.538 2.9h (2 collectors) 0.5h498 (5%) 2.722 24.439 141.5 24.058 2.9h (1 collectors) 0.45h

all the node features is more suitable to represent the accumulatedruntime performance of the computation graph.

6 DISCUSSION6.1 Extensibility of DNNPerfCurrently, DNNPerf supports representative real-world DL modelsand 70+ types of operator. Users can extend DNNPerf by incorporat-ing new operators to support other models. To add a new operator,users need to formulate the features analytically based on its se-mantics and implement the feature extraction scripts. Furthermore,users should add training data that contains the new operator.

DNNPerf can be easily applied to predict other runtime perfor-mance metrics of both model training and inference, such as infer-ence time, GPU utilization, and GPU power consumption. Usersneed to collect new training data with the metrics as the label. Notethat for a prediction task associated with model inference, the front-end parser should remove operators under back propagation fromthe computation graph.

6.2 Training Cost of DNNPerfA larger training dataset is generally helpful to increasing the ac-curacy of a prediction model; however, the training overhead alsoincreases and may exceed what users can afford. In this section, wereport the training cost and accuracy of DNNPerf on five datasetsof different sizes in Table 9. The original training dataset contains9,964 model configurations (excluding the testing and unseen con-figurations), from which other datasets are randomly generated.

Since the training data can be easily collected in parallel, thecollection cost is not significant. For example, it spends only 2.91hours to finish the collection of the original training dataset using20 collectors, which is equal to that of collecting the smallest dataset(498 model configurations) using 1 collector. The more collectors,the less collection time. Training our GNN-based model does nottake long either. For instance, the training with the original datasetcompletes after 2.5 hours. When shrinking the dataset, the trainingtime is shortened obviously. Although the accuracy of DNNPerfdecreases too, it is still acceptable because the MRE values even stay

Page 10: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

Table 10: Experimental results on the PyTorch HPO models.Model Prediction of Training Time

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE 18.491 105.284 93.069 68.779 112.871 30.849RMSE 26.5 47.8 48.1 47.0 54.9 73.0

LeNetMRE 22.711 104.180 90.075 36.515 130.188 12.757

RMSE 20.7 22.2 21.9 21.6 20.3 21.6

ResNet-V1MRE 11.407 69.028 63.477 53.249 69.835 18.401RMSE 18.5 28.1 27.6 26.8 27.2 27.3

Inception-V3MRE 11.194 8.655 8.615 10.744 14.188 66.022RMSE 32.5 60.3 60.8 58.9 76.4 130.3

VGGMRE 39.628 248.064 199.222 230.410 250.220 46.494RMSE 42.8 83.2 84.0 83.2 93.2 96.3

Model Prediction of GPU Memory Consumption

Name Metrics DNNPerf BiRNN ARNN MLP GBDT BRP-NAS

OverallMRE 14.555 59.539 51.818 14.883 17.533 44.571RMSE 1.221 3.038 2.792 1.202 0.826 3.353

LeNetMRE 35.492 74.061 65.978 36.854 17.452 53.682RMSE 2.240 4.164 4.082 2.183 0.616 3.104

ResNet-V1MRE 4.135 26.512 17.373 4.490 4.856 8.882RMSE 0.605 2.981 1.976 0.595 0.622 1.137

Inception-V3MRE 7.459 7.642 15.528 9.647 8.980 53.888RMSE 0.748 0.753 1.071 0.922 0.917 4.040

VGGMRE 11.605 41.537 40.486 11.021 10.570 92.737RMSE 1.198 3.140 3.347 1.091 0.974 5.561

within 25% with only 498 training samples. The results also indicatethat DNNPerf can be adapted to a new deployment environmentat a reasonable cost.

6.3 Generality of DNNPerfOur GNN-based approach is general and can be adapted to other DLframeworks such as PyTorch [45] and other devices such as NVIDIAP100. To evaluate the generality of DNNPerf, we collect a total of5,064 HPO (Hyperparameter Optimization) model configurationsimplemented with PyTorch v1.5.1 on NVIDIA P100 GPU.

Because the framework implementation and hyperparameters(e.g., the padding) of PyTorch are different from those of TensorFlow,we apply Batch Normalization [31] to normalize the data beforethe MLP layer of DNNPerf. We also enhance the input tensor sizefeature to prevent the batch size information from disappearing.The number of message-passing rounds of ANEE is set to 2 and thelearning rate is set to 0.001.

The HPO dataset includes the configurations of six real-world DLmodels (AlexNet [37], LeNet [40], ResNet-V1 [28], VGG-11, VGG-16,VGG-19 [60], and Inception-V3 [61]). The hyperparameter valuecombinations are described in Section 4.1. We randomly split thefull dataset for training (80%) and test (20%). We retrain DNNPerf(for 200 epochs) and baselines over the new dataset.

The experimental results in Table 10 show that the overall MRE/RMSE improvement over the baseline approaches ranges from12.358%/20.5 to 86.793%/46.5 in predicting the training time, andfrom 0.328%/1.571 to 44.984%/2.132 in predicting the GPU memoryconsumption. These results demonstrate that DNNPerf is generalto diverse framework and device options.

6.4 Stability of Training Time Across IterationsPractical model training usually lasts many iterations. Since it isan iterative process, the training time across iterations should beidentical provided a model has no control-flow operators or dy-namic structural changes (Section 3.1). To understand such stability,we actually conducted 16 experiments on the VGG, ResNet-V1,Inception-V3, and LSTM models under both TensorFlow and Py-Torch. We trained every model configuration for an epoch of 1,000iterations. From the collected data of each iteration (excluding thewarm-up ones), we notice that the traing time across iterations isfairly stable, with a relative standard error (i.e., standard error [69]divided by the mean) ranging from 0.24% to 1.61%. Therefore, al-though DNNPerf predicts the training time per iteration, we canstill derive the total training time by multiplying the iteration count.

6.5 Threats to ValidityThreats to Internal Validity: We carefully examine the frameworksource code to extract many performance-related factors as thefeatures, such as the tensor size, computation cost (FLOPs) of theoperator, and device bandwidth (Table 1 and Table 2). However,there exist hidden factors that cause some fluctuations in the run-time performance. For example, the proprietary NVIDIA CUDA,cuDNN, or cuBLAS APIs used by the operator implementationcould use temporary, unpredictable GPU memory called workspaceto improve the runtime performance. We mitigated this threat by re-fining the feature extraction after carefully referring to the NVIDIAdevelopment documentation, profiling the APIs using nvprof, andanalyzing the framework runtime logs.

Threats to External Validity: In reality, there could exist manydifferent kinds of DL models and fairly large configuration spaces,which may reduce the effectiveness of DNNPerf. We mitigated thisthreat by: (1) enlarging the training set with more diverse neuralarchitectures and hyperparameter combinations; (2) supportingmore types of DL operator. The experimental results confirm thatDNNPerf is generally effective, even for configurations of unseenDL models (Section 5.2). Another threat is that we only collectand evaluate configurations of TensorFlow models but there areother DL frameworks such as PyTorch and MXNet [6]. However,since the frameworks take the same abstraction to represent modelsand adopt similar runtime implementations, we believe that ourapproach is general and DNNPerf can be adapted to support otherframeworks.

7 RELATEDWORKPerformance prediction for configurable systems. Many re-searchers focus on predicting the performance of configurable sys-tems in a deployment environment [24–26, 51, 56–58, 63, 76]. Forexample, DeepPerf [26] uses less training data to train a deep sparseneural network but still achieves much higher prediction accuracy.It treats a configurable system as a black box and ignores its internalmechanism because the system is very large and complex. Instead,our work carefully analyses both the DL model and the frameworkto extract performance-related features as much as possible.

Page 11: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

Performance prediction for DL models. Recently, there issome work on predicting the runtime performance of DL mod-els [18, 19, 33, 41, 49, 73]. For example, Paleo [49] estimates the exe-cution time from FLOPs, and DNNMem [19] pre-builds an analyticmodel for the GPU memory consumption estimation. However,these analytic approaches require lots of hand-draft efforts andare specific to certain tasks. The work of [73] predicts the GPUutilization from the FLOPS, input data size, and number of con-volutional layers, but it ignores many other affecting factors suchas the neural architecture, operator execution order, I/O cost, etc.Predicting the learning performance (e.g., accuracy) of DL modelshas also attracted interests in the AutoML community [4, 10, 16–18, 20, 36, 65, 71]. BRP-NAS “proposes a GCN-based predictor forthe end-to-end latency” [18] using the operator type and computa-tion graph structure as features. However, its generalization abilityof cross-model prediction is low. PPP-Net employs a RecurrentNeural Network (RNN) to predict the accuracy from the neuralarchitecture, which “avoids time-consuming training to obtain trueaccuracy but with a slight drawback of regression error” [17]. Com-paring to the prior work, DNNPerf captures not only operator-levelfeatures but also the computation graph information and hiddenfactors within the framework. Our general learning-based approachreduces hand-crafted efforts and achieves better prediction results.

8 CONCLUSIONIn this paper, we have presented DNNPerf, a novel runtime perfor-mance prediction tool for deep learning models. DNNPerf adoptsa GNN-based approach and systematically explores performance-related features derived from the semantics of the computationgraph and hidden factors within the framework. Our experimentsdemonstrate that DNNPerf accurately predicts the GPU memoryconsumption and training time. DNNPerf is also effective and ro-bust to the choices of hyperparameters and neural architectures,even to unseen models.

REFERENCES[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey

Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Ma-chine Learning. In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16). USENIX Association, Savannah, GA, 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

[2] Elvira Albert, Samir Genaim, and Miguel Gómez-Zamalloa. 2010. ParametricInference of Memory Requirements for Garbage Collected Languages. In Proceed-ings of the 2010 International Symposium on Memory Management (Toronto,Ontario, Canada) (ISMM ’10). ACM, New York, NY, USA, 121–130. https://doi.org/10.1145/1806651.1806671

[3] Miltiadis Allamanis, Marc Brockschmidt, andMahmoud Khademi. 2018. Learningto Represent Programs with Graphs. In International Conference on LearningRepresentations. https://openreview.net/forum?id=BJOFETxR-

[4] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. 2017. PracticalNeural Network Performance Prediction for Early Stopping. CoRR abs/1705.10823(2017). arXiv:1705.10823 http://arxiv.org/abs/1705.10823

[5] Víctor Braberman, Federico Fernández, Diego Garbervetsky, and Sergio Yovine.2008. Parametric Prediction of Heap Memory Requirements. In Proceedings ofthe 7th International Symposium on Memory Management (Tucson, AZ, USA)(ISMM ’08). ACM, New York, NY, USA, 141–150. https://doi.org/10.1145/1375634.1375655

[6] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, TianjunXiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexibleand Efficient Machine Learning Library for Heterogeneous Distributed Systems.CoRR abs/1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274

[7] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, HaichenShen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin,and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End OptimizingCompiler for Deep Learning. In 13th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen

[8] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, JohnTran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitivesfor Deep Learning. CoRR (2014). arXiv:1410.0759 http://arxiv.org/abs/1410.0759

[9] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning PhraseRepresentations using RNNEncoder–Decoder for StatisticalMachine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP). Association for Computational Linguistics, Doha, Qatar,1724–1734. https://doi.org/10.3115/v1/D14-1179

[10] Daeyoung Choi, Hyunghun Cho, and Wonjong Rhee. 2018. On the Difficultyof DNN Hyperparameter Optimization Using Learning Curve Prediction. InTENCON 2018 - 2018 IEEE Region 10 Conference. 0651–0656. https://doi.org/10.1109/TENCON.2018.8650070

[11] François Chollet et al. 2015. Keras. https://keras.io.[12] Duc-Hiep Chu, Joxan Jaffar, and Rasool Maghareh. 2016. Symbolic Execution for

Memory ConsumptionAnalysis. In Proceedings of the 17th ACM SIGPLAN/SIGBEDConference on Languages, Compilers, Tools, and Theory for Embedded Systems(Santa Barbara, CA, USA) (LCTES 2016). ACM, New York, NY, USA, 62–71. https://doi.org/10.1145/2907950.2907955

[13] Marc Claesen and Bart De Moor. 2015. Hyperparameter Search in MachineLearning. CoRR abs/1502.02127 (2015). arXiv:1502.02127 http://arxiv.org/abs/1502.02127

[14] Li Deng. 2012. The MNIST Database of Handwritten Digit Images for MachineLearning Research [Best of the Web]. IEEE Signal Processing Magazine 29, 6(2012), 141–142. https://doi.org/10.1109/MSP.2012.2211477

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Association for Computational Linguistics, Minneapolis,Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423

[16] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speedingup Automatic Hyperparameter Optimization of Deep Neural Networks by Ex-trapolation of Learning Curves. In Proceedings of the 24th International Confer-ence on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI ’15). AAAI Press,3460–3468.

[17] Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. 2018.PPP-Net: Platform-aware Progressive Search for Pareto-optimal Neural Archi-tectures. https://openreview.net/forum?id=B1NT3TAIM

[18] Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, HyejiKim, and Nicholas Lane. 2020. BRP-NAS: Prediction-based NAS usingGCNs. In Advances in Neural Information Processing Systems, H. Larochelle,M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran As-sociates, Inc., 10480–10490. https://proceedings.neurips.cc/paper/2020/file/768e78024aa8fdb9b8fe87be86f64745-Paper.pdf

[19] Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin,and Mao Yang. 2020. Estimating GPU Memory Consumption of Deep LearningModels. In Proceedings of the 28th ACM Joint Meeting on European SoftwareEngineering Conference and Symposium on the Foundations of Software Engineering(Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery,New York, NY, USA, 1342–1352. https://doi.org/10.1145/3368089.3417050

[20] Matilde Gargiani, Aaron Klein, Stefan Falkner, and Frank Hutter. 2019. Probabilis-tic Rollouts for Learning Curve Extrapolation Across Hyperparameter Settings.CoRR abs/1910.04522 (2019). arXiv:1910.04522 http://arxiv.org/abs/1910.04522

[21] L. Gong andQ. Cheng. 2019. Exploiting Edge Features for GraphNeural Networks.In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).9203–9211. https://doi.org/10.1109/CVPR.2019.00943

[22] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MITPress. http://www.deeplearningbook.org.

[23] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. DeepAPI Learning. In Proceedings of the 2016 24th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering (Seattle, WA, USA) (FSE2016). Association for Computing Machinery, New York, NY, USA, 631–642.https://doi.org/10.1145/2950290.2950334

[24] Jianmei Guo, Krzysztof Czarnecki, Sven Apely, Norbert Siegmundy, and AndrzejWasowski. 2013. Variability-Aware Performance Prediction: A Statistical Learn-ing Approach. In Proceedings of the 28th IEEE/ACM International Conference onAutomated Software Engineering (Silicon Valley, CA, USA) (ASE ’13). IEEE Press,301–311. https://doi.org/10.1109/ASE.2013.6693089

[25] Jianmei Guo, Yang Dingyu, Norbert Siegmund, Sven Apel, Atrisha Sarkar, PavelValov, Krzysztof Czarnecki, Andrzej Wasowski, and Huiqun Yu. 2018. Data-efficient performance learning for configurable systems. Empirical Software

Page 12: Runtime Performance Prediction for Deep Learning Models

Yanjie Gao, Xianyu Gu, Hongyu Zhang, Haoxiang Lin, and Mao Yang

Engineering 23 (06 2018). https://doi.org/10.1007/s10664-017-9573-6[26] H. Ha and H. Zhang. 2019. DeepPerf: Performance Prediction for Configurable

Software with Deep Sparse Neural Network. In 2019 IEEE/ACM 41st InternationalConference on Software Engineering (ICSE). 1095–1106.

[27] Simon Haykin. 1994. Neural Networks: A Comprehensive Foundation. PrenticeHall.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep ResidualLearning for Image Recognition. In 2016 IEEE Conference on Computer Vision andPattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90

[29] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

[30] J J Hopfield. 1982. Neural networks and physical systems with emergentcollective computational abilities. Proceedings of the National Academyof Sciences 79, 8 (1982), 2554–2558. https://doi.org/10.1073/pnas.79.8.2554arXiv:https://www.pnas.org/content/79/8/2554.full.pdf

[31] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: AcceleratingDeep Network Training by Reducing Internal Covariate Shift. In Proceedings ofthe 32nd International Conference on International Conference onMachine Learning- Volume 37 (Lille, France) (ICML ’15). JMLR.org, 448–456.

[32] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.Summarizing Source Code using a Neural Attention Model. In Proceedings ofthe 54th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers). Association for Computational Linguistics, Berlin, Germany,2073–2083. https://doi.org/10.18653/v1/P16-1195

[33] Daniel Justus, John Brennan, Stephen Bonner, and Andrew Stephen McGough.2018. Predicting the Computational Cost of Deep Learning Models. In 2018 IEEEInternational Conference on Big Data (Big Data). 3873–3882. https://doi.org/10.1109/BigData.2018.8622396

[34] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In 3rd International Conference on Learning Representations, ICLR 2015,San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengioand Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

[35] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification withGraph Convolutional Networks. In International Conference on Learning Repre-sentations (ICLR).

[36] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter.2016. Bayesian Neural Networks for Predicting Learning Curves. http://bayesiandeeplearning.org/2016/papers/BDL_38.pdf

[37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi-fication with Deep Convolutional Neural Networks. In Proceedings of the 25thInternational Conference on Neural Information Processing Systems - Volume 1(Lake Tahoe, Nevada) (NIPS ’12). Curran Associates Inc., Red Hook, NY, USA,1097–1105.

[38] Rasmus Munk Larsen and Tatiana Shpeisman. 2019. TensorFlow Graph Opti-mizations.

[39] Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, R. Howard,Wayne Hubbard, and Lawrence Jackel. 1990. Handwritten Digit Recognitionwith a Back-Propagation Network. In Advances in Neural Information ProcessingSystems, D. Touretzky (Ed.), Vol. 2. Morgan-Kaufmann. https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf

[40] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791

[41] Ying-Chiao Liao, Chuan-Chi Wang, Chia-Heng Tu, Ming-Chang Kao, Wen-YewLiang, and Shih-Hao Hung. 2020. PerfNetRT: Platform-Aware PerformanceModeling for Optimized Deep Neural Networks. In 2020 International ComputerSymposium (ICS). 153–158. https://doi.org/10.1109/ICS51289.2020.00039

[42] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomás Kociský, An-drew W. Senior, Fumin Wang, and Phil Blunsom. 2016. Latent Predictor Net-works for Code Generation. CoRR abs/1603.06744 (2016). arXiv:1603.06744http://arxiv.org/abs/1603.06744

[43] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. 1999. BoostingAlgorithms as Gradient Descent. In Proceedings of the 12th International Con-ference on Neural Information Processing Systems (Denver, CO) (NIPS ’99). MITPress, Cambridge, MA, USA, 512–518. https://proceedings.neurips.cc/paper/1999/file/96a93ba89a5b5c6c226e49b88973f46e-Paper.pdf

[44] Microsoft. 2022. ND-series Virtual Machines. https://docs.microsoft.com/en-us/azure/virtual-machines/nd-series.

[45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, andSoumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance DeepLearning Library. InAdvances in Neural Information Processing Systems 32, Vol. 32.Curran Associates, Inc., 8024–8035. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[46] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau,Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn:Machine Learning in Python. Journal of Machine Learning Research 12, 85 (2011),2825–2830. http://jmlr.org/papers/v12/pedregosa11a.html

[47] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. EfficientNeural Architecture Search via Parameters Sharing. In Proceedings of the 35thInternational Conference on Machine Learning (Proceedings of Machine LearningResearch, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4095–4104.https://proceedings.mlr.press/v80/pham18a.html

[48] PyTorch. 2020. The topological sorting algorithm for the graph transforma-tion subsystem. https://github.com/pytorch/pytorch/blob/v1.5.1/caffe2/core/nomnigraph/include/nomnigraph/Graph/TopoSort.h#L26.

[49] Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A PerformanceModel for Deep Neural Networks. In Proceedings of ICLR.

[50] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. LearningRepresentations by Back-Propagating Errors. MIT Press, Cambridge, MA, USA,696–699.

[51] Atrisha Sarkar, Jianmei Guo, Norbert Siegmund, Sven Apel, and Krzysztof Czar-necki. 2015. Cost-Efficient Sampling for Performance Prediction of ConfigurableSystems (T). In 2015 30th IEEE/ACM International Conference on Automated Soft-ware Engineering (ASE). 342–352. https://doi.org/10.1109/ASE.2015.45

[52] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, andGabriele Monfardini. 2009. The Graph Neural Network Model. IEEE TransactionsonNeural Networks 20, 1 (2009), 61–80. https://doi.org/10.1109/TNN.2008.2005605

[53] Scikit-learn. 2022. The sklearn.preprocessing.MinMaxScaler API. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

[54] Scikit-learn. 2022. The sklearn.preprocessing.StandardScaler API.https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.

[55] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, andYann LeCun. 2014. OverFeat: Integrated Recognition, Localization and Detectionusing Convolutional Networks. In International Conference on Learning Represen-tations (ICLR 2014). CBLS. https://openreview.net/forum?id=Hq5MgBFOP62-X

[56] Norbert Siegmund, Alexander Grebhahn, Sven Apel, and Christian Kästner. 2015.Performance-Influence Models for Highly Configurable Systems. In Proceedingsof the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo,Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY,USA, 284–294. https://doi.org/10.1145/2786805.2786845

[57] Norbert Siegmund, Sergiy S. Kolesnikov, Christian Kästner, Sven Apel, DonBatory, Marko Rosenmüller, and Gunter Saake. 2012. Predicting performance viaautomated feature-interaction detection. In 2012 34th International Conference onSoftware Engineering (ICSE). 167–177. https://doi.org/10.1109/ICSE.2012.6227196

[58] Norbert Siegmund, Marko Rosenmuller, Christian Kastner, Paolo G. Giarrusso,Sven Apel, and Sergiy S. Kolesnikov. 2011. Scalable Prediction of Non-functionalProperties in Software Product Lines. In 2011 15th International Software ProductLine Conference. 160–169. https://doi.org/10.1109/SPLC.2011.20

[59] N. Silberman and S. Guadarrama. 2016. TensorFlow-Slim image classificationmodel library. https://github.com/tensorflow/models/tree/master/research/slim/nets.

[60] Karen Simonyan and Andrew Zisserman. 2015. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition. In 3rd International Conferenceon Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556

[61] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-niew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision.CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567

[62] TensorFlow. 2019. The tf.layers.Conv2D API. https://github.com/tensorflow/docs/blob/r1.13/site/en/api_docs/python/tf/layers/Conv2D.md.

[63] Eno Thereska, Bjoern Doebel, Alice X. Zheng, and Peter Nobel. 2010. PracticalPerformance Models for Complex, Popular Applications. In Proceedings of theACM SIGMETRICS International Conference on Measurement and Modeling ofComputer Systems (New York, New York, USA) (SIGMETRICS ’10). Associationfor Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/1811039.1811041

[64] Leena Unnikrishnan, Scott D. Stoller, and Yanhong A. Liu. 2000. AutomaticAccurate Stack Space and Heap Space Analysis for High-Level Languages. TechnicalReport. Indiana University.

[65] Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, andIlya Tolstikhin. 2021. Predicting Neural Network Accuracy from Weights.arXiv:2002.11448 [stat.ML]

[66] Petar Velic̆ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Confer-ence on Learning Representations. https://openreview.net/forum?id=rJXMpikCZ

Page 13: Runtime Performance Prediction for Deep Learning Models

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network

[67] Ingrid M. Verbauwhede, Chris J. Scheers, and Jan M. Rabaey. 1994. MemoryEstimation for High Level Synthesis. In Proceedings of the 31st Annual DesignAutomation Conference (San Diego, California, USA) (DAC ’94). ACM, New York,NY, USA, 143–148. https://doi.org/10.1145/196244.196313

[68] MinjieWang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, JinjingZhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin,Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. DeepGraph Library: Towards Efficient and Scalable Deep Learning on Graphs. CoRRabs/1909.01315 (2019). arXiv:1909.01315 http://arxiv.org/abs/1909.01315

[69] Wikipedia. 2021. Standard error — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Standard%20error&oldid=1064719702.

[70] Wikipedia contributors. 2021. One-hot — Wikipedia, The Free Encyclopedia.https://en.wikipedia.org/w/index.php?title=One-hot&oldid=1001305138 [Online;accessed 19-January-2021].

[71] YunfengXin, Chengzhe Xu, andHangyi Zhao. 2019. Predicting Future Performanceof Convolutional Neural Networks in Early Training Stages. Technical Report.

[72] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, AndreasStolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Con-versational Speech Recognition. CoRR abs/1610.05256 (2016). arXiv:1610.05256http://arxiv.org/abs/1610.05256

[73] Gingfung Yeung, Damian Borowiec, Adrian Friday, Richard Harper, and Pe-ter Garraghan. 2020. Towards GPU Utilization Prediction for Cloud DeepLearning. In 12th USENIX Workshop on Hot Topics in Cloud Computing (Hot-Cloud 20). USENIX Association. https://www.usenix.org/conference/hotcloud20/presentation/yeung

[74] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and XudongLiu. 2019. A Novel Neural Source Code Representation Based on AbstractSyntax Tree. In Proceedings of the 41st International Conference on SoftwareEngineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 783–794. https://doi.org/10.1109/ICSE.2019.00086

[75] Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and MaoYang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. InProceedings of the 42nd International Conference on Software Engineering (Seoul,Republic of Korea) (ICSE ’20). Association for Computing Machinery, New York,NY, USA, 1159–1170. https://doi.org/10.1145/3377811.3380362

[76] Yi Zhang, Jianmei Guo, Eric Blais, and Krzysztof Czarnecki. 2015. PerformancePrediction of Configurable Software Systems by Fourier Learning. In Proceedingsof the 30th IEEE/ACM International Conference on Automated Software Engineering(Lincoln, Nebraska) (ASE ’15). IEEE Press, 365–373. https://doi.org/10.1109/ASE.2015.15