6
1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun Lin Massachusetts Institute of Technology {songhan, hancai, ligeng, jilin, kuanwang, zhijian, yujunlin}@mit.edu Abstract—Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithm makes the design space much larger: it’s not only about designing the hardware but also about how to tweak the algorithm to best fit the hardware. Human engineers can hardly exhaust the design space by heuristics. It’s labor consuming and sub-optimal. We propose design automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channel pruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work, so that we can afford to design specialized neural network models for different hardware platforms. Index Terms—AutoML, Neural Architecture Search, Channel Pruning, Mixed-Precision, Quantization, Specialization, Efficient Deep Learning. 1 I NTRODUCTION Algorithm and hardware co-design plays an important role in efficient deep learning computing. Unlike optimizing on the SPEC2006 benchmark when we can treat the algorithm as a black box, there’s plenty of room at the algorithm level that can improve the hardware efficiency of deep learning. We should open the box and explore model optimization techniques. The benefit usually comes from memory saving and locality improvement. For example, model compression techniques [1] including pruning and quantization can drastically reduce the memory footprint and save energy consumption. Another example is small model design. SqueezeNet [2] and MobileNet [3] have only 4.8MB/4.2MB of model size, which can fit on-chip SRAM and improve the locality. However, efficient model design and compression have a large design space. Many different neural network archi- tectures can lead to similar accuracy, but drastically different hardware efficiency. This is difficult to exhaust by rule- based heuristics, since there is a shortage of deep learning and hardware experts to hand-tune each model to make it run fast. It’s demanding to systematically study how to design efficient neural network with hardware constraints. We propose hardware-centric AutoML techniques that can automatically design neural networks that are hardware efficient [4, 5, 6]. Such joint optimization is systematic and can transfer well between tasks. It requires fewer engineer efforts while designing better neural networks at low cost. We explore three aspects of neural network design automation (Figure 1): auto design specialized model, auto channel pruning, and auto mixed-precision quantization. Each aspect is summarized as follows. There is plenty of specialized hardware for neural net- works, but little research has been done for specialized neural network architecture for a given hardware architecture (the reverse specialization). There are several advantages for a specialized model: it can fully utilize the parallelism of the target hardware (e.g. fitting the channel size with the PE size). Besides, a specialized model can fully utilize the on- chip buffer and improve locality and reuse. Specialization can also match the model’s computation intensity with the hardware’s roofline model. However, designing a specialized neural network architecture used to be difficult. First, there are limited heuristics. Second, the computation cost used to be prohibitive: even searching a model on CIFAR-10 dataset takes 10 4 GPU hours [7, 8]. We cut the search cost by two orders of magnitude (actually more than that, since we directly search on ImageNet). The search cost is reduced by two techniques: path-level pruning and path- level binarization, which saves GPU hours and GPU memory. Cutting the search cost enables us to design specialized the model on the target task and target hardware. On the mobile phone, our searched model [4] runs 1.8× faster than the best human designed model [9]. After designing a specialized model, compression and pruning is an effective technique to further reduce the memory footprint [1]. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space. We propose an automated design flow that leverages reinforcement learning to give the best model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio, better preserving the accuracy and freeing human labor. We applied this auto- mated, push-the-button compression pipeline to MobileNet and achieved 1.81× speedup of measured inference latency on an Android phone and 1.43× speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy. arXiv:1904.10616v1 [cs.LG] 24 Apr 2019

1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

1

Design Automation for EfficientDeep Learning Computing

Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun LinMassachusetts Institute of Technology

{songhan, hancai, ligeng, jilin, kuanwang, zhijian, yujunlin}@mit.edu

Abstract—Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need tochange the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithmmakes the design space much larger: it’s not only about designing the hardware but also about how to tweak the algorithm to best fit thehardware. Human engineers can hardly exhaust the design space by heuristics. It’s labor consuming and sub-optimal. We proposedesign automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channelpruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performanceand efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work, so that we can affordto design specialized neural network models for different hardware platforms.

Index Terms—AutoML, Neural Architecture Search, Channel Pruning, Mixed-Precision, Quantization, Specialization, Efficient DeepLearning.

F

1 INTRODUCTION

Algorithm and hardware co-design plays an important rolein efficient deep learning computing. Unlike optimizing onthe SPEC2006 benchmark when we can treat the algorithmas a black box, there’s plenty of room at the algorithmlevel that can improve the hardware efficiency of deeplearning. We should open the box and explore modeloptimization techniques. The benefit usually comes frommemory saving and locality improvement. For example,model compression techniques [1] including pruning andquantization can drastically reduce the memory footprintand save energy consumption. Another example is smallmodel design. SqueezeNet [2] and MobileNet [3] have only4.8MB/4.2MB of model size, which can fit on-chip SRAMand improve the locality.

However, efficient model design and compression havea large design space. Many different neural network archi-tectures can lead to similar accuracy, but drastically differenthardware efficiency. This is difficult to exhaust by rule-based heuristics, since there is a shortage of deep learningand hardware experts to hand-tune each model to makeit run fast. It’s demanding to systematically study how todesign efficient neural network with hardware constraints.We propose hardware-centric AutoML techniques that canautomatically design neural networks that are hardwareefficient [4, 5, 6]. Such joint optimization is systematic andcan transfer well between tasks. It requires fewer engineerefforts while designing better neural networks at low cost.

We explore three aspects of neural network designautomation (Figure 1): auto design specialized model, autochannel pruning, and auto mixed-precision quantization.Each aspect is summarized as follows.

There is plenty of specialized hardware for neural net-works, but little research has been done for specialized neural

network architecture for a given hardware architecture (thereverse specialization). There are several advantages for aspecialized model: it can fully utilize the parallelism of thetarget hardware (e.g. fitting the channel size with the PEsize). Besides, a specialized model can fully utilize the on-chip buffer and improve locality and reuse. Specializationcan also match the model’s computation intensity with thehardware’s roofline model. However, designing a specializedneural network architecture used to be difficult. First, thereare limited heuristics. Second, the computation cost usedto be prohibitive: even searching a model on CIFAR-10dataset takes 104 GPU hours [7, 8]. We cut the searchcost by two orders of magnitude (actually more than that,since we directly search on ImageNet). The search cost isreduced by two techniques: path-level pruning and path-level binarization, which saves GPU hours and GPU memory.Cutting the search cost enables us to design specialized themodel on the target task and target hardware. On the mobilephone, our searched model [4] runs 1.8× faster than the besthuman designed model [9].

After designing a specialized model, compression andpruning is an effective technique to further reduce thememory footprint [1]. Conventional model compressiontechniques rely on hand-crafted heuristics and rule-basedpolicies that require domain experts to explore the largedesign space. We propose an automated design flow thatleverages reinforcement learning to give the best modelcompression policy. This learning-based compression policyoutperforms conventional rule-based compression policy byhaving a higher compression ratio, better preserving theaccuracy and freeing human labor. We applied this auto-mated, push-the-button compression pipeline to MobileNetand achieved 1.81× speedup of measured inference latencyon an Android phone and 1.43× speedup on the Titan XPGPU, with only 0.1% loss of ImageNet Top-1 accuracy.

arX

iv:1

904.

1061

6v1

[cs

.LG

] 2

4 A

pr 2

019

Page 2: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

2

(1) Update weight parameters

Architecture ParametersBinary Gate (0:prune, 1:keep)

INPUT

OUTPUT

α β … δ1 0 … 0

(2) Update architecture parameters

INPUT

α β … δ0 1 … 0

update

fmap not in memoryfmap in memory

CONV 5x5

POOL 3x3

... Weight Parameters

CONV 3x3 Identity CONV

5x5POOL

3x3...Identity

update

CONV 3x3

OUTPUT

(e.g. 50%)

Layer t-1

Layer t

Layer t+1

30%

50%

? %

Automatic Learner Hardware

1 1 0 0 0 1 0 1 0

1 1 1 0 0 1

1 1 0 1 0 1 0 0 1

Layer t 2b/4b

Layer t-1 4b/5b

4 bit weight 5 bit activation

Layer t+1 3b/6b

INPUT

α β … δ0 1 … 0

CONV 5x5

POOL 3x3

...IdentityCONV 3x3

OUTPUT

Action

Reward

Mapping

LatencyEnergy

(a) Auto Model Specialization (b) Auto Channel Pruning (c) Auto Mixed-Precision Quantization

Hardware Simulator

Automatic Learner

Quantization

4bit70%

3bit50%

5bit60%

Action

Reward

Mapping

LatencyEnergy

(a) AutomaticModel

Specialization

(b) AutomaticChannelPruning

(c) AutomaticMixed-Precision

Quantization

6bit30%

5bit35%

weight param.

arch. param.

1 0 1 0 0 0 1 0

1 1 1 0 1 0 1 0 0 1 0 1 0

1 1 1 0 1 0 1 0 0 1

Layer t 6b/7b

Layer t-1 4b/6b

Layer t+1 3b/5b

3 bit weight 5 bit activation

�1

Fig. 1. Design automation for model specialization, channel pruning and mixed-precision quantization.

The last step is automatic mixed-precision quantization.Emergent DNN hardware accelerators begin to support flexi-ble bitwidth (1-8 bits), which raises a great challenge to find theoptimal bitwidth for each layer: it requires domain expertsto explore the vast design space trading off among accuracy,latency, energy, and model size. Conventional quantizationalgorithm ignores the different hardware architectures andquantizes all the layers in a uniform way. We introducethe automated design flow of model quantization, and wetake the hardware accelerator’s feedback in the design loop.Our framework can specialize the quantization policy fordifferent hardware architectures. It can effectively reduce thelatency by 1.4-1.95× and the energy consumption by 1.9×with negligible loss of accuracy compared with the fixedbitwidth (8 bits) quantization.

2 AUTOMATED MODEL SPECIALIZATION

In order to fully utilize the hardware resource, we propose tosearch a specialized CNN architecture for the given hardware.The model is compact and runs fast. We start with a largedesign space (Figure 1(a)) that includes many candidatepaths to learn which is the best one by gradient descent,rather than hand-picking with rule-based heuristics. Insteadof just learning the weight parameter, we jointly learn thearchitecture parameter (shown in red in Figure 1(a)). Thearchitecture parameter is the probability of choosing eachpath. The search space for each block i consists of manychoices:

• ConvOp: mobile inverted bottleneck conv [9] withvarious kernel sizes and expansion ratios

– Kernel size: {3×3, 5×5, 7×7}– Expansion ratio: {3, 6}

• ZeroOp: if ZeroOp is chosen at ith block, it meansthe block is skipped.

Therefore, the number of possible architectures in thedesign space is [(3× 2)︸ ︷︷ ︸

ConvOp

+ 1︸︷︷︸ZeroOp

]N = 7N where N is the

number of blocks (21 in our experiments).Given the vast design space, it is infeasible to rely on

domain experts to manually design the CNN model foreach hardware platform. So we need to employ automaticarchitecture design techniques.

However, early reinforcement learning-based [7, 8] NASmethods are very expensive to run (e.g., 104 GPU hours)since they need to iteratively sample an architecture, train

it from scratch and update the meta-controller. It typicallyrequires tens of thousands of networks to be trained to finda good neural network architecture.

We adopt a different approach to improve the efficiencyof model specialization [4]. We first build a super networkthat comprises all candidate architectures. Concretely, it hasa similar structure to a CNN model in the design spaceexcept that each specific operation is replaced with a mixedoperation that has n parallel paths. Each path in a mixedoperation corresponds to a candidate operation oi(·), and weintroduce an architecture parameter αi to each path to learnwhich paths are redundant and thereby can be pruned (i.e.path-level prunning).

In the forward step, to save GPU memory, we allow onlyone candidate path to actively reside in the GPU memory.This is achieved by hard-thresholding the probability of eachcandidate path to either 0 or 1 (i.e., path-level binarization).As such the output of a mixed operation is given as

xl =∑i

gioi(xl−1) (1)

where gi is sampled according to the multinomial dis-tribution derived from the architecture parameters, i.e.,{pi = softmax(αi;α) = exp(αi)/

∑i exp(αi)}.

In the backward step, we update the weight parametersof active paths using standard gradient descent. Since thearchitecture parameters are not directly involved in thecomputational graph (Eq. 1), we use the gradient w.r.t. binarygates to update the corresponding architecture parameters:

∂L

∂αi=∑j=1

∂L

∂pj

∂pj∂αi≈∑j=1

∂L

∂gj

∂pj∂αi

.

In order to specialize the model for hardware, we needto take the latency running on the hardware as a designreward. However, directly measuring the inference latencysuffer from (i) slow (ii) high variance due to differentbattery condition and thermal throttling (iii) latency is non-differentiable and can’t be directly optimized. To addressthese, we present our latency prediction model and hardware-aware loss.

To build the latency model we pre-compute the latencyof each operator with all possible inputs. During search wequery the lookup table during the searching process . Theoverall latency of ith block is the weighted sum of the latencyof each operator.

Page 3: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

3

Model Top-1 Top-5 GPU Latency

MobileNet-V2 [9] 72.0 91.0 6.1 msResNet-34 [10] 73.3 91.4 8.0 ms

NASNet-A [8] 74.0 91.3 38.3 msMnasNet [11] 74.0 91.8 6.1 ms

Specialized model for GPU 75.1 92.5 5.1 ms

TABLE 1. ImageNet Accuracy (%) and GPU latency (Tesla V100).

E[LATi] = α× F (mb3 3x3)+

β × F (mb3 5x5)+

σ × F (identity)+

......

ζ × F (mb6 7x7)

E[LAT] =N∑i

E[LATi]

(2)

Then we combine the latency and training loss (e.g. cross-entropy loss) using the following formula

L = LCE × α log

(E[LAT]

LATref

)β, (3)

where α and β are hyper-parameters controlling the trade-off between accuracy and latency and LATref is the targetlatency. Note our formulation not only provides a fastestimation of the searched model but also makes the searchprocess fully differentiable.

We demonstrate the effectiveness of our model special-ization on ImageNet dataset with CPU (Xeon E5-2640 v4),GPU (Tesla V100) and mobile phone (Google Pixel-1). Wefirst search for a specialized CNN model for the mobilephone (Figure 2). Compared to MobileNet-V2 (the state-of-the-art human engineered architecture), our model improvesthe top-1 accuracy by 2.6% while maintaining a similarlatency. Under the same level of top-1 accuracy (around74.6%), MobileNet-V2 has 143ms latency while ours hasonly 78ms (1.83× faster). Compared with the state-of-the-artauto designed model, MnasNet [11], our model can achieve0.6% higher top-1 accuracy with slightly lower mobile latency.More remarkably, we reduced the search cost by 200×, from40,000 GPU hours to only 200 GPU hours.

Table 1 reports the speedup on GPU. our methodachieved superior performances compared to both human-designed and automatically searched architectures. Com-pared general purpose models, our specialized model im-proves the top-1 accuracy by 1.1% - 3.1% while being 1.2×-7.5× faster. Table 2 compares the specialized models onCPU/GPU/Mobile. As expected, models specialized for GPUdo not run fast on CPU and mobile phone, vice versa. It isessential to learn specialized neural networks to cater fordifferent hardware.

Our automated design flow designed CNN architecturesthat were long dismissed as being too inefficient — butin fact, they are very efficient. For instance, engineershave essentially stopped using 7×7 filters, because they’recomputationally more expensive than multiple, smaller filters(one 7×7 layer has the same receptive field than three 3×3

1.83x faster

Fig. 2. AI automatically designed specialized model consistentlyoutperforms human designed MobileNetV2 under variouslatency settings.

Model Top-1 GPU CPU Mobile

Specialized for GPU 75.1 5.1ms 204.9ms 124ms

Specialized for CPU 75.3 7.4ms 138.7ms 116ms

Specialized for Mobile 74.6 7.2ms 164.1ms 78ms

TABLE 2. Hardware prefers specialized models. Models opti-mized for GPU does not run fast on CPU and mobile phone,vice versa. Our method provides an efficient solution to searcha specialized neural network architecture for a target hardwarearchitecture, while cutting down the search cost by 200×compared with state-of-the-arts [7, 11].

layers, but bears 49 weights rather than 27.). However, our AIdesigned model found that using 7×7 filter is very efficienton GPUs. That’s because GPUs have high parallelization, andinvoking a large kernel call is more efficient than invokingmultiple small kernel calls. This design choice goes againstprevious human thinking. The larger the search space, themore unknown things you can find. You don’t know ifsomething will be better than the past human experience. Letthe automated design tool figure it out [12].

3 AUTOMATED CHANNEL PRUNING

Pruning [13] is widely used in model compression andacceleration. It is very important to find the optimal sparsityfor each layer during pruning. Pruning too much will hurtaccuracy; too less will not achieve high compression ratio.This used to be manually determined in previous studies [1].Our goal is to automatically find out the effective sparsityfor each layer. We train an reinforcement learning agent topredict the best sparsity for a give hardware [5]. We evaluatethe accuracy and FLOPs after pruning. Then we updatethe agent by encouraging smaller, faster and more accuratemodels.

Our automatic model compression (AMC) leveragesreinforcement learning to efficiently search the pruning ratio(Figure 1(b)). The reinforcement learning agent receivesan embedding state st of layer Lt from the environmentand then outputs a sparsity ratio as action at. The layer iscompressed with at (rounded to the nearest feasible fraction).Then the agent moves to the next layer Lt+1, and receivesstate st+1. After finishing the final layer LT , the reward

Page 4: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

4

MillionMAC

Top-1Acc.

Top-5Acc.

GPU AndroidLatency Speed Latency Speed Memory

100%MobileNet 569 70.6% 89.5% 0.46ms 2191 fps 123.3ms 8.1 fps 20.1MB

75%MobileNet 325 68.4% 88.2% 0.34ms 2944 fps 72.3ms 13.8 fps 14.8MB

AMC(50% FLOPs) 285 70.5% 89.3% 0.32ms 3127 fps

(1.43×) 68.3ms 14.6 fps(1.81×) 14.3MB

AMC(50% Latency) 272 70.2% 89.2% 0.30ms 3350 fps

(1.53×) 63.3ms 16.0 fps(1.95×) 13.2MB

TABLE 3. AMC speeds up MobileNet. On Google Pixel-1 CPU, AMC achieves 1.95× measured speedup with batch size one, whilesaving the memory by 34%. On NVIDIA Titan XP GPU, AMC achieves 1.53× speedup with batch size of 50.

Policy FLOPs ∆Acc (%)

MobileNet-V1

uniform (0.75-224) [3] 56% -2.5AMC (ours) 50% -0.4

uniform (0.75-192) [3] 41% -3.7AMC (ours) 40% -1.7

MobileNet-V2 uniform (0.75-224) [9] 70% -2.0AMC (ours) -1.0

TABLE 4. Learning-based automated model compression (AMC)outperforms rule-based model compression. Rule-based heuris-tics are suboptimal.

accuracy is evaluated on the validation set and returned tothe agent.

With our framework, we are able to push the expert-tuned limit of fine-grained model pruning. For ResNet-50on ImageNet, we can push the compression ratio from 3.4×to 5× without loss of accuracy. With further investigation,we find that AMC automatically learns to prune 3×3 convo-lution kernels harder than 1×1 kernels, which is similar tohuman heuristics since the latter is less redundant.

We also compare AMC with heuristic-based channelreduction method on modern efficient neural networksMobileNet [3] and MobileNet-V2 [9] (Table 4). Since thenetworks are already very compact, it is convincing tocompress these nets. The easiest way to reduce the channelsof a model is to use uniform channel shrinkage, i.e. use awidth multiplier to uniformly reduce the channels of eachlayer with a fixed ratio. Both MobileNet and MobileNet-V2present the performance of different multiplier and inputsizes, and we compare our pruned result with models ofsame computations. The format are denoted as uniform(depth multiplier - input size). We can find that our methodconsistently outperforms the uniform baselines. Even for thecurrent state-of-the-art efficient model design MobileNet-V2,AMC can still improve its accuracy by 1.0% at the samecomputation.

Mobile inference acceleration has drawn people’s atten-tion in recent years. Not only can AMC optimize FLOPs andmodel size, it can also optimize the inference latency. Forall mobile inference experiments, we use TensorFlow Liteframework for timing evaluation. Our experiment platformis Google Pixel 1. Models pruned to 0.5× FLOPs and 0.5×inference time are shown in Table 3. For 0.5× FLOPs setting,we achieve 1.81× speed up on a Google Pixel 1 phone. For0.5× FLOPs setting, we accurately achieve 1.95× speed up,which is very close to actual 2× target, showing that AMC

can directly optimize inference time and achieve accuratespeed up ratio. On GPUs, we also achieve up to 1.5×speedup, which is less than mobile phone but still significanton an already very compact model. The less speedup isbecause a GPU has higher degree of parallelism than a mobilephone.

4 AUTOMATED MIXED-PRECISION QUANTIZATION

Conventional quantization methods quantize each layer ofthe model to the same precision. Mixed-precision quantiza-tion is more flexible but suffer from a huge design space that’sdifficult to explore. Meanwhile, as demonstrated in Table 5,the quantization solution optimized on one hardware mightnot be optimal on the other, which raises the demand forspecialized policies for different hardware architectures andfurther increase the design space. Assuming the bitwidthis between 1 to 8 for both weights and activations, theneach layer has 82 choices. If we have M different neu-ral network models, each with N layers, on H differenthardware platforms, there are in total O(H × M × 82N )possible solutions. Rather than using rule-based heuristics,we propose an automated design flow to quantize differentlayer with mixed precision. Our hardware-aware automaticquantization (HAQ) [6] models the quantization task asa reinforcement learning problem. We use the actor-criticmodel to give the quantization policy (#bits per layer)(Figure 1(c)). The goal is not only high accuracy but alsolow energy and low latency.

An intuitive reward can be FLOPs or the model size.However, these proxy signals are indirect. They do nottranslate to latency or energy improvement. Cache locality,number of kernel calls, memory bandwidth all matters.Instead, we use direct latency and energy feedback fromthe hardware simulator. Such feedback enables our RL agentto learn the hardware characteristics for different layers: e.g.,vanilla convolution has more data reuse and locality, whiledepthwise convolution has less reuse and worse locality,which makes it memory bounded.

In real-world applications, we have limited resourcebudgets (i.e., latency, energy, and model size). We would liketo find the quantization policy with the best performancegiven the resource constraint. We encourage our agent tomeet the computation budget by limiting the action space.After our RL agent gives actions {ak} to all layers, wemeasure the amount of resources that will be used by thequantized model. The feedback is directly obtained fromthe hardware simulator. If the current policy exceeds our

Page 5: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

5

Inference latency on

HW1 HW2 HW3

Best Q. policy for HW1 16.29 ms 85.24 ms 117.44 ms

Best Q. policy for HW2 19.95 ms 64.29 ms 108.64 ms

Best Q. policy for HW3 19.94 ms 66.15 ms 99.68 ms

TABLE 5. Inference latency of MobileNet-V1 [3] on threehardware architectures under different quantization policies.The quantization policy that is optimized for one hardware isnot optimal for the other. This suggests we need a specializedquantization solution for different hardware architectures. (HW1:spatial accelerator[14], HW2: edge accelerator[15], HW3: cloudaccelerator[15], batch = 16).

Rooflinedw_x_8 dw_y_8 dw_x dw_y pw_x_8 pw_y_8 pw_x pw_y

0.249997509 0.449592306 0.444438539 0.5996016676 15.98980242 12.797768246 15.98980242 12.7977682460.249990035 0.4494983278 0.533299322 0.719541109 31.83756345 25.59509022 31.83756345 25.595090220.249990035 0.4494983278 0.333315619 0.5993311036 31.83756345 50.67932888 36.35942029 57.941195240.249960147 0.4494983278 0.533197314 0.7190828026 61.49019608 50.68107852 79.86629526 66.004938140.249960147 0.4494983278 0.380871413 0.5992356688 61.49019608 99.35505354 79.86629526 129.039301820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.96694776 121.7494692 124.303484980.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 153.7372654 174.02866820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.97008466 153.7372654 174.02866820.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 86.12894812 87.01669196 130.281008160.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 82.9261673 83.89758595 106.79738932

0.3

0.4

0.5

0.6

0.7

0.8

0.2 0.3 0.4 0.5 0.6

Fix8bitours

0

45

90

135

180

0 40 80 120 160

Fix8bitours

Higher is better

Ops/ByteOps/Byte

GOps/s GOps/s

HAQ improves depthwise layers’ roofline performance

HAQ improves pointwise layers’ roofline performance

Higher is better

0.3

0.4

0.5

0.6

0.7

0.8

0.2 0.3 0.4 0.5 0.6

Fix8bitours

0

45

90

135

180

0 40 80 120 160

fixed 8-bitours

Higher is better

Ops/Byte

GOps/s

HAQ improves depthwise layers’ roofline performance

Ops/Byte

GOps/s

HAQ improves pointwise layers’ roofline performance

Higher is better

#weight bit (pointwise) #weight bit (depthwise)#activation bit (pointwise) #activation bit (depthwise)

# OPs per Byte (pointwise) # OPs per Byte (depthwise)

864246

642468

depthwise: fewer bits pointwise:more bits

depthwise:more bits pointwise:fewer bits

layer

Edge

Cloud

layer#bit

#bit

0

4 2

layerlog#

�1

Fig. 3. Quantization policy under latency constraints forMobileNet-V1.

Rooflinedw_x_8 dw_y_8 dw_x dw_y pw_x_8 pw_y_8 pw_x pw_y

0.249997509 0.449592306 0.444438539 0.5996016676 15.98980242 12.797768246 15.98980242 12.7977682460.249990035 0.4494983278 0.533299322 0.719541109 31.83756345 25.59509022 31.83756345 25.595090220.249990035 0.4494983278 0.333315619 0.5993311036 31.83756345 50.67932888 36.35942029 57.941195240.249960147 0.4494983278 0.533197314 0.7190828026 61.49019608 50.68107852 79.86629526 66.004938140.249960147 0.4494983278 0.380871413 0.5992356688 61.49019608 99.35505354 79.86629526 129.039301820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.96694776 121.7494692 124.303484980.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 153.7372654 174.02866820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.97008466 153.7372654 174.02866820.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 86.12894812 87.01669196 130.281008160.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 82.9261673 83.89758595 106.79738932

0.3

0.4

0.5

0.6

0.7

0.8

0.2 0.3 0.4 0.5 0.6

Fix8bitours

0

45

90

135

180

0 40 80 120 160

Fix8bitours

Higher is better

Ops/ByteOps/Byte

GOps/s GOps/s

HAQ improves depthwise layers’ roofline performance

HAQ improves pointwise layers’ roofline performance

Higher is better

0.3

0.4

0.5

0.6

0.7

0.8

0.2 0.3 0.4 0.5 0.6

Fix8bitours

0

45

90

135

180

0 40 80 120 160

FixedOurs

Higher is better

Ops/Byte

GOps/s

HAQ improves depthwise layers’ roofline performance

Ops/Byte

GOps/s

Roofline performance of pointwise layers are improved.

Higher is better

#weight bit (pointwise) #weight bit (depthwise)#activation bit (pointwise) #activation bit (depthwise)

# OPs per Byte (pointwise) # OPs per Byte (depthwise)

864246

642468

depthwise: fewer bits pointwise:more bits

depthwise:more bits pointwise:fewer bits

layer

Edge

Cloud

layer#bit

#bit

0

4 2

layerlog#

�1

Fig. 4. HAQ improves the roofline performance of pointwiselayers in MobileNet-V1.

resource budget (on latency, energy or model size), we willsequentially decrease the bitwidth of each layer until theconstraint is finally satisfied.

We applied HAQ to three different hardware architecturesto show the importance of specialized quantization policy.Inferencing on edge devices and cloud severs can be quite

Edge Accelerator Cloud Accelerator

Bitwidths Top-1 Latency Top-1 Latency

PACT 4 bits 62.44 45.45 ms 61.39 52.15 msOurs flexible 67.40 45.51 ms 66.99 52.12 ms

PACT 5 bits 67.00 57.75 ms 68.84 66.94 msOurs flexible 70.58 57.70 ms 70.90 66.92 ms

PACT 6 bits 70.46 70.67 ms 71.25 82.49 msOurs flexible 71.20 70.35 ms 71.89 82.34 ms

Original 8 bits 70.82 96.20 ms 71.81 115.84 ms

TABLE 6. Latency-constrained quantization on the edge andcloud accelerator on ImageNet.

different, since (1) the batch size on the cloud serversare larger (2) the edge devices are usually limited to lowcomputation resources and memory bandwidth. We useembedded FPGA Xilinx Zynq-7020 as our edge device, andserver FPGA Xilinx VU9P as our cloud device to comparethe specialized quantization policies.

Compared to fixed 8-bit quantization (PACT [16]), our au-tomated quantization consistently achieved better accuracyunder the same latency (see Table 6). With similar accuracy,HAQ can reduce the latency by 1.4-1.95× compared with thebaseline.

Our agent gave specialized quantization policy for edgeand cloud accelerators (Figure 3). The policy is quite differenton different hardware. For the activations, the depthwiseconvolution layers are assigned much less bitwidth thanthe pointwise layers on the edge device; while on the clouddevice, the bitwidth of these two types of layers are similarto each other. For weights, the bitwidth of these types oflayers are nearly the same on the edge; while on the cloud,the depthwise convolution layers are assigned much morebitwidth than the pointwise convolution layers.

We interpret the quantization policy’s difference betweenedge and cloud by the roofline model. Operation intensity isdefined as operations (MACs in neural networks) per DRAMbyte accessed. A lower operation intensity indicates that themodel suffers more from the memory access. The bottomof Figure 3 shows the operation intensity (OPs per byte) ofconvolution layers in the MobileNet-V1. On edge accelerator,which has much less memory bandwidth, our RL agentallocates fewer activation bits to the depthwise convolutionssince the activations dominates the memory access. Oncloud accelerator, which has more memory bandwidth, ouragent allocates more bits to the depthwise convolutions andallocates fewer bits to the pointwise convolutions to preventit from being computation bounded. Figure 4 shows theroofline model before and after HAQ. HAQ gives morereasonable policy to allocate the bits for each layer andpushes all the points to the upper right corner that is moreefficient.

Finally, we evaluate the transfer ability of our framework:first train our agent on one network (MobileNet-V1), thendirectly apply the agent to another network (MobileNet-V2) (see Table 7). Our quantization policy transferred fromMobileNet-V1 to MobileNet-V2 performs much better thanthe fixed-bitwidth baseline and is only slightly worse thanthe quantization policy directly searched for MobileNet-V2.

Page 6: 1 Design Automation for Efficient Deep Learning …1 Design Automation for Efficient Deep Learning Computing Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun

6

Bitwidth Top-1 Latency

PACT 4 bits 61.39 52.15 msOurs (search for V2) flexible 66.99 52.12 msOurs (transfer from V1) flexible 65.80 52.06 ms

PACT 5 bits 68.84 66.94 msOurs (search for V2) flexible 70.90 66.92 msOurs (transfer from V1) flexible 69.90 66.93 ms

TABLE 7. Our RL agent is able to generalize well to differentnetwork architectures: the quantization policy transferred fromMobileNet-V1 to MobileNet-V2 performs very close to directlysearching for MobileNet-V2. Both outperfomed the the fixed-bitwidth baseline.

This experiment validates that our RL agent generalizes wellto different network architectures. That’s to say, given a newmodel that the agent hasn’t seen before, it can utilize its pastknowledge to give a decent quantization policy, saving thedesign cycle.

5 CONCLUSION

We present design automation techniques for efficient deeplearning computing. There’s plenty of room at the algorithmlevel to improve the hardware efficiency, but the large designspace makes it difficult to be exhausted by human. Wecovered three aspects of design automation: specializedmodel design, compression and pruning, mixed-precisionquantization. Such learning based design automation outper-formed rule-base heuristics. Our framework reveals that theoptimal design policies on different hardware architecturesare drastically different, therefore specialization is important.We interpreted those design policies and believe the insightswill inspire the future software and hardware co-design forefficient deep learning computing.

REFERENCES

[1] S. Han, H. Mao, and W. J. Dally, “Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding,” in ICLR,2016.

[2] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and¡ 0.5 mb modelsize,” arXiv preprint arXiv:1602.07360, 2016.

[3] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mo-bilenets: Efficient convolutional neural networks for mo-bile vision applications,” arXiv preprint arXiv:1704.04861,2017.

[4] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Directneural architecture search on target task and hardware,”in ICLR, 2019.

[5] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han,“Amc: Automl for model compression and accelerationon mobile devices,” in ECCV, 2018.

[6] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ:Hardware-Aware Automated Quantization with MixedPrecision,” in CVPR, 2019.

[7] B. Zoph and Q. V. Le, “Neural architecture search withreinforcement learning,” in ICLR, 2017.

[8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learningtransferable architectures for scalable image recogni-tion,” in CVPR, 2018.

[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, andL.-C. Chen, “Mobilenetv2: Inverted residuals and linearbottlenecks,” in CVPR, 2018.

[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in CVPR, 2016.

[11] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le,“Mnasnet: Platform-aware neural architecture search formobile,” arXiv preprint arXiv:1807.11626, 2018.

[12] M. News.[13] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both

weights and connections for efficient neural network,”in Advances in neural information processing systems, 2015,pp. 1135–1143.

[14] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra,and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamicallycomposable architecture for accelerating deep neuralnetwork,” in ISCA, 2018.

[15] Y. Umuroglu, L. Rasnayake, and M. Sjalander, “Bismo:A scalable bit-serial matrix multiplication overlay forreconfigurable computing,” in FPL, 2018.

[16] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang,V. Srinivasan, and K. Gopalakrishnan, “PACT: Param-eterized Clipping Activation for Quantized NeuralNetworks,” arXiv, 2018.