by Mostafa Mahmoud...I will never be able to honor the favours of my larger family; my parents Dr. Mahmoud and Dr. Amany and my sisters Marwa and Menna. Without their help and support,

DATA-AWARE COMPUTATIONAL IMAGING ACCELERATION

by

Mostafa Mahmoud

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto

c© Copyright 2019 by Mostafa Mahmoud

Abstract

Data-Aware Computational Imaging Acceleration

Mostafa MahmoudDoctor of Philosophy

The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto

2019

Computational Imaging (CI) is an essential front-line building block for a wide range of applications including

computer vision, advanced driver-assistance systems, medical imaging, photography and robotics. Such applications

process large amounts of sensory data and need to do so either in real-time or within user-interactive setup

constraints. In addition, these applications are typically run on cost- and energy-constrained devices with limited

compute and memory resources. CI algorithms perform sophisticated processing of the noisy sensor output to

compensate for the imperfections of the image acquisition hardware, to overcome challenging imaging conditions

and to reveal new features of the captured image.

This dissertation analyzes the performance of recent CI implementations on high-end commodity processors

showing that practical deployment is beyond the reach of such architectures. Accordingly, the dissertation identifies

unique characteristics of the data processed by these algorithms and exploits them to propose efficient data-aware

CI accelerators. Specifically, this dissertation accelerates two CI implementation approaches:

1. Analytical solutions represented by Block-Matching and 3D filtering (BM3D), a state-of-the-art algorithm

that is used in a wide set of core CI tasks including image denoising, sharpening, deblurring and super-

resolution. We propose IDEAL, an image denoising/sharpening hardware accelerator. IDEAL incorporates

Matches Reuse (MR), a novel software-hardware co-optimization that exploits typical image content to

significantly reduce the computations needed without loss in output quality. IDEAL is orders of magnitude

faster and more energy efficient compared to high-end CPUs and GPUs.

2. Machine learning-based implementations represented by a set of recent deep neural networks (DNNs)

performing several CI tasks. We characterize strong spatiotemporal correlation between the adjacent values

processed by these DNNs. Accordingly, we introduce Differential Convolution, a convolution processing

technique that operates on the deltas of the values rather than on the absolute values. We propose Diffy, a

novel DNN accelerator that incorporates Differential Convolution translating the characterized correlation

into improved performance, reduced on- and off-chip storage and communication, and ultimately, improved

energy efficiency.

The proposed hardware accelerators enable practical deployment and achieve the performance needed for user-

ii

interactive, real-time and continuous vision applications. The designs are shown to outperform state-of-the-art

accelerators and commodity architectures.

iii

To my beloved parents: Mahmoud and Amany

supporting sisters: Marwa and Menna

lovely wife Ann and my little one Cyrine

Acknowledgements

“O my Lord! so inspire and enable me that I may be grateful for Thy favours, which thou hast bestowed upon me

and upon my parents, and that I may work the righteousness that will please Thee: And admit me, by Thy Grace, to

the ranks of Thy righteous Servants.”, The Holy Qur’an [27:19].All praise is due to Allah, the Lord of the worlds, the Most Merciful, the One by Whose grace good deeds

are completed. This work would not have been accomplished without the patience, strength, and grace Allahbestowed upon me. The Prophet Mohamed, peace and blessings be upon Him, said, “The most grateful to Allah

are those who are grateful to people". I would like to express my sincere appreciation and gratitude to all thosewho supported and helped me during my PhD journey. Hopefully, I will not forget a name.

First and foremost, I would like to thank my supervisor Prof. Andreas Moshovos. Besides his insightfulresearcher mindset, Andreas has an extremely nice personality that shows in all his interactions and makes him anextraordinary supervisor to work with. On the technical side, Andreas guided me while improving my problemsolving, critical thinking and research skills. His influence did not stop at that point as he also helped me improvemy academic writing and presentation skills. He generously iterated with me over all my writings to show mehow to better deliver my point. His positive attitude and wise guidance have always pushed me forward while hissense of humor helped alleviate the tough times. Andreas not only conducts high quality research, he makes goodresearchers as well. Thanks a lot Andreas!

I would like to express my gratitude to my committee members for their insightful feedback on my research aswell as on this dissertation; Prof. Natalie Enright Jerger, Prof. Paul Chow and Prof. Gennady Pekhimenko fromUniversity of Toronto and my external appraiser Prof. Warren Gross from McGill University. I am honored towork with amazing colleagues in the computer architecture research group (AENAO); Myrto Papadopoulou, IslamAtta, Jorge Albericio, Patrick Judd, Sayeh Sharify, Alberto Delmas, Kevin Su, Dylan Malone Stuart, Milos Nikolic,Zissis Poulos, Bojian Zheng, Omar Awad, Isak Edo Vivancos, Ali Hadi Zadeh, Meysam Roodi and Jose MunozCepillo (Toño). I appreciate their continuous help, support and feedback in our group meetings. I would like alsoto thank Shehab Elsayed, Karthik Ganesan, Mario Badr and Joshua San Miguel from Prof. Natalie’s group for thefeedback and help with the cluster.

Many thanks to my collaborator Dr. Felix Heide at Stanford University for the immense help, insightful expertguidance and the so many reference letters. I thank my industry collaborators Jonathan Assouline, Paul Boucherand Emmanuel Onzon at Algolux. Their expertise helped, enriched and guided my early research steps.

Special thanks go to the administration staff at the University of Toronto, including Kelly Chan, Jayne Leake,Shawn Mitchell and Darlene Gorzo, who helped with all the logistics and paper work to ease my PhD journey.

My research has been funded by an NSERC Engage Grant, an NSERC Discovery Grant, the ConnaughtInternational Scholarship, the Electrical and Computer Engineering Fellowship and Rogers Scholarship.

I believe that if you have a great supporting family, there is almost no chance you can fail your goals. My wife,Ann, has been there all the time supporting and encouraging me down the road to the finish line. I am sure she hadall the patience in the world to remain faithful throughout my long working days and nights. I hope I will makeup for the days I longed for having good time with my kid, Cyrine, and that they both forgive my absence andbusyness during this long journey.

I will never be able to honor the favours of my larger family; my parents Dr. Mahmoud and Dr. Amanyand my sisters Marwa and Menna. Without their help and support, I may never have made it this far. In a darkdesperate night, a few minutes call with them has always energized and strengthened me to go on. Their advice,love, unconditional faith and confidence have always motivated me not only during my PhD but throughout myentire life. God bless you my family!

v

Contents

1 Introduction 11.1 Analytical CI Algorithms: Challenges and Acceleration . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Deep Learning-Based Computational Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 72.1 Computational Imaging Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 White Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.4 Sharpening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.5 Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.6 Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Analytical CI Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Similarity-Based Collaborative Filtering (SBCF) . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Block Matching and 3D (BM3D) Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Learning-Based Computational Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Hybrid Analytical-Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 IDEAL: Custom Image Denoising Accelerator 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 BM3D Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Computational Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Software Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3 Execution Time Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 IDEAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 Block-matching Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vi

3.4.2 Denoising Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Off-chip Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.4 Reduced Precision Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.5 Patch Buffer Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Accelerator Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5.1 Matches Reuse Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.2 Matches Reuse: Potential and Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.3 Architecture Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6.2 Execution Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6.3 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.4 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.5 Optimization Effect Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.6 Sensitivity Study: Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6.7 Sensitivity Study: Technology Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6.8 Sensitivity Study: Precision Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Augmenting Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Diffy: Enabling ML-Based Computational Imaging 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Revealing Redundant Information and Work . . . . . . . . . . . . . . . . . . . . . . . . 454.2.3 Spatial Correlation in CI-DNNs imaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.4 Computation Reduction Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.5 Memory Storage and Communication Reduction Potential . . . . . . . . . . . . . . . . . 48

4.3 Diffy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.1 Baseline Value-Agnostic Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.2 Value-Aware Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.3 Differential Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.4 Delta Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.5 Diffy Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.6 Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4.2 Relative Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.3 Per-Layer Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.4 Absolute Performance: Frame Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4.5 Compression and Off-Chip Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.6 Power, Energy Efficiency, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4.7 Sensitivity to Tiling Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vii

4.4.8 Sensitivity to Dataflow Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.9 Scaling for Real-time HD Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.10 Iso-area Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.11 Classification DNN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.12 Classification DNNs vs. CI-DNNs: Performance Discrepancy . . . . . . . . . . . . . . . 644.4.13 Performance Comparison with SCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Discussion on IDEAL vs. Diffy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Summary 685.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Bibliography 72

viii

List of Tables

3.1 Microarchitectural breakdown of CPU runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Accelerator hardware parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 CPU parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 GPU parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 ML1 & ML2 neural network parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 The implementations we compare along with the corresponding abbreviations. . . . . . . . . . . . 343.7 Power breakdown for all implementations in Watts. . . . . . . . . . . . . . . . . . . . . . . . . . 373.8 The effect of prefetching and on-chip buffering on IDEALMR speedup over CPU. . . . . . . . . . 373.9 Area and power consumption vs. precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Profiling-based activation precisions per layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 CI-DNNs studied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Input Datasets Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4 VAA, PRA and Diffy configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5 Minimum on-chip memory capacities vs. encoding scheme. . . . . . . . . . . . . . . . . . . . . . 584.6 Power [W ] consumption breakdown for Diffy vs. PRA and VAA. . . . . . . . . . . . . . . . . . . 604.7 Area [mm2] breakdown for Diffy vs. PRA and VAA. . . . . . . . . . . . . . . . . . . . . . . . . 614.8 Classification DNNs: Profiling-based activation precisions per layer. . . . . . . . . . . . . . . . . 64

ix

List of Figures

2.1 Typical computational imaging pipeline stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Simple 2-layer neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Deep Neural Network with several hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 One convolution layer of a CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Block diagram of BM3D processing flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 CPU runtime for images up to 16MP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 CPU and GPU runtime for images up to 42MP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Runtime breakdown for the CPU and GPU implementations. . . . . . . . . . . . . . . . . . . . . 24

3.5 Block diagram of the basic accelerator (IDEALB). . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.6 Block-matching (BM) engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Denoising (DE) engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.8 Overlapping search windows for consecutive reference patches. . . . . . . . . . . . . . . . . . . . 27

3.9 PSNR for different fraction precisions normalized to floating-point implementation. . . . . . . . 27

3.10 MR hit rate as a function of K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.11 Per image normalized PSNR as a function of K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.12 One lane of IDEALMR. The accelerator features 16 lanes sharing the same memory controller. . . 31

3.13 Speedup vs. single-thread CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.14 IDEALMR runtime for different resolution images. . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.15 HD frames per second processed by IDEALMR under different configurations. . . . . . . . . . . 36

3.16 Performance sensitivity to the number of lanes of IDEALMR. . . . . . . . . . . . . . . . . . . . . 38

4.1 Information content: Entropy of raw activations H(A), conditional entropy H(A|A′) (see text) andentropy of deltas H(∆) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 The imap values of CI-DNNs are spatially correlated, and as a result processing deltas instead ofraw values reduces work. All results are with the Barbara image as input. . . . . . . . . . . . . . 47

4.3 Cumulative distribution for number of effectual terms per activation/delta over all consideredCI-DNNs and datasets. Average sparsity is shown for raw activations. . . . . . . . . . . . . . . . 47

4.4 Potential speedups when processing only the effectual terms of the imaps (RawE ) or of their deltas(∆E ). Speedups are reported over processing all imap terms. . . . . . . . . . . . . . . . . . . . . 47

4.5 Storage needed for three compression approaches normalized to a fixed precision storage scheme. 48

4.6 Tile of our Baseline Value-Agnostic Accelerator (VAA). . . . . . . . . . . . . . . . . . . . . . . 49

4.7 A PRA tile. The AM is partitioned along columns. . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8 Differential Convolution example: deltas are processed instead leading to shorter computations. . 51

x

4.9 Diffy tile reconstructing differential convolution output. . . . . . . . . . . . . . . . . . . . . . . . 534.10 Diffy’s Deltaout engine computing deltas for next layer. . . . . . . . . . . . . . . . . . . . . . . . 534.11 PRA and Diffy performance normalized to VAA. . . . . . . . . . . . . . . . . . . . . . . . . . . 554.12 Execution time breakdown per layer for each network showing the utilized cycles, idle cycle and

memory stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.13 Frames per second processed by Diffy as a function of image resolution. HD frame rate is plotted

in Fig. 4.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.14 HD frames per second processed by VAA, PRA and Diffy with different compression schemes. . . 584.15 Off-chip traffic of different compression schemes normalized to no compression. . . . . . . . . . 594.16 Performance of Diffy with different off-chip DRAM technologies and activation compression

schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.17 Performance sensitivity to the number of terms processed concurrently per filter. . . . . . . . . . . 614.18 Potential speedups for Diffy _y compared to Diffy and PRA over VAA. . . . . . . . . . . . . . . . 624.19 Tiles needed for Diffy to sustain real-time HD processing along with the needed memory system

for each compression scheme. D, P and N for DeltaD16, Profiled and NoCompression. . . . . . . 624.20 Iso-area PRA and Diffy performance normalized to VAA. . . . . . . . . . . . . . . . . . . . . . . 634.21 PRA and Diffy speedup for classification DNNs. . . . . . . . . . . . . . . . . . . . . . . . . . . 644.22 Speedup of Diffy over SCNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xi

Chapter 1

Introduction

Numerous applications, such as those in medical imaging, film production, automotive and robotics, use imagingsensors (IS) to convert light to signals appropriate for further processing and storage by digital devices such as smartphones, desktop computers, and digital cameras. IS output is far from perfect and requires significant processingin the digital domain to yield acceptable results. For example, lens imperfections result in distorted outputwhile sensor sensitivity may yield output that is underexposed or overexposed at places and thus missing crucialinformation. Computational Imaging (CI) is the processing in the digital domain of IS output to: 1) compensatefor imperfections in the imaging system, 2) reduce the complexity and cost of its components, and 3) reveal newfeatures and high-level information of the depicted scene such as veins or skin tumors in medical imaging, andvehicle and road structures in autonomous driving applications [1, 2, 3, 4]. The latter are often also considered asComputer Vision (CV) tasks or components thereof.

Photos and videos are dominating the data generated in the past few years with 1 trillion photos taken in 2015compared to the 3.8 trillion photos that were taken in all human history until 2011 [5]. By 2017 around 80%of all photos were taken with a mobile phone, a platform with limited power and computational capabilities [6].Similar resource limitations apply to many consumer devices such as digital cameras and medical imaging devices.The market of camera-enabled consumer devices, including smart phones, tablets, digital cameras and hand-heldcomputers, is expected to worth $839 billion by 2020 [7]. Besides consumer devices, computational imaging isinvolved in scientific applications such as microscopy and telescope imaging with massively high resolution imagesup to 1.5 billion pixels [8, 9], automation applications in manufacturing pipelines, and even in server farms. Thus,there are also applications where higher cost and energy can be acceptable for better quality.

Traditionally, image signal processors were used to perform the first low-level steps of image processing whilethe higher level and more advanced CI and vision tasks were run as software implementations on commodityarchitectures such as general-purpose processors (CPUs) and graphics processing units (GPUs). Such softwareimplementations offer the flexibility necessary for CI applications to evolve and be customized according to theneeds of each application. However, the computation power and data bandwidth of commodity architectures now fallshort of that needed due to: 1) the increasing sophistication of CI algorithms [10,11,12,13,14,15,16,17,18,19,20],and 2) the increasing demand for imaging systems that can process higher resolution images at higher frame rateswhile often supporting multiple sensors [21, 22, 23, 24, 25, 26]. Thus, practical deployment and further advances insophistication of state-of-the-art CI algorithms are hindered by execution time performance [10, 11, 27].

Accordingly, this dissertation develops hardware accelerators for state-of-the-art CI algorithms enablingpractical deployment and achieving the performance needed for real-time and/or user-interactive applications. The

1

CHAPTER 1. INTRODUCTION 2

proposed architectures are optimized in terms of energy efficiency, off-chip communication and on-chip storage tofit the limited resources of embedded devices that are typically cost-, power-, energy-, and form-factor-constrained.However, the proposed techniques and architectures should be applicable to less constrained applications wherehigher cost and energy budget are acceptable.

This dissertation characterizes and accelerates two CI implementation approaches: 1) Analytical solutionsrepresented by Block-Matching and 3D filtering (BM3D), a state-of-the-art CI algorithm that is used in a wideset of core CI tasks including denoising, sharpening, deblurring and upsampling [11, 12, 28, 29, 30, 31, 32, 33].While the emphasis is on hardware acceleration, software-hardware co-design solutions and optimizations aredeveloped to boost the performance of the proposed designs. 2) Machine learning-based alternatives representedby a set of recent deep neural networks (DNNs) performing core CI tasks including denoising, demosaicking andsuper-resolution. We exploit unique characteristics of these DNNs to boost performance and energy efficiency overstate-of-the-art DNN accelerators.

The rest of this chapter is organized as follows. Section 1.1 overviews the BM3D technique, a framework forimplementing several CI algorithms, and the architectural challenges when running it on commodity architectures.Then, it summarizes the techniques incorporated by IDEAL, our BM3D accelerator, to enable that family ofalgorithms to run efficiently. Section 1.2 overviews DNN-based computational imaging models and the techniquesexploited by Diffy, our DNN accelerator, to boost the performance and energy efficiency over state-of-the-art DNNaccelerators while running these models. Section 1.3 summarizes the contributions made by this dissertation andSection 1.4 describes the thesis organization.

1.1 Analytical CI Algorithms: Challenges and Acceleration

A state-of-the-art technique used in a wide set of CI applications is Non-local Self-Similarity (NSS), sometimesreferred to as Similarity-Based Collaborative Filtering. The technique exploits the fact that in a natural imagesimilar patterns, small image patches, repeat many times in different positions within the image. These repeatedpatches are used to produce a higher quality reconstruction of a given distorted patch. Many recent works adoptedthe NSS technique to boost output quality [11,34,35,36,37,38,39,40]. BM3D is a realization of this technique thatachieves state-of-the-art quality in several CI tasks such as denoising, sharpening, deblurring and upsampling [11,12,28,29,30,31,32,33]. Since it was proposed, BM3D has been widely adopted as the quality standard to compareagainst for competing analytical and learning-based methods [37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49].

However, the excellent output quality of BM3D comes at significant computation power cost. Thus, it ischallenging to practically deploy BM3D in imaging systems. For example, and as corroborated by this dissertation,BM3D can take from few minutes up to half an hour depending on the application to process one image of moderateresolution of 13 megapixels (MP) running on a high-end CPU [10]. This is exceedingly slow for real-time or evenuser-interactive applications where processing an image should take at most half a second if not less [50,51,52,53].BM3D also requires significant memory bandwidth and storage footprint. Chapter 3 illustrates the processing steps,the bottlenecks and the high computation demands of BM3D.

Several recent works aimed to reduce BM3D’s execution time. Algorithmic simplifications were proposed toreduce the computations needed at the cost of some drop in output quality [27, 54]. Other works vectorized BM3Don single-instruction multiple-data (SIMD) architectures such as CPUs with vector extensions and GPUs [10,27,55,56, 57]. The vectorization exploited the embarrassingly parallel nature of most of the computations of the BM3Dalgorithm. However, performance remains unsatisfactory even on high-end GPUs, a finding corroborated by thisdissertation. Recent works proposed BM3D accelerators on Field-Programmable Gate Arrays (FPGAs) [58] or as


fully custom Application-Specific Integrated Circuit (ASIC) implementations [59, 60]. However, none of theseattempts managed to achieve real-time BM3D processing for high-resolution images and within a limited powerand energy envelop. Moreover, the simplifications incorporated by these attempts sacrifice quality for performance.

This dissertation develops and optimizes software implementations of BM3D taking advantage of the capa-bilities of a high-end CPU with vector extensions and a high-end GPU. The performance results corroborate thefindings of past work [10, 11, 27] showing that BM3D is impractical to deploy even for modest image resolutionsand even after intensive software and algorithmic optimization efforts. Our performance analysis shows that theblock matching step is the most time-consuming task taking 67% and 87% of the processing time on the CPU andthe GPU respectively. Thus, acceleration efforts should be devoted to it first.

Accordingly, this dissertation proposes IDEAL, a BM3D denoising hardware accelerator that incorporates thefollowing techniques: 1) Matches Reuse (MR), a novel software-hardware co-optimization that exploits typicalimage content to significantly reduce the computations needed without loss in output quality, 2) prefetching andjudicious use of on-chip buffering to minimize execution stalls and off-chip bandwidth consumption, and 3) acareful arrangement of specialized computing blocks.

In natural images, subsequent patches with small strides are almost similar and their surrounding searchwindows are almost completely overlapping. The MR technique exploits this common case to significantly reducethe computations needed by the block matching step of BM3D. MR boosts the performance and slightly improvesthe output quality especially for images with more homogeneous color areas. The output quality is measuredusing the traditional Signal-to-Noise Ratio (SNR) metric and the Structural Similarity Index (SSIM) metric thatbetter matches known characteristics of the human visual system [61]. IDEAL exploits the regular memory accesspattern of BM3D along with appropriately sized on-chip buffers to decouple the processing of the current inputpixels and the off-chip transfers needed for subsequent inputs hiding most of the off-chip access latency. Thespecialized computation blocks are arranged to maximize hardware reuse in a structure-based acceleration fashion.The MR-enhanced IDEAL (IDEALMR) can process a 42MP image in 0.45 seconds and a 2MP high-definition(HD) frame in real-time (more than 30 frames per second). On average, IDEALMR is 11,352× and 591× fasterthan the CPU and the GPU implementations respectively. IDEALMR is 4 and 3 orders of magnitude more energyefficient respectively as well.

Accelerating BM3D and the proposed techniques can benefit a wider range of CI applications. BM3D is aflexible CI framework with an underlying generic natural image model. Thus, by adjusting its filtering stage it canimplement many other CI applications including image sharpening, deblurring and super-resolution [11, 28, 29, 30,31, 32, 33]. This dissertation shows that IDEAL can take advantage of this property by demonstrating that it can beextended with appropriate modifications to a limited set of its modules to support additional CI applications. Weshow in Section 3.7 an example where IDEAL is extended to support sharpening by adding little extra hardware toits filtering module. Moreover, the MR optimization we propose can benefit a wider range of NSS-based algorithmsthat involve a block-matching step.

1.2 Deep Learning-Based Computational Imaging

For years, state-of-the-art CI algorithms were handcrafted by combining analytical techniques tailored to eachapplication. This a cumbersome process. It is especially hard to provision for any possible case in such algorithms.Recent works however propose to use machine learning (ML) instead. Such efforts devised computationalimaging DNN models (CI-DNNs) that rival and often surpass handcrafted methods in terms of output qualityfor most of the core CI tasks including image denoising [46, 47, 48, 62, 63], demosaicking [49], sharpening [64],


deblurring [65,66,67], and super-resolution [68,69,70,71,72]. A further advantage of this approach is that it couldpotentially generalize covering additional CI tasks.

As opposed to accelerating handcrafted CI algorithms, the approach of DNN acceleration has the advantage ofbeing less rigid where the accelerator could be used for other applications as well. However, as generality oftencomes at a cost, a comparison showing the performance and energy efficiency gap between the two approachesis needed. We compare IDEAL, our BM3D accelerator, to DaDianNao, a state-of-the-art DNN accelerator [73],running recent image denoising DNN models [62, 74]. Our results show that further execution time improvementsare needed for DNN accelerators to meet the requirements of real-time and user-interactive computational imagingapplications. Specifically, IDEAL is found to be at least 5.4× faster and 3.95× more energy efficient thanDaDianNao running the studied DNNs. This performance gap motivates our next contribution where we devise aDNN accelerator tailored for such DNN models to enhance performance and energy efficiency over state-of-the-artDNN accelerators.

Recent works have proposed a plethora of DNN accelerators boosting performance and energy efficiency overcommodity CPUs and GPUs while running DNN models, e.g., DaDianNao [73], Cnvlutin [75], Minerva [76],Eyeriss [77], EIE [78], Stripes [79], Bit-Pragmatic [80] and SCNN [81]. Those designs were motivated by: 1) thesuccesses of DNN models in high-level classification applications such as image recognition [82, 83, 84], objectsegmentation [85, 86, 87] and speech recognition [88], and 2) over time, models performing such applicationsare evolving and becoming more complex with deeper pipeline of layers and more filters and weights per layer.This increases the computation- and data-intensity and stresses both the underlying compute cores and memorysystems. Accordingly, previously proposed accelerators have taken advantage of the computation structure, the datareuse, the static and dynamic ineffectual value content, and the precision requirements of these DNNs to improveperformance and energy efficiency and to reduce the needed data storage and communication while executing thosemodels.

CI-DNNs introduce new challenges and improvement opportunities due to their unique characteristics high-lighted by this dissertation. While classification models extract features and identify high-level abstractions,CI-DNNs perform low-level per-pixel prediction, i.e., for each input pixel the model predicts a correspondingoutput pixel. For example, image classification models take an image and identify the depicted object. The outputis typically a vector of probabilities, one per possible object class. On the other hand, a denoising DNN modeltakes a noisy image and produces a denoised version typically with the same input resolution. Thus, the structureand behavior of CI-DNNs are different than those of image classification models. Three key differences are:1) While DNN models generally include a variety of layers, the per-pixel prediction models are fully convolutional.2) CI-DNNs naturally scale with the input resolution whereas classification models are resolution-specific. 3) CI-DNN models exhibit significantly higher spatial correlation in their runtime-calculated values. That is, the inputs(activations) used during the calculation of neighboring outputs tend to be close in value. This is a known propertyof images which these models exhibit throughout their layers. The first two characteristics exacerbate the typicalbottlenecks of DNN processing causing CI-DNNs to have much higher computation demands, produce much largervolume of intermediate results and thus to need much more on-chip storage and off-chip bandwidth. We exploit thethird characteristic to mitigate the exacerbated bottlenecks.

This dissertation identifies and exploits a runtime behavior specific to CI-DNNs to inform further innovationin DNN accelerator design. Specifically, this work characterizes the strong spatial correlation in the runtime-calculated value stream of CI-DNNs. The dissertation further demonstrates that this property can have tangiblepractical applications by presenting Diffy, a practical hardware accelerator that exploits this spatial correlationto transparently reduce: 1) the number of bits needed to store the activation values on- and off-chip, and 2) the


computations that need to be performed. Combined these reductions increase performance and energy efficiencybenefits over state-of-the-art accelerator designs. To take advantage of the spatial correlation in their values, weintroduce Differential Convolution which operates on the differences, or the deltas, of the activations rather than ontheir absolute values. This approach greatly reduces the amount of work that is required to execute a CI-DNN.Diffy demonstrates that differential convolution can be practically implemented to translate the reduced precisionand the reduced effectual bit-content of these deltas into improved performance, reduced on- and off-chip storageand communication, and ultimately, improved energy efficiency. While Diffy targets CI-DNNs, we demonstratethat it is robust as it benefits other models such as image classification and image segmentation models as well.

Diffy boosts average performance by 7.1× over a baseline value-agnostic accelerator (VAA) [73] and by1.41× over a state-of-the-art value-aware accelerator that processes only the effectual content of the raw activationvalues (PRA) [80]. In addition, Diffy is respectively 1.83× and 1.36× more energy efficient when consideringonly the on-chip energy. However, Diffy requires much less off-chip traffic which would translate to even higherenergy efficiency. More importantly, Diffy delivers the performance necessary to achieve real-time processingof HD resolution images with practical configurations. Finally, Diffy is a robust CNN accelerator as it improvesperformance even for image classification, object detection and image segmentation models by 6.1× and 1.16×over VAA and PRA respectively.

1.3 Dissertation Contributions

The contributions of this dissertation are:

• It characterizes the performance and energy consumption of highly optimized software implementations ofBM3D, a state-of-the-art algorithm covering several core CI applications, running on high-end CPUs withvector extensions and high-end GPUs. In the context of image denoising, it demonstrates that such general-purpose compute architectures cannot sustain the compute requirements of real-time and user-interactiveapplications even for HD frames.

• It proposes IDEAL, a dedicated hardware accelerator for BM3D. IDEAL incorporates a novel software-hardware co-optimization technique, Matches Reuse (MR), that reduces the amount of computation per-formed. Experiments demonstrate that IDEAL makes it possible to process image frames of up to 42MP foruser-interactive applications (≈ 2 fps) and HD frames in real-time (30 fps). On average, IDEAL is 11,352×and 591× faster than the CPU and the GPU implementations respectively. IDEAL is 4 and 3 orders ofmagnitude more energy efficient respectively as well. As an evidence that IDEAL can be extended to supportadditional CI tasks, the thesis shows how sharpening can be supported with limited modifications to itspipeline.

• It characterizes emerging DNN-based computational imaging implementations (CI-DNNs) on state-of-the-art DNN accelerators. While being far faster than software implementations, the performance ofthose accelerators still falls short of the real-time and user-interactive processing requirements. The DNNaccelerator we use as a baseline is found to be at least 5.4× slower and 3.95× less energy efficient comparedto our IDEAL. Fortunately, our characterization results show that there is significant room to improve theperformance and energy efficiency of DNN accelerators running computational imaging DNNs.

• It introduces Differential Convolution, a novel technique to compute convolutions using the deltas betweensuccessive input activations instead of the raw values. The technique exploits the strong spatial correlationbetween adjacent activation values, an inherent characteristic in these CI-DNN models. The dissertation


proposes Diffy, a DNN hardware accelerator that incorporates differential convolution to significantlyreduce the computation, communication and storage needed. Diffy improves performance over our baselineaccelerator by 7.1× and improves the on-chip energy efficiency by 1.83×. Moreover, Diffy needs 2.8× lesson-chip memory and consumes 4.57× less off-chip bandwidth. Thus, even higher overall energy efficiencyis expected.

1.4 Dissertation Organization

The rest of this thesis is organized as follows.

• Chapter 2 overviews computational imaging, typical CI pipeline structure and several state-of-the-art CIalgorithms including analytical, learning-based and hybrid models.

• Chapter 3 first analyzes the performance of BM3D image denoising on a commodity CPU with vector exten-sions and on a high-end GPU. It presents Matches Reuse (MR), a novel software-hardware co-optimizationthat exploits the underlying data properties to reduce the computation needed. It introduces IDEAL, acustom hardware BM3D accelerator that incorporates the MR technique. Then, the chapter demonstrates theextensibility of IDEAL where we modify its processing pipeline to support sharpening as well. Finally, itcompares IDEAL against a state-of-the-art DNN accelerator running CI-DNN models that were recentlyshown to rival the output quality of BM3D. This comparison motivates us to consider improving hardwareacceleration for CI-DNNs.

• Chapter 4 characterizes a set of state-of-the-art CI-DNN models for which the output quality was shown torival or exceed that of handcrafted analytical models. The analysis shows remarkable spatial correlation ofthe data values processed by CI-DNNs. This motivates Differential Convolution, a technique that exploits theinherent spatial correlation to reduce the computation, communication and storage needed by these models.It then presents Diffy, a DNN hardware accelerator incorporating differential convolution to improve theperformance and energy efficiency over state-of-the-art DNN accelerators for both CI-DNNs as well ashigher level computer vision models.

• Chapter 5 summarizes the contributions of this dissertation and points to potential directions for future work.

Chapter 2

Background

This chapter reviews computational imaging and a typical CI pipeline in Section 2.1. Section 2.2 reviewsstate-of-the-art analytical CI methods and Section 2.3 revisits learning-based approaches especially DNN-basedimplementations. Section 2.4 introduces hybrid approaches where parts of the CI algorithm are handcrafted whileothers are implemented as trainable DNN models. Finally, Section 2.5 reviews the key concepts of machinelearning with more emphasis on Convolutional Neural Networks (CNNs) typically used for image processing tasks.

2.1 Computational Imaging Pipeline

A computer vision pipeline typically comprises a set of processing stages starting with image acquisition froman imaging sensor, followed by a low-level image processing and enhancement sub-pipeline and ending withthe higher-level processing steps such as object detection, segmentation and classification [89]. This dissertationfocuses on the low-level image processing steps known as Computational Imaging. CI compensates for theimperfections of the imaging system and reduces its cost by processing the output of the imaging sensor in thedigital domain. Typically, a CI pipeline includes white balancing, demosaicking, denoising, sharpening, deblurringand super-resolution as shown in Fig. 2.1 [90, 91, 92]. Following is an overview on each of these operations.

2.1.1 White Balancing

This step is sometimes referred to as color correction. It adjusts the intensity of the colors in an image to be moreappropriate for reproduction or display. This correction is essential for several reasons including the mismatch

Figure 2.1: Typical computational imaging pipeline stages.

7

CHAPTER 2. BACKGROUND 8

between the acquisition sensors and the perception sensors in a human eye, the need to account for the targetdisplay medium and the difference between the ambient viewing conditions of the acquisition and the displayviewing conditions. The adjustment is done such that the objects which are believed to be neutral, for examplewhite or grey, appear neutral in the reproduction.

The process conceptually consists of two steps: 1) determining the illuminant under which an image wascaptured, then 2) scaling or transforming the color components (e.g., R, G, and B) of the image so they conformto the viewing illuminant. It can be applied in most of the known color spaces such as camera RGB and monitorRGB [93]. However, Viggiano found that white balancing in the camera’s native RGB color model produces morecolor constancy (i.e., less distortion of the colors) than doing so in monitor RGB under thousands of hypotheticalcamera sensitivity settings [93]. Thus, it is advantageous to perform white balancing right at the time of imageacquisition rather than editing it later on a monitor.

Several algorithms have been proposed to estimate the ambient lighting from the camera data then use thisinformation for the subsequent transformation step. Examples include algorithms [94, 95, 96] based on the Retinextheory proposed by Land [97]. These algorithms attempt to estimate the reflectances, i.e. the true color components,of each point in the input RGB image. One such algorithm operates as follows. The maximum red, green and bluevalues rmax, gmax and bmax of all pixels are determined. Assuming that the scene contains objects which reflectall red light, others which reflect all green light and still others which reflect all blue light, the illuminating lightsource can be described by (rmax, gmax, bmax). For each pixel with values (r, g, b), its reflectance is estimatedas (r/rmax, g/gmax, b/bmax). Other proposed algorithms are based on neural networks [98, 99, 100] and Bayesianinference [101, 102].

2.1.2 Demosaicking

For cost-reduction and due to manufacturing limitations, a digital camera is typically equipped with just one CMOS(complementary metal oxide silicon) or CCD (charge-coupled device) sensor overlaid with a color filter array(CFA). Thus, the output is an image with spatially under-sampled color channels called mosaic. Demosaicking isthe process of reconstructing a full-resolution color image from the incomplete color samples acquired from theimaging sensor of a digital camera [103]. While the Bayer pattern is the most commonly used CFA in both literatureand commercial products [104], there are several other filter arrangements [105, 106, 107, 108]. A demosaickingalgorithm needs to interpolate the acquired samples while avoiding color artifacts such as zippering and chromaticaliases near the edges and detailed regions [103]. For efficient hardware implementation, the algorithm needs toexhibit low computational complexity. Since samples are typically noisy, the demosaicking algorithm should beamenable to analysis as well to simplify the following noise reduction step.

There are a number of simple demosaicking algorithms where the nearby samples of the same channel areinterpolated such as the nearest-neighbor, bilinear, bicubic, spline and lanczos interpolation algorithms [109]. Morecomplex methods exploit spatial and spectral correlation to deliver higher quality output such as the threshold-basedvariable number of gradients method [110], the pixel grouping method [111] and the adaptive homogeneity-directedmethod [112]. The similarities between demosaicking and super-resolution motivated jointly addressing themin a unified context [113]. Demosaicking has also been jointly addressed with denoising using deep learning totackle hard cases such as thin edges where conventional handcrafted algorithms exhibit visual artifacts [49]. Thisdissertation characterizes one such CI-DNN, demosaicnet [49], and exploits its runtime value characteristics topropose Diffy boosting performance and energy efficiency over previous DNN accelerators.


2.1.3 Denoising

This step aims to improve the signal to noise ratio in the output image frame. Besides improving visual quality,denoising enhances the visibility of low contrast objects and recovers the fine details of an object. This is criticalfor subsequent computer vision tasks. The most commonly used noise model is the Additive White Gaussian Noise(AWGN) which assumes that each pixel in the image has been perturbed by a typically small added noise valuefollowing a Gaussian distribution. Such noise happens due to fluctuations in the number of photons received by theimaging sensor cells according to the central limit theorem [114]. Salt and pepper is another type of noise causedby dusty, overheated, or faulty imaging sensor cells [115].

Denoising algorithms range in complexity from the simplistic linear filtering [116], convolution low-pass filtersthat smooth out noisy pixel but introduce undesirable artifacts including blurry edges and loss of fine details, up tothe-state-of-the-art Block-Matching and 3D Filtering (BM3D) technique which delivers the highest known to datequality [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. However, this superior quality comes at significant computationpower cost. On high-end CPUs and GPUs, the performance of BM3D software implementations was found to befar from satisfactory even after intensive software optimization efforts [10, 11, 27]. This dissertation corroboratesthese findings and shows that a BM3D denoiser running on a modern CPU takes around 1600 seconds to processa 16MP image. Heide et al. report that BM3D denoising accounts for 95% of the processing time of a typicalfront-end CI pipeline [10]. Accordingly, we tackle this bottleneck by introducing IDEAL, a custom hardwareaccelerator for BM3D enabling real-time processing of HD frames (30 fps or more).

Recently, deep learning methods have also been proposed to tackle the denoising problem with output qualitythat rivals the handcrafted BM3D technique [46, 47, 48, 117]. This dissertation characterizes and exploits theproperties of DNN models performing denoising as well as other CI tasks such demosaicking and super-resolution.This thesis proposes Diffy that translates such properties to boosted performance and energy efficiency overprevious deep learning accelerators.

2.1.4 Sharpening

Sharpening improves the visual quality of an image by enhancing its high frequency components. The outputimage typically exhibits increased contrast highlighting edges and small details. Sharpening is a preprocessing stepprior to higher-level vision tasks such as segmentation, binarization, classification and pattern recognition [28]. Itis also used to produce visually appealing output in consumer imaging applications since the human vision systemis more sensitive to edges and fine details. Unsharp masking is a conventional sharpening technique where theoriginal image is passed through a high-pass filter to extract the high frequency components. Then, the filter outputis scaled and added to the original image amplifying the high frequency components corresponding to the edgesand fine details while keeping the low-frequency components, related to homogeneous regions, unchanged. Morerecent techniques apply sharpening in a transform domain by amplifying certain parts of the spectrum leading toimproved output quality [118, 119, 120].

Alpha-rooting is a well-established sharpening technique where the amplification is done by taking the α-rootof the transform coefficients with α being a constant larger than 1. This dissertation demonstrates that IDEAL,our proposed BM3D denoising accelerator, can be extended to perform sharpening with little extra hardwareoverhead where the processing pipeline is augmented with α-rooting engines. The extension is inspired by a jointdenoising-sharpening technique proposed in previous works [28, 30].


2.1.5 Deblurring

Image deblurring aims to recover a sharp image from the acquired blurry one. Image blur corrupts valuableinformation that is essential for displaying or for further high-level processing steps such as object classification andsegmentation. Blur can be due to several reasons including motion, camera shake, defocus, atmospheric turbulenceand intrinsic malfunction of the imaging system [121]. The blur effect is typically modeled as a convolution of aspatially- and/or temporally-varying kernel with the sharp image to be recovered. In practical applications, boththe kernel and the sharp image are unknown making the deblurring process an ill-posed inverse problem [122].Based on how they tackle this ill-posed problem to limit the solution search space, existing methods can begrouped into five categories: Bayesian inference, variational methods, sparse representation methods, homography-based modeling, and region-based methods [123]. Bayesian inference and variational methods introduce priors

and regularization techniques respectively to limit the possible solutions space. Recently becoming popular,sparse representation methods transform the natural image signal to other domains where it exhibits sparserepresentation. Such sparsity facilitates tackling many ill-posed CI problems including denoising [124, 125, 126]and deblurring [127,128,129,130]. The homography-based and region-based methods address the case of spatially-varying blur which cannot be modeled by a single kernel. These methods approximate the blur effect with multiplekernels or homographies and incorporate techniques to overcome the growing problem size.

Dabov et al. propose a BM3D-based method that combines the strengths of both the variational methods andthe sparse representation methods [29]. The technique uses BM3D denoising as a regularization step after theblur inversion step. Unless accelerated, the significant computational demands of BM3D make this deblurringtechnique impractical to deploy especially for real-time and user-interactive applications.

2.1.6 Super-Resolution

Super-resolution is the process of constructing a high-resolution image from one or more observed low-resolutionimages. Super-resolution reveals finer details of the imaged objects without need for higher cost imaging sys-tem components such as a higher resolution imaging sensor. Typically, a super-resolution technique extractsnon-redundant information from multiple sub-pixel-shifted low-resolution frames to construct a high-resolutionversion [131]. The sub-pixel shifts between the frames can be obtained with controlled or uncontrolled motionbetween the imaging system and the scene. Super-resolution is an essential processing step for various applicationsincluding object recognition and video surveillance [132], remote sensing [133], medical imaging [134, 135, 136],and video super-resolution conversion for example from NTSC to high definition (HD).

Recently, analytical techniques have been proposed to construct a high-resolution image from a single low-resolution image [137]. Deep learning-based single image super-resolution has recently achieved state-of-the-artoutput quality [138, 139, 140, 141]. This dissertation exploits unique characteristics of super-resolution DNNs, aswell as other CI-DNNs, to propose Diffy boosting the performance and reducing the communication and storageneeded compared to state-of-the-art DNN accelerators.

2.2 Analytical CI Algorithms

Computational Imaging was conventionally dominated by simple analytical algorithms running on commodityarchitectures such CPUs and GPUs. However, the complexity of imaging systems is increasing to support higherresolution and frame rates as well as to handle challenging imaging conditions. Moreover, image-based applicationsare becoming more dominant including computer vision, medical diagnosis, film production, remote sensing,


robotics and automated driving systems. These factors pushed experts to innovate sophisticated CI algorithms thatare able to reveal and enhance the finest details in a captured image or video [10, 11, 29, 120, 142, 143]. Generally,analytical CI techniques can be categorized into spatial filtering methods and wavelet transform-based methods.The spatial filtering techniques process the input image in the spatial color domain to smooth out distortionswhile preserving fine details and sharp edges as much as possible. This category includes bilateral filters [144],nonlocal means filters [13], anisotropic diffusion filters [145] and guided filters [146]. On the other hand, thewavelet transform-based methods process the wavelet transform coefficients to adaptively remove distortions indifferent frequencies [147, 148, 149]. Using other transform domains, such as gradient domain, has also beeninvestigated [150].

Similarity-Based Collaborative Filtering (SBCF), alternatively known as Non-local Self-Similarity (NSS), is ananalytical CI technique that receives increasing attention. It has been widely adopted by several CI algorithmsaddressing different applications such as image denoising, deblurring, super-resolution and sharpening [11, 34, 35,36, 37, 38, 39, 40]. Block Matching and 3D Filtering (BM3D) is a state-of-the-art realization of the SBCF technique.BM3D is widely adopted as a generic natural image model and as a quality benchmark for subsequent imagerestoration techniques. Accordingly, this dissertation studies, develops and optimizes software implementations ofBM3D for commodity architectures showing that their performance is unsatisfactory. Section 2.2.1 overviews howSBCF processes an input image while Section 2.2.2 highlights BM3D, its applications and acceleration attempts.

2.2.1 Similarity-Based Collaborative Filtering (SBCF)

The technique exploits the fact that a natural image exhibits repeated patterns or patches in close proximity toproduce a high-quality reconstruction of a given distorted patch. Many recent works adopted the SBCF technique toboost the output quality for images and videos [11,13,34,35,36,37,38,39,40]. Typically, an SBCF-based algorithmprocesses an input image as overlapping patches where for each patch three processing steps are performed:1) searching for other nearby patches that are similar to the patch in hand, 2) collaboratively passing the foundsimilar patches through a filter implementing the target output effect, then 3) accumulating weighted versions ofthe filtered patches back to their original locations in the output image. Various implementations differ in the waythey estimate patch similarity, whether processing is done in color or some transform domain and the filter design.SBCF is a generic natural image model that facilitates a wide set of CI applications including denoising, deblurring,sharpening, super-resolution and in-painting. We next overview BM3D, a family of algorithms that realizes SBCFand achieves state-of-the-art output quality in many CI applications.

2.2.2 Block Matching and 3D (BM3D) Filtering

BM3D is a realization of the SBCF technique that was initially proposed for image denoising then was adopted asa generic natural image model used either directly or as a regularization prior for other CI applications [11, 28,29, 29, 30, 31, 32, 33, 151]. The technique is consistently considered as a state-of-the-art and a quality benchmarkfor all subsequent application-specific algorithms including analytical and learning-based methods [37, 38, 39,40, 41, 42, 43, 44, 45, 46, 47, 48, 49]. However, BM3D is known for its significant computational power needs.Thus, practical deployment is unlikely even on high-end commodity architectures such as CPUs with vectorextensions and GPUs [10, 11]. Previous works proposed GPU implementations of BM3D that achieved oneorder of magnitude speedups over commodity CPUs [57, 152]. Tsai et al. proposed an approximated-nearest-neighbor (ANN) search technique to tackle the block-matching step in order to improve GPU performance forlittle output quality loss [27]. However, high-end GPU performance still falls far short of that needed for real-time


and user-interactive applications. We corroborate these results and conduct microarchitectural analysis of theexecution bottlenecks on commodity platforms in Section 3.3. Hardware acceleration on Field-Programmable GateArrays (FPGAs) [58] and Application-Specific Integrated Circuits (ASICs) [59, 60] has been explored. However,those implementations used simplistic approximations of the computations involved along with relaxed algorithmconfigurations. More importantly, none of those proposals enabled BM3D to process HD frames in real time andwere restricted to low resolution inputs.

Accordingly, this dissertation: 1) analyzes the performance bottlenecks of BM3D to judiciously address time-consuming parts of the algorithm, 2) proposes a novel software-hardware co-design optimization, Matches Reuse(MR), exploiting natural image characteristics to significantly reduce the computational cost of BM3D withoutsacrificing output quality, and 3) proposes IDEAL, a dedicated hardware BM3D accelerator that incorporates MRenabling real-time processing of HD frames. Chapter 3 analyzes software implementations of BM3D in the contextof image denoising, proposes IDEAL in a basic and an MR-optimized versions, and extends the accelerator toperform sharpening as well.

2.3 Learning-Based Computational Imaging

Machine Learning (ML) has recently achieved remarkable success in a wide range of applications includingcomputer vision such as image classification and object segmentation [82, 83, 84, 85, 86, 87] as well as speechrecognition [88]. Researchers have also exploited machine learning to propose novel computational imagingtechniques that rival and sometimes outperform conventional analytical techniques in terms of output quality.These techniques include DNN models [62, 63, 153, 154, 155], learning natural image priors [156, 157], dictionarylearning and sparse coding [158, 159, 160].

This dissertation focuses on DNN-based computational imaging models (CI-DNNs) for various reasons:1) DNNs are flexible models with large capacity to learn different applications, 2) the proposed DNN modelsachieve state-of-the-art output quality, 3) accelerating CI-DNNs overlaps and might benefit a wider range of otherDNN-based applications including classification, semantic segmentation and speech recognition, 4) the plethoraof previously proposed DNN accelerators targeting other applications is a good base to build upon towards DNNaccelerator designs capable of efficiently running models from different application domains.

In Chapter 4, this dissertation characterizes a set of modern CI-DNNs performing different applicationsincluding denoising [46, 47, 48], joint demosaicking-denoising [49] and super-resolution [68]. Our analysis revealsstrong spatial correlation between the activation values processed by such models. This runtime value characteristicis exhibited by every layer in the DNN. We propose Diffy, a DNN hardware accelerator that translates thiscorrelation into reductions in the needed computations, communication and data storage compared to state-of-the-art DNN accelerators. Moreover, Diffy is a robust DNN accelerator as it improves performance for DNN modelsfrom other application domains such as classification and segmentation.

2.4 Hybrid Analytical-Machine Learning Approaches

To achieve adequate representation capacity, an end-to-end CI-DNN tends to be deep with a pipeline of tens ofconvolutional layers each of which applies tens to hundreds of filters. This requires many hyper-parameters to be setsuch as the number of layers, filter sizes and learning rate. Thus, model training time and design space explorationis significantly long for such end-to-end models. More relevant to this dissertation is that the resulting trainedmodel is typically compute- and memory-intensive rendering practical deployment unlikely especially on platforms


with limited resources. To tackle these challenges, hybrid implementations have been proposed to combine thestrengths of analytical and machine learning techniques. For instance, Yang et al. propose BM3D-Net [161] wherethey unfold the BM3D processing pipeline into: 1) a block-matching step that is performed as in conventionalBM3D implementations, then 2) a lightweight convolutional neural network approximating the rest of the BM3Dpipeline. The neural network comprises: 1) an “extraction” front-end layer corresponding to the 3D patch groupformation, 2) three hidden layers performing convolution, nonlinear transform, then one more convolution. Thosecollectively implement the 3D Filtering (Shrinkage) step in BM3D, and finally 3) an “aggregation” layer performingthe weighted sum updates to the output image.

While we do not study such hybrid approaches, we believe they can benefit from the techniques we propose inthis dissertation as follows. Matches Reuse (MR), incorporated by IDEAL, can benefit the block-matching stepthrough reusing the previous search results when possible and thus skipping most of the unnecessary computations.The spatial correlation among activations, exploited by Diffy, can benefit the lightweight network part of BM3D-Netif such correlation is strong enough between its runtime calculated values. We leave investigating the benefits ofour techniques for such hybrid approaches for future work.

2.5 Machine Learning

With the ever-increasing complexity of the tasks needed to be performed by compute machines, it is becomingharder for programmers to understand all the details of such tasks and explicitly program a computer to performthem. Machine learning paves a completely new way of computing where a machine is taught by example how toperform a task. The machine then can generalize what it has just learned to solve new problem instances that it hasnever seen before. The programmer just needs to describe a trainable model, initialize its parameters with randomvalues and finally train the model by iterating through a large set of training examples. The training algorithm is anoptimization problem where, with each example, the model parameters are updated such that the error between theactual output and the expected output is reduced [162]. Upon convergence to a local minimum, or luckily the globalminimum, the average accuracy of the model will likely be sufficient to perform inference, i.e., to solve new unseeninputs. While the trainable model should be designed to have enough capacity to capture the underlying features ofthe inputs and correlate these features with the expected outputs, overly sized models can lead to training problemssuch as overfitting where the model turns to be no more than an encoded dictionary of the training examples andfails to generalize for new unseen inputs [163].

2.5.1 Neural Networks

Neural networks are one of many machine learning techniques and probably the most popular one today withstate-of-the-art accuracy in a variety of tasks, including computer vision, speech recognition, machine translationand medical diagnosis. The technique is inspired by the biological neural networks that constitute the humanbrain. A neural network consists of a set of nodes called neurons connected by edges called synapses. Uponreceiving input signals, or values, a neuron computes the output as a non-linear activation function of the sumof its weighted inputs. The neuron then communicates the output signal to the subsequent neurons connectedto it through synapses. Each synapse has an associated weight where all the weights of a neural network modelare collectively called model parameters. Another type of model parameters are the biases which are per-neuronconstant added to the weighted sum of the input values before applying the activation function. Typically, neuronsare aggregated into layers arranged in a pipeline fashion where each layer feeds its outputs into the next layer. A


A1

A2

A3

A4

O1

O2

input layer

output layer

Figure 2.2: Simple 2-layer neural network.

A1

A2

A3

AN

input layer

hidden layer 1

OK

O1

output layer

…. …

.

….

….

hidden layer 2

W1 W2

….

Figure 2.3: Deep Neural Network with several hidden layers.

feed-forward neural network does not contain cycles where feed-back connections are not permitted. However,recurrent neural networks with in-network memory nodes and feed-back connections are needed for applicationssuch as speech recognition and natural language processing. A neural network is a trainable model where themodel parameters are adjusted during the training process in order to eventually minimize the error between theoutput of the final layer and the expected output per training example.

Fig. 2.2 shows how neurons and synapses are connected in a simple 2-layer neural network with 4 input neuronsand 2 output neurons. The input layer, the first layer in the pipeline, takes the inputs directly from the example data.The activation value for the jth output neuron is then computed according to Eq. (2.1).

Oj = f (N

∑i=1

Ai×Wi, j +B j) (2.1)

Where N is the number of input activations, Ai is the ith input activation, Wi, j is the weight of the synapseconnecting the ith input neuron to the jth output neuron, B j is the bias associated with the jth output neuron and f

is a non-linear activation function such as the Rectified Linear Unit (ReLU).

For complex tasks, a neural network would typically comprise several hidden layers, i.e., layers between theinput and output layers. A network with one or more hidden layers is called a Multi-Layer Perceptron (MLP) withnetworks having more than one hidden layer generally called Deep Neural Networks (DNNs) as shown in Fig. 2.3.For clarity of the figure, we replace the individual synapse connections from one layer of neurons to the next with a


3D input 3D Filters

*

C

W

H

C

HFWF

……

.

K

K

WO

HO

3D output

Sliding window

Figure 2.4: One convolution layer of a CNN.

single bold arrow labeled W i. Generally, increasing the number of hidden layers increases model capacity andthus its ability to perform more sophisticated tasks. However, beyond some depth, adding more layers might havenegative effect on output accuracy due to overfitting [163, 164].

2.5.2 Convolutional Neural Networks

Typically, the layers of MLP are fully connected layers, that is, a layer is an all-to-all connection where all theneurons of a layer are connected to all the neurons of the subsequent layer. While this model architecture is neededfor some applications, it is expensive in terms of the number of model parameters leading to excessive memoryand computation requirements. A Convolutional Neural Network (CNN) architecture tackles this problem byintroducing convolutional layers instead of the fully connected layers. A convolutional layer needs significantlylower number of parameters since: 1) only a localized subset of the input activations are used to compute an outputactivation, and 2) the weights are reused across different output activations. This leads to reduced memory andcomputation demands compared to a fully connected layer with the same number of input and output neurons. CNNsare typically used for processing multi-dimensional input data such as images to implement several applicationsincluding image classification, segmentation, computational imaging and object detection.

Fig. 2.4 shows how convolution is applied in one CNN layer where the layer comprises a number of 3Dconvolution filters. Each filter is applied in a sliding window fashion along the 3D input activations (such as aninput image with 3-channel depth) with some stride in the horizontal and vertical dimensions. The application of afilter produces a 2D output feature map. The output feature maps produced by the filters are then stacked along thedepth dimension to form a 3D feature map that is fed as input to the next layer. Specifically, assuming an inputfeature map with dimensions C×H×W (channels, height, and width), K 3D filters of size C×HF ×WF , slidingwindow with stride S, the output feature map is a 3D array of activations of size K×HO×WO where the outputheight is:

HO = (H−HF)/S+1 (2.2)

and the output width is:WO = (W −WF)/S+1 (2.3)

Each output activation is the inner product of a filter and a window, a sub-array of the 3D input feature mapwith the same size as the filter. Assuming a, o and wn are respectively the input, the output and the n-th filter, the


output activation o(n,y,x) is computed as the following inner product 〈wn, ·〉:

o(k,y,x) =C−1

∑c=0

HF−1

∑j=0

WF−1

∑i=0

wn(c, j, i)×a(c, j+ y×S, i+ x×S) (2.4)

After the inner product in Eq. (2.4) is computed, a bias is added and then a non-linear activation function isapplied to compute the output activation.

An output activation is computed only using the input activations in the corresponding window. In addition, theweights of a filter are reused across all the input windows. This leads to a more compact memory footprint of themodel parameters as well as considerably less computations compared to fully connected layers.

2.6 Concluding Remarks

Computational imaging is essential for all devices equipped with imaging sensors. A computational imagingpipeline performs a set of processing steps on the raw output of an imaging sensors including denoising, sharp-ening, demosaicking and up-sampling. For decades, analytical solutions have dominated computational imagingapplications where algorithms have been specifically engineered to perform CI tasks. However, recently proposedDNNs performing CI tasks took over the field and surpassed the output quality of engineered algorithms. In thisthesis, we study and accelerate state-of-the-art CI implementations including both the analytical approach andthe DNN-based computational imaging. The DNN implementations of computational imaging are typically fullyconvolutional CNNs where all the layers are convolution layers. A convolution layer comprises a set of 3D filtersthat are convoluted with a 3D input feature map in a sliding window fashion to produce a 3D output feature map.Understanding how convolution works in CNNs is essential to motivate our proposed computational imaging DNNaccelerator in Chapter 4.

Chapter 3

IDEAL: Custom Image DenoisingAccelerator

Computational imaging pipelines (CIPs) convert the raw output of imaging sensors into the high-quality images thatare used for further processing. This work studies how Block-Matching and 3D filtering (BM3D), a state-of-the-artdenoising algorithm can be implemented to meet the demands of real-time and user-interactive applications.Denoising is the most computationally demanding stage of a CIP accounting for more than 95% of the processingtime of a highly-optimized software implementation [10]. We analyze the performance and energy consumption ofoptimized software implementations on three commodity platforms and find that their performance is unsatisfactoryfor practical applications.

Accordingly, we consider two acceleration alternatives: 1) developing a dedicated custom hardware accelerator,and 2) running recently proposed Neural Network (NN) based approximations of BM3D [62, 74] on an NNaccelerator. We propose Image DEnoising AcceLerator (IDEAL) [165], a hardware BM3D accelerator whichincorporates the following techniques: 1) a novel software-hardware optimization, Matches Reuse (MR), thatexploits typical image content to reduce the computations needed by BM3D, 2) prefetching and judicious use ofon-chip buffering to minimize execution stalls and off-chip bandwidth consumption, 3) a careful arrangement ofspecialized computing blocks to maximize hardware reuse, and 4) data type precision tuning to reduce hardwareoverhead. Over a dataset of images with resolutions ranging from 8 megapixels (MP) and up to 42MP, we showthat IDEAL is 11,352× and 591× faster than software implementations running on high-end general-purpose(CPU) and graphics (GPU) processors. Moreover, IDEAL is orders of magnitude more energy efficient. Evencompared to NN approximations of BM3D running on DaDianNao [73], a server-class hardware NN accelerator,IDEAL is 5.4× faster and 3.95× more energy efficient.

3.1 Introduction

Numerous applications, such as those in medical imaging, film production, automotive, and robotics, use imagingsensors (IS) to convert light to signals appropriate for further processing and storage by digital devices such assmartphones, desktop computers, digital cameras, and embedded systems. IS output is far from perfect and requiressignificant processing in the digital domain to yield acceptable results [91, 92]. For example, lens imperfectionsresult in distorted output, while sensor imperfection such as non-uniform sensitivity may yield output that isunderexposed or overexposed at places and thus missing crucial information or that contains other artifacts.

17

CHAPTER 3. IDEAL: CUSTOM IMAGE DENOISING ACCELERATOR 18

Computational Imaging (CI) is the processing in the digital domain of IS output to compensate for these limitations.A Computational Imaging Pipeline (CIP) comprises a sequence of processing steps implementing CI.

An essential step in practically all CIPs is denoising, a process that aims to improve the signal to noise ratio inthe output image frame. The state-of-the-art denoising algorithm is Block-Matching and 3D filtering (BM3D) [11]as it delivers the highest known image quality compared to other techniques [10,11,12,13,14,15,16,17,18,19,20].However, this superior quality comes at a significant computational power cost. Previous software implementationson high-end general purpose and graphics processors found that BM3D is impractical to deploy even after intensivesoftware optimization efforts [10]. We corroborated this observation by profiling a typical front-end CIP, theprocess of converting the raw sensor signal into a typical image representation (e.g., RGB color channels), that wasdeveloped by Heide et al. [10]. Out of the 184 seconds that are needed to process a raw 2MP image on a high-endCPU, 95% is devoted to BM3D denoising. Further, the profiling shows that BM3D is compute bound meaning thatthe compute resources on commodity hardware are not sufficient to do the needed computations within the timelimits of real-time or even user-interactive applications.

User-interactive applications expect processing to take at most half a second if not less [50,51,52,53] while real-time computer vision (CV) applications expect much shorter processing times. Mobile phones and photo camerasincorporate a wide range of image capturing user-interactive applications while Advanced Driver-Assistance

Systems (ADAS) incorporate camera-driven real-time CV components for scene understanding and multi-objectsegmentation and tracking. The resolution of an ADAS camera dictates the distance at which a given object canbe detected while the frame rate dictates system responsiveness for critical functions such as the vehicle’s stopdistance [166]. While current ADAS use 1MP cameras at 10-15 frames per second (FPS), next generation systemswill use 2MP at 30 FPS and soon thereafter 8MP [166, 167]. Such systems require an accelerated, reliable andpower efficient CI processing pipeline. Finally, video capturing applications need to denoise raw video framesin real-time before encoding. The denoised frames require substantially less compression and therefore lead tosignificant bandwidth savings as denoising itself can be considered a compression mechanism [147]. Accordingly,this dissertation explores energy-efficient acceleration of BM3D enabling practical deployment for user-interactiveand real-time applications running on energy- and compute-constrained devices.

We consider the following alternatives: 1) Optimized software implementations running on commodityhardware, including a high-end and an embedded general purpose processor and a graphics processor, 2) adedicated custom hardware accelerator, and 3) running recently proposed NN-based approximations of BM3D on ahigh-end NN hardware accelerator.

Unfortunately, performance with the software implementations falls far short of the needs of user-interactiveapplications. Accordingly, we present IDEAL a dedicated hardware BM3D accelerator that allows for the firsttime a BM3D variant to be used for user-interactive applications. IDEAL incorporates the following techniques:1) a novel software-hardware optimization, Matches Reuse (MR) that exploits typical image content to reduce thecomputations needed by BM3D without sacrificing quality, 2) prefetching and judicious use of on chip buffering tominimize execution stalls and off-chip bandwidth consumption, 3) a careful arrangement of specialized computingblocks to maximize hardware reuse, and 4) data type precision tuning to minimize hardware footprint.

Recent work has shown that it is possible to approximate BM3D using Deep Neural Networks (DNNs) [62, 74].A DNN approach has the advantage of being less rigid whereas a DNN accelerator could be used for otherapplications as well. Unfortunately, we find that further execution time improvements are needed to meet therequirements of user-interactive applications even when a high-end server-class DNN accelerator is used. This isthe motivation for our next contribution Diffy detailed in Chapter 4. Diffy is a value-aware DNN accelerator thatimproves performance and energy efficiency over state-of-the-art DNN accelerator and closes the performance gap


between IDEAL as a specialized BM3D accelerator and DNN accelerators running computational imaging neuralnetworks.

In summary, the contributions and findings of this part of the dissertation are:

• We characterize the performance and energy consumption of highly optimized software implementationsof BM3D for three commodity platforms: 1) a high-end general purpose processor with vector extensions(CPU), 2) a desktop graphics processor (GPU), and 3) a general purpose processor targeting embeddedsystems. We find that these BM3D implementations fall far short of the requirements of user-interactiveapplications even for High Definition (HD) frames (roughly 2MP resolution).

• We propose IDEAL, a dedicated hardware accelerator for BM3D. The design incorporates a novel software-hardware co-optimization that takes advantage of the common case reducing the amount of computationperformed. Experiments using cycle accurate simulation and synthesis results demonstrate that IDEALmakes it possible to process image frames of up to 42MP for user-interactive applications and within a limitedenergy budget. On average, IDEAL is 11,352× and 591× faster than the CPU and the GPU implementationsrespectively. IDEAL is 4 and 3 orders of magnitude more energy efficient respectively as well. We also showthat IDEAL can be modified with little effort to support another CI application, sharpening.

• We characterize the performance of a high-end state-of-the-art DNN accelerator [73] running two recent NNimplementations. While being far faster than software implementations, we show that the performance stillfalls short of the user-interactive processing requirements. Specifically, IDEAL is found to be at least 5.4×faster and 3.95× more energy efficient than the best of the two NN alternatives.

While this work focuses on using BM3D for denoising, BM3D has many more applications as it is an instanceof the Similarity-Based Collaborative Filtering (SBCF) technique, a state-of-the-art filtering technique that canimplement a wide variety of CI building blocks. For example, BM3D can be used for image sharpening, deblurring,up-sampling, video denoising, and super-resolution just by changing the filter it implements [11,28,29,30,31,32,33].Thus, investigating the architectural support necessary for BM3D can be valuable for other CI building blocks.Moreover, the MR optimization, applied here to BM3D, can be applicable to a wide range of SBCF-basedalgorithms as illustrated in Section 3.5.1.

The rest of this chapter is organized as follows: Section 3.2 reviews the BM3D denoising algorithm. Section 3.3analyzes software implementations of BM3D. Section 3.4 presents a basic accelerator design IDEALB. whileSection 3.5 further refines the design into IDEALMR by introducing the MR technique and other optimizations.Section 3.6 evaluates the performance and energy consumption, and where appropriate hardware area footprint, ofIDEAL compared to software implementations and hardware acceleration alternatives. Section 3.7 illustrates howIDEAL can be extended to implement additional CI functionality via an example. Finally, Section 3.8 reviewsrelated work and Section 3.9 concludes.

3.2 BM3D Denoising

BM3D treats the input 3-channel (Red-Green-Blue) image as a grid of overlapping sub-images, patches, withsome stride of Ps pixels. BM3D transforms the input image as follows: For each patch, BM3D searches througha surrounding window of Ns×Ns pixels and with a stride of Ss for the best set of P similar patches, given somesimilarity metric. The P matching patches including the reference patch are then all transformed and the results


Block Matching

Denoising

noisy image

Block Matching

Denoising

Hard-Thresholding

Wiener Filtering

denoised image

intermediate image

(a) BM3D processing stages

BM

Path B

DCTHard

ThresholdL2 Norm Distance

Ref Patch

3D Block Formation

I/P image 16 best matches coordinates

Ns x Ns search area

R

DCTPath A

color channel 1

(b) Block-matching step

HaarTransform

Spectrum Shrinkage

Inv-Haar Inv-DCT

3D block of patches

O/P Image

Count non-zero

Weighting R

DEDCT WE

channel 1

channel 2

channel 3

Path C Path FDCT

Path C

DCT

Path

D

Path E

(c) Denoising step

Figure 3.1: Block diagram of BM3D processing flow.

affect all the corresponding output image patches. Accordingly, an output pixel might receive multiple updatesfrom overlapping patches.

As Fig. 3.1a shows, BM3D denoising processes the input image in two stages: a) Hard-Thresholding, andb) Wiener Filtering. Each stage comprises two steps which are almost identical across the two stages: 1) BlockMatching (Fig. 3.1b), and 2) Denoising (Fig. 3.1c). This results in four steps: Block Matching #1 (BM1), Denoising#1 (DE1), Block Matching #2 (BM2), and Denoising #2 (DE2). The input is a raw image from an imaging sensor.We first describe BM1 and DE1, the steps of the “Hard-Thresholding” stage, followed by the modifications neededby the “Wiener filtering” stage to implement BM2 and DE2. The section then comments on how easy it is touse BM3D for other applications and concludes by detailing the computations performed by the BM3D buildingblocks.

Block-Matching #1: As Fig. 3.1b shows, BM1 searches an area of Ns×Ns pixels centered at the referencepatch for similar patches. Patches are typically 4×4 pixels wide. According to Heide et al. [10], the followingconfiguration provides the best quality: reference patches stride Ps = 1, search patch stride Ss = 1 and searchwindow size Ns = 49 (39 for BM2). Block-matching uses only the first channel. Accordingly, for each referencepatch, 49×49 = 2401 patches are processed in BM1.

In detail, BM1 operates as follows: a) For every reference patch Pre f , the surrounding search area is readpatch by patch and transformed to the frequency domain through the Discrete Cosine Transform (DCT) (PathA in Fig. 3.1b). The search includes and starts with Pre f itself. b) The DCT patches are hard-thresholded wherecoefficients that are below a preset threshold T ht are eliminated. c) The Euclidean distance (l2-Norm) between


every patch and Pre f is calculated. If the distance is below a certain threshold Tmatch, the patch is reported as amatch of Pre f to the next module. d) The 3D Block Formation step keeps the 16 patches with the least distance fromPre f including Pre f itself. At the end, the closest 16 patches are sorted according to their distance and arranged as a3D stack of patches that is fed to DE1. The stack contains the patch coordinates only. BM3D uses 16 best matchesper reference patch as this many were found to be sufficient to compensate for Additive White Gaussian Noise(AWGN) even with high standard deviation (σ ) up to 75 [12].

Denoising Stage #1: As Fig. 3.1c shows, this step processes the 3D patch stacks produced by BM1 and updatesthe corresponding output image patches. Each of the 3 color channels is processed separately. The processingproceeds as follows: a) Saving the DCT of channel 1 patches in BM1 avoids recomputing this for the stack patchesin DE1 (Path C in Fig. 3.1c). For channels 2 and 3, stack patches are passed through the DCT. b) The 3D stack inthe DCT domain is read in vectors along the depth dimension (z-dimension) as input to the Haar transform module.c) The Spectrum Shrinkage module uses hard-thresholding to eliminate those Haar coefficients that are below apreset threshold Thard. d) The number of non-zero coefficients in the entire 3D block M is counted. e) The inverseHaar and inverse DCT restore the 3D block of patches to the color domain. f) Each restored patch is weighted by1/M before being accumulated to its original location in the output image.

Wiener Filtering Stage: The differences in the Wiener filtering stage are as follows: a) The search window issmaller with Ns set to 39. b) In BM2, searching for the best matches is done in the color domain instead of theDCT domain (Path B in Fig. 3.1b). c) Since channel 1 patches bypass DCT in BM2, the needed channel 1 patchesgo through DCT in DE2 (Path D in Fig. 3.1c) like the other channels. d) Spectrum shrinkage in DE2 implements a“Wiener Filter” that attenuates the Haar coefficients instead of applying a hard threshold.

Other BM3D Applications: The Haar-transformed stack of patches is a sparse-coded representation that can beused to implement not only denoising [11], but also other CI applications such as sharpening [28], deblurring [29]or up-sampling [31]. A combined implementation can achieve more than one effect at once [30]. This is achievableby modifying/adding filters to the DE step with the rest of the pipeline remaining as-is. The BM3D class ofalgorithms has also been extended beyond the imaging domain to video processing including denoising [32] andsuper-resolution [33].

3.2.1 Computational Blocks

The main computational blocks of BM3D are: DCT, Haar transform, l2-Norm (distance), inverse Haar transform,and inverse DCT.

DCT and Inverse DCT: The DCT consists of a 1D DCT along the rows of the input patch, a transpose of theoutput followed by one more 1D DCT applied along the rows. The 1D DCT is a matrix multiplication of the patchby a matrix of constant coefficients. For a 4×4 patch, this matrix multiplication involves 64 multiplications and 64additions. The computation of the DCT of a patch P follows this equation:

PDCT =C(CP)T (3.1)

Where C is the transform coefficients matrix and (.)T is the transpose operator. The inverse DCT uses the samecomputation but with the transpose of the transform coefficients matrix.

Haar and Inverse Haar Transforms: These are 1D transforms along the z-dimension of the stacked 16 bestmatches of a reference patch Pre f . The transform is a matrix-vector multiplication of a 16×16 matrix containingconstant coefficients by each 16-element vector along the z-dimension. Each such multiplication entails 256 scalarmultiplications and 256 scalar additions.


0

1000

2000

3000

4000

5000

6000

7000

8000

0 5 10 15

Runt

ime

(sec

s)

Image Resolution (MegaPixel)

Orig Basic AVX Vect ARM Vect

Figure 3.2: CPU runtime for images up to 16MP.

l2-Norm: The l2-Norm (distance) between two H×H patches P1 and P2 needs H2 subtractions, H2 multiplicationsand H2 additions. The computation implements this equation:

distance =i=H−1

∑i=0

j=H−1

∑j=0

(P1(i, j)−P2(i, j))2 (3.2)

3.3 Software Implementations

This section justifies the need for accelerating BM3D by analyzing the execution time of optimized softwareimplementations on commodity hardware. The analysis highlights the bottlenecks that should be targeted by acustom hardware accelerator.

3.3.1 CPU Implementation

The original reference implementation of BM3D is highly optimized using Intel’s Thread Building Blocks andIntel’s Math Kernel Library [168]. On average, its Instructions Per Cycle (IPC) throughput is 2.45 on a processorwith an ideal IPC of 4. Unfortunately, as the reference implementation is only available in a pre-compiled binaryform, reverse engineering would have been needed to associate performance events with code and data structures.Thus, we implemented BM3D in C++ exploiting the Intel AVX extended vector instruction set to match or exceedthe performance of the reference implementation on a high-end Intel Xeon E5-2650 v2 (Section 3.6.1 describesour methodology). We also developed a vectorized implementation optimized for the ARM Cortex-A15 mobileprocessor. Fig. 3.2 reports the processing time in seconds for images of different resolution up to 16MP for thereference (Orig), non-vectorized Xeon (Basic), vectorized Xeon (AVX Vect) and vectorized ARM (ARM Vect)implementations. The performance of AVX Vect almost matches that of Orig. The best runtime for moderateresolution images is far beyond the acceptable range; a 16MP image takes 1400 seconds to process on the Xeon,whereas the ARM implementation is 5.2× slower on average. Fig. 3.3a shows the processing time for a wider rangeof input resolutions and only for the vectorized Xeon implementation. Given that user-interactive applicationsrequire processing time to be less than half a second, performance is far below that needed even for low resolutionimages.


Table 3.1: Microarchitectural breakdown of CPU runtime.

Retiring cycles 62.4%Front-end stalls 4.1%Mis-speculation stalls 5.4%Back-end (Memory) stalls 5.5%Back-end (Core) stalls 22.8%

Microarchitectural Analysis

Using Intel’s VTune profiling tools [169], we analyzed the microarchitectural behavior of our vectorized Xeonimplementation. Table 3.1 shows that instructions are successfully retired in 62.4% of the total processor cycles.Front-end stalls and mis-speculation stalls are as low as 4.1% and 5.4% respectively. Back-end stalls accountfor 28.3% of the total cycles but only 5.5% of these is due to memory while the remaining 22.8% are computecore-related stalls. This breakdown demonstrates that BM3D running on commodity CPUs is compute-boundwhere computational resources are the bottleneck not stalls due to memory accesses or other forms of stalls. Ourimplementation achieved a high IPC of 2.7 on a machine with an ideal IPC of 4.

3.3.2 GPU Implementation

We implemented BM3D in CUDA and analyzed its runtime on a high-end NVIDIA GTX 980 with 4GB of memory.The implementation uses an accurate nearest neighbor search for block-matching and divides the image into tiles tofit the intermediate results in the 4GB memory. Tiling was also used to exploit the faster, but smaller, on-chip sharedmemory. Fig. 3.3b shows the runtime for different image resolutions. While the GPU implementation is muchfaster than the CPU one, its performance remains unsatisfactory for user-interactive and real-time applications. Forexample, processing a 16MP and a 42MP images takes 86 and 226 seconds respectively. Heide et al. [10] report thatusing the approximate nearest neighbor search technique, proposed by Tsai et al. [27], for block-matching improvesGPU performance by 4×. We did not implement this modification as performance would still be unsatisfactory.

0500

100015002000250030003500400045005000

5 15 25 35 45

CPU

Run

time

(sec

)


(a)

0

50

100

150

200

250

5 15 25 35 45

GPU

Run

time

(sec

)


(b)

Figure 3.3: CPU and GPU runtime for images up to 42MP.

3.3.3 Execution Time Breakdown

Fig. 3.4 shows a per algorithm step breakdown of the runtime on the Xeon and the GTX 980. Recall that BM3Dentails two stages: Hard-threshold filtering followed by Wiener filtering. Each stage consists of the following


0%10%20%30%40%50%60%70%80%90%

100%

vect CPU GPU

Frac

tion

of R

untim

e

DCT1 BM1 DE1 BM2 DCT2 DE2

Figure 3.4: Runtime breakdown for the CPU and GPU implementations.

Figure 3.5: Block diagram of the basic accelerator (IDEALB).

steps: computing the DCT transformation of all possible patches (DCTx), block-matching (BMx) to find the best16 matches of each reference patch, and finally the actual denoising (either hard threshold filter or Wiener filter –DEx). As Fig. 3.4 shows, the block-matching step is the bottleneck as it searches through 49×49 patches (39×39for the Wiener stage) for every reference patch. This step accounts for 67% of the CPU runtime combined for bothstages. On the GPU, the BM step also dominates instead taking 87% of the runtime combined for the two stages.The accurate nearest neighbor search requires some synchronization among GPU threads to exchange informationabout the best matches found so far. This limits the amount of concurrency that can be extracted on the GPU forblock-matching.

In order to reduce processing latency to acceptable levels, all steps need to be accelerated in hardware. However,the breakdown above can guide the design process to judiciously partition the power and area budgets among thepipeline stages.

3.4 IDEAL

We first describe a basic accelerator configuration (IDEALB) for BM3D denoising, which we later enhance tomeet the performance requirements of real-time and user-interactive applications. We refer to the four mainprocessing steps of BM3D detailed in the beginning of Section 3.2: BM1, DE1, BM2 and DE2. Fig. 3.5 shows themain components of IDEALB: 1) the DCT engine (DCTE ), 2) the Patch Buffer (PB), 3) the 16 Block-MatchingEngines (BMEs), and 4) the Denoising Engine (DEE ). The dashed boxes in Fig. 3.5 and those shown in Fig. 3.1band Fig. 3.1c provide the mapping of the algorithmic blocks to IDEALB components. Instead of directly mirroringthe BM3D pipeline in hardware, IDEALB uses only as many resources as necessary to keep all units as busy aspossible and to maximize hardware reuse. Since the bulk of the computation happens in the BMEs, a single DCTE

and a single DEE are sufficient to sustain the throughput of the 16 BMEs. The PB stores channel 1 patches in the


Figure 3.6: Block-matching (BM) engine.

frequency domain for step BM1 (Path A in Fig. 3.5) or in the color domain for step BM2 (Path B). The PB alsofeeds the DEE with channel 1 DCT patches along Path C for step DE1 and along Path D through DCTE then PathE for step DE2. We design the DCTE to perform both DCT and inverse DCT reusing internal resources for both. Itaccepts jobs from three queues: a) the BM Patch Queue (QBMP) along Path A to compute channel 1 DCT patchesfor step BM1, b) the Denoiser DCT Queue (QD) to compute the DCT of channels 2 and 3 before step DE1 andchannels 1, 2 and 3 before step DE2, and c) the Denoiser Inverse DCT Queue(QiD) to compute the inverse DCT ofthe denoised patches after each of the DE1 and DE2 steps.

Processing starts by fetching channel 1 patches from memory and into QBMP. The patches either go throughthe DCTE along Path A for step BM1 or bypass it along Path B for step BM2 before being stored in the PB. The 16BMEs concurrently process 16 adjacent reference patches one per BME . The PB, detailed in Section 3.4.5, holdsall the channel 1 patches of an area covering the search windows for these 16 reference patches. Every cycle,one patch is read out of the PB and is broadcast to all the BMEs. Each BME calculates the distance between thispatch and its assigned reference patch and keeps a list of the 16 best matching patches. Once the search windowis exhausted, each BME enqueues the coordinates of the 16 best matches into the Denoising Jobs Queue (QDJ)which feeds the DEE . The DEE then processes these jobs one at a time. For each job, the three color channelsare processed one at a time. As channel 1 patches are already buffered in the PB, the DEE gets those needed fordenoising from there along Path C for step DE1 and along Path D through QD, DCTE then Path E for step DE2.For the two other channels, the patches are read from off-chip into QD, passed through the DCTE and then followPath E. DEE stacks the 16 best matching DCT patches, performs a Haar transform along the depth dimension,either hard-thresholds the resulting coefficients in case of step DE1 or attenuates them with a Wiener Filter incase of step DE2, and finally performs an inverse Haar transform. The resulting patches are then sent back alongPath F and through QiD to the DCTE to transform them back to the color domain. At the very end, the patchesare weighted and accumulated back to the output image in the main memory. The rest of this section details theindividual processing components, illustrates design optimizations and motivates the improved IDEAL design.

3.4.1 Block-matching Engines

Fig. 3.6 shows the structure of the BM engine (BME ) with its three main components: 1) a buffer to hold theassigned reference patch (RPB) 2) an Euclidean distance engine (EDE), and 3) a priority queue (MQ) to keepthe coordinates of the 16 best matches found so far sorted according to the distance from the reference patch.Processing starts by reading a reference patch from PB into RPB and then continues as follows: 1) the BM readsthe patches of the corresponding search window one at time from PB, 2) EDE calculates the distance of the readpatch from the reference patch saved in RPB and, 3) if the distance is below a preset threshold Tmatch and is among


16

......

16 4x4 DCT patches

Denoising Lane

......

Spectrum Shrinkage

Engine

Inv Haar Engine

Denoising Lane

16 4x4 Denoised DCT patches

......

Haar Engine

Figure 3.7: Denoising (DE) engine.

the 16 shortest distances found so far, the coordinates of the patch are inserted into the MQ. The EDE uses 16subtractors, 16 multipliers, and a 16-input adder tree to calculate the 4×4 patch distance according to Eq. (3.2) inone cycle.

3.4.2 Denoising Engine

Fig. 3.7 shows the DEE which processes the 16 best matches found by BME in the DCT domain. Those come fromthe PB directly for channel 1 while processing step DE1, from the PB through DCTE for channel 1 while processingstep DE2, or from off-chip through DCTE otherwise. The 16-patch stack is split into stripes along the z-dimension.The stripes are assigned each to one of the 16 parallel Denoising Lanes (DeL). Each DeL comprises three pipelinedstages: Haar Transform Engine (HTE ), Spectrum Shrinkage Engine (SSE ), and Inverse Haar Transform Engine(iHTE ). HTE implements the Haar transformation, a matrix-vector multiplication of a 16×16 matrix of constantcoefficients and the input stripe. The coefficients matrix is a sparse matrix with many repeated and/or power-of-twocoefficients. Thus, we exploit the structure and the values of the coefficients matrix to reduce the needed resourcesfor the HTE to only 32 multipliers, 10 2-input adders, four 4-input adder trees, and four 8-input adder trees. Theoptimized engine can perform the transform of one input stripe in parallel in one cycle. SSE takes the 16-elementHaar-transformed stripe and takes one cycle to either zero those elements that are below a preset threshold in caseof step DE1 or to perform element-wise multiplication by Wiener filter coefficients to attenuate noisy elements incase of DE2. Finally, iHTE takes one cycle to perform a matrix-vector multiplication of the transpose of the Haartransform coefficients matrix by the filtered stripe. As in the HTE , we exploit the sparsity, power-of-2 values andrepetitions of the elements of the transposed matrix to reduce the number of needed multipliers to 10 along with 165-input adder trees.

assuming a 3-channel image and with the max number of best matches set to 16, DEE processes only 16×3= 48patches per reference patch. Meanwhile, each BME processes 49×49 candidate patches per reference patch incase of step BM1, or 39×39 for step BM2. Thus, a single DEE can ideally sustain up to 39×39/48 = 31 BMEs.However, as Section 3.6.6 explains, we use 16 BMEs and one DEE for a practical implementation.


Overlapping

Search W 1

Search W 2

Ref. patch 1

Ref. patch 2

Image

Read Off-chip

Figure 3.8: Overlapping search windows for consecutive reference patches.

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

12-bit 11-bit 10-bit 9-bit 8-bit 7-bit

Nor

mal

ized

PSN

R

min max avg

Figure 3.9: PSNR for different fraction precisions normalized to floating-point implementation.

3.4.3 Off-chip Bandwidth

Since the stride Ps between the reference patches is typically small, the search windows of consecutive referencepatches are almost completely overlapping as Fig. 3.8 shows. For example, assuming Ps = 1 and Ns = 49, out ofthe 49×49 patches in the search window of reference patch 1, the next search window can reuse 48×49 patches.Buffering these reusable patches in the on-chip PB has the following benefits: 1) Reduces off-chip accesses whereonly the gray area in Fig. 3.8 needs to be read off-chip for each subsequent reference patch. Precisely, the numberof pixels read off-chip for each subsequent search window is only (Ns +PD−1)×Ps where PD is the dimensionof the patch. 2) Avoids recomputing the DCT several times for the search window patches during step BM1 and,thus, significantly reduces the pressure on the DCTE allowing it to be reused for the two other job queues QD andQiD. This can be achieved by buffering the patches after being transformed through the DCTE (QBMP along Path Athrough the DCTE and the multiplexer to the PB in Fig. 3.5).

3.4.4 Reduced Precision Arithmetic

The BM3D software implementations typically use floating-point arithmetic. For hardware acceleration, usingfixed-point representation and reducing precision are well known techniques with many applications in signal andimage processing [170,171]. Fig. 3.9 shows how the output image quality of BM3D changes for various fixed-pointrepresentations with the number of fractional bits ranging from 12 down to to 7. The quality is measured using the


peak signal-to-noise ratio (PSNR) metric defined as follows:

PSNR = 10× log10(peakValue2/MSE) (3.3)

Where peakValue is 255 given an 8-bit color channel depth and MSE is the mean squared error between thedenoised image and the ground truth version. MSE is defined as follows:

MSE =1

C×W ×H

c=C−1

∑c=0

x=W−1

∑x=0

y=H−1

∑y=0

(Idenoised(c,x,y)− IGT(c,x,y))2 (3.4)

Where C, W and H are the number of channels, width and height of the image. Idenoised and IGT are the denoisedimage and the ground truth version respectively.

For each precision configuration and over our images dataset, Fig. 3.9 shows the average, minimum, andmaximum PSNR normalized to that of the floating-point implementation. Even with 10 fractional bits, the minimumrelative PSNR is at least 98.9%. We configured IDEAL with 12 bits for fractions to guarantee the minimum imagequality is within 99.9% of that of the floating-point implementation. For the integer part, we customize the numberof bits along the pipeline stages to fit the dynamic range of values after each transformation. Assuming the inputimage has an 8-bit channel depth and considering how each transformation scales its input values, we used 11, 13and 15 bits for the integer part of the DCT, Haar transform and inverse Haar transform values respectively.

3.4.5 Patch Buffer Configuration

The PB must serve multiple BMEs concurrently processing adjacent reference patches. A naïve PB implementationwould use multiple ports consuming more area and power. As the search windows of adjacent reference patchesare almost entirely overlapping, a single-port PB provides adequate performance. Specifically, the buffered patchesare read one at a time and broadcast to the BMEs which use or copy the ones they need. At times an BME mayneed to stall waiting for the next necessary patch. This single-port PB degrades performance by only 12.5% onaverage compared to the multi-port PB. However, it significantly reduces area and power consumption.

For a grid of M×M BMEs, search window size of Ns×Ns and a stride Ps between reference patches, thecollective search area comprises (Ns +(M−1)∗Ps)

2 patches. For our specific configuration with 4× 4 BMEs,a 49× 49 search window, and Ps = 1, a PB of at most 128KB is sufficient assuming 4× 4 pixel patches and a3-byte fixed-point representation of DCT values. The Wiener stage needs less buffering as Ns = 39 and patches arebuffered in the color domain with 1-byte per pixel.

3.5 Accelerator Optimization

As Section 3.6.2 demonstrates, while IDEALB outperforms optimized CPU and GPU implementations by 363×and 18.9× respectively, it meets the needs of user-interactive applications only for up to 4MP images. Accordingly,Section 3.5.1 motivates and presents Matches Reuse (MR), a software-hardware co-optimization technique thatsignificantly reduces the needed amount of computation. Since MR relies on the nature of typical images,Section 3.5.2 investigates not only its potential performance gains but also its effect on the output quality. Finally,Section 3.5.3 presents the MR-optimized accelerator IDEALMR.


0102030405060708090

100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reus

e Hi

t Rat

e %

MR Aggressiveness Factor (K)

BM1 reuse BM2 reuse

Figure 3.10: MR hit rate as a function of K.

3.5.1 Matches Reuse Optimization

Since most of the processing time is spent in the two block-matching steps BM1 and BM2, MR targets thesebottleneck steps. For block-matching, the algorithm examines every patch in a large search window centeredaround a reference patch to find the 16 patches that are most similar to the reference. The similarity metric used isthe l2-Norm between the patch and the reference patch where only those patches with l2-Norm less than Tmatch, apreset threshold, are considered similarity candidates. In typical images and with small stride Ps, adjacent referencepatches are likely almost similar and thus typically have almost the same list of 16 best matches. Exploiting thisobservation, MR shortcuts the search effort by reusing the best matches of a reference patch for the subsequentreference patch if the two references are similar enough. To avoid deterioration in output quality, MR should notreuse the best matches of a reference patch for a subsequent dissimilar reference patch. Thus, the threshold used byMR for reference similarity should be stricter than Tmatch, the threshold used by the block-matching search.

Accordingly, MR amends the block-matching step as follows: calculate the l2-Norm (similarity metric) betweenthe current reference patch Pc and the previous reference patch Pp. If the similarity is less than a stricter thresholdK×T match, where 0 < K < 1, the 16 best matches for Pc are found by having the BM engine search: 1) The 16best matches of Pp that also fall within Pc’s search window, and 2) the rightmost column of Ns×Ps patches of Pc’ssearch window, which constitutes the part of Pc’s search window that does not overlap with Pp’s search window.Thus, MR ideally reduces the number of patches searched from Ns×Ns to Ns×Ps +16. This reduces computationby 37× for Ns = 49 and Ps = 1. If Pc is not similar enough to Pp, MR reverts to the original exhaustive search ofthe Ns×Ns window. As K approaches 1, performance gain is expected to increase at the expense of output qualityas less similar reference patches reuse the best matches. The speedup depends also on the image content; rapidchanges in colors limit the likelihood that successive reference patches are similar enough for best matches reuse.

MR can be extended to reuse the matches of earlier reference patches as well. However, checking for reusewith just the immediately preceding reference patch proved sufficient. In addition, MR can be used in many SBCFalgorithms since they typically involve block-matching steps that dominate the execution time.

3.5.2 Matches Reuse: Potential and Quality

Fig. 3.10 shows the MR hit rate, that is the fraction of reference patches for which MR attempts reuse instead ofthe original exhaustive search, as a function of the MR aggressiveness factor K. A hit rate of X% does not meanthat patches are reused at the same rate, but that MR restricts its search this often. The colored bars in the figurereport the average hit rate over all our dataset images for the two block-matching steps BM1 and BM2. The vertical


96

98

100

102

104

106

108

110

112

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Nor

mal

ized

PSN

R %

MR Aggressiveness Factor (K)

Avg normalized PSNR

Figure 3.11: Per image normalized PSNR as a function of K.

line bars are error bars showing the minimum and maximum hit rate over the images. The results show that MR ishighly effective at reducing computation even though it depends on image content. For BM1, the average hit rateeven with strict K = 0.1 is 96% saturating at 99.9% for K > 0.5. However, as the error bars illustrate, the hit rateof MR noticeably depends on image content especially for stricter K values: the minimum hit rate is 74% withK = 0.1 and never gets below 99.4% once K > 0.5. Such high hit rates should not be surprising as even abruptchanges in an image affect neighboring reference patches gradually because BM3D slides its reference window bya small stride typically one pixel at a time. BM2 follows a similar trend as BM1 but the hit rate is generally lowerand the sensitivity to image content is higher. This is expected as BM1 and BM2 operate respectively in the DCTand the color domains.

Fig. 3.11 reports how the output image quality, reported as per-image PSNR relative to the original BM3D,reacts to the MR optimization and its aggressiveness factor K. The curve highlights the average relative PSNR. ForK = 0.1, the average PSNR is 2.64% higher than the original BM3D, and as K increases, the improvement drops to2% since the reference patch similarity criterion becomes less strict for higher K. Image content also affects theimage quality when applying the MR optimization. Images depicting mostly homogeneous areas tend to benefitmore with relative PSNR improving by up to 10%, while others with less homogeneous areas and more abruptchanges may see up to 2% output quality deterioration for all K configurations and the images we studied.

The experiments show that MR may even improve quality over BM3D. This occurs because MR is moreresistant to noise artifacts than BM3D, which the latter may consider to be features of the original image. Thisproperty of MR is easier to illustrate in the extreme case of a uniform color image. In an example run with such animage, MR improved quality by 49% over BM3D. BM3D tends to over-fit the matches to a given reference patchsince, due to happenstance, it is often possible to find matches with a nearly identical noise pattern. Effectively,BM3D considers the specific noise pattern an embedded feature of these patches and fails to eliminate it. MR, byreusing the matches of the previous reference patch, is less prone to over-fitting leading to better diversity and moresparsity of the coefficients in the transform domain such that the noise can be more easily identified, isolated andeliminated.

3.5.3 Architecture Modifications

In IDEALB, all BMEs operate in lock step across reference patches since the search effort is constant per reference.With MR, each BME needs to advance independently as each may be performing a different number of computations.Accordingly, IDEALMR comprises 16 processing lanes sharing the same memory controller. Fig. 3.12 shows one


Figure 3.12: One lane of IDEALMR. The accelerator features 16 lanes sharing the same memory controller.

lane of IDEALMR. The modifications made are illustrated in the following subsections.

Per-BM Denoising Engine

Since MR significantly prunes the search done by a block-matching engine BME , the pressure now increases on thedenoising engine DEE . We found that using MR increases the average throughput of each BME to almost that ofan DEE . Hence, per BME , IDEALMR dedicates one DEE and three DCTEs: 1) one performs DCT of channel 1patches for the BM1 step, 2) one performs DCT of channels 2 and 3 for the DE1 step or channels 1,2 and 3 for theDE2 step, and 3) one performs the inverse DCT of all channels after the two denoising steps DE1 and DE2.

Per-BM Search Window Buffer

Since the BMEs advance independently, their search windows are likely non-overlapping. Thus, Fig. 3.12 showsthat IDEALMR uses smaller per-BME Search Window Buffers (SWB) instead of a shared PB. While PB heldDCT transformed patches (BM1) or color domain patches (BM2), each SWB holds the search window for thecorresponding BME in the color domain for both BM1 and BM2. Each SWB needs to hold (Ns +PD−1)2 pixelswhere PD is the patch size. This comes at the expense of recalculating the DCT of the patches that are searched forthe subsequent reference patch. However, in IDEALMR 1) there are dedicated DCTEs per BME making recalculationpossible, 2) the MR optimization significantly reduces the number of patches searched per reference patch and theenergy overhead of recalculating the DCT for those is negligible compared to the MR energy savings, and 3) theaggregate capacity of all SWBs is less than that of PB.

Scheduling

In IDEALMR, work is divided among the BMEs at image row granularity. An BME processes a whole image rowand then proceeds with the next available image row if one remains to be processed. This assignment increasesthe chances of reducing computation through MR, as adjacent reference patches along a row are processed by thesame BME . Furthermore, this allows the buffered search window to be reused across successive reference patchesand thus reduces off-chip bandwidth consumption.

Since off-chip memory accesses typically return a 64-byte block, IDEALMR uses SWBs that can each hold upto two search windows to efficiently utilize the off-chip bandwidth and to minimize stalls. Specifically, each SWBhas (Ns +PD−1) entries, as many as the rows of the search window. Each entry holds two 64B memory blocks.


Table 3.2: Accelerator hardware parameters.

Parameter IDEALB IDEALMR

Technology 65nmFrequency 1 GHzBM Engines 16Denoising Engines 1 shared 16DCT Engines 1 shared 16×3On-chip Buffer 126.75 KB 16×6.5 KBFraction Precision 12-bitMemory Controller 2-channel, 32 in-flight requestsOff-chip DRAM 4GB, DDR3-1333

Using two blocks per SWB entry handles the case where each search window row spans two memory blocks due toalignment. Moreover, it allows the SWB to effectively prefetch the next search window along the image row. Thiseliminates most of the memory-related stalls and boosts the performance of IDEALMR to be within 9.5% of thatpossible with an ideal off-chip memory system.

Exploiting MR across images rows could further reduce the processing time but would also increase theimplementation complexity; matches would have to be shared and communicated across BMEs. Thus, IDEALMR

does not implement that.

3.6 Evaluation

This section presents and compares the performance, power, and where possible the area, of various implementationsof BM3D-based image denoisers. We evaluate the following: 1) highly-optimized software implementations ofBM3D targeting high-end CPUs and GPUs, 2) two neural network (NN) models performing image denoisingaccelerated on a server-class NN hardware accelerator, and 3) our proposed BM3D hardware accelerator in thebasic IDEALB and the optimized IDEALMR versions. We also present a sensitivity study for the various designoptions of IDEAL.

3.6.1 Methodology

This section presents the evaluation methodology, configurations, tools and test platforms we used to estimate theperformance, power consumption and hardware footprint of our proposed accelerator compared to those of thesoftware implementations and accelerated machine learning models.

Accelerator Modeling

We developed cycle-accurate simulators for IDEALB and IDEALMR which were configured with the parametersshown in Table 3.2. The simulator integrates with DRAMSim2 [172] to model the off-chip accesses to mainmemory. We implemented the accelerators in VHDL and synthesized the designs using the Synopsys DesignCompiler and a commercial TSMC 65nm cell library. We use the McPAT [173] version of CACTI to model the areaand power consumption of the on-chip buffers as an SRAM compiler is not available to us. The target frequency ofthe accelerator is set to 1GHz given CACTI’s estimate for the buffer speed. To show how IDEAL scales on a newerprocess technology, we also report the area and power consumption of IDEAL using an STM 28nm cell library.The input data set comprises 30 publicly available RAW format images [174] with resolutions varying from 8MP


Table 3.3: CPU parameters.

Processor Intel Xeon E5-2650 v2Technology 22nmFrequency 2.60 GHzCores 8 (×2 HW threads)L1, L2, L3 32 KB D + 32 KB I, 256 KB, 20 MBMemory 4-channel, 48 GB

Table 3.4: GPU parameters.

GPU NVIDIA GeForce GTX 980Technology 28nmFrequency 1.126 GHzCUDA Cores 2048L1/texture, L2 24 KB, 2MBShared memory 96 KBMemory 4 GB GDDR5, 224 GB/s

up to 42MP. The images depict nature, street, and texture scenes. We report results for two MR aggressivenessconfigurations with K = 0.25 and K = 0.5.

CPU Implementation

We implemented three versions of the BM3D algorithm in C++ targeting general-purpose CPUs: 1) single-thread,2) multi-threaded to exploit all the hardware threads available on the processor, and 3) MR-optimized single-threadimplementation with two configurations K = 0.25 and K = 0.5. The implementations were optimized to exploit theIntel AVX vector instruction set and were compiled with GCC 5.1.1 at the -O3 optimization level then they wererun on a high-end Intel Xeon E5-2650 v2 detailed in Table 3.3. For energy and power measurements on such acommodity system, we followed the methodology of Yazdani et al. [175]. Specifically, we measured the energyconsumption of these runs using the PAPI API [176] which is based on the Intel RAPL library [177]. This libraryuses a software power model that estimates energy usage by using hardware performance counters and I/O models.

GPU Implementation

We implemented BM3D in CUDA and ran the experiments on an NVIDIA GTX 980 GPU with the specificationsshown in Table 3.4. The code was compiled with CUDA Toolkit v8.0 at the -O3 optimization level. For performanceand power consumption estimates, we used NVIDIA Visual Profiler [178] following the methodology of Yazdaniet al. [175]. The measurements do not include the memory transfers between the CPU and the GPU. Even withoutincluding these overheads, performance is shown to be unsatisfactory.

Machine Learning Implementations

Several recent works exploit machine learning (ML) for computational imaging applications [62, 63, 74, 179,180, 181]. We chose two state-of-the-art ML-based denoisers shown to rival BM3D in terms of output quality.The denoiser proposed by Burger et al., referred to as ML1, is a 5-layer fully-connected NN (FCNN) with thedimensions shown in Table 3.5 [62]. The denoiser proposed by Gharbi et al., referred to as ML2, uses a 15-layer


Table 3.5: ML1 & ML2 neural network parameters.

ML1 ML2NN type FCNN CNNNumber of layers 5 15Input patch/tile size 39×39 320×320Output patch size 17×17 256×256Layer dimensions L1: 1522×3072 each layer: 64×64

L2: 3073×3072 kernel: 3×3L3: 3073×2559L4: 2560×2047L5: 2048×289

Model size (# of weights) 27.8M 560K

Table 3.6: The implementations we compare along with the corresponding abbreviations.

SW/HW Implementation Abbreviation

SW

Single-thread CPU CPUMulti-threaded CPU ThreadsSingle-thread CPU + MR, K=0.25 MR (0.25)Single-thread CPU + MR, K=0.5 MR (0.5)GPU GPU

HW

FCNN on DaDianNao ML1CNN on DaDianNao ML2IDEALB IDEAL_BIDEALMR, K=0.25 IDEAL (0.25)IDEALMR, K=0.5 IDEAL (0.5)

convolutional neural network (CNN) to jointly demosaick and denoise an input image [74]. The parameters andarchitecture of the CNN are shown in Table 3.5.

We measured the execution time of these models on DaDianNao, a DNN accelerator proposed by Chenet al. [73], using a cycle-accurate simulator. To model power consumption, we synthesize a Verilog model ofDaDianNao on the same 65nm technology as IDEAL and model the on-chip eDRAMs using Destiny [182].While the ML1 model requires 56MB, we assume it fits in the 32MB on-chip weights memory as even then theperformance remains unsatisfactory. For ML2, the model size is much smaller and thus we replace the original32MB weights eDRAM with a 1.125MB SRAM for fair comparison with our accelerator.

3.6.2 Execution Time Performance

All performance results presented in this subsection are normalized to the baseline single-thread CPU implementa-tion and they are averaged over all input images. First, we show relative performance for multi-threaded CPU andGPU software implementations followed by hardware accelerated implementations including NN models runningon a deep learning hardware accelerator and our custom hardware accelerator IDEAL.

Software Implementations

Fig. 3.13a shows that the multi-threaded CPU and the GPU implementations improve performance by 12.6× and19× respectively. The figure also shows that using the MR optimization with the single-thread CPU implementationresults in a 3× speedup for both aggressiveness configurations (K = 0.25 and K = 0.5). This is expected as: 1) the


02468

10121416182022

Threads GPU MR (0.25) MR (0.5)

X sp

eedu

p vs

CPU

(a) SW implementations.

0.13 0.36 0

2

4

6

8

10

12

1000

X sp

eedu

p vs

CPU

(b) Accelerators.

Figure 3.13: Speedup vs. single-thread CPU.

block-matching steps consume 67% of the processing time (Fig. 3.4), and 2) the MR optimization was foundto reduce the search effort of the block-matching steps by 29× and 31× on average respectively for the two Kconfigurations. Incorporating MR into the GPU implementation would ideally improve its performance by atmost 6.4× given that the block-matching steps originally account for 87% of the total execution time. Thus, theperformance of an MR-optimized GPU implementation would remain unsatisfactory.

Hardware Implementations

Fig. 3.13b shows speedup with the hardware-accelerated implementations. Running ML1 and ML2 on DaDianNaois 131× and 2,243× faster respectively. ML1 achieves much less speedup compared to ML2 since ML1 is afully-connected neural network (FCNN) while ML2 is convolutional neural network (CNN). Accelerating FCNNsis a harder problem with more challenges due to the following: 1) the model weights are not reused across severalactivations leading to much more under-utilization of the processing elements of DaDianNao, and 2) there is morestress on the off-chip memory bandwidth due to the larger volume of the model weights for those fully-connectedlayers. Most of the recent neural networks used for image-based applications are CNNs since convolution naturallyfits those applications where a more compact set of convolution filters are reused over several subsets of the inputactivations.

IDEALB is on average 363× faster than the baseline CPU and 18.9× faster than the GPU. Although it does notoutperform ML2, IDEALB is a lower footprint design that consumes much less energy as Section 3.6.3 shows.However, the optimized IDEALMR design outperforms all other implementations and is 9,446× and 11,352×faster than the baseline CPU for K = 0.25 and K = 0.5 respectively. Since the accelerator pipelines the BMand the DE steps, the speedup over IDEALB scales linearly with the reduction in the search effort done by theBMEs. IDEALMR is 27× and 31× faster than IDEALB for the two K configurations respectively. This brings theperformance to a satisfactory level for real-time and user-interactive applications as shown in next subsections.

Per Image Performance

Fig. 3.14 shows the runtime of IDEALMR for images of different resolution. The runtime depends on image contentas this affects the probability of reusing the best matches. For the images studied, processing time remains withinuser-interactive applications limits even for high resolution images: IDEALMR can process a 42MP image in


00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.5

8 10 12 16 18 20 21 22 24 25 42

Runt

ime

(sec

s)


IDEAL(0.25) IDEAL(0.5)

Figure 3.14: IDEALMR runtime for different resolution images.

0102030405060708090

100

HD F

ram

es p

er se

cond

Figure 3.15: HD frames per second processed by IDEALMR under different configurations.

less than 0.5 seconds and the more common 16MP images in 0.13 to 0.18 seconds. Fig. 3.15 shows the average,minimum and maximum frames per second (FPS) performance of different IDEALMR configurations for HDframes taken from a different dataset of 34 HD frames depicting nature, city, and texture scenes. We use theabbreviation IDEAL_x_y for IDEALMR configured with K = x and reference patch stride Ps = y. On average, allconfigurations achieve 30 FPS or higher except for IDEAL_0.25_1. IDEALMR can reach 90 FPS for the relaxedconfiguration IDEAL_1_3 with which FPS does not drop below 22. Higher performance would be possible if thesearch window dimensions could be reduced.

3.6.3 Energy Efficiency

Energy efficiency (EE) is a metric typically used to compare different compute machines. It captures how efficientis a machine in using energy to perform the needed computation. The energy efficiency of a compute machine M2over another M1 is defined as follows:

EE(M2,M1) =PM1×T M1

PM2×T M2(3.5)

Where P is the power consumption and T is the processing time. Table 3.7 shows the average total powerdissipation for the studied implementations along with a breakdown for the compute core, on-chip memory andoff-chip DRAM. The GPU power measurement tool does not provide a breakdown hence the table lists only the


Table 3.7: Power breakdown for all implementations in Watts.

Core LLC DRAM TotalCPU 25.9 11.9 4.7 42.5Threads 96.8 24.2 9.1 130.1GPU - - - 144

On-Chip DRAM TotalCore BuffersML1 40.91 NC NCML2 9.04 3.97 0.44 13.45IDEALB 1.29 0.39 3.83 5.51IDEALMR 9.2 2.84 6.16 18.2

Table 3.8: The effect of prefetching and on-chip buffering on IDEALMR speedup over CPU.

Configuration Pref+Buff No Pref NoneIDEAL 0.25 9,445× 7,144× 278×IDEAL 0.5 11,352× 8,176× 286×

total power. The multi-threaded software implementation dissipates 3× the power compared to the baseline CPUas it uses all the available 16 cores. The GPU with full utilization is the most power-hungry implementationdissipating 144W. Meanwhile, IDEALB is the lowest power-consuming solution at 1.68W on-chip and 5.5W total.However, IDEALMR is the most energy efficient machine as it consumes 12.05W on-chip and 18.2W in total whilebeing 31× faster than IDEALB.

ML1 consumes 41W on-chip power while being considerably slower than IDEALMR and thus we did notmeasure its off-chip memory power consumption. ML2 consumes 13W on-chip, almost 1W more than the on-chippower of IDEALMR, but performs fewer off-chip accesses since it uses a 4MB on-chip activation memory. Whileits total power is lower than IDEALMR, so is its performance. Thus, IDEALMR is 3.95× more energy efficient thanML2 running on the DaDianNao accelerator.

Given the speed up of IDEALMR, a direct comparison shows that it is 35,595× and 7,064× more energyefficient than the CPU and the GPU implementations respectively. However, the CPU and GPU power measurementsare on actual systems and a direct comparison may not be appropriate. Nevertheless, given the three to four ordersof magnitude, there should be no doubt that IDEALMR is more energy efficient as well.

3.6.4 Area

IDEALB with 16 BM engines, 1 DE engine, 1 DCT engine and a shared 126.75KB PB occupies 5.5 mm2.IDEALMR, with 16 BM engines, 16 DEs, 48 DCT engines and 16 per-BM 6.5KB SWBs requires 23.08mm2. TheDEs are the most expensive components totaling 79% and 62% of the area and power consumption of IDEALMR.The original DaDianNao with 32MB eDRAM weight memory needed to run ML1 has an area of 80.4mm2 whilethe customized version running ML2 needs 41mm2.

3.6.5 Optimization Effect Breakdown

Besides the MR optimization, IDEALMR incorporates prefetching and on-chip buffering. To quantify the effectof these optimizations, Table 3.8 reports the performance of IDEALMR relative to the baseline CPU when theseoptimizations are selectively disabled. Three configurations are shown: 1) Prefetching + Buffering, 2) No


0

10

20

30

40

50

60

0 16 32 48 64 80 96 112 128 144

1000

X sp

eedu

p vs

. CPU

Number of Lanes

IDEAL(0.25) IDEAL(0.5)

Figure 3.16: Performance sensitivity to the number of lanes of IDEALMR.

prefetching, and 3) No prefetching and buffering. Disabling the prefetcher reduces speedups to 7,144× and8,176× respectively for IDEAL 0.25 and IDEAL 0.5. Eliminating the on-chip buffers further degrades speedupsdown to 278× and 286× respectively.

3.6.6 Sensitivity Study: Scalability

For IDEALB, the single denoising engine DEE can ideally serve up to 31 block-matching engines (BMEs). However,we found that the utilization of each BME degrades below 90% for configurations with more than 16 BMEs. This isdue to the single-ported PB that reads and broadcasts one patch at a time to all the BMEs. As the number of BMEsincreases, the non-overlapping area of the corresponding search windows increases causing more stalls. Thus, theperformance scales sub-linearly as the number of BMEs increases in the processing pipeline. One way to scale upIDEALB is to replicate the whole processing pipeline shown in Fig. 3.5 while tiling the input image across thereplicated pipelines. However, we found that the hardware footprint and power consumption of such a design aresignificant for the target real-time and user-interactive performance needs.

Given that the BM1 and BM2 steps dominate the execution time of BM3D, the number of independentprocessing lanes IDEALMR uses is the key design parameter that directly affects performance. Accordingly,Fig. 3.16 reports how performance varies relative to the baseline CPU while scaling the number of lanes inIDEALMR from 16 up to 128. While performance scales linearly going from 16 to 32 lanes, going to 64 lanes orhigher the performance improvements become increasingly sublinear and more so for K = 0.25. These diminishingperformance returns are due to the limited off-chip bandwidth. The evaluated design uses a dual-channel DDR3-1333 memory controller that can deliver up to 21 GB/s. IDEAL 0.25 hits this bandwidth ceiling earlier at 64 laneswhile IDEAL 0.5 hits it at the 128-lanes configuration. A higher K value results in higher and more regular reuseacross lanes, which as a result tend to advance more synchronously. When the lanes stay close to one anotherin terms of their assigned image rows, their memory requests often coalesce leading to less off-chip bandwidthconsumption.

3.6.7 Sensitivity Study: Technology Node

On the more recent STM 28nm technology library, IDEALB requires an area of 1.44mm2 and consumes 0.65Won-chip power. Similarly, IDEALMR needs an area of 7.9 mm2 and consumes 5.1W on-chip power. Thus, on newertechnology nodes, the proposed designs are becoming more affordable for the possible target systems.


Table 3.9: Area and power consumption vs. precision.

Precision 12-bit 11-bit 10-bit 9-bit 8-bitArea (mm2) 23.08 21.45 19.97 17.54 15.4Power (W) 12.05 11.65 11.41 10.21 9.07

3.6.8 Sensitivity Study: Precision Tuning

A design parameter that greatly affects area and power consumption is precision. Accordingly, Table 3.9 showshow the area and power consumption of IDEALMR vary for precisions ranging from 12-bit down to 8-bit fractions.Section 3.4.4 reported no visual artifacts even with 9 bits of precision. The results show to what extent designerscan use precision as a design knob to meet the constraints of different applications.

3.7 Augmenting Functionality

This section presents an example where IDEALMR is extended to support an additional CI application, sharpening.As expected, extending IDEALMR to support sharpening required surgical addition only to the denoising engineDEE . By changing the DEE , it should be possible to implement other BM3D variants that have demonstratedsuperior quality such as for example deblurring [29] and up-sampling [31].

The modified IDEALMR implements the technique of Dabov et al. [30] to jointly denoise and sharpenimages. It uses the same BM3D pipeline with a single minor modification: After denoising the 3D transform-domain coefficients, sharpening is achieved by taking the α-root of their magnitude for some α > 1. Themodified accelerator incorporates an α-rooting component in the DEE pipeline right after the inverse Haar engine(see Fig. 3.7). For the 65nm technology these modifications require an extra 0.09 mm2 of area and 0.12W of powerwhile the processing throughput remains unaffected.

3.8 Related Work

Previous work on accelerating BM3D includes algorithmic approximations as well as implementations targetingdifferent computing platforms such as GPUs, FPGAs and custom hardware. Sarjanoja et al. presented a hetero-geneous implementation of BM3D using OpenCL and CUDA [152]. On an NVIDIA GeForce GTX 650, theirimplementation was shown to be 7.5× faster than a CPU implementation. Honzátko’s CUDA implementationrunning on an NVIDIA GeForce GTX 980 was reported to be 10× faster than a CPU implementation running onan Intel Core i7 processor [57]. Both aforementioned implementations restrict BM3D configuration parameterssuch as the search window size Ns and the reference patch stride Ps. Tsai et al. accelerated block-matching using anapproximated-nearest-neighbor (ANN) search in order to process a 512×512 image in 0.7 seconds on a high-endGPU [27]. Our exact CUDA implementation with the same algorithmic parameters achieves a 19× speedup overour CPU implementation which would still be unsatisfactory if it incorporated ANN.

Zhang et al. proposed a custom hardware accelerator for BM3D which can process 25 BT656 PAL frames persecond (0.4MP video resolution) [59, 60]. They restrict the BM3D parameters with Ns = 15 and Ps = 4 to reducecomputations by two orders of magnitude. Our experiments show that IDEALMR can process 52 FPS for 0.4MPframes even with the unmodified BM3D parameters.

Cardoso proposed an FPGA implementation of BM3D [58] whose pipeline is similar to IDEALB, 16 BMmodules followed by one DE, but with several approximations: 1) using the much simpler l1-Norm instead of


l2-Norm for the block-matching similarity metric, 2) not implementing the full Haar transform but just a single-level Haar decomposition, and 3) restricting the search parameters to Ns = 39 and Ps = 4. On the Xilinx 28nmZYNQ-7000 ZC706 SoC, the design operates at 125 MHz, consumes 2.9W and processes an 8MP image in 1.5seconds. Our IDEALB would be 21× faster and 94× more energy efficient under the same search parametersconfiguration and technology node.

Clemons et al. proposed a patch memory system tailored to applications that process 2D and 3D data structuressuch as images [183]. The system exploits multi-dimensional locality and provides efficient caching, prefetching,and address calculation. Leveraging such memory system for IDEALMR is left for future work.


We characterize highly optimized software implementations of BM3D for image frame denoising on high-endcommodity processors including a CPU with vector extensions and a GPU. We show that the performance onsuch platforms falls short of that needed for real-time processing of HD frames and user-interactive processing ofhigher resolutions. We consider accelerating machine learning-based implementations on a server-class hardwareaccelerator. We propose IDEALMR, a BM3D custom hardware accelerator that outperforms CPU and GPUimplementations by 4 and 3 orders of magnitude respectively while being faster than the hardware acceleratedML alternatives by at least 5.4×. Follow up work may focus on modifying our accelerators to provide support foradditional filters and thus more CI applications or on improving performance and energy efficiency of deep learningaccelerators to enable ML-based implementations to be used for user-interactive and real-time applications.

Chapter 4

Diffy: Enabling ML-Based ComputationalImaging

We show that Deep Convolutional Neural Network (CNN) implementations of computational imaging tasks exhibitspatially correlated values. We exploit this correlation to reduce the amount of computation, communication,and storage needed to execute such CNNs by introducing Diffy [184], a hardware accelerator that performsDifferential Convolution. Diffy stores, communicates, and processes the bulk of the activation values as deltas.Experiments show that, over five state-of-the-art CNN models and for HD resolution inputs, Diffy boosts theaverage performance by 7.1× over a baseline value-agnostic accelerator [73] and by 1.41× over a state-of-the-artaccelerator that processes only the effectual content of the raw activation values [80]. Further, Diffy is respectively1.83× and 1.36× more energy efficient when considering only the on-chip energy. The total energy efficiencywould be higher given that Diffy requires 55% less on-chip storage and 2.5× less off-chip bandwidth comparedto storing the raw values using profiled per-layer precisions [185]. Compared to using dynamic per groupprecisions [186], Diffy requires 32% less storage and 1.43× less off-chip memory bandwidth. With practicalconfigurations, Diffy provides the performance necessary to process 2MP high-definition frames in real time,i.e., 30 HD frames or more per second. Finally, Diffy is robust and can serve as a general CNN accelerator as itimproves performance even for image classification models.

4.1 Introduction

In addition to the well known successes of Deep Neural Networks (DNNs) in high-level classification applicationssuch as image recognition [82, 83, 84], object segmentation [85, 86, 87] and speech recognition [88], DNNshave also recently achieved state-of-the-art output quality in a wide range of Computational Imaging (CI) andlow-level computer vision tasks. These CI applications, including but not limited to image denoising [46, 47,48], demosaicking [49], sharpening [64], deblurring [65, 66, 67], and super-resolution [68, 69, 70, 71, 72], weretraditionally dominated by analytical solutions. These are essential tasks for virtually all imaging sensor-basedsystems such as mobile devices, smartphones, tablets, medical devices, digital cameras, automation systems,and in general in imaging based embedded systems. Such devices are typically cost-, power-, energy-, andform-factor-constrained. Accordingly, one goal of this part of the thesis is to investigate whether ComputationalImaging DNNs (CI-DNNs) can be deployed on such devices. While the emphasis of this work is on such devices,there are also applications where higher cost and energy can be acceptable for better quality. For example, practical

41

CHAPTER 4. DIFFY: ENABLING ML-BASED COMPUTATIONAL IMAGING 42

deployment of CI-DNNs can benefit scientific applications such as telescope imaging with input images of up to1.5 billion pixels [8, 9], automation applications in manufacturing pipelines, or even in server farms.

Due to the high computational and data demand of DNNs, several DNN accelerators have been proposed toboost performance and energy efficiency over commodity Graphics Processing Units (GPUs) and general purposeprocessor (CPUs) [73, 75, 76, 77, 78, 79, 80, 81]. To date accelerators have taken advantage of the computationstructure, the data reuse, the static and dynamic ineffectual value content, and the varying precision requirementsof DNNs. These past acceleration successes demonstrate that identifying additional runtime behaviors in DNNs isinvaluable as it can inform further innovation in accelerator design.

While CNN acceleration efforts have focused primarily on classification DNNs, CI-DNNs introduce newchallenges and improvement opportunities due to their unique characteristics highlighted by this dissertation.While classification models extract features and identify high-level abstractions, CI-DNNs perform low-level per-

pixel prediction, i.e., for each input pixel the model predicts a corresponding output pixel. For example, imageclassification models take an image and identify the depicted object. The output is typically a vector of probabilities,one per possible object class. On the other hand, a denoising DNN model takes a noisy image and producesa denoised version typically with the same input resolution. Thus, the structure and behavior of CI-DNNs aredifferent than those of image classification models. Three key differences are: 1) while DNN models generallyinclude a variety of layers, the per-pixel prediction models are fully convolutional. 2) CI-DNNs naturally scale withthe input resolution whereas classification models are resolution-specific. 3) CI-DNN models exhibit significantlyhigher spatial correlation in their runtime-calculated values. That is, the inputs (activations) used during thecalculation of neighboring outputs tend to be close in value. This is a known property of images which these modelsexhibit throughout their layers. The first two characteristics exacerbate the typical bottlenecks of DNN processingwhere CI-DNNs typically have much higher computation demands, produce much larger volume of intermediateresults and thus need much more on-chip storage and off-chip bandwidth. We exploit the third characteristic tomitigate the exacerbated bottlenecks.

To take advantage of the spatial correlation between activation values, we introduce Differential Convolution

which operates on the differences, or the deltas, of the activations rather than on their absolute values. Wedemonstrate that differential convolution can be practically implemented by proposing Diffy, a CI-DNN acceleratorthat translates the reduced precision and the reduced effectual bit-content of these deltas into improved performance,reduced on- and off-chip storage and communication, and ultimately, improved energy efficiency over state-of-the-art designs. While Diffy targets CI-DNNs, it benefits also other models – we experiment with image classificationand image segmentation – albeit to a lesser extend. This shows that Diffy is robust and not CI-DNN specific.

In summary, the conceptual contributions and findings of this work are:

• We study an emerging class of CNN models that performs per-pixel prediction showing that they exhibitstrong spatial correlation in their value stream. This property is exhibited by all the layers in the DNNpipeline.

• We present Differential Convolution (DC), a novel convolution processing technique which exploits thepreceding property of CI-DNNs to process the activation values as deltas leading to significant reductions inthe work necessary to compute convolutions.

• We propose to store and communicate values as deltas both off- and on-chip reducing the amount of storageand communication needed, or equivalently boosting the effective capacity of on- and off-chip storage andcommunication links.


We propose Diffy, a practical DC-based architecture that boosts performance and energy efficiency for CI-DNNs and other convolutional neural networks (CNNs) over state-of-the-art techniques. Our experimental resultsshow that:

• Over a set of recent CI-DNNs, a Diffy configuration that can compute the equivalent of 1K 16×16b multiply-accumulate operations per cycle boosts performance by 7.1× and 1.41× over a baseline value-agnosticaccelerator (VAA) [73] and a state-of-the-art value-aware accelerator (PRA) [80] respectively. This Diffyconfiguration processes HD frames (1920×1080) at 3.9 up to 28.5 FPS (frames/sec) depending on the targetapplication. By comparison, VAA achieves 0.7 to 3.9 FPS while PRA can process 2.6 to 18.9 FPS.

• Compared to a state-of-the-art compression technique that uses dynamic precision per group of raw val-ues [186], Diffy with its delta-based compression reduces on-chip storage needs by 32% and off-chip trafficby 1.43×.

• Diffy consistently outperforms SCNN, a state-of-the-art sparse DNN accelerator [81], even when the CI-DNNmodels are aggressively sparsified. For instance, Diffy is 4.5× faster when the models are 50% sparse.

• Considering only on-chip energy, Diffy is 1.83× and 1.36× more energy efficient compared to VAA andPRA. Taking the reduction in off-chip traffic, higher total energy efficiency is expected.

• Diffy scales much more efficiently and enables real-time processing of HD frames with considerably lessresources.

• Diffy benefits image classification CNNs as well improving performance on average by 6.1× and by 1.16×compared to VAA and PRA respectively. Most of the benefits appear at the earlier layers of these networkswhere Diffy proves to be up to 2.1× faster than PRA.

The rest of this chapter is organized as follows: Section 4.2 motivates Diffy showing the potential of replacingnormal convolutions with Differential Convolutions especially for per-pixel prediction DNN models. Section 4.3presents the needed background on the baseline DNN accelerator and the serial-parallel architecture Diffy buildsupon. Section 4.3 follows with proposing Differential Convolution and the implementation details of Diffy.Section 4.4 presents the methodology, compares the performance, energy efficiency and area of the consideredaccelerators and conducts sensitivity studies for some design decisions. Finally, Section 4.5 goes over related workand Section 4.7 concludes.

4.2 Motivation

To motivate Diffy, we first quickly review information theory concepts related to ideal information representation.The average amount of information bits carried by a random variable A taking n possible values {a1,a2, ...,an} isdefined as the entropy H(A) and is given by the following equation:

H(A) =−i=n

∑i=1

P(ai)× log2(P(ai)) (4.1)

Where P(ai) is the probability that the random variable A takes the value ai. Now given another random variableA′ taking the same n possible values, the conditional entropy H(A|A′) is defined as the amount of informationcarried by A given that we know the value of A′. The two extreme cases here are either A and A′ are independent


0

1

2

3

4

5

6

7

DnCNN FFDNet IRCNN JointNet VDSR

Entr

opy

(bits

)

H(A) H(A|A') H(Δ)

Figure 4.1: Information content: Entropy of raw activations H(A), conditional entropy H(A|A′) (see text) and entropy of deltasH(∆)

random variables or A is certainly determined if A′ is known. Between the two extremes there is a range of possiblecorrelation strength between the two random variables. If A and A′ are independent random variables, then knowingA′ does not remove any uncertainty about A in which case:

H(A|A′) = H(A) (4.2)

However, if A′ certainly determines the value of A, this means that A carries no new information if we know A′

already. In this case:H(A|A′) = 0 (4.3)

Thus, the lower H(A|A′) is the more correlation between the two random variables and hence the less amountof new information carried by A given the value of A′.

Ideally, to avoid redundant computation and communication, we should process only the new, or essential

information carried by the data in hand. Fig. 4.1 presents the first set of evidence that, compared to the rawvalues, the deltas between adjacent values processed by CI-DNNs more compactly convey the essential informationcontent. For each model of a set of five CI-DNNs that we study, the figure shows: 1) H(A): the entropy of theactivations processed by the model, 2) H(A|A′): the conditional entropy of an activation A given its neighbouralong the X-axis A′, and 3) H(∆): the entropy of the activation deltas along the X-axis, i.e., H(A−A′). The resultsshown were collected over all the input datasets we detail later in Section 4.4.1. We find very similar results whenwe study adjacent activations along the Y-axis

While H(A) represents the average amount of information in bits encoded within an activation value, H(A|A′)determines the amount of new information carried by A if we already know A′. H(∆) shows how much of theredundant information can be removed from A if we replaced the activations with their deltas. Compared to theentropy of the activations H(A), the lower conditional entropy H(A|A′) proves that there is considerable redundantinformation from an activation to the next. This redundancy can be exploited to compress the encoded informationby a factor of H(A)/H(A|A′). The compression factor ranges from 1.29× for IRCNN and up to 1.62× for VDSR.H(∆) shows that for some networks the activation deltas can compress the information more aggressively comparedto H(A|A′) while for other models the deltas do not capture the full potential. However, on average over all themodels the potential to compress the underlying information with H(A|A′) and H(∆) is nearly identical at 1.41×and 1.4× respectively.

Next we review the operation of the convolutional layers (Section 4.2.1) so that we can explain how usingdeltas can in principle reduce computation, communication and data footprint (Section 4.2.2). We finally motivate


Diffy by reporting the spatial locality in CI-DNNs (Section 4.2.3), and the potential of delta encoding to reducecomputation (Section 4.2.4), and communication and data storage (Section 4.2.5).

4.2.1 Convolutional Layers

A convolutional layer takes an input feature map imap, which is a 3D array of activations of size C×H×W

(channels, height, and width), applies K 3D filter maps, or fmaps, of size C×HF ×WF in a sliding window fashionwith stride S and produces an output map, or omap, which is a 3D array of activations of size K×HO×WO. Eachoutput activation is the inner product of a filter and a window, a sub-array of the imap of the same size as the filter.Assuming a, o and wn are respectively the imap, the omap and the n-th filter, the output activation o(n,y,x) iscomputed as the following inner product 〈wn, ·〉:

o(n,y,x) =C−1

∑k=0

HF−1

∑j=0

WF−1

∑i=0

wn(k, j, i)×a(k, j+ y×S, i+ x×S) (4.4)

Input windows form a grid with a stride S. As a result, the omap dimensions are respectively K, HO =

(H−HF)/S+1, WO = (W −WF)/S+1. In the discussion that follows we assume without loss of generality thatS = 1 which is the common case for CI-DNNs. However, the concepts apply regardless.

4.2.2 Revealing Redundant Information and Work

Given the abundant reuse in the convolutional layers, it is beneficial to transform the input activations from its rawvalue space R to some space D, i.e., R =⇒D where: 1) the operations performed on R can still be applied seamlesslyon D, and 2) the representation of values in D is compressed leading to less communication and computation. Onesuch transformation is delta encoding where adjacent activations are represented by their differences instead oftheir raw values. First, deltas are subject to the distributive and associative properties of multiplication and addition,the main operations of convolution. Second, if the raw values are correlated enough delta encoding is a compressedand more space- and communication-efficient representation of the values.

Multiplications account for the bulk of the computational work in CI-DNNs. For this reason, strong spatialcorrelation in the imaps presents an opportunity to reduce the amount work needed. To understand why this isso, let us consider the multiplication a×w of an activation a with a weight w. If a is represented using p bits,the multiplication amounts to adding p terms where the i-th term is the result of “multiplying” the i-th bit of themultiplier a with the shifted by i bit positions multiplicand w:

a×w =i=p

∑i=0

ai · (w� i) (4.5)

It is only those bits of a that are 1 that yield effectual work and thus called effectual bits. Using booth encoding,a value can be represented using a sum of signed terms where each term is a power of 2 [187]. This can furtherreduce the number of effectual terms as long as we allow both addition and subtraction.

Since convolution layers process the imap as overlapping windows, a weight w that was multiplied with anactivation a for some window I will be multiplied with the adjacent activation a′ while processing the adjacentwindow I′ along the X or Y axis. Thus, rather than calculating a′×w directly we could instead calculate it relativelyto a×w:

a′×w = (a×w)+(a′−a)×w = (a×w)+(∆a×w) (4.6)


While the previous equation holds true for any two windows I and I′, the spatially closer the two windows themore likely that their corresponding activations are almost similar in value. In this case, calculating a′×w fromscratch will be just a Déjà vu of a×w repeating almost the same long multiplication. However, their difference ∆a

will be relatively small with typically fewer effectual terms to process compared to a or a′. Given that we alreadycalculated a×w, this approach will reduce the amount of work needed for a′×w. Representing the imap usingdeltas can also reduce its footprint and the amount of information to communicate to and from the compute units.This will be possible as long as the deltas can be represented using a shorter datatype than the original imap values.

4.2.3 Spatial Correlation in CI-DNNs imaps

This section investigates the extent to which the activations are spatially correlated and how this correlation maylead to fewer terms being computed if only the deltas, bits that carry new information, are processed instead of theraw activation values. Fig. 4.2 shows an example that demonstrates that the activation values of a feature map of aCI-DNN are spatially correlated.

Fig. 4.2a shows a heatmap of the raw imap values processed by the third convolutional layer of DnCNN, one ofthe CI-DNNs we study, while denoising the Barbara image. A heatmap is a graphical representation of data thatmaps the range of possible values to a range of colors using a system of color-coding. In our case, smaller valuesmap to darker colors and vice versa. Fig. 4.2a shows that, even though this is an intermediate layer, the imageis still discernible. More relevant to our discussion, Fig. 4.2b shows a heatmap of the differences, or the deltas

between activations adjacent along the X-axis. The smaller values of the deltas compared to the raw activations,manifested by the darker pixels of the deltas heatmap, reveal a strong correlation between adjacent activations. It isonly around the edges of the original image where deltas peak. Thus, a delta typically has a smaller number ofeffectual terms compared to the corresponding raw activation. Fig. 4.2c shows a heatmap of the number of savedeffectual terms (blue) or extra effectual terms (red) of the deltas compared to the corresponding activations. InSection 4.3.3 we propose Differential convolution, a novel convolution processing technique that translates thisproperty to savings of compute time. For the specific imap shown in the figure, the average number of terms pervalue is 3.65 and 1.9 for activations and deltas respectively. Thus, there is a potential to reduce the amount ofcompute work needed by 3.65/1.9 = 1.9×. Savings are higher and reaching up to 6 terms in homogeneous colorareas. However, deltas do not always yield less terms than the raw values. In areas with rapid changes in colorslike edges, deltas may have up to 4 more terms compared to the raw activations. Fortunately, in typical images theformer is by far the dominant case.

Fig. 4.3 shows the cumulative distribution of the number of effectual terms per activation and delta. That is, foreach number of effectual terms x, the figure shows the percentage of values that have effectual terms <= x for bothraw activations and deltas. The distribution is measured over all the CI-DNN models and all the image datasetsdetailed later in Section 4.4.1. The figure shows that the deltas contain considerably fewer effectual terms per valuecompared to the raw activations. For example, 92% of the deltas has 2 terms or less as opposed to 67% for the rawactivations. Thus, there is significant potential for reduction in the amount of computations needed if the deltas areprocessed instead of the raw imap. The intersection of the curves with the Y-axis refer to the sparsity of the valuessince a value with 0 effectual terms is actually a zero. The sparsity of the raw imap values is 43% and it is higherat 48% for the deltas. Thus, processing the deltas improves the potential performance benefit of any techniqueexploiting activation sparsity.


0.05

0.10

0.15

0.20

0.25

0.30

0.35A

ctiv

atio

n V

alue

(a) Activation Values

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Del

ta V

alue

(b) Deltas Along the X-Axis

4

2

0

2

4

6

Del

ta: S

aved

Ter

ms

(c) Reduction in effectual terms per imapactivation

Figure 4.2: The imap values of CI-DNNs are spatially correlated, and as a result processing deltas instead of raw values reduceswork. All results are with the Barbara image as input.

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5

Cum

ulat

ive

Freq

uenc

y

Effectual Terms

Raw Deltas Zero

Figure 4.3: Cumulative distribution for number of effectual terms per activation/delta over all considered CI-DNNs and datasets.Average sparsity is shown for raw activations.

43.9

269

.61

02468

101214161820

DnCNN FFDNet IRCNN JointNet VDSR Geom

Pote

ntia

l Spe

edup

Raw DeltasRawE ΔE

Figure 4.4: Potential speedups when processing only the effectual terms of the imaps (RawE ) or of their deltas (∆E ). Speedupsare reported over processing all imap terms.


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Nor

mal

ized

Foot

prin

t

Profiled RawD16 DeltaD16DeltaD16RawD16

Figure 4.5: Storage needed for three compression approaches normalized to a fixed precision storage scheme.

Table 4.1: Profiling-based activation precisions per layer.

Network Per Layer Profile-Derived PrecisionsDnCNN 9-9-10-11-10-9-10-9-10-10-9-9-9-9-9-9-9-9-11-13FFDNet 10-9-10-10-10-10-10-10-9-9IRCNN 9-9-9-8-7-8-8JointNet 9-9-10-9-9-9-9-9-9-10-9-9-9-9-10-8-10-10-11VDSR 9-10-9-7-7-7-7-7-7-8-8-6-7-8-7-7-7-7-9-8

4.2.4 Computation Reduction Potential

Fig. 4.4 compares the amount of computations that has to be performed by three multiplication processingapproaches: (1) ALL: is the baseline value-agnostic approach which processes all product terms of a multiplicationbetween a weight and an activation, (2) RawE : a value-aware approach that processes only the effectual terms of theraw activation values, and (3) ∆E : a value-aware approach that processes only the effectual terms of the activationdeltas. Given a fixed-point p bits representation of activations, a multiplication of a weight and an activation wouldneed p terms for ALL according to Eq. (4.5). However, the value-aware approaches will pick and process only theeffectual terms of the binary representation of the activation value. For each studied CI-DNN model, the figurereports the reduction in the total number of terms processed for RawE and ∆E as a potential speedup over ALL.Results are collected over all the image datasets we detail in the evaluation section.

On average, ∆E needs to process 18.13× less terms than ALL. All CI-DNNs benefit with ∆E with the worksavings over ALL ranging from 11.3× for FFDNet to 69.6× for VDSR. VDSR exhibits much higher imap sparsity(77%) than the other models (around 23% is typical) which explains its much higher potential. Compared to RawE ,∆E can potentially reduce work by 1.72× on average. The least potential for work reduction compared to RawE is1.59× for VDSR and the maximum is 1.95× for DnCNN. This suggests that processing the activation deltas hasthe potential to further boost performance and energy efficiency even over processing techniques that target theeffectual terms of the activations.

4.2.5 Memory Storage and Communication Reduction Potential

The imaps of CI-DNNs occupy a lot more space than their fmaps. The latter tend to be small in the order of a fewtens of KB whereas the imaps scale proportionally with the input image resolution and dominate off-chip bandwidthneeds. Fig. 4.5 compares the amount of storage needed for the imaps of all layers with four compression approaches:


+16

16Filte

r 0

0

15

16 x

16

x 16

AM16x16

+16

16

IP 15

Filte

r 15 0

15

15 016

ABin

16 x

16

16 x

16

To AM

ABoutWei

ght B

uffe

r

16

IP 0

X

X

X

X

Figure 4.6: Tile of our Baseline Value-Agnostic Accelerator (VAA).

1) NoCompression: where all imap values are stored using 16-bit fixed-point representation, 2) Profiled: followingthe methodology of Judd et al. [185], we profiled the target CI-DNNs, found the lowest per-layer precisions(Table 4.1) adequate to preserve 99.9% of the output quality of floating-point representation, and then used thefound precisions to compress the imap values, 3) RawD16 where the imap values are stored using dynamicallydetected precision per group of 16 activations [188], and 4) DeltaD16 where we store the imap values usingper-group dynamically detected precision for the deltas. For each CI-DNN, the figure shows the amount of storageneeded for Profiled, RawD16 and DeltaD16 normalized to that of NoCompression. While Profiled can reduce theneeded storage to 47%−61% of NoCompression, RawD16 can further compress it down to 9.7%−38.6%. OurDeltaD16 can lead to storage needs of only 8%−30%, a 23% on average compression over RawD16. The numberof bits that need to be communicated depends on the dataflow and the tiling approach used thus we defer thesemeasurements until Section 4.4.5.

4.3 Diffy

We first describe our baseline value-agnostic accelerator that resembles DaDianNao [73]. This well understooddesign is an appropriate baseline not only because it is a well optimized data-parallel yet value-agnostic design butalso because it is widely referenced in literature. Thus, taking DaDianNao as a baseline enables rough comparisonwith the plethora of accelerator designs that have emerged afterwards.

Diffy builds on top and modifies the Bit-Pragmatic accelerator (PRA) whose execution time is proportional tothe number of effectual terms of the imap [80]. Since Diffy targets processing deltas which we have shown to havefewer effectual terms than the raw activation values, PRA’s processing approach can be adapted to translate thisphenomenon to performance improvement. However, PRA has been designed to process raw activation values andneeds to be modified to enable delta processing. Before we describe Diffy we first review PRA’s tile design. Atthe end, we implement the additional functionality Diffy needs with only a modest investment in extra hardwarewhich is a major advantage for any hardware proposal. We expect that the proposed techniques can be incorporatedto other designs and this work serves as the necessary motivation for such follow-up investigations. That said,demonstrating the specific implementation is essential and sufficient.


+

4

16

16

SIP (0,0)

Filte

r 0

0

1516

x 1

6 x

16

AM0

16x16

+

16

16

SIP (15,0)

Filte

r 15 0

15

<<neg

211

15 0

4

neg

211

<<

neg

211

<<

<<neg

211

ABin

4

15 0

4

ABin

16 x 16

16 x 16

AM15

16x16

<<

<<

To AMABoutTo AM

Wei

ght B

uffe

r

Offset Gen

Offset Gen

Offset Gen

Offset Gen

Column 0 Column 15

Figure 4.7: A PRA tile. The AM is partitioned along columns.

4.3.1 Baseline Value-Agnostic Accelerator

Figure 4.6 shows a tile of a data-parallel value-agnostic accelerator VAA that is modeled after DaDianNao [73]. Atile of VAA comprises 16 inner product units (IPs) operating in parallel over the same set of 16 activations eachproducing a partial output activation per cycle. Each cycle, each IP reads 16 weights, one per input activation,calculates 16 products, reduces them via an adder tree, and accumulates the result into an output register. A per tileWeight Memory (WM) and an Activation Memory (AM) respectively store fmaps and imaps/omaps. The WM andthe AM can supply 16×16 weights and 16 activations per cycle respectively. Each AM can broadcast its values toall tiles. Only one AM slice operates per cycle and all tiles see the same set of 16 activations. The AM slices aresingle-ported and banked. For each layer, half of the banks are used for the imaps and the other half for the omaps.Depending on their size, WM and AM can be banked while weight and activation buffers can be used to hide theirlatency. An output activation buffer collects the results prior to writing them back to AM. Both activations andweights are read from and written to an off-chip memory. The number of filters, tiles, weights per filter, precision,etc., are all design parameters that can be adjusted as necessary. For clarity, in our discussion we assume that alldata per layer fits in WM and AM. However, the evaluation considers the effects of limited on-chip storage and ofthe external memory bandwidth.

4.3.2 Value-Aware Accelerator

Our Diffy implementation builds upon the Bit-Pragmatic accelerator (PRA) [80]. PRA processes activations“bit”-serially, one effectual term at a time. Offset generators convert the activations into a stream of effectual powersof two after applying a modified Booth Encoding. PRA multiplies a weight with a term, a signed power of two,each cycle using a shifter. The term’s sign determines whether to add or subtract the shifted weight. PRA alwaysmatches or exceeds the throughput of an equivalent VAA by concurrently processing 16 activation windows whilereusing the same set of weights.

Figure 4.7 shows a PRA tile. The 16 IP units have been expanded into a grid of 16×16 simpler Serial IP units(SIPs). Each SIP column corresponds to a different window. Each cycle, the tile broadcasts a set of 16 activation


45 47

46 49

2 1

3 2

47 47

49 50

47 46

50 48

373 388 386

45 47

46 49

2 0

3 1

0 -1

1 -2

373 15 -2

Window 0 Window 1 Window 2

+

+

373 388 386

Convolution Differential Convolution

2 1

3 2

2 1

3 2

2 1

3 2

2 1

3 2

2 1

3 2Filter Filter

Window 0 Δ 1 Δ 2

= = = = = =

Figure 4.8: Differential Convolution example: deltas are processed instead leading to shorter computations.

terms to each SIP column for a total of 256 activation terms per cycle. PRA restricts the distance of concurrentlyprocessed terms to better balance area and performance. Accordingly, each term needs 4 bits, 2 bits for the powerof 2, a sign bit and a valid bit. Each SIP has a 16-input adder tree and instead of 16 multipliers it has 16 shifters.Each of these shift the 16b weight input as directed by the term. All SIPs along the same row share the sameset of 16 weights. While VAA processes 16 activations per cycle, PRA processes 256 activations term-serially.The dataflow we use for VAA processes {a(c,x,y) · · ·a(c+ 15,x,y)} (a brick aB(c,x,y) in PRA’s terminology)concurrently, where (c MOD 16) = 0. PRA processes {aB(c,x,y),aB(c,x+1,y), · · · ,aB(c,x+15,y)} (a pallet inPRA’s terminology) concurrently and over multiple cycles.

4.3.3 Differential Convolution

Formally, given an output activation o(n,y,x) that has been computed directly as per Eq. (4.4) and exploitingthe distributive and associative properties of multiplication and addition, it is possible to compute o(n,y,x+1)differentially as follows:

o(n,y,x+1) = o(n,y,x)+ 〈wn,∆a〉 (4.7)

where ∆a is the element-wise deltas of the imap windows corresponding to o(n,y,x+1) and o(n,y,x):

∆a(k, j, i) = a(k, j+ y×S, i+(x+1)×S)

−a(k, j+ y×S, i+ x×S)

Where S is the stride between the two imap windows. The above method can be applied along the H or the W

dimensions, and in general through any sequence through the imap. A design can choose an appropriate ratio ofoutput calculations to calculate directly as per Eq. (4.4) or differentially as per Eq. (4.7). Furthermore, the designcan choose a convenient dataflow.

Fig. 4.8 shows an example of Differential Convolution processing. It shows how Diffy would apply a 2×2filter on three consecutive activation windows using differential convolution of the deltas as opposed to directconvolution of the raw activation values.


4.3.4 Delta Dataflow

In the designs we evaluate, we choose to calculate only the leftmost output activation of a row using normalconvolution with the raw input activations while the remaining outputs along the row are calculated using differentialconvolution of the deltas of the input activations. We do so since this is compatible with designs that buffer twocomplete rows of windows of the imap on chip. This dataflow strategy reduces off-chip bandwidth when it is notpossible to store the full imap on chip.

Timing-wise, Diffy calculates each output row in two phases which are pipelined. During the first phase, Diffycalculates the leftmost per row output in parallel with calculating the 〈wn,∆a〉 terms for the remaining outputsper row. During a second phase, starting from the leftmost output, Diffy propagates the direct components in acascaded fashion. A single addition per output is all that is needed. Given that the bulk of the time is needed forprocessing the leftmost inner-product and the 〈wn,∆a〉 terms, a set of adders provides sufficient compute bandwidthfor the second phase. Each phase can process the whole row, or part of the row to balance the number of addersand buffers needed.

4.3.5 Diffy Architecture

There are two possible design alternatives for Diffy; one that computes deltas on-the-fly at the input and the otherthat computes deltas at the output of the SIPs prior to writing to the AM. The latter has the potential of reducingthe needed AM capacity and bandwidth if the output activations are stored in compressed format. Thus, we optedto incorporate the latter alternative.

Diffy modifies the PRA architecture by introducing: 1) a Differential Reconstruction Engine (DR) per SIPas shown in Fig. 4.9, and 2) a Delta Output Calculation engine (Deltaout) per tile as shown in Fig. 4.10. TheDR engines reconstruct the original outputs after finishing differential convolution of the activation deltas. Thereconstruction proceeds in a cascaded fashion across the tile columns and per row as follows: Let us consider theprocessing of the first 16 windows of a layer. The imap windows are fed into the 16 columns as deltas except for thevery first window of a row which is given raw values. The SIPs in columns 1-15 process their assigned windowsdifferentially while the SIPs of column 0 do so normally. When the SIPs of column 0 finish computing their currentoutput brick, they pass it along through their ABout to the SIPs of column 1. The column 1 SIPs can then updatetheir differential outputs to their normal values. They then forward their results to the column 2 SIPs and so on in around robin fashion along the columns. Since processing the next set of 16 windows typically takes hundreds ofcycles, there is plenty of time to read the output activations out and pass them through the activation function units.The multiplexer per DR has the following uses: 1) it allows the first window of a row, which is calculated usingthe raw imap values, to be written unmodified to ABout , 2) it allows intermediate results to be written to ABout ifneeded to support other dataflow processing orders, and 3) it makes it possible to revert to normal convolution ofraw activations for those layers whose performance might be negatively affected by differential convolution.

We have investigated two schemes for calculating the deltas. The first stores imaps raw in AM and calculatesthe deltas as the values are read out. We do not present this scheme for two reasons: First, it recomputes deltasanytime values are read from AM. Second, it does not take advantage of deltas to reduce on-chip storage andcommunication. We instead present the scheme where the deltas are calculated at the output of each layer, once peromap value and stored as such in AM.

Fig. 4.10 shows the architecture of the Deltaout engine that Diffy uses to write the output bricks of the currentlayer back to the AM in the delta format. Deltaout computes the delta bricks of columns 0 to 15 one at a time toreuse the hardware. Assuming the next layer’s stride is Snext , computing the delta brick for column Colout is done in


Wei

ght B

uffe

r16

x 1

6 x

16

4

15 0

Offset Gen

4

AM0

16x16

SIP (15,0)

ABin

Cur

r to DeltaoutABout Prev

Cur


SIP (0,0)

SIP (15,1)

SIP (0,1)

Cur


SIP (15,15)

SIP (0,15)

Column 0 Column 1 Column 15

Offset Gen

Previous output

Current output

+

DR

DR

+

DR

+

DR

+

+

+

DR

DR

4

15 0

Offset Gen

4

AM1

16x16

ABin

Offset Gen

4

15 0

Offset Gen

4

AM15

16x16

ABin

Offset Gen

Figure 4.9: Diffy tile reconstructing differential convolution output.

SIP (15,0)

BrickSBufferABout

SIP (0,0)

SIP (15,15)

SIP (0,15)

Column 0 Column 15

DR

DR

DRDR

16-to-1 M

UX

Colselect

16x16

16x1

6ƒ

BrickOut

To AM

ABout

Deltaout

Figure 4.10: Diffy’s Deltaout engine computing deltas for next layer.

two steps: 1) reading the output brick with stride Snext to the left of Colout from the corresponding ABout, passingit through the activation function f then storing it in the Bricks buffer. This process might need to wrap aroundto read an output brick corresponding to a previous pallet depending on the stride Snext . 2) Reading the brick ofcolumn Colout from the corresponding ABout, passing it through the activation function f and computing the deltabrick using element-wise subtractors before writing the results to the AM. The 16-to-1 multiplexer controls fromwhich ABout to read at each step (the multiplexer is implemented as a read port across the ABouts of a row). Forexample, if Snext = 2 and we want to compute the delta brick for column Colout = 0, we need the output brick ofcolumn 14 which belongs to the previous pallet of output bricks. Thus, for step 1, the selection lines Colselect areset to (Colout −Snext) MOD 16, while for step 2 they are set to Colout . Each ABout can store up to 4 output brickscorresponding to 4 consecutive output pallets. Thus, Diffy can handle any stride up to 48 which is far beyond whatis needed by the current models.

4.3.6 Memory System

For per-pixel models it is imap/omap storage and traffic that dominates for the following reasons: 1) The modelsnaturally scale with the input image resolution which is typically much higher than that used by most imageclassification models that are publicly available (for example, ImageNet models process image frames of roughly


Table 4.2: CI-DNNs studied.

DnCNN FFDNet IRCNN JointNet VDSRApplication Denoising Demosaicking + Denoising Super-resolution

Conv. Layers 20 10 7 19 20ReLU Layers 19 9 6 16 19Max Filter Size (KB) 1.13 1.69 1.13 1.13 1.13Max Total Filter Size per Layer (KB) 72 162 72 144 72

Table 4.3: Input Datasets Used

Dataset Samples Resolution DescriptionCBSD68 68 481×321 test section of the Berkeley data set [157, 190]McMaster 18 500×500 CDM Dataset, modified McMaster [191]Kodak24 24 500×500 Kodak dataset [192],RNI15 15 370×280 – 700×700 noisy images covering real noise such as from the camera or from JPEG compression [48, 193, 194]LIVE1 29 634×438 – 768×512 widely used to evaluate super-resolution algorithms [195, 196, 197, 198]Set5+Set14 5 + 14 256×256 – 720×576 images used for testing super-resolution algorithms [68, 199, 200].HD33 33 1920×1080 HD frames depicting nature, city and texture scenes [174]

230×230 resolution). 2) While all layers maintain the resolution, most intermediate layers increase the numberof channels. 3) The fmaps are comparatively small, do not increase with resolution, and may be dilated (e.g., a3×3 filter expanded to 9×9 by adding zero elements). Accordingly, it may not be reasonable to assume that theimaps/omaps can fit on chip and off-chip bandwidth becomes a major concern. Similarly, it may not be reasonableto assume that we can fit the fmaps on chip for the full model. For this reason, an effective dataflow that utilizesfmap and imap/omap reuse is essential.

Diffy opts for an off-chip strategy that reads each weight or input activation once per layer, and that writes eachoutput activation at most once per layer. For this purpose, the AM is sized to fit two complete rows of activationwindows plus two output rows. This way Diffy can process the windows of one row from on-chip (which requiresbuffering a row of output activations), while loading the activations for the next row of windows from off-chipmemory, while also simultaneously writing the previous row of output activations to off-chip memory (whichrequires buffering another row of output activations). For the fmaps, the WM is sized to be as large as to holdall fmaps that will be processed concurrently. This depends on the number of fmaps per tile and the number oftiles. To completely hide the loading of the next set of fmaps we need the buffer to also have space to load thenext set of fmaps for the same layer or for the first set of the next layer. Section 4.4.5 shows that the AM andWM memories needed are reasonable for today’s technologies and demonstrates that delta encoding can furtherreduce their size. If smaller WM and AM are desired, off-chip bandwidth will increase. Yang et al., present analgorithm for determining energy efficient blocking dataflows which can be adapted for our purposes [189]. Toreduce off-chip traffic, Diffy encodes activations as deltas using a dynamically detected precision per group ofactivations instead of the raw values. Dynamic precision detection is done following the technique proposed inDynamic Stripes [188].

4.4 Evaluation

We evaluate Diffy and compare it to two previously proposed accelerators VAA [73] and PRA [80] in terms ofperformance, area, power consumption and energy efficiency.


VAA, PRA and DiffyTiles 4 WM/Tile 8KB×16 BanksFilters/Tile 16 ABin/out/Tile 2KBWeights/Filter 16 Off-chip Memory 4GB LPDDR4-3200Tech Node 65nm Frequency 1GHz

DiffyAM/Tile 8KB×16 Banks = 128KB

VAA and PRAAM/Tile 16KB×16 Banks = 256KB

Table 4.4: VAA, PRA and Diffy configurations.

0123456789

1011121314

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy PR

A

Diffy


Spee

dup

over

VAA

NoCompression Profiled Deltaꓓ₁₆ IdealDeltaD16

Figure 4.11: PRA and Diffy performance normalized to VAA.

4.4.1 Methodology

Table 4.2 details the CI-DNNs we study. DnCNN [47], FFDNet [48] and IRCNN [46] are state-of-the-art imagedenoising DNN models that rival the output quality of non-local similarity-based methods such as BM3D [11]and WNNM [37]. JointNet performs demosaicking and denoising. VDSR is a 20-layer DNN model that deliversstate-of-the-art quality single-image super-resolution [68]. Some of these models can be retrained to perform otherimage tasks. For example, the same DnCNN network architecture can be trained to perform single-image super-resolution and JPEG deblocking as well [49] and IRCNN can be trained to perform in-painting and deblurring [46].IRCNN uses dilated filters where the effective filter size is 3×3 but scaled up to be 5×5, 7×7 or 9×9 with thegaps being filled with zeros. Since the perforation of dilated filters is regular and known in advance, all the designswe consider can avoid the unnecessary computations involving these zero weights seamlessly. Table 4.3 shows theimage datasets we used for evaluating the studied architectures and CI-DNNs models.

To measure performance, all architectures are modeled using cycle-accurate simulators. Table 4.4 reports thedefault configurations. For area and power consumption all designs were implemented in Verilog and synthesizedthrough the Synopsys Design Compiler [201]. Layout was performed using Cadence Innovus [202] and for a65nm TSMC technology which was the best technology available to us. While not the most recent, it is stillrelevant especially for accelerator designs and the measurements presented enable reasonable projections to newertechnologies. For power estimation we used Mentor Graphics ModelSim to capture circuit activity and used that asinput to Innovus. We use CACTI to model the area and power consumption of the on-chip SRAM memories andbuffers as an SRAM compiler is not available to us. The accelerator target frequency is set to 1GHz given CACTI’sestimate for the speed of the buffers. We study the effect of off-chip memory by considering several technologiesfrom DDR3-1600 and up to HBM2.


4.4.2 Relative Performance

Since the design space is large, we start by investigating performance for HD images and with a DDR4-3200off-chip memory interface which is a typical memory grade in use today in high-end mobile devices. First, we lookat relative performance. Fig. 4.11 shows the performance of PRA and of Diffy normalized to VAA while takinginto account the following off-chip compression schemes: a) no compression (NoCompression), b) encoding usinga per layer profile-derived precision (Profiled) [185], c) per group dynamic precision for groups of 16 activations(DeltaD16) and where for activations we store deltas, and d) infinite off-chip bandwidth (Ideal). Since off-chipbandwidth is not a bottleneck for VAA, compression can only improve its energy efficiency not its performance.Compression however enables PRA and Diffy to deliver their full performance benefits. Specifically, PRA canideally improve performance by 5.1× over VAA. However, depending on the model, PRA needs the Profiled orthe DeltaD16 compression to avoid stalling noticeably due to off-chip memory transfers. The DeltaD16 scheme isnecessary for JointNet and VDSR. Under this scheme PRA achieves almost the ideal speedup with 5× over VAA.

Diffy outperforms both VAA and PRA by 7.1× and 1.41× respectively on average. As with PRA it needsDeltaD16 compression to avoid stalling noticeably for off-chip memory. It is only for JointNet that off-chip memorystalls remain noticeable at about 8.2%. Benefits with both PRA and Diffy are nearly double for VDSR compared tothe other models. VDSR exhibits high activation sparsity in the intermediate layers and requires shorter precisionscompared to the other models. In general, the benefits are proportional to but lower than the potential measuredin Section 4.2. There are two reasons why the full potential is not achieved: Underutilization due to the number offilters available per layer, and cross-lane synchronization due to imbalance in the number of effectual terms peractivation. The latter is the most significant.

4.4.3 Per-Layer Analysis

Fig. 4.12 reports a per layer breakdown of lane utilization for Diffy in the following categories: a) useful cycles,b) idle cycles which may be due to cross-lane synchronization or due to filter underutilization, and c) stalls dueto off-chip delays. Utilization varies considerably per layer and per network. Off-chip delays noticeably appearonly for certain layers of FFDNet and JointNet. For FFDNet these layers account for a small percentage of overallexecution time and hence do not impact overall performance as much as they do for JointNet. Utilization is verylow for VDSR. This is due to cross lane synchronization since VDSR has high activation sparsity and the fewnon-zero activations dominate execution time. The first layer of all networks incurs relatively low utilizationbecause the input image has 3 channels thus 13 out of the 16 available activation lanes are typically idle. FFDNetis an exception since the input to the first layer is a 15-channel feature map: the input image is pre-split into 4 tilesstacked along the channel dimension with 3 extra channels describing the noise standard deviation of each of theRGB color channels. Also, the last layer for all networks exhibits very low utilization. This layer produces thefinal 3-channel output and has only 3 filters and thus can keep only 3 of the 64 filter lanes busy. Allowing each tileto use its activations locally could enable Diffy to partition the input activation space across tiles and to improveutilization. Moreover, idle cycles could be used to reduce energy. However, these optimization techniques are leftfor future work.

The per-layer relative speedup of Diffy over PRA is fairly uniform with a mean of 1.42× and a standarddeviation of 0.32. Diffy underperforms PRA only on a few noncritical layers in JointNet and VDSR and there byat most 10%. We have experimented with a variant of Diffy that uses profiling to apply Differential Convolutionselectively per layer and only when it is beneficial. While this eliminated the few per-layer slowdowns compared toPRA, the overall improvement was negligible (below 1% at best).


0%10%20%30%40%50%60%70%80%90%

100%

conv

_1co

nv_3

conv

_5co

nv_7

conv

_9co

nv_1

1co

nv_1

3co

nv_1

5co

nv_1

7co

nv_1

9co

nv_2

1co

nv_2

3co

nv_2

5co

nv_2

7co

nv_2

9co

nv_3

1co

nv_3

3co

nv_3

5co

nv_3

7co

nv_3

9Av

g

Exec

utio

n Ti

me

Perc

enta

ge

Utilized Idle Memory Stalls

(a) DnCNN

0%10%20%30%40%50%60%70%80%90%

100%

Exec

utio

n Ti

me

Perc

enta

ge


(b) FFDNet

0%10%20%30%40%50%60%70%80%90%

100%

Exec

utio

n Ti

me

Perc

enta

ge


(c) IRCNN

0%10%20%30%40%50%60%70%80%90%

100%

pack

prec

onv1

conv

2co

nv3

conv

4co

nv5

conv

6co

nv7

conv

8co

nv9

conv

10co

nv11

conv

12co

nv13

conv

14co

nv15

resi

dual

post

conv

1ou

tput

Avg

Exec

utio

n tim

e Pe

rcen

tage


(d) JointNet

0%10%20%30%40%50%60%70%80%90%

100%

conv

_1co

nv_3

conv

_5co

nv_7

conv

_9co

nv_1

1co

nv_1

3co

nv_1

5co

nv_1

7co

nv_1

9co

nv_2

1co

nv_2

3co

nv_2

5co

nv_2

7co

nv_2

9co

nv_3

1co

nv_3

3co

nv_3

5co

nv_3

7co

nv_3

9Av

g

Exec

utio

n Ti

me

Perc

enta

ge


(e) VDSR

Figure 4.12: Execution time breakdown per layer for each network showing the utilized cycles, idle cycle and memory stalls.

4.4.4 Absolute Performance: Frame Rate

We next report absolute performance measured as the number of frames processed per second (FPS). Fig. 4.13reports FPS for all resolutions but HD (1920×1080) which is shown in Fig. 4.14. Fig. 4.13 shows that for the lowerresolution images, Diffy can achieve real-time processing of 30 or more FPS for all models except for DnCNNwhen used with resolutions above 0.25MP. Even for DnCNN, Diffy can process 19 FPS for 0.4MP frames. Theseresults suggest that Diffy can be used for applications where processing lower resolution frames is sufficient.

Since our interest is on HD resolution processing, Fig. 4.14 reports detailed measurements for this resolution.The figure shows the FPS for VAA, PRA, and Diffy where for each accelerator configuration we use the bestoff-chip compression scheme as per Fig. 4.11. The results show that Diffy can robustly boost the FPS over PRAand VAA. The achievable FPS varies depending on the image content with the variance being ±7.5% and ±15%of the average FPS for PRA and Diffy respectively.


10

100

1000

0 0.1 0.2 0.3 0.4 0.5

Fram

es p

er se

cond

(FPS

)

Resolution (MP)


Figure 4.13: Frames per second processed by Diffy as a function of image resolution. HD frame rate is plotted in Fig. 4.14.

02468

1012141618202224262830

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy

VAA

PRA

Diffy


HD F

ram

es p

er se

cond

(FPS

)

NoCompression Profiled Deltaꓓ₁₆ IdealDeltaD16

Figure 4.14: HD frames per second processed by VAA, PRA and Diffy with different compression schemes.

While Diffy is much faster than the alternatives, it is only for JointNet that the FPS is near the real-time 30FPS rate. For real-time performance to be possible more processing tiles are needed. Section 4.4.9 exploressuch designs. With the current configuration, Diffy is more appropriate for user-interactive applications such asphotography with a smartphone.

4.4.5 Compression and Off-Chip Memory

We study the effect of delta encoding on on-chip storage and off-chip traffic. Table 4.5 reports the total on-chipstorage needed for fmaps and imaps/omaps. The total weight memory needed for these networks is 324KB, whichcan be rounded up to 512KB or to 128KB per tile for a four-tile configuration. Since our DeltaD16 scheme targetsactivations, the table reports the total activation memory size needed for three storage schemes that mirror thecompression schemes of Fig. 4.5. Our DeltaD16 can reduce the on-chip activation memory or boost its effectivecapacity. Without any compression the AM needs to be 964KB. Profiled reduces the storage needed by 19% to782KB, whereas RawD16 reduces AM by 46% to 514KB. Unfortunately, if we were to round all these up to the

Table 4.5: Minimum on-chip memory capacities vs. encoding scheme.

Mem. Type Baseline Profiled RawD16 DeltaD16

AM SRAM 964KB 782KB 514KB 348KBWM SRAM 324KB


0.00.10.20.30.40.50.60.70.80.91.0

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

RLE

zR

LEPr

ofile

dR

awD

256

Raw

D16

Raw

D8

Del

taD

256

Del

taD

16D

elta

D8

DnCNN FFDNet IRCNN JointNet VDSR Avg

Nor

mal

ized

Traf

fic

MetadataProfiled RawD DeltaDRLERLEz

Figure 4.15: Off-chip traffic of different compression schemes normalized to no compression.

02468

101214

DnC

NN

FFD

Net

IRCN

NJo

intN

etVD

SRD

nCN

NFF

DN

etIR

CNN

Join

tNet

VDSR

DnC

NN

FFD

Net

IRCN

NJo

intN

etVD

SRD

nCN

NFF

DN

etIR

CNN

Join

tNet

VDSR

DnC

NN

FFD

Net

IRCN

NJo

intN

etVD

SRD

nCN

NFF

DN

etIR

CNN

Join

tNet

VDSR

DDR3-1600 DDR3E-2133 DDR4-3200 DDR4X-3733 DDR4X-4267 HBM2

Spee

dup

over

VAA

NoCompression Profiled DeltaD16 IdealDeltaD16

Figure 4.16: Performance of Diffy with different off-chip DRAM technologies and activation compression schemes.

next power of two sized capacity, they would all lead to a 1MB AM. Finally, using our proposed DeltaD16 reducesAM to just 348KB a 64% reduction over the baseline which we can round up to 512KB as needed. Regardlessof the rounding scheme used, our DeltaD16 compression considerably reduces the on-chip AM capacity that isrequired. For the rest of the evaluation we round up the AM capacity to the nearest, higher power of two.

Fig. 4.15 reports off-chip traffic normalized to NoCompression. Profiled reduces off-chip traffic to about 54%.Using dynamic per group precisions reduces off-chip traffic further. Using a group of 256 activations (RawD256)reduces off-chip traffic to 39% on average whereas using smaller groups of 16 (RawD16) or 8 (RawD8) activationsreduces traffic to about 28%. Using smaller group sizes increases the overhead of the metadata as we need tospecify the precision per group. Accordingly, the difference between a group of 16 and a group of 8 is negligible.Storing activations as deltas with per group precision (DeltaD16) further reduces off-chip traffic resulting to just22% of the uncompressed traffic, an improvement of 27% over RawD16. Since off-chip accesses are two orders ofmagnitude more expensive than on-chip accesses, this reduction in off-chip traffic should greatly improve overallenergy efficiency. While using a group size of 16 (DeltaD16) reduces traffic considerably compared to using a groupsize of 256 (DeltaD256) the metadata overhead prevents further reduction with the smaller group size (DeltaD8). Inthe rest of the evaluation we restrict attention to DeltaD16 for on-chip and off-chip encoding of imaps/omaps.

Finally, we study the effect of the off-chip memory on overall performance by considering six memorytechnologies ranging from the now low-end LPDDR3-1600 up to the high-end HBM2. We do so to demonstratethat our off-chip compression scheme is essential to sustain the performance gains with realistic off-chip memory.


Table 4.6: Power [W ] consumption breakdown for Diffy vs. PRA and VAA.

Diffy PRA VAAPower % Power % Power %

Compute 11.77 86.67 10.80 82.75 1.90 54.42AM 0.79 5.85 1.36 10.45 1.36 38.96WM 0.37 2.73 0.27 2.07 0.08 2.29ABin+ABout 0.15 1.12 0.15 1.16 0.15 4.33Dispatcher 0.25 1.85 0.25 1.92 - -Offset Gens. 0.21 1.58 0.21 1.64 - -Deltaout 0.03 0.21 - - - -

Total 13.58 100% 13.05 100% 3.50 100%Normalized 3.88× 3.73× 1×Energy Efficiency 1.83× 1.34× 1×

Fig. 4.16 shows the performance of Diffy normalized to VAA with the compression schemes shown as stackedbars. Without any compression all models require at least an HBM2 memory to incur no slowdown. JointNet andVDSR are the most sensitive since even with Profiled and the high-end LPDDR4X-4267, they sustain 77% and68% of their maximum performance respectively. The other networks perform within 2.5% of their peak. For anyof the less capable memory nodes, the performance slowdowns are much more noticeable. Our DeltaD16 allowsall networks to operate at nearly their maximum for all memory nodes starting from LPDDR4-3200 where onlyJointNet incurs an 8.2% slowdown. Even with the LPDDR3E-2133 node, performance with DeltaD16 is within 2%of the maximum possible for all networks except JointNet which is within 22% of its maximum. With a 2-channelLPDDR4X-4267 memory system Diffy sustains only 87% and 65% of its maximum performance for JointNetand VDSR under no compression. With Profiled, two channels of LPDDR4X-3733 are sufficient to preserve 94%and 98% of their performance respectively. Finally, with DeltaD16 and a dual-channel LPDDR3E-2133 memorysystem, VDSR incurs no slowdowns while JointNet performs within 5% of its maximum.

4.4.6 Power, Energy Efficiency, and Area

Table 4.6 reports a breakdown of power for the three architectures. While both PRA and Diffy consume morepower, their speedup is higher than the increase in power, and thus they are 1.34× and 1.83× more energy efficientthan VAA. Even if Diffy used the same uncompressed on-chip 1MB AM as PRA or VAA, it will still be 1.76×energy efficient with 1.57× the area of the baseline VAA. Thus, it would still be preferable over a scaled upVAA (performance would not scale linearly for VAA— it would suffer more from underutilization and wouldrequire wider WMs). Moreover, these measurements ignore the off-chip traffic reduction achieved by Diffy. Asoff-chip accesses are orders of magnitude more expensive that on-chip accesses and computation, the overall energyefficiency for Diffy will be higher. The power and area of the compute core of Diffy are higher than PRA due tothe additional DR engines.

Table 4.7 reports a breakdown of area for the architectures. Since Diffy uses DeltaD16 for the AM, its overalloverhead over VAA is lower than PRA, furthermore, its area overhead is far less than its performance advantage.Section 4.4.10 presents an Iso-area comparison between the three designs.

4.4.7 Sensitivity to Tiling Configuration

Fig. 4.17 reports the performance for different tile configurations Tx where x is the number of (weight×activation)pairs processed concurrently per filter. So far we considered only T16. The speedup of Diffy over VAA, when


Table 4.7: Area [mm2] breakdown for Diffy vs. PRA and VAA.

Diffy PRA VAAArea % Area % Area %

Compute 15.50 53.05 14.49 40.19 3.36 14.26AM 6.05 20.70 13.93 38.62 13.93 59.11WM 6.05 20.70 6.05 16.77 6.05 25.67ABin+ABout 0.23 0.77 0.23 0.62 0.23 0.96Dispatcher 0.37 1.28 0.37 1.04 - -Offset Gens. 1.00 3.42 1.00 2.77 - -Deltaout 0.02 0.09 - - - -Total 29.22 100% 36.07 100% 23.56 100%Normalized 1.24× 1.53× 1×

02468

1012141618


Spee

dup

over

VAA

T1 T2 T4 T8 T16

Figure 4.17: Performance sensitivity to the number of terms processed concurrently per filter.

both configured with T1, is 11.9× on average. This is higher than the 7.1× speedup achieved when both enginesare configured with T16. Increasing the number of concurrently processed pairs exaggerates the cross-lanesynchronization problem discussed in Section 4.4.3. When many pairs are concurrently processed, the pair thathas the activation with the largest number of effectual terms dominates the execution time introducing idle cyclesfor other processing lanes with less effectual terms per activation. The T1 configuration does not suffer these idlecycles and realizes most of the potential speedup for all models except for VDSR due to its high activation sparsity.This finding does not imply that Diffy with T1 configuration is faster than Diffy with T16. It is just that the speedupof Diffy over VAA is higher when both are configured with smaller Tx.

Although it avoids cross-lane synchronization, scaling the processing lane width down to T1 has two drawbacks:1) it does not amortize the hardware and energy cost associated with partial sums accumulation as efficiently asa reasonably wide processing lane, and 2) it has to be compensated for by scaling up other dimensions of thecompute structures if we were to keep the same total throughput of the accelerator. Dimensions that might bescaled up for this purpose include: 1) the number of windows processed per tile (the number of columns per tile),2) the number of filters processed per tile (the number rows of per tile), or 3) the number of tiles. Scaling up thenumber of windows needs the wires of the weights to travel longer distance to span more columns. Moreover, theslowest window will dominate execution time leading to a similar cross-window synchronization phenomenon.Scaling the number of filters per tile or the number of tiles leads to under-utilization for layers that do not haveenough number of filters to keep the filter lanes busy. Due to these drawbacks, we found that a T1 configurationdesign incurs higher hardware and energy overheads compared to the T16 configuration we used.


43.9

269

.61

69.7

2

02468

1012141618202224


Pote

ntia

l Spe

edup

Raw Deltas Deltas_y

Figure 4.18: Potential speedups for Diffy _y compared to Diffy and PRA over VAA.

D, P N

D P

N

DP N

D P N

D P N

048

12162024283236

Num

ber o

f Tile

s


Figure 4.19: Tiles needed for Diffy to sustain real-time HD processing along with the needed memory system for eachcompression scheme. D, P and N for DeltaD16, Profiled and NoCompression.

4.4.8 Sensitivity to Dataflow Direction

So far, we assumed the input activation window slides horizontally from left to right. Fig. 4.18 reports potentialspeedup with Diffy _y which applies differential convolution along the vertical dimension instead. Performance isnearly identical except for a modest improvement for VDSR under Diffy _y suggesting that benefits with differentialconvolution are robust regardless of the dataflow direction. Studying alternative dataflows such as diagonallysliding window used in Eyeriss [77] is left for future work.

4.4.9 Scaling for Real-time HD Processing

To achieve real-time 30 FPS processing of HD frames, Diffy needs to be scaled up as shown in Fig. 4.19. Foreach model and each compression scheme we report the minimum configuration needed, that is the number oftiles (Y-axis) and external memory system (X-axis). The X-axis reports the off-chip memory configuration as v-r-x

where v is the DDR version, r is the transfer rate and x is the number of channels. For example, 3-1600-2x is aDDR3-1600 dual channel memory system. As the figure shows, DnCNN is the most demanding requiring 32 tilesand an HBM2 memory under DeltaD16 and Profiled or an HBM3 memory otherwise. The second most demanding


0123456789

10


Spee

dup

over

VAA

PRA Diffy

Figure 4.20: Iso-area PRA and Diffy performance normalized to VAA.

model is VDSR which needs 16 tiles and a lower cost 3E-2133-2x memory system under the DeltaD16 compressiondue to the activations sparsity of that model. FFDNet and JointNet need 8 tiles with a 3-1600-2x memory systemwhile IRCNN needs 12 tiles with a 3E-2133-2x memory system under our DeltaD16 compression.

4.4.10 Iso-area Comparison

Iso-area is one way to conduct a comparison among several architectures and to study how they stack against eachother under fair assumptions accepted by the computer architecture community. That is, given a constant siliconbudget, what performance and energy efficiency the studied architectures can deliver. We scale both Diffy andPRA down to the same area as the baseline VAA. As explained in Section 4.3.6, the capacity of the activation(AM) and the weight (WM) memory is mainly dictated by the input image dimensions and the architecture of theDNN models respectively. Hence, we opted to scale down the compute cores of Diffy and PRA by reducing thenumber of columns per tile and thus the number of concurrently processed windows. This is equivalent to taking avertical slice of each tile. There are other approaches to scale down the compute cores such as keeping the numberof columns unchanged while reducing the number of filters processed per tile, i.e., taking a horizontal slice of eachtile. The differences between the two approaches in terms of area and power of the compute cores are negligible.Following the first scaling approach, for Diffy we could keep 11 out of the 16 columns while PRA could keep only3 columns. As illustrated in Section 4.4.5, Diffy with its delta encoding for activations needs a 512KB AM whilePRA and VAA need 1MB. Thus, Diffy could devote more of the area budget to the compute cores with less areaoverhead compared to PRA.

Fig. 4.20 shows the speedup of the scaled Diffy and PRA over the baseline VAA. On average, Diffy stillachieves speedup of 4.9× while PRA incurred a slight slow down with 0.96× relative performance. Our resultsshow that the average number of terms per activation value is more than 3 and thus PRA with 3 columns barelycould maintain the performance of the single-column baseline. In terms of energy efficiency, Diffy maintainsalmost the same energy efficiency under Iso-area constraints with 1.84× relative to the baseline VAA while PRA isnotably less energy efficient with 0.88× of that of the baseline.

4.4.11 Classification DNN Models

While Diffy targets CI-DNNs it can execute any CNN. Since classification remains an important workload we runseveral well known ImageNet classification models on VAA, PRA and Diffy. We include FCN_Seg, a per-pixelprediction model performing semantic segmentation which is another form of classification where pixels are


Table 4.8: Classification DNNs: Profiling-based activation precisions per layer.

Network Per Layer Profile-Derived PrecisionsAlexNet 9-8-5-5-7GoogLeNet 10-8-10-9-8-10-9-8-9-10-7NiN 8-8-8-9-7-8-8-9-9-8-8-8VGG_19 12-12-12-11-12-10-11-11-13-12- 13-13-13-13-13-13VGG_m 7-7-7-8-7VGG_s 7-8-9-7-9FCN_Seg 7-9-7-7-9-8-8-7-8-6-7-6-7-5-5-6-5-5

0123456789

101112

Spee

dup

over

VAA

PRA Diffy

Figure 4.21: PRA and Diffy speedup for classification DNNs.

grouped into areas of interest [85]. FCN_Seg is derived from the VGG model [203] but with fully-connected layersconverted to be convolutional. We evaluate these networks using the per-layer precisions found by previous workto maintain output accuracy [79] and shown here in Table 4.8. Fig. 4.21 reports the resulting performance. Diffy’sspeedup is 6.1× over VAA, an improvement of 1.16× over PRA. While modest, the results show that differentialconvolution not only does not degrade but also benefits performance for this class of models. Accordingly, Diffycan be used as a general CNN accelerator.

4.4.12 Classification DNNs vs. CI-DNNs: Performance Discrepancy

Unlike CI-DNNs where the per-layer speedup of Diffy over VAA and PRA is consistent along the layer pipeline,the speedup achieved for classification models is at its highest at the beginning of the layer pipeline then graduallyattenuates as we go deeper in the model. This discrepancy happens for two reasons: 1) Unlike CI-DNNs,classification models are discriminative models trained to extract and amplify the features corresponding to theclass of the object depicted in the input image while attenuating those of the other classes. Thus, from a layer to thenext deeper in the network, the output activations lose most of their spatial correlation and become more sparse dueto the higher abstraction levels [204], and 2) Max-pooling layers conventionally used in such classification modelsdisturb the spatial correlation of the activations which translates to lower potential speedup for the next convolutionlayer. The per-pixel prediction CI-DNNs depart from these two characteristics since they are fully convolutionalwith no max-pooling layers and they are trained to sustain the spatial correlation of the activations which is inherentin natural images typically exhibiting smooth color changes. Nevertheless, the per-layer performance of Diffyrarely drops below that of PRA with the relative per-layer performance ranging from 0.9× up to 2.1×.


5.36

4.52

2.44

1.04

0123456789

10

DnCNN FFDNet IRCNN JointNet VDSR GeomSp

eedu

p ov

er S

CNN

SCNNOO SCNN50 SCNN75 SCNN90SCNN50 SCNN75 SCNN90SCNNO

Figure 4.22: Speedup of Diffy over SCNN.

4.4.13 Performance Comparison with SCNN

Fig. 4.22 shows the speedup of Diffy over SCNN [81] under various assumptions about weight sparsity whereSCNNO, SCNN50, SCNN75 and SCNN90 refer to SCNN running unmodified, 50%, 75% and 90% randomlysparsified versions of the models respectively. SCNN is configured with 64 processing elements (PEs) each ofwhich has 16 bit-parallel multipliers totaling 1024 multipliers. This is an equivalent configuration to 4-tile VAA,PRA and Diffy.

On average, Diffy consistently outperforms SCNN even with the 90% sparse models. Specifically, Diffy is5.4×, 4.5×, 2.4× and 1.04× faster than SCNN for the four sparsity assumptions respectively. We believe thateven a 50% weight sparsity for these models is optimistic since in the analytical models these CI-DNNs mimiceach pixel value depends on a large neighborhood of other pixel values. Moreover, activation sparsity is lower forthese models compared to the classification models used in the original SCNN study.

4.5 Related Work

DNN acceleration has been receiving tremendous attention and it is not possible to exhaust the topic here. Dueto space limitations we limit attention to the most relevant works. We have already compared to PRA. Otheraccelerators that could potentially benefit from differential convolution are Stripes (STR) [79] and Dynamic Stripes

(DSTR) [188]. The performance of the two designs varies with the precision of the activations by incorporatingserial-parallel multiplication and reduced activation precision. Both designs represent an activation a in p bitsand a filter weight w in 16-bits. While p is statically profiled and fixed per layer in STR, DSTR further trimsthe leading and trailing ineffectual bits from the currently processed set of activations allowing for finer graindynamic precision tuning. Both designs process a bit-serially over p cycles such that one bit of a is multiplied byw, AND-then-Shift effectively, producing one term at a time that is accumulated towards the final product. This canideally improve performance by 16/p over a 16-bit fixed-point bit-parallel accelerator. To maintain the throughputof the worst case 16-bit effective precision, they concurrently process 16× more activation-weight pairs. PRAimproves over STR and DSTR by not only avoiding prefix and suffix ineffectual bits, but also by skipping the zerobits within the binary representation of the activation value, effectively processing just the bits with the value one.STR, DSTR and PRA improve performance by 1.92×, 2.61× and 4.31× on average compared to VAA over a setof image classification DNNs. Since deltas are smaller values than the activations, their precision requirements willbe lower as well. Thus, incorporating differential convolution to process deltas instead of raw activation in STRand DSTR would potentially boost their performance. While the two designs are outperformed by PRA, they aresimpler and lower cost designs in terms of power consumption and area.


CBInfer is an algorithm that reduces inference time for CNNs processing video frames by computing onlyconvolutions for activation values that have changed across frames [205]. While CBInfer targets changes invalues temporally across frames, Diffy targets changes in values spatially within a frame. CBInfer is limited toframe-by-frame video processing, whereas Diffy can work for more general computational imaging tasks. UnlikeCBInfer which requires additional storage to store the previous frame values, Diffy can potentially reduce theneeded storage and bandwidth through delta compression. Moreover, CBInfer is a software implementation forgraphics processors. However, the two concepts could be potentially combined.

Similarly, Euphrates exploits the motion vectors, generated based on the cross-frame temporal changesin pixel data, to accelerate CNNs performing high-level semantic tasks such as object detection and trackingapplications [206]. Instead of performing an expensive CNN inference for each frame, Euphrates extrapolatesthe results based on the motion vectors naturally generated by the early stages of the pipeline. Our approach iscomplementary as it operates within each frame. Moreover, we target the lower-level computational imaging tasks.

4.6 Discussion on IDEAL vs. Diffy

In Chapter 3, we have shown that IDEAL, our BM3D image denoising accelerator, outperforms VAA runningthe JointNet model (referred to as ML2 in Fig. 3.13b) by 5.4×. Diffy revisits CI-DNN acceleration to close thisperformance gap where it boosts the performance by 6.5× over the same baseline VAA for that specific DNNmodel and by 7.1× on average over a wider set of CI-DNNs. In terms of area, Diffy needs 29.22 mm2 whileIDEALMR is more compact with an area of 23.08mm2. In addition, the more aggressive version of IDEALMR,IDEAL 0.5, is more energy efficient than Diffy. The two designs are 1.58× and 1.34× more energy efficient thanIDEAL 0.25 respectively. However, we believe Diffy is a better approach for computational imaging accelerationfor several reasons: 1) While IDEAL is a fixed function accelerator, CI-DNN acceleration is a less rigid approachwhere the accelerator can be used for better and lighter CI models as they come up and for models performingother CI tasks such as super-resolution and deblurring, 2) Diffy enables clever DNN models like JointNet whichjointly performs image denoising and demosaicking, 3) Diffy matches, and sometimes exceeds, the frame rates ofIDEALMR for some of the studied CI-DNNs, 4) Diffy benefits other application domains as well such as imageclassification, object detection and image segmentation models, and 5) Most mobile and embedded devices areexpected to feature a DNN accelerator for other application domains. Thus, Diffy widens the domain of possibleapplications those accelerators can run efficiently.


In this chapter, we studied an emerging class of CNNs for computational imaging tasks that rival conventionalanalytical methods. We demonstrated that they exhibit spatial correlation in their value stream. Accordingly,we introduced differential convolution, a novel technique for processing convolutions between filter weightsand the deltas of the activations instead of the raw values of activations. We proposed Diffy, a deep learningaccelerator that translates this spatial correlation property into communication, storage, and computation benefitsboosting performance and energy efficiency. The accelerator makes it more practical to deploy such CNNs enablingreal-time processing. Diffy outperforms the baseline VAA and the state-of-the-art PRA by 7.1× and 1.41× whilebeing 1.83× and 1.36× more energy efficient respectively. In addition, Diffy exploits that property to compressactivations where it requires 55% less on-chip memory and 2.5× less off-chip bandwidth compared to a techniquethat stores the raw values using profiled per-layer precisions. Compared to a state-of-the-art technique that uses


dynamic precision per-group of activations, Diffy needs 32% less on-chip memory with 1.43× less off-chipbandwidth.

Chapter 5

Summary

5.1 Contributions

In summary, this dissertation makes the following main contributions:

Characterizing State-of-the-art Computational Imaging Implementations

• We develop and analyze the performance and energy consumption of highly optimized software implementa-tions of BM3D, a state-of-the-art algorithm widely used for various computational imaging applications, forthree commodity platforms: 1) a high-end general purpose processor with vector extensions (CPU), 2) adesktop graphics processor (GPU), and 3) a general purpose processor targeting embedded systems.

• We show that CPU and GPU implementations are respectively 4 and 3 orders of magnitude slower than whatis needed for real-time and user-interactive applications even for 1080p HD image frames.

• We conduct microarchitectural analysis of the implementations showing that they are compute-bound. Thus,the lack of adequate compute resources hinders satisfactory performance on commodity architectures. Inaddition, our analysis shows that the block-matching step is a major bottleneck (67% and 87% of processingtime on CPU and GPU respectively) and thus extra acceleration effort should be devoted to tackle this step.

• We characterize DNN-based computational imaging models on state-of-the-art value-agnostic DNN acceler-ators showing that, while being much faster than commodity architectures, such accelerators still fall short ofthe real-time and user-interactive processing requirements.

• For both analytical and DNN-based computational imaging implementations, our analysis shows strongspatial correlation between successively processed data items. This correlation suggests that significantamount of the computations involved are actually redundant. Data-aware architectures should be designed toexploit such phenomenon to boost the performance over data-agnostic approaches.

IDEAL: Enabling Real-Time BM3D Processing

• We develop IDEAL, a dedicated hardware accelerator for BM3D. IDEAL incorporates a novel software-hardware optimization, Matches Reuse (MR) that exploits typical image content to reduce the computations

68

CHAPTER 5. SUMMARY 69

needed by BM3D without sacrificing quality. Instead of the exhaustive search performed by the block-matching step for every successive patch of the input image, MR performs shorter search reusing the resultsof a previous patch if there is enough similarity with the current one.

• We demonstrate that IDEAL makes it possible to process image frames of up to 42MP for UI applications.On average, IDEAL is 11,352× and 591× faster than the CPU and the GPU implementations respectively.IDEAL is 4 and 3 orders of magnitude more energy efficient respectively as well.

• We demonstrate that IDEAL is at least 5.4× faster and 3.95× more energy efficient than the fastest of twoDNN models running on a high-end state-of-the-art value-agnostic DNN accelerator. This motivates ournext work, Diffy, to close the performance gap of DNN-based computational imaging acceleration.

• We show that IDEAL can be modified with little effort to support another computational imaging application,sharpening. This is possible since BM3D is a generic image model widely adopted in several computationalimaging applications.

Diffy: Exploiting Value Stream Characteristics to Accelerate DNN-based Computational Imaging

• We analyze the potential performance gains achievable from exploiting the spatial correlations between theactivations processed by Computational Imaging DNN models (CI-DNNs).

• To exploit this property, we propose Differential Convolution (DC), a novel convolution technique thatprocesses the activation values as deltas to reduce the amount of work involved.

• We propose Diffy, a practical DC-based architecture that significantly reduces the amount of computation,communication and storage needed to process CI-DNNs.

• We demonstrate that Diffy boosts performance by 7.1× over a baseline value-agnostic accelerator (VAA)and by 1.41× over a state-of-the-art accelerator that processes only the effectual content of the raw activationvalues (PRA). Layout results show that Diffy is 1.83× and 1.36× more energy efficient respectively. Diffyalso reduces off-chip traffic by 4.57× and on-chip storage by 2.8×.

• We show that Diffy scales much more efficiently and enables real-time processing of HD frames withconsiderably less resources.

• We demonstrate that Diffy is robust and can serve as a general CNN accelerator as it improves performanceeven for image classification, object detection and image segmentation models by 6.1× and by 1.16×compared to the value-agnostic (VAA) and value-aware (PRA) accelerators respectively.

5.2 Directions for Future Work

As Dennard scaling has ceased and we are approaching an end of the multi-core scaling era [207, 208], advances intechnology can not anymore cope with the ever increasing processing demands of a wide range of applications.Even with the highly parallel high-end GPUs loaded with thousands of Single Instruction Multiple Data (SIMD)cores, the processing power still falls short of what is needed. Computational imaging and low-level computervision applications are one of those categories of applications that, despite being embarrassingly parallel for mostof the computations involved, remain hard to tackle even on high-end commodity architectures not to mentionmobile platforms [10, 11, 27, 57, 152, 209, 210]. Thus, there is increasing interest in proposing specialized hardware


accelerators for those applications to achieve high performance and energy efficiency [58, 59, 60]. On the otherhand, deep learning continues to achieve state-of-the-art results in both high-level classification applications suchas image recognition [82, 83, 84], object segmentation [85, 86, 87] and speech recognition [88] as well as low-levelcomputer vision and image processing applications such as image denoising [46, 47, 48], demosaicking [49],sharpening [64], deblurring [65, 66, 67], and super-resolution [68, 69, 70, 71, 72]. However, these DNN models arevery computationally demanding and not efficient energy-wise to run on commodity architectures.

Future work may enhance over the techniques proposed in this dissertation to exploit more of the underlyingdata properties to squeeze the involved amount of computation and data communication. Studying a wider set ofCI applications or CI-DNNs is another path to explore to what extent the value properties discovered hereof hold.Combining the data-aware techniques proposed by this work with other approaches may reach promising trade-offpoints in the design space or reveal new design possibilities as well. We itemized possible future work paths asfollows:

For Analytical Computational Imaging: Generalizing and Extending IDEAL

• Incorporating the Matches Reuse (MR) technique into other Nonlocal Self-Similarity (NSS) based appli-cations and studying the potential performance gains as well as the effect on output quality. NSS-basedtechniques can make use of MR since they all share the same idea of searching a window for the best matchesof the currently processed patch.

• Extending IDEAL to evolve as a more generic CI accelerator through augmenting more functionalities intoits pipeline. Following the technique proposed by Dabov et al. to jointly denoise and sharpen images [30],we showed one such extension by adding sharpening compute units to IDEAL. The units apply α-rooting tothe transform coefficients to give the sharpening effect. However, other extensions are possible thanks to thefact that BM3D is a generic image model that have been used for various CI applications including imagedeblurring, up-sampling, video denoising, and super-resolution [29, 30, 31, 32, 33].

• Exploring hybrid Analytical-Machine Learning techniques and the possibility to design a specializedhardware accelerator that combines block-matching engines, incorporating the MR optimization, fromIDEAL with Diffy as an accelerator for DNN-based approximation models of the filtering step of BM3D. Thisapproach would be a hardware realization of a number of hybrid models recently proposed by computationalimaging researchers [161, 211]. The advantage would be the much simpler DNN models that approximatesjust the filtering step instead of the conventional end-to-end models.

For DNN-based Computational Imaging: More Aggressive Data-Aware Approaches

• Sparsifying CI-DNNs models to reduce their significant computational intensity. CI-DNNs, unlike classifica-tion DNNs, typically process images with high resolution that is maintained along the model pipeline. Thus,these models need sparsification the most to facilitate their acceleration on the recently proposed sparseDNN accelerators boosting the performance over dense architectures [81, 212, 213]. A following step thenwould be exploiting the runtime activation properties revealed by this dissertation to further enhance theperformance of such sparse accelerators.

• Diffy considers applying Differential Convolution (DC) on the granularity of the whole activation window.Although this is an implementation that results in a fairly low hardware overhead, it sacrifices significantpotential performance gains. An extremely aggressive implementation would arbitrarily process activation


values as deltas on per-activation basis, i.e., apply DC only to the activations for whom processing deltaswould result in less work or effectual terms. This would incur extra hardware overhead but can be mitigatedby some trade-offs. For example, a future study might profile input activation feature maps and identifythe activation planes that tend to be suitable for DC, i.e. applying DC arbitrarily on activation windowslices along the depth of a window. This finer grain Selective Differential Convolution (SDC) may boost theperformance of Diffy for an affordable extra hardware overhead.

• This dissertation studied the spatial correlation of the activations processed by CI-DNN models. Future workmay investigate the spatial correlation of the weights as well. If such spatial correlation proves to be strong,it can be orthogonally exploited in an accelerator that processes weights bit-serially such as Loom [214].

• Standardized DNN benchmark suite comprising different categories of DNN-based applications such ashigh-level image classification, object detection, image segmentation and speech recognition as well aslow-level vision tasks and image processing such as computational imaging applications.

• Exploring the activations spatial/temporal correlation for other types of applications and DNN architecturessuch as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). For example, Neil et

al. exploit the slowly changing activation values over time in RNNs to reduce memory accesses andcomputations [215]. They propose an RNN architecture, delta network, where each neuron communicates itsvalue only when the change in its activation exceeds a preset threshold. Their work focuses on the modelarchitecture but does not exploit these properties towards a hardware accelerator design. Future work mightexplore extending Diffy to translate such properties into boosted accelerator performance.

Bibliography

[1] Achuta Kadambi, Hayato Ikoma, Xing Lin, Gordon Wetzstein, and Ramesh Raskar. Subsurface Enhancementthrough Sparse Representations of Multispectral Direct/Global Decomposition. In Imaging and Applied

Optics, page CTh1B.4. Optical Society of America, 2013.

[2] G. Satat, C. Barsi, and R. Raskar. Skin perfusion photography. In 2014 IEEE International Conference on

Computational Photography (ICCP), pages 1–8, May 2014.

[3] Martin Roser and Andreas Geiger. Video-based raindrop detection for improved image registration. InInternational Conference on Computer Vision (ICCV) Workshops, 2009.

[4] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3D estimation of vehicles and scene flow. InISPRS Workshop on Image Sequence Analysis (ISA), 2015.

[5] Gerald C. Kane and Alexandra Pear. The Rise of Visual Content Online. http://sloanreview.mit.edu/article/the-rise-of-visual-content-online/, January 4 2016.

[6] Paul Worthington. One Trillion Photos in 2015. http://mylio.com/true-stories/tech-today/one-trillion-photos-in-2015-2, December 11 2014.

[7] Consumer Electronics Market Worth $838.85 Billion By 2020. http://www.grandviewresearch.com/press-release/global-consumer-electronics-market, September 3 2015.

[8] Lynn Jenner. Hubble’s High-Definition Panoramic View of the Andromeda Galaxy. https:

//www.nasa.gov/content/goddard/hubble-s-high-definition-panoramic-view-of-the-

andromeda-galaxy, January 5 2015.

[9] John E Krist. Deconvolution of hubble space telescope images using simulated point spread functions. InAstronomical Data Analysis Software and Systems I, volume 25, page 226, 1992.

[10] Felix Heide, Markus Steinberger, Yun-Ta Tsai, Mushfiqur Rouf, Dawid PajÄEk, Dikpal Reddy, OrazioGallo, Jing Liu abd Wolfgang Heidrich, Karen Egiazarian, Jan Kautz, and Kari Pulli. FlexISP: A FlexibleCamera Image Processing Framework. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia

2014), 33(6), December 2014.

[11] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising with block-matching and 3D filtering. In Electronic Imaging 2006, pages 606414–606414. International Society forOptics and Photonics, 2006.

[12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domaincollaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, Aug 2007.

72

http://sloanreview.mit.edu/article/the-rise-of-visual-content-online/

http://sloanreview.mit.edu/article/the-rise-of-visual-content-online/

http://mylio.com/true-stories/tech-today/one-trillion-photos-in-2015-2

http://mylio.com/true-stories/tech-today/one-trillion-photos-in-2015-2

http://www.grandviewresearch.com/press-release/global-consumer-electronics-market

http://www.grandviewresearch.com/press-release/global-consumer-electronics-market

https://www.nasa.gov/content/goddard/hubble-s-high-definition-panoramic-view-of-the-andromeda-galaxy



BIBLIOGRAPHY 73

[13] A. Buades, B. Coll, and J. M. Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65 vol. 2,June 2005.

[14] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Non-Local Means Denoising. Image Processing

On Line, 1:208–212, 2011.

[15] Karen Egiazarian, Jaakko Astola, Mika Helsingius, and Pauli Kuosmanen. Adaptive denoising and lossycompression of images in transform domain. Journal of Electronic Imaging, 8(3):233–245, 1999.

[16] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures ofGaussians in the wavelet domain. IEEE Transactions on Image Processing, 12(11):1338–1351, Nov 2003.

[17] M. Elad and M. Aharon. Image Denoising Via Sparse and Redundant Representations Over LearnedDictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, Dec 2006.

[18] A. Foi, V. Katkovnik, and K. Egiazarian. Pointwise Shape-Adaptive DCT for High-Quality Denoising andDeblocking of Grayscale and Color Images. IEEE Transactions on Image Processing, 16(5):1395–1411,May 2007.

[19] C. Kervrann and J. Boulanger. Optimal Spatial Adaptation for Patch-Based Image Denoising. IEEE

Transactions on Image Processing, 15(10):2866–2878, Oct 2006.

[20] J. A. Guerrero-Colon and J. Portilla. Two-level adaptive denoising using Gaussian scale mixtures inovercomplete oriented pyramids. In IEEE International Conference on Image Processing 2005, volume 1,pages I–105–8, Sept 2005.

[21] Samsung S9+. http://www.samsung.com/ca/smartphones/galaxy-s9/camera/.

[22] Huawei P20 Pro. https://consumer.huawei.com/ca/phones/p20-pro/.

[23] Sony Xperia XA2. https://www.sonymobile.com/ca-en/products/phones/xperia-xa2/.

[24] Canon EOS 5DS. https://www.usa.canon.com/internet/portal/us/home/products/details/

cameras/dslr/eos-5ds.

[25] Nikon D850. http://en.nikon.ca/nikon-products/product/dslr-cameras/d850.html.

[26] Sony Alpha A7R III. https://www.sony.com/electronics/interchangeable-lens-cameras/

ilce-7rm3.

[27] Yun-Ta Tsai, Markus Steinberger, Dawid Pajak, and Kari Pulli. Fast ANN for High-Quality CollaborativeFiltering. In High-Performance Graphics, 2014.

[28] M. Mäkitalo and A. Foi. Spatially adaptive alpha-rooting in BM3D sharpening. In Image Processing:

Algorithms and Systems IX, volume 7870 of Proc. SPIE, page 787012, March 2011.

[29] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image restoration by sparse 3Dtransform-domain collaborative filtering. In Electronic Imaging 2008, pages 681207–681207. InternationalSociety for Optics and Photonics, 2008.

http://www.samsung.com/ca/smartphones/galaxy-s9/camera/

https://consumer.huawei.com/ca/phones/p20-pro/

https://www.sonymobile.com/ca-en/products/phones/xperia-xa2/

https://www.usa.canon.com/internet/portal/us/home/products/details/cameras/dslr/eos-5ds

https://www.usa.canon.com/internet/portal/us/home/products/details/cameras/dslr/eos-5ds

http://en.nikon.ca/nikon-products/product/dslr-cameras/d850.html

https://www.sony.com/electronics/interchangeable-lens-cameras/ilce-7rm3

https://www.sony.com/electronics/interchangeable-lens-cameras/ilce-7rm3

BIBLIOGRAPHY 74

[30] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Joint image sharpening anddenoising by 3D transform-domain collaborative filtering. In Proc. 2007 Int. TICSP Workshop Spectral

Meth. Multirate Signal Process., SMMSP, volume 2007. Citeseer, 2007.

[31] A. Danielyan, A. Foi, V. Katkovnik, and K. Egiazarian. Image upsampling via spatially adaptive block-matching filtering. In 2008 16th European Signal Processing Conference, pages 1–5, Aug 2008.

[32] K. Dabov, A. Foi, and K. Egiazarian. Video denoising by sparse 3D transform-domain collaborative filtering.In 2007 15th European Signal Processing Conference, pages 145–149, Sept 2007.

[33] Aram Danielyan, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image and video super-resolution via spatially adaptive blockmatching filtering. In Proceedings of International Workshop on Local

and non-Local Approximation in Image Processing (LNLA), 2008.

[34] A. Buades, B. Coll, and J. M. Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65 vol. 2,June 2005.

[35] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Nonlocal image and movie denoising. International

Journal of Computer Vision, 76(2):123–139, Feb 2008.

[36] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng. Patch Group Based Nonlocal Self-Similarity PriorLearning for Image Denoising. In 2015 IEEE International Conference on Computer Vision (ICCV), pages244–252, Dec 2015.

[37] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimization withapplication to image denoising. In Proceedings of the 2014 IEEE Conference on Computer Vision and

Pattern Recognition, CVPR ’14, pages 2862–2869, Washington, DC, USA, 2014. IEEE Computer Society.

[38] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. In2009 IEEE 12th International Conference on Computer Vision, pages 2272–2279, Sept 2009.

[39] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally Centralized Sparse Representation for Image Restoration.IEEE Transactions on Image Processing, 22(4):1620–1630, April 2013.

[40] W. Dong, G. Shi, and X. Li. Nonlocal Image Restoration With Bilateral Variance Estimation: A Low-RankApproach. IEEE Transactions on Image Processing, 22(2):700–711, Feb 2013.

[41] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In 2014 IEEE Conference on

Computer Vision and Pattern Recognition, pages 2774–2781, June 2014.

[42] Jeremy Jancsary, Sebastian Nowozin, and Carsten Rother. Loss-specific Training of Non-parametric ImageRestoration Models: A New State of the Art. In Proceedings of the 12th European Conference on Computer

Vision - Volume Part VII, ECCV’12, pages 112–125, Berlin, Heidelberg, 2012. Springer-Verlag.

[43] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In 2011

International Conference on Computer Vision, pages 479–486, Nov 2011.

[44] Yunjin Chen, Thomas Pock, René Ranftl, and Horst Bischof. Revisiting loss-specific training of filter-based mrfs for image restoration. In Joachim Weickert, Matthias Hein, and Bernt Schiele, editors, Pattern

Recognition, pages 271–281, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.

BIBLIOGRAPHY 75

[45] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effectiveimage restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1256–1272, June2017.

[46] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep CNN denoiser prior for imagerestoration. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2808–2817,2017.

[47] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a Gaussian denoiser: Residual learning ofdeep CNN for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, July 2017.

[48] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for CNN basedimage denoising. CoRR, abs/1710.04026, 2017.

[49] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep joint demosaicking anddenoising. ACM Trans. Graph., 35(6):191:1–191:12, November 2016.

[50] Gerd Waloszek and Ulrich Kreichgauer. User-Centered Evaluation of the Responsiveness of Applications.In Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction: Part I,INTERACT ’09, pages 239–242, Berlin, Heidelberg, 2009. Springer-Verlag.

[51] Stuart K. Card, George G. Robertson, and Jock D. Mackinlay. The Information Visualizer, an InformationWorkspace. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’91,pages 181–186, New York, NY, USA, 1991. ACM.

[52] Jakob Nielsen. Powers of 10: Time Scales in User Experience. https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/, October 5 2009.

[53] Robert B. Miller. Response Time in Man-computer Conversational Transactions. In Proceedings of the

December 9-11, 1968, Fall Joint Computer Conference, Part I, AFIPS ’68 (Fall, part I), pages 267–277,New York, NY, USA, 1968. ACM.

[54] Marc Lebrun. An Analysis and Implementation of the BM3D Image Denoising Method. Image Processing

On Line, 2:175–213, 2012.

[55] Emanouil Atanassov, Michaela Barth, Mikko Byckling, Vali Codreanu, Nevena Ilieva, Tomas Karasek, JorgeRodriguez, Sami Saarinen, Ole Widar Saastad, Michael Schliephake, Martin Stachon, Janko Strassburg, andVolker Weinberg. Best Practice Guide Intel Xeon Phi v2.0. http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-january-2017/, January 29 2017.

[56] Yu Pei. Research on GPU parallelization of patch match algorithm in image processing. Master’s thesis,Electronics and Communication Engineering, Shanghai Jiao Tong University, Shanghai, China, 2013.

[57] David Honzátko. GPU Acceleration of Advanced Image Denoising. PhD thesis, Department of SoftwareEngineering, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, 2015.

[58] Bernardo Manuel Aguiar Silva Teixeira Cardoso. Algorithm and Hardware Design for Image Restoration.Master’s thesis, Faculty of Engineering, the University of Porto, Porto, Portugal, 2015.

https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/

https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/

http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-january-2017/

http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-january-2017/

BIBLIOGRAPHY 76

[59] Hao Zhang, Wenjiang Liu, Ruolin Wang, Tao Liu, and Mengtian Rong. Hardware architecture design ofblock-matching and 3D-filtering denoising algorithm. Journal of Shanghai Jiaotong University (Science),21(2):173–183, 2016.

[60] BM3D assembly device designed on basis of ASIC, July 28 2010. CN Patent App. CN 201,010,102,701.

[61] and A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility tostructural similarity. IEEE Transactions on Image Processing, 13(4):600–612, April 2004.

[62] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete withBM3D? In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2392–2399, June2012.

[63] Viren Jain and Sebastian Seung. Natural image denoising with convolutional networks. In D. Koller,D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21,pages 769–776. Curran Associates, Inc., 2009.

[64] S. Zhang and E. Salari. A neural network-based nonlinear filter for image enhancement. International

Journal of Imaging Systems and Technology, 12(2):56–62, 2002.

[65] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network fordynamic scene deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),July 2017.

[66] Michal Hradiš, Jan Kotera, Pavel Zemcík, and Filip Šroubek. Convolutional neural networks for direct textdeblurring. In Proceedings of BMVC 2015. The British Machine Vision Association and Society for PatternRecognition, 2015.

[67] C. J. Schuler, M. Hirsch, S. Harmeling, and B. SchÃulkopf. Learning to deblur. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 38(7):1439–1451, July 2016.

[68] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deepconvolutional networks. CoRR, abs/1511.04587, 2015.

[69] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, Feb 2016.

[70] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert,and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolu-tional neural network. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages1874–1883, 2016.

[71] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for imagesuper-resolution. CoRR, abs/1511.04491, 2015.

[72] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, JohannesTotz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generativeadversarial network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages105–114, 2017.

BIBLIOGRAPHY 77

[73] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.DaDianNao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium

on Microarchitecture, pages 609–622, Dec 2014.

[74] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. Deep Joint Demosaicking andDenoising. ACM Trans. Graph., 35(6):191:1–191:12, November 2016.

[75] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016 ACM/IEEE 43rd Annual International Symposium on

Computer Architecture (ISCA), pages 1–13, June 2016.

[76] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernÃandez-Lobato, G. Y. Wei,and D. Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In 2016

ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267–278, June2016.

[77] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficient reconfigurable accelerator fordeep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, Jan 2017.

[78] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. Eie:Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International

Symposium on Computer Architecture, ISCA ’16, pages 243–254, Piscataway, NJ, USA, 2016. IEEE Press.

[79] P. Judd, J. Albericio, and A. Moshovos. Stripes: Bit-serial deep neural network computing. IEEE Computer

Architecture Letters, 16(1):80–83, Jan 2017.

[80] Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov, and AndreasMoshovos. Bit-pragmatic deep neural network computing. In Proceedings of the 50th Annual IEEE/ACM

International Symposium on Microarchitecture, MICRO-50 ’17, pages 382–394, New York, NY, USA, 2017.ACM.

[81] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J.Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE

44th Annual International Symposium on Computer Architecture (ISCA), pages 27–40, June 2017.

[82] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate objectdetection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and

Pattern Recognition, CVPR ’14, pages 580–587, Washington, DC, USA, 2014. IEEE Computer Society.

[83] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 1–9, June 2015.

[84] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.

[85] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation.CoRR, abs/1411.4038, 2014.

BIBLIOGRAPHY 78

[86] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecturefor image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, Dec 2017.

[87] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet: Multi-path refinement networksfor high-resolution semantic segmentation. CoRR, abs/1611.06612, 2016.

[88] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger,Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-endspeech recognition. CoRR, abs/1412.5567, 2014.

[89] Scott Krig. Vision Pipelines and Optimizations. In Computer Vision Metrics: Textbook Edition, chapter 8,pages 273–317. Springer International Publishing, Cham, 2016.

[90] Michael S. Brown. Understanding the in-camera image processing pipeline for computer vision.

[91] R. Ramanath, W. E. Snyder, Y. Yoo, and M. S. Drew. Color image processing pipeline. IEEE Signal

Processing Magazine, 22(1):34–43, Jan 2005.

[92] Junichi Nakamura. Image Sensors and Signal Processing for Digital Still Cameras. CRC Press, Inc., BocaRaton, FL, USA, 2005.

[93] JA Stephen Viggiano. Comparison of the accuracy of different white-balancing options as quantified bytheir color constancy. In Sensors and Camera Systems for Scientific, Industrial, and Digital Photography

Applications V, volume 5301, pages 323–334. International Society for Optics and Photonics, 2004.

[94] Jean-Michel Morel, Ana B Petro, and Catalina Sbert. Fast implementation of color constancy algorithms.In Color Imaging XIV: Displaying, Processing, Hardcopy, and Applications, volume 7241, page 724106.International Society for Optics and Photonics, 2009.

[95] Ron Kimmel, Michael Elad, Doron Shaked, Renato Keshet, and Irwin Sobel. A variational framework forretinex. International Journal of computer vision, 52(1):7–23, 2003.

[96] Lauren Barghout. Visual taxometric approach to image segmentation using fuzzy-spatial taxon cut yieldscontextually relevant regions. In International Conference on Information Processing and Management of

Uncertainty in Knowledge-Based Systems, pages 163–173. Springer, 2014.

[97] Land, Edwin H. The retinex theory of color vision. Scientific american, 237(6):108–129, 1977.

[98] Brian Funt, Vlad Cardei, and Kobus Barnard. Learning color constancy. In Color and Imaging Conference,volume 1996, no. 1, pages 58–60. Society for Imaging Science and Technology, 1996.

[99] C. L. Chen and S. H. Lin. Automatic white balance based on estimation of light source using fuzzy neuralnetwork. In 2009 4th IEEE Conference on Industrial Electronics and Applications, pages 1905–1910, May2009.

[100] Mahmoud Afifi. Semantic white balance: Semantic color constancy using convolutional neural network.CoRR, abs/1802.00153, 2018.

[101] David H Brainard and William T Freeman. Bayesian color constancy. JOSA A, 14(7):1393–1411, 1997.

BIBLIOGRAPHY 79

[102] G. D. Finlayson, S. D. Hordley, and P. M. HubeL. Color by correlation: a simple, unifying framework forcolor constancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1209–1221, Nov2001.

[103] Daniele Menon and Giancarlo Calvagno. Color image demosaicking: An overview. Signal Processing:

Image Communication, 26(8-9):518–533, 2011.

[104] Bryce E Bayer. Color imaging array. United States Patent 3,971,065, 1976.

[105] Rastislav Lukac and Konstantinos N Plataniotis. Color filter arrays: Design and performance analysis. IEEE

Transactions on Consumer Electronics, 51(4):1260–1267, 2005.

[106] John F Hamilton Jr and John T Compton. Processing color and panchromatic pixels, April 29 2014. USPatent 8,711,452.

[107] Keigo Hirakawa and Patrick J Wolfe. Spatio-spectral color filter array design for optimal image recovery.IEEE Transactions on Image Processing, 17(10):1876–1890, 2008.

[108] Yue M Lu and Martin Vetterli. Optimal color filter array design: Quantitative conditions and an efficientsearch procedure. In Digital Photography V, volume 7250, page 725009. International Society for Opticsand Photonics, 2009.

[109] C. E. Duchon. Lanczos Filtering in One and Two Dimensions. Journal of Applied Meteorology, 18:1016–1022, August 1979.

[110] Edward Chang, Shiufun Cheung, and Davis Y Pan. Color filter array recovery using a threshold-basedvariable number of gradients. In Sensors, Cameras, and Applications for Digital Photography, volume 3650,pages 36–44. International Society for Optics and Photonics, 1999.

[111] Chuan kai Lin. Pixel Grouping. https://sites.google.com/site/chklin/demosaic/, 2003.

[112] Keigo Hirakawa and Thomas W Parks. Adaptive homogeneity-directed demosaicing algorithm. IEEE

Transactions on Image Processing, 14(3):360–369, 2005.

[113] Sina Farsiu, Michael Elad, and Peyman Milanfar. Multiframe demosaicing and super-resolution of colorimages. IEEE Transactions on Image Processing, 15(1):141–159, 2006.

[114] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with anew one. Multiscale Modeling & Simulation, 4(2):490–530, 2005.

[115] Rohit Verma and Jahid Ali. A comparative study of various types of image noise and efficient noise removaltechniques. International Journal of advanced research in computer science and software engineering,3(10), 2013.

[116] Scott E. Umbaugh. Computer Vision and Image Processing: A Practical Approach Using Cviptools with

Cdrom. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1997.

[117] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Deep image prior. CoRR, abs/1711.10925, 2017.

[118] D. Y. Tsai, Yongbum Lee, M. Sekiya, S. Sakaguchi, and I. Yamada. A method of medical image enhancementusing wavelet analysis. In 6th International Conference on Signal Processing, 2002., volume 1, pages723–726 vol.1, Aug 2002.

https://sites.google.com/site/chklin/demosaic/

BIBLIOGRAPHY 80

[119] S. Hatami, R. Hosseini, M. Kamarei, and H. Ahmadi. Wavelet based fingerprint image enhancement. In2005 IEEE International Symposium on Circuits and Systems, pages 4610–4613 Vol. 5, May 2005.

[120] S. S. Agaian, B. Silver, and K. A. Panetta. Transform coefficient histogram-based image enhancementalgorithms using contrast entropy. IEEE Transactions on Image Processing, 16(3):741–758, March 2007.

[121] Seungyong Lee and Sunghyun Cho. Recent advances in image deblurring. In SIGGRAPH Asia 2013

Courses, SA ’13, pages 6:1–6:108, New York, NY, USA, 2013. ACM.

[122] Mario Bertero and Patrizia Boccacci. Introduction to inverse problems in imaging. CRC press, 1998.

[123] Ruxin Wang and Dacheng Tao. Recent progress in image deblurring. arXiv preprint arXiv:1409.6838, 2014.

[124] Emmanuel J Candès and David L Donoho. Curvelets: A surprisingly effective nonadaptive representationfor objects with edges. Technical report, Stanford University, CA, Department of Statistics, 2000.

[125] Minh N Do and Martin Vetterli. The contourlet transform: an efficient directional multiresolution imagerepresentation. IEEE Transactions on Image Processing, 14(12):2091–2106, 2005.

[126] Emmanuel J Candès and David L Donoho. Ridgelets: A key to higher-dimensional intermittency? Philo-

sophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences,357(1760):2495–2509, 1999.

[127] Jack Tzeng, Chun-Chen Liu, and Truong Q Nguyen. Contourlet domain multiband deblurring based oncolor correlation for fluid lens cameras. IEEE Transactions on Image Processing, 19(10):2659–2668, 2010.

[128] Yi Zhang and Keigo Hirakawa. Blur processing using double discrete wavelet transform. In Computer

Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1091–1098. IEEE, 2013.

[129] Hui Ji and Kang Wang. Robust image deblurring with an inaccurate blur kernel. IEEE Transactions on

Image Processing, 21(4):1624–1634, 2012.

[130] Jian-Feng Cai, Hui Ji, Chaoqiang Liu, and Zuowei Shen. Framelet-based blind motion deblurring from asingle image. IEEE Transactions on Image Processing, 21(2):562–572, 2012.

[131] Thomas Huang and Jianchao Yang. Image super-resolution: Historical overview and future challenges. InSuper-resolution imaging, pages 19–52. CRC Press, 2010.

[132] Marco Cristani, Dong Seon Cheng, Vittorio Murino, and Donato Pannullo. Distilling information withsuper-resolution for video surveillance. In Proceedings of the ACM 2Nd International Workshop on Video

Surveillance &Amp; Sensor Networks, VSSN ’04, pages 2–11, New York, NY, USA, 2004. ACM.

[133] F. Li, X. Jia, and D. Fraser. Universal hmt based super-resolution for remote sensing images. In 2008 15th

IEEE International Conference on Image Processing, pages 333–336, Oct 2008.

[134] John A Kennedy, Ora Israel, Alex Frenkel, Rachel Bar-Shalom, and Haim Azhari. Super-resolution in PETimaging. IEEE transactions on medical imaging, 25(2):137–147, 2006.

[135] K. Malczewski and R. Stasinski. Toeplitz-based iterative image fusion scheme for mri. In 2008 15th IEEE

International Conference on Image Processing, pages 341–344, Oct 2008.

BIBLIOGRAPHY 81

[136] Sharon Peled and Yehezkel Yeshurun. Super-resolution in MRI: application to human white matter fibertract visualization by diffusion tensor imaging. Magnetic resonance in medicine, 45(1):29–35, 2001.

[137] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE

Transactions on Image Processing, 19(11):2861–2873, Nov 2010.

[138] Hojjat Seyed Mousavi, Tiantong Guo, and Vishal Monga. Deep image super-resolution via natural imagepriors. CoRR, abs/1802.02721, 2018.

[139] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, JohannesTotz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generativeadversarial network. CoRR, abs/1609.04802, 2016.

[140] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deepconvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 1646–1654, 2016.

[141] Mehdi S. M. Sajjadi, Bernhard Schölkopf, and Michael Hirsch. Enhancenet: Single image super-resolutionthrough automated texture synthesis. CoRR, abs/1612.07919, 2016.

[142] C. Y. Yang and M. H. Yang. Fast direct super-resolution by simple functions. In 2013 IEEE International

Conference on Computer Vision, pages 561–568, Dec 2013.

[143] J. Yang, Z. Lin, and S. Cohen. Fast image super-resolution based on in-place example regression. In 2013

IEEE Conference on Computer Vision and Pattern Recognition, pages 1059–1066, June 2013.

[144] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Sixth International Conference

on Computer Vision (IEEE Cat. No.98CH36271), pages 839–846, Jan 1998.

[145] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 12(7):629–639, Jul 1990.

[146] K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(6):1397–1409, June 2013.

[147] S. Grace Chang, Bin Yu, and Martin Vetterli. Adaptive Wavelet Thresholding for Image Denoising andCompression. IEEE Transactions on Image Processing, 9(9):1532–1546, 2000.

[148] David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal

of the american statistical association, 90(432):1200–1224, 1995.

[149] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika,81(3):425–455, 1994.

[150] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.

[151] A. Danielyan, V. Katkovnik, and K. Egiazarian. BM3D frames and variational image deblurring. IEEE

Transactions on Image Processing, 21(4):1715–1728, April 2012.

BIBLIOGRAPHY 82

[152] S. Sarjanoja, J. Boutellier, and J. Hannuksela. BM3D image denoising using heterogeneous computingplatforms. In 2015 Conference on Design and Architectures for Signal and Image Processing (DASIP),pages 1–8, Sept 2015.

[153] S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In 2017 IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), pages 5882–5891, July 2017.

[154] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stackeddenoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.J. Mach. Learn. Res., 11:3371–3408, December 2010.

[155] Junyuan Xie, Linli Xu, and Enhong Chen. Image Denoising and Inpainting with Deep Neural Networks.In Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’12,pages 341–349, USA, 2012. Curran Associates Inc.

[156] Stan Z Li. Markov random field modeling in image analysis. Springer Science & Business Media, 2009.

[157] S. Roth and M. J. Black. Fields of experts: a framework for learning image priors. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 860–867 vol.2, June 2005.

[158] M. Elad and M. Aharon. Image denoising via learned dictionaries and sparse representation. In 2006 IEEE

Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages895–900, June 2006.

[159] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on

Image Processing, 17(1):53–69, Jan 2008.

[160] Lei Zhang, Weisheng Dong, David Zhang, and Guangming Shi. Two-stage image denoising by principalcomponent analysis with local pixel grouping. Pattern Recognition, 43(4):1531–1549, 2010.

[161] D. Yang and J. Sun. BM3D-Net: A convolutional neural network for transform-domain collaborativefiltering. IEEE Signal Processing Letters, 25(1):55–59, Jan 2018.

[162] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.

[163] Tom Dietterich. Overfitting and undercomputing in machine learning. ACM Comput. Surv., 27(3):326–327,September 1995.

[164] Igor V. Tetko, David J. Livingstone, and Alexander I. Luik. Neural network studies. 1. comparison ofoverfitting and overtraining. Journal of Chemical Information and Computer Sciences, 35(5):826–833,1995.

[165] Mostafa Mahmoud, Bojian Zheng, Alberto Delmás Lascorz, Felix Heide, Jonathan Assouline, Paul Boucher,Emmanuel Onzon, and Andreas Moshovos. IDEAL: Image denoising accelerator. In Proceedings of the 50th

Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, pages 82–95, 2017.

[166] Mihir Mody. ADAS Front Camera: Demystifying Resolution and Frame-Rate. http://www.eetimes.com/author.asp?section_id=36&doc_id=1329109, March 7 2016.

http://www.eetimes.com/author.asp?section_id=36&doc_id=1329109

http://www.eetimes.com/author.asp?section_id=36&doc_id=1329109

BIBLIOGRAPHY 83

[167] Bosch’s Driver assistance systems - Predictive pedestrian protection. http://products.bosch-

mobility-solutions.com/en/de/_technik/component/SF_PC_DA_Predictive-Pedestrian-

Protection_SF_PC_Driver-Assistance-Systems_5251.html?compId=2880, 2017.

[168] http://www.cs.tut.fi/~foi/GCF-BM3D/.

[169] A. Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International

Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–44, March 2014.

[170] W.T. Padgett and D.V. Anderson. Fixed-Point Signal Processing. Synthesis lectures on signal processing.Morgan & Claypool, 2009.

[171] F. Cabello, J. LeÃsn, Y. Iano, and R. Arthur. Implementation of a fixed-point 2D Gaussian Filter for ImageProcessing based on FPGA. In 2015 Signal Processing: Algorithms, Architectures, Arrangements, and

Applications (SPA), pages 28–33, Sept 2015.

[172] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator.IEEE Computer Architecture Letters, 10(1):16–19, Jan 2011.

[173] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integratedpower, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd

Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469–480, Dec 2009.

[174] http://www.photographyblog.com/.

[175] R. Yazdani, A. Segura, J. M. Arnau, and A. Gonzalez. An ultra low-power hardware accelerator for automaticspeech recognition. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture

(MICRO), pages 1–12, Oct 2016.

[176] V. M. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, D. Terpstra, and S. Moore. MeasuringEnergy and Power with PAPI. In 2012 41st International Conference on Parallel Processing Workshops,pages 262–268, Sept 2012.

[177] Intel 64 and IA-32 Architectures Software Developer’s Manual. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-

vol-3b-part-2-manual.pdf.

[178] NVIDIA Visual Profiler. https://developer.nvidia.com/nvidia-visual-profiler.

[179] Marc’aurelio Ranzato, Y lan Boureau, Sumit Chopra, and Yann Lecun. A Unified Energy-Based Frameworkfor Unsupervised Learning. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh

International Conference on Artificial Intelligence and Statistics (AISTATS-07), volume 2, pages 371–379.Journal of Machine Learning Research - Proceedings Track, 2007.

[180] S. Zhang and E. Salari. Image denoising using a neural network based non-linear filter in wavelet domain.In Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing,

2005., volume 2, pages ii/989–ii/992 Vol. 2, March 2005.

[181] H.C. Burger, C.J. Schuler, and S. Harmeling. Learning how to combine internal and external denoisingmethods. In Proceedings of the 35th German Conference on Pattern Recognition (GCPR 2013), 2013.

http://products.bosch-mobility-solutions.com/en/de/_technik/component/SF_PC_DA_Predictive-Pedestrian-Protection_SF_PC_Driver-Assistance-Systems_5251.html?compId=2880



http://www.cs.tut.fi/~foi/GCF-BM3D/

http://www.photographyblog.com/

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf



https://developer.nvidia.com/nvidia-visual-profiler

BIBLIOGRAPHY 84

[182] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie. DESTINY: A tool for modeling emerging 3D NVMand eDRAM caches. In 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pages1543–1546, March 2015.

[183] Jason Clemons, Chih C. Cheng, Iuri Frosio, Daniel Johnson, and Stephen W. Keckler. A patch memorysystem for image processing and computer vision. In 2016 49th Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO), pages 1–13, Oct 2016.

[184] Mostafa Mahmoud, Kevin Siu, and Andreas Moshovos. Diffy: a Déjà vu-free differential deep neuralnetwork accelerator. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture

(MICRO), pages 134–147, Oct 2018.

[185] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and AndreasMoshovos. Proteus: Exploiting numerical precision variability in deep neural networks. In Proceedings of

the 2016 International Conference on Supercomputing, ICS ’16, pages 23:1–23:12, New York, NY, USA,2016. ACM.

[186] Alberto Delmas, Sayeh Sharify, Patrick Judd, Milos Nikolic, and Andreas Moshovos. DPRed: Makingtypical activation values matter in deep learning computing. CoRR, abs/1804.06732, 2018.

[187] Wen-Chang Yeh and Chein-Wei Jen. High-speed booth encoded parallel multiplier design. IEEE Transac-

tions on Computers, 49(7):692–701, July 2000.

[188] Alberto Delmas, Patrick Judd, Sayeh Sharify, and Andreas Moshovos. Dynamic stripes: Exploiting thedynamic precision requirements of activation values in neural networks. CoRR, abs/1706.00504, 2017.

[189] Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen Richardson, Shahar Kvatinsky,Jonathan Ragan-Kelley, Ardavan Pedram, and Mark Horowitz. A systematic approach to blocking convolu-tional neural networks. CoRR, abs/1606.04209, 2016.

[190] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and itsapplication to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth

IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423 vol.2, 2001.

[191] Lei Zhang, Xiaolin Wu, Antoni Buades, and Xin Li. Color demosaicking by local directional interpolationand nonlocal adaptive thresholding. Journal of Electronic imaging, 20(2):023016, 2011.

[192] Rich Franzen. Kodak Lossless True Color Image Suite. http://r0k.us/graphics/kodak, November 151999.

[193] Marc Lebrun, Miguel Colom, and Jean-Michel Morel. The Noise Clinic: a Blind Image Denoising Algorithm.Image Processing On Line, 5:1–54, 2015.

[194] Neat Image. https://ni.neatvideo.com/home.

[195] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deepconvolutional network. In Proceedings of the IEEE International Conference on Computer Vision, pages576–584, 2015.

[196] Hamid Rahim Sheikh, Zhou Wang, Lawrence Cormack, and Alan C. Bovik. Live Image Quality AssessmentDatabase Release 2. http://live.ece.utexas.edu/research/quality/subjective.htm.

http://r0k.us/graphics/kodak

https://ni.neatvideo.com/home

http://live.ece.utexas.edu/research/quality/subjective.htm

BIBLIOGRAPHY 85

[197] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical evaluation of recent full reference image qualityassessment algorithms. IEEE Transactions on Image Processing, 15(11):3440–3451, Nov 2006.

[198] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. Single-image super-resolution: A benchmark. InProceedings of European Conference on Computer Vision, 2014.

[199] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on non-negative neighbor embedding. British Machine Vision Conference

(BMVC), September 2012.

[200] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations.In Jean-Daniel Boissonnat, Patrick Chenin, Albert Cohen, Christian Gout, Tom Lyche, Marie-LaurenceMazure, and Larry Schumaker, editors, Curves and Surfaces, pages 711–730, Berlin, Heidelberg, 2012.Springer Berlin Heidelberg.

[201] Synopsys. Design Compiler Graphical. https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html.

[202] Cadence. Innovus Implementation System. https://www.cadence.com/content/cadence-

www/global/en_US/home/tools/digital-design-and-signoff/hierarchical-design-and-

floorplanning/innovus-implementation-system.html.

[203] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014.

[204] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 1–9, June 2015.

[205] Lukas Cavigelli, Philippe Degen, and Luca Benini. Cbinfer: Change-based inference for convolutionalneural networks on video data. In Proceedings of the 11th International Conference on Distributed Smart

Cameras, ICDSC 2017, pages 1–8, New York, NY, USA, 2017. ACM.

[206] Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. Euphrates: Algorithm-soc co-designfor low-power mobile continuous vision. CoRR, abs/1803.11232, 2018.

[207] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. DarkSilicon and the End of Multicore Scaling. In Proceedings of the 38th Annual International Symposium on

Computer Architecture, ISCA ’11, pages 365–376, New York, NY, USA, 2011. ACM.

[208] M. Bohr. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society

Newsletter, 12(1):11–13, Winter 2007.

[209] L. Robert Hocking, Russell MacKenzie, and Carola-Bibiane Schönlieb. Guidefill: GPU accelerated, artistguided geometric inpainting for 3D conversion of film. SIAM Journal on Imaging Sciences, 10(4):2049–2090,2017.

[210] K. G. W. Karunaratne, P. U. Wickramasinghe, and J. G. Samarawickrama. GPU based acceleration forFergus’ image deblurring algorithm. In 7th International Conference on Information and Automation for

Sustainability, pages 1–6, Dec 2014.

https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html

https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler-graphical.html

https://www.cadence.com/content/cadence-www/global/en_US/home/tools/digital-design-and-signoff/hierarchical-design-and-floorplanning/innovus-implementation-system.html



BIBLIOGRAPHY 86

[211] B. Ahn and N. I. Cho. Block-matching convolutional neural network for image denoising. CoRR,abs/1704.00524, 2017.

[212] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. Cambricon-x: Anaccelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on

Microarchitecture (MICRO), pages 1–12, Oct 2016.

[213] Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, SayehSharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. Bit-tactical: A software/hardware approachto exploiting value and bit sparsity in neural networks. In The 24th ACM International Conference on

Architectural Support for Programming Languages and Operating Systems, Providence, RI, USA, 2019.

[214] Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, and Andreas Moshovos. Loom: Exploitingweight and activation precisions to accelerate convolutional neural networks. In Proceedings of the 55th

Annual Design Automation Conference, DAC ’18, pages 20:1–20:6, San Francisco, CA, USA, 2018.

[215] Daniel Neil, Junhaeng Lee, Tobi Delbrück, and Shih-Chii Liu. Delta networks for optimized recurrentnetwork computation. CoRR, abs/1612.05571, 2016.

Documents

by Mostafa Mahmoud...I will never be able to honor the favours of my larger family; my parents Dr. Mahmoud and Dr. Amany and my sisters Marwa and Menna. Without their help and support,