10
Neural Architecture Search for Deep Image Prior [1] Kary Ho, [1] Andrew Gilbert, [2] Hailin Jin, [12] John Collomosse [1] Centre for Vision Speech and Signal Processing, University of Surrey [2] Creative Intelligence Lab, Adobe Research Abstract We present a neural architecture search (NAS) technique to enhance the performance of unsupervised image de- noising, in-painting and super-resolution under the recently proposed Deep Image Prior (DIP). We show that evolution- ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific prior to regularize these single image restoration tasks. Our binary representation en- codes the design space for an asymmetric E-D network that typically converges to yield a content-specific DIP within 10-20 generations using a population size of 500. The op- timized architectures consistently improve upon the visual quality of classical DIP for a diverse range of photographic and artistic content. 1. Introduction Many common image restoration tasks require the estima- tion of missing pixel data: de-noising and artefact removal; in-painting; and super-resolution. Usually, this missing data is estimated from surrounding pixel data, under a smooth- ness prior. Recently it was shown that the architecture of a randomly initialized convolutional neural network (CNN) could serve as an effective prior, regularizing estimates for missing pixel data to fall within the manifold of natural im- ages. This regularization technique, referred to as the Deep Image Prior (DIP) [34] exploits both texture self-similarity within an image and the translation equivariance of CNNs to produce competitive results for the image restoration tasks mentioned above. However, the efficacy of DIP is depen- dent on the architecture of the CNN used; different content requires different CNN architectures for excellent perfor- mance and care over meta-parameter choices e.g. filter sizes, channel depths, epoch count, etc. This paper contributes an evolutionary strategy to auto- matically search for the optimal CNN architecture and as- sociated meta-parameters given a single input image. The core technical contribution is a genetic algorithm (GA) [8] for representing and optimizing a content-specific network architecture for use as DIP regularizer in unsupervised im- Figure 1. Neural Architecture Search yields a content-specific deep image prior (NAS-DIP, right) that enhances performance over clas- sical DIP (left) for unsupervised image restoration tasks including de-noising (top) and in-painting (bottom). Source inset (red) age restoration tasks. We show that superior results can be achieved through architecture search versus standard DIP backbones or random architectures. We demonstrate these performance improvements for image de-noising, Content- Aware Fill (otherwise known as inpainting), and content up- scaling for static images; Fig. 1 contrasts the output of clas- sic DIP [34] with our neural architecture search (NAS-DIP). Unlike image classification, to which NAS has been ex- tensively applied [29, 28, 23, 11], there is no ground truth available in the DIP formulation. An encoder-decoder net- work (whose architecture we seek) is overfitted to recon- struct the input image from a random noise field, so acquir- ing a generative model of that image’s structure. Parallel work in GANs [40] has shown that architectural design, par- ticularly of the decoder, is critical to learning a sufficient gen- 1 arXiv:2001.04776v1 [cs.CV] 14 Jan 2020

arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Neural Architecture Search for Deep Image Prior

[1] Kary Ho, [1]Andrew Gilbert, [2] Hailin Jin, [12] John Collomosse[1] Centre for Vision Speech and Signal Processing,

University of Surrey[2] Creative Intelligence Lab, Adobe Research

Abstract

We present a neural architecture search (NAS) techniqueto enhance the performance of unsupervised image de-noising, in-painting and super-resolution under the recentlyproposed Deep Image Prior (DIP). We show that evolution-ary search can automatically optimize the encoder-decoder(E-D) structure and meta-parameters of the DIP network,which serves as a content-specific prior to regularize thesesingle image restoration tasks. Our binary representation en-codes the design space for an asymmetric E-D network thattypically converges to yield a content-specific DIP within10-20 generations using a population size of 500. The op-timized architectures consistently improve upon the visualquality of classical DIP for a diverse range of photographicand artistic content.

1. IntroductionMany common image restoration tasks require the estima-

tion of missing pixel data: de-noising and artefact removal;in-painting; and super-resolution. Usually, this missing datais estimated from surrounding pixel data, under a smooth-ness prior. Recently it was shown that the architecture ofa randomly initialized convolutional neural network (CNN)could serve as an effective prior, regularizing estimates formissing pixel data to fall within the manifold of natural im-ages. This regularization technique, referred to as the DeepImage Prior (DIP) [34] exploits both texture self-similaritywithin an image and the translation equivariance of CNNs toproduce competitive results for the image restoration tasksmentioned above. However, the efficacy of DIP is depen-dent on the architecture of the CNN used; different contentrequires different CNN architectures for excellent perfor-mance and care over meta-parameter choices e.g. filter sizes,channel depths, epoch count, etc.

This paper contributes an evolutionary strategy to auto-matically search for the optimal CNN architecture and as-sociated meta-parameters given a single input image. Thecore technical contribution is a genetic algorithm (GA) [8]for representing and optimizing a content-specific networkarchitecture for use as DIP regularizer in unsupervised im-

Figure 1. Neural Architecture Search yields a content-specific deepimage prior (NAS-DIP, right) that enhances performance over clas-sical DIP (left) for unsupervised image restoration tasks includingde-noising (top) and in-painting (bottom). Source inset (red)

age restoration tasks. We show that superior results can beachieved through architecture search versus standard DIPbackbones or random architectures. We demonstrate theseperformance improvements for image de-noising, Content-Aware Fill (otherwise known as inpainting), and content up-scaling for static images; Fig. 1 contrasts the output of clas-sic DIP [34] with our neural architecture search (NAS-DIP).

Unlike image classification, to which NAS has been ex-tensively applied [29, 28, 23, 11], there is no ground truthavailable in the DIP formulation. An encoder-decoder net-work (whose architecture we seek) is overfitted to recon-struct the input image from a random noise field, so acquir-ing a generative model of that image’s structure. Parallelwork in GANs [40] has shown that architectural design, par-ticularly of the decoder, is critical to learning a sufficient gen-

1

arX

iv:2

001.

0477

6v1

[cs

.CV

] 1

4 Ja

n 20

20

Page 2: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

erative model, and moreover, that this architecture is contentspecific. Our work is the first to both automate the designof encoder-decoder architectures and to do so without su-pervision, leveraging a state-of-the-art perceptual metric toguide the optimization [44]. Further, we demonstrate thatthe optimized architectures are content specific, and enableclustering on visual style without supervision.

2. Related WorkNeural architecture search (NAS) seeks to automate

the design of deep neural network architectures through data-driven optimization [12], most commonly for image classi-fication [47, 28], but recently also object detection [47] andsemantic segmentation [6]. NAS addresses a facet of the au-tomated machine learning (AutoML) [20] problem, whichmore generally addresses hyper-parameter optimization andtuning of training meta-parameters.

Early NAS leveraged Bayesian optimization for MLP net-works [4], and was extended to CNNs [9] for CIFAR-10. Intheir seminal work, Zoph and Le [46] applied reinforcementlearning (RL) to construct image classification networks viaan action space, tokenizing actions into an RNN-synthesisedstring, with reward driven by validation data accuracy. Theinitially high computational overhead (800GPUs/4 weeks)was reduced while further enhancing accuracy. For example,by exploring alternative policies such as proximal policy op-timization [31] or Q-learning [2], RL approaches over RNNnow scale to contemporary datasets e.g. NASNet for Ima-geNet classification [26, 47] and was recently explored forGAN over CIFAR-10 [16]. Cai et al. similarly tokenizes thearchitecture, but explore the solution space via sequentialtransformation of the string via function-preserving muta-tion operations [5] on an LSTM-learned embedding of thestring. Our work also encodes architecture as a (in our case,binary) string, but optimizes via an evolutionary strategyrather than training a sequential model under RL.

Evolutionary strategies for network architecture optimiza-tion were first explored for MLPs in the early nineties [25].Genetic algorithms (GAs) were used to optimize both thearchitecture and weights [1, 33], rather than rely upon back-prop, in order to reduce the GA evaluation bottleneck; how-ever, this is not practical for contemporary CNNs. Whileselection and population culling strategies [11, 29, 28] havebeen explored to develop high performing image classifica-tion networks over ImageNet [23, 28]. Our work is uniquein that we explore image restoration encoder-decoder net-work architectures via GA optimization, and as such, ourarchitecture representation differs from prior work.

Single image restoration has been extensively studiedin classical vision and deep learning, where priors are pre-scribed or learned from representative data. A common priorto texture synthesis is the Markov Random Field (MRF) inwhich the pair-wise term encodes spatial coherence. Sev-eral in-painting works exploit MRF formulations of thiskind [22, 18, 24], including methods that source patchesfrom the input [10] or multiple external [17, 14] images, oruse random propagation of a few good matched patches [3].

Patch self-similarity within single images has also beenexploited for single image super-resolution [15] and de-noising. The Deep Image Prior (DIP) [34] (and its videoextension [42]) exploit translation equivariance of CNNs tolearn and transfer patches within the receptive field. Veryrecently, single image GAN [32] has been proposed to learna multi-resolution model of appearance and DIP has beenapplied to image layer decomposition [13]. Our work tar-gets the use cases for DIP proposed within the original pa-per [34]; namely super-resolution, de-noising and region in-painting. All three of these use cases have been investigatedsubstantially in the computer vision and graphics literature,and all have demonstrated significant performance benefitswhen solved using a deep neural network trained to performthat task on representative data. Generative Adversarial Net-works (GANs) are more widely used for in-painting andsuper-resolution [43, 36] by learning structure and appear-ance from a representative corpus image data [41], in somecases explicitly maintaining both local and global consis-tency through independent models [21]. Our approach andthat of DIP differs in that we do not train a network to per-form a specific task. Instead, we use an untrained (randomlyinitialized) network to perform the tasks by overfitting sucha network to a single image under a task-specific loss usingneural architecture search.

3. Architecture Search for DIPThe core principle of DIP is to learn a generative

CNN Gθ (where θ are the learned network parameters e.g.weights) to reconstruct x from a noise field N of identicalheight and width to x, with pixels drawn from a uniform ran-dom distribution. Ulyanov et al. [34] propose a symmetricencoder-decoder network architecture with skip connectionsfor Gθ, comprising five pairs of (up-)convolutional layerswith varying architectures depending on the image restora-tion application (de-noising, in-painting or super-resolution)and the image content to be processed. A reconstruction lossis applied to learn Gθ for a single given image x:

θ∗ = argminθ‖Gθ(N )− x‖22 (1)

Our core contribution is to optimize not only for θ butalso for architecture G using a genetic algorithm (GA),guided via a perceptual quality metric [44], as now de-scribed.

3.1. Network RepresentationWe encode the space of encoder-decoder (E-D) archi-

tectures from which to sample G as a constant length bi-nary sequence, representing N paired encoder decoder unitsU = {U1, ..., UN}. Following [34], G is a fully convolu-tional E-D network and optimize for the connectivity andmeta-parameters of the convolutional layers. A given unitUn comprises encoder En and decoder Dn convolutionalstages denoted En and Dn respectively each of which re-quires 7 bits to encode its parameter tuple. Unit Un requires

Page 3: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Figure 2. Architecture search space of NAS-DIP(-T). The Encoder-Decoder (E-D) network G (right) is formed of several E-D Units (Un)each with an Encoder En) and Decoder Dn paired stage (zoomed, left) represented each by 7 bits plus an additional 4N bits Rn to encodegated skip connections from En to other decoder blocks in the network. Optionally the training epoch count T is encoded. The binaryrepresentation for E-D units is discussed further in Sec. 3.1. Under DIP images are reconstructed from constant noise field N by optimizingto find weights θ thus overfitting the network to input image x under reconstruction loss e.g. here for de-noising (eq.3).

a total 14+4N bits to encode, as an additional 4N -bit blockfor the unit, encodes a tuple specifying the configuration ofskip connections from its encoder stage to each of the de-coder stages i.e. both within itself and other units. Thus, thetotal binary representation for an architecture in neural archi-tecture search for DIP (NAS-DIP) is N(14 + 4N). For ourexperiments, we use N = 6 but note that the effective num-ber of encoder or decoder stages varies according to skipconnections which may bypass the stage. Fig. 2 illustratesthe organisation of the E-D units (right) and the detailed ar-chitecture of a single E-D unit (left). Each unitUn comprisesthe following binary representation, where super-scripts in-dicate elements of the parameter tuple:

• Esn ∈ [0, 1] (1 bit) a binary indicator of whether theencoder stage of unit Un is skipped (bypassed).

• Efn ∈ [0, 7] (3 bits) encoding filter size f = 2Efn + 1learned by the convolutional encoder.

• Ehn ∈ [0, 7] (3 bits) encoding number of filters h =

2Ehn−1 and so channels output by the encoder stage.

• Dsn ∈ [0, 1] (1 bit) a binary indicator of whether thedecoder stage of unit Un is skipped (bypassed).

• Dfn ∈ [0, 7] (3 bits) encoding filter size f = 2Dfn + 1

learned by the up-convolutional decoder stage.

• Dhn ∈ [0, 7] (3 bits) encoding number of filters h =

2Dhn−1 and so channels output by the decoder stage.

• Rn ∈ B4N (4N bits) encodes gated skip connections as[ρ1n, ..., ρ

Nn ]; each 4-bit group ρin ∈ [0, 15] determines

whether gated skip path rin connects fromEn toDi and

if so, how many filters/channels (i.e. skip gate is openif rin = 0).

NAS-DIP-T. We explore a variant of the representationencoding maximum epoch count T = 500∗ (2t−1) via twoadditional bits coding for t and thus a representation lengthfor NAS-DIP-T of N(14 + 4N) + 2.

Symmetric NAS-DIP. We also explore a compact vari-ant of the representation that forces En = Dn for all param-eters, forcing a symmetric E-D architecture to be learnedand requiring only 10 bits to encode Un. However, we latershow that such symmetric architectures are almost alwaysunderperformed by asymmetric architectures discovered viaour search (subsec. 4.1).

3.2. Evolutionary OptimizationDIP provides an architecture specific prior that regu-

larises the reconstruction of a given source image x froma uniform random noise fieldN . Fixing N constant, we usea genetic algorithm (GA) to search architecture space for theoptimal architectureG∗ to recover a ’restored’ e.g. denoised,inpainted or upscaled version x of that source image:

x = G∗θ∗(N ). (2)

GAs simulate the process of natural selection by breedingsuccessive generations of individuals through the processesof cross-over, fitness-proportionate reproduction and muta-tion. In our implementation, such individuals are networkconfigurations encoded via our binary representation.

Individuals are evaluated by running a pre-specifiedDIP image restoration task using the encoded architecture.We consider unsupervised or ‘blind’ restoration tasks (de-noising, in-painting, super-resolution) in which an ideal x

Page 4: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Figure 3. NAS-DIP convergence for in-painting task on BAM! [39]. Left: Input (top) and converged (bottom) result at generation 13. Right:Sampling the top, middle and bottom performing architectures (shown in respective rows) from a population of 500 architectures, at the final(1st, middle) and final (13th, rightmost) generations. Inset scores: Fitness selection is driven using LPIPS [44]; evaluation by PSNR/SSIM.

is unknown (is sought) and so cannot guide the GA search.In lieu of this ground truth we employ a perceptual measure(subsec. 3.2.1) to assess the visual quality generated by anycandidate architecture by training G∗

θ(N) via backpropoga-tion to minimize a task specific reconstruction loss:

Lde−noise(x;G) = minθ|Gθ(N )− x|. (3)Lin−paint(x;G) = minθ|M(Gθ(N ))−M(x)|. (4)Lupscale(x;G) = minθ|D(Gθ(N ))− x|. (5)

where D is a downsampling operator reducing its targetto the size of x via bi-linear interpolation, and M(.) is amasking operator that returns zero within the region to bein-painted.

We now describe a single iteration of the GA search,which is repeated until the improvements gained over theprevious few generations are marginal (the change in bothaverage and maximum population fitness over a sliding timewindow fall below a threshold).

3.2.1 Population Sampling and Fitness

We initialize a population of K = 500 solutions uni-formly sampling BN(14+4N) to seed initial architecturesG = {G1, G2, ..., GK}. The visual quality of x under eacharchitecture is assessed ‘blind using a learned perceptualmeasure (LPIPS score) [44] as a proxy for individual fitness:

f(Gi) = P(argminxL(x;Gθ)) (6)

where L is the task specific loss (Eq.3-5) and P (.) is theperceptual score [44] for a given individual Gi in the popu-lation. We explored several blind perceptual scores from the

GAN literature including Inception Score [30] and FrechetInception Distance [19] but found LPIPS to better correlatewith PSNR and SSIM during our evaluation, improving con-vergence for NAS-DIP.

Individuals are selected stochastically (and with replace-ment) for breeding, to populate the next generation of so-lutions. Architectures that produce higher quality outputsare more likely to be selected. However, population diver-sity is encouraged via the stochasticity of the process andthe introduction of a random mutation into the offspringgenome. We apply elitism; the bottom 5% of the populationis culled, and the top 5% passes unperturbed to the next gen-eration – the fittest individual in successive generations isthus at least as fit as those in the past. The middle 90% isused to produce the remainder of the next generation. Twoindividuals are selected stochastically with a bias to fitnessp(Gi) = f(Gi)/

∑Kj=1 f(Gj), and bred via cross-over and

mutation (subsec. 3.2.2) to produce a novel offspring forthe successive generation. This process repeats with replace-ment until the population count matches the prior genera-tion.

3.2.2 Cross-over and Mutation

Individuals are bred via genetic crossover; giventwo constant length binary genomes of formG{(E1, R1, D1), ..., (EN , RN , DN )} ∈ BN(14+4N), asplice point S = [1, N ] is randomly selected such that unitsfrom a pair of parents A and B are combined via copyingof units UN<S from A and UN≥S from B. Such cross-overcould generate syntactically invalid genomes e.g. due to

Page 5: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Figure 4. Evaluation of proposed NAS-DIP (and variants) vs. classical DIP [34] for in-painting, de-noising and 4× super-resolution onthe Library, Plane and Zebra images of the DIP dataset. For each task, respectively, from left to right, the max population fitness graphis shown. We compare the proposed unconstrained (asymmetric) E-D network search of NAS-DIP (optionally with epoch count T freelyoptimized; NAS-DIP-T) with NAS-DIP constrained to symmetric E-D architectures only. The baseline (dotted blue) for classical DIP usesthe architecture published in [34]. Examples further quantified in Table 2.

tensor size incompatibilities between units in the genome.During the evaluation, an invalid architecture evaluates tozero prohibiting its selection for subsequent generations.

Population diversity is encouraged via the stochasticityof the selection process and the introduction of a randommutation into the offspring genome. Each bit within the off-spring genome is subject to random flip with low probabilitypm; we explore the trade-off of this rate against convergencein subsec. 4.2.

4. Experiments and DiscussionWe evaluate the proposed neural architecture search tech-

nique for DIP (NAS-DIP) for each of the three blind imagerestoration tasks proposed originally for DIP [34]: imagein-painting, super-resolution and de-noising.

Datasets. We evaluate over three public datasets: 1)Places2 [45]; a dataset of photos commonly used for in-painting (test partition sampled as [21]); 2) Behance ArtisticMedia (BAM!) [38]; a dataset of artwork, using the test par-

tition of BAM! selected to evaluate in-painting in [14] com-prising of 8 media styles; 3) DIP [34]; the dataset of imagesused to evaluate the original DIP algorithm of Ulyanov etal. Where baseline comparison is made to images from thelatter, the specific network architecture published in [34] isreplicated using according to [35] and the authors’ public im-plementation. Results are quantified via three objective met-rics; PSNR (as in DIP) and structural similarity (SSIM) [37]are used to evaluate against ground truth, and we also reportthe perceptual metric (LPIPS) [44] used as fitness score inthe GA optimization.

Training details. We implement NAS-DIP in Tensorflow,using ADAM and learning rate of 10−3 for all experiments.For NAS-DIP epoch count is 2000 otherwise this and othermetaparameters are searched via the optimization. To alle-viate the evaluation bottleneck in NAS (which takes, on av-erage, 2-3 minutes per architecture proposal), we distributethe evaluation step over 16 Nvidia Titan-X GPUs capable ofeach evaluating two proposals concurrently for an averageNAS-DIP search time of 3-6 hours total (for an average of

Page 6: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Dataset Task PSNR (higher better) LPIPS (lower better) SSIM (higher better)DIP Proposed DIP Proposed DIP Proposed

BAM! [39] In-painting 16.8 19.76 0.42 0.25 0.32 0.62Places2 [21] In-painting 12.4 15.43 0.37 0.27 0.37 0.58

DIP [34]In-painting 23.7 27.3 0.33 0.04 .76 0.92De-noising 17.6 18.95 0.13 0.09 0.73 0.85

Super Resolution 18.9 19.3 0.38 0.16 0.48 0.62Table 1. Per-dataset average performance of NAS-DIP (asymmetric representation) versus DIP over the dataset proposed in DIP, and forin-painting datasets BAM! and Places2.

Task ImagePSNR (higher better) LPIPS (lower better) SSIM (higher better)

DIP Sym Asym Asym DIP Sym Asym Asym DIP Sym Asym AsymNAS-DIP NAS-DIP NAS-DIP-T NAS-DIP NAS-DIP NAS-DIP-T NAS-DIP NAS-DIP NAS-DIP-T

In-paintingVase 18.3 29.8 29.2 30.2 0.76 0.02 0.01 0.02 0.48 0.95 0.96 0.95Library 19.4 19.7 20.4 20.0 0.15 0.10 0.09 0.12 0.83 0.81 0.84 0.83Face 33.4 34.2 36.0 35.9 0.078 0.03 0.01 0.04 0.95 0.95 0.96 0.95

De-noising Plane 23.1 25.8 25.7 25.9 0.15 0.096 0.09 0.07 0.85 0.90 0.92 0.94Snail 12.0 12.6 12.2 12.7 0.11 0.12 0.09 0.06 0.61 0.52 0.74 0.84

Super-Resolution Zebra 4x 19.1 20.2 19.6 20.0 0.19 0.13 0.14 0.14 0.67 0.51 0.75 0.72Zebra 8x 18.7 19.1 18.96 19.1 0.57 0.29 0.19 0.20 0.28 0.34 0.48 0.47

Table 2. Detailed quantitative comparison of visual quality under three image restoration tasks, for the DIP dataset. Comparison betweenthe architecture presented in the original DIP for each image, with the best architecture found under each of three variants of the proposedmethod: NAS-DIP, NAS-DIP-T and NAS-DIP constrained to symmetric E-D (subsec. 3.1). Quality metrics: LPIPS [44] (used in NASobjective); and both PSNR and SSIM [37] as external metrics.

a) b)Figure 5. t-SNE visualizations in architecture space (BN(14+4N)) for NAS-DIP, N = 5 symmetric network: a) Population distribution atconvergence; multi-modal distribution of similar best DIP architectures obtained for the super-resolution task (image inset); b) Visualizingbest discovered architecture from the population (lower), and the classical DIP default architecture [35] (upper) for this task.

10-20 generations to convergence). Our DIP implementationextends the authors’ public code for DIP [34]. Reproducibil-ity is critical for NAS-DIP optimization; the same genomemust yield the same perceptual score for a given source im-age. In addition to fixing all random seeds (e.g. for batchsampling and fixing the initial noise field), we take addi-tional technical steps to avoid non-determinism in cuDNNe.g. avoidance of atomic add operations in image paddingpresent in original DIP code.

4.1. Network RepresentationWe evaluate the proposed NAS method over three vari-

ants of the architecture representation proposed in sub-sec.3.1: NAS-DIP, NAS-DIP-T (in which epoch count is alsooptimized), and a constrained version of NAS-DIP forcinga symmetric E-D network. For all experiments, N = 5 en-abling E-D networks of up to 10 (up-)convolutional layers

with gated skip connections. Performance is evaluated us-ing PSNR for comparison with original DIP [34] as well asSSIM, and the LPIPS score used to guide NAS is also re-ported. Table 2 provides direct comparison on images fromUlyanov et al. [34]; visual output and convergence graphsare shown in Fig. 4. Following random initialization relativefitness gains of up to 30% are observed after 10-20 genera-tions after which performance converges to values above theDIP baseline [34] in all cases.

For both in-painting and de-noising tasks, asymmetricnetworks are found that outperform any symmetric network,including the networks published in the original DIP forthose images and tasks. For super-resolution, a symmetricnetwork is found to outperform asymmetric networks andconstraining the genome in this way aids in discovering aperformant network. In all cases, it was unhelpful to opti-mize for epoch count T , despite prior observations on the

Page 7: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

PSNR (higher better) LPIPS (lower better) SSIM (higher better)Dataset Image DIP pm = 0.01 pm = 0.02 pm = 0.05 pm = 0.10 DIP pm = 0.01 pm = 0.02 pm = 0.05 pm = 0.10 DIP pm = 0.01 pm = 0.02 pm = 0.05 pm = 0.10

In-paintingVase 18.3 24.5 28.5 30.2 29.4 0.76 0.12 0.10 0.01 0.07 0.48 0.57 0.67 0.96 0.69Library 19.4 20.0 19.5 20.4 19.8 0.15 0.12 0.11 0.09 0.09 0.83 0.81 0.81 0.84 0.82Face 33.4 34.5 35.6 35.9 35.4 0.01 0.01 0.01 0.01 0.01 0.95 0.95 0.95 0.96 0.95

De-noising Plane 23.1 22.9 23.7 25.9 23.6 0.15 0.11 0.01 0.09 0.09 0.85 0.85 0.92 0.94 0.91Snail 12.0 12.0 12.3 12.7 12.2 0.11 0.10 0.09 0.00 0.02 0.61 0.78 0.82 0.84 0.81

Super-Resolution Zebra 4x 19.1 19.6 19.7 20.0 19.6 0.19 0.15 0.15 0.14 0.16 0.67 0.70 0.70 0.72 0.71Zebra 8x 18.7 18.9 19.0 19.1 19.0 0.57 0.34 0.24 0.19 0.28 0.32 0.35 0.38 0.47 0.43

Table 3. Quantitative results for varying mutation rates; chance of bit flip in offspring p(r). Comparing each of three variants of the proposedmethod: NAS-DIP, NAS-DIP-T and NAS-DIP constrained to symmetric E-D (subsec. 3.1) against original DIP. Quality metrics: LPIPS[44] (used in NAS objective); and both PSNR and SSIM as external metrics.

Figure 6. Result of varying mutation (bit flip) rate p(r); improved visual quality (left, zoom recommended) is obtained with a mutation rateof p(r) = 0.05 equating to an average of 4 bit flips per offspring. Convergence graph for each visual example (right).

importance of T in obtained good reconstructions under DIP[34]. Table 1 broadens the experiment for NAS-DIP, averag-ing performance for three datasets: BAM! with 80 randomlysampled images, 10 from each 8 styles; Places2 with a 50 im-age subset included in [21]; DIP with 7 images used in [34].For all three tasks and all three datasets, NAS-DIP discoversnetworks outperforming those of DIP [34].

4.2. Evaluating Mutation RatePopulation elitism requires a moderate mutation rate p(r)

to ensure population diversity and so convergence, yet raisedtoo high convergence occurs at a lower quality level. Weevaluate pm = {0.01, 0.02, 0.05, 0.10} in Table 3, observ-ing that for all tasks a bit flip probability of 5% (i.e. an av-erage of 4 bit flips per offspring for N = 5) encouragesconvergence the highest fitness value. This in turn correlatesto highest visual quality according to both PSNR and SSIMexternal metrics. Fig. 6 provides representation visual out-put alongside convergence graphs. In general, convergenceis achieved between 10-20 generations taking a few hourson single GPU for a constant population size K = 500.

4.3. Content aware in-painting

To provide a qualitative performance of the Content-Aware in-painting we compare our proposed approachagainst several contemporary baselines: PatchMatch [3] andImage Melding [7] which sample patches from within asingle source image; and two methods that use millionsor training images, the generative convnet approach ’Con-text encoder’ [27] a recent GAN in-painting work [21], andthe Style-aware in-painting [14]. We compare against thesame baselines using scenic photographs in Places2, Fig 7presents a visual comparison, and quantitative results aregiven in Tbl.1 over the image set included in [21] with thesame mask regions published in that work.

4.4. Discovered Architectures

Architectures discovered by NAS-DIP are visualized inFig. 5a, using t-SNE projection in the architecture space(BN(14+4N)) for a solution population at convergence (gen-eration 20) for a representative super-resolution task forwhich symmetric E-D network was sought (source image

Page 8: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

Source / Mask (red) PatchMatch [3] ImgMelding [7] Context Encoder [27] ImgComp [21] Style-aware [14] ProposedFigure 7. Qualitative visual comparison of NAS-DIP in-painting quality against 5 baseline methods; examples from dataset Places2 [45].

a)

b)Figure 8. Content specific architecture discovery. NAS-DIP in-painted artwork results from BAM! [39] a) t-sne visualization ofdiscovered architectures in BN(14+4N) for 80 in-painted artworks(7 BAM styles + DIP photograph); common visual styles yield acommon best performing architecture under NAS-DIP. b) Samplein-painted results (source/mask left, output right).

inset). Distinct clusters of stronger candidate architectureshave emerged each utilising all 10 convolutional stages andsimilar filter sizes but with distinct skip activations and chan-

nel widths; colour coding indicates fitness. Fig. 5b shows aschematic for the default architecture for this task publishedin [35] alongside the discovered best architecture for thistask and image.

Fig. 8 visualizes 80 best performing networks for in-painting BAM! artworks of identical content (flowers) butexhibiting 8 different artistic styles. Each image has beenrun through NAS-DIP under loss eq. 4; representative out-put is included in Fig. 8. Visualizing the best performingarchitectures via t-SNE in BN(14+4N) reveals style-specificclusters; pen-and-ink, graphite sketches, 3D graphics ren-derings, comics, and vector artwork all form distinct groupswhile others e.g. watercolor form several clusters. We con-clude that images that share a common visual style exhibitcommonality in network architecture necessary perform im-age reconstruction well. This discovery suggests a potentialapplication for NAS-DIP beyond architecture search to un-supervised clustering e.g. for visual aesthetics.

5. ConclusionWe reported the first neural architecture search (NAS)

for image reconstruction under the recently proposed deepimage prior (DIP) [34], learning a content-specific prior fora given source image in the form of an encoder-decoder(E-D) network architecture. Following the success of evolu-tionary search techniques for image classification networks[29, 28] we leveraged a genetic algorithm to search a bi-nary space representing asymmetric E-D architectures anddemonstrated its efficacy for image de-noising, in-paintingand super-resolution. For the latter case, we observed a con-strained version of our genome yielding symmetric networksexceeded that of asymmetric networks which benefited theother two tasks. All image restoration tasks were ‘blind’ andoptimization was guided via a proxy measure for percep-tual quality [44]. In all cases, we observed the performanceof the discovered networks to significantly exceed classicalDIP and the potential for content-specific architectures be-yond image restoration to unsupervised style clustering. Fu-ture work could pursue the latter parallel application, as wellas explore further generalizations of the architecture spacebeyond fully convolutional E-Ds to incorporate pooling andnormalization strategies.

Page 9: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

References[1] P. Angeline, G. Saunders, and J. Pollack. An evolutionary

algorithm that constructs recurrent neural networks. IEEETrans. on Neural Networks, 5(1):54–65, 1994. 2

[2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neuralnetwork architectures using reinforcement learning. In Proc.ICLR, 2017. 2

[3] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman.Patchmatch: a randomized correspondence algorithm forstructural image editing,. In Proc. ACM SIGGRAPH, 2009.2, 7, 8

[4] J. Bergstra, D. Yamins, and D. Cox. Making a science ofmodel search: Hyper-parameter optimization in hundreds ofdimensions for vision architectures. In Proc. ICML, 2013. 2

[5] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficientarchitecture search by network transformation. In Proc. AAAI,2018. 2

[6] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F.Schroff, H. Adam, and J. Shlens. Searching for efficientmulti-scale architectures for dense image prediction. In Proc.NeurIPS, pages 8713–8724, 2018. 2

[7] Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Gold-man, and Pradeep Sen. Image melding: combining inconsis-tent images using patch-based synthesis. In ACM TOG, 2012.7, 8

[8] K. de Jong. Learning with genetic algorithms. J. MachineLearning, 3:121–138, 1988. 1

[9] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding upautomatic hyperparameter optimization of deep neural net-works by extrapolation of learning curves. In Proc. IJCAI,2015. 2

[10] A. Efros and T. Leung. Texture Synthesis by non-parametricsampling. In Proc. Intl. Conf. on Computer Vision (ICCV),1999. 2

[11] T. Elsken, J. Metzen, and F. Hutter. Efficient multi-objectiveneural architecture search via lamarckian evolution. In Proc.ICLR, 2019. 1, 2

[12] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecturesearch: A survey. Journal of Machine Learning Research(JMLR), 20:1–21, 2019. 2

[13] Y. Gandelsman, A. Shocher, and M. Irani. Double-dip: Un-supervised image decomposition via coupled deep-image-priors. In Proc. CVPR. IEEE, 2018. 2

[14] A. Gilbert, J. Collomosse, H. Jin, and B. Price. Disentanglingstructure and aesthetics for content-aware image completion.In Proc. CVPR, 2018. 2, 5, 7, 8

[15] D. Glasner, S. Bagon, and M. Irani. Super-resolution from asingle image. In Intl. Conference on Computer Vision (ICCV),2009. 2

[16] Xinyu Gong, Shiyu Chang, Yifan Jiang, and ZhangyangWang. Autogan: Neural architecture search for generativeadversarial networks. In Proc. ICCV, 2019. 2

[17] James Hays and Alexei A Efros. Scene completion usingmillions of photographs. In ACM Transactions on Graphics(TOG). ACM, 2007. 2

[18] K. He and J. Sun. Statistics of patch offsets for image com-pletion. In Euro. Conf. on Comp. Vision (ECCV), 2012. 2

[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S.Hochreiter. GANs trained by a two time-scale update ruleconverge to a local nash equilibrium. In Proc. NeurIPS, pages6629–6640, 2017. 4

[20] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors.Automated Machine Learning: Methods, Systems, Challenges.Springer, 2019. In press, available at http://automl.org/book.2

[21] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa.Globally and Locally Consistent Image Completion. ACMTransactions on Graphics (Proc. of SIGGRAPH 2017),36(4):107:1–107:14, 2017. 2, 5, 6, 7, 8

[22] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick. Graph-cut textures: Image and video synthesis using graph cuts.ACM Transactions on Graphics, 3(22):277–286, 2003. 2

[23] C. Liu, B. Zoph, M. Neumann, J. Shelns, W. Hua, L. Li, F-F.Li, A. Yuille, J. Huang, and K. Murphy. Progressive neuralarchitecture search. In Proc. ECCV, 2018. 1, 2

[24] Y. Liu and V. Caselles. Exemplar-based image inpainting us-ing multiscale graph cuts. IEEE Trans. on Image Processing,pages 1699–1711, 2013. 2

[25] G. F. Miller, P. M. Todd, and S. U. Hedge. Designing neuralnetworks using genetic algorithms. In Proc. Intl. Conf. onGenetic Algorithms, 1989. 2

[26] R. Negrinho and G. Gordon. Deeparchitect: Auto-matically designing and training deep architectures, 2017.arXiv:1704.08792. 2

[27] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei Efros. Context encoders: Feature learningby inpainting. In CVPR’16, 2016. 7, 8

[28] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Aging evolu-tion for image classifier architecture search. In Proc. AAAI,2019. 1, 2, 8

[29] E. Real, S. Moore, A. Selle, S. Saxena, Y. Suematsu, Q. V. Le,and A. Kurakin. Large-scale evolution of image classifiers.In Proc. ICLR, 2017. 1, 2, 8

[30] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-ford, and X. Chen. Improved techniques for training GANs.In Proc. NeurIPS, pages 2234–2242, 2016. 4

[31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O.Klimov. Proximal policy optimization algorithms, 2017.arXiv:1707.06347. 2

[32] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan:Learning a generative model from a single natural image. InProc. ICCV, 2019. 2

[33] K. Stanley and R. Miikkulainen. Evolving neural networksthrough augmenting topologies. Evolutionary Computation,10:99–127, 2002. 2

[34] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018. 1, 2, 5, 6, 7, 8

[35] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior.Intl. Journal Computer Vision (IJCV), 2019. 5, 6, 8

[36] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,Jan Kautz, and Bryan Catanzaro. High-resolution image syn-thesis and semantic manipulation with conditional gans. InProc. CVPR 2018, 2018. 2

[37] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero PSimoncelli. Image quality assessment: from error visibility tostructural similarity. IEEE transactions on image processing,13(4):600–612, 2004. 5, 6

[38] M. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse,and S. Belongie. Bam! the behance artistic media dataset forrecognition beyond photography. In Proc. ICCV, 2017. 5

Page 10: arXiv:2001.04776v1 [cs.CV] 14 Jan 2020ary search can automatically optimize the encoder-decoder (E-D) structure and meta-parameters of the DIP network, which serves as a content-specific

[39] Michael J Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann,John Collomosse, and Serge Belongie. Bam! the behanceartistic media dataset for recognition beyond photography.arXiv preprint arXiv:1704.08614, 2017. 4, 6, 8

[40] Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L. Chen,A. Fathi, and J. Uiklings. The devil is in the decoder: Classifi-cation, regression and gans. Intl. Journal of Computer Vision(IJCV), 127:1694–1706, December 2019. 1

[41] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpaintingwith perceptual and contextual losses. arXiv preprintarXiv:1607.07539, 2016. 2

[42] H. Zhang, L. Mai, H. KJin, Z. Wang, N. Xu, and J. Collo-mosse. Video inpainting: An internal learning approach. InProc. ICCV, 2019. 2

[43] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-gang Wang, Sharon Huang, and Dimitris Metaxas. Stack-gan++: Realistic image synthesis with stacked generative ad-versarial networks. IEEE Trans. PAMI, 41(8):1947–1962,2017. 2

[44] R. Zhang, P. Isola, A. Efros, E. Shechtman, and O. Wang. Theunreasonable effectiveness of deep features as a perceptualmetric. In Proc. CVPR, 2018. 2, 4, 5, 6, 7, 8

[45] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Tor-ralba, and Aude Oliva. Places: An image database for deepscene understanding. arXiv preprint arXiv:1610.02055, 2016.5, 8

[46] B. Zoph and Q. V. Le. Neural architecture search with rein-forcement learning. In Proc. ICLR, 2017. 2

[47] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learningtransferrable architectures for scalable image recognition. InProc. CVPR, 2018. 2