Recovering Realistic Texture in Image Super-resolution by...

Preview:

Citation preview

Recovering Realistic Texture in Image Super-resolution by

Deep Spatial Feature Transform

Xintao Wang Ke Yu Chao Dong Chen Change Loy

Problem

Low-resolution image High-resolution image

enlarge 4 times

Previous work• Contemporary SR algorithms are mostly CNN-based methods[1].

• Most of CNN-based methods use pixel-wise loss function. (MSE-based model) good at recovering edges and smooth areas

not good at texture recovery

• Adversarial loss is introduced in SRGAN[2] and EnhanceNet[3]. (GAN-based model) encourage the network to favor solutions that look more like natural images

visual quality of reconstruction is significantly improved

SRCNN SRGAN Ground-truth

[1] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.[2] C. Ledig, L. Theis, F. Husz ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.[3] M. S. Sajjadi, B. Sch olkopf, and M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.

Motivation

building x4

plant x4swap priors

plant prior

building prior

animal

building water

sky

grass

mountain

plant

Semantic categorical prior

Issues

1. How to represent the semantic categorical prior?

2. How categorical prior can be incorporated into the reconstruction process effectively?

Our approach: explore semantic segmentation probability maps as the categorical prior up to pixel level.

Our approach: propose a novel Spatial Feature Transform that is capable of altering the network behavior conditioned on other information.

Represent categorical prior• Contemporary CNN segmentation network[1]

• fine-tuned on LR images

ResNet 101

𝐾 categories

𝑎𝑟𝑔𝑚𝑎𝑥

semantic categorical prior

probability maps

[1] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.

Input LR images

animal

sky

grass

building

mountain

plant

water

background

Segments on LR images

Segments on HR images

Ground-truth

Examples on segmentation

𝒚 = 𝐺𝜃(𝒙)

Ψ = (𝑃1, 𝑃2, … , 𝑃𝐾)

Incorporate conditions

CNN for SR

𝒚 = 𝐺𝜃(𝒙)

input LR image

𝒙restored image

𝒚𝑛𝑒𝑡 𝐺

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑧𝑒𝑑 𝑏𝑦 𝜃

Categorical prior

Ψ = (𝑃1, 𝑃2, … , 𝑃𝐾)

𝒚 = 𝐺𝜃(𝒙|Ψ)probability maps

𝑃1, 𝑃2, … , 𝑃𝐾 ?prior

Ψ

Spatial Feature Transform

• By learning a mapping function ℳ, the prior Ψ is modeled by a pair of affine transformation parameters (𝛾, 𝛽) .

• The modulation is then carried out by an affine transformation on feature maps 𝑭.

ℳ:Ψ ↦ (𝛾, 𝛽)

SF𝑇 𝑭 𝛾, 𝛽 = 𝛾⨀𝑭+ 𝛽

𝒚 = 𝐺𝜃(𝒙|Ψ)ℳ:Ψ ↦ (𝛾, 𝛽)

SFT 𝑭 𝛾, 𝛽 = 𝛾⨀𝑭+ 𝛽

𝒚 = 𝐺𝜃(𝒙|𝛾, 𝛽)

Spatial Feature Transform

Co

nv

Res

idu

al b

lock

SFT

laye

r

SFT

laye

r

Co

nv

Co

nv

Residual block

Res

idu

al b

lock

SFT

laye

r

Co

nv

Up

sam

plin

g

Co

nv

Co

nv

Co

nv

⨀ +C

on

v

Co

nv

Co

nv

Co

nv

SFT layer

conditions

features 𝜸𝑖 𝜷𝑖Co

nv

Co

nv

Co

nv

Co

nv

Condition Network

Segmentationprobability

maps

Shared SFT conditions

loss function

• Adversarial loss[1]

min𝜃

max𝜂

Ε𝑦~𝑝HR 𝑙𝑜𝑔𝐷𝜂 𝑦 + Ε𝑥~𝑝LRlog(1 − 𝐷𝜂 𝐺𝜃(𝑥) )

Generator

Discriminator

Compete

• Perceptual loss[2]

encourage the network to generate images that reside on the manifold of natural images

𝜙𝑉𝐺𝐺 𝑦 − 𝜙𝑉𝐺𝐺 𝑦 22

use a pre-trained 19-layer VGG network (features before conv54)

optimize a super-resolution model in a feature space

[1] Goodfellow, Ian, et al. Generative adversarial nets. In NIPS. 2014.[2] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.

𝑷𝒃𝒖𝒊𝒍𝒅𝒊𝒏𝒈 map

RestoredLR patch

𝑷𝒈𝒓𝒂𝒔𝒔 map 𝜸 map of 𝐶6 𝜷 map of 𝐶7Input

Spatial condition

• The modulation parameters (𝛾, 𝛽) have a close relationship with probability maps 𝑷 and contain spatial information.

𝜸 map of 𝐶51

Restored

LR patch 𝑷𝒑𝒍𝒂𝒏𝒕 map

𝑷𝒈𝒓𝒂𝒔𝒔 map

𝜷 map of 𝐶1

𝜸 map of 𝐶14 𝜷 map of 𝐶5

Delicate modulation

Results

GTSRCNN SRGAN EnhanceNet SFT-Net (ours)

PSNR: 24.83dB PSNR: 23.36dB PSNR: 22.71dB PSNR: 22.90dB

Bicubic SRCNN VDSR LapSRN DRRN MemNet EnhanceNet SRGAN SFT-Net (ours) GT

Results

MSE-based method GAN-based method

54.5

76.468 75

56.468.7 65.7

sky building grass animal plant water mountain

67 33

85 15Ours

Ours

EnhanceNet

SRGAN

User study – part I

User study – part II

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Rank-1 Rank-2 Rank-3 Rank-4

80.4 18.4

18.6 79.6

61.3 36.3

37 62.4

GT

Ours

MemNet

SRCNN

buildingprior

skyprior

grassprior

mountainprior

waterprior

plantprior

animalpriorbicubic

bu

ildin

gsk

ygr

ass

mo

un

tain

wat

er

pla

nt

anim

al

Impact of different priors

building sky grass mountain water plant animalbuilding sky grass mountain water plant animal

building prior

buildingprior

skyprior

grassprior

mountainprior

waterprior

plantprior

animalpriorbicubic

bu

ildin

gsk

ygr

ass

mo

un

tain

wat

er

pla

nt

anim

al

Impact of different priors

buildingprior

skyprior

grassprior

mountainprior

waterprior

plantprior

animalpriorbicubicmountain

Other conditioning methods

[1] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your own prada: Fashion synthesis with structural coherence. In ICCV, 2017.[2] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville. FiLM: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.

Compositionalmapping[1]

FiLM[2]Input concatenation

Input concatenation

Compositionalmapping

FiLMSFT-Net (ours)

Comparison with other conditioning methods

SRGAN SRGANOurs Ours

Robustness to out-of-category

Conclusion

• Explore semantic segmentation maps as categorical prior for realistic texture recovery.

• Propose a novel Spatial Feature Transform layer to efficiently incorporate the categorical conditions into a CNN-based SR network.

• Extensive comparisons and a user study demonstrate the capability of SFT-Net in generating realistic and visually pleasing textures.

Crafting a Toolchain for Image Restoration by

Deep Reinforcement Learning

Ke Yu Chao Dong Liang Lin Chen Change Loy

Image Restoration

• There are many individual tasks• Denoising• Deblurring• JPEG Deblocking• Super-Resolution• …

• Towards more complicated distortions• Address multiple levels of degradation in one task[1, 2]

• Address multiple individual tasks[3]

Image Restoration – A New Setting

• Consider multiple distortions simultaneously • Real-world: Image capture and storage• Synthetic: Gaussian blur, Gaussian noise and JPEG compression

GaussianBlur

GaussianNoise

JPEGCompression

Real-worldScenario

SyntheticSetting

Our New Task

Motivation

• Can we use a single CNN to address multiple distortions?• Inefficient: Require a huge network to handle all the possibilities• Inflexible: All kinds of distorted images are processed with the

same structure

• Find a more efficient and flexible approach!• Process different distortion in a different way

Method – Decision Making

• Progressively restore the image quality

• Treat image restoration as a decision making process

Noisy! Try a denoising toolBlurry! Try a

deblurring toolArtifacts! Try a

deblocking toolGood enough :)

Method – Overview

• Our framework requires a toolbox and an agent

Agent Agent

Toolbox Toolbox

Method – Toolbox

• We design 12 tools, each of which addresses a simple task• 3-layer CNN[4]

• 8-layer CNN

Method – Agent

• Use reinforcement learning to address tool selection

Statecurrent distorted image

action at last step

12 tools

stoppingAction

Reward: PSNR gain at each step

InputImage

FeatureExtractor

One-hot Encoder

LSTM 𝒗1

𝐈1

𝑆1Agent

𝒗1

Structure:

Method – Joint Training

• Challenge of ‘Middle State’• Intermediate results after several steps of processing• None of the tools has seen these intermediate results

• Joint Training

...

...

...

forward

backwardtoolchain 1

toolchain 2forward

backward

MSE loss

MSE loss

Experimental Results

• Dataset: DIV2K[5]

• Comparison with generic models for image restoration• VDSR[1]

• DnCNN[3]

Experimental Results

• Quantitative results on DIV2K

• Runtime Analyses

More efficient

Better generality

Competitive performance

Experimental Results

• Qualitative results on DIV2K

Input

1st step

2nd step

3rd step

VDSR-s

VDSR[1]

Mild (unseen) Moderate Severe (unseen)

Experimental Results

• Qualitative results on real-world imagesInput 1st step 2nd step 3rd step VDSR[1]

Experimental Results

• Ablation StudyJoint training

Stopping action

Conclusion

• Contributions• Address image restoration in a reinforcement learning framework• Propose joint learning to cope with middle processing state• Dynamically formed toolchain performs competitively against

human-designed networks with less computational complexity

• Future work• Incorporate more tools (trained with GAN loss)• Handle spatial-variant distortions

Thanks!Q & A

Reference

[1] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.

[2] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory network for image restoration. In ICCV, 2017.

[3] K. Zhang,W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 2017.

[4] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. TPAMI, 38(2):295–307, 2016.

[5] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPR Workshop, 2017.

Recommended