13
Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang Wang 1 , Jungang Yang 1 , Wei An 1 , Jingyi Yu 2 , Yulan Guo 1,3 1 College of Electronic Science and Technology, National University of Defense Technology, China 2 School of Information Science and Technology, ShanghaiTech University, China 3 School of Electronics and Communication Engineering, Sun Yat-sen University, China {wangyingqian16, wanglongguang15, yangjungang, anwei, yulan.guo}@nudt.edu.cn, [email protected] Abstract Light field (LF) cameras record both intensity and direc- tions of light rays, and capture scenes from a number of viewpoints. Both information within each perspective (i.e., spatial information) and among different perspectives (i.e., angular information) is beneficial to image super-resolution (SR). In this paper, we propose a spatial-angular interac- tive network (namely, LF-InterNet) for LF image SR. In our method, spatial and angular features are separately ex- tracted from the input LF using two specifically designed convolutions. These extracted features are then repetitively interacted to incorporate both spatial and angular infor- mation. Finally, the interacted spatial and angular features are fused to super-resolve each sub-aperture image. Exper- iments on 6 public LF datasets have demonstrated the su- periority of our method. As compared to existing LF and single image SR methods, our method can recover much more details, and achieves significant improvements over the state-of-the-arts in terms of PSNR and SSIM. 1. Introduction Light field (LF) cameras provide multiple views of a scene, and thus enable many attractive applications such as post-capture refocusing [33], depth sensing [26], saliency detection [19], and de-occlusion [32]. However, LF cam- eras face a trade-off between spatial and angular resolutions [49]. That is, they either provide dense angular samplings with a low image resolution (e.g., Lytro 1 and RayTrix 2 ), or capture high-resolution (HR) sub-aperture images (SAIs) with sparse angular samplings (e.g., camera arrays [37, 30]). Consequently, many efforts have been made to improve the angular resolution through LF reconstruction [39, 38], or 1 https://www.lytro.com 2 https://www.raytrix.de RCAN resLF resLF LFSSR_4D GBSQ RCAN EDSR VDSR PSNR SSIM 38 37 36 35 34 0.970 0.980 0.990 LF-InterNet (Ours) 39 2xSR LFSSR_4D GBSQ EDSR VDSR PSNR SSIM 32 31 30 29 28 0.90 33 4xSR 0.93 0.91 0.92 0.975 0.985 0.94 LFBM5D LF-InterNet (Ours) LFBM5D Figure 1. Average PSNR and SSIM values achieved by state-of- the-art SR methods on 6 public LF datasets [22, 10, 36, 17, 29, 21]. Note that, our LF-InterNet improves PSNR and SSIM values by a large margin as compared to single image SR methods (VDSR [15], EDSR [20], RCAN [47]) and LF image SR methods (LFBM5D [1], resLF [46], GBSQ [23], LFSSR 4D [41]). the spatial resolution through LF image super-resolution (SR) [1, 46, 23, 31, 41]. In this paper, we focus on the LF image SR problem, namely, to reconstruct HR SAIs from their corresponding low-resolution (LR) SAIs. Image SR is a long-standing problem in computer vision. To achieve high reconstruction performance, SR methods need to incorporate as much useful information as pos- sible from LR inputs. In the area of single image SR, good performance can be achieved by fully exploiting the neighborhood context (i.e., spatial information) in an im- age. Using the spatial information, single image SR meth- ods [5, 15, 20, 47] can successfully hallucinate missing de- tails. In contrast, LF cameras capture scenes from multi- ple views. The complementary information among differ- ent views (i.e., angular information) can be used to further improve the performance of LF image SR. However, due to the complicated 4D structures of LFs [18], it is highly challenging to incorporate spatial and an- gular information in an LF. Existing LF image SR methods 1 arXiv:1912.07849v1 [eess.IV] 17 Dec 2019

Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Spatial-Angular Interaction for Light Field Image Super-Resolution

Yingqian Wang1, Longguang Wang1, Jungang Yang1, Wei An1, Jingyi Yu2, Yulan Guo1,3

1College of Electronic Science and Technology, National University of Defense Technology, China2School of Information Science and Technology, ShanghaiTech University, China

3School of Electronics and Communication Engineering, Sun Yat-sen University, China{wangyingqian16, wanglongguang15, yangjungang, anwei, yulan.guo}@nudt.edu.cn,

[email protected]

Abstract

Light field (LF) cameras record both intensity and direc-tions of light rays, and capture scenes from a number ofviewpoints. Both information within each perspective (i.e.,spatial information) and among different perspectives (i.e.,angular information) is beneficial to image super-resolution(SR). In this paper, we propose a spatial-angular interac-tive network (namely, LF-InterNet) for LF image SR. Inour method, spatial and angular features are separately ex-tracted from the input LF using two specifically designedconvolutions. These extracted features are then repetitivelyinteracted to incorporate both spatial and angular infor-mation. Finally, the interacted spatial and angular featuresare fused to super-resolve each sub-aperture image. Exper-iments on 6 public LF datasets have demonstrated the su-periority of our method. As compared to existing LF andsingle image SR methods, our method can recover muchmore details, and achieves significant improvements overthe state-of-the-arts in terms of PSNR and SSIM.

1. IntroductionLight field (LF) cameras provide multiple views of a

scene, and thus enable many attractive applications such aspost-capture refocusing [33], depth sensing [26], saliencydetection [19], and de-occlusion [32]. However, LF cam-eras face a trade-off between spatial and angular resolutions[49]. That is, they either provide dense angular samplingswith a low image resolution (e.g., Lytro1 and RayTrix2), orcapture high-resolution (HR) sub-aperture images (SAIs)with sparse angular samplings (e.g., camera arrays [37, 30]).Consequently, many efforts have been made to improve theangular resolution through LF reconstruction [39, 38], or

1https://www.lytro.com2https://www.raytrix.de

RCAN

resLFresLF

LFSSR_4D

GBSQ

RCAN

EDSR

VDSR

PSN

R

SSIM

38

37

36

35

340.970 0.980 0.990

LF-InterNet (Ours)

LFNet

39

2xSR

LFSSR_4D

GBSQ

EDSR

VDSR

PSN

R

SSIM

32

31

30

29

280.90

33

4xSR

0.930.91 0.920.975 0.985 0.94

LFBM5D

LF-InterNet (Ours)

LFBM5D

Figure 1. Average PSNR and SSIM values achieved by state-of-the-art SR methods on 6 public LF datasets [22, 10, 36, 17, 29, 21].Note that, our LF-InterNet improves PSNR and SSIM valuesby a large margin as compared to single image SR methods(VDSR [15], EDSR [20], RCAN [47]) and LF image SR methods(LFBM5D [1], resLF [46], GBSQ [23], LFSSR 4D [41]).

the spatial resolution through LF image super-resolution(SR) [1, 46, 23, 31, 41]. In this paper, we focus on the LFimage SR problem, namely, to reconstruct HR SAIs fromtheir corresponding low-resolution (LR) SAIs.

Image SR is a long-standing problem in computer vision.To achieve high reconstruction performance, SR methodsneed to incorporate as much useful information as pos-sible from LR inputs. In the area of single image SR,good performance can be achieved by fully exploiting theneighborhood context (i.e., spatial information) in an im-age. Using the spatial information, single image SR meth-ods [5, 15, 20, 47] can successfully hallucinate missing de-tails. In contrast, LF cameras capture scenes from multi-ple views. The complementary information among differ-ent views (i.e., angular information) can be used to furtherimprove the performance of LF image SR.

However, due to the complicated 4D structures of LFs[18], it is highly challenging to incorporate spatial and an-gular information in an LF. Existing LF image SR methods

1

arX

iv:1

912.

0784

9v1

[ee

ss.I

V]

17

Dec

201

9

Page 2: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

fail to fully exploit both the angular information and thespatial information, resulting in limited SR performance.Specifically, in [43, 42, 44], SAIs are first super-resolvedseparately using single image SR methods [5, 20], andthen fine-tuned together to incorporate the angular informa-tion. The angular information is ignored by these two-stagemethods [43, 42, 44] during their upsampling process. In[31, 46], only part of SAIs are used to super-resolve oneview, and the angular information in these discarded viewsis not incorporated. In contrast, Rossi et al. proposed agraph-based method [23] to consider all angular views inan optimization process. However, this method [23] can-not fully use the spatial information, and is inferior to deeplearning-based SR methods [20, 47, 46, 41]. It is worth not-ing that, even all views are fed to a deep network, it is stillchallenging to achieve superior performance. Yeung et al.proposed a deep network named LFSSR [41] to consider allviews for LF image SR. However, as shown in Fig. 1, LF-SSR [41] is inferior to resLF [46], EDSR [20], and RCAN[47].

The spatial information and the angular information arehighly coupled in 4D LFs, and contribute to LF image SR indifferent manners. Consequently, it is difficult for networksto perform well directly using these coupled information.To efficiently incorporate spatial and angular information,we propose a spatial-angular interactive network (i.e., LF-InterNet) for LF image SR. We first specifically design twoconvolutions to decouple spatial and angular features froman input LF. Then, we develop LF-InterNet to repetitivelyinteract and incorporate spatial and angular information.Extensive ablation studies have been conducted to validateour designs. We compare our method to the state-of-the-artsingle and LF image SR methods on 6 public LF datasets.As shown in Fig. 1, our LF-InterNet substantially improvesthe PSNR and SSIM performance as compared to existingSR methods.

2. Related WorksIn this section, we review several major works on single

image SR and LF image SR.

2.1. Single Image SR

In the area of single image SR, deep learning-basedmethods have been extensively explored. Readers are re-ferred to recent surveys [34, 3, 40] for more details in singleimage SR. Here, we only review several milestone works.Dong et al. proposed the first CNN-based SR method (i.e.,SRCNN [5]) by cascading 3 convolutional layers. AlthoughSRCNN [5] is shallow and simple, it achieves significantimprovements over traditional SR methods [28, 14, 45]. Af-terwards, SR networks became increasingly deep and com-plex, and thus more powerful in spatial information ex-ploitation. Kim et al. proposed a very deep SR network (i.e.,

VDSR [15]) with 20 convolutional layers. Global resid-ual learning is applied to VDSR [15] to avoid slow conver-gence. Lim et al. proposed an enhanced deep SR network(i.e., EDSR [20]) with 65 convolutional layers by cascadingresidual blocks [9]. EDSR substantially improves its perfor-mance by applying both local and global residual learning,and won the NTIRE 2017 Challenge on single image SR[27]. More recently, Zhang et al. proposed a residual densenetwork (i.e., RDN [48]) with 149 convolutional layers bycombining ResNet [9] with DenseNet [12]. Using residualdense connections, RDN [48] can fully extract hierarchicalfeatures for image SR, and thus achieve further improve-ments over EDSR [20]. Subsequently, Zhang et al. proposeda residual channel attention network (i.e., RCAN) [47] byapplying both recursive residual mechanism and channelattention module [11]. RCAN [47] has 500 convolutionallayers, and is one of the most powerful SR methods to date.

2.2. LF image SR

In the area of LF image SR, different paradigms havebeen proposed. Most early works follow the traditionalparadigm. Bishop et al. [4] first estimated the scene depthand then used a de-convolution approach to estimate HRSAIs. Wanner et al. [35] proposed a variational LF imageSR framework using the estimated disparity map. Farru-gia et al. [7] decomposed HR-LR patches into several sub-spaces, and achieved LF image SR via PCA analysis. Alainet al. extended SR-BM3D [6] to LFs, and super-resolvedSAIs using LFBM5D filtering [1]. Rossi et al. [23] formu-lated LF image SR as a graph optimization problem. Thesetraditional methods [4, 35, 7, 1, 23] use different approachesto exploit angular information, but cannot fully exploit spa-tial information.

In contrast, deep learning-based SR methods are moreeffective in exploiting spatial information, and thus canachieve promising performance. Many deep learning-basedmethods have been recently developed for LF image SR. Inthe pioneering work proposed by Yoon et al. (i.e., LFCNN[43]), SAIs are first super-resolved separately using SRCNN[5], and then fine-tuned in pairs to incorporate angular in-formation. Similarly, Yuan et al. proposed LF-DCNN [44],in which they used EDSR [20] to super-resolve each SAIand then fine-tuned the results. Both LFCNN [43] and LF-DCNN [44] handle the LF image SR problem in two stages,and do not use angular information in the first stage. Differ-ent from [43, 44], Wang et al. proposed LFNet [31] by ex-tending BRCN [13] to LF image SR. In their method, SAIsfrom the same row (or column) are fed to a recurrent net-work to incorporate the angular information. Zhang et al.stacked SAIs along different angular directions to generateinput volumes, and then proposed a multi-stream residualnetwork named resLF [46]. Both LFNet [31] and resLF[46] reduce 4D LF to 3D LF by using part of SAIs to super-

Page 3: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

kernel=angRes; stride=angRes;

dilation=1

(a)

(b)

(c) (d) (e)

Nearest Interpolation

kernel=3; stride=1;

dilation=angRes

LF Shuffling

Sub-Aperture Images

Macro-Pixel Image

Convolution for Angular Feature Extraction:

Kernel = angRes x angResStride = angRes

Dilation = 1

Convolution for Spatial Feature Extraction:

Kernel = 3 x 3Stride = 1

Dilation = angRes

Macro-Pixel Image

Figure 2. SAI array (left) and MPI (right) representations of LFs.Both the SAI array and the MPI representations have the same sizeof RUH×VW . Note that, to convert an SAI array representationinto an MPI representation, pixels at the same spatial coordinatesof each SAI need to be extracted and organized according to theirangular coordinates to generate a macro-pixel. Then, an MPI canbe generated by organizing these macro-pixels according to theirspatial coordinates. More details are presented in the supplementalmaterial.

resolve one view. Consequently, the angular informationin these discarded views cannot be incorporated. To con-sider all views for LF image SR, Yeung et al. proposed LF-SSR [41] to alternately shuffle LF features between SAI pat-tern and macro-pixel image (MPI) pattern for convolution.However, the complicated LF structure and coupled infor-mation have hindered the performance gain of LFSSR [41].

3. MethodIn this section, we first introduce the approach to de-

couple spatial and angular features in Section 3.1, and thenpresent our network in details in Section 3.2.

3.1. Spatial-Angular Feature Decoupling

An LF has 4D structures and can be denoted as L ∈RU×V×H×W , where U and V represent the angular di-mensions (e.g., U = 3,V = 4 for a 3 × 4 LF), H and Wrepresent the height and width of each SAI. Intuitively, anLF can be considered as a 2D angular collection of SAIs,and the SAI at each angular coordinate (u, v) can be de-noted as L (u, v, :, :) ∈ RH×W . Similarly, an LF can alsobe organized into an MPI, namely, a 2D spatial collectionof macro-pixels. The macro-pixel at each spatial coordinate(h,w) can be denoted as L (:, :, h, w) ∈ RU×V . An illus-tration of these two LF representations is shown in Fig. 2.Note that, when an LF is organized as a 2D SAI array, theangular information is implicitly contained among differ-ent SAIs and thus is hard to extract. Therefore, we use theMPI representation in our method, and specifically designtwo special convolutions (i.e., Angular Feature Extractor(AFE) and Spatial Feature Extractor (SFE)) to extract and

Angular Feature Extractor (AFE):

Kernel = A x AStride = A

Dilation = 1

Spatial Feature Extractor (SFE):

Kernel = 3 x 3Stride = 1

Dilation = A

Figure 3. An illustration of angular and spatial feature extractors.Here, an LF of size R3×3×3×3 is used as a toy example. For bet-ter visualization, pixels from different SAIs are represented withdifferent labels (e.g., red arrays or green squares), while differ-ent macro-pixels are paint with different background colors. Notethat, AFE only extracts angular features and SFE only extracts spa-tial features, resulting in spatial-angular information decoupling.

decouple angular and spatial features.Since most methods use SAIs distributed in a square ar-

ray as their input, we follow [1, 23, 43, 42, 41, 46] to setU = V = A in our method, where A denotes the angu-lar resolution. Given an LF of size RA×A×H×W , an MPIof size RAH×AW can be generated by organizing macro-pixels of size A × A according to their spatial coordinates.Here, we use a toy example in Fig. 3 to illustrate the pro-cesses for angular and spatial feature extraction. Specifi-cally, AFE is defined as a convolution with a kernel size ofA × A and a stride of A. Padding is not performed so thatfeatures generated by AFE have a size of RH×W×C , whereC represents the feature depth. In contrast, SFE is definedas a convolution with a kernel size of 3×3, a stride of 1, anda dilation of A. We perform zero padding to ensure the out-put features to have the same spatial size AH ×AW as theinput MPI. It is worth noting that, during angular featureextraction, each macro-pixel can be exactly convolved byAFE, while the information across different macro-pixelswill not be aliased. Similarly, during spatial feature extrac-tion, pixels in each SAI can be convolved by the SFE, whilethe angular information will not be involved. In this way,the spatial and angular information in an LF is decoupled.

Due to the 3D property of real scenes, objects of dif-ferent depths have different disparity values in LFs. Con-sequently, pixels of an object among different views can-not always locate at a single macro-pixel. To handle thisproblem, we apply AFE and SFE for multiple times (i.e.,performing spatial-angular interaction) in our network. Asshown in Fig. 4, in this way, the receptive field can be en-larged to cover pixels with disparities.

3.2. Network Design

Our LF-InterNet takes an LR MPI of size RAH×AW

as its input and produces an HR SAI array of sizeRαAH×αAW , where α denotes the upsampling factor. Fol-lowing [46, 41, 31, 44], we convert RGB images into

Page 4: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

(a) center-view SAI (b) heat maps

(c) EPIs

Figure 4. A visualization of the receptive field of our LF-InterNetusing the Grad-CAM method [24]. We performed 2×SR on the5 × 5 central views of scene HCInew bicycle [10], and investi-gated the contribution of input pixels to the specified output pixel(marked in the zoom-in image of (a)). (a) Center-view SAI. (b)Heat maps generated by Grad-CAM [24]. The contributive pixelsare highlighted. (c) Epipolar-plane images (EPIs) of the output LFand the heat maps. In summary, our LF-InterNet can handle thedisparity problem in LF image SR, and its receptive field can coverthe corresponding pixels in each LR image.

YCbCr color space, and only super-resolve Y channel im-ages. An overview of our network is shown in Fig. 5.

3.2.1 Overall Architecture

Given an LR MPI ILR ∈ RAH×AW , the angular and spatialfeatures are first extracted by AFE and SFE, respectively.

FA,0 = HA (ILR) , FS,0 = HS (ILR) , (1)

where FA,0 ∈ RH×W×C and FS,0 ∈ RAH×AW×C re-spectively represent the extracted angular and spatial fea-tures, HA and HS respectively represent the angular andspatial feature extractors (as described in Section 3.1). Af-ter initial feature extraction, features FA,0 and FS,0 are fur-ther processed by a series of interaction groups (i.e., Inter-Groups, see Section 3.2.2) to achieve spatial-angular featureinteraction:

(FA,n,FS,n) =HIG,n (FA,n−1,FS,n−1) ,(n = 1, 2, · · · , N) ,

(2)

where HIG,n denotes the nth Inter-Group and N denotesthe total number of Inter-Groups.

Inspired by RDN [48], we cascade all these Inter-Groupsto fully use the information interacted at different stages.Specifically, features generated by each Inter-Group areconcatenated and fed to a bottleneck block to fuse the in-teracted information. The feature generated by the bottle-neck block is further added with the initial feature FS,0 to

achieve global residual learning. The fused feature FS,t canbe obtained by

FS,t =HB ([FA,1, · · · ,FA,N ] , [FS,1, · · · ,FS,N ])

+ FS,0,(3)

where HB denotes the bottleneck block, [·] denotes theconcatenation operation. Finally, the fused feature FS,tis fed to the reconstruction module, and an HR SAI arrayISR ∈ RαAH×αAW can be obtained by

ISR = H1×1 (Spix (Slf (HS (FS,t)))) , (4)

where Slf , Spix, andH1×1 represent LF shuffle, pixel shuf-fle, and 1× 1 convolution, respectively. More details aboutfeature fusion and reconstruction are introduced in Sec-tion 3.2.3.

3.2.2 Spatial-Angular Feature Interaction

The basic module for spatial-angular interaction is the inter-action block (i.e., Inter-Block). As shown in Fig. 5 (b), theInter-Block takes a pair of angular and spatial features asinputs to achieve interaction. Specifically, the input angularfeature is first upsampled by a factor of A. Here, a 1 × 1convolution followed by a pixel shuffle layer is used forupsampling. Then, the upsampled angular feature is con-catenated with the input spatial feature, and further fed toan SFE to incorporate the spatial and angular information.In this way, the complementary angular information can beused to guide spatial feature extraction. Simultaneously, thenew angular feature is extracted from the input spatial fea-ture by an AFE, and then concatenated with the input angu-lar feature. The concatenated angular feature is further fedto a 1×1 convolution to integrate and update the angular in-formation. Note that, the fused angular and spatial featuresare added with their respective input features to achieve lo-cal residual learning. In this paper, we cascade K Inter-Blocks in an Inter-Group, i.e., the output of an Inter-Blockforms the input of its subsequent Inter-Block. In summary,the spatial-angular feature interaction can be formulated as

F (k)S,n =HS

([F (k−1)S,n ,

(F (k−1)A,n

)↑])

+ F (k−1)S,n ,

F (k)A,n =H1×1

([F (k−1)A,n , HA

(F (k−1)S,n

)])+ F (k−1)

A,n ,

(k = 1, 2, · · · ,K) ,

(5)

where F (k)S,n and F (k)

A,n represent the output spatial and an-gular features of the kth Inter-Block in the nth Inter-Group,respectively, ↑ represents the upsampling operation.

3.2.3 Feature Fusion and Reconstruction

The objective of this stage is to fuse the interacted featuresto reconstruct an HR SAI array. The fusion and recon-struction stage mainly consists of bottleneck fusion (Fig. 5

Page 5: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

UpSP

AFE

C

C

1x1

Co

nv

Local Residual

Inter Group

1

Inter Group

2

Inter Group

3

Inter Group

4

C C C

C C C

SFE

LF S

hu

ffle

Pix

el S

hu

ffle

1x1

Co

nv

Global Residual

Local Residual

ReL

USF

ER

eLU

LR MPI

HR SAIs

Angular Feature

Spatial Feature

(a) Overall Architecture

UpSP

1x1

Co

nv

ReL

USF

ER

eLU

Bo

ttle

nec

k

UpSP

AFE

SFE

Upsampling Module

Angular Feature Extractor

Spatial Feature Extractor

Element-wise Addition

Concatenation

(b) Inter-Block (c) Bottleneck (d) LF Shuffle

AFE

SFE

(e) Pixel ShuffleC

Feature Extraction Spatial-Angular Feature Interaction Feature Fusion & Reconstruction

Figure 5. An overview of our LF-InterNet.

(c)), channel extension, LF shuffle (Fig. 5 (d)), pixel shuffle(Fig. 5 (e)), and final reconstruction.

In the bottleneck, the concatenated angular features[FA,1, · · · ,FA,N ] ∈ RH×W×NC are first fed to a 1 × 1convolution and a ReLU layer to generate a feature mapFA ∈ RH×W×C . Then, the squeezed angular feature FAis upsampled and concatenated with spatial features. Thefinal fused feature FS,t can be obtained as

FS,t = HS ([FS,1, · · · ,FS,N , (FA) ↑]) + FS,0, (6)

After the bottleneck, we apply another SFE layer to ex-tend the channel size of FS,t to α2C for pixel shuffle [25].However, since FS,t is organized in the MPI pattern, weapply LF shuffle to convert FS,t into an SAI array repre-sentation for pixel shuffle. To achieve LF shuffle, we firstextract pixels with the same angular coordinates in the MPIfeature, and then re-organize these pixels according to theirspatial coordinates, which can be formulated as

ISAIs (x, y) = IMPI (ξ, η) , (7)

where

x = H (ξ − 1) + bξ/Ac (1−AH) + 1,

y =W (η − 1) + bη/Ac (1−AW ) + 1.(8)

Here, x = 1, 2, · · · , AH and y = 1, 2, · · · , AW de-note the pixel coordinates in the shuffled SAI arrays, ξ andη denote the corresponding coordinates in the input MPI,b·c represents the round-down operation. The derivation ofEqs. (7) and (8) is presented in the supplemental material.

Finally, a 1 × 1 convolution is applied to squeeze thenumber of feature channels to 1 for HR SAI reconstruction.

Table 1. Datasets used in our experiments.Datasets Type Training TestEPFL [22] real-world 70 10HCInew [10] synthetic 20 4HCIold [36] synthetic 10 2INRIA [17] real-world 35 5STFgantry [29] real-world 9 2STFlytro [21] real-world 250 50Total — 394 73

4. ExperimentsIn this section, we first introduce the datasets and our

implementation details, then conduct ablation studies to in-vestigate our network. Finally, we compare our LF-InterNetto recent LF image SR and single image SR methods.

4.1. Datasets and Implementation Details

As listed in Table 1, we used 6 public LF datasetsin our experiments. All the LFs in the training and testsets have an angular resolution of 9 × 9. In the trainingstage, we first cropped each SAI into patches with a size of64 × 64, and then used bicubic downsampling with a fac-tor of α (α = 2, 4) to generate LR patches. The generatedLR patches were re-organized into MPI pattern to form theinput of our network. The L1 loss function was used sinceit can generate good results for the SR task and is robust tooutliers [2]. Following the recent works [46, 26], we aug-mented the training data by 8 times using random horizon-tal flipping, vertical flipping, and 90-degree rotation. Notethat, during each data augmentation, all SAIs need to beflipped and rotated along both spatial and angular directionsto maintain their LF structures.

By default, we used the model withN = 4,K = 4, C =64, and angular resolution of 5× 5 for both 2× and 4× SR.

Page 6: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Table 2. Comparative results achieved on the STFlytro dataset [21]by our LF-InterNet with different settings for 4× SR. Note that,the results of bicubic interpolation, VDSR [15], and EDSR [20] arealso listed as baselines.

Model PSNR SSIM Params.LF-InterNet-onlySpatial 29.75 0.893 1.27MLF-InterNet-onlyAngular 26.50 0.822 3.58MLF-InterNet-SAcoupled 31.02 0.918 5.10MLF-InterNet 31.65 0.925 5.23MBicubic 27.84 0.855 —VDSR [15] 29.17 0.880 0.66MEDSR [20] 30.29 0.903 1.45M

We also investigated the performance of other branches ofour LF-InterNet in Section 4.2. We used PSNR and SSIMas quantitative metrics for performance evaluation. Notethat, PSNR and SSIM were separately calculated on the Ychannel of each SAI. To obtain the overall metric score fora dataset with M scenes (each with an angular resolution ofA×A), we first obtain the score for a scene by averaging itsA2 scores, and then get the overall score by averaging thescores of all M scenes.

Our LF-InterNet was implemented in PyTorch on a PCwith an Nvidia RTX 2080Ti GPU. Our model was initial-ized using the Xavier method [8] and optimized using theAdam method [16]. The batch size was set to 12 and thelearning rate was initially set to 5 × 10−4 and decreasedby a factor of 0.5 for every 10 epochs. The training wasstopped after 40 epochs and took about 1 day.

4.2. Ablation Study

In this subsection, we compare the performance of ourLF-InterNet with different architectures and angular resolu-tions to investigate the potential benefits introduced by dif-ferent modules.

4.2.1 Network Architecture

Angular information. We investigated the benefit of an-gular information by removing the angular path in LF-InterNet. That is, we only use SFE for LF image SR.Consequently, the network is identical to a single imageSR network, and can only incorporate spatial informationwithin each SAI. As shown in Table 2, only using the spa-tial information, the network (i.e., LF-InterNet-onlySpatial)achieves a PSNR of 29.75 and a SSIM of 0.893. Boththe performance and the parameter number of LF-InterNet-onlySpatial is between VDSR [15] and EDSR [20].

Spatial information. To investigate the benefit intro-duced by spatial information, we changed the kernel sizeof all SFEs from 3 × 3 to 1 × 1. In this case, the spatialinformation cannot be exploited and integrated by convolu-tions. As shown in Table 2, the performance of LF-InterNet-onlyAngular is even inferior to that of bicubic interpolation.

Table 3. Comparative results achieved on the STFlytro dataset [21]by our LF-InterNet with different number of interactions for 4×SR.

IG 1 IG 2 IG 3 IG 4 PSNR SSIM Params.29.84 0.894 1.48M

X 31.44 0.922 2.42MX X 31.61 0.924 3.35MX X X 31.66 0.925 4.28MX X X X 31.84 0.927 5.23M

Table 4. Comparative results achieved on the STFlytro dataset [21]by our LF-InterNet with different angular resolutions for 2× and4× SR.

AngRes Scale PSNR SSIM Params.3× 3 ×2 37.95 0.980 2.73M5× 5 ×2 38.81 0.983 4.80M7× 7 ×2 39.05 0.984 7.90M9× 9 ×2 39.08 0.985 12.02M3× 3 ×4 31.30 0.918 3.15M5× 5 ×4 31.84 0.927 5.23M7× 7 ×4 32.04 0.931 8.33M9× 9 ×4 32.07 0.933 12.48M

That is because, neighborhood context in an image is highlysignificant in recovering details. Consequently, spatial in-formation plays a major role in LF image SR, while angularinformation can only be used as a complimentary part tospatial information but cannot be used alone.

Information decoupling. To investigate the benefit ofspatial-angular information decoupling, we stacked all SAIsalong the channel dimension as input, and used 3 × 3 con-volutions with a stride of 1 to extract both spatial and an-gular information from these stacked images. Note that, thecascaded framework with global and local residual learn-ing was maintained to keep the overall network architec-ture unchanged, and the feature depth was set to 128 tokeep the number of parameters comparable to that of LF-InterNet. As shown in Table 2, LF-InterNet-SAcoupled isinferior to LF-InterNet. That is, with comparable numberof parameters, LF-InterNet can handle the 4D LF structureand achieve LF image SR in a more efficient way.

Spatial-Angular interaction. We investigated the ben-efits introduced by our spatial-angular interaction mecha-nism. Specifically, we canceled feature interaction in eachInter-Group by removing the upsampling and AFE modulesin each Inter-Block (see Fig. 5 (b)). In this case, spatial andangular features can only be processed separately. Whenall interactions are removed, these spatial and angular fea-tures can only be incorporated by the bottleneck block. Ta-ble 3 presents the results achieved by our LF-InterNet withdifferent numbers of interactions. It can be observed that,without any feature interaction, our network achieves com-parable performance to the LF-InterNet-onlySpatial model(29.84 vs 29.75 in PSNR and 0.894 vs 0.893 in SSIM). Thatis, the angular and spatial information cannot be effectivelyincorporated by the bottleneck block without interactions.

Page 7: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Table 5. PSNR/SSIM values achieved by different methods for 2× and 4× SR. The best results are in bold faces and the second bestresults are underlined.

Method Scale Params.Dataset

EPFL [22] HCInew [10] HCIold [36] INRIA [17] STFgantry [29] STFlytro [21] AverageBicubic 2× — 29.50/0.935 31.69/0.934 37.46/0.978 31.10/0.956 30.82/0.947 33.02/0.950 32.27/0.950VDSR [15] 2× 0.66M 32.01/0.959 34.37/0.956 40.34/0.985 33.80/0.972 35.80/0.980 35.91/0.970 35.37/0.970EDSR [20] 2× 1.31M 32.86/0.965 35.02/0.961 41.11/0.988 34.61/0.977 37.08/0.985 36.87/0.975 36.26/0.975RCAN [47] 2× 14.8M 33.46/0.967 35.56/0.963 41.59/0.989 35.18/0.978 38.18/0.988 37.32/0.977 36.88/0.977LFBM5D [1] 2× — 31.15/0.955 33.72/0.955 39.62/0.985 32.85/0.969 33.55/0.972 35.01/0.966 34.32/0.967GBSQ [23] 2× — 31.22/0.959 35.25/0.969 40.21/0.988 32.76/0.972 35.44/0.983 35.04/0.956 34.99/0.971LFSSR 4D [41] 2× 3.36M 32.56/0.967 34.47/0.960 41.04/0.989 34.06/0.976 34.08/0.975 36.62/0.976 35.47/0.974resLF [46] 2× 6.35M 33.22/0.969 35.79/0.969 42.30/0.991 34.86/0.979 36.28/0.985 35.80/0.970 36.38/0.977LF-InterNet 32 2× 1.20M 34.43/0.975 36.96/0.974 43.99/0.994 36.31/0.983 37.40/0.989 38.47/0.982 37.88/0.983LF-InterNet 64 2× 4.80M 34.76/0.976 37.20/0.976 44.65/0.995 36.64/0.984 38.48/0.991 38.81/0.983 38.42/0.984Bicubic 4× — 25.14/0.831 27.61/0.851 32.42/0.934 26.82/0.886 25.93/0.843 27.84/0.855 27.63/0.867VDSR [15] 4× 0.66M 26.82/0.869 29.12/0.876 34.01/0.943 28.87/0.914 28.31/0.893 29.17/0.880 29,38/0.896EDSR [20] 4× 1.48M 27.82/0.892 29.94/0.893 35.53/0.957 29.86/0.931 29.43/0.921 30.29/0.903 30.48/0.916RCAN [47] 4× 14.9M 28.31/0.899 30.25/0.896 35.89/0.959 30.36/0.936 30.25/0.934 30.66/0.909 30.95/0.922LFBM5D [1] 4× — 26.61/0.869 29.13/0.882 34.23/0.951 28.49/0.914 28.30/0.900 29.07/0.881 29.31/0.900GBSQ [23] 4× — 26.02/0.863 28.92/0.884 33.74/0.950 27.73/0.909 28.11/0.901 28.37/0.973 28.82/0.913LFSSR 4D [41] 4× 3.36M 27.39/0.894 29.61/0.893 35.40/0.962 29.26/0.930 28.53/0.908 30.26/0.908 30.08/0.916resLF1 [46] 4× 6.79M 27.86/0.899 30.37/0.907 36.12/0.966 29.72/0.936 29.64/0.927 28.94/0.891 30.44/0.921LF-InterNet 32 4× 1.31M 29.16/0.912 30.74/0.913 36.78/0.970 31.30/0.947 29.92/0.934 31.49/0.923 31.57/0.933LF-InterNet 64 4× 5.23M 29.52/0.917 31.01/0.917 37.23/0.972 31.65/0.950 30.44/0.941 31.84/0.927 31.95/0.937

Note: 1Since the 4×SR model of resLF [46] is unavailable, we cascaded two 2×SR models for 4×SR.

As the number of interactions increases, the performance issteadily improved. This clearly demonstrates the effective-ness of our spatial-angular feature interaction mechanism.

4.2.2 Angular Resolution

We compared the performance of LF-InterNet with differ-ent angular resolutions. Specifically, we extracted the cen-tral A × A (A = 3, 5, 7, 9) SAIs from the input LFs, andtrained different models for both 2× and 4× SR. As shownin Table 4, the PSNR and SSIM values for both 2× and4× SR are improved as the angular resolution is increased.That is because, additional views provide rich angular in-formation for LF image SR. It is also notable that, the im-provements tend to be saturated as the angular resolutionincreases from 7 × 7 to 9 × 9 (only 0.03 dB improvementin PSNR). That is because, the complementary informationprovided by additional views is already sufficient. As theangular information is fully exploited, a further increase ofviews can only provide minor performance improvements.

4.3. Comparison to the State-of-the-arts

We compare our method to three milestone single im-age SR methods (i.e., VDSR [15], EDSR [20], and RCAN[47]) and four state-of-the-art LF image SR methods (i.e.,LFBM5D [1], GBSQ [23], LFSSR [41], and resLF [46]).All these methods were implemented using their releasedcodes and pre-trained models. We also present the resultsof bicubic interpolation as the baseline results. For simplifi-cation, we only present the results on 5× 5 LFs for 2× and

4×SR. Since the angular resolution of LFSSR [41] is fixed,we use its original version with 8× 8 input SAIs.

Quantitative Results. Quantitative results are presentedin Table 5. For both 2× and 4× SR, our method (i.e., LF-InterNet 64) achieves the best results on all the 6 datasetsand surpasses existing methods by a large margin. For ex-ample, 2.04 dB and 1.51 dB PSNR improvements in av-erage over the state-of-the-art LF image SR method resLF[46] can be observed for 2× and 4×SR, respectively. Itis worth noting that, even the feature depth of our modelis halved to 32, our method (i.e., LF-InterNet 32) can stillachieve the highest SSIM scores on all the 6 datasets andthe highest PSNR scores on 5 of the 6 datasets as comparedto existing methods. Note that, the numbers of parametersof LF-InterNet 32 are only 1.20M for 2×SR and 1.31Mfor 4×SR, which are significantly smaller than recent deeplearning-based SR methods [47, 41, 46].

Qualitative Results. Qualitative results of 2× and 4×SR are shown in Figs. 6 and 7, with more visual compar-isons being provided in our supplemental material. It canbe observed from Fig. 6 that, our method can well pre-serve the textures and details (e.g., the horizontal stripes inthe scene HCInew origami and the stairway in the sceneINRIA Sculpture) in these super-resolved images. In con-trast, although the single image SR method RCAN [47]achieves high PSNR and SSIM scores, the images gener-ated by RCAN [47] are over-smoothed and poor in details.It can be observed from Fig. 7 that the visual superiorityof our method is more obvious for 4× SR. Since the in-put LR images are severely degraded by the down-sampling

Page 8: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Groundtruth Bicubic RCAN

STFlytro_buildings

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 28.44/0.864 31.11/0.919 28.96/0.880 31.09/0.921 30.17/0.903 32.75/0.942

Groundtruth Bicubic RCAN

HCInew_origami

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 34.24/0.970 40.06/0.990 38.15/0.986 37.72/0.984 39.47/0.989 41.46/0.993

Groundtruth Bicubic RCAN

INRIA_Sculpture

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 30.80/0.945 34.06/0.973 31.93/0.956 33.23/0.972 33.92/0.973 36.18/0.984

Groundtruth Bicubic RCAN

EPFL_Palais

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 22.37/0.725 24.29/0.825 23.11/0.774 24.50/0.828 24.49/0.830 25.27/0.854

Figure 6. Visual results of 2×SR.

Groundtruth Bicubic RCAN

STFlytro_buildings

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 28.44/0.864 31.11/0.919 28.96/0.880 31.09/0.921 30.17/0.903 32.75/0.942

Groundtruth Bicubic RCAN

HCInew_origami

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 34.24/0.970 40.06/0.990 38.15/0.986 37.72/0.984 39.47/0.989 41.46/0.993

Groundtruth Bicubic RCAN

INRIA_Sculpture

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 30.80/0.945 34.06/0.973 31.93/0.956 33.23/0.972 33.92/0.973 36.18/0.984

Groundtruth Bicubic RCAN

EPFL_Palais

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 22.37/0.725 24.29/0.825 23.11/0.774 24.50/0.828 24.49/0.830 25.27/0.854

Figure 7. Visual results of 4×SR.

operation, the process of 4×SR is highly ill-posed. Singleimage SR methods use spatial information only to hallu-cinate missing details, and they usually generate ambigu-ous and even fake textures (e.g., the window frame in sceneEPFL Palais generated by RCAN [47]). In contrast, LFimage SR methods can use complementary angular infor-mation among different views to produce authentic results.However, the results generated by existing LF image SRmethods [23, 41, 46] are relatively blurring. As comparedto these single image and LF image SR methods, the resultsproduced by our LF-InterNet are much more close to thegroundtruth images.

Performance w.r.t. Perspectives. Since our LF-InterNet can super-resolve all SAIs in an LF, we furtherinvestigate the reconstruction quality with respect to differ-ent perspectives. We used the central 7 × 7 views of sceneHCIold MonasRoom [36] as input to perform both 2× and4× SR. The PSNR and SSIM values are calculated for eachperspective and are visualized in Fig. 8. Since resLF [46]uses part of views to super-resolve different perspectives,the reconstruction qualities of resLF [46] for non-central

resLF

PSN

RSSIM

2xSR 4xSR

LF-InterNet resLF LF-InterNet

Figure 8. Visualizations of PSNR and SSIM values achieved byresLF [46] and LF-InterNet on each perspective of scene HCI-old MonasRoom [36]. Here, 7×7 input views are used to performboth 2× and 4× SR. Our LF-InterNet achieves high reconstruc-tion qualities with a balanced distribution among different SAIs.

views are relatively low. In contrast, our LF-InterNet jointlyuses the angular information from all input views to super-resolve each perspective, and thus achieves much higherreconstruction qualities with a more balanced distributionamong different perspectives.

Page 9: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

5. ConclusionIn this paper, we proposed a deep convolutional network

LF-InterNet for LF image SR. We first introduce an ap-proach to extract and decouple spatial and angular features,and then design a feature interaction mechanism to incorpo-rate spatial and angular information. Experimental resultshave clearly demonstrated the superiority of our method.Our LF-InterNet outperforms the state-of-the-art SR meth-ods by a large margin in terms of PSNR and SSIM, and canrecover rich details in the reconstructed images.

6. AcknowledgementThis work was partially supported by the National

Natural Science Foundation of China (No. 61972435,61602499), Natural Science Foundation of GuangdongProvince, Fundamental Research Funds for the Central Uni-versities (No. 18lgzd06).

References[1] Martin Alain and Aljosa Smolic. Light field super-resolution

via lfbm5d sparse coding. In 2018 25th IEEE InternationalConference on Image Processing (ICIP), pages 2501–2505.IEEE, 2018.

[2] Yildiray Anagun, Sahin Isik, and Erol Seke. Srlibrary: Com-paring different loss functions for super-resolution over var-ious convolutional architectures. Journal of Visual Commu-nication and Image Representation, 61:178–187, 2019.

[3] Saeed Anwar, Salman Khan, and Nick Barnes. A deepjourney into super-resolution: A survey. arXiv preprintarXiv:1904.07523, 2019.

[4] Tom E Bishop and Paolo Favaro. The light field camera:Extended depth of field, aliasing, and superresolution. IEEETransactions on Pattern Analysis and Machine Intelligence,34(5):972–986, 2011.

[5] Chao Dong, Chen Change Loy, Kaiming He, and XiaoouTang. Learning a deep convolutional network for imagesuper-resolution. In European Conference on Computer Vi-sion (ECCV), pages 184–199. Springer, 2014.

[6] Karen Egiazarian and Vladimir Katkovnik. Single imagesuper-resolution via bm3d sparse coding. In European Sig-nal Processing Conference (EUSIPCO), pages 2849–2853.IEEE, 2015.

[7] Reuben A Farrugia, Christian Galea, and Christine Guille-mot. Super resolution of light field images using linear sub-space projection of patch-volumes. IEEE Journal of SelectedTopics in Signal Processing, 11(7):1058–1071, 2017.

[8] Xavier Glorot and Yoshua Bengio. Understanding the diffi-culty of training deep feedforward neural networks. In Pro-ceedings of the International Conference on Artificial Intel-ligence and Statistics, pages 249–256, 2010.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016.

[10] Katrin Honauer, Ole Johannsen, Daniel Kondermann, andBastian Goldluecke. A dataset and evaluation methodologyfor depth estimation on 4d light fields. In Asian Conferenceon Computer Vision (ACCV), pages 19–34. Springer, 2016.

[11] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 7132–7141,2018.

[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-ian Q Weinberger. Densely connected convolutional net-works. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 4700–4708,2017.

[13] Yan Huang, Wei Wang, and Liang Wang. Bidirectionalrecurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information ProcessingSystems (NeurIPS), pages 235–243, 2015.

[14] Yang Jianchao, Wright John, Huang Thomas, and Ma Yi. Im-age super-resolution via sparse representation. IEEE Trans-actions on Image Processing, 19(11):2861–2873, 2010.

[15] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurateimage super-resolution using very deep convolutional net-works. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1646–1654,2016.

[16] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. Proceedings of the InternationalConference on Learning and Representation (ICLR), 2015.

[17] Mikael Le Pendu, Xiaoran Jiang, and Christine Guille-mot. Light field inpainting propagation via low rank ma-trix completion. IEEE Transactions on Image Processing,27(4):1981–1993, 2018.

[18] Marc Levoy and Pat Hanrahan. Light field rendering. InProceedings of the 23rd Annual Conference on ComputerGraphics and Interactive Techniques, pages 31–42. ACM,1996.

[19] Nianyi Li, Jinwei Ye, Yu Ji, Haibin Ling, and Jingyi Yu.Saliency detection on light field. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), pages 2806–2813, 2014.

[20] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, andKyoung Mu Lee. Enhanced deep residual networks for sin-gle image super-resolution. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition Work-shops (CVPRW), pages 136–144, 2017.

[21] Abhilash Sunder Raj, Michael Lowney, Raj Shah, and Gor-don Wetzstein. Stanford lytro light field archive, 2016.

[22] Martin Rerabek and Touradj Ebrahimi. New light field imagedataset. In International Conference on Quality of Multime-dia Experience (QoMEX), 2016.

[23] Mattia Rossi and Pascal Frossard. Geometry-consistent lightfield super-resolution via graph-based regularization. IEEETransactions on Image Processing, 27(9):4207–4218, 2018.

[24] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.Grad-cam: Visual explanations from deep networks via

Page 10: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

gradient-based localization. In Proceedings of the IEEE In-ternational Conference on Computer Vision (ICCV), pages618–626, 2017.

[25] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,Andrew P Aitken, Rob Bishop, Daniel Rueckert, and ZehanWang. Real-time single image and video super-resolutionusing an efficient sub-pixel convolutional neural network. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 1874–1883, 2016.

[26] Changha Shin, Hae-Gon Jeon, Youngjin Yoon, In So Kweon,and Seon Joo Kim. Epinet: A fully-convolutional neural net-work using epipolar geometry for depth from light field im-ages. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 4748–4757,2018.

[27] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on singleimage super-resolution: Methods and results. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW), pages 114–125, 2017.

[28] Radu Timofte, Vincent De Smet, and Luc Van Gool.Anchored neighborhood regression for fast example-basedsuper-resolution. In Proceedings of the IEEE IInternationalConference on Computer Vision (ICCV), pages 1920–1927,2013.

[29] Vaibhav Vaish and Andrew Adams. The (new) stanford lightfield archive. Computer Graphics Laboratory, Stanford Uni-versity, 6(7), 2008.

[30] Kartik Venkataraman, Dan Lelescu, Jacques Duparre, An-drew McMahon, Gabriel Molina, Priyam Chatterjee, RobertMullis, and Shree Nayar. Picam: An ultra-thin high per-formance monolithic camera array. ACM Transactions onGraphics, 32(6):166, 2013.

[31] Yunlong Wang, Fei Liu, Kunbo Zhang, Guangqi Hou,Zhenan Sun, and Tieniu Tan. Lfnet: A novel bidirectionalrecurrent convolutional neural network for light-field imagesuper-resolution. IEEE Transactions on Image Processing,27(9):4274–4286, 2018.

[32] Yingqian Wang, Tianhao Wu, Jungang Yang, LongguangWang, Wei An, and Yulan Guo. Deoccnet: Learning to seethrough foreground occlusions in light fields. In Winter Con-ference on Applications of Computer Vision (WACV), 2020.arXiv:1912.04459.

[33] Yingqian Wang, Jungang Yang, Yulan Guo, Chao Xiao, andWei An. Selective light field refocusing for camera arrays us-ing bokeh rendering and superresolution. IEEE Signal Pro-cessing Letters, 26(1):204–208, 2018.

[34] Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learn-ing for image super-resolution: A survey. arXiv preprintarXiv:1902.06068, 2019.

[35] Sven Wanner and Bastian Goldluecke. Variational light fieldanalysis for disparity estimation and super-resolution. IEEETransactions on Pattern Analysis and Machine Intelligence,36(3):606–619, 2013.

[36] Sven Wanner, Stephan Meister, and Bastian Goldluecke.Datasets and benchmarks for densely sampled 4d light fields.In Vision, Modelling and Visualization (VMV), volume 13,pages 225–226. Citeseer, 2013.

[37] Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Tal-vala, Emilio Antunez, Adam Barth, Andrew Adams, MarkHorowitz, and Marc Levoy. High performance imaging us-ing large camera arrays. In ACM Transactions on Graphics,volume 24, pages 765–776. ACM, 2005.

[38] Gaochang Wu, Yebin Liu, Qionghai Dai, and Tianyou Chai.Learning sheared epi structure for light field reconstruction.IEEE Transactions on Image Processing, 28(7):3261–3273,2019.

[39] Gaochang Wu, Mandan Zhao, Liangyong Wang, QionghaiDai, Tianyou Chai, and Yebin Liu. Light field reconstruc-tion using deep convolutional network on epi. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 6319–6327, 2017.

[40] Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang,Jing-Hao Xue, and Qingmin Liao. Deep learning for singleimage super-resolution: A brief review. IEEE Transactionson Multimedia, 2019.

[41] Henry Wing Fung Yeung, Junhui Hou, Xiaoming Chen, JieChen, Zhibo Chen, and Yuk Ying Chung. Light field spatialsuper-resolution using deep efficient spatial-angular separa-ble convolution. IEEE Transactions on Image Processing,28(5):2319–2330, 2018.

[42] Youngjin Yoon, Hae-Gon Jeon, Donggeun Yoo, Joon-YoungLee, and In So Kweon. Light-field image super-resolutionusing convolutional neural network. IEEE Signal ProcessingLetters, 24(6):848–852, 2017.

[43] Youngjin Yoon, Hae-Gon Jeon, Donggeun Yoo, Joon-YoungLee, and In So Kweon. Learning a deep convolutional net-work for light-field image super-resolution. In Proceedingsof the IEEE International Conference on Computer VisionWorkshops (ICCVW), pages 24–32, 2015.

[44] Yan Yuan, Ziqi Cao, and Lijuan Su. Light-field image super-resolution using a combined deep cnn based on epi. IEEESignal Processing Letters, 25(9):1359–1363, 2018.

[45] Roman Zeyde, Michael Elad, and Matan Protter. On sin-gle image scale-up using sparse-representations. In Interna-tional conference on Curves and Surfaces, pages 711–730.Springer, 2010.

[46] Shuo Zhang, Youfang Lin, and Hao Sheng. Residual net-works for light field image super-resolution. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 11046–11055, 2019.

[47] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, BinengZhong, and Yun Fu. Image super-resolution using very deepresidual channel attention networks. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pages286–301, 2018.

[48] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, andYun Fu. Residual dense network for image super-resolution.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 2472–2481, 2018.

[49] Hao Zhu, Mantang Guo, Hongdong Li, Qing Wang, andAntonio Robles-Kelly. Revisiting spatio-angular trade-offin light field cameras and extended applications in super-resolution. IEEE transactions on visualization and computergraphics, 2019.

Page 11: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Spatial-Angular Interaction for Light Field Image Super-Resolution

Supplemental Material

Section A presents details of light field (LF) shuffle. Sec-tion B provides additional visual comparisons.

A. Light Field Shuffle

We use the notations in Table 6 for formulation. Asshown in Fig. 9, an LF L ∈ RU×V×H×W can be organizedinto a macro-pixel image IMPI ∈ RUH×VW or an array ofsub-aperture images ISAIs ∈ RUH×VW . Consequently, LFshuffle is defined as the transformation between these tworepresentations. To convert LFs from one representation tothe other representation, the one-to-one mapping functionbetween MPI and SAIs needs to be built. Without loss ofgenerality, we take spatial-to-angular shuffle as an exam-ple, namely, to find point (ξ, η) ∈ IMPI corresponding to aknown point (x, y) ∈ ISAIs . We first calculate the angularcoordinates u and v of point (x, y) according to

u = dx/He = bx/Hc+ 1,

v = dy/W e = by/W c+ 1.(9)

Using the angular coordinates, the spatial coordinates h andw can be derived by

h = x− (u− 1) ·H = x− bx/Hc ·H,w = y − (v − 1) ·W = y − by/W c ·W,

(10)

Since ISAIs and IMPI represent the same LF, (x, y) and(ξ, η) in these two representations have the same spatial andangular coordinates. Therefore, we find (ξ, η) correspond-ing to (u, v, h, w) as follows:

ξ = U · (h− 1) + u

= U · (x− bx/Hc ·H − 1) + bx/Hc+ 1

= U · (x− 1) + bx/Hc · (1− U ·H) + 1

(11)

η = V · (w − 1) + v

= V · (y − by/W c ·W − 1) + by/W c+ 1

= V · (y − 1) + by/W c · (1− V ·W ) + 1

(12)

The angular-to-spatial shuffle can be derived following asimilar approach. That is,

x = H · (u− 1) + h

= H · (ξ − 1) + bξ/Uc · (1− U ·H) + 1(13)

h

w

H

W

x ξ

v

u

U

V

Spatial-to-Angular Shuffle

Angular-to-Spatial Shuffle

Figure 9. An illustration of LF shuffle. Since the SAIs and the MPIdenote the same LF, the objective of LF shuffle is to re-organizeLFs between these two representations.

Table 6. Notations used in this supplemental material.

Notation Representation

L ∈ RU×V×H×W a 4D LF

ISAIs ∈ RUH×VW a 2D SAI array

IMPI ∈ RUH×VW a 2D MPI

U, V ∈ Z+ angular size

H,W ∈ Z+ spatial size

u, v ∈ Z+ angular coordinate

h,w ∈ Z+ spatial coordinate

(x, y) ∈ Z2+ coordinate in ISAIs

(ξ, η) ∈ Z2+ coordinate in ISAIs

b·c round-down operation

y =W · (v − 1) + w

=W · (η − 1) + bη/V c · (1− V ·W ) + 1(14)

Note that, Eq. (8) in the main body of our manuscriptcan be derived from the above equations (i.e., Eqs. (13-14))by assigning A to U and V .

B. Additional Visual Comparisons

Additional visual comparisons for 2× and 4× SR areshown in Figs. 10 and 11, respectively.

Page 12: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Groundtruth Bicubic RCAN

HCIold_buddha

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 37.46/0.975 40.38/0.986 40.00/0.986 40.45/0.987 41.44/0.989 44.10/0.994

Groundtruth Bicubic RCAN

EPFL_luxembourg

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 26.47/0.894 29.72/0.951 27.96/0.929 30.00/0.954 29.60/0.950 31.04/0.965

Groundtruth Bicubic RCAN

HCInew_bicycle

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 29.35/0.915 33.00/0.955 33.03/0.964 32.58/0.957 33.65/0.966 35.23/0.976

Groundtruth Bicubic RCAN

INRIA_messydesk

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 31.34/0.976 35.16/0.991 32.70/0.985 33.81/0.988 34.75/0.990 36.42/0.994

Groundtruth Bicubic RCAN

STFgantry_cards

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 27.99/0.924 33.40/0.980 34.57/0.982 31.60/0.969 33.75/0.982 36.21/0.990

Groundtruth Bicubic RCAN

STFlytro_cars_52

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 30.81/0.922 34.44/0.964 31.24/0.932 33.48/0.956 33.04/0.951 34.63/0.966

Groundtruth Bicubic RCAN

STFlytro_general_23

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 29.98/0.893 31.97/0.935 30.35/0.909 32.55/0.942 31.15/0.923 32.99/0.949

Groundtruth Bicubic RCAN

STFlytro_people_3

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 31.62/0.935 35.41/0.969 32.28/0.945 34.81/0.969 33.83/0.959 37.32/0.980

Figure 10. Additional visual results for 2×SR.

Page 13: Spatial-Angular Interaction for Light Field Image Super-Resolution · 2019-12-18 · Spatial-Angular Interaction for Light Field Image Super-Resolution Yingqian Wang 1 , Longguang

Groundtruth Bicubic RCAN

HCInew_bedroom

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 27.15/0.831 29.26/0.875 28.28/0.864 28.97/0.876 29.71/0.889 30.47/0.901

Groundtruth Bicubic RCAN

INRIA_hublais

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 24.84/0.850 28.63/0.914 26.20/0.890 27.32/0.912 28.20/0.924 31.03/0.942

Groundtruth Bicubic RCAN

STFgantry_cards

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 23.60/0.778 26.83/0.897 25.61/0.859 26.04/0.872 26.76/0.895 27.67/0.917

Groundtruth Bicubic RCAN

STFlytro_cars_40

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 29.08/0.914 32.08/0.951 29.53/0.923 31.98/0.952 30.66/0.938 33.61/0.963

Groundtruth Bicubic RCAN

STFlytro_flowers_29

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 27.31/0.867 30.09/0.914 27.97/0.884 30.34/0.922 29.17/0.903 32.42/0.941

HCIold_mona

Groundtruth Bicubic RCAN GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 32.04/0.934 36.60/0.965 33.11/0.950 35.39/0.965 36.25/0.969 37.79/0.977

Groundtruth Bicubic RCAN

STFlytro_reflective_21

GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 30.52/0.906 32.31/0.932 30.72/0.912 32.04/0.930 31.24/0.921 32.59/0.935

Groundtruth Bicubic RCAN GBSQ LFSSR_4D resLF Ours

PSNR/SSIM 27.69/0.931 31.26/0.959 28.16/0.943 28.84/0.958 29.69/0.959 33.26/0.966EPFL_friends_1

Figure 11. Additional visual results for 4×SR.