Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Attentive Deep Stitching and Quality Assessment for360◦ Omnidirectional Images
1
2
Jia Li , Senior Member, IEEE, Yifan Zhao, Weihua Ye, Kaiwen Yu, and Shiming Ge , Senior Member, IEEE3
Abstract—360◦ omnidirectional images are very helpful in creat-4ing immersive multimedia contents, which enables a huge demand5in their efficient generation and effective assessment. In this paper,6we leverage an attentive idea to meet this demand by addressing two7concerns: how to generate a good omnidirectional image in a fast8and robust way and what is a good omnidirectional image for hu-9man. To this end, we propose an attentive deep stitching approach10to facilitate the efficient generation of omnidirectional images,11which is composed of two modules. The low-resolution deformation12module aims to learn the deformation rules from dual-fisheye to om-13nidirectional images with joint implicit and explicit attention mech-14anisms, while the high-resolution recurrence module enhances the15resolution of stitching results with the high-resolution guidance in a16recurrent manner. In this way, the stitching approach can efficiently17generate high-resolution omnidirectional images that are highly18consistent with human immersive experiences. Beyond the efficient19generation, we further present an attention-driven omnidirectional20image quality assessment (IQA) method which uses joint evaluation21with both global and local metrics. Especially, the local metric22mainly focuses on the stitching region and attention region that23mostly affect the Mean Opinion Score (MOS), leading to a consis-24tent evaluation of human perception. To verify the effectiveness of25our proposed assessment and stitching approaches, we construct a26hybrid benchmark evaluation with 7 stitching models and 8 IQA27metrics. Qualitative and quantitative experiments show our stitch-28ing approach generate preferable results with the state-of-the-art29models at a 6× faster speed and the proposed quality assessment30approach surpasses other methods by a large margin and is highly31consistent with human subjective evaluations.
Q132
Index Terms—360◦ omnidirectional image, image quality33assessment (IQA), attentive deep stitching.34
Manuscript received April 21, 2019; revised September 3, 2019; acceptedNovember 8, 2019. This work was supported in part by the National NaturalScience Foundation of China under Grant 61922006, Grant 61532003, and Grant61772513, and in part by the Beijing Nova Program Z181100006218063. Thework of S. Ge was supported by the Youth Innovation Promotion Association,Chinese Academy of Sciences. The guest editor coordinating the review of thismanuscript and approving it for publication was Prof. Patrick LE CALLET.(Corresponding author: Jia Li.)
J. Li is with the State Key Laboratory of Virtual Reality Technology and Sys-tems, School of Computer Science and Engineering, Beihang University, Beijing100191, China, with the Beijing Advanced Innovation Center for Big Data andBrain Computing, Beihang University, Beijing 100191, China, and also withthe Peng Cheng Laboratory, Shenzhen 518000, China (e-mail: [email protected]).
Y. Zhao, W. Ye, and K. Yu are with the State Key Laboratory of VirtualReality Technology and Systems, School of Computer Science and Engineer-ing, Beihang University, Beijing 100191, China (e-mail: [email protected];[email protected]; [email protected]).
S. Ge is with the Institute of Information Engineering, Chinese Academy ofSciences, Beijing 100095, China (e-mail: [email protected]).
I. INTRODUCTION 35
IN THE rapid development of virtual reality (VR) over the last 36
decade, high-quality 360◦ omnidirectional images play an 37
increasingly important role in producing multimedia contents, 38
which requires natural immersions of real-world scenarios in 39
head-mounted displays. Along with the boost of omnidirectional 40
acquisition devices, there exists a huge demand in efficient om- 41
nidirectional image generation and accurate quality assessment, 42
which can further be of great use in biology [1], [2], medical [3], 43
modeling [4] and virtual reality [5]. To get the high-quality 44
omnidirectional images, tens of models have been proposed to 45
stitch the dual-fisheye images into 360◦ omnidirectional images. 46
With the large amount of stitched images and stitching models, 47
it further yields two important concerns: how to generate a good 48
omnidirectional image in a fast and robust way and what is a 49
good omnidirectional image for human? 50
In the view of the first concern, there exist two main cat- 51
egories of automatic stitching methods to generate omnidi- 52
rectional images rather than manual calibration methods [4], 53
[6]: direct stitching and feature-based stitching. Direct stitching 54
approaches [1], [7], [8] have the advantage that they make full 55
use of the available image data and hence can provide accurate 56
registration but required a closed initialization. In contrast, the 57
feature-based stitching methods [3], [9]–[11] do not require the 58
complicated initialization procedure. They usually automati- 59
cally detect invariant local features and construct a matching 60
correspondence instead of manual registration. Some classical 61
feature detectors [12], [13] usually perform well on conventional 62
planar natural images but may lack invariant properties in han- 63
dling dual-fisheye images, which may cause distortions or shape 64
breakages in stitching regions. 65
Beyond the first concern, many image quality assessment 66
(IQA) methods [14]–[18] have been proposed to address this 67
problem. In the early researches of IQA, the evaluation meth- 68
ods mainly focus on the common daily images with many 69
photometric quality indexes such as MSE [19], PSNR [20] 70
and SSIM [21]. With the development of Convolutional Neu- 71
ral Networks (CNNs), some representative models [16], [22], 72
[23] with deep features have been proposed. However, these 73
models usually focus on the photometric quality indexes such 74
as blurring, noise and color distortions, which may be not suit- 75
able for the omnidirectional images. Moreover, there are a few 76
works [24], [25] on the quality assessment of panoramic images. 77
For example, Yang et al. [17] proposed a light-weight model to 78
evaluate the stitched panoramic images based on ghosting and 79
2
Fig. 1. Dual-fisheye image and its stitched 360◦ panoramic image with humanattention (white) and stitching region focusing (blue). The image immersiveexperience to human is mainly affected by human attention mechanism and thedistortions are most likely to happen in specific stitching regions.
structure inconsistency. These proposed metrics are designed80
for normal 2-D plane image stitching, such that cannot handle81
360◦ omnidirectional images which are generated from the82
dual-fisheye images and have large distortion and information83
loss in the stitching areas.84
In summary, these general stitching and assessment ap-85
proaches usually treat every pixel of the stitching images equally.86
However, immersive experiences of omnidirectional images are87
affected by two attentive cues: attention region and stitching88
region. As shown in Fig. 1, attention region is mainly focused89
by human gaze while stitching region in the middle most likely90
happens distortions or shape breakage. To this end, we address91
the efficient generation and effective assessment of 360◦ omnidi-92
rectional images by two human perception-driven approaches,93
which is attentive deep stitching (ADS) and attentive quality94
assessment (AQA), respectively.95
ADS adopts a progressive manner to perform efficient96
generation of 360◦ omnidirectional images, which is composed97
of two main modules along with an implicit-attention and98
explicit-attention mechanisms respectively. The first low-99
resolution deformation module learns the deformation features100
from the dual-fisheye image with multiple implicit-attention101
blocks. By combing the learned deformation features and the102
high-resolution dual-fisheye image, the second high-resolution103
recurrence module is conducted to assign the deformation104
relationship with the high-resolution pixel guidance. With the105
recurrent refinement scheme, a high-resolution omnidirectional106
image is obtained. At the end of these two modules, the explicit107
attention map of human gaze is introduced to regularize the con-108
sistency of stitching results and human subjective experience.109
AQA is a novel full-reference quality assessment approach, 110
which is designed to evaluate the quality of the stitched 360◦ om- 111
nidirectional images in accord with human perception. Based on 112
the accurate cross-reference omnidirectional image dataset [26], 113
we propose a joint approach to combining the local and global 114
metrics, where the global metric mainly considers the environ- 115
mental differences like color chromatism and the blind zone 116
phenomenon. For the local metric, we develop an attentive sam- 117
pling strategy to focus on attention region and stitching region, 118
the two special regions that mostly affect the stitching quality 119
for the attentive frequency and stitching distortions, respectively. 120
To this end, We adopt the sparse reconstruction and appearance 121
difference to represent the local metric and finally use the linear 122
learning progress to match human subjective evaluations. 123
The contributions of this paper can be summarized as follows: 124
1) We propose a novel attentive deep stitching approach to 125
facilitate the generation of high-resolution 360◦ omnidirectional 126
images from dual-fisheye images in an end-to-end deep manner, 127
which runs at a 6× faster speed than the state-of-the-art methods. 128
2) We propose an attentive quality assessment approach to au- 129
tomatically assess the stitching quality of 360◦ omnidirectional 130
images, which provides more consistent evaluation to the human 131
perception. 3) Qualitative and quantitative experiments are con- 132
ducted to demonstrate the effectiveness of the proposed stitching 133
approach while the proposed quality assessment approach is 134
highly consistent with human subjective evaluation. 135
The rest of this paper is organized as follows: Section II 136
reviews related works and Section III proposes the attentive deep 137
stitching method. In Section IV, the attentive quality assessment 138
approach for omnidirectional stitching is proposed. We further 139
conduct quantitative and qualitative experiments in Section V 140
and finally conclude this paper in Section VI. 141
II. LITERATURE REVIEW 142
A. Omnidirectional Image Stitching 143
Classical stitching models: Image stitching techniques is 144
now becoming a research hotspot with wide applications [7], 145
[9], [27]. Classical stitching models, such as Stereoscopic Vi- 146
sion Projection (SVP) [28], Isometric Projection (IP) [29] and 147
Equidistant Projection (EP) [30], have been widely used to gen- 148
erate the 360◦ omnidirectional images in an automatic way. Most 149
modern digital cameras have added panoramic mode, including 150
many mobile devices. 151
Classical image stitching methods can be roughly divided into 152
two categories: camera calibration based image stitching and 153
keypoint based image stitching. Recently several works [31]– 154
[33], have made progress in improving traditional image stitch- 155
ing algorithm [11]. Charles et al. [34] solve the problem that the 156
use of a single registration often leads to errors, especially in 157
scenes with significant depth variation or object motion. With 158
the portability and cheapness of the dual-fisheye camera, the 159
research [35], [36] on fisheye image stitching becomes more and 160
more applicable. Lo et al. [37] stitched the dual-fisheye image 161
into a 360◦ panoramic image following the four basic steps of 162
right angle transformation, feature extraction, mesh deformation 163
and mixture. 164
3
Deep convolutional models: At present, most convolutional165
neural networks keep the image prototype and only extract166
special information. With the high demand on daily images,167
the traditional problems of pixel drift, such as image stitching168
and fisheye image distortion correction [35], [38], also need to be169
further studied. In recent years, a few researchers made attempts170
in solving the pixel drift issue with convolutional networks. Yin171
et al. [39] proposed an end-to-end multi-context collaborative172
deep network for removing distortions from single fisheye im-173
ages that learns high-level semantics and low-level appearance174
features simultaneously to estimate the distortion parameters.175
Deng et al. [40] proposed restricted stitching convolution for176
semantic segmentation, which can effectively model geometric177
transformations by learning the shapes of convolutional filters.178
B. Image Quality Assessment179
Many IQA methods [41] have been proposed in the past180
decades, which can be roughly grouped into three categories.181
Some pioneer works for image IQA [15], [22], [42] focuses on182
both traditional IQA and common panoramic stitching IQA. In183
this paper, we mainly focus on the quality assessment omnidi-184
rectional images, which is a less-explored task with increasing185
demands.186
Classical IQA metrics: Most of recent IQA researches focused187
on no-reference image quality assessment (NR-IQA) [43]–[48]188
and full-reference image quality assessment (FR-IQA) [18],189
[19], [49]–[51]. NR-IQA do not need specific reference image190
which is convenient to various image assessment task. In the pro-191
cess of FR-IQA, the assessment quality are compared with the192
results of reference image, For instance: MSE [19], PSNR [20],193
SSIM [52]. The assessment of immersive stitching IQA can194
also be adopted to full reference assessment method to some195
extent. However, the assessment result may be not so accurate196
and sometimes even opposite to human visual judgements.197
Learnable IQA metrics: Due to the rapid development of deep198
learning in recent years, various existing problems can achieve199
better results on the basis of deep learning approaches. There-200
fore, many researchers use deep learning models to evaluate201
daily images in the field of image quality evaluation. For these202
deep learning models [15], [23], the biggest problem is that there203
is no suitable large dataset for training. Kang et al. [23] aimed204
at images which the dataset suitable for deep learning training,205
and proposed to use 32 × 32 patches for training. On the one206
hand, the method increased the amount of data through simplify207
image processing, on the other hand, the discontinuity of the208
main structure in the image may lead to inaccuracy. Liu et al. [16]209
trained a Siamese Network to sort and learn images, and learned210
the relationship between images by sharing the weight of the211
network. Some researchers adopted convolutional sparse coding212
to locate specific distortions [53]–[55] and designed the kernel213
to quantify the mixed effects of multiple distortion types in local214
regions.215
Stitching IQA metrics: There are few researches in the216
image quality assessment of stitching images. For example,217
Yang et al. [17] solved the problem of ghosting and structural218
discontinuity in image stitching by using perceptual geometric219
error metric and local structure-guide metric, but for immersive 220
image, the evaluation method is not comprehensive enough to 221
detect the global color difference, and the conditions of blind 222
zone. Huang et al. [56] proposed the quality evaluation of im- 223
mersed images, mainly focusing on resolution and compression, 224
neither the quality evaluation of stitching, nor on the image 225
quality evaluation. In [53], the authors adopted convolutional 226
sparse coding and compound feature selection which focuses 227
on the stitching region for stitched image assessment. More- 228
over, some subjective omnidirectional video quality assessment 229
methods [57], [58] have been proposed in this less-explored task. 230
III. ATTENTIVE DEEP STITCHING 231
A. Overview 232
The classical omnidirectional stitching problem is a transfor- 233
mation of optical refraction operation. The pixels of dual-fisheye 234
images are transferred from the equirectangular coordinates to 235
the two-dimensional plane stretching. Commonly, transforming 236
dual fisheye imagex to omnidirectional imagey can be formally 237
represented as: 238
y = T (x; θ), (1)
where T : RW×H → RW×H is the transformation function and 239
θ denotes the parameters of stitching model. However, the trans- 240
formation relationship varies significantly for different images 241
and cameras. 242
Instead of the manual calibration or automatical registration, 243
we advocate using deep convolutional neural networks to learn 244
this transformation instead of hand-crafted designs. To this 245
end, we propose an attentive deep stitching approach which 246
efficiently solves the transformation T in two phases: 247
f = F(x; θF ), θF ⊂ θL, (2)
y = H(f ,x; θH), (3)
where F is the low-resolution deformation module and H is the 248
high-resolution recurrence module. x is the down-sampled input 249
of dual-fisheye image x. f are learnable deformation features 250
of x and θ{L,H} are learnable parameters of the low-resolution 251
deformation and high-resolution recurrence phase, respectively. 252
θF is used to extract the deformation features, which is a part 253
of θL. 254
B. Low-Resolution Deformation 255
As shown in Fig. 2, the low-resolution deformation module 256
aims to decode the transformation informationF(·) in a learning 257
procedure. Verified many representative researches in panoramic 258
attention, human gaze [59], [60] usually focuses on special 259
regions which contain the most attractive information. Keeping 260
these two cues in mind, we further develop an attention-based 261
deformation learning process, which is jointly optimized with 262
the implicit-attention and explicit-attention mechanisms. 263
Considering the computation cost and stitching efficiency, we 264
resort to the low-resolution image x to learn the deformation 265
information. Inspired by the successful U-Net [61] architec- 266
ture, we develop the light-weighted deformation module which 267
4
Fig. 2. Framework of proposed Attentive Deep Stitching (ADS). Our framework is mainly composed of two modules. The first deformation module is tolearn the transformation rules from dual-fisheye to omnidirectional images with the joint human-supervised explicit attention and implicit attention mechanism.The second recurrence module utilizes the high-resolution fisheye image as a guidance to the stitching results in a recurrent manner.
Fig. 3. Implicit-attention block. σ: element summation in channel dimensionand a softmax operation. ⊗: scalar production. ⊕: element-wise sum.
consists of a contracting path and an expansive path. Based268
on this design, we add multiple implicit-attention blocks as269
the transition layers to pass through the low-level attention to270
the high-level decoders, which attaches more importance in271
attention region and can further eliminate the gradient loss.272
The detailed architecture of implicit-attention block is shown273
in Fig. 3. σ denotes the channel-wise summation and softmax274
operation. With the input feature P ∈ RW×H×C , this block can275
be formally represented as:276
V = tanh(wvP+ bv),
K = tanh(wkP+ bk), (4)
where wk,bk,wv,bv are the parameters and � is the scalar-277
product operation. After getting this transformed features, the278
Fig. 4. Illustration of local attentive sampling. First row: gaussian-basedstitching region sampling. Second row: attention-based region sampling.
final output S(P) with the implicit attention is formulated as: 279
S(Pi,j,k) =eMi,j
∑i,j e
Mi,j�Qi,j,k +Pi,j,k,
Mi,j =∑
k
Ki,j,k. (5)
5
After that, the outputs S(P) pass from these implicit-attention280
blocks to the corresponding features in decoder with a concate-281
nation operation.282
In addition to the implicit-attention mechanism, the explicit283
attention is usually generated by human gaze estimation, which284
provides reliable region attention information. To this end, we285
resort to the state-of-the-art SALICON [62] approach to estimate286
human gaze. We train the gaze model on large daily image287
datasets and finetune the network with omnidirectional image288
annotations.289
With the reliable generation of explicit attention model, we290
develop a region sensitive MSE loss LL to attach more impor-291
tance on attention regions:292
LL(θL) =1
2N
N∑
i=1
Wi‖F(xi; θL)− yi‖2,
Wi =1
1 + e−A(yi),
(6)
where N is the total number of elements and F(xi; θL) denotes293
the low-resolution prediction of the deformation module. More294
exactly, θL is the deformation module with the final prediction,295
and θF is used to obtain the transformation features. Wi, xi,296
and yi are the ith element of the explicit-attention weight297
W, the low-resolution dual-fisheye image x and low-resolution298
omnidirectional image y, respectively. A(yi) is the ith attention299
value of omnidirectional image y, which is normalized in [0, 1].300
Regularized with these two attention mechanisms, our defor-301
mation module is developed with d = 32 times down-sample302
operations with chained max-pooling. In this manner, each303
pixel in the highest level takes the responsibility to learn304
the transformation from k2 × d2 pixels of the original input,305
where k is the kernel size of the current feature map. Moreover,306
we explore chained max-pooling operation in each down-sample307
operation and the tanh(·) activation at the end of the network,308
which are suitable to emphasize the local extremum and greatly309
maintain the transformation information. The tanh(·) function310
also accelerates the convergence speed of our network.311
C. High-Resolution Recurrence312
After the initialized deformation, we develop a progressive313
high-resolution generation module with a recurrent manner.314
As shown in Fig. 2, the high-resolution fisheye image passes315
through a feature encoder (view in red) to obtain the accurate316
pixel guidance. In another way, the feature from the second317
last feature map (features before output) is obtained with a318
up-sampling operation as the deformation guidance. The high-319
resolution pixel guidance and deformation guidance are then320
concatenated with 1 × 1 convolutions. Finally, these fused fea-321
tures pass through a hourglass network without down-sampling322
operations to get the higher-resolution output.323
In this manner, the deformation branch provides the trans-324
formation regulation F and the high-resolution input provides325
the pixel-level guidance G. In sth stage, we concatenate the326
up-sampling transformation regulation F and high-resolution G327
to decode them with a high-resolution hourglass decoder ϕs(·), 328
which is composed of 8 convolutional layers with the 3 × 3 329
kernels. This recurrent manner in sth stage can be formulated 330
as: 331
Hs(xs, xs+1) =
{ϕs(G(xs+1)� F(xs; θF )), if s = 1
ϕs(G(xs+1)�Hs−1(·)), if 1 < s ≤ S,(7)
where the xs is the down-sampled input of the sth scale and � 332
is the feature concatenate operation with 1 × 1 convolutions. 333
H(·) is the output feature in the sth iteration. S is the maximum 334
stage with largest input resolutions, which is set as 3 in our 335
experiments. The first iteration adopts deformation feature from 336
our first module and the following iterations adopts the high- 337
resolution features of the last stage as guidance. At the end of 338
the third iteration, the width and height of the recurrence F 339
are fixed for the computation limitation. We recommend using 340
this result as the final prediction considering the time efficiency. 341
Our recurrence network follows the common stage-wise training 342
process, the final loss in the sth stage can be represented as: 343
LH(θsH) =1
2N
N∑
i=1
Wi‖H(xi,s, xi,s+1; θsH)− yi,s+1‖2,
(8)
where s = 1 . . . S denotes the iteration stage. The weight Wi is 344
generated by the explicit attention map in Eqn. (6). With the 345
recurrent iterations from low-resolution to high-resolution, a 346
finer stitching result is obtained with an end-to-end inference. 347
IV. ATTENTIVE QUALITY ASSESSMENT 348
After the promise of the stitching models, the main con- 349
cern is how to evaluate these models on the stitched images. 350
In this section, we propose a novel approach to evaluate the 351
less-explored omnidirectional stitching task with joint local 352
and global quality assessment metrics. Both global and local 353
indexes are full-reference metrics which are evaluated with the 354
cross-reference ground-truth. The local indexes mainly focus 355
on attentive and the stitching seam region, while the global ones 356
mainly focus on the environmental immersion experience. 357
A. Local Attentive Assessment 358
The main distortions in the omnidirectional images are most 359
likely to happen in the regions near the stitching seam. In 360
contrast, the regions far from the stitching seam are usually 361
with fewer distortions. On the other hand, human gaze [63]–[65] 362
usually focuses on regions with special patterns, which drives 363
us not to treat every pixel equally. Specially, the stitching im- 364
ages and ground-truth reference may have some slight degree 365
changes, which is not suitable to align pixels in stitching image 366
and ground-truth. 367
To this end, we sample the patches instead of per-pixel 368
matching in both stitched images and ground-truth image. The 369
stitching regionsΩsti are sampled with gaussian sigma-criterion 370
6
in Eqn. (9):371
Rs(x) =1√2πσ
exp
(
− (x− μ)2
2σ2
)
,x ∈ Ωsti, (9)
where the region indicators μ is set as the 0.5 times width of the372
stitching region. To sampling more patches in stitched images373
and eliminate the pixel shifting, we set σ for stitching regions374
and reference regions as 220 and 350 respectively. The patch375
size is set as 8 × 8 and sampled by using a sliding-window376
strategy. Moreover, the human attention Ra(x) is sampled with377
the most brightness scores in attention map, which is calculated378
by the same SALICON [62] model fine-tuned on the omni-379
directional image. Similarly, The ground-truth patches Da(x)380
and Ds(x) are obtained. With the summarization of attentive381
sampled patches, two meaningful metrics of local regions are382
further proposed.383
1) Sparse reconstruction: To robustly measure the region384
similarity at various levels of details, we propose to adopt the385
sparse reconstruction errors as the local metric. The foundation386
of this metric is that the similar patches can represent each387
other with a minimal length of the sparse code. To this end,388
an over-complicated sparse dictionary D = {Da,Ds} is con-389
structed with the ground-truth patches and the stitched patches390
are stacked as R = {Ra,Rs}. Our solving procedure of the391
minimal reconstruction code X∗ can be formulated as:392
X∗ = argminX
1
2||R−DX||2F + λ||X||1. (10)
This can be easily solved with the Online Dictionary Learn-393
ing [66]. With the optimized sparse representation X∗, we394
further adopt the SVD decomposition to the principal component395
with FPCA. The final score is evaluated with the L1-norm:396
X∗ =r∑
i=1
UiΣiVTi ,
Esparse = −r∑
i=1
||FPCA(Σi)||1, (11)
where Σ is decomposed with singular values to represent the397
sparse reconstructions.398
2) Appearance similarity: To evaluate the appearance similar-399
ity, we resort to the commonly used Gray-Level Co-occurrence400
Matrix (GLCM) [67] in degrees of [45, 90, 135, 180] to extract401
the texture features. With these features, we adopt the histogram402
calculation to measure the texture similarity between the sam-403
pled patches. In this manner, we calculate the cosine similarity404
between the divided bins in Eqn. (12).405
Eapp =
4∑
d=1
n∑
i
n∑
j
cos(hi,hj)||hi||2F ||hj ||2F , (12)
where hi is the ith histogram bin of the GLCM matrix with406
n = 10 devisions and d indicates the four degrees in GLCM407
matrix.408
B. Global Environmental Assessment 409
To evaluate the environmental immersion experience, we 410
further propose two global metrics to qualify global regions of 411
the stitched images, which can be summarized as: 412
1) Color chromatism: Most of the stitching methods adjust 413
some optical parameters to match two images, which could 414
further bring in some chromatic aberrations. To evaluate the 415
point-wise color difference, we adopt the SIFT [68] matching 416
to find the pixel correspondences between the stitched image 417
and ground-truth image. For each matched pair of points, we 418
compute the K-nearest neighbor to eliminated mismatches, 419
and we denote S and G as the sampled patches of stitching 420
regions and referenced ground-truth regions, respectively. The 421
sift matching procedure can be formulated as: 422
G∗ = argminG||S−Hsift(Gi)||2F , i = 1 . . .K. (13)
After matching every S with the nearest G, the color chro- 423
matism score can be calculated as: 424
Ecolor = −M∑
i=1
C∑
k=1
λ||Sik −G∗
ik||2FM × C
, (14)
where M is the number of corresponding pairs and C is the 425
number of channels. λ is set as 100 to balance the final score. 426
2) Blind zone: The blind zones are the blank areas with 427
the information loss during transformation processes, which 428
affect the visual comfortableness in immersive experiences. To 429
accurately measure the impact of blind zones, we propose an 430
attention-weighted blind zone evaluation metric. We adopt the 431
same SALICON model [62] to calculate the attentive regions 432
with groundtruth. The generated attention mapwbi for blind zone 433
is generated with the ground-truth omni-directional image. The 434
value of wbi are normalized in [0, 1]. The score of Eblind can be 435
calculated as: 436
Eblind = 1− 1
N
N∑
i=1
Bi
1 + e−wbi
, (15)
where Bi denotes the ith pixel of blind zones masks, which is 437
set as 1 when it is in blind zone. N denotes the number of pixels 438
in the stitched image. The region Bi can be simply calculated 439
with the bottom and top stitching region with continuous blank 440
areas (zero-value pixels). If the attentive region with important 441
message are missing, the Eblind will generate lower scores. 442
C. Joint Assessment With Human Guided Classifier 443
With the proposed two local metrics and two global ones, 444
we further introduce human subjective evaluations to supervise 445
our linear classifier. We use the concluded pair-wise evaluation 446
scores as the final results, which is the Mean Opinion Score 447
(MOS) collected in the CROSS dataset [26]. 448
The aim of our classifier is to provide learnable weight to make 449
the metric consistent with the human subjective assessment. To 450
this end, we adopt the multiple linear regression (MLR) [69], 451
[70] to fit the human subjective ground-truth. Stack vector 452
x = {Eapp,Esparse,Ecolor,Eblind} with the proposed metrics 453
above, we adopt MOS scores as the ground-truth M, The 454
7
weight-balance parametersβ can be learned by generalized least455
squares estimation, which are shown in Eq. (16):456
M = β · x,
β∗ = argminβ(xTΩ−1x)−1xTΩ−1M, (16)
where Ω is the covariance matrix of residual error. Finally, the457
final assessment scores can be calculated as:458
R = β∗ · x, (17)
which can be further used to rank different stitching results.459
V. EXPERIMENTS460
A. Experiments Settings461
1) Cross-Reference Dataset: We conduct our experimental462
on the CROSS dataset [26], which contains 292 fisheye images463
as quaternions. Each quaternion is composed of images captured464
from standard quarters of 0, 90, 180 and 270 degrees. Taking two465
images in opposite directions for stitching, the other two images466
can provide high-quality ground-truth references. In this manner,467
a high-quality ground truth stitching image is obtained.468
To make a fair comparison of the state-of-the-art models, we469
randomly select 192 fisheye images as the training set from 12470
different scenarios and the rest 100 images as the test set. More-471
over, all the existing dual-fisheye images and corresponding472
360◦ panoramic images are mirrored horizontally and vertically473
to obtain four times the number of original. In order to verify474
the robustness of our network, we only use the original images475
and the horizontal images in training process but add the vertical476
images in the test.477
2) Evaluation Criterion: Our evaluation criterion is com-478
posed of two systems. The first criterion system is composed of479
the five most commonly used quality assessment evaluation met-480
rics to match the mean opinion score (MOS) provided by [26],481
which conducted pair-wise ranking scores with 14,847 compar-482
isons. The evaluation metrics include Cosine Similarity (CS), the483
Pearson Rank Correlation Coefficient (PRCC), the Spearman’s484
Rank Order Correlation Coefficient (SROCC), Kendall Rank485
Correlation Coefficient (KRCC) and Root Mean Square Error486
(RMSE). We also adopt the quality metric evaluation frame-487
work [74] to assess our AQA approach. The detailed formulation488
of these evaluation metrics can be found in Table II. The MOS489
scores are stacked as vectors to calculate correlation similarities490
with IQA methods.491
The second criterion system is adopted to evaluate the qual-492
ity of stitched images. Despite the proposed omnidirectional493
quality assessment, 9 widely-used IQA methods are adopted494
for our benchmark, including classical methods MSE [19],495
PSNR [20], SSIM [49], no-reference quality assessment metrics,496
BRISQUE [43], NIQE [73], PIQE [14], CCF [47], CEIQ [48],497
and current method based machine learning, CNNIQA [15].498
3) Implementation Details: Our deep deformation model is499
built with 10 convolutional blocks with (3 × 3, ReLU, batch500
norm), followed by a 2 × 2 max-pooling after each block in501
encoder and upsampling in decoder. The high-resoluton recur- 502
rence module is built with the same convolutional blocks and 503
finally designed with a tanh function at the end. The whole model 504
is trained on a single NVIDIA GeForce GTX 1080 GPU and 505
a single Intel i7-6700 CPU. The learning rate of deformation 506
module and recurrence module are both starting with 0.001 507
and reduces to half of that when the validation loss reaches a 508
plateau. The deformation module is trained with the resolution 509
of 512 × 256. Owing to the limitation of GPU memory, we 510
adopt the 2048 × 1024 as final high-resolution output and the 511
recurrence stage as 3 iterations to get the final output in our ex- 512
periments. The deformation stage is trained for 30 k iteration and 513
each recurrent stage are trained for 15 k iterations. The adopted 514
attention model SALICON [62] is pre-trained on the SALICON 515
fixation dataset [75], which contains 20,000 annotations of daily 516
images [76]. We then fine-tuned on the fixation annotations of 517
our training set with 10 k iterations with a finetuned lr = 1e− 6. 518
The groundtruth annotation follows the original dataset [75]. 519
B. The Omnidirectional Stitching Benchmark 520
We adopt 7 widely-used state-of-the-art stitching meth- 521
ods to construct our benchmark, including Samsung Gear 522
360 [71], OpenSource [72], Stereoscopic Vision Projection 523
(SVP) [28], Isometric Projection (IP) [29] and Equidistant 524
Projection (EP) [30], ManMethod (Manual Method) and our 525
proposed Attentive Deep Stitching (ADS), which finally yields 526
1344 stitched images in total for comparisons. 527
We firstly use the 7 compared IQA methods to evaluate results 528
of selected 7 stitching models, which finally yields 49 scores, 529
as shown in Table I. To compare with these IQA indexes, we 530
use the ranking order (view in blue) to evaluate these methods. 531
From which we can see that in most of the proposed indexes, 532
our proposed deep stitching method generate the preferable 533
results, comparing to the time-costing or labour-consume meth- 534
ods. Our proposed method ranks the first place in referenced 535
IQA metrics, which demonstrate the stitching deformation re- 536
sults of our method. However, our proposed losses a lot de- 537
tails in No-reference IQA comparisons such as BRISQE [43], 538
NIQE [73] and PIQE [14] because of the resolution limitation. 539
Most of the stitching methods are conducted in the resolution 540
of 5792 × 2896, which our stitching result are generated by 541
upsampling from 2048 × 1024, which may loss many details 542
in the non-referenced IQA methods. we will further expound 543
this in Section V-D. 544
The qualitative results are shown in Fig. 5, our model gen- 545
erates favorable results with a fast inference procedure. The 546
proposed evaluation scores are shown in the top left corner (view 547
in blue). The NIQE [73] and PIQE [14] are shown in green color 548
in the second and third row, respectively. On the one hand, the 549
proposed AQA method is sensitive to the local distortions and 550
global color chromatism, while NIQE [73] and PIQE [14] do 551
not generate favorable results to match the visual experience. On 552
the other hand, our proposed attentive deep stitching approach 553
shows less breakage and color chromatism, while other methods 554
show apparent stitching error in stitching regions. Moreover, 555
8
TABLE IJOINT BENCHMARKING OF 8 STITCHING MODELS WITH 10 IQA METRICS. ADS: ATTENTIVE DEEP STITCHING METHOD. AQA: PROPOSED ATTENTIVE
QUALITY ASSESSMENT. THE ASSESSMENT RANKS OF THE 1ST AND 2ND PLACE IN EACH ROW ARE VIEW IN BOLD AND UNDERLINED.↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER
Fig. 5. Benchmark evaluations. The proposed evaluation scores are shown in the top left corner (view in blue). The NIQE [73] and PIQE [14] are shown in greencolor in the second and third row, respectively. Our proposed deep stitching method with the second highest scores show fewer distortions and color chromatism,especially in the attention regions in the second row.
benefiting from our network architecture, the blind zones in our556
results more also smaller than the state-of-the-art models.557
C. Analysis of Image Quality Assessment558
To further evaluate the effectiveness of the proposed quality559
assessment metric, we further use the five commonly used560
metrics (e.g., CS, PRCC, SROCC, KRCC, RMSE) to evaluate561
our IQA scores. We adopt the pair-wise ranking of 14,847 images562
comparisons as the MOS results. As shown in Table III, Our563
evaluation show a large superior margin comparing to the second564
best result of PIQE [14] in CS, PRCC, SROCC and KRCC.565
However, our proposed IQA method generates comparable re-566
sults to the PIQE with a slightly lower of 3%, while most of the567
classical metrics failed to handle this kind of images. Despite568
the existing evaluation metrics in Table II, we adopt the MOS569
evaluation framework of [74] to transform the subjective MOS570
scores into pair-wise significance. From Table IV, our proposed571
AQA method is also highly matched with the human subjective572
evaluations with the AUC value of 0.965 and surpasses the573
state-of-the-art methods.574
To verify the robustness of the proposed AQA algorithm, we575
split our dataset into different parts and test our algorithm under576
TABLE IIMATHEMATICAL FORMULATION OF 5 CORRELATION METRICS
various scenarios. The qualitative results are shown in Table V, 577
where our model show robustness in the various conditions. Joint 578
analyzing with Table III, the lowest results of our proposed 579
algorithms still show superiority to the state-of-the-arts. 580
From Table V, we can easily conclude that indoor-scenes are 581
more challenging than the outdoors mainly because of the var- 582
ious object and light changes. To further lightness parameters 583
in our assessment metrics, we divided the test images into 584
two groups, the natural-light and no-natural (e.g., indoor light, 585
9
TABLE IIIMOS SIMILARITIES OF 10 STATE-OF-THE-ART IQA METHODS WITH 5 CORRELATIONS CRITERIONS. ↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER
TABLE IVMOS SIMILARITIES OF 8 STATE-OF-THE-ART IQA METHODS USING PAIR-WISE MOS FRAMEWORK [74]
TABLE VCOMPARISONS WITH HUMAN SUBJECT SUBJECTIVE EVALUATIONS.
EVALUATION SCORES: TOTAL WINING PROPORTIONS
VIA PAIR-WISE COMPARISONS
TABLE VIMOS SIMILARITY EVALUATIONS WITH LOCAL INDICATORS.
AQA (ALL): AQA METHOD WITH ALL LOCAL AND GLOBAL METRICS
TABLE VIIABLATION STUDY OF ADS WITH STATE-OF-THE-ART IQA METRICS
streetlight). Our assessment still faces a challenge in handling586
these conditions, with a sharp 17% drop in sensitive PRCC and587
SROCC metrics, while getting acceptable results in CS metric588
(3% lower).589
We use single indicator to evaluate the scores and conduct590
CS similarity with MOS scores. The results can be found in591
Table VI. With our single appearance similarity indicators, the592
CS similarity with MOS can be 0.811, which is higher than593
the classical MSE and SSIM indicators. The single global color594
chromatism indicator also shows 0.803 similarity with MOS.595
With our full-reference AQA algorithm, the full algorithm reach596
TABLE VIIITIME COST OF ADS AND STATE-OF-THE-ART MODELS WITH THE
SAME RESOLUTION OF 2048 × 1024 STITCHED OUTPUT
the performance of 0.948. This also verifies that our local and 597
global module are complementary and can boost the perfor- 598
mance together. 599
We further evaluate the time efficiency, and compare our 600
method with state-of-the-art IQA approaches. The execution 601
time of our method is evaluated on a single Intel I7-6700 CPU. 602
For a single 2048 × 1024 image, our proposed AQA method 603
costs 5.38 seconds and CCF [47] costs 6.98 seconds per im- 604
age. While the classical SSIM and PSNR indexes with lower 605
performance cost 0.604 s and 0.275 s per image respectively. 606
D. Analysis of Attentive Deep Stitching 607
To evaluate the effectiveness of our proposed Attentive Deep 608
Stitching (ADS), we compare the stitching time with four state- 609
of-the-art automatic stitching methods. It can be conclude that 610
our stitching method runs 15 times faster than the classical 611
Stereoscopic Vision Projection (SVP) [28], Isometric Projection 612
(IP) [29] and Equidistant Projection (EP) [30]. Thanks to the 613
two-stage lightweight design, our method runs over 6 times 614
faster than the state-of-the-art OpenSource [72] toolbox, which 615
is shown in Table VIII. 616
To verify the design of our proposed network architecture, we 617
conduct ablation study in the proposed ADS. We randomly select 618
a half from the test set for this validation with state-of-the-art 619
IQA metrics. As shown in Table VII, the first row of Progressive- 620
1 denotes the first iterated output of the recurrence network. With 621
3 times recurrence, the stitching results boost PSNR from 15.52 622
to 16.29. While our model without the attention mechanism 623
generates much coarser results than the final model in the third 624
line. Our proposed method is consistent with human subjective 625
10
Fig. 6. Results of different iterative resolutions. The first row is the stitching results in the iteration of 512 × 256. The second row shows the final results withresolutions of 2048× 1024. The proposed progressive module provide increasing details, especially in the attention regions, e.g., cars in the first row and widowsin the second and third row.
Fig. 7. Visualized results of our ADS algorithm. ADS w/o Attention: ADS algorithm without attention module. ADS-Final: the final results with proposedattention mechanism.
evaluations while the NIQE [73] is not sensitive to this kind626
of image quality. The proposed AQA is also sensitive to the627
image quality improvement for varying from 38.36 to 51.56 in628
the progressive iterations. The visualized ablation results of our629
attention mechnism can be found in Fig. 7. The corresponding630
attention map can be found in third column. Compared to the631
second column with our full model, the results without attention632
mechanism lose many details in local regions, especially the633
attentive regions.634
The qualitative results are shown in Fig. 6. Comparing the635
first row with fewer iterations and the second row, the proposed636
progressive module provide increasing details, especially in637
the attention regions (e.g., cars in the first row and widows638
in the second and third row). However, with the limitation639
of GPU memory, obtaining results with higher resolutions is640
still a challenging task, which is the largest limitation of our641
module.642
The last thing we want to emphasize is the choice of attention643
algorithms. As shown in Table IX, we adopt three different644
TABLE IXADS WITH DIFFERENT ATTENTION METHODS. THE BEST PERFORMANCES
ARE IN BOLD. ↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER.†: TRAINED WITH SALIENT 360 DATASET [78]
ways in generating attention regions for stitching algorithms. 645
We adopt the super-pixel based attention generation method [77] 646
designed for 360◦ omni-directional images. The second row and 647
the third row indicate the SALICON model trained with or with- 648
out the omnidirectional benchmark [78] images. It can be easily 649
concluded that the results with deep SALICON model [62] show 650
better performance than the classical method [77]. With more 651
accurate saliency annotations in 360◦ scenarios, most of the IQA 652
metrics show a slight performance boost. 653
11
VI. CONCLUSIONS654
In this paper, we mainly address two concerns with increasing655
demands in 360◦ omnidirectional images: how to generate a656
good omnidirectional image in a fast and robust way and what is657
a good omnidirectional image for human? To address these two658
concerns, we develop two human perception-driven approaches,659
which are attentive deep stitching (ADS) and attentive quality660
assessment (AQA) for omnidirectional images. Firstly, our pro-661
gressive attentive deep stitching model consists of two modules,662
the first to learn the deformation information, the second to663
progressively enhance the perceptive ability in resolution. To664
achieve this, we propose a joint implicit and explicit attention665
mechanism to make our results consistent with human subjective666
evaluations. Secondly, to accurately evaluate the stitching re-667
sults, we develop a novel attentive quality assessment approach668
for 360◦ omnidirectional images, which consists of two local669
sensitive metrics to focus on the human attention and stitch-670
ing region and two global ones on environmental immersions.671
Qualitative and Quantitative experiments show that our stitching672
approach generates preferable results with the state-of-the-arts673
at a 6× faster speed. Moreover, the proposed attentive quality674
assessment approach for omnidirectional images surpasses the675
state-of-the-art methods by a large margin and is highly consis-676
tent with human subjective evaluations.677
REFERENCES678
[1] J. Chalfoun et al., “Mist: Accurate and scalable microscopy image stitching679tool with stage modeling and error minimization,” Sci. Rep., vol. 7, 2017,680Art. no. 4988.681
[2] E. A. Semenishchev, V. V. Voronin, V. I. Marchuk, and I. V. Tolstova,682“Method for stitching microbial images using a neural network,” Proc.683SPIE, vol. 10221, 2017, Art. no. 102210.684
[3] D. Li, Q. He, C. Liu, and H. Yu, “Medical image stitching using parallel685sift detection and transformation fitting by particle swarm optimization,”686J. Med. Imag. Health Informat., vol. 7, no. 6, pp. 1139–1148, Oct. 2017.687
[4] L. Barazzetti, M. Previtali, and F. Roncoroni, “3D modelling with the688samsung gear 360,” ISPRS-Int. Archives Photogramm., Remote Sens.689Spatial Inf. Sci., pp. 85–90, 2017.690
[5] D. Chapman and A. Deacon, “Panoramic imaging and virtual realityłfilling691the gaps between the lines,” ISPRS J. Photogramm. Remote Sens., vol. 53,692no. 6, pp. 311–319, 1998.693
[6] G. Payen de La Garanderie, A. Atapour Abarghouei, and T. P. Breckon,694“Eliminating the blind spot: Adapting 3D object detection and monocular695depth estimation to 360-degree panoramic imagery,” in Proc. Eur. Conf.696Comput. Vis., 2018, pp. 789–807.697
[7] T. Z. Xiang, G. S. Xia, X. Bai, and L. Zhang, “Image stitching by line-698guided local warping with global similarity constraint,” Pattern Recognit.,699pp. 481–497, 2018.700
[8] W. Ye, K. Yu, Y. Yu, and J. Li, “Logical stitching: A panoramic image701stitching method based on color calibration box,” in Proc. IEEE Int. Conf.702Signal Process., 2018, pp. 1139–1143.703
[9] I. C. Lo, K. T. Shih, and H. H. Chen, “Image stitching for dual fisheye704cameras,” in Proc. IEEE Int. Conf. Image Process., Oct. 2018, pp. 3164–7053168.706
[10] T. Ho and M. Budagavi, “Dual-fisheye lens stitching for 360-degree imag-707ing,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 2017,708pp. 2172–2176.709
[11] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using710invariant features,” Int. J. Comput. Vis., vol. 74, no. 1, pp. 59–73, 2007.711
[12] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that affect712protein function,” Nucleic Acids Res., vol. 31, no. 13, pp. 3812–3814, 2003.713
[13] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”714in Proc. Eur. Conf. Comput. Vis., May 2006, pp. 404–417.715
[14] V. N. , P. D. , M. C. Bh., S. S. Channappayya, and S. S. Medasani, “Blind 716image quality evaluation using perception based features,” in Proc. 21st 717Nat. Conf. Commun., Feb./Mar. 2015, pp. 1–6. 718
[15] L. Kang, P. Ye, Y. Li, and D. S. Doermann, “Convolutional neural networks 719for no-reference image quality assessment,” in Proc. IEEE Conf. Comput. 720Vis. Pattern Recognit., Columbus, OH, USA, Jun. 2014, pp. 1733–1740. 721
[16] X. Liu, J. van de Weijer, and A. D. Bagdanov, “Rankiqa: Learning from 722rankings for no-reference image quality assessment,” in Proc. IEEE Int. 723Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 1040–1049. 724
[17] G. Cheung, L. Yang, Z. Tan, and Z. Huang, “A content-aware metric for 725stitched panoramic image quality assessment,” in Proc. IEEE Int. Conf. 726Comput. Vis. Workshops, Venice, Italy, Oct. 2017, pp. 2487–2494. 727
[18] Y. Niu, H. Zhang, W. Guo, and R. Ji, “Image quality assessment for color 728correction based on color contrast similarity and color value difference,” 729IEEE Trans. Circuits Syst. Video Techn., vol. 28, no. 4, pp. 849–862, 730Apr. 2018. 731
[19] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude 732similarity deviation: A highly efficient perceptual image quality index,” 733IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014. 734
[20] A. Tanchenko, “Visual-PSNR measure of image quality,” J. Visual Com- 735mun. Image Representation, pp. 874–878, 2014. 736
[21] Z. Wang et al., “Image quality assessment: From error visibility to struc- 737tural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 738Apr. 2004. 739
[22] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep 740neural networks for no-reference and full-reference image quality assess- 741ment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018. 742
[23] L. Kang, P. Ye, Y. Li, and D. S. Doermann, “Simultaneous estimation of 743image quality and distortion via multi-task convolutional neural networks,” 744in Proc. IEEE Int. Conf. Image Process., Quebec City, QC, Canada, 745Sep. 2015, pp. 2791–2795. 746
[24] G. Cheung, L. Yang, Z. Tan, and Z. Huang, “A content-aware metric for 747stitched panoramic image quality assessment,” in Proc. IEEE Int. Conf. 748Comput. Vis. Workshops, Venice, Italy, Oct. 2017, pp. 2487–2494. 749
[25] H. Duan, G. Zhai, X. Min, Y. Zhu, Y. Fang, and X. Yang, “Perceptual 750quality assessment of omnidirectional images,” in Proc. IEEE Int. Symp. 751Circuits Syst., 2018, pp. 1–5. 752
[26] J. Li, K. Yu, Y. Zhao, Y. Zhang, and L. Xu, “Cross-reference stitching 753quality assessment for 360 omnidirectional images,” in Proc. 27th ACM 754Int. Conf. Multimedia, 2019, pp. 2360–2368. 755
[27] O. S. Vaidya and S. T. Gandhe, “The study of preprocessing and postpro- 756cessing techniques of image stitching,” in Proc. Int. Conf. Adv. Commun. 757Comput. Technol., Feb. 2018, pp. 431–435. 758
[28] B. Maneshgar, L. Sujir, S. P. Mudur, and C. Poullis, “A long-range vision 759system for projection mapping of stereoscopic content in outdoor areas,” 760in Proc. 12th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory 761Appl., Porto, Portugal, Feb./Mar. 2017, pp. 290–297. 762
[29] D. Cai, X. He, and J. Han, “Isometric projection,” in Proc. 22nd AAAI 763Conf. Artif. Intell., Vancouver, BC, Canada, Jul. 2007, pp. 528–533. 764
[30] D. Schneider, E. Schwalbe, and H. G. Maas, “Validation of geometric 765models for fisheye lenses,” ISPRS J. Photogrammetry Remote Sens., 766vol. 64, no. 3, pp. 259–266, 2009. 767
[31] W. Li, C.-B. Jin, M. Liu, H. Kim, and X. Cui, “Local similarity refine- 768ment of shape-preserved warping for parallax-tolerant image stitching,” 769Institution Eng. Technol., pp. 661–668, 2017. 770
[32] T. Xiang, G.-S. Xia, and L. Zhang, “Image stitching with perspective- 771preserving warping,” 2016, arXiv:1605.05019. 772
[33] T. Ho, I. D. Schizas, K. Rao, and M. Budagavi, “360-degree video stitching 773for dual-fisheye lens cameras based on rigid moving least squares,” in Proc. 774IEEE Int. Conf. Image Process., 2017, pp. 51–55. 775
[34] C. Herrmann et al., “Robust image stitching with multiple registrations,” 776in Proc. Eur. Conf. Comput. Vis., 2018. 777
[35] L. Yao, Y. Lin, C. Zhu, and Z. Wang, “An effective dual-fisheye lens 778stitching method based on feature points,” in Proc. Int. Conf. Multimedia 779Model., Jan. 2019, pp. 665–677. 780
[36] J. Tan, G. Cheung, and R. Ma, “360-degree virtual-reality cameras for the 781masses,” IEEE Multimedia, vol. 25, no. 1, pp. 87–94, Jan./Mar. 2018. 782
[37] I.-C. Lo, K.-T. Shih, and H. H. Chen, “Image stitching for dual fisheye 783cameras,” in Proc. Int. Conf. Image Process., Oct. 2018, pp. 3164–3168. 784
[38] X. Li, Y. Pi, Y. Jia, Y. Yang, Z. Chen, and W. Hou, “Fisheye image 785rectification using spherical and digital distortion models,” Multispectral 786Image Acquisition Process. Anal., 2018, Art. no. 106070G. 787
[39] X. Yin, X. Wang, J. Yu, P. Fua, M. Zhang, and D. Tao, “Fisheyerecnet: A 788multi-context collaborative deep network for fisheye image rectification,” 789in Proc. Eur. Conf. Comput. Vis., 2018, pp. 469–484. 790
12
[40] L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted deformable791convolution based road scene semantic segmentation using surround view792cameras,” 2018, arXiv:1801.00708.793
[41] C. Li, M. Xu, S. Zhang, and P. L. Callet, “State-of-the-art in 360°794video/image processing: Perception, assessment and compression,” 2019,795arXiv:1905.00161.796
[42] Y. Qian, D. Liao, and J. Zhou, “Manifold alignment based color transfer797for multiview image stitching,” in Proc. IEEE Int. Conf. Image Process.,798Melbourne, Australia, Sep. 2013, pp. 1341–1345.799
[43] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality800assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,801no. 12, pp. 4695–4708, Dec. 2012.802
[44] C. Richardt, Y. Pritch, H. Zimmer, and A. Sorkine-Hornung, “Megastereo:803Constructing high-resolution stereo panoramas,” in Proc. IEEE Conf.804Comput. Vision Pattern Recognit., Portland, OR, USA, Jun. 2013,805pp. 1256–1263.806
[45] J. Zaragoza, T. Chin, Q. Tran, M. S. Brown, and D. Suter, “As-projective-807as-possible image stitching with moving DLT,” IEEE Trans. Pattern Anal.808Mach. Intell., vol. 36, no. 7, pp. 1285–1298, Jul. 2014.809
[46] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature810learning framework for no-reference image quality assessment,” in Proc.811IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 1098–1105.812
[47] Y. Wang et al., “An imaging-inspired no-reference underwater color image813quality assessment metric,” Comput. Elect. Eng., vol. 70, pp. 904–913,8142018.815
[48] Y. Fang, K. Ma, Z. Wang, W. Lin, Z. Fang, and G. Zhai, “No-reference816quality assessment of contrast-distorted images based on natural scene817statistics,” IEEE Signal Process. Lett., vol. 22, no. 7, pp. 838–842,818Jul. 2015.819
[49] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality820assessment: From error visibility to structural similarity,” IEEE Trans.821Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.822
[50] Z. Wang and Q. Li, “Information content weighting for perceptual image823quality assessment,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1185–8241198, May 2011.825
[51] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity826index for image quality assessment,” IEEE Trans. Image Process., vol. 20,827no. 8, pp. 2378–2386, Aug. 2011.828
[52] L. Liu, H. Dong, H. Huang, and A. C. Bovik, “No-reference image quality829assessment in curvelet domain,” Signal. Process Image Commun., vol. 29,830pp. 494–505, 2014.831
[53] S. Ling, G. Cheung, and P. L. Callet, “No-reference quality assessment832for stitched panoramic images using convolutional sparse coding and833compound feature selection,” in Proc. IEEE Int. Conf. Multimedia Expo,834San Diego, CA, USA, Jul. 2018, pp. 1–6.835
[54] Y. Yuan, Q. Guo, and X. Lu, “Image quality assessment: A sparse learning836way,” Neurocomputing, vol. 159, pp. 227–241, 2015.837
[55] C. Zhang, J. Pan, S. Chen, T. Wang, and D. Sun, “No reference image838quality assessment using sparse feature representation in two dimensions839spatial correlation,” Neurocomputing, vol. 173, pp. 462–470, 2016.840
[56] M. Huang et al., “Modeling the perceptual quality of immersive images841rendered on head mounted displays: Resolution and compression,” IEEE842Trans. Image Process., vol. 27, no. 12, pp. 6039–6050, Dec. 2018.843
[57] M. Xu, C. Li, Z. Chen, Z. Wang, and Z. Guan, “Assessing visual quality844of omnidirectional videos,” IEEE Trans. Circuits Syst. Video Technol., to845be published.846
[58] M. Xu, C. Li, Y. Liu, X. Deng, and J. Lu, “A subjective visual quality847assessment method of panoramic videos,” in Proc. IEEE Int. Conf. Multi-848media Expo, 2017, pp. 517–522.849
[59] Y. Rai, P. L. Callet, and P. Guillotel, “Which saliency weighting for omni850directional image quality assessment?” in Proc. 9th Int. Conf. Quality851Multimedia Experience, Erfurt, Germany, May/Jun. 2017, pp. 1–6.852
[60] E. Upenik, M. Rerábek, and T. Ebrahimi, “Testbed for subjective evalu-853ation of omnidirectional visual content,” in Proc. Picture Coding Symp.,854Nuremberg, Germany, Dec. 2016, pp. 1–5.855
[61] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks856for biomedical image segmentation,” in Proc. Int. Conf. Med. Image857Comput. Comput. Assisted Intervention, Oct. 2015, pp. 234–241.858
[62] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing the semantic859gap in saliency prediction by adapting deep neural networks,” in Proc.860IEEE Int. Conf. Comput. Vis., 2015, pp. 262–270.861
[63] M. Assens Reina, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor,862“Saltinet: Scan-path prediction on 360 degree images using saliency863volumes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 23318642338.865
[64] Y. Rai, J. Gutiérrez, and P. Le Callet, “A dataset of head and eye movements 866for 360 degree images,” in Proc. 8th ACM Multimedia Syst. Conf., 2017, 867pp. 205–210. 868
[65] M. Startsev and M. Dorr, “360-aware saliency estimation with con- 869ventional image saliency predictors,” Signal Process., Image Commun., 870vol. 69, pp. 43–52, 2018. 871
[66] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix 872factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, no. Jan, 873pp. 19–60, 2010. 874
[67] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for 875image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, 876pp. 610–621, Nov. 1973. 877
[68] D. G. Lowe, “Object recognition from local scale-invariant features,” in 878Proc. 7th Int. Conf. IEEE Comput. Vis., 1999, pp. 1150–1157. 879
[69] C. Li, A. C. Bovik, and X. Wu, “Blind image quality assessment using 880a general regression neural network,” IEEE Trans. Neural Netw., vol. 22, 881no. 5, pp. 793–799, May 2011. 882
[70] K. J. Preacher, P. J. Curran, and D. J. Bauer, “Computational tools for 883probing interactions in multiple linear regression, multilevel modeling, 884and latent curve analysis,” J. Educational Behav. Statist., pp. 437–448, 8852006. 886
[71] S. Group, “Samsung gear 360 stitching toolbox,” 2017. [Online]. Avail- 887able: https://www.samsung.com/global/galaxy/gear-360 888
[72] Github.com, “Dualfisheye,” 2016. [Online]. Available: https://github.com/ 889ooterness/DualFisheye. 890
[73] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ”completely 891blind” image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, 892pp. 209–212, Mar. 2013. 893
[74] L. Krasula, K. Fliegel, P. Le Callet, and M. Klíma, “On the accuracy 894of objective image and video quality models: New methodology for 895performance evaluation,” in Proc. IEEE 8th Int. Conf. Quality Multimedia 896Experience, 2016, pp. 1–6. 897
[75] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” 898in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1072–1080. 899
[76] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Proc. 900Eur. Conf. Comput. Vis., Springer, 2014, pp. 740–755. 901
[77] Y. Fang, X. Zhang, and N. Imamoglu, “A novel superpixel-based saliency 902detection model for 360-degree images,” Signal Process., Image Commun., 903vol. 69, pp. 1–7, 2018. 904
[78] J. Gutiérrez, E. J. David, A. Coutrot, M. P. Da Silva, and P. Le Callet, 905“Introducing un salient360! benchmark: A platform for evaluating visual 906attention models for 360 contents,” in Proc. IEEE 10th Int. Conf. Quality 907Multimedia Experience, 2018, pp. 1–3. 908
Jia Li (M’12–SM’15) received the B.E. degree from 909Tsinghua University, Beijing, China, in 2005, and 910the Ph.D. degree from the Institute of Computing 911Technology, Chinese Academy of Sciences, Beijing, 912China, in 2011. He is currently an Associate Professor 913with the School of Computer Science and Engineer- 914ing, Beihang University, Beijing, China. Before he 915joined Beihang University in June 2014, he used to 916conduct research in Nanyang Technological Univer- 917sity, Peking University and Shanda Innovations. He 918is the author or coauthor of more than 70 technical 919
articles in refereed journals and conferences such as TPAMI, IJCV, TIP, CVPR, 920and ICCV. His research interests include computer vision and multimedia big 921data, especially the learning-based visual content understanding. He is a senior 922member of CIE and CCF. More information can be found at http://cvteam.net 923
924
Yifan Zhao received the B.E. degree from the Harbin 925Institute of Technology, Harbin, China, in Jul. 2016. 926He is currently working toward the Ph.D. degree with 927the State Key Laboratory of Virtual Reality Technol- 928ogy and System, School of Computer Science and 929Engineering, Beihang University, Beijing, China. His 930research interests include computer vision and image 931understanding. 932
933
13
Weihua Ye is currently working toward the M.S. De-934gree with the State Key Laboratory of Virtual Reality935Technology and System, School of Computer Science936and Engineering, Beihang University, Beijing, China.937His research interests include computer vision and938virtual reality.939
940
Kaiwen Yu is currently working toward the M.S. de-941gree with the State Key Laboratory of Virtual Reality942Technology and System, School of Computer Sci-943ence and Engineering, Beihang University, Beijing,944China. His research interests include image quality945assessment and deep learning.946
947
Shiming Ge (M’13–SM’15) received the B.S. and 948Ph.D degrees in electronic engineering from the Uni- 949versity of Science and Technology of China, Hefei, 950China, in 2003 and 2008, respectively. He is currently 951an Associate Professor with the Institute of Infor- 952mation Engineering, Chinese Academy of Sciences, 953Beijing, China. He is also the member of Youth Inno- 954vation Promotion Association, Chinese Academy of 955Sciences. Prior to that, he was a Senior Researcher and 956Project Manager in Shanda Innovations, a Researcher 957in Samsung Electronics and Nokia Research Center. 958
His research mainly focuses on computer vision, data analysis, machine learning 959and AI security, especially efficient learning models and solutions toward 960scalable applications. 961
962