Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
- Here, to help focus research efforts onto the hardest unsolved problems, and bridge computer and human vision, we define a battery of 5 tests that measure the gap between human and machine performances in several dimensions.
- Cases where machines are superior motivate us to design new experiments to understand mechanisms of human vision, and to reason about its failure. Cases where humans are better inspire computational researchers to learn from humans.- In some applications (e.g., human-machine interaction or personal robots), perfect accuracy is not necessarily the goal; rather, having the same type of behavior (e.g., failing in cases where humans fail too) is favorable.
- As expected, models based on histograms are less influ-enced (e.g ., geo color, line hist, HOG, texton, and LBP).- Models correlate higher with humans over scenes (OSR and ISR) than objects, and better on outdoor scenes than indoors. - Some models, which use global feature statistics, show high correlation only on scenes but very low on objects (e.g., GIST, texton, geo map, and LBP), since they do not capture object shape or structure.
Fig
- A & B: We find that HOG, SSIM, texton, denseSIFT, LBP, and LBHPF outperform other models (accuracy above 70%). We note that spatial feature integration (i.e., x_pyr for the model x) enhances accuracies.
- C: Animal vs. Non-Animal: All models perform above 70%, except tiny image. Human accuracy here is about 80%. Inter-estingly, some models exceed human performance here.
- SUN dataset: Models that performed well on small datasets (although they degrade heavily) still rank on top. GIST model works well here (16.3%) but below top contenders: HOG, tex- ton, SSIM, denseSIFT, and LBP (or their variants). Models ranking at the bottom, in order, are tiny image, line hist, geo color, HMAX, and geo map8x8.
- A majority of models perform > 70% on line drawings which is higher than human perfor-mance (similar pattern on images with human=77.3% and models > 80%).- SVM trained on images and tested on line drawings: Some models (e.g., line hists, GIST, geo map, sparseSIFT) better generalize to line drawings.
- SVM trained on line drawings and tested on edge maps: Surprisingly, averaged over all models, Sobel and Canny perform better than gPb. GIST, line hists, and HMAX were the most successful models using all edge detection methods. sparseSIFT, LBP, geo_color, and geo_texton were the most affected ones.- Models using Canny technique achieved the best scene classification accuracy.
- A majority of models are invariant to scaling while few are drasti-cally affected with a large amount of scaling (e.g., siagianItti07, SSIM, line hists, and sparseSIFT).
- Interestingly, LBP here shows a similar pattern as humans across four stimulus categories (i. e., max for head, min for close body).
- Some models show higher similarity to human disruption over the four categories of the animal dataset: sparseSIFT, SSIM, and HOG.
- On Caltech-256, HOG achieves the highest accuracy about 33.28% fol-lowed by SSIM, texton, and denseSIFT.
- GIST which is spe-cifically designed for scene categorization achieves 27.4% accu-racy, better than some models specialized for object recognition (e.g., HMAX).
- On sketch images, the shogSmooth model, specially designed for recognizing sketch images, outperforms others (acc=57.2%). Texton histogram and SSIM ranked second and fourth, respectively. HMAX did very well (in contrast to Caltech-256), perhaps due to its success in capturing edges, corners, etc.- Overall, models did much better on sketches than on natural objects (results are almost 2 times higher than the Caltech-256). Here, similar to the Caltech-256, features relying on geometry (e.g., geo_map) did not perform well.
- Scenes were were presentedto subjects for 17-87 ms in a 6-alternative force choice task (human acc= 77.3%).
- On color images, geo_color, sparseSIFT, GIST, and SSIM showed the highest correla-tion (all with classification ac-curacy ≥ 75%), while tiny images, texton, LBHF, and LBP showed the least. Over the SUN dataset, HOG, denseSIFT, and texton showed high correlation with human CM.- It seems that those models that take advantage of re-gional histogram of features (e.g., denseSIFT, GIST, geo_ x; x=map or color) or heavily rely on edge histograms (texton and HOG) show higher correlation with humans on color images (although low in magnitude).- Over line drawings: As in images, geo_color, SSIM, and sparseSIFT correlate well with humans.To our surprise, geo_color worked well.
Human vs. Computer in Scene and Object RecognitionAli Borji and Laurent Itti {borji,itti}@usc.edu
University of Southern California, Computer Science, Psychology, and Neuroscience
Introduction & Motivation
Test 1: Scene Recognition
Test 2: Recognition of Line Drawings and Edge MapsLine Drawings
Edge Maps
Test 3: Invariance Analysis Test 4: Local vs. Global Information
Test 5: Object Recognition
1) Models outperform humans in rapid categorization tasks, indicating that discriminative informa-tion is in place but humans do not have enough time to extract it. Models outperform humans on jumbled images and score relatively high in absence of (less) global information. 2) We find that some models and edge detection methods are more efficient on line drawings and edge maps. Our analysis helps objectively assess the power of edge detection algorithms to ex-tract meaningful structural features for classification, which hints toward new directions.3) While models are far from human performance over object and scene recognition on natural scenes, even classic models show high performance and correlation with humans on sketches.4) Consistent with the literature, we find that some models (e.g., HOG, SSIM, geo/texton, and GIST) perform well. We find that they also resemble humans better. 5) Invariance analysis shows that only sparseSIFT and geo_color are invariant to in-plane rotation with the former having higher accuracy (our 3rd test). GIST, a model of scene recognition works better than many models over both Caltech-256 and Sketch datasets.
Learned Lessons
Summary
Supported by the National Science Foundation, the General Motors Corporation, the Army Research Office, and Office of Naval Research
http://ilab.usc.edu/publications/doc/Borji_Itti14cvpr.pdf
Pear
son
Corre
latio
n
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
6 CAT dataset line drawings
Human Confusion matrix over line drawings
0.38
0.01
0.07
0.02
0.07
0.01
0.12
0.69
0.05
0.13
0.03
0.12
0.09
0.06
0.65
0.04
0.04
0.03
0.13
0.13
0.06
0.70
0.04
0.06
0.21
0.04
0.12
0.04
0.78
0.02
0.06
0.08
0.06
0.06
0.03
0.76
beaches
city
forest
highway
mountain
office
dens
eSIF
T_py
rHO
G_p
yrLB
PHF
HMAX
LBP_
pyr
gist
Padd
ing
dens
eSIF
TLB
Plin
e_hi
stG
IST
HOG
LBPH
F_py
rtin
y_im
age
siag
ianI
tti07
text
onge
o_te
xton
spar
seSi
ftge
o_m
ap8x
8te
xton
_pyr
SSIM
SSIM
_pyr
geo_
colo
r
beac
hes
city
fores
thig
hway
mounta
in
office
Pear
son
Corre
latio
n
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
models
6 CAT dataset color photographs
Human Confusion matrix over color photographs
Human acc 77.3%
Human acc 66%
beac
hes
city
fores
thig
hway
mounta
in
office
0.59
0.02
0.03
0.01
0.06
0.01
0.07
0.77
0.03
0.09
0.04
0.10
0.04
0.01
0.84
0.01
0.03
0.01
0.12
0.16
0.03
0.82
0.04
0.07
0.15
0.01
0.06
0.04
0.82
0.01
0.03
0.03
0.01
0.02
0.01
0.80
beaches
city
forest
highway
mountain
office
LBPH
F_py
rtin
y_im
age
SSIM
_pyr
text
on_p
yrte
xton
LBPH
FLB
P_py
rLB
Pde
nseS
IFT_
pyr
HMAX
HOG
geo_
text
onlin
e_hi
stge
o_m
ap8x
8si
agia
nItti
07de
nseS
IFT
HOG
_pyr
gist
Padd
ing
SSIM
GIS
Tsp
arse
Sift
geo_
colo
r
Human-model agreement on the 6-CAT dataset. See our paper and its supplement for confusion matrices of models.
Geometric map: ground, pourous, sky, and vertical regions.
gPb Canny Log Prewitt Roberts SobelLD
Edge maps for a sample image .
.
tiny_
imag
e
texton
LBPH
F
HMAX
geo_
color
geo_
textonLB
P
line_
hist
spars
eSIFT
geo_
map8
x8
siagia
nItti0
7
LBP_
pyr
LBPH
F_py
r
texton
_pyr
GIST
gistPa
dding
dens
eSIFT
dens
eSIFT
_pyr
SSIM
SSIM
_pyr
HOG_
pyr
HOG
tiny_
imag
e
texton
LBPH
F
HMAX
geo_
color
geo_
texton LB
P
line_
hist
spars
eSIFT
geo_
map8
x8
siagia
nItti0
7
LBP_
pyr
LBPH
F_py
r
texton
_pyr
GIST
gistPa
dding
dens
eSIFT
dens
eSIFT
_pyr
SSIM
SSIM
_pyr
HOG_
pyr
HOG
CP
Flipped
LD
gPb
60
70
80
90
40
30
20
10
0
50
60
70
80
90
40
30
20
50
class
ificati
on ra
te %
class
ificati
on ra
te %
100
CP Cross Training on Color Photographs; Testing on X
LD Cross Training on Line Drawings; Testing on X
human
human
LDgPbSobelLogCannyRobertsPrewitt
Top: training a SVM from color photographs and testing online drawings, gPb edge maps, and inverted (FL) images. Bottom: SVMtrained on line drawings and applied to edge maps.
airplane apple cup grape key mouse wine-bottle
store street kitchen bedroom open country tall building
sauna
face
gym
mou
ntain
head meduim body far bodyclose body
alley fountain escalator balcony campus
Caltech ISR OSR
camel coin electric guitar frying pan glasses keyboard umbrella cactus
mountain highway forest city beach
CAT 6
line d
rawi
ngCA
T 15
SUN
Jum
bled
Anim
als
Calte
ch 25
6Sk
etch
es
gym
Sample images from scene and object datasets used in this study.
tiny_
imag
ege
o_ma
p8x8
gistP
addin
gGI
ST
SSIM
HMAX
line_
hist
spars
eSIFT
SSIM
_pyr
HOG_
pyr
LBPH
F
LBPH
F_py
rLB
P_py
r
texton
_pyr
HOG
texton
dens
eSIFT
geo_
texton
geo_
color LB
P30
40
50
60
70
80
90
Original image
180 degree rotated (inverted)
90 degree rotated
Animal vs. non-animalChance 50%
Chance 25%
four-way classification (head, close body, medium body, far body)
60
70
80
90
40
50
6-CAT A CB
classi
fication
rate
%
spars
eSIFT
geo_
map8
x8ge
o_co
lortin
y_im
age
dens
eSIFT
line_
hist
HMAX
LBPH
Ftex
tonge
o_tex
ton LBP
SSIM
LBPH
F_py
rGIS
Tgis
tPadd
ingtex
ton_p
yrSS
IM_p
yrHO
G
HOG
LBP_
pyr
HOG_
pyr
tiny_
imag
etex
tonLB
PHF
HMAX
geo_
color
geo_
texton LB
Plin
e_his
tsp
arseS
IFTge
o_ma
p8x8
siagia
nItti0
7LB
P_py
rLB
PHF_
pyr
texton
_pyr
GIST
gistPa
dding
dens
eSIFT
dens
eSIFT
_pyr
dens
eSIFT
_pyr
dens
eSIFT
_pyr
SSIM
SSIM
_pyr
HOG_
pyr
100
chance level = 1/6 (16.67 %)Naive Bayes chance = 18.88 %
OriginalScale 1/2Scale 1/4
10
20
30
40
50
60
70
80
90
100
15-CAT
8-CAT
chance level = 1/15 (6.67 %)Naive Bayes chance = 8.96 %
chance level = 1/8 (12.50 %)Naive Bayes chance = 15.25 %
A & B) Scene classification accuracy over 6-, 8- and 15-CAT datasets. Error bars represent the standard error of the mean over 10 runs. NaiveBayes chance is simply set by the size of the largest class. All models work well above chance level. C) Top: animal vs. non-animal (distractor images)classification. Bottom: classification of target images. 4-way classification is only over target scenes (and not distractors).
1 5 10 20 50
101
100
Number of training samples per class
Clas
sifica
tion
rate
%
Chance level = 1/397 (0.25 %)
SUN dataset
HMAX [6.98]siagianItti07 [7.44]
all [38.0]HOG [27.2]geo_texton [23.5]SSIM [22.5]denseSIFT [21.5]LBP [18.0]texton [17.6]
GIST [16.3]all (1NN) [13.0]LBPHF [12.8]sparseSIFT [11.5]geo_color [9.1]color_hist [8.2]geo_map8x8 [6.0]line_hist [5.7]tiny_image [5.5]
Pearson Correlation0 0.1 0.2 0.3
tiny_imageline_histgeo_colorHMAXgeo_map8x8SSIM
LBPHF_pyr
sparseSIFT
LBPHF
LBP
gistPaddingLBP_pyr
SSIM_pyrGIST
geo_texton
denseSIFT
denseSIFT_pyr
texton
HOG_pyrHOG
Performances and correlations on SUN dataset. We randomlychose n = {1, 5, 10, 20, 50} images per class for training and 50 for test.
gPb Sobel Prewitt Roberts Log Canny40
50
60
70
80
90
100
Avg 5 best: 0.856Avg 10 best: 0.832
Avg 5 best: 0.849Avg 10 best: 0.815
Avg 5 best: 0.848Avg 10 best: 0.813
Avg 5 best: 0.827Avg 10 best: 0.794
Avg 5 best: 0.872Avg 10 best: 0.842
Avg 5 best: 0.882Avg 10 best: 0.853
class
ificati
on ra
te % rank 1263 4 5
SiagianItti07HMAXdenseSIFTdenseSIFT_pyrgeo_colorgeo_map8x8geo_textonGISTgistPaddingHOGHOGpyrLBPLBP_pyrLBPHFLBPHF_pyrline_histsparseSIFTSSIMSSIM_pyrtextontexton_pyrtiny_image
Scene classification results using edge detected images over 6-CAT dataset. Canny edge detectorleads to best accuracies followed by the log and gPb methods.
d’d’
0
1
22.5
0
1
22.5
tin
y_im
age
geo
_map
8x8
geo
_co
lor
line_
his
ts
HM
AX
geo
_tex
ton
text
on
text
on
_pyr
spar
seS
IFT
gis
tPad
din
g
GIS
T
LB
PH
F
LB
P
den
seS
IFT
_pyr
HO
G
HO
G_p
yr
den
seS
IFT
SS
IM_p
yr
SS
IM
LB
PH
F_p
yr
LB
P_p
yr
Hu
man
90o Rotation
180o Rotation (inverted)
d’
0
1
2
2.5 Original
0.1210.1680.2160.2050.2150.210.220.2060.2680.2780.2970.3690.3880.4550.5150.5520.7560.7830.8650.9741.36close body medium body far bodyhead
d‘
values over original,90o , and 180o rotated animal images.
10
30
40
60
80
0
0.2
0.4
0.6
0.8
CAL ISR OSR
Class
ificati
on ra
te %
Corre
latio
n
Human = 65.1Human = 56.1Human = 49.7
Siag
ianItt
i07HM
AX
dens
eSIF
Tde
nseS
IFT_p
yr
geo_
colo
r
geo_m
ap8x
8ge
o_tex
ton
GIST
gist
Padd
ing
HOG
HOG_
pyr
LBP
LBP_
pyr
LBPH
F
LBPH
F_py
r
line_
hist
s
spar
seSI
FT
SSIM
SSIM
_pyr
text
on
text
on_p
yr
tiny_
imag
e
Correlation and classification accuracy over jumbled images.
0
5
10
HOG [33.28]HOG_pyr [32.66]
SSIM [30.18]texton [29.86]denseSIFT [29.42]denseSIFT_pyr [28.38]texton_pyr [27.80]GIST [27.38]
gistPadding [25.09]SSIM_pyr [25.01]
LBP [20.66]LBP_pyr [20.54]sparseSIFT [20.38]geo_texton [20.26]
LBPHF_pyr [17.76]LBPHF [17.62]siagianItti07 [16.46]
tiny_image [12.96]HMAX [12.02]
geo_color [8.14]
line_hist [6.54]
(geo_map8x8)[5.28]
1 5 10 20 50number of training samples per class number of training samples per class
Reco
gnitio
n rate
%
Chance level = 1/256 (0.39 %)Caltech 256 dataset
15
20
25
30
35 0 0.1 0.2 0.3 0.4
Corre
lation
geo_colordenseSIFTLBPHFLBPline_histgeo_textontextonHOGtiny_imageSSIMsparseSIFTgeo_map8x8LBPHF_pyrdenseSIFT_pyrLBP_pyrGISTHMAXHOG_pyrtexton_pyrgistPaddingSSIM_pyrshogHardshogSmooth
0
10
20
30
40
50
60
HMAX [53.03]
denseSIFT [7.46]
denseSIFT_pyr [41.84]
geo_color [1.78]
geo_map8x8 [29.25]
geo_texton [22.21]
GIST [51.97]gistPadding [51.74]
HOG [20.68]
HOG_pyr [50.69]
LBP [12.08]
LBP_pyr [47.15]
LBPHF [9.42]
LBPFH_pyr [42.02]
line_hists [14.62]
sparseSIFT [23.62]
SSIM [26.48]
SSIM_pyr [54.45]
texton [22.29]
texton_pyr [55.28]
tiny_ image [25.46]
shogSmooth [57.20]
shogHard [54.70]
1 5 10 20 50
Chance level = 1/250 (0.4 %)
Sketch dataset
Left: Object recognition performance on Caltech-256 dataset. Right: Recognition rate and correlations on Sketch dataset.
Model siag
ianI
tti0
7
HM
AX
dens
eSIF
T
dSIF
Tpy
r
geo
colo
r
geo
map
8x8
geo
text
on
GIS
T
gist
Padd
ing
HO
G
HO
Gpy
r
LBP
LBP
pyr
LBPH
F
LBPH
Fpy
r
line
hist
spar
seSI
FT
SSIM
SSIM
pyr
text
on
text
onpy
r
tiny
imag
e
SUN 7.43 7 21.5 - 9.14 6.02 23.5 16.3 13.7 27.2 - 18.0 - 12.8 - 5.7 11.5 22.5 - 17.6 - 5.54Caltech-256 16.5 12 29.4 28.4 4.9 5.3 20.3 27.4 25.1 33.3 32.7 20.7 20.5 17.6 17.8 6.54 20.4 30.2 25.0 29.9 27.8 13
Sketch - 55 7.6 43.4 1.68 30.6 23.4 53.7 53.6 21.2 52.3 12.8 48.9 9.6 43.3 15.1 24.9 27.5 56.2 23.1 56.9 27.2Animal/Non-Anim. - 75.8 84.4 83.6 73.7 72.5 78.8 81.5 81 84 84.2 83.1 85.7 83.1 85.8 74.5 80.7 84.9 84.7 78.3 78.6 65Similarity rank 13.6 13.6 9.3 12.6 10.2 13.8 8.4 11.2 12.4 5.6 9.2 10.0 10.2 11.7 10.0 13.0 11.8 9.2 10.8 9.6 9.2 18.9
Classification results corresponding to 50 training and (50 over SUN and remaining images over Caltech-256 and Sketch) testing images per classAnimal vs. non-Animal corresponds to classification of 600 target vs. 600 distractor images . Top three models on each dataset are highlighted in red.