Human vs. Computer in Scene and Object Recognitionilab.usc.edu/borji/papers/cvpr2014Poster_lowRes.pdfLearned Lessons Summary Supported by the National Science Foundation, the General

- Here, to help focus research efforts onto the hardest unsolved problems, and bridge computer and human vision, we define a battery of 5 tests that measure the gap between human and machine performances in several dimensions.

- Cases where machines are superior motivate us to design new experiments to understand mechanisms of human vision, and to reason about its failure. Cases where humans are better inspire computational researchers to learn from humans.- In some applications (e.g., human-machine interaction or personal robots), perfect accuracy is not necessarily the goal; rather, having the same type of behavior (e.g., failing in cases where humans fail too) is favorable.

- As expected, models based on histograms are less influ-enced (e.g ., geo color, line hist, HOG, texton, and LBP).- Models correlate higher with humans over scenes (OSR and ISR) than objects, and better on outdoor scenes than indoors. - Some models, which use global feature statistics, show high correlation only on scenes but very low on objects (e.g., GIST, texton, geo map, and LBP), since they do not capture object shape or structure.

Fig

- A & B: We find that HOG, SSIM, texton, denseSIFT, LBP, and LBHPF outperform other models (accuracy above 70%). We note that spatial feature integration (i.e., x_pyr for the model x) enhances accuracies.

- C: Animal vs. Non-Animal: All models perform above 70%, except tiny image. Human accuracy here is about 80%. Inter-estingly, some models exceed human performance here.

- SUN dataset: Models that performed well on small datasets (although they degrade heavily) still rank on top. GIST model works well here (16.3%) but below top contenders: HOG, texton, SSIM, denseSIFT, and LBP (or their variants). Models ranking at the bottom, in order, are tiny image, line hist, geo color, HMAX, and geo map8x8.

- A majority of models perform > 70% on line drawings which is higher than human perfor-mance (similar pattern on images with human=77.3% and models > 80%).- SVM trained on images and tested on line drawings: Some models (e.g., line hists, GIST, geo map, sparseSIFT) better generalize to line drawings.

- SVM trained on line drawings and tested on edge maps: Surprisingly, averaged over all models, Sobel and Canny perform better than gPb. GIST, line hists, and HMAX were the most successful models using all edge detection methods. sparseSIFT, LBP, geo_color, and geo_texton were the most affected ones.- Models using Canny technique achieved the best scene classification accuracy.

- A majority of models are invariant to scaling while few are drasti-cally affected with a large amount of scaling (e.g., siagianItti07, SSIM, line hists, and sparseSIFT).

- Interestingly, LBP here shows a similar pattern as humans across four stimulus categories (i. e., max for head, min for close body).

- Some models show higher similarity to human disruption over the four categories of the animal dataset: sparseSIFT, SSIM, and HOG.

- On Caltech-256, HOG achieves the highest accuracy about 33.28% fol-lowed by SSIM, texton, and denseSIFT.

- GIST which is spe-cifically designed for scene categorization achieves 27.4% accu-racy, better than some models specialized for object recognition (e.g., HMAX).

- On sketch images, the shogSmooth model, specially designed for recognizing sketch images, outperforms others (acc=57.2%). Texton histogram and SSIM ranked second and fourth, respectively. HMAX did very well (in contrast to Caltech-256), perhaps due to its success in capturing edges, corners, etc.- Overall, models did much better on sketches than on natural objects (results are almost 2 times higher than the Caltech-256). Here, similar to the Caltech-256, features relying on geometry (e.g., geo_map) did not perform well.

- Scenes were were presentedto subjects for 17-87 ms in a 6-alternative force choice task (human acc= 77.3%).

- On color images, geo_color, sparseSIFT, GIST, and SSIM showed the highest correla-tion (all with classification ac-curacy ≥ 75%), while tiny images, texton, LBHF, and LBP showed the least. Over the SUN dataset, HOG, denseSIFT, and texton showed high correlation with human CM.- It seems that those models that take advantage of re-gional histogram of features (e.g., denseSIFT, GIST, geo_ x; x=map or color) or heavily rely on edge histograms (texton and HOG) show higher correlation with humans on color images (although low in magnitude).- Over line drawings: As in images, geo_color, SSIM, and sparseSIFT correlate well with humans.To our surprise, geo_color worked well.

Human vs. Computer in Scene and Object RecognitionAli Borji and Laurent Itti {borji,itti}@usc.edu

University of Southern California, Computer Science, Psychology, and Neuroscience

Introduction & Motivation

Test 1: Scene Recognition

Test 2: Recognition of Line Drawings and Edge MapsLine Drawings

Edge Maps

Test 3: Invariance Analysis Test 4: Local vs. Global Information

Test 5: Object Recognition

1) Models outperform humans in rapid categorization tasks, indicating that discriminative informa-tion is in place but humans do not have enough time to extract it. Models outperform humans on jumbled images and score relatively high in absence of (less) global information. 2) We find that some models and edge detection methods are more efficient on line drawings and edge maps. Our analysis helps objectively assess the power of edge detection algorithms to ex-tract meaningful structural features for classification, which hints toward new directions.3) While models are far from human performance over object and scene recognition on natural scenes, even classic models show high performance and correlation with humans on sketches.4) Consistent with the literature, we find that some models (e.g., HOG, SSIM, geo/texton, and GIST) perform well. We find that they also resemble humans better. 5) Invariance analysis shows that only sparseSIFT and geo_color are invariant to in-plane rotation with the former having higher accuracy (our 3rd test). GIST, a model of scene recognition works better than many models over both Caltech-256 and Sketch datasets.

Learned Lessons

Summary

Supported by the National Science Foundation, the General Motors Corporation, the Army Research Office, and Office of Naval Research

http://ilab.usc.edu/publications/doc/Borji_Itti14cvpr.pdf

Pear

son

Corre

latio

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

6 CAT dataset line drawings

Human Confusion matrix over line drawings

0.38

0.01

0.07

0.02

0.07

0.01

0.12

0.69

0.05

0.13

0.03

0.12

0.09

0.06

0.65

0.04

0.04

0.03

0.13

0.13

0.06

0.70

0.04

0.06

0.21

0.04

0.12

0.04

0.78

0.02

0.06

0.08

0.06

0.06

0.03

0.76

beaches

city

forest

highway

mountain

office

dens

eSIF

T_py

rHO

G_p

yrLB

PHF

HMAX

LBP_

pyr

gist

Padd

ing

dens

eSIF

TLB

Plin

e_hi

stG

IST

HOG

LBPH

F_py

rtin

y_im

age

siag

ianI

tti07

text

onge

o_te

xton

spar

seSi

ftge

o_m

ap8x

8te

xton

_pyr

SSIM

SSIM

_pyr

geo_

colo

r

beac

hes

city

fores

thig

hway

mounta

in

office

Pear

son

Corre

latio

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

models

6 CAT dataset color photographs

Human Confusion matrix over color photographs

Human acc 77.3%

Human acc 66%

beac

hes

city

fores

thig

hway

mounta

in

office

0.59

0.02

0.03

0.01

0.06

0.01

0.07

0.77

0.03

0.09

0.04

0.10

0.04

0.01

0.84

0.01

0.03

0.01

0.12

0.16

0.03

0.82

0.04

0.07

0.15

0.01

0.06

0.04

0.82

0.01

0.03

0.03

0.01

0.02

0.01

0.80

beaches

city

forest

highway

mountain

office

LBPH

F_py

rtin

y_im

age

SSIM

_pyr

text

on_p

yrte

xton

LBPH

FLB

P_py

rLB

Pde

nseS

IFT_

pyr

HMAX

HOG

geo_

text

onlin

e_hi

stge

o_m

ap8x

8si

agia

nItti

07de

nseS

IFT

HOG

_pyr

gist

Padd

ing

SSIM

GIS

Tsp

arse

Sift

geo_

colo

r

Human-model agreement on the 6-CAT dataset. See our paper and its supplement for confusion matrices of models.

Geometric map: ground, pourous, sky, and vertical regions.

gPb Canny Log Prewitt Roberts SobelLD

Edge maps for a sample image .

.

tiny_

imag

e

texton

LBPH

F

HMAX

geo_

color

geo_

textonLB

P

line_

hist

spars

eSIFT

geo_

map8

x8

siagia

nItti0

7

LBP_

pyr

LBPH

F_py

r

texton

_pyr

GIST

gistPa

dding

dens

eSIFT

dens

eSIFT

_pyr

SSIM

SSIM

_pyr

HOG_

pyr

HOG

tiny_

imag

e

texton

LBPH

F

HMAX

geo_

color

geo_

texton LB

P

line_

hist

spars

eSIFT

geo_

map8

x8

siagia

nItti0

7

LBP_

pyr

LBPH

F_py

r

texton

_pyr

GIST

gistPa

dding

dens

eSIFT

dens

eSIFT

_pyr

SSIM

SSIM

_pyr

HOG_

pyr

HOG

CP

Flipped

LD

gPb

60

70

80

90

40

30

20

10

0

50

60

70

80

90

40

30

20

50

class

ificati

on ra

te %

class

ificati

on ra

te %

100

CP Cross Training on Color Photographs; Testing on X

LD Cross Training on Line Drawings; Testing on X

human

human

LDgPbSobelLogCannyRobertsPrewitt

Top: training a SVM from color photographs and testing online drawings, gPb edge maps, and inverted (FL) images. Bottom: SVMtrained on line drawings and applied to edge maps.

airplane apple cup grape key mouse wine-bottle

store street kitchen bedroom open country tall building

sauna

face

gym

mou

ntain

head meduim body far bodyclose body

alley fountain escalator balcony campus

Caltech ISR OSR

camel coin electric guitar frying pan glasses keyboard umbrella cactus

mountain highway forest city beach

CAT 6

line d

rawi

ngCA

T 15

SUN

Jum

bled

Anim

als

Calte

ch 25

6Sk

etch

es

gym

Sample images from scene and object datasets used in this study.

tiny_

imag

ege

o_ma

p8x8

gistP

addin

gGI

ST

SSIM

HMAX

line_

hist

spars

eSIFT

SSIM

_pyr

HOG_

pyr

LBPH

F

LBPH

F_py

rLB

P_py

r

texton

_pyr

HOG

texton

dens

eSIFT

geo_

texton

geo_

color LB

P30

40

50

60

70

80

90

Original image

180 degree rotated (inverted)

90 degree rotated

Animal vs. non-animalChance 50%

Chance 25%

four-way classification (head, close body, medium body, far body)

60

70

80

90

40

50

6-CAT A CB

classi

fication

rate

%

spars

eSIFT

geo_

map8

x8ge

o_co

lortin

y_im

age

dens

eSIFT

line_

hist

HMAX

LBPH

Ftex

tonge

o_tex

ton LBP

SSIM

LBPH

F_py

rGIS

Tgis

tPadd

ingtex

ton_p

yrSS

IM_p

yrHO

G

HOG

LBP_

pyr

HOG_

pyr

tiny_

imag

etex

tonLB

PHF

HMAX

geo_

color

geo_

texton LB

Plin

e_his

tsp

arseS

IFTge

o_ma

p8x8

siagia

nItti0

7LB

P_py

rLB

PHF_

pyr

texton

_pyr

GIST

gistPa

dding

dens

eSIFT

dens

eSIFT

_pyr

dens

eSIFT

_pyr

dens

eSIFT

_pyr

SSIM

SSIM

_pyr

HOG_

pyr

100

chance level = 1/6 (16.67 %)Naive Bayes chance = 18.88 %

OriginalScale 1/2Scale 1/4

10

20

30

40

50

60

70

80

90

100

15-CAT

8-CAT



A & B) Scene classification accuracy over 6-, 8- and 15-CAT datasets. Error bars represent the standard error of the mean over 10 runs. NaiveBayes chance is simply set by the size of the largest class. All models work well above chance level. C) Top: animal vs. non-animal (distractor images)classification. Bottom: classification of target images. 4-way classification is only over target scenes (and not distractors).

1 5 10 20 50

101

100

Number of training samples per class

Clas

sifica

tion

rate

%

Chance level = 1/397 (0.25 %)

SUN dataset

HMAX [6.98]siagianItti07 [7.44]

all [38.0]HOG [27.2]geo_texton [23.5]SSIM [22.5]denseSIFT [21.5]LBP [18.0]texton [17.6]

GIST [16.3]all (1NN) [13.0]LBPHF [12.8]sparseSIFT [11.5]geo_color [9.1]color_hist [8.2]geo_map8x8 [6.0]line_hist [5.7]tiny_image [5.5]

Pearson Correlation0 0.1 0.2 0.3

tiny_imageline_histgeo_colorHMAXgeo_map8x8SSIM

LBPHF_pyr

sparseSIFT

LBPHF

LBP

gistPaddingLBP_pyr

SSIM_pyrGIST

geo_texton

denseSIFT

denseSIFT_pyr

texton

HOG_pyrHOG

Performances and correlations on SUN dataset. We randomlychose n = {1, 5, 10, 20, 50} images per class for training and 50 for test.

gPb Sobel Prewitt Roberts Log Canny40

50

60

70

80

90

100

Avg 5 best: 0.856Avg 10 best: 0.832






class

ificati

on ra

te % rank 1263 4 5

SiagianItti07HMAXdenseSIFTdenseSIFT_pyrgeo_colorgeo_map8x8geo_textonGISTgistPaddingHOGHOGpyrLBPLBP_pyrLBPHFLBPHF_pyrline_histsparseSIFTSSIMSSIM_pyrtextontexton_pyrtiny_image

Scene classification results using edge detected images over 6-CAT dataset. Canny edge detectorleads to best accuracies followed by the log and gPb methods.

d’d’

0

1

22.5

0

1

22.5

tin

y_im

age

geo

_map

8x8

geo

_co

lor

line_

his

ts

HM

AX

geo

_tex

ton

text

on

text

on

_pyr

spar

seS

IFT

gis

tPad

din

g

GIS

T

LB

PH

F

LB

P

den

seS

IFT

_pyr

HO

G

HO

G_p

yr

den

seS

IFT

SS

IM_p

yr

SS

IM

LB

PH

F_p

yr

LB

P_p

yr

Hu

man

90o Rotation

180o Rotation (inverted)

d’

0

1

2

2.5 Original

0.1210.1680.2160.2050.2150.210.220.2060.2680.2780.2970.3690.3880.4550.5150.5520.7560.7830.8650.9741.36close body medium body far bodyhead

d‘

values over original,90o , and 180o rotated animal images.

10

30

40

60

80

0

0.2

0.4

0.6

0.8

CAL ISR OSR

Class

ificati

on ra

te %

Corre

latio

n

Human = 65.1Human = 56.1Human = 49.7

Siag

ianItt

i07HM

AX

dens

eSIF

Tde

nseS

IFT_p

yr

geo_

colo

r

geo_m

ap8x

8ge

o_tex

ton

GIST

gist

Padd

ing

HOG

HOG_

pyr

LBP

LBP_

pyr

LBPH

F

LBPH

F_py

r

line_

hist

s

spar

seSI

FT

SSIM

SSIM

_pyr

text

on

text

on_p

yr

tiny_

imag

e

Correlation and classification accuracy over jumbled images.

0

5

10

HOG [33.28]HOG_pyr [32.66]

SSIM [30.18]texton [29.86]denseSIFT [29.42]denseSIFT_pyr [28.38]texton_pyr [27.80]GIST [27.38]

gistPadding [25.09]SSIM_pyr [25.01]

LBP [20.66]LBP_pyr [20.54]sparseSIFT [20.38]geo_texton [20.26]

LBPHF_pyr [17.76]LBPHF [17.62]siagianItti07 [16.46]

tiny_image [12.96]HMAX [12.02]

geo_color [8.14]

line_hist [6.54]

(geo_map8x8)[5.28]

1 5 10 20 50number of training samples per class number of training samples per class

Reco

gnitio

n rate

%

Chance level = 1/256 (0.39 %)Caltech 256 dataset

15

20

25

30

35 0 0.1 0.2 0.3 0.4

Corre

lation

geo_colordenseSIFTLBPHFLBPline_histgeo_textontextonHOGtiny_imageSSIMsparseSIFTgeo_map8x8LBPHF_pyrdenseSIFT_pyrLBP_pyrGISTHMAXHOG_pyrtexton_pyrgistPaddingSSIM_pyrshogHardshogSmooth

0

10

20

30

40

50

60

HMAX [53.03]

denseSIFT [7.46]

denseSIFT_pyr [41.84]

geo_color [1.78]

geo_map8x8 [29.25]

geo_texton [22.21]

GIST [51.97]gistPadding [51.74]

HOG [20.68]

HOG_pyr [50.69]

LBP [12.08]

LBP_pyr [47.15]

LBPHF [9.42]

LBPFH_pyr [42.02]

line_hists [14.62]

sparseSIFT [23.62]

SSIM [26.48]

SSIM_pyr [54.45]

texton [22.29]

texton_pyr [55.28]

tiny_ image [25.46]

shogSmooth [57.20]

shogHard [54.70]

1 5 10 20 50

Chance level = 1/250 (0.4 %)

Sketch dataset

Left: Object recognition performance on Caltech-256 dataset. Right: Recognition rate and correlations on Sketch dataset.

Model siag

ianI

tti0

7

HM

AX

dens

eSIF

T

dSIF

Tpy

r

geo

colo

r

geo

map

8x8

geo

text

on

GIS

T

gist

Padd

ing

HO

G

HO

Gpy

r

LBP

LBP

pyr

LBPH

F

LBPH

Fpy

r

line

hist

spar

seSI

FT

SSIM

SSIM

pyr

text

on

text

onpy

r

tiny

imag

e

SUN 7.43 7 21.5 - 9.14 6.02 23.5 16.3 13.7 27.2 - 18.0 - 12.8 - 5.7 11.5 22.5 - 17.6 - 5.54Caltech-256 16.5 12 29.4 28.4 4.9 5.3 20.3 27.4 25.1 33.3 32.7 20.7 20.5 17.6 17.8 6.54 20.4 30.2 25.0 29.9 27.8 13

Sketch - 55 7.6 43.4 1.68 30.6 23.4 53.7 53.6 21.2 52.3 12.8 48.9 9.6 43.3 15.1 24.9 27.5 56.2 23.1 56.9 27.2Animal/Non-Anim. - 75.8 84.4 83.6 73.7 72.5 78.8 81.5 81 84 84.2 83.1 85.7 83.1 85.8 74.5 80.7 84.9 84.7 78.3 78.6 65Similarity rank 13.6 13.6 9.3 12.6 10.2 13.8 8.4 11.2 12.4 5.6 9.2 10.0 10.2 11.7 10.0 13.0 11.8 9.2 10.8 9.6 9.2 18.9

Classification results corresponding to 50 training and (50 over SUN and remaining images over Caltech-256 and Sketch) testing images per classAnimal vs. non-Animal corresponds to classification of 600 target vs. 600 distractor images . Top three models on each dataset are highlighted in red.

Documents

Human vs. Computer in Scene and Object Recognitionilab.usc.edu/borji/papers/cvpr2014Poster_lowRes.pdfLearned Lessons Summary Supported by the National Science Foundation, the General