1
Feed-forward deep neural networks diverge from primates on core visual object recognition behavior Rishi Rajalingham*, Elias B. Issa 1 *, Kailyn Schmidt, Kohitij Kar, Pouya Bashivan, and James J. DiCarlo Dept. of Brain and Cognitive Sciences, McGovern Institute for Brain Research, MIT Department of Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University 1 Inception-v3 Human Pool Monkey Pool (n=5) V1 ALEXNET VGG GOOGLENET RESNET INCEPTION-v3 NYU Primate Zone Consistency to Human Pool (O2) Human Pool Monkey Pool (n=5) Inception-v3 V1 ALEXNET VGG GOOGLENET RESNET INCEPTION-v3 NYU Consistency to Human Pool (O1) Primate Zone Monkey Pool (n=5 subjects) Heldout Human Pool (n=4 subjects) Label probability 1 0 1 0 1 0 Human Monkey HCNN Human Pool Monkey Pool (n=5) Inception-v3 O2: Performance per object for each distracter (confusion matrix) V1 ALEXNET VGG GOOGLENET RESNET INCEPTION-v3 NYU Consistency to Human Pool (I1c) Primate Zone Monkey Pool (n=5 subjects) Heldout Human Pool (n=4 subjects) HCNN performance ( Δd’) Human performance (Δd’) n=240 images Monkey performance ( Δd’) n=240 images Human performance (Δd’) I1c: Performance per image across distracters Humans can rapidly and accurately recognize objects in spite of variation in viewing param- eters (e.g. position, pose, and size) and background conditions. To understand this invariant object recognition ability, one can construct and test models that aim to accurately capture human behavior, including making the same patterns of errors over all images. Here, we applied this straightforward visual “Turing test” to the leading feed-forward com- putational models of human vision (hierarchical convolutional neural networks, HCNNs) and to a leading animal model (rhesus macaques). Our results reveal a failure of current HCNNs to fully replicate the image-level behavioral patterns of primates. Introduction Methods Behavioral paradigm: We tested object recognition performance on a two alternative forced choice (2AFC) match-to-sam- ple task using brief (100ms) presentations of natural- istic synthetic images of basic-level objects. Android Tablet Juice reward tube Arduino controller High-throughput psychophsyics: Over a million behavioral trials across hun- dreds of humans and five monkeys were ag- gregated to characterize each species on each image. To collect such large behavioral data- sets, we used Amazon Mechanical Turk for humans and a novel home-cage touchscreen system for monkeys. Stimuli: To enforce invariant object recognition behavior, we generated several thousand naturalistic synthetic images (2400 images, 24 objects, 100 images/object), each with one foreground object (by ren- dering a 3D model of each object with random viewing parameters) on a random natural image background. Methods HARD EASY Fixation (100ms) Test Image (100ms) Choice Screen (1000ms) Amazon Mechanical Turk “MonkeyTurk” Tank Bird Shorts Elephant House Hanger Pen Hammer Zebra Fork Guitar Bear Watch Dog Spider Truck Wrench Rhino Leg Camel Table Calculator Gun Knife O1: Performance per object across distracters Results Inception-v3 Human Pool Monkey Pool (n=5) V1 ALEXNET VGG GOOGLENET RESNET INCEPTION-v3 NYU Consistency to Human Pool (I2c) Primate Zone I2c: Performance per image for each distracter (confusion matrix) Conclusions Behavioral metrics: We characterized the core object recognition behavior of humans, macaques and HCNNs over a large set of images and distracter objects using several behavioral metrics. Each behavioral metric computes a pattern of unbiased behavioral perfor- mances, using a sensitivity index (d’), at the resolution of objects (O1, O2) and images (I1, I2). To isolate image-driven behavioral variance that is object independent, we normalized image-level behavioral metrics by subtracting the mean performance values over all images of the same object (I1c, I2c). Each object can produce myriad images under variations in viewing parame- ters, with some images being more challenging than others. Consistent with previous work, macaque monkeys and all tested HCNN models are highly consistent with humans with respect to object-level metrics (O1, O2), while a lower level representation (V1) is less consistent. Eccentricity Size Pose Difference Monkey Pool Human Pool Normalized Image-level Performance (Δd’) Mean Luminance BG Spatial Frequency Segmentation Index Regression Weights Eccentricity Size Pose Difference Mean Luminance Segmentation Index BG Spatial Frequency All tested HCNN models fail to accurately capture the image-level behavioral patterns of primates. In contrast to the corresponding object-level metric (O1), all tested HCNNs diverge from primates with respect to the one-versus-all image-level behavioral pattern (I1c). Zooming in on one example image of a pen, humans and monkeys make similar confusions (choices in object labels), while the HCNN model dif- fers significantly. Example image with corresponding behavior- al patterns (Left, top row) For each behavioral metric, we defined a “primate zone” as the range of consistency values delimited by extrapolated estimates of con- sistency to the human pool of large (i.e. infinitely many subjects) pool of rhesus macaque monkeys and humans respectively. Thus, the primate zone defines a range of consistency values that corre- spond to models that accurately capture the behavior of the human pool, at least as well as a monkey pool. (Left) Consistency at the image level could not be easily rescued by pooling together several “model subjects,” (top panels) or simple modifications (bottom panels), such as primate-like retinal input sampling or additional model training on synthetic images generated in the same manner as our test set. (Right) Additionally, the gap in consistency could not be attributed to simple image attributes (e.g. viewpoint meta-parameters, low-level image characteristics. Consistency to Human Pool # Subjects in Pool Primate Zone Monkey Pool HCNN Pool Consistency to Human Pool # Subjects in Pool Object-level metric (O2) Image-level metric (I2c) Fine-tuned HCNN Retina-sampled HCNN Early HCNN layers Inception-v3 Consistency to Human Pool Model Performance (d’) Fine-tuned HCNN Retina-sampled HCNN Early HCNN layers Inception-v3 Primate Zone The difference in image-level behav- ioral pattern between HCNNs and pri- mates is not strongly dependent on image attribtues. • In recent years, specific HCNN models, optimized to match human-level object recognition performance, have proven to be a dramatic advance in the state-of-the-art of artificial vision systems. • HCNN models match primates in their core object recognition behavior with respect to ob- ject-level difficulty and object-level confusion patterns. • HCNN models fail to match primate behavior in image-level difficulty and fail to match im- age-level confusion patterns. • Attempts to rescue the image-level gap in consistency, including adding retinal sampling on the model front-end, adding various decoders on the model back-end, and supplementing model training with similar images, did not reduce the gap between models and primates. • Furthermore, no particular property of the image (luminance, spatial frequencey) or of the object in the image (size, pose) was strongly correlated with model failure to capture primate image-level performance. Taken together, these results suggest that high-resolution behaviour could serve as a strong constraint for discovering models that more precisely capture the neural mechanisms under- lying primate object recognition. To measure the behavioral consistency between two systems with respect to a given behavioral metric, we computed a noise-adjusted correlation, which accounts for the reliability (i.e. internal con- sistency) of behavioral measurements. 10 images/object In contrast to the corresponding object-level metric (O2), all tested HCNNs diverge from primates with respect to the one-versus-other image-lev- el behavioral pattern (I2c). CAMEL DOG RHINO ELEPHANT WRENCH KNIFE HANGER FORK GUITAR PEN TANK TRUCK BIRD HAMMER GUN TABLE CALCULATOR SPIDER LEG ZEBRA HOUSE BEAR SHORTS WATCH 3.0 3.5 2.5 1.5 0.5 2.5 1.5 3.5 4.5 1.0 2.0 3.0

Introduction Methods - Columbia University · Camel Leg Rhino Wrench Knife Gun Calculator Table O1: Performance per object across distracters ... ject-level di˝culty and object-level

  • Upload
    hanhan

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Feed-forward deep neural networks diverge from primates on core visual object recognition behaviorRishi Rajalingham*, Elias B. Issa1*, Kailyn Schmidt, Kohitij Kar, Pouya Bashivan, and James J. DiCarloDept. of Brain and Cognitive Sciences, McGovern Institute for Brain Research, MITDepartment of Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University1

Inception-v3Human Pool Monkey Pool (n=5)

V1 ALEXNET

VGG

GOOGLENET

RESNET

INCEPTION-v3

NYU

Primate Zone

Cons

iste

ncy

to H

uman

Poo

l (O

2)

Human Pool

Monkey Pool (n=5)

Inception-v3

V1 ALEXNET

VGG

GOOGLENET

RESNET

INCEPTION-v3

NYU

Cons

iste

ncy

to H

uman

Poo

l (O

1)

Primate ZoneMonkey Pool (n=5 subjects)

Heldout Human Pool (n=4 subjects)

Labe

l pro

babi

lity 1

0

1

0

1

0

Human Monkey HCNN

Human Pool

Monkey Pool (n=5)

Inception-v3

O2: Performance per object for each distracter (confusion matrix)

V1 ALEXNET

VGG

GOOGLENET

RESNET

INCEPTION-v3

NYU

Cons

iste

ncy

to H

uman

Poo

l (I1

c) Primate Zone

Monkey Pool (n=5 subjects)

Heldout Human Pool (n=4 subjects)

HCN

N p

erfo

rman

ce (

Δd’

)

Human performance (Δd’)

n=240 images

Mon

key

perf

orm

ance

( Δ

d’)

n=240 images

Human performance (Δd’)

I1c: Performance per image across distracters

Humans can rapidly and accurately recognize objects in spite of variation in viewing param-eters (e.g. position, pose, and size) and background conditions.

To understand this invariant object recognition ability, one can construct and test models that aim to accurately capture human behavior, including making the same patterns of errors over all images. Here, we applied this straightforward visual “Turing test” to the leading feed-forward com-putational models of human vision (hierarchical convolutional neural networks, HCNNs) and to a leading animal model (rhesus macaques).

Our results reveal a failure of current HCNNs to fully replicate the image-level behavioral patterns of primates.

Introduction Methods

Behavioral paradigm: We tested object recognition performance on a two alternative forced choice (2AFC) match-to-sam-ple task using brief (100ms) presentations of natural-istic synthetic images of basic-level objects.

AndroidTablet

Juice reward tubeArduino controller

High-throughput psychophsyics: Over a million behavioral trials across hun-dreds of humans and �ve monkeys were ag-gregated to characterize each species on each image. To collect such large behavioral data-sets, we used Amazon Mechanical Turk for humans and a novel home-cage touchscreen system for monkeys.

Stimuli: To enforce invariant object recognition behavior, we generated several thousand naturalistic synthetic images (2400 images, 24 objects, 100 images/object), each with one foreground object (by ren-dering a 3D model of each object with random viewing parameters) on a random natural image background.

Methods

HARD EASY

Fixation(100ms)

Test Image (100ms)

Choice Screen(1000ms)

Amazon Mechanical Turk “MonkeyTurk”

TankBirdShortsElephantHouseHangerPenHammer

ZebraForkGuitarBearWatchDogSpiderTruck

Wrench Rhino LegCamel TableCalculatorGunKnife

O1: Performance per object across distracters

Results

Inception-v3Human Pool Monkey Pool (n=5)

V1 ALEXNET

VGG

GOOGLENET

RESNET

INCEPTION-v3

NYU

Cons

iste

ncy

to H

uman

Poo

l (I2

c)

Primate Zone

I2c: Performance per image for each distracter (confusion matrix)

Conclusions

Behavioral metrics: We characterized the core object recognition behavior of humans, macaques and HCNNs over a large set of images and distracter objects using several behavioral metrics. Each behavioral metric computes a pattern of unbiased behavioral perfor-mances, using a sensitivity index (d’), at the resolution of objects (O1, O2) and images (I1, I2). To isolate image-driven behavioral variance that is object independent, we normalized image-level behavioral metrics by subtracting the mean performance values over all images of the same object (I1c, I2c).

Each object can produce myriad images under variations in viewing parame-ters, with some images being more challenging than others.

Consistent with previous work, macaque monkeys and all tested HCNN models are highly consistent with humans with respect to object-level metrics (O1, O2), while a lower level representation (V1) is less consistent.

Eccentricity Size Pose Di�erence

Monkey PoolHuman Pool

Nor

mal

ized

Imag

e-le

vel P

erfo

rman

ce (Δ

d’)

Mean Luminance BG Spatial Frequency Segmentation Index

Regr

essi

on W

eigh

ts

Eccentri

city

Size

Pose D

i�erence

Mean Luminance

Segmentation In

dex

BG Spatial F

requency

All tested HCNN models fail to accurately capture the image-level behavioral patterns of primates.

In contrast to the corresponding object-level metric (O1), all tested HCNNs diverge from primates with respect to the one-versus-all image-level behavioral pattern (I1c).

Zooming in on one example image of a pen, humans and monkeys make similar confusions (choices in object labels), while the HCNN model dif-fers signi�cantly.

Example image with corresponding behavior-al patterns

(Left, top row) For each behavioral metric, we de�ned a “primate zone” as the range of consistency values delimited by extrapolated estimates of con-sistency to the human pool of large (i.e. in�nitely many subjects) pool of rhesus macaque monkeys and humans respectively.

Thus, the primate zone de�nes a range of consistency values that corre-spond to models that accurately capture the behavior of the human pool, at least as well as a monkey pool.

(Left) Consistency at the image level could not be easily rescued by pooling together several “model subjects,” (top panels) or simple modi�cations (bottom panels), such as primate-like retinal input sampling or additional model training on synthetic images generated in the same manner as our test set.

(Right) Additionally, the gap in consistency could not be attributed to simple image attributes (e.g. viewpoint meta-parameters, low-level image characteristics.

Cons

iste

ncy

to H

uman

Poo

l

# Subjects in Pool

Primate Zone

Monkey Pool

HCNN Pool

Cons

iste

ncy

to H

uman

Poo

l

# Subjects in Pool

Object-level metric (O2) Image-level metric (I2c)

Fine-tunedHCNN

Retina-sampledHCNN

Early HCNNlayers

Inception-v3

Cons

iste

ncy

to H

uman

Poo

l

Model Performance (d’)

Fine-tunedHCNN

Retina-sampledHCNN

Early HCNNlayers

Inception-v3

Primate Zone

The di�erence in image-level behav-ioral pattern between HCNNs and pri-mates is not strongly dependent on image attribtues.

• In recent years, speci�c HCNN models, optimized to match human-level object recognition performance, have proven to be a dramatic advance in the state-of-the-art of arti�cial vision systems.

• HCNN models match primates in their core object recognition behavior with respect to ob-ject-level di�culty and object-level confusion patterns.

• HCNN models fail to match primate behavior in image-level di�culty and fail to match im-age-level confusion patterns.

• Attempts to rescue the image-level gap in consistency, including adding retinal sampling on the model front-end, adding various decoders on the model back-end, and supplementing model training with similar images, did not reduce the gap between models and primates.

• Furthermore, no particular property of the image (luminance, spatial frequencey) or of the object in the image (size, pose) was strongly correlated with model failure to capture primate image-level performance.

Taken together, these results suggest that high-resolution behaviour could serve as a strong constraint for discovering models that more precisely capture the neural mechanisms under-lying primate object recognition.

To measure the behavioral consistency between two systems with respect to a given behavioral metric, we computed a noise-adjusted correlation, which accounts for the reliability (i.e. internal con-sistency) of behavioral measurements.

10 images/object

In contrast to the corresponding object-level metric (O2), all tested HCNNs diverge from primates with respect to the one-versus-other image-lev-el behavioral pattern (I2c).

CAM

ELD

OG

RHIN

OEL

EPH

ANT

WRE

NC

HKN

IFE

HAN

GER

FORK

GU

ITAR PE

NTA

NK

TRU

CK

BIRD

HAM

MER

GU

NTA

BLE

CAL

CU

LATO

RSP

IDER LEG

ZEBR

AH

OU

SEBE

ARSH

ORT

SW

ATC

H

3.0

3.5

2.5

1.5

0.5

2.5

1.5

3.5

4.5

1.0

2.0

3.0