Same Data, Same Features: Modern ImageNet-Trained CNNs ...cs510/yr2020sp/more_progress/pdfs/… · are learning different but equally expressive features, or encoding ... Intriguing

Same Data, Same Features: Modern ImageNet-Trained CNNs

Learn the Same ThingDavid McNeely-White

Colorado State University29 April 2020

CNNs are dominant in modern Computer Vision• Object detection, localization, segmentation, pose estimation, gesture

recognition, etc.• Many architectures exist, offering incremental improvements in

recognition• Here is a basic example …

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

Features

ImageNet and ILSVRC2012

• ImageNet• Project for collecting and labelling

images using WordNet• ~14 million images• ~22k hierarchical categories

• ILSVRC2012• 2012 instance of annual challenge• 1.3 million images• 1k classes (120 dog breeds)• Drawn continuous attention since

AlexNet

“Cheeseburger” “Appenzeller”

“Entlebucher”“Cardoon”Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.International journal of computer vision, 115(3):211–252, 2015.

Major effort is invested in CNN design

• Often a small error reduction is enough cause for publication.

• Each advance may involve major changes to design.

• How do we know whether these CNNs are learning different but equally expressive features, or encoding equivalent features?

• I sought to compare CNN architectures.

By Atlas ML, CC BY-SA 4.0, https://paperswithcode.com/sota/image-classification-on-imagenet

A Way to Compare CNNs: Performance

• The most common method is black box comparison

• CNNs often consist of millions of parameters, trained on millions of samples

• For better or worse, black box methods hide complexity

• E.g. when 2 systems perform similarly, it’s not clear if they’re learning equally discriminative features, or the same features

By Atlas ML, CC BY-SA 4.0, https://paperswithcode.com/sota/image-classification-on-imagenet

Another Way to Compare: Visualize Responses

• CNNs often deal with visual data

• Pick an activation (or many) and correlate with examples• As with earlier V1 experiments on primates

• Alternatively, pick an activation and an example, and alter the example to produce a saliency map

• Many variations exist

• Extrinsic analysis

[top] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.[overlay] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill,2017. https://distill.pub/2017/feature-visualization.[bottom] By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

Visualization via example

Another Way to Compare: Visualize Responses

• Additionally, optimization-based techniques can reveal features

• Synthetic images are generated which maximize activation for a neuron, channel, layer, or class.

• Essentially, random noise is iteratively tweaked until patterns emerge.

• Intrinsic analysis

• Still, this doesn’t clearly show whether 2 CNNs are functionally equivalent or not

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill,2017. https://distill.pub/2017/feature-visualization.

My Comparison: Are Features Related?

• Now emphasizing dataset, feature extractor, and classifier

Feature Extractor

ILSVRC2012Images

Prediction(airplane, dog,

spoon, etc.)Clas

sifie

r

Features

CNN

Others finding linear relationships• Lenc et al., May 2019

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. International Journal of Computer Vision, 127(5):456–476,May 2019.

“Franken-CNN”

Others finding linear relationships


Others finding linear relationships• Lenc et al., May 2019• They use ground truth labels• They map within feature extractors, requiring interpolation• Their emphasis was similar, focusing on interchangeability of image

representations• Nonetheless, a similar and important work

• My work• Use unsupervised training for mappings• Use final, pooled convolutional layer’s features• Use many, varying architectures


CNNs I Study• I sought to find whether different CNNs are different• First, I need different CNNs, these are• ResNet-v1, 152-layer• ResNet-v2, 152-layer• Inception-v1 (aka GoogLeNet)• Inception-v2• Inception-v3• Inception-v4• Inception-ResNet-v2• MobileNet-v2-1.4• NASNet-A• PNASNet-5

ResNet

• “Shortcut connections”• (un)activated, unparameterized

activation pass-through• Still a popular architectural

feature

• Single-scale processing

• Great depth (demonstrated up to 1k layers)

• This simple feature is present in nearly all modern, high-performance CNNs• DenseNet• Inception-ResNet• ResNeXt• EfficientNetResNet-v1 Block

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Inception

• Focus development on efficient, expressive “modules”

• Modules divide processing by scale, concatenate the results

• Many revisions and optimizations• 1x1 convolutions and pooling for

dimension-reduction• Factorize NxN convolutions into

1xN and Nx1• Batch normalization• (Later) shortcut connections

• Great breadthInception-v1 Module

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 1–9, 2016.

MobileNet

• Depthwise-separable convolutions• Split filtering by channel• Use 1x1 convolutions to combine

• Linear bottleneck layers• Authors state non-linear

activations are lossy

• Shortcut connections

• Performance-tunable

• Low-cost

MobileNet-v2 Layer

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.

NASNet and PNASNet

• Use search strategies to find the best “cell”• Define search space• Search using small dataset

(CIFAR-10) and small architecture

• Evaluate using large dataset (ILSVRC2012) and large architecture

• Complex but expressivePNASNet-5 Cell PNASNet ImageNet

architecture

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.

𝐹! I 𝐶!Features CNN A

Features CNN A

SoftmaxLabels

Imagex

IFeatures CNN B

Features CNN B

SoftmaxLabels

Imagex

𝐴 → 𝐴

𝐵 → 𝐵 𝐹" 𝐶"

Feature Mappings: Start with Identity • Start: Images feed the feature extractor (𝐹), producing features• End: Features feed the classifier (𝐶), producing predictions (labels).• Notice the blue rectangle, currently just the identity mapping.• Thus, far a pointless extra step. But, next slide …

𝐹! I 𝐶!Features CNN A

Features CNN A

SoftmaxLabels

Imagex

IFeatures CNN B

Features CNN B

SoftmaxLabels

Imagex

𝑀!→"Features CNN A

Features CNN B

(mapped)SoftmaxLabels

Imagex

𝑀"→!Features CNN B

Features CNN A


Imagex

𝐴 → 𝐴

𝐵 → 𝐵

𝐴 → 𝐵

𝐵 → 𝐴

𝐹"

𝐹!

𝐹"

𝐶"

𝐶"

𝐶!

Affine Mapping Between CNN Feature Spaces

To be Clear, All We Mean by Affine is …

𝑦 = 𝑀 𝑥

𝑀!→"Features CNN A

Features CNN B


Imagex𝐴 → 𝐵 𝐹! 𝐶" Prediction

Solving for the Affine Mapping• Paired inputs and outputs problem• Input is a feature vector from CNN A for image 𝑥$• Output is a feature vector from CNN B for image 𝑥$

• Produce pairs for all 1.3 million ILSVRC2012 images• Perform ordinary least squares regression to produce affine map• Training labels are never used (else we would be retraining classifiers)• This process is repeated for all 90 unique pairs of CNNs

𝐹% I 𝐶%Features ResNet

Features ResNet

SoftmaxLabels

Imagex

IFeatures Inception

Features Inception

SoftmaxLabels

Imagex

𝑀%→&Features ResNet

Features Inception (mapped)

SoftmaxLabels

Imagex

𝑀&→%Features Inception

Features ResNet


Imagex

R → R

I → I

R → I

I → R

𝐹&

𝐹%

𝐹&

𝐶&

𝐶&

𝐶%

39,350 (78.70%)

40,195 (80.39%)

38,175 (76.35%)

39,785 (79.57%)

# correct / 50k

Example Results (ResNet-v2 & Inception-v4)

Comparing 10 well-known CNNs

All Results

But wait!

• Classifiers are all linear• Classifiers are full-rank• Thus, converting features

to logits is fully reversible, lossless• Mapping features via

classifiers is perfect• Still, this doesn’t mean the

previous result is trivial

Significance

• Equivalence was sought to better understand CNN architectures

• This work has demonstrated near-affine equivalence for 10 popular ImageNet CNNs• Despite variations in architecture, pedigree, and

performance

• Presumably, each network is a distinct combination of non-linear functions, yet a linear relationship exists.

• This suggests further advances in CNN performance may not rely on architecture.• Indeed, the last two advances in ImageNet

performance involve pre-published nets with novel training.

• Regardless, more work is necessary to fully understand these relationships.

Future Work• Characterize mappings

• Other experiments suggest redundancy in number of dimensions• Applying this “canonical space”

• Perhaps there is a limit to CNN architecture expressivity!• Testing different domains

• Face identification is promising!• Dong et al., Oct 2019 – linearity between feature spaces

• CNNs trained for face identification• Linearity found between CNN feature space and hand-built feature space• They conclude this would be difficult for general object detection/classification

• Testing different training paradigms• Do modern data augmentation techniques inhibit this equivalence?• How about extra training data?

Same Data, Same Features: Modern ImageNet-Trained CNNs

Learn the Same ThingDavid McNeely-White

Colorado State University29 April 2020

Comparison with Transfer Learning

• Map features across domains• Use labels to train map/fine-tune

• Map features across CNNs• Use feature vectors to train map

Diving Deeper

• Take all 50k validation samples

• Divide into 3 categories

• Labeled correctly by ResNet• Labeled correctly by Inception• Labeled correctly by I →R

• Most samples (72.6%) labeled correctly by all 3

• ResNet missed 11k (22%)

• I →R corrected 3k (5.9%)

Appendix – R to I Venn Diagram

Appendix – Affine Illustration

Documents

Same Data, Same Features: Modern ImageNet-Trained CNNs ...cs510/yr2020sp/more_progress/pdfs/… · are learning different but equally expressive features, or encoding ... Intriguing