Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Same Data, Same Features: Modern ImageNet-Trained CNNs
Learn the Same ThingDavid McNeely-White
Colorado State University29 April 2020
CNNs are dominant in modern Computer Vision• Object detection, localization, segmentation, pose estimation, gesture
recognition, etc.• Many architectures exist, offering incremental improvements in
recognition• Here is a basic example …
By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
Features
ImageNet and ILSVRC2012
• ImageNet• Project for collecting and labelling
images using WordNet• ~14 million images• ~22k hierarchical categories
• ILSVRC2012• 2012 instance of annual challenge• 1.3 million images• 1k classes (120 dog breeds)• Drawn continuous attention since
AlexNet
“Cheeseburger” “Appenzeller”
“Entlebucher”“Cardoon”Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.International journal of computer vision, 115(3):211–252, 2015.
Major effort is invested in CNN design
• Often a small error reduction is enough cause for publication.
• Each advance may involve major changes to design.
• How do we know whether these CNNs are learning different but equally expressive features, or encoding equivalent features?
• I sought to compare CNN architectures.
By Atlas ML, CC BY-SA 4.0, https://paperswithcode.com/sota/image-classification-on-imagenet
A Way to Compare CNNs: Performance
• The most common method is black box comparison
• CNNs often consist of millions of parameters, trained on millions of samples
• For better or worse, black box methods hide complexity
• E.g. when 2 systems perform similarly, it’s not clear if they’re learning equally discriminative features, or the same features
By Atlas ML, CC BY-SA 4.0, https://paperswithcode.com/sota/image-classification-on-imagenet
Another Way to Compare: Visualize Responses
• CNNs often deal with visual data
• Pick an activation (or many) and correlate with examples• As with earlier V1 experiments on primates
• Alternatively, pick an activation and an example, and alter the example to produce a saliency map
• Many variations exist
• Extrinsic analysis
[top] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.[overlay] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill,2017. https://distill.pub/2017/feature-visualization.[bottom] By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
Visualization via example
Another Way to Compare: Visualize Responses
• Additionally, optimization-based techniques can reveal features
• Synthetic images are generated which maximize activation for a neuron, channel, layer, or class.
• Essentially, random noise is iteratively tweaked until patterns emerge.
• Intrinsic analysis
• Still, this doesn’t clearly show whether 2 CNNs are functionally equivalent or not
Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill,2017. https://distill.pub/2017/feature-visualization.
My Comparison: Are Features Related?
• Now emphasizing dataset, feature extractor, and classifier
Feature Extractor
ILSVRC2012Images
Prediction(airplane, dog,
spoon, etc.)Clas
sifie
r
Features
CNN
Others finding linear relationships• Lenc et al., May 2019
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. International Journal of Computer Vision, 127(5):456–476,May 2019.
“Franken-CNN”
Others finding linear relationships
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. International Journal of Computer Vision, 127(5):456–476,May 2019.
Others finding linear relationships• Lenc et al., May 2019• They use ground truth labels• They map within feature extractors, requiring interpolation• Their emphasis was similar, focusing on interchangeability of image
representations• Nonetheless, a similar and important work
• My work• Use unsupervised training for mappings• Use final, pooled convolutional layer’s features• Use many, varying architectures
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. International Journal of Computer Vision, 127(5):456–476,May 2019.
CNNs I Study• I sought to find whether different CNNs are different• First, I need different CNNs, these are• ResNet-v1, 152-layer• ResNet-v2, 152-layer• Inception-v1 (aka GoogLeNet)• Inception-v2• Inception-v3• Inception-v4• Inception-ResNet-v2• MobileNet-v2-1.4• NASNet-A• PNASNet-5
ResNet
• “Shortcut connections”• (un)activated, unparameterized
activation pass-through• Still a popular architectural
feature
• Single-scale processing
• Great depth (demonstrated up to 1k layers)
• This simple feature is present in nearly all modern, high-performance CNNs• DenseNet• Inception-ResNet• ResNeXt• EfficientNetResNet-v1 Block
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Inception
• Focus development on efficient, expressive “modules”
• Modules divide processing by scale, concatenate the results
• Many revisions and optimizations• 1x1 convolutions and pooling for
dimension-reduction• Factorize NxN convolutions into
1xN and Nx1• Batch normalization• (Later) shortcut connections
• Great breadthInception-v1 Module
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 1–9, 2016.
MobileNet
• Depthwise-separable convolutions• Split filtering by channel• Use 1x1 convolutions to combine
• Linear bottleneck layers• Authors state non-linear
activations are lossy
• Shortcut connections
• Performance-tunable
• Low-cost
MobileNet-v2 Layer
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
NASNet and PNASNet
• Use search strategies to find the best “cell”• Define search space• Search using small dataset
(CIFAR-10) and small architecture
• Evaluate using large dataset (ILSVRC2012) and large architecture
• Complex but expressivePNASNet-5 Cell PNASNet ImageNet
architecture
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
𝐹! I 𝐶!Features CNN A
Features CNN A
SoftmaxLabels
Imagex
IFeatures CNN B
Features CNN B
SoftmaxLabels
Imagex
𝐴 → 𝐴
𝐵 → 𝐵 𝐹" 𝐶"
Feature Mappings: Start with Identity • Start: Images feed the feature extractor (𝐹), producing features• End: Features feed the classifier (𝐶), producing predictions (labels).• Notice the blue rectangle, currently just the identity mapping.• Thus, far a pointless extra step. But, next slide …
𝐹! I 𝐶!Features CNN A
Features CNN A
SoftmaxLabels
Imagex
IFeatures CNN B
Features CNN B
SoftmaxLabels
Imagex
𝑀!→"Features CNN A
Features CNN B
(mapped)SoftmaxLabels
Imagex
𝑀"→!Features CNN B
Features CNN A
(mapped)SoftmaxLabels
Imagex
𝐴 → 𝐴
𝐵 → 𝐵
𝐴 → 𝐵
𝐵 → 𝐴
𝐹"
𝐹!
𝐹"
𝐶"
𝐶"
𝐶!
Affine Mapping Between CNN Feature Spaces
To be Clear, All We Mean by Affine is …
𝑦 = 𝑀 𝑥
𝑀!→"Features CNN A
Features CNN B
(mapped)SoftmaxLabels
Imagex𝐴 → 𝐵 𝐹! 𝐶" Prediction
Solving for the Affine Mapping• Paired inputs and outputs problem• Input is a feature vector from CNN A for image 𝑥$• Output is a feature vector from CNN B for image 𝑥$
• Produce pairs for all 1.3 million ILSVRC2012 images• Perform ordinary least squares regression to produce affine map• Training labels are never used (else we would be retraining classifiers)• This process is repeated for all 90 unique pairs of CNNs
𝐹% I 𝐶%Features ResNet
Features ResNet
SoftmaxLabels
Imagex
IFeatures Inception
Features Inception
SoftmaxLabels
Imagex
𝑀%→&Features ResNet
Features Inception (mapped)
SoftmaxLabels
Imagex
𝑀&→%Features Inception
Features ResNet
(mapped)SoftmaxLabels
Imagex
R → R
I → I
R → I
I → R
𝐹&
𝐹%
𝐹&
𝐶&
𝐶&
𝐶%
39,350 (78.70%)
40,195 (80.39%)
38,175 (76.35%)
39,785 (79.57%)
# correct / 50k
Example Results (ResNet-v2 & Inception-v4)
Comparing 10 well-known CNNs
All Results
But wait!
• Classifiers are all linear• Classifiers are full-rank• Thus, converting features
to logits is fully reversible, lossless• Mapping features via
classifiers is perfect• Still, this doesn’t mean the
previous result is trivial
Significance
• Equivalence was sought to better understand CNN architectures
• This work has demonstrated near-affine equivalence for 10 popular ImageNet CNNs• Despite variations in architecture, pedigree, and
performance
• Presumably, each network is a distinct combination of non-linear functions, yet a linear relationship exists.
• This suggests further advances in CNN performance may not rely on architecture.• Indeed, the last two advances in ImageNet
performance involve pre-published nets with novel training.
• Regardless, more work is necessary to fully understand these relationships.
Future Work• Characterize mappings
• Other experiments suggest redundancy in number of dimensions• Applying this “canonical space”
• Perhaps there is a limit to CNN architecture expressivity!• Testing different domains
• Face identification is promising!• Dong et al., Oct 2019 – linearity between feature spaces
• CNNs trained for face identification• Linearity found between CNN feature space and hand-built feature space• They conclude this would be difficult for general object detection/classification
• Testing different training paradigms• Do modern data augmentation techniques inhibit this equivalence?• How about extra training data?
Same Data, Same Features: Modern ImageNet-Trained CNNs
Learn the Same ThingDavid McNeely-White
Colorado State University29 April 2020
Comparison with Transfer Learning
• Map features across domains• Use labels to train map/fine-tune
• Map features across CNNs• Use feature vectors to train map
Diving Deeper
• Take all 50k validation samples
• Divide into 3 categories
• Labeled correctly by ResNet• Labeled correctly by Inception• Labeled correctly by I →R
• Most samples (72.6%) labeled correctly by all 3
• ResNet missed 11k (22%)
• I →R corrected 3k (5.9%)
Appendix – R to I Venn Diagram
Appendix – Affine Illustration