A Systematic Evaluation: Fine-Grained CNN vs. Traditional

1

A Systematic Evaluation:Fine-Grained CNN vs. Traditional CNN Classifiers

Saeed Anwar, Nick Barnes, and Lars Petersson

Abstract—To make the best use of the underlying minuteand subtle differences, fine-grained classifiers collect informationabout inter-class variations. The task is very challenging dueto the small differences between the colors, viewpoint, andstructure in the same class entities. The classification becomesmore difficult due to the similarities between the differencesin viewpoint with other classes and differences with its own.In this work, we investigate the performance of the landmarkgeneral CNN classifiers, which presented top-notch results onlarge scale classification datasets, on the fine-grained datasets,and compare it against state-of-the-art fine-grained classifiers. Inthis paper, we pose two specific questions: (i) Do the general CNNclassifiers achieve comparable results to fine-grained classifiers?(ii) Do general CNN classifiers require any specific information toimprove upon the fine-grained ones? Throughout this work, wetrain the general CNN classifiers without introducing any aspectthat is specific to fine-grained datasets. We show an extensiveevaluation on six datasets to determine whether the fine-grainedclassifier is able to elevate the baseline in their experiments.

Index Terms—Fine-grained classifiers, Traditional classifiers,Systematic evaluation, Deep learning, Review, Baselines.

I. INTRODUCTION

F Ine-grain visual classification refers to the task of dis-tinguishing the categories of the same class. Fine-grain

classification is different from traditional classification as theformer models intra-class variance while the later is about theinter-class difference. Examples of naturally occurring fine-grain classes include: birds [1], [2], dogs [3], flowers [4],vegetables [5], plants [6] etc. while human-made categoriesinclude aeroplanes [7], cars [8], food [9] etc. Fine-grainclassification is helpful in numerous computer vision andimage processing applications such as image captioning [10],machine teaching [11], and instance segmentation [12], etc.

Fine grain visual classification is a challenging problemas there are minute and subtle differences within the speciesof the same classes e.g., a crow and a raven, as comparedto traditional classification where the difference between theclasses is quite visible e.g., a lion and an elephant. Fine-grained classification of species or objects of any categoryis a herculean task for humans beings and usually requiresextensive domain knowledge to identify the species or objectscorrectly.

As mentioned earlier, fine-grained classification in imagespace aims to reduce the high intra-class variance and the low

S. Anwar and L. Petersson are with CSIRO and Schoolof Engineering and Computer Science, Australian NationalUniversity, Australia. (see https://saeed-anwar.github.io/, Email:(saeed.anwar,lars.petersson)@data61.csiro.au).

Nick Barnes is an associate professor at School of Engineeringand Computer Science, Australian National University, Australia. E-mail:[email protected]

Malamute Husky Eskimo

Crow Raven Jackdaw

Fig. 1. The difference between classes (inter-class variation) is limited forvarious classes.

Fig. 2. The intra-class variation is usually high due to pose, lighting, andcolor.

inter-class variance. We provide a few sample images fromthe dog and bird datasets in Figure 1 to highlight the difficultyof the problem. The examples in the figure show the imageswith the same viewpoint. The colors are also roughly similar.Although the visual variation is very limited between classes,all of these belong to different dog and bird categories. InFigure 2, we provide more examples of the same mentionedcategories. Here, the difference in the viewpoint and colorsare prominent. The visual variation is more significant ascompared to the images in Figure 1, but these belong to thesame class.

Many approaches have been proposed to tackle the problemof fine-grained classification; for example, earlier works con-verged on part detection to model the intra-class variations.

arX

iv:2

003.

1115

4v1

[cs

.CV

] 2

4 M

ar 2

020

2

Next, the algorithms exploited three-dimensional represen-tations to hand multiple poses and viewpoints to achievestate-of-the-art results. Recently, with the advent of CNNs,most methods exploit the modeling capacity of CNNs as acomponent or as a whole.

This paper aims to investigate the capability of traditionalCNN networks compared to specially designed fine-grainedCNN classifiers. We strive to answer the question of whethercurrent general CNN classifiers can achieve comparable per-formance to fine-grained ones. To show the competitiveness,we employ several fine-grained datasets and report top-1accuracy for both classifier types. These experiments provide aproper place for general classifiers in fine-grained performancecharts and will serve as baselines in future comparisons forfine-grained classification problems.

II. RELATED WORKS

Fine-grained visual classification is a vital and well-studiedproblem. The aim of fine grain visual classification is to dif-ferentiate between subclasses of the same category as opposedto the traditional classification problem where discriminativefeatures are learned to distinguish between different classes.Some of the challenges in this domain are the following: i) Thecategories are highly correlated i.e. small differences and smallinter-category variance to discriminate between subcategories.ii): Similarly, the intra-category variation can be significant,due to different viewpoints and poses. Many algorithms suchas [13], [14], [15], [16], [17], [18], [19] are presented toachieve the desired results. In this section, we highlight theapproaches which are recent and similar to our algorithm.The FGVC research can be mainly divided into three mainbranches which are reviewed in the following paragraphs.

Part-based FGVC algorithms [20], [21], [22], [23], [24],[25] rely on the distinguishing features of the objects toleverage the accuracy of visual recognition. These FGVCmethods [26], [27] aim to learn the distinct features presentin different parts of the object e.g. the differences present inthe beak and tail of the bird species. Similarly, the part-basedapproaches normalize the variation present due to poses andviewpoints. Many works [28], [29], [1] assume the availabilityof bounding boxes at the object-level and the part-level inall the images during training as well as testing settings. Toachieve higher accuracy, [22], [30], [31] employed both object-level and part-level annotations. These assumptions restrict theapplicability of the algorithms to larger datasets. A reasonablealternative setting would be the availability of bounding boxaround the object of interest. Recently, Chai et al. [21] appliedsimultaneous segmentation and detection to enhance the per-formance of segmentation as well as object part localization.Similarly, a supervised method is proposed [16] which locatesthe training images similar to a test image using KNN. Theobject part locations from the selected training images areregressed to the test image.

Succeeding algorithms take advantage of the annotated dataduring the training phase while requires no knowledge duringthe testing phase. These supervised methods learn on bothobject-level and object-parts level annotation in the training

phase only. Such an approach is furnished in [32] where onlyobject-level annotations are given during training while nosupervision is provided at object-parts level. Similarly, SpatialTransformer Network (STCNN) [33] handles data representa-tion and outputs the location of vital regions. Furthermore,recent approaches focused on removing the limitation ofprevious works, aiming for conditions where the informationabout the location of the object-parts is not required either intraining or testing phase. These FGVC methods are suitablefor deployment on a large scale and are helpful for theadvancement of research in this direction.

Recently, Xiao et al. [25] presented two attention mod-els to learn appropriate patches for a particular object anddetermine the discriminative object parts using deep CNN.The fundamental idea is to cluster the last CNN feature mapsinto groups. The object patches and object parts are obtainedfrom the activations of these clustered feature maps. [25]needs the model to be trained on the category of interest,while we only require the general trained CNN. Similarly,DTRAM [34] learns to end the attention process for eachimage after a fixed number of steps. A number of methods areproposed to take advantage of object parts. However, the mostpopular one is the deformable part model (DPM) [35] whichlearns the constellation relative to the bounding box withSupport Vector Machines (SVM). Simon et al. [36] improvedupon [37] and employed DPM to localize the parts usingthe constellation provided by DPM [35]. Navigator-Teacher-Scrutinizer Network (NTSNet) [38] uses informative regionsin images without employing any kind of annotations. Anotherteacher-student network is proposed recently as Trilinear At-tention Sampling Network (TASN) [], which is composed ofa trilinear attention module, attention-based sampler, and afeature distiller.

Current fine-grain visual categorization state-of-the-artmethods [39], [24] avoids the incorporation of the boundingboxes during testing and training phases. Zhang et al. [24]and Lin et al. [39] used a two-stage network for object andobject-part detection and classification employing R-CNN andBilinear-CNN respectively. Part Stacked CNN [18] adopts thesame strategy as [39], [24] of a two-stage system; however thedifference lies in stacking of the object-parts at the end forclassification. Fu et al. [40] proposed multiple-scale RACNNto acquire distinguishing attention and region feature repre-sentations. Moreover, HIHCA [41] incorporated higher-orderhierarchical convolutional activations via a kernel scheme.

Distance metric learning [42], [43], [44], [45] is an alter-native approach to part-based algorithms and aims to clusterthe data points/objects into the same category while movingdifferent types away from each other. Bromley et al. [45]trained Siamese networks using deep metrics for signatureverification. In this context, [45] set the trend in this direction.Recently, Qian et al. [42] employs a multi-stage frameworkwhich accepts pre-computed feature maps and learning thedistance metric for classification. The pre-computed featurescan be extracted from DeCAF [43], as these features arediscriminative and can be used in many tasks for classifica-tion. Dubey et al. [46] employs pairwise confusion (PC) viatraditional classifiers.

3

Fig. 3. Basic building block of the VGG [58]. ResNet [59] and DenseNet [60]architectures.

Many works [47], [32], [25], [24] utilized the feature repre-sentations of a CNN and employed in many tasks [48], [49],[50]. A CNN captures the global information directly opposedto the traditional descriptors which capture local informationand requires manual engineering to encode global representa-tion. Destruction and Construction Learning (DCL) [51] takesadvantage of a standard classification network and emphasizeon discriminative local details. The model then reconstructs thesemantic correlation among local regions. Zeiler & Fergus [49]illustrated the reconstruction of the original image from theactivations of the fifth max-pooling layer. Max-pooling ensuresinvariance to small-scale translation and rotation; however,robustness to larger-scale deformations might be achievedby global spatial information. To capture global information,Gong et al. [52] combined the features from fully connectedlayers using VLAD pooling. Similarly, Cimpoi et al. [53]pooled the features from convolutional layers instead of fullyconnected layers for text recognition based on the idea the con-volutional layers are transferable and are not domain specific.Following the footsteps of [53], [52], PDFR [17] encoded theCNN filters responses employing a picking strategy via thecombination of Fisher Vectors. However, considering featureencoding as an isolated element is not an optimum choice forconvolutional neural networks.

Recently, feature integration is adopted by several ap-proaches from different layers of the same CNN. The intuitionbehind feature integration is to take advantage of globalsemantic information captured by fully connected layers andinstance level information preserved by convolutional layers[54]. Long et al. [55] merged the features from intermediateand high-level convolutional activations in their convolutionalnetwork to exploit both low-level details and high-level seman-tics for image segmentation. Similarly, for localization andsegmentation, Hariharan et al. [56] concatenated the featuremaps of convolutional layers at a pixel as a vector to form adescriptor. Likewise, for edge detection, Xie & Tu [57] addedseveral feature maps from the lower convolutional layers toguide CNN and predict edges at different scales.

III. TRADITIONAL NETWORKS

To make the paper self-inclusive, we briefly provide thebasic building blocks of the modern state-of-the-art traditionalCNN architectures. These architectures can be broadly cate-gorized into plain networks, residual networks, and denselyconnected networks. We review the most prominent and pi-oneering traditional networks which fall in each mentionedcategory and then adapt these models for the fine-grainedclassification task. The three architectures we investigate areVGG [58], ResNet [59], and DenseNet [60].

TABLE IDETAILS OF EIGHT FINE-GRAINED VISUAL CATEGORIZATION DATASETS

TO EVALUATE THE PROPOSED METHOD.

No. of ImagesDataset Classes Train Val TestNABirds [2] 555 23,929 - 24,633Dogs [3] 120 12,000 - 8,580CUB [1] 200 5,994 - 5,794Aircraft [7] 100 3,334 3,333 3,333Cars [8] 196 8,144 - 8,041Flowers [4] 102 2,040 - 6,149

A. Plain Network

Pioneering CNN architectures such as VGG [58] andAlexNet follows a single path i.e. without any skip connec-tions. The success of AlexNet [61], inspired VGG [58]. Bothof these networks rely on the smaller convolutional filtersbecause a sequence of smaller convolutional filters achievesthe same performance when compared to a single largerconvolutional filter. For example, when four convolutionallayers of 3×3 stacked together, it has the same receptive fieldas two 5×5 convolutional layers in sequence. Although, thelarge receptive field has fewer parameters than the smallerones. The basic building block of VGG [58] architecture isshown in Figure 3(a).

VGG [58] has many variants; we use the 19-layer con-volutional network, which has shown promising results onImageNet. As mentioned earlier, the block structure of VGGis planar (without any skip connection), and the number offeature channels are increased from 64 to 512.

B. Residual Network

To solve the vanishing gradients problem, residual networkemployed elements of the network with skip connectionsknown as identity shortcuts, as shown in Figure 3(b). Thepioneering research in this direction is resnet [59].

The identity shortcuts help to propagate the gradient signalback without being diminished. The identity shortcuts theoret-ically “skip” over all layers and reach the initial layers of thenetwork, learning the task at hand. Because of the summationof features at the end of each module, ResNet [59] learns onlyan offset and therefore, it does not require the learning of thefull features. The identity shortcuts allow for successful androbust training of much deeper architectures than previouslypossible. Due to successful classification results, we comparetwo variants of ResNet (ResNet50 & ResNet152), with fine-grained classifiers.

C. Dense Network

Building upon the success of ResNet [59], DenseNet [60]concatenates each convolutional layer in the modules, replac-ing the expensive element-wise addition and retaining thecurrent features and from the previous layers through skippedconnections. Furthermore, there is always a path for informa-tion from the last layer backward to deal with the gradientdiminishing problem. Moreover, to improve computationalefficiency, DenseNet [60] utilizes 1×1 convolutional layers toreduce the number of input feature maps before each 3×3

4

TABLE IICOMPARISON OF THE STATE-OF-THE-ART FINE GRAIN CLASSIFICATION

ON CUB [1] DATASET

CNN Methods Acc.

Fine

-Gra

ined

MGCNN [27] 81.7%STCNN [33] 84.1%FCAN [62] 84.3%PDFR [17] 84.5%RACNN [40] 85.3%HIHCA [41] 85.3%DTRAM [34] 86.0%BilinearCNN [39] 84.1 %PC-BilinearCNN [46] 85.6%PC-DenseNet161 [46] 86.7%MACNN [26] 86.5%NTSNet [38] 87.5%DCL-VGG16 [51] 86.9%DCL ResNet50[51] 87.8%TASN [63] 87.9%

Trad

ition

al VGG19 [58] 77.8%ResNet50 [59] 84.7%ResNet152 [59] 85.0%NasNet [64] 83.0%DenseNet161 [60] 87.7%

convolutional layer. Transition layers are applied to compressthe number of channels that resulted from the concatenationoperations. The building block of DenseNet [60] is shown inFigure 3(c).

The performance of DenseNet [60] on ILSVRC is com-parable with ResNet [59] but it has significantly fewer pa-rameters, thus requires less computations, e.g. DenseNet with201 convolutional layers which has 20 million parameters,produces comparable validation error as a ResNet with 101convolutional layers having 40 million parameters. Therefore,we consider DenseNet [60] a suitable candidate for fine-grained classification.

IV. EXPERIMENTAL SETTINGS

A. Datasets

In this section, we provide the details of the six most promi-nent fine-grain datasets used for evaluation and comparisonagainst the current state-of-the-art algorithms.

• Birds: The birds’ datasets that we compare on areCaltech-UCSD Birds-200-2011, abbreviated as CUB [1]is composed of 11,788 photographs of 200 categorieswhich further divided into 5,994 training and 5,794testing images. The second dataset for birds fine-grainedclassification is North American Birds, generally knownas NABirds [2], is the largest in this comparison.NABirds [2] has 555 species found in North Americawith 48562 photographs.

• Dogs: The Stanford Dogs [3] is a subset of ImageNet [65]gathered for the task of fine-grained categorization. Thedataset composed of 12k training and 8,580 testing im-ages.

• Cars: The cars dataset [8] has 196 classes with differentmake, model, and year. It has a total number of 16185car photographs where the split is 8,144 training imagesand 8,041 testing images i.e. roughly 50% for both.

TABLE IIIEXPERIMENTAL RESULTS ON FGVC AIRCRAFT [7] AND CARS [8].

DatasetsCNN Methods Aircraft [7] Cars [8]

Fine

-Gra

ined

FVCNN [66] 81.5% -FCAN [62] - 89.1%BilinearCNN [39] 84.1% 91.3%RACNN [40] 88.2% 92.5%HIHCA [41] 88.3% 91.7%PC-Bilinear [46] 85.8% 92.5%PC-ResNet50 [46] 83.4% 93.4%PC-DenseNet161 [46] 89.2% 92.7%MACNN [26] 89.9% 92.8%DTRAM [34] - 93.1%TASN [63] - 93.8%NTSNet [38] 91.4% 93.9%

Trad

ition

al VGG19 [58] 85.7% 80.5%ResNet50 [59] 91.4% 91.7%ResNet152 [59] 90.7% 93.2%NasNet [64] 88.5% -DenseNet161 [60] 92.9% 94.5%

• Aeroplanes: A total of 10,200 images with 102 variantshaving 100 images for each are present in the fine-grained visual classification of Aircraft i.e. FGVC-aircraftdataset [7]. Airplanes are an alternative to objects con-sidered for fine-grained categorization such as birds andpets.

• Flowers: The number of classes in the flower [4] datasetis 102. The training images are 2,040, while the test-ing images are 6,149. Furthermore, there are significantvariations within categories while having similarities withother categories.

Table I summarizes the number of classes and the numberof images, including the data split for training, testing, andvalidation (if any) for the fine-grain visualization datasets.

V. EVALUATIONS

A. Performance on CUB Dataset

We present the comparisons on the CUB dataset [1] inTable II. The best performer on this dataset is DenseNet [60],and this is unsurprising because the model concatenates thefeature maps from preceding layers to preserve details. Theworst performing among the traditional classifiers is Nas-Net [64], maybe due to its design, which is more inclinedtowards a specific dataset (i.e. ImageNet [65]). The ResNetmodels [59] perform relatively better than NasNet [64], whichshows that networks with shortcut connections surpass inperformance those with multi-scale representations for fine-grained classification. DenseNet [60] offers high accuracy ascompared to ResNet [59] because the former do not fuse thefeature and carries the details forward unlike the later wherethe features are combined in each block.

The fine-grained classification literature consider CUB-200-2011 [1] as a standard benchmark for evaluation; therefore,image-level labels, bounding boxes, and different types ofannotations are employed to extract the best results on thisdataset. Similarly, multi-branch networks focusing on variousparts of images, and multiple objective functions are com-bined for optimization. On the contrary, the traditional classi-fiers [60], [59] use a single loss without any extra information

5

TABLE IVCOMPARISON OF THE STATE-OF-THE-ART FINE GRAIN CLASSIFICATION

ON DOGS [3], FLOWERS [4] AND NABIRDS [2] DATASET.

DatasetsCNN Methods Dogs [3] Flowers [4] NABirds [2]

Fine

-Gra

ined

Zhang et al. [67] 80.4% - -Krause et al. [68] 80.6% - -Det.+Seg. [69] - 80.7% -Overfeat [50] - 86.8% -Branson et al. [47] - - 35.7%Van et al. [2] - - 75.0%BilinearCNN [39] 82.1% 92.5% 80.9%PC-ResNet50 [46] 73.4% 93.5% 68.2%PC-BilinearCNN [46] 83.0% 93.7% 82.0%PC-DenseNet161 [46] 83.8% 91.4% 82.8%

Trad

ition

al VGG19 [58] 76.7% 88.73% 74.9%ResNet50 [59] 83.4% 97.2% 79.6%ResNet152 [59] 85.2% 97.5% 84.0%DenseNet161 [60] 85.2% 98.1 86.3%

or any other annotations. The best-performing fine-grainedclassifiers for CUB [1] are DCL ResNet50[51], TASN [63],and NTSNet [38] where 0.1% and 0.2% gain is recordedover DenseNet [60] for DCL ResNet50 [51] and TASN [63],respectively, while NTSNet [38] lags by a margin of 0.2%. Theimprovement over DenseNet [60] is insignificant, keeping inmind the different computationally expensive tactics employedto learn the distinguishable features by fine-grained classifiers.

B. Quantitative analysis on Aircraft and CarsIn Table III, the performances of fine-grained classifiers

are shown on Cars [8] and Aircraft [7] datasets. Here, wealso observe that the performance of the traditional classifiersis better than the fine-grained classifiers. DenseNet161 [60]has an improvement of about 1.5% and 3% on Aircraft [7]compared to best-performing NTSNet [38] and MACNN [26],respectively. Similarly, an improvement of 0.6% and 1.4% isrecorded against NTSNet [38] and DTRAM [34] on Cars [8],respectively. The fine-grain classifiers [38], [34], [63] failto achieve the same accuracy as the traditional classifiers,although the former employ more image-specific informationfor learning.

C. Comparison on Stanford DogsThe Stanford dogs dataset [3] is another challenging dataset

where the performance is compared in Table IV. Here, weutilize ResNet [59] and DenseNet [60] from the traditionalones. The performance of ResNet [59] composed of 152layers is similar to DenseNet [60] with 161 layers, bothachieved 85.2% accuracy, which is 1.4% higher than PC-DenseNet161 [46], the best performing method in fine-grainedclassifiers. This experiment suggests that the incorporation oftraditional classifiers in the fine-grained ones requires moreinsight rather than just utilizing it in the framework. It isalso worth mentioning that some of the fine-grained classifiersemploy a large amount of data from other sources in additionto the Stanford dogs training data.

D. Results of Flower datasetThe accuracy of DenseNet [60] on the Flower dataset [4]

is 98.1% which is around 5.5% higher as compared to

TABLE VDIFFERENCES STRATEGIES FOR INITIALING THE NETWORK WEIGHTS i.e.

FINE-TUNING FROM IMAGENET AND RANDOM INITIALIZATION(SCRATCH) FOR CARS [8] DATASET.

Initial MethodsWeights ResNet50 ResNet152

Scratch 83.4% 36.9%Fine-tune 91.7% 93.2%

the second-best performing state-of-the-art method (PC-ResNet50 [46]) in Table IV. Similarly, the other traditionalclassifiers also outperform the fine-grained ones by a signifi-cant margin. It should also be noted that the performance onthis dataset is approaching saturation.

E. Performance on NABirds

Relatively fewer methods have reported their results onthis dataset. However, for sake of completeness we providecomparisons on the NABirds [2] dataset. Again the leadingperformance on NABirds [2] is achieved by DenseNet161 [60],followed by ResNet152 [59]. The third-best performer isa fine-grain classifier i.e. PC-DenseNet161 [46], which in-ternally employes DenseNet161 [60] lags behind by 3.5%.This shows the superior performance of the traditional CNNclassifiers against state-of-the-art fine-grained CNN classifiers.

F. Ablation studies

Here, we present two strategies for training traditionalCNN classification networks i.e. fine-tuning the weights viaImageNet [65] and training from scratch (randomly initializingthe weights) for the Car dataset. The accuracy presentedfor each is given in Table V. The resnet50 [59] achieveshigher accuracy when fine-tuned as compared to the randomlyinitialized version. Similarly, ResNet152 [59] performed betterfor the fine-tuned network; however, it fails when trainedfrom scratch. The reason may be due to a large number ofparameters and smaller training data.

VI. CONCLUSION

In this paper, we provided comparisons between state-of-the-art traditional CNN classifiers and fine-grained CNN clas-sifiers. It has been shown that traditional models achieve state-of-the-art performance on fine-grained classification datasetsand outperform the fine-grained CNN classifiers. Therefore,it is necessary to update the baselines for comparisons. It isalso important to note that the performance increase is dueto the initial weights trained on the ImageNet [65] datasets.Furthermore, we have established that the DenseNet161 modelachieves new state-of-the-art results for all datasets outper-forming the fine-grained classifiers by a significant margin.

REFERENCES

[1] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” California Institute of Technology,Tech. Rep., 2011.

[2] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis,P. Perona, and S. Belongie, “Building a bird recognition app and largescale dataset with citizen scientists: The fine print in fine-grained datasetcollection,” in CVPR, 2015.

6

[3] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset forfgvc: Stanford dogs,” in CVPR Workshop, 2011.

[4] M.-E. Nilsback and A. Zisserman, “Automated flower classification overa large number of classes,” in ICVGIP, 2008.

[5] S. Hou, Y. Feng, and Z. Wang, “Vegfru: A domain-specific dataset forfine-grained visual categorization,” in ICCV, 2017.

[6] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona,“Cataloging public objects using aerial and street-level images-urbantrees,” in CVPR, 2016.

[7] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.

[8] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representationsfor fine-grained categorization,” in CVPR Worshops, 2013.

[9] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,“Pfid: Pittsburgh fast-food image dataset,” in ICIP, 2009.

[10] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Videodescription: A survey of methods, datasets, and evaluation metrics,”ACM Computing Surveys (CSUR), vol. 52, no. 6, p. 115, 2019.

[11] G. C. Spivak, Outside in the teaching machine. Routledge, 2012.[12] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for

instance segmentation,” in CVPR, 2018.[13] A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for large-scale

subcategory flower recognition,” in WACV, 2013.[14] Y. Chai, E. Rahtu, V. Lempitsky, L. Van Gool, and A. Zisserman,

“Tricos: A tri-level class-discriminative co-segmentation method forimage classification,” in ECCV, 2012.

[15] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing forfine-grained recognition,” in CVPR, 2013.

[16] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuyte-laars, “Fine-grained categorization by alignments,” in ICCV, 2013.

[17] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filterresponses for fine-grained image recognition,” in CVPR, 2016.

[18] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine-grained visual categorization,” in CVPR, 2016.

[19] X. Zhang, F. Zhou, Y. Lin, and S. Zhang, “Embedding label structuresfor fine-grained feature representation,” in CVPR, 2016.

[20] S. Yang, L. Bo, J. Wang, and L. G. Shapiro, “Unsupervised templatelearning for fine-grained object recognition,” in NIPS, 2012.

[21] Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation andpart localization for fine-grained categorization,” in ICCV, 2013.

[22] T. Berg and P. Belhumeur, “Poof: Part-based one-vs.-one features forfine-grained categorization, face verification, and attribute estimation,”in CVPR, 2013.

[23] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable partdescriptors for fine-grained recognition and attribute prediction,” inICCV, 2013.

[24] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnnsfor fine-grained category detection,” in ECCV, 2014.

[25] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “Theapplication of two-level attention models in deep convolutional neuralnetwork for fine-grained image classification,” in CVPR, 2015.

[26] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convo-lutional neural network for fine-grained image recognition,” in ICCV,2017.

[27] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiplegranularity descriptors for fine-grained categorization,” in ICCV, 2015.

[28] O. M. Parkhi, A. Vedaldi, C. Jawahar, and A. Zisserman, “The truthabout cats and dogs,” in ICCV, 2011.

[29] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats anddogs,” in CVPR, 2012.

[30] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur, “Dog breed classi-fication using part localization,” in ECCV, 2012.

[31] L. Xie, Q. Tian, R. Hong, S. Yan, and B. Zhang, “Hierarchical partmatching for fine-grained visual categorization,” in ICCV, 2013.

[32] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognitionwithout part annotations,” in CVPR, 2015.

[33] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformernetworks,” in NIPS, 2015.

[34] Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, and W. Xu, “Dynamiccomputational time for visual attention,” in ICCV, 2017.

[35] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminativelytrained, multiscale, deformable part model,” in CVPR, 2008.

[36] M. Simon and E. Rodner, “Neural activation constellations: Unsuper-vised part model discovery with convolutional networks,” in ICCV, 2015.

[37] M. Simon, E. Rodner, and J. Denzler, “Part detector discovery in deepconvolutional neural networks,” in ACCV, 2014.

[38] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning tonavigate for fine-grained classification,” in ECCV, 2018.

[39] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models forfine-grained visual recognition,” in ICCV, 2015.

[40] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent atten-tion convolutional neural network for fine-grained image recognition,”in CVPR, 2017.

[41] S. Cai, W. Zuo, and L. Zhang, “Higher-order integration of hierarchi-cal convolutional activations for fine-grained visual categorization,” inICCV, 2017.

[42] Q. Qian, R. Jin, S. Zhu, and Y. Lin, “Fine-grained visual categorizationvia multi-stage metric learning,” in CVPR, 2015.

[43] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “Decaf: A deep convolutional activation feature for genericvisual recognition,” in ICML, 2014.

[44] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, “Large scale fine-grained categorization and domain-specific transfer learning,” in CVPR,2018.

[45] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah, “Signatureverification using a” siamese” time delay neural network,” in NIPS,1994.

[46] A. Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik,“Pairwise confusion for fine-grained visual classification,” in ECCV,2018.

[47] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird speciescategorization using pose normalized deep convolutional nets,” arXiv,2014.

[48] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in CVPR, 2014.

[49] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in ECCV, 2014.

[50] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnnfeatures off-the-shelf: an astounding baseline for recognition,” in CVPRworkshops, 2014.

[51] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and constructionlearning for fine-grained image recognition,” in CVPR, 2019.

[52] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderlesspooling of deep convolutional activation features,” in ECCV, 2014.

[53] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texturerecognition and segmentation,” in CVPR, 2015.

[54] A. Babenko and V. Lempitsky, “Aggregating local deep features forimage retrieval,” in ICCV, 2015.

[55] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in CVPR, 2015.

[56] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumnsfor object segmentation and fine-grained localization,” in CVPR, 2015.

[57] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015.[58] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” ICLR, 2015.[59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in CVPR, 2016.[60] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely

connected convolutional networks,” in CVPR, 2017.[61] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification

with deep convolutional neural networks,” in NIPS, 2012.[62] X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin, “Fully

convolutional attention networks for fine-grained recognition,” arXiv,2016.

[63] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in thedetails: Learning trilinear attention sampling network for fine-grainedimage recognition,” in CVPR, 2019.

[64] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferablearchitectures for scalable image recognition,” in CVPR, 2018.

[65] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in CVPR, 2009.

[66] P.-H. Gosselin, N. Murray, H. Jegou, and F. Perronnin, “Revisiting thefisher vector for fine-grained classification,” PRL, 2014.

[67] Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do,“Weakly supervised fine-grained categorization with part-based imagerepresentation,” TIP, 2016.

[68] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin,and L. Fei-Fei, “The unreasonable effectiveness of noisy data for fine-grained recognition,” in ECCV, 2016.

[69] A. Angelova and S. Zhu, “Efficient object detection and segmentationfor fine-grained recognition,” in CVPR, 2013.

Documents

A Systematic Evaluation: Fine-Grained CNN vs. Traditional