Abstract arXiv:1802.09129v1 [cs.CV] 26 Feb 2018 · arXiv:1802.09129v1 [cs.CV] 26 Feb 2018. ... Weakly supervised object detection/localization is usu-ally performed in a bottom-up

Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection andSemantic Segmentation Based on Weakly Supervised Learning

Weifeng Ge Sibei Yang Yizhou YuDepartment of Computer Science, The University of Hong Kong

Abstract

Supervised object detection and semantic segmentationrequire object or even pixel level annotations. When thereexist image level labels only, it is challenging for weaklysupervised algorithms to achieve accurate predictions. Theaccuracy achieved by top weakly supervised algorithms isstill significantly lower than their fully supervised coun-terparts. In this paper, we propose a novel weakly su-pervised curriculum learning pipeline for multi-label ob-ject recognition, detection and semantic segmentation. Inthis pipeline, we first obtain intermediate object localiza-tion and pixel labeling results for the training images, andthen use such results to train task-specific deep networksin a fully supervised manner. The entire process consistsof four stages, including object localization in the trainingimages, filtering and fusing object instances, pixel label-ing for the training images, and task-specific network train-ing. To obtain clean object instances in the training im-ages, we propose a novel algorithm for filtering, fusing andclassifying object instances collected from multiple solutionmechanisms. In this algorithm, we incorporate both met-ric learning and density-based clustering to filter detectedobject instances. Experiments show that our weakly super-vised pipeline achieves state-of-the-art results in multi-labelimage classification as well as weakly supervised object de-tection and very competitive results in weakly supervisedsemantic segmentation on MS-COCO, PASCAL VOC 2007and PASCAL VOC 2012.

1. Introduction

Deep neural networks give rise to many breakthroughs incomputer vision by usinging huge amounts of labeled train-ing data. Supervised object detection and semantic segmen-tation require object or even pixel level annotations, whichare much more labor-intensive to obtain than image level la-bels. On the other hand, when there exist image level labelsonly, due to incomplete annotations, it is very challenging topredict accurate object locations, pixel-wise labels, or evenimage level labels in multi-label image classification.

Given image level supervision only, researchers haveproposed many weakly supervised algorithms for detectingobjects and labeling pixels. These algorithms employ dif-ferent mechanisms, including bottom-up, top-down [42, 23]and hybrid approaches [31], to dig out useful information.In bottom-up algorithms, pixels are usually grouped intomany object proposals, which are further classified, and theclassification results are merged to match groundtruth im-age labels. In top-down algorithms, images first go througha forward pass of a deep neural network, and the result isthen propagated backward to discover which pixels actuallycontribute to the final result [42, 23]. There are also hybridalgorithms [31] that consider both bottom-up and top-downcues in their pipeline.

Although there exist many weakly supervised algo-rithms, the accuracy achieved by top weakly supervised al-gorithms is still significantly lower than their fully super-vised counterparts. This is reflected in both the precisionand recall of their results. In terms of precision, results fromweakly supervised algorithms contain much more noise andoutliers due to indirect and incomplete supervision. Like-wise, such algorithms also achieve much lower recall be-cause there is insufficient labeled information for them tolearn comprehensive feature representations of target objectcategories. However, different types of weakly supervisedalgorithms may return different but complementary subsetsof the ground truth.

These observations motivate an approach that first col-lect as many evidences and results as possible from multi-ple types of solution mechanisms, put them together, andthen remove noise and outliers from the fused results usingpowerful filtering techniques. This is in contrast to deepneural networks trained from end to end. Although thisapproach needs to collect results from multiple separatelytrained networks, the filtered and fused evidences are even-tually used for training a single network used for the testingstage. Therefore, the running time of the final network dur-ing the testing stage is still comparable to that of state-of-the-art end-to-end networks.

According to the above observations, we propose aweakly supervised curriculum learning pipeline for objectrecognition, detection and segmentation. At a high level, we

4321

arX

iv:1

802.

0912

9v1

[cs

.CV

] 2

6 Fe

b 20

18

obtain object localization and pixelwise semantic labelingresults for the training images first using their image levellabels, and then use such intermediate results to train ob-ject detection, semantic segmentation, and multi-label im-age classification networks in a fully supervised manner.

Since image level, object level and pixel level analy-sis has mutual dependencies, they are not performed in-dependently but organized into a single pipeline with fourstages. In the first stage, we collect object localization re-sults in the training images from both bottom-up and top-down weakly supervised object detection algorithms. Inthe second stage, we incorporate both metric learning anddensity-based clustering to filter detected object instances.In this way, we obtain a relatively clean and complete set ofobject instances. Given these object instances, we furthertrain a single-label object classifier, which is applied to allobject instances to obtain their final class labels. Third, toobtain a relatively clean pixel-wise probability map for ev-ery class and every training image, we fuse the image levelattention map, object level attention maps and an object de-tection heat map. The pixel-wise probability maps are usedfor training a fully convolutional network, which is appliedto all training images to obtain their final pixel-wise labelmaps. Finally, the obtained object instances and pixel-wiselabel maps for all the training images are used for trainingdeep networks for object detection and semantic segmen-tation respectively. To make pixel-wise label maps of thetraining images help multi-label image classification, weperform multi-task learning by training a single deep net-work with two branches, one for multi-label image classifi-cation and the other for pixel labeling. Experiments showthat our weakly supervised curriculum learning system iscapable of achieving state-of-the-art results in multi-labelimage classification as well as weakly supervised object de-tection and very competitive results in weakly supervisedsemantic segmentation on MS-COCO [25], PASCAL VOC2007 and PASCAL VOC 2012 [12].

In summary, this paper has the following contributions.

•We propose a novel weakly supervised pipeline for multi-label object recognition, detection and semantic segmenta-tion. In this pipeline, we first obtain intermediate labelingresults for the training images, and then use such results totrain task-specific networks in a fully supervised manner.

• To obtain clean object instances detected in the trainingimages, we propose a novel algorithm for filtering, fus-ing and classifying object instances collected from multi-ple solution mechanisms. In this algorithm, we incorporateboth metric learning and density-based clustering to filterdetected object instances.

• To obtain a relatively clean pixel-wise probability map forevery class and every training image, we propose a novel

algorithm for fusing image level and object level attentionmaps with an object detection heat map. The fused mapsare used for training a fully convolutional network for pixellabeling.

2. Related WorkWeakly Supervised Object Detection and Segmentation.Weakly supervised object detection and segmentation re-spectively locates and segments objects with image-level la-bels only [27, 7]. They are important for two reasons: first,learning complex visual concepts from image level labels isone of the key components in image understanding; second,fully supervised deep learning is too data hungry.

Weakly supervised object detection/localization is usu-ally performed in a bottom-up manner. Methods in [27, 10,9] treat the weakly supervised localization problem as animage classification problem, and obtain object locationsin specific pooling layers of their networks. Methods in[4, 37] extract object instances from images using selec-tive search [39] or edge boxes [46], convert the weakly su-pervised detection problem into a multi-instance learningproblem [8]. The method in [8] at first learns object masksas in [10, 9], and then uses the E-M algorithm to force thenetwork to learn object segmentation masks obtained at pre-vious stages. Since it is very hard for a network to directlylearn object locations and pixel labels without sufficient su-pervision, in this paper, we decompose object detection andpixel labeling into multiple easier problems, and solve themprogressively in multiple stages.Neural Attention. Neural attention aims to find out therelationship between the pixels in the input image and theneural responses in every layer of a network. Many ef-forts [42, 2, 23] have been made to explain how neuralnetworks work. The method in [23] extends layer-wiserelevance propagation (LRP) [1] to comprehend inherentstructured reasoning of deep neural networks. To fur-ther ignore the cluttered background, a positive neural at-tention back-propagation scheme, called excitation back-propagation (Excitation BP), is introduced in [42]. Themethod in [2] locates top activations in each convolutionalmap, and maps these top activation areas into the input im-age using bilinear interpolation.

Neural attention provides a top-down mechanism to ob-tain pixel-wise class probabilities using image level labelsonly. In our pipeline, we adopt the excitation BP [42] tocalculate pixel-wise class probabilities. However for im-ages with multiple category labels, a deep neural networkcould fuse the activations of different categories in the sameneurons. To solve this problem, we train a single-label ob-ject instance classification network and perform excitationBP in this network to obtain more accurate pixel level classprobabilities.Curriculum Learning. Curriculum learning [3] is part of

4322

(a) Image Level Stage: Proposal Generation

and Multi Evidence Fusion

(b) Instance Level Stage: Outlier Detection and

Object Instance Filtering

(c) Pixel Level Stage: Probability Map Fusion

and Pixel Label Prediction

Object Heatmap

Image Attention Map

Input/Image Object Instances Triplet Loss Net

Instance Classifier

Instance Attention Map Probability Map

Label Map with

Uncertainty

Filtered Object Instances

Figure 1. The proposed weakly supervised pipeline. From left to right: (a) Image level stage: fuse the object heatmaps H and the imageattention map Ag to generate object instancesR for the instance level stage, and provide these two maps for information fusion at the pixellevel stage. (b) Instance level stage: perform triplet loss based metric learning and density based clustering for outlier detection, and train asingle label instance classifier φs(·, ·) for instance filtering. (c) Pixel level stage: integrate the object heatmaps H , instance attention mapAl, and image attention map Ag for pixel labeling with uncertainty.

the broad family of machine learning methods that startswith easier subtasks and gradually increases the difficultylevel of the tasks. In [3], Yoshua et al. describe the conceptof curriculum learning, and use a toy classification problemto show the advantage of decomposing a complex probleminto several easier ones. In fact, the idea behind curricu-lum learning has been widely used before [3]. Hinton etal. [17] trained a deep neural network layer by layer usinga restricted Boltzmann machine [35] to avoid the local min-ima in deep neural networks. Many machine learning algo-rithms [36, 14] follow a similar divide-and-conquer strategyin curriculum learning.

In this paper, we adopt this strategy to decompose thepixel labeling problem into image level learning, object in-stance level learning and pixel level learning. All the learn-ing tasks in these three stages are relatively simple usingthe training data in the current stage and the output from theprevious stage.

3. Weakly Supervised Curriculum Learning

3.1. Overview

Given an image I associated with an image level labelvector yI = [y1, y2, ..., yC ]T , our weakly supervised cur-riculum learning aims to obtain pixel-wise labels Y I =[y1,y2, ...,yP ]T , and then use these labels to assist weaklysupervised object detection, semantic segmentation andmulti-label image classification. Here C is the total num-ber of object classes, P is the total number of pixels in I ,and yl is binary. yl = 1 means the l-th object class exists inI , and yl = 0 otherwise. The label of a pixel p is denotedby a C-dimensional binary vector yp. The number of objectclasses existing in I , which is the same as the number ofpositive components of yI is denoted by K. Following the

(a) Heatmap Proposals (b) Attention Proposals (c) Fused Proposals

Figure 2. (a) Proposals Rh and Rl generated from an objectheatmap, (b) proposals generated from an attention map, (c) fil-tered proposals (green), heatmap proposals (red and blue), and at-tention proposals (purple).

divide-and-conquer idea in curriculum learning [3], we de-compose the pixel labeling task into three stages: the imagelevel stage, the instance level stage and the pixel level stage.

3.2. Image Level Stage

The image level stage not only decomposes multi-labelimage classification into a set of single-label object instanceclassifications, but also provides an initial set of pixel-wiseprobability maps for the pixel level stage.Object Heatmaps. Unlike the fully supervised case,weakly supervised object detection produces object in-stances with higher uncertainty and also misses a higherpercentage of true objects. To reduce the number of miss-ing detections, we propose to compute an object heatmapH for every object class existing in the image.

For an image I with width W and height H , a denseset of object proposals R = (R1, R2, ..., Rn) are generatedusing sliding anchor windows. And the feature stride λs is

4323

set to 8. The number of locations in the input image wherewe can place anchor windows isH/λs×W/λs. Denote theshort side of image I by Lρ. Following the setting used forRPN [28], we let the anchor windows at a single locationhave four scales [Lρ/8, Lρ/4, Lρ/2, Lρ] and three aspectratios [0.5, 1, 2]. After proposals out of image borders havebeen removed, there are usually 12000 remaining proposalsper image. Here we define a stack of object heatmaps H =[H1,H2, ...,HC ] as a C×H×W matrix, and all values areset to zero initially. The object detection and classificationnetwork φd(·, ·) used here is the weakly supervised objecttesting net VGG-16 from [37]. For every proposal Ri inR, its object class probability vector φd(I, Ri) is added toall the pixels in the corresponding window in the heatmaps.Then every heatmap is normalized to [0, 1] as follows,

Hc = (Hc −min(Hc))/max(Hc),

where Hc is the heatmap for the c-th object class. Notethat only the heatmaps for object classes existing in I arenormalized. All the other heatmaps are ignored and set tozeros.Multiple Evidence Fusion. The object heatmaps high-light the regions that may contain objects even when thelevel of supervision is very weak. However, since they aregenerated using sliding anchor windows at multiple scalesand aspect ratios, they tend to highlight pixels near but out-side true objects, as shown in Fig 2. Given an image clas-sification network trained using the image level labels (herewe use GoogleNet V1 [42]), neural attention calculates thecontribution of every pixel to the final classification result.It tends to focus on the most influential regions but not nec-essarily the entire objects. Note that false positive regionsmay occur during excitation BP [42]. To obtain more ac-curate object instances, we integrate the top-down atten-tion maps Ag = [A1

g,A2g, ...,A

Cg ] with the object heatmaps

H = [H1,H2, ...,HC ].For object classes existing in image I , their correspond-

ing heatmaps H and attention maps Ag are thresholded bydistinct values. The heatmaps H are too smooth to indicateaccurate object boundaries, but they provide important spa-tial priors to constrain object instances obtained from theattention maps. We assume that regions with a sufficientlyhigh value in the object heatmaps should at least includeparts of objects, and regions with sufficiently low valueseverywhere do not contain any objects. Following this as-sumption, we threshold the heatmaps with two values 0.65and 0.1 to identify highly confident object proposals Rh =(Rh

1 , Rh2 , ..., R

hNh

) and relatively low confident object pro-posals Rl = (Rl

1, Rl2, ..., R

lNl

) after connected componentextraction. Then the attention maps are thresholded by 0.5to attention proposals Ra = (Ra

1 , Ra2 , ..., R

aNa

) as shownin Fig 2. Nh, Nl and Na are the proposal numbers of Rh,

(a) Input Proposals (b) Distance Map

Figure 3. (a) Input proposals of the triplet-loss network, (b) dis-tance map computed using features from the triplet-loss network.

Rl and Ra. All these object proposals have correspondingclass labels. During the fusion, for each object class, the at-tention proposals Ra which cover more than 0.5 of any pro-posals in Rh are preserved. We denote these proposals byR, each of which is modified slightly to completely enclosethe corresponding proposal in Rh meanwhile be completelycontained inside the corresponding proposal in Rl (Fig 2).

3.3. Instance Level Stage

Since multiple object categories present in the same im-age make it hard for neural attention to obtain an accuratepixel-wise attention map for each class, we train a single-label object instance classification network and compute at-tention maps in this network to obtain more accurate pixellevel class probabilities. The fused object instances fromthe image level stage are further filtered by metric learningand density-based clustering. The remaining labeled objectproposals are used for training this object instance classi-fier, which can also be used to further remove remainingfalse positive object instances.Metric Learning for Feature Embedding. Metric learn-ing is popular in face recognition [33], person re-identification and object tracking [33, 44, 38]. It embedsan image X into a multi-dimensional feature space by as-sociating this image with a fixed size vector, φt(X, ·), inthe feature space. This embedding makes similar imagesclose to each other and dissimilar images apart in the fea-ture space. Thus the similarity between two images canbe measured by their distance in this space. The triplet-loss network φt(·, ·) proposed in [33] has the additionalproperty that it can well separate classes even when intra-class distances have large variations. When there existtraining samples associated with incorrect class labels, theloss stays at a high value and the distances between cor-rectly labeled and mislabeled samples remain very largeeven after the training process has run for a long time.Now let R = [R1, R2, ..., RO]T denote the fused objectinstances from all training images in the image level stage,and Y = [y1,y2, ...,yO]T are their labels. Here O is the

4324

total number of fused instances, and yl is the label vectorof instance Rl. We train a triplet-loss network φt(·, ·) usingGoogleNet V2 with BatchNorm as in [33]. Each mini-batchfirst chooses b object classes randomly, and then chooses ainstances from these classes randomly. These instances arecropped out from the training images and fed into φt(·, ·).Fig. 3 visualizes a mini-batch composition and the corre-sponding pairwise distances among instances.

Clustering for Outlier Removal. Clustering aims to re-move outliers that are less similar to other object instancesin the same class. Specifically, we perform density basedclustering [30] to form a single cluster of normal instanceswithin each object class independently, and instances out-side this cluster are considered outliers. This is differentfrom that in [30]. Let Rc denote instances in R with classlabel c, and Nc is the number of instances in Rc. Calcu-late the pairwise distances d(·, ·) among these instances, andobtain the Nc by Nc distance matrix Dc. For an instanceRc

n, if its distance from another instance is less than λd (=0.8), its density dcn is increased by 1. Rank these instancesby their densities in a descending order, and choose the in-stance ranked at the top as the seed of the cluster. Thenadd instances to the cluster following the descending orderif their distance to any element in the cluster is less than λdand their density is higher than Nc/4.

Instance Classifier for Re-labeling. Since metric learn-ing and clustering screen object instances in an aggressiveway and may heavily decrease their recall, we use the nor-mal instances surviving the previous clustering step to trainan instance classifier, which is in turn used to re-label allobject proposals generated in the image level stage again.This is a single-label classification problem as each objectinstance is allowed a single label. GoogleNet V1 with theSoftMax loss serves as the classifier φs(·, ·), and it is fine-tuned from the image level classifier. For every object pro-posal generated in the previous image level stage, if its la-bel predicted by the instance classifier does not match itsoriginal label, it is labeled as an outlier and permanentlydiscarded.

3.4. Pixel Level Stage

In previous stages, we have already built an image clas-sifier, a weakly supervised object detector, and an objectinstance classifier. Each of these deep networks producesits own inference result from the input image. For exam-ple, the image classifier generates a global attention map,and the object detector generates the object heatmaps. Inthe pixel level stage, we still perform multi-evidence fil-tering and fusion to integrate the inference results from allthese component networks to obtain the pixelwise probabil-ity map indicating potential object categories at every pixel.The global attention map Ag from the image classifier has

a full knowledge about the objects in an image but some-times only focuses on the most important object parts. Theobject instance classifier has a local view of each individualobject. With the help of object-specific local attention mapsgenerated from the instance classifier, we can avoid missingsmall objects.Instance Attention Map. Here we define the instance at-tention map Al as a C × H × W matrix, and all valuesare zero initially. For every surviving object instance fromthe instance level stage, the object instance classifier φs(·, ·)is used to extract its local attention map, and add it tothe corresponding region in the instance attention map Al.Normalize the range of Al to [0, 1] as we did for objectheatmaps.Probability Map Integration. The final attention map Ais obtained by calculating the element-wise maximum be-tween the image attention map Ag and the instance atten-tion map Al. That is, A = max(Al,Ag). For both theheatmap H and the attention map A, only the classes exist-ing in the image are considered. The background maps ofA and H are defined as follows,

A0 = max(0, 1− ΣCl=1ylAl),

H0 = max(0, 1− ΣCl=1ylH l).

Now both A and H become (C+ 1)×H×W matrices.For the l-th channel, if yl = 0, Al = 0 and H l = 0.Then we perform softmax on both maps along the channeldimension independently. The final probability map P isdefined as the result of applying softmax to the element-wise product between A and H by treating H as a filter.That is, P = softmax(H �A).Pixel Labeling with Uncertainty. Pixel labels Y I are ini-tialized with the probability map P . For every pixel p, ifthe maximum element in its label vector yp is larger thana threshold (=0.6), we simply set the maximum element to1 and other elements to 0; otherwise, the class label at p isuncertain.

4. Object Recognition, Detection and Segmen-tation

4.1. Semantic Segmentation

Given pixel-wise labels generated at the end of the pixellevel stage for all training images, we train a fully convo-lutional network (FCN) similar to the network in [26] toperform semantic segmentation. Note that all pixels withuncertain class labels are excluded during training. In theprediction part, we adopt atrous spatial pyramid pooling asin [5]. The resulting trained network can be used for la-beling all pixels in any testing image as well as pixels withuncertain labels in all training images.

4325

(a) Input/Image (b) Object Heatmap (c) Image Attention (d) Instance Attention (e) Probability (e) Segmentation

Figure 4. The pixel labeling process in the pixel level stage. White pixels in the last column indicate pixels with uncertain labels.

4.2. Object DetectionOnce all pixels with uncertain labels in the training im-

ages have been re-labeled using the above network for se-mantic segmentation, we generate object instances in theseimages by computing bounding boxes of connected pixelssharing the same semantic label. As in [37] and [24], wetrain fast RCNN [13] using these bounding boxes and theirassociated labels. Since the bounding boxes generated fromthe semantic label maps may contain noise, we filter themusing our object instance classifier as in Section 3.3. VGG-16 is still the base network of our object detector, which istrained with five scales and flip as in [37].

4.3. Multi-label ClassificationThe main component in our multi-label classification

network is the structure of ResNet-101 [16]. There aretwo branches after layer res4b22 relu of the main compo-nent, one branch for classification and the other for semanticsegmentation. Both branches share the same structure af-ter layer res4b22 relu. Here we adopt multi-task learningto train both branches. The idea is using the training datafor the segmentation branch to make the convolutional ker-nels in the main component more discriminative and power-ful. This network architecture is shown in the supplementalmaterials. Layer pool5 of ResNet-101 in the classificationbranch is removed, and the output X(∈ R14×14×2048) oflayer res5c is a 14 × 14 × 2048 matrix. X is directly fedinto a 2048× 1× 1× C convolutional layer, and a classi-fication map Y cls(∈ R14×14×C) is obtained. We let thesemantic label map Y seg(∈ R14×14×C) play the role of anattention map Y att after the summation over each channelof the semantic label map is normalized to 1. The final im-age level probability vector y is the result of spatial averagepooling over the element-wise product between Y cls andY att. Here Y att is used to identify important image re-gions and assign them larger weights. At the end, the prob-ability vector y is fully connected to an output layer, whichperforms binary classification for each of the C classes. Thecross-entropy loss is used for training the multi-label classi-fication network. The segmentation branch uses atrous spa-tial pyramid pooling to perform semantic segmentation, andsoftmax is applied to enforce a single label per pixel.

5. Experimental ResultsAll our experiments are implemented using Caffe [18]

and run on an NVIDIA TITAN X(Maxwell) GPU with12GB memory. The hyper-parameters in Section 3 are setaccording to common sense and confirmed after we visuallyverify that the segmentation results on a few training sam-ples are valid. The same parameter setting is used for alldatasets and has not been tuned on any validation sets.

5.1. Semantic SegmentationDatasets and performance measures. The Pascal VOC2012 dataset [11] serves as a benchmark in most existingwork on weakly-supervised semantic segmentation. It has21 classes and 10582 training images (the VOC 2012 train-ing set and additional data annotated in [15]), 1449 for val-idation and 1456 for testing. Only image tags are used astraining data in our experiments. We report results on boththe validation (supplemental materials) and test sets.Implementation details. Our network is based on VGG-16. The layers after relu5 3 and layer pool4 are removed.Dilations in layers conv5 1, conv5 2, and conv5 3 are setto 2. The feature stride λs at layer relu5 3 is 8. We add theatrous spatial pyramid pooling as in DeepLab V3 [5] afterlayer relu5 3. The dilations in our atrous spatial pyramidpooling layers are [1, 2, 4, 6]. This FCN is implemented inpy-faster-rcnn [29]. For data augmentation, we use five im-age scales (480, 576, 688, 864, 1024) (the shorter side isresized to one of these scales) and horizontal flip, and capthe longer side at 1200. During testing, the original size ofan input image is preserved. The network is fine-tuned fromthe pre-trained model for ImageNet in [34]. The learningrate γ is set to 0.001 in the first 20k iterations, and 0.0001in the next 20k iterations. The weight decay is 0.0005, andthe mini-batch size is 1. Post-processing using CRF [22] isadded during testing.Result comparison. We compare our method with exist-ing state-of-the-art algorithms. Table 1 lists the results ofweakly supervised semantic segmentation on Pascal VOC2012. The proposed method achieves 55.6% mean IoU,comparable to the state of the art (AE-SPL [?]). Recentalgorithms, including AE-PSL[?], F-B [32], FCL [31], andSEC [21], all conduct end-to-end training to learn object

4326

Figure 5. The detection and semantic segmentation results on Pascal VOC 2012 test set (the first row) and Pascal VOC 2007 test set (thesecond row). The detection results are gotten by select proposals with the highest confidence of every class. The semantic segmentationresults are post-processed by CRF [22].

method bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU

SEC[21] 83.5 56.4 28.5 64.1 23.6 46.5 70.6 58.5 71.3 23.2 54.0 28.0 68.1 62.1 70.0 55.0 38.4 58.0 39.9 38.4 48.3 51.7FCL[31] 85.7 58.8 30.5 67.6 24.7 44.7 74.8 61.8 73.7 22.9 57.4 27.5 71.3 64.8 72.4 57.3 37.0 60.4 42.8 42.2 50.6 53.7TP-BM[20] 83.4 62.2 26.4 71.8 18.2 49.5 66.5 63.8 73.4 19.0 56.6 35.7 69.3 61.3 71.7 69.2 39.1 66.3 44.8 35.9 45.5 53.8[?] - - - - - - - - - - - - - - - - - - - - - 55.7

Ours+CRF 86.6 72.0 30.6 68.0 44.8 46.2 73.4 56.6 73.0 18.9 63.3 32.0 70.1 72.2 68.2 56.1 34.5 67.5 29.6 60.2 43.6 55.6

Table 1. Comparison among weakly supervised semantic segmentation methods on PASCAL VOC 2012 segmentation test set.

score maps. Our method demonstrates that if we filter andintegrate multiple types of intermediate evidences at differ-ent granularities during weakly supervised training, the re-sults become equally competitive or even better.

5.2. Object DetectionDatasets and performance measures. The performance ofour object detector in Section 4.2 is evaluated on the pop-ular Pascal VOC 2007 and Pascal VOC 2012 datasets [11].Each of these two datasets is divided into train, val and testsets. The trainval sets (5011 images for 2007 and 11540 im-ages for 2012) are used for training, and only image tags areused. Two measures are used to test our model: mAP andCorLoc. According to the standard Pascal VOC protocol,the mean average precision (mAP) is used for testing ourtrained models on the test sets, and the correct localization(CorLoc) is used for measuring the object localization ac-curacy [6] on the trainval sets whose image tags are alreadyused as training data.Implementation details. We use the code for py-faster-rcnn [29] to implement fast R-CNN [13]. The network isstill VGG-16. The learning rate is set to 0.001 in the first30k iterations, and 0.0001 in the next 10k iterations. Themomentum and weight decay are set to 0.9 and 0.0005 re-spectively. We follow the same data augmentation setting in[37], use five image scales (480, 576, 688, 864, 1200) andhorizontal flip, and cap the longer image side at 2000.Result comparison. Object detection results on Pascal

VOC 2007 test set (Table 2) and Pascal VOC 2012 testset (supplemental materials) are reported. Object localiza-tion results on Pascal VOC 2007 trainval set and PascalVOC 2012 trainval set are also reported (supplemental ma-terial). On Pascal VOC 2012 test set, our algorithm achievesthe highest mAP (47.5%), at least 5.0% higher than thelatest state-of-the-art algorithms including OICR [37] andHCP+DSD+OSSH3[19]. Our trained model also achievesthe highest mAP (51.2%) among all weakly supervised al-gorithms on Pascal VOC 2007 test set, 4.2% higher thanthe latest result from [37]. The object localization accuracy(CorLoc) of our trained model on Pascal VOC 2007 train-val set and Pascal VOC 2012 trainval set are respectively67% and 69.4%, which are 2.7% and 3.8% higher than theprevious best.

5.3. Multi-Label ClassificationDataset and performance measures. MicrosoftCOCO [25] is the most popular dataset in multi-labelclassification. MS-COCO was primarily built for objectrecognition tasks in the context of scene understanding.The training set is composed of 82081 images in 80classes, on average 2.9 object labels per image. Sincethe groundtruth labels of the test set is not available,performance evaluation is conducted on the validation setwith 40504 images. We train our models on the training setand test them on the validation set.

Performance measures for multi-label classification is

4327

method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP

OM+MIL+FRCNN[24] 54.5 47.4 41.3 20.8 17.7 51.9 63.5 46.1 21.8 57.1 22.1 34.4 50.5 61.8 16.2 29.9 40.7 15.9 55.3 40.2 39.5HCP+DSD+OSSH3[19] 54.2 52.0 35.2 25.9 15.0 59.6 67.9 58.7 10.1 67.4 27.3 37.8 54.8 67.3 5.1 19.7 52.6 43.5 56.9 62.5 43.7OICR-Ens+FRCNN[37] 65.5 67.2 47.2 21.6 22.1 68.0 68.5 35.9 5.7 63.1 49.5 30.3 64.7 66.1 13.0 25.6 50.0 57.1 60.2 59.0 47.0

Ours+FRCNN w/o clustering 66.7 61.8 55.3 41.8 6.7 61.2 62.5 72.8 12.7 46.2 40.9 71.0 67.3 64.7 30.9 16.7 42.6 56.0 65.0 26.5 48.5Ours+FRCNN w/o uncertainty 66.8 63.4 54.5 42.2 5.8 60.5 58.3 67.8 7.8 46.1 40.3 71.0 68.2 62.6 30.7 16.5 41.1 55.2 66.8 25.2 47.5Ours+FRCNN w/o instances 67.7 62.9 53.1 44.4 11.2 62.4 58.5 71.2 8.3 45.7 41.5 71.0 68.0 59.2 30.3 15.0 42.4 56.0 67.2 26.8 48.1Ours+FRCNN 64.3 68.0 56.2 36.4 23.1 68.5 67.2 64.9 7.1 54.1 47.0 57.0 69.3 65.4 20.8 23.2 50.7 59.6 65.2 57.0 51.2

Table 2. Average precision (in %) of weakly supervised methods on PASCAL VOC 2007 detection test set.

method F1-C P-C R-C F1-O P-O R-O F1-C/top3 P-C/top3 R-C/top3 F1-O/top3 P-O/top3 R-O/top3

CNN-RNN[40] - - - - - - 60.4 66.0 55.6 67.8 69.2 66.4RLSD[43] - - - - - - 62.0 67.6 57.2 66.5 70.1 63.4RNN-Attention[41] - - - - - - 67.4 79.1 58.7 72.0 84.0 63.0ResNet101-SRN[45] 70.0 81.2 63.3 75.0 84.1 67.7 66.3 85.8 57.5 72.1 88.1 61.1

ResNet101(448× 448)(baseline) 72.8 73.8 72.9 76.3 77.5 75.1 69.5 78.3 63.7 73.1 83.8 64.9Ours 74.9 80.4 70.2 78.4 85.2 72.5 70.6 84.5 62.2 74.7 89.1 64.3

Table 3. Performance comparison among multi-label classification methods on Microsoft COCO 2014 validation set.

quite different from those for single-label classification.Following [45, 41], we employ macro/micro precision,macro/micro recall, and macro/micro F1-measure to eval-uate our trained models. For precision, recall and F1-measure, labels with confidence higher than 0.5 are consid-ered positive. “P-C”, “R-C” and “F1-C” represent the aver-age per-class precision, recall and F1-measure while “P-O”,“R-O” and “F1-O” represent the average overall precision,recall and F1-measure. These measures do not require afixed number of labels per image. To compare with exist-ing state-of-the-art algorithms, we also report the results oftop-3 labels with confidence higher than 0.5 as in [41].

Implementation details. Our main network for multi-labelclassification is ResNet-101 as described earlier. The reso-lution of the input images is at 448 × 448. We first train anetwork with the classification branch only. As a commonpractice, a pre-trained model for ImageNet is fine-tunedwith the learning rate γ set to 0.001 in the first 20k iter-ations, and 0.0001 in the next 20k iterations. The weightdecay is 0.0005. Then we add the segmentation branchand train this new branch only by fixing all the layers be-fore layer res4b22 relu and the classification branch. Thelearning rate is set to 0.001 in the frist 20k iterations, and0.0001 in the next 20k iterations. At last, we train theentire network with both branches using the cross-entropyloss for multi-label classification for 30k iterations with alearning rate 0.0001 while still fixing the layers before layerres4b22 relu.

Result comparison. In addition to our two-branch network,we also train a ResNet-101 classification network as ourbaseline. The multi-label classification performance of bothnetworks on MS-COCO is reported in Table 3. Since the in-put resolution of our baseline is 448 × 448, in comparisonto the latest work (ResNet101-SRN) [45], the performanceof our baseline is slightly better. Specifically, the F1-C ofour baseline is 72.8%, which is 2.8% higher than the F1-

C of ResNet101-SRN. In comparison to the baseline, ourtwo-branch network further achieves overall better perfor-mance. Specifically, the P-C of our two-branch network is6.6% higher than the baseline, the R-C is 2.7% lower, andthe F1-C is 2.1% higher. All F1-measures (F1-C, F1-O, F1-C/top3 and F1-O/top3) of our two-branch network are thehighest among all state-of-the-art algorithms.

5.4. Ablation Study

We perform an ablation study on Pascal VOC 2007 de-tection test set by replacing or removing a single componentin our pipeline every time. First, to verify the importanceof object instances, we remove all steps related to objectinstances, including the entire instance level stage and theoperations related to the instance attention map in the pixellevel stage. The mAP is decreased by 3.1% as shown in Ta-ble 2. Second, the clustering and outlier detection step inthe instance level stage is removed. We directly train an in-stance classifier using the object proposals from the imagelevel stage. The mAP is decreased by 2.7%. Third, insteadof labeling a subset of pixels only in the pixel level stage,we assign a unique label to every pixel even in the case oflow confidence. The mAP drops to 47.5%, 3.7% lower thanthe performance of the original pipeline.

6. ConclusionsIn this paper, we have presented a new pipeline for

weakly supervised object recognition, detection and seg-mentation. Different from previous algorithms, we fuse andfilter object instances from different techniques and performpixel labeling with uncertainty. We use the resulting pixel-wise labels to generate groundtruth bounding boxes for ob-ject detection and attention maps for multi-label classifica-tion. Our pipeline has achieved clearly better performancein all of these tasks. Nevertheless, how to simplify the stepsin our pipeline deserves further investigation.

4328

References[1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.

Muller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propaga-tion. PloS one, 10(7):e0130140, 2015. 4322

[2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Net-work dissection: Quantifying interpretability of deep visualrepresentations. In The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), July 2017. 4322

[3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-riculum learning. In Proceedings of the 26th annual interna-tional conference on machine learning, pages 41–48. ACM,2009. 4322, 4323

[4] H. Bilen and A. Vedaldi. Weakly supervised deep detectionnetworks. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 2846–2854,2016. 4322

[5] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv preprint arXiv:1706.05587, 2017. 4325, 4326

[6] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervisedlocalization and learning with generic knowledge. Inter-national journal of computer vision, 100(3):275–293, 2012.4327

[7] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, andL. Van Gool. Weakly supervised cascaded convolutional net-works. arXiv preprint arXiv:1611.08258, 2016. 4322

[8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solv-ing the multiple instance problem with axis-parallel rectan-gles. Artificial intelligence, 89(1):31–71, 1997. 4322

[9] T. Durand, T. Mordan, N. Thome, and M. Cord. Wild-cat: Weakly supervised learning of deep convnets for imageclassification, pointwise localization and segmentation. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR 2017), 2017. 4322

[10] T. Durand, N. Thome, and M. Cord. Weldon: Weakly su-pervised learning of deep convolutional neural networks. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4743–4752, 2016. 4322

[11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective. International journal of com-puter vision, 111(1):98–136, 2015. 4326, 4327

[12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International journal of computer vision, 88(2):303–338, 2010. 4322

[13] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter-national conference on computer vision, pages 1440–1448,2015. 4326, 4327

[14] A. Graves, M. G. Bellemare, J. Menick, R. Munos, andK. Kavukcuoglu. Automated curriculum learning for neu-ral networks. arXiv preprint arXiv:1704.03003, 2017. 4323

[15] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. In Computer Vi-sion (ICCV), 2011 IEEE International Conference on, pages991–998. IEEE, 2011. 4326

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016. 4326

[17] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learn-ing algorithm for deep belief nets. Neural computation,18(7):1527–1554, 2006. 4323

[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. In Proceed-ings of the 22nd ACM international conference on Multime-dia, pages 675–678. ACM, 2014. 4326

[19] Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu. Deep self-taughtlearning for weakly supervised object localization. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), July 2017. 4327, 4328

[20] D. Kim, D. Cho, D. Yoo, and I. So Kweon. Two-phase learn-ing for weakly supervised object localization. In The IEEEInternational Conference on Computer Vision (ICCV), Oct2017. 4327

[21] A. Kolesnikov and C. H. Lampert. Seed, expand and con-strain: Three principles for weakly-supervised image seg-mentation. In European Conference on Computer Vision,pages 695–711. Springer, 2016. 4326, 4327

[22] P. Krahenbuhl and V. Koltun. Efficient inference in fullyconnected crfs with gaussian edge potentials. In Advancesin neural information processing systems, pages 109–117,2011. 4326, 4327

[23] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller, andW. Samek. Analyzing classifiers: Fisher vectors and deepneural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2912–2920, 2016. 4321, 4322

[24] D. Li, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang. Weaklysupervised object localization with progressive domain adap-tation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3512–3520, 2016.4326, 4328

[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In European conference on computervision, pages 740–755. Springer, 2014. 4322, 4327

[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3431–3440, 2015. 4325

[27] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object lo-calization for free?-weakly-supervised learning with convo-lutional neural networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages685–694, 2015. 4322

[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in neural information processing systems, pages91–99, 2015. 4324

[29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-

4329

works. In Advances in Neural Information Processing Sys-tems (NIPS), 2015. 4326, 4327

[30] A. Rodriguez and A. Laio. Clustering by fast search andfind of density peaks. Science, 344(6191):1492–1496, 2014.4325

[31] A. Roy and S. Todorovic. Combining bottom-up, top-down,and smoothness cues for weakly supervised image segmen-tation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3529–3538, 2017.4321, 4326, 4327

[32] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson,S. Gould, and J. M. Alvarez. Built-in foreground/backgroundprior for weakly-supervised semantic segmentation. In Eu-ropean Conference on Computer Vision, pages 413–432.Springer, 2016. 4326

[33] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-fied embedding for face recognition and clustering. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 815–823, 2015. 4324, 4325

[34] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 4326

[35] P. Smolensky. Information processing in dynamical systems:Foundations of harmony theory. Technical report, COL-ORADO UNIV AT BOULDER DEPT OF COMPUTERSCIENCE, 1986. 4323

[36] L. Sun, Q. Huo, W. Jia, and K. Chen. A robust approach fortext detection from natural scene images. Pattern Recogni-tion, 48(9):2906–2920, 2015. 4323

[37] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instancedetection network with online instance classifier refinement.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017. 4322, 4324, 4326, 4327,4328

[38] G. Tsagkatakis and A. Savakis. Online distance metriclearning for object tracking. IEEE Transactions on Circuitsand Systems for Video Technology, 21(12):1810–1821, 2011.4324

[39] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.Smeulders. Selective search for object recognition. Inter-national journal of computer vision, 104(2):154–171, 2013.4322

[40] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu.Cnn-rnn: A unified framework for multi-label image classifi-cation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2285–2294, 2016.4328

[41] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin. Multi-labelimage recognition by recurrently discovering attentional re-gions. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 464–472, 2017. 4328

[42] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. In EuropeanConference on Computer Vision, pages 543–559. Springer,2016. 4321, 4322, 4324

[43] J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. Multi-labelimage classification with regional latent semantic dependen-cies. arXiv preprint arXiv:1612.01082, 2016. 4328

[44] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprintarXiv:1610.02984, 2016. 4324

[45] F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang. Learn-ing spatial regularization with image-level supervisions formulti-label image classification. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July2017. 4328

[46] C. L. Zitnick and P. Dollar. Edge boxes: Locating objectproposals from edges. In European Conference on ComputerVision, pages 391–405. Springer, 2014. 4322

4330

Documents

Abstract arXiv:1802.09129v1 [cs.CV] 26 Feb 2018 · arXiv:1802.09129v1 [cs.CV] 26 Feb 2018. ... Weakly supervised object detection/localization is usu-ally performed in a bottom-up