1
www.data61.csiro.au FOR FURTHER INFORMATION Fatemehsadat Saleh , E: [email protected] Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation Fatemehsadat Saleh 1,2 , M. Sadegh Ali Akbarian 1,2 , Mathieu Salzmann 3 , Lars Petersson 1,2 , Stephen Gould 1 , and Jose M. Alvarez 1,2 1 The Australian National University (ANU) 2 Data61-CSIRO, Australia 3 CVLab, EPFL, Switzerland Introduction Goal: Assigning a semantic label to every pixel in the image. Problem: Acquiring a huge amount of pixel-level annotations is expensive. Our approach: We aim at using one of the weakest levels of annotation, image-level tags. Drawbacks of current approaches using image-level tags: Poor localization and inaccurate object boundaries. Additional priors require pixel-level annotations/bounding boxes. Different types of annotation used in related works Contributions A method to extract accurate masks from a network pre-trained for object recognition. A novel loss function to incorporate these masks during training. A new form of weak supervision, where the user selects the best mask among several automatically generated candidates. Our Method Built-in Foreground/Background Model From a network pre-trained on ImageNet, we propose to exploit the unit activations of the hidden layers to extract a foreground/background mask. We make use of the fourth and fifth convolutional layers to compute foreground probabilities acting as unary potentials in a fully-connected CRF[9]. Benefits Foreground/background mask extracted without relying on an external method. Requires no additional annotations. Novel Loss Function ͳ ݎ ͳ ܯ ǡȁெ ǡೕ ǡೕ ͳ ݎ ͳ ȁ ܯȁ ǡȁெ ǡೕ ǡೕ ܮͳ ܮെͳ אǡஷ ͳ ܮǤ ܫ ǡאூǡ א ͳെ ǡ horse person Semantic segmentation network Pixel-level annotation Image-level annotation[2,4,5] Point-level annotation[3] Bounding-box annotation[4,2] Image Fourth Conv. Fifth Conv. Fusion Mask Present classes should appear in the foreground mask no pixels in the whole image should take on an absent label pixels predicted as background should be assigned to the background class Our Method (Cont.) Novel Weak Supervision The mask obtained by inference in the CRF is not always the desired one. We generate multiple, diverse, low energy predictions (M-best Problem[1]). A user can then decide which prediction is the best one. This manual selection takes roughly 2-3 seconds per image. Experiments Network Structure Mask Evaluation Semantic Segmentation Results References: [1] Batra, D., et.al: Diverse mbest solutions in markov random fields. In: ECCV (2012) [2] Pinheiro, P.O., et. al: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015) [3] Bearman, et. al: What's the point: Semantic segmentation with point supervision. ArXiv e-prints (2015) [4] Papandreou, G., et. al: Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: ICCV (2015) [5] Pathak, D., et. al: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV(2015) [6] Wei, Y., et. al: Learning to segment with image-level annotations. Pattern Recognition (2016) [7] Wei, Y., et. al: Stc: A simple to complex framework for weakly-supervised semantic segmentation. ArXiv e-prints (2015) [8] Alexe, B., et. al: Measuring the objectness of image windows. Pattern Analysis and Machine Intelligence (2012) [9] Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems (2011) Method Additional Supervision Validation Test [2] MIL+bbox 37.8% 37.0% [2] MIL+seg 42.0% 40.6% [6] SN-B+MCG seg 41.9% 43.2% [7] STC+Add. train data 49.8% 51.2% [3] What’s the point +1Point 42.7% 43.6% [5] CCNN+Size Info. 42.4% 45.1% Ours (CheckMask) 51.5% 52.9% Method Image Tag Validation Test [2] MIL+sspxl 36.6% 35.8% [3] What’s the point w/Obj 32.2% --- [4] EM-Adapt 38.2% 39.6% [5] CCNN 35.3% 35.6% Ours (Tag) 46.6% 48.0% Method Mean IOU [5] CCNN (Tags) 32.2% Ours (Tags) 39.0% Ours (CheckMask) 46.3% Mean IOU on PASCAL VOC using image-tags Mean IOU on PASCAL VOC using additional supervision Mean IOU on PASCAL VOC validation set using Flickr for training Method Mean IOU Mask obtained using objectness method [8] 52.3% Mask obtained using MCG 50.2% Our masks 60.1% Mask evaluation results on 10% of PASCAL VOC training set Image ௨௦ Objectness Map[8] MCG-Map Our Mask Objectness Mask[8] MCG-Mask Image Baseline Pascal Tags Pascal CheckMask Pascal Tags Flickr CheckMask Flickr G.T

Built-in Foreground/Background Prior for Weakly … · Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation Fatemehsadat Saleh 1,2 , M. Sadegh Ali Akbarian

Embed Size (px)

Citation preview

Page 1: Built-in Foreground/Background Prior for Weakly … · Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation Fatemehsadat Saleh 1,2 , M. Sadegh Ali Akbarian

www.data61.csiro.au

FOR FURTHER INFORMATION

Fatemehsadat Saleh ,

E: [email protected]

Built-in Foreground/Background Prior for Weakly-Supervised Semantic SegmentationFatemehsadat Saleh1,2, M. Sadegh Ali Akbarian1,2, Mathieu Salzmann3, Lars Petersson1,2, Stephen Gould1, and Jose M. Alvarez1,2

1The Australian National University (ANU) 2Data61-CSIRO, Australia 3CVLab, EPFL, Switzerland

IntroductionGoal: Assigning a semantic label to every pixel in the image.

Problem: Acquiring a huge amount of pixel-level annotations is expensive.Our approach: We aim at using one of the weakest levels of annotation, image-level tags.Drawbacks of current approaches using image-level tags: • Poor localization and inaccurate object boundaries.• Additional priors require pixel-level annotations/bounding boxes.

Different types of annotation used in related works

Contributions• A method to extract accurate masks from a network pre-trained for object recognition.• A novel loss function to incorporate these masks during training.• A new form of weak supervision, where the user selects the best mask among several

automatically generated candidates.

Our Method

Built-in Foreground/Background ModelFrom a network pre-trained on ImageNet, we propose to exploit the unit activations of the hidden layers to extract a foreground/background mask.We make use of the fourth and fifth convolutional layers to compute foreground probabilities acting as unary potentials in a fully-connected CRF[9].

BenefitsForeground/background mask extracted without relying on an external method.Requires no additional annotations.

Novel Loss Function

horse

personSemantic segmentation network

Pixel-level annotation

Image-level annotation[2,4,5]

Point-level annotation[3]

Bounding-box annotation[4,2]

Image Fourth Conv. Fifth Conv. Fusion Mask

Present classes should appear in the

foreground mask

no pixels in the whole image should take on an

absent label

pixels predicted as background should be assigned to the

background class

Our Method (Cont.)Novel Weak Supervision

• The mask obtained by inference in the CRF is not always the desired one.• We generate multiple, diverse, low energy predictions (M-best Problem[1]).• A user can then decide which prediction is the best one.• This manual selection takes roughly 2-3 seconds per image.

ExperimentsNetwork Structure

Mask Evaluation

Semantic Segmentation Results

References:[1] Batra, D., et.al: Diverse mbest solutions in markov random fields. In: ECCV (2012)[2] Pinheiro, P.O., et. al: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015)[3] Bearman, et. al: What's the point: Semantic segmentation with point supervision. ArXiv e-prints (2015)[4] Papandreou, G., et. al: Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: ICCV (2015)[5] Pathak, D., et. al: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV(2015)[6] Wei, Y., et. al: Learning to segment with image-level annotations. Pattern Recognition (2016)[7] Wei, Y., et. al: Stc: A simple to complex framework for weakly-supervised semantic segmentation. ArXiv e-prints (2015)[8] Alexe, B., et. al: Measuring the objectness of image windows. Pattern Analysis and Machine Intelligence (2012)[9] Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems (2011)

Method

Additional Supervision

Validation Test

[2] MIL+bbox 37.8% 37.0%

[2] MIL+seg 42.0% 40.6%

[6] SN-B+MCG seg 41.9% 43.2%

[7] STC+Add. train data 49.8% 51.2%

[3] What’s the point +1Point 42.7% 43.6%

[5] CCNN+Size Info. 42.4% 45.1%

Ours (CheckMask) 51.5% 52.9%

MethodImage Tag

Validation Test

[2] MIL+sspxl 36.6% 35.8%

[3] What’s the point w/Obj

32.2% ---

[4] EM-Adapt 38.2% 39.6%

[5] CCNN 35.3% 35.6%

Ours (Tag) 46.6% 48.0%

MethodMean IOU

[5] CCNN (Tags) 32.2%

Ours (Tags) 39.0%

Ours (CheckMask) 46.3%

Mean IOU on PASCAL VOC using image-tags

Mean IOU on PASCAL VOC using additional supervision

Mean IOU on PASCAL VOC validation set using Flickr

for training

MethodMean IOU

Mask obtained using objectness method [8] 52.3%

Mask obtained using MCG 50.2%

Our masks 60.1%

Mask evaluation results on 10% of PASCAL VOC training set Image Objectness

Map[8]MCG-Map Our Mask

ObjectnessMask[8] MCG-Mask

ImageBaselinePascal

TagsPascal

CheckMaskPascal

TagsFlickr

CheckMaskFlickr G.T