1
SSD: Single Shot MultiBox Detector Wei Liu 1 , Dragomir Anguelov 2 , Dumitru Erhan 3 , Christian Szegedy 3 , Scott Reed 4 , Cheng-Yang Fu 1 , Alexander C. Berg 1 1 UNC Chapel Hill 2 Zoox Inc. 3 Google Inc. 4 University of Michigan, Ann-Arbor OVERVIEW SSD discretizes bounding boxes space into a set of default box shapes per feature map location, and uses convolution kernel (3 × 3) to predict both the bounding box offsets and object probabilities per location. COMPARE STATE - OF - THE - ART METHODS #1: MULTI -S CALE F EATURE MAPS SSD uses multiple feature maps of decreasing resolution to output bounding boxes of increasing size. Prediction source layers from: mAP use boundary boxes? # Boxes 74.3 63.4 8732 70.7 69.2 9864 62.4 64.0 8664 #2: MORE DEFAULT BOXES 8 × 8 feature map 4 × 4 feature map SSD discretizes bounding boxes spaces into many bins, preventing box coordinates averaging when several likely hypotheses are present in the same default box. SSD300 include { 1 2 , 2} box? include { 1 3 , 3} box? number of boxes 3880 7760 8732 VOC2007 test mAP 71.6 73.7 74.3 SSD ARCHITECTURE 9** 'HWHFWLRQVSHU&ODVV &ODVVLILHU&RQY[[[&ODVVHV 1RQ0D[LPXP6XSSUHVVLRQ XP6X P$3 )36 &ODVVLILHU&RQY[[[&ODVVHV 66' ([WUD&RQYROXWLRQDO)HDWXUH0DSV RQV &RQY[[[&ODVVHV LPDJH Method mAP FPS batchsize # Boxes Input res Faster R-CNN (VGG16) 73.2 7 1 6000 1000 × 600 Fast YOLO 52.7 155 1 98 448 × 448 YOLO (VGG16) 66.4 21 1 98 448 × 448 SSD300 74.3 46 1 8732 300 × 300 SSD512 76.8 19 1 24564 512 × 512 SSD300 74.3 59 8 8732 300 × 300 SSD512 76.8 22 8 24564 512 × 512 THE DEVIL IS IN THE DETAILS 1. Data augmentation data augmentation SSD300 horizontal flip random crop & color distortion random expansion VOC2007 test mAP 65.5 74.3 77.2 2. Ground truth to default box matching 3. Hard negative mining DETECTION E XAMPLES REFERENCES [1] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. [2] R. Girshick. Fast R-CNN. In ICCV, 2015. [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detec- tion and semantic segmentation. In CVPR, 2014. [4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. [5] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. MORE RESULTS Method VOC2007 test VOC2012 test MS COCO test-dev ILSVRC2014 val2 Fast R-CNN 70.0 68.4 19.7 N/A Faster R-CNN 73.2 70.4 21.9 N/A YOLO 63.4 57.9 N/A N/A SSD300 74.3 72.4 23.2 43.4 SSD512 76.8 74.9 26.8 46.4 SSD300* 77.2 75.8 25.1 N/A SSD512* 79.8 78.5 28.8 N/A

SSD:SingleShotMultiBoxDetector - ECCV 2016 | October · PDF fileSSD:SingleShotMultiBoxDetector Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3, Scott Reed4, Cheng-Yang

  • Upload
    ngodan

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: SSD:SingleShotMultiBoxDetector - ECCV 2016 | October · PDF fileSSD:SingleShotMultiBoxDetector Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3, Scott Reed4, Cheng-Yang

SSD: Single Shot MultiBox DetectorWei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3,

Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1

1UNC Chapel Hill 2Zoox Inc. 3Google Inc.4University of Michigan, Ann-Arbor

OVERVIEW

SSD discretizes bounding boxes space into a set of default box shapesper feature map location, and uses convolution kernel (3× 3) to predictboth the bounding box offsets and object probabilities per location.

COMPARE STATE-OF-THE-ART METHODS

#1: MULTI-SCALE FEATURE MAPS

SSD uses multiple feature maps of decreasing resolution to outputbounding boxes of increasing size.

Prediction source layers from: mAPuse boundary boxes? # Boxes

� � � � � � 74.3 63.4 8732� � � 70.7 69.2 9864

� 62.4 64.0 8664

#2: MORE DEFAULT BOXES

8× 8 feature map 4× 4 feature map

SSD discretizes bounding boxes spaces into many bins, preventing boxcoordinates averaging when several likely hypotheses are present inthe same default box.

SSD300include { 1

2, 2} box? � �

include { 13, 3} box? �

number of boxes 3880 7760 8732VOC2007 test mAP 71.6 73.7 74.3

SSD ARCHITECTURE

Method mAP FPS batchsize # Boxes Input resFaster R-CNN (VGG16) 73.2 7 1 ∼ 6000 ∼ 1000× 600

Fast YOLO 52.7 155 1 98 448× 448YOLO (VGG16) 66.4 21 1 98 448× 448

SSD300 74.3 46 1 8732 300× 300SSD512 76.8 19 1 24564 512× 512SSD300 74.3 59 8 8732 300× 300SSD512 76.8 22 8 24564 512× 512

THE DEVIL IS IN THE DETAILS1. Data augmentation

data augmentation SSD300horizontal flip � � �

random crop & color distortion � �

random expansion �

VOC2007 test mAP 65.5 74.3 77.2

2. Ground truth to default box matching

3. Hard negative mining

DETECTION EXAMPLES

REFERENCES[1] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural

networks. In CVPR, 2014.

[2] R. Girshick. Fast R-CNN. In ICCV, 2015.

[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detec-tion and semantic segmentation. In CVPR, 2014.

[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time objectdetection. In CVPR, 2016.

[5] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection withregion proposal networks. In NIPS, 2015.

MORE RESULTS

MethodVOC2007

testVOC2012

testMS COCOtest-dev

ILSVRC2014val2

Fast R-CNN 70.0 68.4 19.7 N/A

Faster R-CNN 73.2 70.4 21.9 N/A

YOLO 63.4 57.9 N/A N/A

SSD300 74.3 72.4 23.2 43.4

SSD512 76.8 74.9 26.8 46.4

SSD300* 77.2 75.8 25.1 N/A

SSD512* 79.8 78.5 28.8 N/A