Good Practices for Deep Feature Fusion - ImageNetimage-net.org/challenges/talks/2016/[email protected] · Good Practices for Deep Feature Fusion. ... –Multi-scale augmentation

The Third Research Institute of the Ministry of Public Security, China

09 October 2016

Jie SHAO, Xiaoteng ZHANG, Zhengyan DING, Yixin ZHAO, Yanjun CHEN, Jianying ZHOU, Wenfei WANG, Lin MEI, Chuanping HU

Good Practices for Deep Feature Fusion

Trimps-Soushen@ILSVRC2016

• Object Localization

1st place, 7.71% error

• Object Classification

1st place, 2.99% error

• Object Detection

3rd place, 0.618 mAP

• Scene Classification

3rd place, 10.30% error

• Object Detection from video

3rd place, 0.71 mAP

Object Localization-CLS

• Different kinds of deep models

– Inception-v3, Inception-v4, Residual Network, Inception-ResNet-v2, Wide Residual Network

0

0.1

0.2

0.3

0.4

0.5

0.6

bakery cloak dough hair_slide hook letter_opener loupe restaurant spotlight velvet

Cls Errors for Top-10 Difficult Categories

inception-v3 inception-v4 inception-resnet-v2 resnet-200 wrn-68-2

[1] K. He, et al. Deep Residual Learning for ImageRecognition. CVPR, 2016.[2] C. Szegedy, et al. Rethinking the InceptionArchitecture for Computer Vision. CVPR, 2016.[3] Christian Szegedy, et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections onLearning. ICLR, 2016.[4] Zagoruyko S, et al. Wide Residual Networks.arXiv preprint arXiv:1605.07146, 2016.


• Details– Training

– Multi-scale augmentation & Large mini-batch size

– Identity map (Pre-activation)

– Testing

– Multi-scale & flip & dense fusion

Inception-v3

Inception-v4

Inception-Resnet-v2

Resnet-200

Wrn-68-3 Fusion（Val.） Fusion（Test）

Err. (%) 4.20 4.01 3.52 4.26 4.65 2.92 (-0.6) 2.99

Object Localization-CLS• Fusion Error Analysis

– Top-k Accuracy on Val. Dataset

Top-20 Accuracy reaches 99%


– Manually analyze 1458 error images from Val set

Classification Accuracy is hard to improve (1%)

Error Categories Numbers Percentages(%)

Label May Wrong 221 15.16

Multiple Objects (>5) 118 8.09

Non-Obvious Main Object 355 24.35

Confusing Label 206 14.13

Fine-grained Label 258 17.70

Obvious Wrong 234 16.05

Partial Object 66 4.53


• Image Examples (Label may wrong)

7

Predict:1 pencil box2 diaper3 bib4 purse5 running shoe

Ground Truth:sleeping bag


• Image Examples (Multiple objects)

8

Predict:1 lion2 web site3 frying pan4 teddy5 pop bottle

Ground Truth:ice cream


• Image Examples (Non-obvious main object)

9

Predict:1 dock2 submarine3 boathouse4 breakwater5 lifeboat

Ground Truth:paper towel


• Image Examples (Confusing label)

10

Predict:1 carton2 packet3 toilet tissue4 vending machine5 crate

Ground Truth:sunscreen


• Image Examples (Fine-grained label)

11

Predict:1 bolete2 earthstar3 gyromitra4 hen of the woods5 mushroom

Ground Truth:stinkhorn


• Image Examples (Obvious wrong)

12

Predict:1 apron2 plastic bag3 sleeping bag4 umbrella5 bulletproof vest

Ground Truth:poncho


• Image Examples (Partial object)

13

Predict:1 plate2 carbonara3 chocolate sauce4 bakery5 ice cream

Ground Truth:restaurant


– Manually analyze 1458 error images from Val set

Classification Accuracy is hard to improve (1%)

Error Categories Numbers Percentages(%)

Label May Wrong 221 15.16

Multiple Objects (>5) 118 8.09

Non-Obvious Main Object 355 24.35

Confusing Label 206 14.13

Fine-grained Label 258 17.70

Obvious Wrong 234 16.05

Partial Object 66 4.53


25.81

16.42

11.74

6.66

3.56 3.03 2.99

0

5

10

15

20

25

30

Top

-5 e

rro

r(%

)

Object Classification

Object Localization-LOC

• Based on Faster R-CNN Style Pipeline

– Pre-trained CNN Models: ResNet-152, ResNet-101, Inception-V4,

Inception-V3, WRN-68, Inception-ResNet-V2

[5] Shaoqing Ren, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS, 2015.


• Improvements

–Cascade Fast RCNN(multi)

–Learn to Rank(single)

–Fusion in single model(single)

–Fusion between models(multi)

Localization Val-Top-5 err (%)

Test-Top-5 err (%)

Baseline Single Model 10.35 /

Single Model Improvements

8.51(-1.84) /

Ensemble 7.58(-0.93) 7.71


• Cascade Fast RCNN (Region Fusion)

Model A

Model B

Model C

Faster RCNN

Faster RCNN

Faster RCNN

Region Pool

Cascade Fast RCNN

Train

Top-5 Classification

Labels

LocalizationResults

Keep Top-900 by categary


• Learn to Rank (test)

For the given category (used in 45 categories)

–Train new CNN model to rescore regions

–Average the original score and new score

–Select regions by the average score

n04019541n04228054n02825657n03355925n09256479n03825788n09288635n04264628n03961711……


• Fusion in single model

– Boost about 0.4%

• Fusion between models

– Voting by region cluster

– Boost more than 0.9%



*ResNet-152 8.51

ResNet-101 8.65

*Inception-ResNet-V2 8.81

Inception-V4 8.88

Inception-V3 9.27

*WRN-68 8.87


Ensemble all 7.58

No ResNet-152 7.93(+0.35)

No Inception-ResNet-V2 7.88(+0.30)

No WRN-68 7.75(+0.17)

Diversity between modelsis important


• Details in Localization

No weight-decay, no dropout

Enough epochs

Suitable mini-batch/iter_size, like 8 or 16

Diversity between models

Different anchor size

Better regions

Multi-scale train/test


• Improve localization accuracy is difficult

Ground truth bounding box


• Improve localization accuracy is difficult

Ground truth bounding box

Our Predicted bounding box


33.4829.88

25.32

9.06 9.01 8.74 7.71

0

5

10

15

20

25

30

35

40

Top

-5 e

rro

r(%

)

Object Localization

Scene Classification

• Details

– Training

– Torch (Memory-shared Optimization)

– Small scale range & large input crop size

– Testing

– Improve multi-scale fusion & multi-model fusion


• Improve Multi-scale Fusion

– Results

Baseline Model Single Model Multi models (Val.) Selected models (Test.)

Err. 12.40% 11.20% (-1.2%) 10.32% (-0.88%) 10.50 (+0.18%)

SingleModel


• Improve Multi-model Fusion– Fusion models two by two (Less Overfitting)

Two Models

Val. Err. 10.80(-0.4%)

7*Two Models

Val. Err. 10.39 (-0.41%)

Test. Err. 10.42 (+0.03%)


10.93 10.85 10.43 10.42 10.3 10.199.01

0

2

4

6

8

10

12To

p-5

err

or(

%)


Combine with a Places2-

pretrained models

only provided

data

Object Detection

• Testing (single: +2~3 mAP; multi: +4.3 mAP)

– 300 regions: predict boxes B from our best model

– New 300 regions: new predict boxes using B as input

– Average softmax and coordinate using 600 regions and

their flips across all models

Object Detection

55.35

60.88 61.56 61.82 62.07

65.2766.28

48

50

52

54

56

58

60

62

64

66

68

mA

P

Object Detection

Object Detection from Video

• From 200 to 30

– Using models from detection task of 200 classes to do

video detection

– Using video data to do fine-tuning

• Add extra train data

–Select some train data from ImageNet dataset

–Using part of Val data to train


64.06 64.2870.97 73.31

76.880.83

0

10

20

30

40

50

60

70

80

90

mA

P


Thank you!

Contact: [email protected]

mailto:[email protected]

Documents

Good Practices for Deep Feature Fusion - ImageNetimage-net.org/challenges/talks/2016/[email protected] · Good Practices for Deep Feature Fusion. ... –Multi-scale augmentation