Damage Detection fromAerial Images via Convolutional ...5/8/2017 MVA2017@Nagoya University 2 In the event of catastrophic disasters, fast assessment of the extent of damage is crucial

Damage Detection from Aerial Images

via Convolutional Neural Networks

*National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan

†Nagoya University, Aichi, Japan

5/8/2017 MVA2017@Nagoya University 1

Aito Fujita*

Riho Ito*

Ken Sakurada† Tomoyuki Imaizumi*

Shuhei Hikoska* Ryosuke Nakamura*

Aerial images for damage detection


In the event of catastrophic disasters, fast assessment ofthe extent of damage is crucial for recovery.

E.g. “Which buildings got washed away by tsunami? And how many?”

Satellite/aerial images as source for damage assessment:

◦ The wide coverage can facilitate the assessment.

◦ However, currently, the assessment is performed manually (by human eye).

Aerial view of Tohoku quake


Post-quake imagePre-quake image

2km

4km

Damage assessment in practice












This building is “washed-away”.







This building is “surviving”.




Repeated for all buildings.















Manual assessmentPost-quake imagePre-quake image

RED: washed-away bldg.

YELLOW: surviving bldg.


RED: washed-away bldg.

YELLOW: surviving bldg.



Manual assessment

Manual assessment takes too much time.

Can we automate such assessment process?

Background and Contributions

Background of damage detection:

◦ Lack of labeled dataset

◦ No study on Deep Learning applied to satellite/aerial images

Related works in this context:

◦ Gueguen+(CVPR2015): hand-designed features (tree-of-shapes); satellite images

◦ Cooner+(Remote Sensing 2016): shallow neural networks; aerial images

◦ Nia+(CRV2017): Convolutional networks but ground-level images

Our contributions:◦ A new labeled dataset (ABCD dataset)

◦ Comprehensive analyses of CNNs for washed-away building detection


Image coverage

New dataset for washed-away building detection

AIST Building Change Detection (ABCD) dataset:

◦ Based on images taken before/after the Tohoku earthquake

◦ Over 10K pairs of pre/post-tsunami patches

◦ Collected from 66km2 of aerial images of the Tohoku coastal region

◦ A target building at the center of the patch

◦ A damage label assigned to each pair


Washed-away

buildings

Surviving

buildings

◦ The damage label data is based on the survey result of the Great East Japan earthquake,

which was conducted by the Japanese government (MLIT) after March 11, 2011.

◦ Now preparing to release to the public.

Overview of washed-away detection framework










Practical considerations

From a practical viewpoint, we investigated the following:

Input scenario (to address availability of pre-tsunami image)

Input scale (to address variability of building size)


Practical considerations: Input scenario



◦ Both pre- and post-tsunami images are available.

◦ Only post-tsunami images are available.














Input to a CNN: a pair of pre/post patches

Two configurations inspired by Zagoruyko+(CVPR15), Lin+(CVPR15), Simo-Serra+(ICCV15)







Input to a CNN: a pair of pre/post patches

Two configurations inspired by Zagoruyko+(CVPR15), Lin+(CVPR15), Simo-Serra+(ICCV15)

Fully-connected

Convolution

Max pooling







In practice, pre-tsunami images may not be available.







Input to a CNN: a post-tsunami patch





◦ Multiple crop sizes

◦ Fixed-scale

◦ Size-adaptive

◦ Multi-scale CNN







◦ Fixed-scale -> Aims to encompass most of the buildings.

◦ Size-adaptive -> Makes tiny buildings more conspicuous.

◦ Multi-scale CNN







◦ Fixed-scale: 160 x 160 pixels (64m x 64m) so that most buildings are encompassed.


160 pixels

16

0 p

ixel

s






◦ Fixed-scale -> Aim to encompass most of the buildings.

◦ Size-adaptive: the crop size depends on buildings—more emphasis on small buildings.


Size-adaptively crop and resize

Fixed-scale (160x160) Resized version






◦ Fixed-scale -> Aim to encompass most of the buildings.

◦ Size-adaptive -> Make tiny buildings more conspicuous.

◦ Multi-scale CNN







◦ Fixed-scale

◦ Size-adaptive

◦ Multi-scale CNN


Inspired by Zagoruyko+(CVPR15)

Experiment

Experimental setting, the same across all conditions (input scenario and scale)

The number of data◦ Randomly sampled 8,500 pairs; balanced class distribution

Evaluation◦ 5-fold cross validation (train/val per fold = 6800 : 1700)

◦ Classification accuracy

CNN hyper-parameters◦ Conv-Pool-Conv-Pool-Conv-Conv-FC-FC with ReLU nonlinearity

◦ No batch normalization (due to degradation in performance)

◦ SGD with momentum; constant learning rate; weight decay

◦ Shared or unshared weights between branches for Siamese case

Data pre-processing◦ Zero mean and unit variance (pixel-wise)

◦ Augmentation with vertical and horizontal flip


Accuracy


93.0%

94.0%

95.0%

96.0%

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Accuracy


93.0%

94.0%

95.0%

96.0%

6-channel Siamese Only post image

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Accuracy


93.0%

94.0%

95.0%

96.0%

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Accuracy


93.0%

94.0%

95.0%

96.0%

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixed Adaptive

Accuracy


93.0%

94.0%

95.0%

96.0%

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Wrongly classified

as “surviving”

Correctly classified

as “washed-away”

Correctly classified

as “washed-away”

Wrongly classified

as “surviving”

Fixed-scale Adaptive

The different scales are complementary w.r.t. prediction.

vs

vs

Accuracy


93.0%

94.0%

95.0%

96.0%


Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Accuracy


93.0%

94.0%

95.0%

96.0%


Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

93.0%

94.0%

95.0%

96.0%

Accuracy


Without multi-scale

With multi-scale

v.s.

Accuracy: in terms of input scenario


Only post-tsunami images may be sufficient in the context of washed-away building detection.

93.0%

94.0%

95.0%

96.0%

Both pre/post images Only post image

Accuracy: in terms of input scale


93.0%

94.0%

95.0%

96.0%

No good-and-bad between fixed-scale and adaptive-scale.The ensemble is always better.

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Accuracy: in terms of input scale


93.0%

94.0%

95.0%

96.0%

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Without multi-scale

With multi-scale

Multi-scale CNNs always beat single-scale counterparts.

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Fixe

d

Ad

apti

ve

Ense

mb

le

Qualitative result


Ground truthCNN prediction

RED: washed-away

YELLOW: surviving

Conclusion

This work in a nutshell:

◦ Effective use of CNNs for washed-away detection was explored.

◦ To this end, we compiled a new labeled dataset.

◦ On this dataset, we performed some types of experiments from an application viewpoint (input scenario and input scale).

◦ Overall, the accuracy was reasonably good.

Future work:

◦ Generalizability (other regions, other events, other data and so on)

◦ End-to-end framework from localization to classification such as instance-segmentation (Pinheiro+ECCV16)



Reference:◦Cooner et al.: Detection of urban damage using remote sensing and machine learning algorithms:

Revisiting the 2010 Haiti earthquake, in Remote Sensing, 2016.

◦Gueguen and Hamid: Large-scale damage detection using satellite imagery, in CVPR, 2015.

◦Simo-Serra et al.: Discriminative learning of deep convolutional feature point descriptors, in ICCV, 2015.

◦Zagoruyko and Komodakis: Learning to compare image patches via convolutional neural networks, in

CVPR, 2015.

◦Lin et al.: Learning deep representations for ground-to-aerial geolocalization" in CVPR, 2015.

◦Ministry of Land, Infrastructure, Transport, and Tourism: First report on an assessment of the damage

caused by the Great East Japan earthquake", http://www.mlit.go.jp/common/000162533.pdf

(published in Japanese)

◦Pinheiro: Learning to re ne object segments, in ECCV, 2016.

◦Nia and Mori: Building Damage Assessment Using Deep Learning and Ground-Level Image Data, in CRV,

2017.

Acknowledgement: This presentation is based on results obtained from a project commissioned by the New Energy and

Industrial Technology Development Organization (NEDO).

Appendix


Related work

Damage detection◦ Gueguen & Hamid (CVPR, 2015)

◦ Pre- and post-disaster satellite images

◦ Semi-supervised classification (BoF + SVM)

◦ Cooner et al. (Remote Sensing, 2016)

◦ Pre- and post-disaster satellite images

◦ Two-layer neural network

◦ Nia & Mori (CRV, 2017)

◦ Only post-disaster ground-level image

◦ Dilated CNN


Related work

Two image patches as input to CNN

◦ Simo-Serra et al. (ICCV, 2015)

◦ Lin et al. (CVPR, 2015)

◦ Zagoruyko & Komodakis (CVPR, 2015)




9 days after the quake (20/Mar/2011)114 days before the quake (17/Nov/2010)

Human annotations derived from the field survey are superimposed.

Red arrows are buildings that were washed away by tsunami.

Visual changes of buildings with tsunami are clear, but how can we describe these?


Cross validation result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 8

15 22 29 36 43 50 57 64 71 78 85 92 99

10

6

11

3

120

12

7

13

4

14

1

14

8

15

5

16

2

16

9

17

6

183

19

0

19

7

20

4

21

1

21

8

22

5

23

2

23

9

246

25

3

26

0

26

7

27

4

28

1

28

8

29

5

Acc

ura

cy /

Lo

ss

Iteration (x10)

accuracy (fixed_size/noAug)

testloss (fixed_size/noAug)

trainloss (fixed_size/noAug)


Size-adaptive classification

Improved:

Worsened:



0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 1 2 3 4

Cro

ss-v

alidati

on m

ean o

f accura

cy

Number of conv layers shared

Effect of # of shared layers on accuracy

Visualization via Global Average Pooling

Global Average Pooling is the technique for:

1) Regularizing big networks by forgoing FC layers (e.g., NIN and GoogLeNet)

2) Visualizing feature maps learned by networks in an intuitive way for human

3) Localizing object instances even under the weakly-supervised setting

Recent papers that delve into GAP:

◦ Learning Deep Features for Discriminative Localization (Zhou+, CVPR16)

◦ Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization (Selvaraju+, arXiv16)


in Zhou+ (CVPR16)

Replace fc-layers w/ GAP and need to retrain (leading to some drops in classification accuracy, but improving localization power)


Visualization using GAP: Class Activation Map

Visualization using GAP: Class Activation Map

in Selvaraju+ (arXiv16)

Retraining is unnecessary thanks to gradient backprop (keeping models intact; hence it doesn’t impair classification performance)


resized

fixed


resized

fixed


resized

fixed


Documents

Damage Detection fromAerial Images via Convolutional ...5/8/2017 MVA2017@Nagoya University 2 In the event of catastrophic disasters, fast assessment of the extent of damage is crucial