1
Towards perspective-free object counting with deep learning Daniel Oñoro-Rubio and Roberto J. López-Sastre GRAM, University of Alcalá This work is supported by the projects SPIP2014-1468, SPIP2015-01809, and MINECO TEC2013-45183-R. Our models Experiments Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 72x72x32 31x31x32 18x18x32 18x18x1000 18x18x400 18x18x1 7x7 3x3 7x7 1x1 1x1 1x1 18x18 72x72 Conv C 1 Conv C 2 2 Conv 3 Conv C 4 Conv 5 Conv C 6 6 72 x 72 x 32 31 x 31 x 32 18 x 18 x 32 18 x 18 x 1000 18 x 18 x 400 18 x 18 x 1 7 x 7 3 3 x x 3 3 7 x 7 1 x 1 1 x 1 1 x 1 Input Image Patch Density Prediction S0 (72X72) S1 (72X72) Fc6 Fc7 Fc8 512 512 324 18X18 Sn (72X72) Sn CCNN_Sn CCNN_S0 CCNN_Sn CCNN_S1 CCNN_S0 CCNN_Sn x Our Counting CNN is a fully convolutional model that maps an input image patch into its estimated density map. Counting CNN Hydra CNN x TRANCOS x UCSD x UCF METHOD GAME 0 GAME 1 GAME 2 GAME FIASCHI ET AL. 17.77 20.14 23.65 25.99 LEMPITSKY ET AL. 13.76 16.72 20.72 24.36 CCNN 12.49 16.58 20.02 22.41 HYDRA 2S 11.41 16.36 20.89 23.67 HYDRA 3S 10.99 13.75 16.69 19.32 HYDRA 4S 12.92 15.54 18.45 20.96 PREDICTION GROUND TRUTH 73.8 75.2 30.0 31.6 28.2 27.4 20.3 70.1 29.3 92.5 Qualitative Results Table of Results Database characteristics: o Target object: Cars o Different scenes o No perspective map o 1244 images Data base characteristics: o Target object: People o Single scene o With perspective map o 2000 images o Video sequence Table of Results Data base characteristics: o Target object: People o Multiple scenes o No perspective map o 50 images o Count between 94-4543 o Average of 1280 per image PREDICTION GROUND TRUTH 1331.6 441.9 1334.1 2106.3 381.2 163.8 1196.5 1116.0 3008.9 3407.0 Qualitative Results Table of Results METHOD MAE MSD RODRIGUEZ ET AL. 655.7 697.8 LEMPITSKY ET AL. 493.4 487.1 ZHANG ET AL. 467.0 498.5 IDREES ET AL. 419.5 541.6 ZHANG ET AL. 377.6 509.1 CCNN 488.67 646.68 HYDRA 2S 333.73 425.26 HYDRA 3S 465.73 371.84 x The Hydra model uses a pyramid of input patches cropped from the center of the target patch to provide multiscale information to the network. The counting by regression model with deep learning Notation GT density CNN REGRESSOR PATCH EXTRACTION FORWARD PASS DENSITY MAP ASSEMBLY Understanding the counting problem x Given an image and a target object, we want to guess the total number of objects. TRANCOS UCF UCSD Perspective map Perspective free Single loss function Find it! Qualitative Results METHOD MAXIMAL DOWNSCALE UPSCALE MINIMAL LEMPITSKY ET AL. 1.70 1.28 1.59 2.02 FIASCHI ET AL. 1.70 2.16 1.61 2.20 PHAM ET AL. 1.43 1.30 1.59 1.62 ARTETA ET AL. 1.24 1.31 1.69 1.49 ZHANG ET AL. 1.70 1.26 1.59 1.52 CCNN 1.65 1.79 1.11 1.50 HYDRA 2S 2.22 1.93 1.37 2.38 HYDRA 3S 2.17 2.99 1.44 1.92 Download us! PREDICTIO N GROUND TRUTH 47.0 44.6 20.6 21.5 Sequence Prediction

nº - ECCV 2016 · 419.5 541.6 ZHANG ET AL. 377.6 509.1 CCNN 488.67 646.68 HYDRA 2S 333.73 425.26 HYDRA 3S 465.73 371.84 x The Hydra model uses a pyramid of input patches cropped

Embed Size (px)

Citation preview

Towards perspective-free object counting with deep learningDaniel Oñoro-Rubio and Roberto J. López-SastreGRAM, University of Alcalá

This work is supported by the projects SPIP2014-1468, SPIP2015-01809, and MINECO TEC2013-45183-R.

Our models

Experiments

Conv1 Conv2 Conv3 Conv4 Conv5 Conv6

72x72x32 31x31x32 18x18x32 18x18x1000 18x18x400 18x18x1

7x7

3x37x7

1x1 1x1 1x1

18x18

72x72

ConvC 1 ConvC 22 Conv3 ConvC 4 Conv5 ConvC 66

72x72x32 31x31x32 18x18x32 18x18x1000 18x18x400 18x18x1

7x7

33xx337x7

1x1 1x1 1x1

Input ImagePatch

Density Prediction

S0 (72X72)

S1 (72X72)

Fc6

Fc7

Fc8

512 512 324

18X18

Sn (72X72)Sn

CCNN_Sn

CCNN_S0

CCNN_Sn

CCNN_S1

CCNN_S0

CCNN_Sn

Our Counting CNN is a fully convolutional model that maps an input image patch into its estimated density map.

Counting CNN Hydra CNN

TRANCOS

UCSD

UCF

METHOD GAME 0 GAME 1 GAME 2 GAME FIASCHI ET AL. 17.77 20.14 23.65 25.99LEMPITSKY ET AL. 13.76 16.72 20.72 24.36CCNN 12.49 16.58 20.02 22.41HYDRA 2S 11.41 16.36 20.89 23.67HYDRA 3S 10.99 13.75 16.69 19.32HYDRA 4S 12.92 15.54 18.45 20.96

PRED

ICTI

ONGR

OUN

D TR

UTH

73.8

75.2

30.0

31.6

28.2

27.4

20.3

70.1

29.3

92.5

Qualitative ResultsTable of Results

Database characteristics:o Target object: Carso Different sceneso No perspective mapo 1244 images

Data base characteristics:o Target object: Peopleo Single sceneo With perspective mapo 2000 images o Video sequence

Table of Results

Data base characteristics:o Target object: Peopleo Multiple sceneso No perspective mapo 50 imageso Count between 94-4543o Average of 1280 per image

PRED

ICTI

ONGR

OUN

D TR

UTH

1331.6

441.9

1334.1

2106.3

381.2

163.8

1196.5

1116.0

3008.9

3407.0

Qualitative ResultsTable of ResultsMETHOD MAE MSD

RODRIGUEZ ET AL. 655.7 697.8 LEMPITSKY ET AL. 493.4 487.1 ZHANG ET AL. 467.0 498.5 IDREES ET AL. 419.5 541.6 ZHANG ET AL. 377.6 509.1 CCNN 488.67 646.68 HYDRA 2S 333.73 425.26 HYDRA 3S 465.73 371.84

The Hydra model uses a pyramid of input patches cropped from the center of the target patch to provide multiscale information to the network.

The counting by regression model with deep learningNotation GT density

CNN REGRESSOR

PATCH EXTRACTION FORWARD PASS DENSITY MAP ASSEMBLY

Understanding the counting problem

Given an image and a target object, we want to guess the total number of objects.

TRANCOS

UCF

UCSD Perspective map

Perspective free

Single loss function

Find it!

Qualitative Results

METHOD MAXIMAL DOWNSCALE UPSCALE MINIMAL LEMPITSKY ET AL. 1.70 1.28 1.59 2.02 FIASCHI ET AL. 1.70 2.16 1.61 2.20 PHAM ET AL. 1.43 1.30 1.59 1.62 ARTETA ET AL. 1.24 1.31 1.69 1.49 ZHANG ET AL. 1.70 1.26 1.59 1.52 CCNN 1.65 1.79 1.11 1.50 HYDRA 2S 2.22 1.93 1.37 2.38 HYDRA 3S 2.17 2.99 1.44 1.92

Download us!

PRED

ICTI

ON

GRO

UND

TRUT

H

47.0

44.6

20.6

21.5

Sequence Prediction