Fine tuning a convolutional network for cultural event recognition

FINE-TUNING A CONVOLUTIONAL NETWORK FOR CULTURAL EVENT RECOGNITION

ADVISORS:

Andrea Calafell

Xavier Giró-i-Nieto Amaia Salvador

20/07/2015

AUTHOR:

Matthias Zeppelzauer

OUTLINE1. Motivation and State of the art2. Baseline 3. Study of the dataset bias4. Denoising5. Fracking6. Fine-tuning deeper layers only7. Ensemble of event detectors8. Conclusions and future work

2

MOTIVATION: Cultural Heritage

3Chinese New year

MOTIVATION: Cultural Heritage

4Carnival Rio

Classic onsite explorers

5

Onsite social media is big data...

6

...and online explorers need our help

7

CHALEARN: Looking at People

8

TRAININGSET

5,875

VALIDATIONSET

2,332

TESTSET

3,569

50 EVENTS

MOTIVATION: Goals

9

● Improve the results obtained in ChaLearn Challenge.

● Exploit the noisy data collected from Flickr

STATE OF THE ART: CaffeNet

10

ContentVisual

Time stamp ContextGeolocation

Text

Zaharieva’15 X X X

Mattivi’11 X X

Bossard’13 X X

Cao’08 X X X

Sutanto’13 X

Schinas’12 X X

Brenner’13 X X

Nguyen’13 X X

MediaEvalSocial

Event Detection

http://dx.doi.org/10.1109/MMUL.2015.31

http://dx.doi.org/10.1109/MMUL.2015.31

http://dx.doi.org/10.1145/2072508.2072511

http://dx.doi.org/10.1145/2072508.2072511

http://dx.doi.org/10.1109/ICCV.2013.151

http://dx.doi.org/10.1109/ICCV.2013.151

http://dx.doi.org/10.1109/CVPR.2008.4587382


http://eprints.qut.edu.au/63821/

http://eprints.qut.edu.au/63821/

http://ceur-ws.org/Vol-927/mediaeval2012_submission_40.pdf


http://mklab2.iti.gr/sewm14/wp-content/uploads/2014/03/SEWM_2014_Proceedings.pdf

http://mklab2.iti.gr/sewm14/wp-content/uploads/2014/03/SEWM_2014_Proceedings.pdf



STATE OF THE ART: CaffeNet

11

CaffeNet

ARCHITECTURE[Khrizevsky’12]

SOFTWARE[Jia’14]

DATA[Deng’09]


STATE OF THE ART: CNN ARCHITECTURE

12

Convolutional Neural Network architecture

Babenko et al, Neural codes for image retrieval. In Computer Vision-ECCV, 2014

http://arxiv.org/pdf/1404.1777v2.pdf

STATE OF THE ART: Object+Scene CNNs

13

Object-Scene Convolutional Neural Network for event recognition

Wang et al, Object-scene convolutional neural networks for event recognition in images. In CVPRW, 2015

http://wanglimin.github.io/papers/WangWDQ_ChaLearnLAP15.pdf

http://wanglimin.github.io/papers/WangWDQ_ChaLearnLAP15.pdf


14

BASELINE: Fine-tuning a ConvNet

15

50

BASELINE: ChaLearn @ CVPRW 2015

16

Awarded with the 2nd prize of the Cultural Event Recognition Challenge in the ChaLearn Workshop at CVPR 2015

Salvador. A, Giro-i-Nieto. X, Calafell, A, et al, Cultural Event Recognition with Visual ConvNets and Temporal Models. In CVPRW, 2015



17





18

Convnets require to be trained with...

19

a large amount of labeled images

but clean data is expensive...

20

and downloading noisy data in an unsupervised fashion is easier and cheaper.

NOISY DATA: Flickr Dataset

21

FLICKR DATASET

4,06850

EVENTS

DATASET BIAS

22

Dataset bias when fine-tuning with ChaLearn or Flickr dataset:


23

DENOISING THE FLICKR DATASET

24

Mosaic of Queens Day from ChaLearn Mosaic of Queens Day from Flickr


25Example event: Annual Buffalo Roundup

Fine-tuned model with ChaLearn

New subset from

BASELINE: Dataset ordering during fine-tuning

26

CaffeNet

FINE-TUNING JOINT:


27

Joint fine-tuning of the clean and noisy datasets:

0.6136


28

CaffeNet

FINE-TUNING: FINE-TUNING:


29

Sequential fine-tuning of the clean and noisy datasets:

0.6136


30

CaffeNet

FINE-TUNING:FINE-TUNING:


31

Sequential fine-tuning of the noisy and clean datasets:

0.6136

+1,3%


32

FRACKING MINING +/- SAMPLES

33

FRACKING THE TRAINING DATASET

34Example event: Pingxi Lantern Festival

Fine-tuned model with ChaLearn

New subset from

hard negatives

hard positive


35

CaffeNet

FINE-TUNING: Fine-tuning with fracking subset from:

FRACKING THE TRAINING DATASET

36

Results of fine-tuning using fracking in images from ChaLearn:

baseline: 0.61365

+0,9%


37

FINE-TUNING DEEPER LAYERS ONLY

38Layer 2 responds to corners and other edge/color conjunctions.


39

Layer 3 has more complex invariances, capturing similar textures Zeiler et al, Visualizing and Understanding Convolutional Networks, In Computer Vision-ECCV 2014,

http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

http://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf


40

50

Andrej Karpathy. Convolutional neural networks for visual recognition. In Stanford CS class CS231n.

FC6 FC7

FC8

http://cs231n.stanford.edu/

http://cs231n.stanford.edu/


41

Results of only fine-tuning the deeper layers:

+3%

0.61365


42

Results of only fine-tuning the deeper layers :

+4%

0.6136


43


44




ENSEMBLE OF EVENT DETECTORS

45

SINGLE CONVNET FOR THE 50 EVENTS:


46

ONE CONVNET FOR EACH EVENTS:


47

Results of ensemble of binary :

+6,6%

0.6136


48

CONLUSIONS

49

● The Flickr dataset helped us to improve the score by swapping the order in which we were using the clean and noisy datasets

CaffeNet

FINE-TUNING:FINE-TUNING:+1,3%

CONLUSIONS

50

● The network actually succeeds in improving his performance by learning from its own mistakes when applying fracking.

+0,9%

CaffeNet

FINE-TUNING: Fine-tuning with fracking subset from:

CONLUSIONS

51

● The results are better if we keep the weights learned in the earlier layers from a very large dataset.

50

+4%

CONLUSIONS

52

● Fine-tuning one convnet for each class increases the score.

+6,6%

FUTURE WORK

53

● Mix our solutions with a fine-tuned network with PLACES, and with other local solutions.

SCENE CNN (PLACES)

LOCAL

NOW

● Compete (and try to win) ChaLearn @ ICCV 2015 !!

FINE-TUNING A CONVOLUTIONAL NETWORK FOR CULTURAL EVENT RECOGNITION

ADVISORS:

Andrea Calafell

Xavier Giró-i-Nieto Amaia Salvador

20/07/2015

AUTHOR:

Matthias Zeppelzauer

Technology

Fine tuning a convolutional network for cultural event recognition