Recurrent Image Annotator

Preview:

Citation preview

Recurrent Image Annotator for Arbitrary Length Image TaggingJIREN JINNAKAYAMA LAB

2

1. Introduction to Automatic Image Annotation

3

Automatic Image Annotation (AIA)

4

Difficulties of the TaskMost previous work focus on several problems:• label sparsity• label imbalance• incorrect/incomplete labels

The basic way is to utilize:• image-to-tag correlation• tag-to-tag correlation

5

Existing Methods• generative models (distribution over image features and annotation tags), Yu et al.• discriminatively trained classifiers, Claudio et al.• based on K-nearest-neighbor (KNN), Guillaumin et al.• based on Object detection, Song et al.

6

2. The Missing Part: Annotation Length

7

Missing Part: Annotation LengthConventional evaluation has a fixed annotation length • annotate k most relevant keywords • evaluate retrieval performance per keyword• average over keywords• typical k value is 5 or 3Why did they do this?• for ease of comparison with previous results• most existing methods cannot trivially predict proper number of tags

8

Why Annotation Length MattersFixed annotation length:• not the natural way that we humans annotate images • not the fact of realistic images

Problem to solve: predict results with

arbitrary length AL:arbitrary length

T5: top-5

GT:

Ground truth

9

3. Our Solution: Recurrent Image Annotator

10

Sequence generation• just output them one by one -> arbitrary annotation length• previous outputs influence the current output -> tag-to-tag correlation

Inspired by machine translation and image captioning• image or language A’s sentence to be encoded• image description or language B’s sentence to be decoded

Natural Way for Arbitrary Length Outputs

Karpathy, et al. (2014)

Vinyals, Oriol, et al. (2014).

11

What Else We NeedAn order of the tags• Both image captioning and machine translation aim to generate sentences, which have a natural order. • Unfortunately, in image annotation task, order is not available.• We have to choose or learn an order. Points for a useful order “rule”:• should be based on semantic image and tag information • tag sequences in each training example should follow the same rule to be sorted

Easy to learn Good for generation

12

Contributions1.analyze the insufficiency in existing methods:

◦ unable to generate image dependent number of tags

2.first to form image annotation as a sequence generation problem◦ propose a novel RNN based model Recurrent

Image Annotator 3.propose and evaluate several orders for

sorting the tag inputs ◦ show the importance of tag order in tag sequence

generation problem

13

Recurrent Image Annotator (RIA)

14

4. Submodules of Recurrent Image Annotator

15

Neural Networks

Hidden layer: linear transformation + nonlinear activation function (e.g., sigmoid function)Simple network from

Wikipedia

Fully-connected

16

Convolutional Neural Networks

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

Local connectivity

Shared weights

3D volumes of neurons

17

Recurrent Neural Networks

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

18

Long Short Term Memory NetworksAn improved version of RNN:• Remember information for long periods of time• Use gating units to control information flow through time steps

S.Hochreiter and J.Schmidhuber, 1997

Core idea of LSTM:

the cell state

easy for information to just

flow along it unchanged http://colah.github.io/posts/2015-08-Understanding-LSTMs/

19

5. Experimentation

20

Dataset 1: Corel 5KVocabulary size

260

Number of images

4,493

Words per image

3.4 (maximum is 5)

Images per word

58.6 (maximum is 1004)

21

Dataset 2: ESP GAMEVocabulary size

269

Number of images

18,689

Words per image

4.7(maximum is 15)

Images per word

362.7 (maximum is 4553)

22

Dataset 3: IAPR-TC12Vocabulary size

291

Number of images

17,665

Words per image

5.7 (maximum is 23)

Images per word

347.7 (maximum is 4999)

23

Evaluation Measures• precision, P (averaged over classes)• recall, R (averaged over classes)• f-measure, F (averaged over classes)• the number of classes with non-zero recall value, N+

24

Different Orders for Tag Sequences• dictionary order: alphabetical order• random order: random sorting tags in each training example• frequent-first order: put the frequent tags ahead rare tags• rare-first order: put the rare tags ahead frequent tags

25

6. Analysis and Conclusion

26

Arbitrary Length Annotation (1)

27

Arbitrary Length Annotation (2)

28

Arbitrary Length Annotation (3)

29

Compare Influence of Different Orders

P: precisionR: recallF: f-measureN+: the number of class with non-zero recall valuesLarger value represents better performance.

30

Analysis of Results for Different Orders Why rare-first outperforms frequent-first:• “rare” means rare in the datasets, however, for the single image, it may represent more importance• frequent tags are easier to predict than rare tags naturally, while frequent-first order makes the easy task easier, but the difficult task more difficult• correctly predicting rare tags is more important in the per-class evaluation measure

31

Top-5 Annotation

P: precisionR: recallF: f-measureN+: the number of class with non-zero recall values

Much faster testing speed: Constant time (5ms) for each testing image,instead of O(N) in KNN based methods.N: number of training images

32

Conclusion• transform image annotation to sequence generation problem • achieve comparable performance to state-of-the-art methods • decide appropriate annotation length automatically • obtain a much faster testing speed• confirm the importance of a proper tag sequence order

33

Output of This Work1.Accepted by International Conference

on Pattern Recognition (ICPR) 2016 (oral)

2.Web demo for RIA: www.nlab.ci.i.u-tokyo.ac.jp/annotator

34

Future workImprove the strategy to obtain the tag sequence order• e.g., use reinforcement learning to learn the order automatically

Extend to personal preference annotation• consider eye-catching effect, etc.

35

Recommended