Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 2 Input

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning

Spring 2019

Image to Text Translation

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/19 2

Input Output

• Classification tasks


Input Output

• Structured input to structured output tasks o Machine translation or other NLP tasks o Image captioning

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Language modeling using RNN

4/1/19

Conditional probability

• Compute the probability of a sentence s = (w1, w2, …, wT)

p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)

• RNN

4


Recap: Forward propagation in RNN

5

[Fig 10.3]

Recurrent connections between hidden units; output every time step

a(t ) = b+Wh(t−1) +Ux(t ) ,ht = tanh(a(t ) ),o(t ) = c+Vh(t ) ,y(t ) = softmax(o(t ) )

Softmax to get normalized probabilities


Language modeling using RNN• Compute the probability of a sentence s =

(w1, w2, …, wT)

p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)

• RNN p(wt+1=w|w1, … ,wt) =gθw(ht, wt)

4/1/19 6

Probability of the next word being ‘w’

Conditional probability


Conditional language model• p(wt+1=w|w1, … ,wt) =gθw(ht, wt)

• ht = φθ(ht-1, wt, c)

4/1/19 7

Context


Recap: Encoder-Decoder Sequence-to-Sequence Architecture

• Map variable-length input sequence to variable-length output sequence

• Machine translation [Cho et al., 2014][Sutskever et al., 2014]

8

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 9

Encoder-Decoder Sequence-to-Sequence Architecture

Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C


Encoder-Decoder Sequence-to-Sequence Architecture

Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C

Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))

Representions for words, phrases, or sentences

• Word/phrases into continuous vector space

• Word2vec: the skip-gram model [Mikolov’13] — prediction model

• GloVE: Global vectors for word representation [Pennington’14] — co-occurrence matrix

• Skip-thought vectors [Kiros’15] — representation for sentences

11


Encoder-Decoder [Grid]-to-[Sequence] Architecture

Encoder (reader or input) [CNN] processes input image x and emits context C



Show & Tell [Vinyals et al., 2015]

4/1/19 13

C

Simple encoder-decoder model using fixed-length context vector


Show & Tell [Vinyals et al., 2015]

4/1/18 14

CNN: Inception V1-3 Batch Normlization


Show & Tell[Vinyals et al., 2015]


Encoder-Decoder [Grid]-to-[Sequence] Architecture

Encoder (reader or input) [CNN] processes input image x and emits context C


The context model is too simple to guarantee that spatial, temporal, or spatio-temporal structures of input are preserved.


Attention mechanisms allow the system to sequentially focus on different subsets of the input (Cho et al., 2015).

4/1/19 17


Attention mechanism• A structured representation of input

• e.g., a set of fixed-size vectors known as “context set” C = { c1, c2, …, cM}

• Attention model: another neural network to map hidden state to context vector

4/1/19 18


Attention model

• Soft attention: softmax over context vectors in context set

• Hard attention: one best match4/1/19 19

eit = fAtt zt−1,ci , α j

t−1{ }j=1

M⎛⎝⎜

⎞⎠⎟

ct =ϕ ci{ }i=1M, αi

t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

∑αi

t =exp(ei

t )

ejt

j=1

M∑

e: score of context ci at time t

Hidden state z

Natural for gradient back-propagationMC sampling [Xu et al., 2015]

Attention weight α over M contexts


Attention model

• Soft attention: softmax over context vectors in context set

• Hard attention: one best match4/1/19 20

eit = fAtt zt−1,ci , α j

t−1{ }j=1

M⎛⎝⎜

⎞⎠⎟


t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

∑αi

t =exp(ei

t )

ejt

j=1

M∑

e: score of context ci at time t

e.g., weighted sum

Hidden state z

Attention weight α over M contexts

Natural for gradient back-propagationMC sampling


Conditional RNN language model

• Computing context vector every time step instead of using a fixed-length context vector

• ht = φθ(ht-1, xt, ct)

4/1/19 21


t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

∑


Attention-based image captioning

• Representation of input image: – Activation of the last fully-connected hidden

layer as context vector in simple encoder-decoder model

– Activation of the last convolutional layer to use attention mechanism

4/1/19 22

Decoder can selectively pay attention to certain parts of an input image


Show, Attend, & Tell [Xu et al., 2015]



4/1/18 24



4/1/19 25

Generate dense descriptions of images using multimodal embedding

4/1/19

[Karpathy & Fei-Fei 2015]

Representing images• Bounding box detection using:

– R-CNN + pretrained on ImageNet + fine-tuned on 200 classes of ImageNet Detection Challenge

4/1/19

[Girshick CVPR’14]

4096 activations of fully connected layer right before classification

Representing images• Top 19 bounding boxes + entire input

image = 20

• 1 image ! 20 h-dimensional vectors v = Wm[CNNθc(Ib)] + bm

4/1/19

h: multimodal embedding space, e.g., size varies 1000-1600 in the experiments


Recap: Bidirectional RNNs

29[Fig. 10.11]

Forward in time

Backward in time

Representing sentences• Bidirectional RNN

• Left to right & right to left context

• Each input word ! 1-of-k vector

• Encode into h-D vector (the same embedding space as images)

1/31/18

Alignment • Training set:

– k: image index – l: sentence index

• Multimodal h-D embedding

• Image ! v1,…v20

• Sentence n words ! s1,…,sn

• Similarity between image region & word based on dot product vk

Tst

1/31/18

Sk,l = maxi∈gkt∈gl

∑ viTst (Eq. 8)

gk: Set of image fragments in image k

• Image CNN at t0

• START & END: special tokens

• Each word encoded into a vector

• Predict next word as probability distribution over dictionary + END

1/31/18

Multimodal RNN for text generation

Qualitative result

1/31/18

20 occurrences of “man in black shirt” 60 occurrences of “is playing guitar”

Additional sample results

1/31/18


Image captioning with attributes (LSTM-A) [Yao et al., 2017]

• CNN-RNN encoder-decoder model

• Predefined set of high-level attributes

• Multiple instance learning with inter-attribute correlations

4/1/18 35



4/1/18 36




How much data do we need to achieve decent performance in image

captioning?

4/1/18 38

[Young et al., TACL 2014]

1/31/18

30K images + 150K captions P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2014.

MS COCO 2014

1/31/18

http://cocodataset.org

http://farm9.staticflickr.com/8463/8127485677_6d8d40496a_z.jpg

330K images x 5 captions



Image captioning in the wild

41

Sample results of Show & Tell, Show, Attend, & Tell on arbitrary images found on the web show that there is still a lot of room for improvement.

Next classes• Modules

• Attributes

• Task-specific datasets

• 4/8 Art and AI — Eunsu Kang, MLD

• 4/15 AI and Ethics — Mike Skirpan, SCS

42

Documents

Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 2 Input