42
16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2019 Image to Text Translation

Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 2 Input

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning

Spring 2019

Image to Text Translation

Page 2: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/19 2

Input Output

• Classification tasks

Page 3: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/19 3

Input Output

• Structured input to structured output tasks o Machine translation or other NLP tasks o Image captioning

Page 4: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Language modeling using RNN

4/1/19

Conditional probability

• Compute the probability of a sentence s = (w1, w2, …, wT)

p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)

• RNN

4

Page 5: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Recap: Forward propagation in RNN

5

[Fig 10.3]

Recurrent connections between hidden units; output every time step

a(t ) = b+Wh(t−1) +Ux(t ) ,ht = tanh(a(t ) ),o(t ) = c+Vh(t ) ,y(t ) = softmax(o(t ) )

Softmax to get normalized probabilities

Page 6: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Language modeling using RNN• Compute the probability of a sentence s =

(w1, w2, …, wT)

p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)

• RNN p(wt+1=w|w1, … ,wt) =gθw(ht, wt)

4/1/19 6

Probability of the next word being ‘w’

Conditional probability

Page 7: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Conditional language model• p(wt+1=w|w1, … ,wt) =gθw(ht, wt)

• ht = φθ(ht-1, wt, c)

4/1/19 7

Context

Page 8: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Recap: Encoder-Decoder Sequence-to-Sequence Architecture

• Map variable-length input sequence to variable-length output sequence

• Machine translation [Cho et al., 2014][Sutskever et al., 2014]

8

Page 9: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 9

Encoder-Decoder Sequence-to-Sequence Architecture

Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C

Page 10: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 10

Encoder-Decoder Sequence-to-Sequence Architecture

Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C

Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))

Page 11: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Representions for words, phrases, or sentences

• Word/phrases into continuous vector space

• Word2vec: the skip-gram model [Mikolov’13] — prediction model

• GloVE: Global vectors for word representation [Pennington’14] — co-occurrence matrix

• Skip-thought vectors [Kiros’15] — representation for sentences

11

Page 12: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 12

Encoder-Decoder [Grid]-to-[Sequence] Architecture

Encoder (reader or input) [CNN] processes input image x and emits context C

Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))

Page 13: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Show & Tell [Vinyals et al., 2015]

4/1/19 13

C

Simple encoder-decoder model using fixed-length context vector

Page 14: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Show & Tell [Vinyals et al., 2015]

4/1/18 14

CNN: Inception V1-3 Batch Normlization

Page 15: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 15

Show & Tell[Vinyals et al., 2015]

Page 16: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 16

Encoder-Decoder [Grid]-to-[Sequence] Architecture

Encoder (reader or input) [CNN] processes input image x and emits context C

Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))

The context model is too simple to guarantee that spatial, temporal, or spatio-temporal structures of input are preserved.

Page 17: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Attention mechanisms allow the system to sequentially focus on different subsets of the input (Cho et al., 2015).

4/1/19 17

Page 18: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Attention mechanism• A structured representation of input

• e.g., a set of fixed-size vectors known as “context set” C = { c1, c2, …, cM}

• Attention model: another neural network to map hidden state to context vector

4/1/19 18

Page 19: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Attention model

• Soft attention: softmax over context vectors in context set

• Hard attention: one best match4/1/19 19

eit = fAtt zt−1,ci , α j

t−1{ }j=1

M⎛⎝⎜

⎞⎠⎟

ct =ϕ ci{ }i=1M, αi

t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

∑αi

t =exp(ei

t )

ejt

j=1

M∑

e: score of context ci at time t

Hidden state z

Natural for gradient back-propagationMC sampling [Xu et al., 2015]

Attention weight α over M contexts

Page 20: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Attention model

• Soft attention: softmax over context vectors in context set

• Hard attention: one best match4/1/19 20

eit = fAtt zt−1,ci , α j

t−1{ }j=1

M⎛⎝⎜

⎞⎠⎟

ct =ϕ ci{ }i=1M, αi

t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

∑αi

t =exp(ei

t )

ejt

j=1

M∑

e: score of context ci at time t

e.g., weighted sum

Hidden state z

Attention weight α over M contexts

Natural for gradient back-propagationMC sampling

Page 21: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Conditional RNN language model

• Computing context vector every time step instead of using a fixed-length context vector

• ht = φθ(ht-1, xt, ct)

4/1/19 21

ct =ϕ ci{ }i=1M, αi

t{ }i=1

M⎛⎝⎜

⎞⎠⎟ = αici

i=1

M

Page 22: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Attention-based image captioning

• Representation of input image: – Activation of the last fully-connected hidden

layer as context vector in simple encoder-decoder model

– Activation of the last convolutional layer to use attention mechanism

4/1/19 22

Decoder can selectively pay attention to certain parts of an input image

Page 23: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 23

Show, Attend, & Tell [Xu et al., 2015]

Page 24: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Show, Attend, & Tell [Xu et al., 2015]

4/1/18 24

Page 25: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Show, Attend, & Tell [Xu et al., 2015]

4/1/19 25

Page 26: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Generate dense descriptions of images using multimodal embedding

4/1/19

[Karpathy & Fei-Fei 2015]

Page 27: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Representing images• Bounding box detection using:

– R-CNN + pretrained on ImageNet + fine-tuned on 200 classes of ImageNet Detection Challenge

4/1/19

[Girshick CVPR’14]

4096 activations of fully connected layer right before classification

Page 28: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Representing images• Top 19 bounding boxes + entire input

image = 20

• 1 image ! 20 h-dimensional vectors v = Wm[CNNθc(Ib)] + bm

4/1/19

h: multimodal embedding space, e.g., size varies 1000-1600 in the experiments

Page 29: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Recap: Bidirectional RNNs

29[Fig. 10.11]

Forward in time

Backward in time

Page 30: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Representing sentences• Bidirectional RNN

• Left to right & right to left context

• Each input word ! 1-of-k vector

• Encode into h-D vector (the same embedding space as images)

1/31/18

Page 31: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Alignment • Training set:

– k: image index – l: sentence index

• Multimodal h-D embedding

• Image ! v1,…v20

• Sentence n words ! s1,…,sn

• Similarity between image region & word based on dot product vk

Tst

1/31/18

Sk,l = maxi∈gkt∈gl

∑ viTst (Eq. 8)

gk: Set of image fragments in image k

Page 32: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

• Image CNN at t0

• START & END: special tokens

• Each word encoded into a vector

• Predict next word as probability distribution over dictionary + END

1/31/18

Multimodal RNN for text generation

Page 33: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Qualitative result

1/31/18

20 occurrences of “man in black shirt” 60 occurrences of “is playing guitar”

Page 34: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Additional sample results

1/31/18

Page 35: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Image captioning with attributes (LSTM-A) [Yao et al., 2017]

• CNN-RNN encoder-decoder model

• Predefined set of high-level attributes

• Multiple instance learning with inter-attribute correlations

4/1/18 35

Page 36: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

Image captioning with attributes (LSTM-A) [Yao et al., 2017]

4/1/18 36

Page 37: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 37

Image captioning with attributes (LSTM-A) [Yao et al., 2017]

Page 38: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

CMU 16-785: Integrated Intelligence in Robotics ([email protected])

How much data do we need to achieve decent performance in image

captioning?

4/1/18 38

Page 39: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

[Young et al., TACL 2014]

1/31/18

30K images + 150K captions P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2014.

Page 40: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

MS COCO 2014

1/31/18

http://cocodataset.org

http://farm9.staticflickr.com/8463/8127485677_6d8d40496a_z.jpg

330K images x 5 captions

Page 41: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Image captioning in the wild

41

Sample results of Show & Tell, Show, Attend, & Tell on arbitrary images found on the web show that there is still a lot of room for improvement.

Page 42: Image to Text Translationjeanoh/16-785/lectures/IIR-ImageCaptioning.pdf · Image to Text Translation. 4/1/19 CMU 16-785: Integrated Intelligence in Robotics (jeanoh@cmu.edu) 2 Input

Next classes• Modules

• Attributes

• Task-specific datasets

• 4/8 Art and AI — Eunsu Kang, MLD

• 4/15 AI and Ethics — Mike Skirpan, SCS

42