Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning
Spring 2019
Image to Text Translation
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/19 2
Input Output
• Classification tasks
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/19 3
Input Output
• Structured input to structured output tasks o Machine translation or other NLP tasks o Image captioning
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Language modeling using RNN
4/1/19
Conditional probability
• Compute the probability of a sentence s = (w1, w2, …, wT)
p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)
• RNN
4
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Recap: Forward propagation in RNN
5
[Fig 10.3]
Recurrent connections between hidden units; output every time step
a(t ) = b+Wh(t−1) +Ux(t ) ,ht = tanh(a(t ) ),o(t ) = c+Vh(t ) ,y(t ) = softmax(o(t ) )
Softmax to get normalized probabilities
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Language modeling using RNN• Compute the probability of a sentence s =
(w1, w2, …, wT)
p(w1, w2, …, wT) = Πt=1T p(wt|w1, … ,wt-1)
• RNN p(wt+1=w|w1, … ,wt) =gθw(ht, wt)
4/1/19 6
Probability of the next word being ‘w’
Conditional probability
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Conditional language model• p(wt+1=w|w1, … ,wt) =gθw(ht, wt)
• ht = φθ(ht-1, wt, c)
4/1/19 7
Context
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Recap: Encoder-Decoder Sequence-to-Sequence Architecture
• Map variable-length input sequence to variable-length output sequence
• Machine translation [Cho et al., 2014][Sutskever et al., 2014]
8
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 9
Encoder-Decoder Sequence-to-Sequence Architecture
Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 10
Encoder-Decoder Sequence-to-Sequence Architecture
Encoder (reader or input) RNN processes input sequence x=(x(1), …, x(nx))and emits context C
Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))
Representions for words, phrases, or sentences
• Word/phrases into continuous vector space
• Word2vec: the skip-gram model [Mikolov’13] — prediction model
• GloVE: Global vectors for word representation [Pennington’14] — co-occurrence matrix
• Skip-thought vectors [Kiros’15] — representation for sentences
11
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 12
Encoder-Decoder [Grid]-to-[Sequence] Architecture
Encoder (reader or input) [CNN] processes input image x and emits context C
Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Show & Tell [Vinyals et al., 2015]
4/1/19 13
C
Simple encoder-decoder model using fixed-length context vector
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Show & Tell [Vinyals et al., 2015]
4/1/18 14
CNN: Inception V1-3 Batch Normlization
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 15
Show & Tell[Vinyals et al., 2015]
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 16
Encoder-Decoder [Grid]-to-[Sequence] Architecture
Encoder (reader or input) [CNN] processes input image x and emits context C
Decoder (writer or output) RNN is conditioned on the context C to generate output sequence y=(y(1), …, y(ny))
The context model is too simple to guarantee that spatial, temporal, or spatio-temporal structures of input are preserved.
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Attention mechanisms allow the system to sequentially focus on different subsets of the input (Cho et al., 2015).
4/1/19 17
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Attention mechanism• A structured representation of input
• e.g., a set of fixed-size vectors known as “context set” C = { c1, c2, …, cM}
• Attention model: another neural network to map hidden state to context vector
4/1/19 18
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Attention model
• Soft attention: softmax over context vectors in context set
• Hard attention: one best match4/1/19 19
eit = fAtt zt−1,ci , α j
t−1{ }j=1
M⎛⎝⎜
⎞⎠⎟
ct =ϕ ci{ }i=1M, αi
t{ }i=1
M⎛⎝⎜
⎞⎠⎟ = αici
i=1
M
∑αi
t =exp(ei
t )
ejt
j=1
M∑
e: score of context ci at time t
Hidden state z
Natural for gradient back-propagationMC sampling [Xu et al., 2015]
Attention weight α over M contexts
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Attention model
• Soft attention: softmax over context vectors in context set
• Hard attention: one best match4/1/19 20
eit = fAtt zt−1,ci , α j
t−1{ }j=1
M⎛⎝⎜
⎞⎠⎟
ct =ϕ ci{ }i=1M, αi
t{ }i=1
M⎛⎝⎜
⎞⎠⎟ = αici
i=1
M
∑αi
t =exp(ei
t )
ejt
j=1
M∑
e: score of context ci at time t
e.g., weighted sum
Hidden state z
Attention weight α over M contexts
Natural for gradient back-propagationMC sampling
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Conditional RNN language model
• Computing context vector every time step instead of using a fixed-length context vector
• ht = φθ(ht-1, xt, ct)
4/1/19 21
ct =ϕ ci{ }i=1M, αi
t{ }i=1
M⎛⎝⎜
⎞⎠⎟ = αici
i=1
M
∑
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Attention-based image captioning
• Representation of input image: – Activation of the last fully-connected hidden
layer as context vector in simple encoder-decoder model
– Activation of the last convolutional layer to use attention mechanism
4/1/19 22
Decoder can selectively pay attention to certain parts of an input image
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 23
Show, Attend, & Tell [Xu et al., 2015]
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Show, Attend, & Tell [Xu et al., 2015]
4/1/18 24
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Show, Attend, & Tell [Xu et al., 2015]
4/1/19 25
Generate dense descriptions of images using multimodal embedding
4/1/19
[Karpathy & Fei-Fei 2015]
Representing images• Bounding box detection using:
– R-CNN + pretrained on ImageNet + fine-tuned on 200 classes of ImageNet Detection Challenge
4/1/19
[Girshick CVPR’14]
4096 activations of fully connected layer right before classification
Representing images• Top 19 bounding boxes + entire input
image = 20
• 1 image ! 20 h-dimensional vectors v = Wm[CNNθc(Ib)] + bm
4/1/19
h: multimodal embedding space, e.g., size varies 1000-1600 in the experiments
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Recap: Bidirectional RNNs
29[Fig. 10.11]
Forward in time
Backward in time
Representing sentences• Bidirectional RNN
• Left to right & right to left context
• Each input word ! 1-of-k vector
• Encode into h-D vector (the same embedding space as images)
1/31/18
Alignment • Training set:
– k: image index – l: sentence index
• Multimodal h-D embedding
• Image ! v1,…v20
• Sentence n words ! s1,…,sn
• Similarity between image region & word based on dot product vk
Tst
1/31/18
Sk,l = maxi∈gkt∈gl
∑ viTst (Eq. 8)
gk: Set of image fragments in image k
• Image CNN at t0
• START & END: special tokens
• Each word encoded into a vector
• Predict next word as probability distribution over dictionary + END
1/31/18
Multimodal RNN for text generation
Qualitative result
1/31/18
20 occurrences of “man in black shirt” 60 occurrences of “is playing guitar”
Additional sample results
1/31/18
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Image captioning with attributes (LSTM-A) [Yao et al., 2017]
• CNN-RNN encoder-decoder model
• Predefined set of high-level attributes
• Multiple instance learning with inter-attribute correlations
4/1/18 35
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
Image captioning with attributes (LSTM-A) [Yao et al., 2017]
4/1/18 36
CMU 16-785: Integrated Intelligence in Robotics ([email protected]) 4/1/18 37
Image captioning with attributes (LSTM-A) [Yao et al., 2017]
CMU 16-785: Integrated Intelligence in Robotics ([email protected])
How much data do we need to achieve decent performance in image
captioning?
4/1/18 38
[Young et al., TACL 2014]
1/31/18
30K images + 150K captions P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2014.
MS COCO 2014
1/31/18
http://cocodataset.org
http://farm9.staticflickr.com/8463/8127485677_6d8d40496a_z.jpg
330K images x 5 captions
Image captioning in the wild
41
Sample results of Show & Tell, Show, Attend, & Tell on arbitrary images found on the web show that there is still a lot of room for improvement.
Next classes• Modules
• Attributes
• Task-specific datasets
• 4/8 Art and AI — Eunsu Kang, MLD
• 4/15 AI and Ethics — Mike Skirpan, SCS
42