Sequence to SequenceVideo to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
Presented by Dewal GuptaUCSD CSE 291G, Winter 2019
BACKGROUNDChallenge: Create a description for a given video
Important in:
- describing videos for blind- human-robot interactions
Challenging because:
- diverse set of scenes, actions- necessary to recognize salient
action in context
PREVIOUS WORK: Template Models- Tag video with captions and use as bag of words
- Two stage pipeline: - first: tag video with semantic information on objects, actions
- treated as a classification problem- FGM labels subject, verb, object, place
- second: generate sentence from semantic information
- S2VT approach: avoids separating content identification from sentence generation
Integrating Language and Vision: to Generate Natural Language Descriptions of Videos in the Wild - Mooney et al., 2014
PREVIOUS WORK: Mean Pooling
- CNN trained on object classification (subset of ImageNet)
- 2 layer LSTM with video and previous word as input
- Ignores video frame ordering
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Mooney et al., 2015
PREVIOUS WORK: Exploiting Temporal Structure
Encoder: - train 3D ConvNet on action recognition- fixed frame input- exploits local temporal structureDescribing Videos by Exploiting Temporal Structure
Videos in the WildCourville et al., 2015
PREVIOUS WORK: Exploiting Temporal Structure
Decoder: - Similar to our HW 2- Exploits global temporal
structure
Describing Videos by Exploiting Temporal Structure Videos in the WildCourville et al., 2015
GOAL
End to End differentiable model that can:
1. Handle variable video length (i.e. variable input length)
2. Learn temporal structure
3. Learn a language model that is capable of generating descriptive sentences
MODEL: LSTMSingle LSTM network
2 layer LSTM network
- 1000 hidden units (ht)- red layer: models visual elements- green layer: models linguistic
elements
MODEL: VGG-16
MODEL: AlexNet
Used for RGB & Flow!
MODEL: Details- Use Text Embedding (of 500 dimensions)
- self-trained, simple linear transformation
- RGB networks are pre-trained on subset of ImageNet data- Used networks from the original works
- Optical Flow pretrained on UCF101 dataset- Action Classification Task- Original work from ‘Action Tubes’
- All layers are frozen except last layers for training
- Flow and RGB combined by “shallow fusion technique”
DATASETS3 datasets used:
- Microsoft Video Description corpus (MSVD)- MPII Movie Description Corpus (MPII-MD)- Montreal Video Annotation Dataset (M-VAD)
MSVD: web clips with human annotations
MPII-MD: Hollywood clips with descriptions from script & audio (originally for the visually impaired)
M-VAD: Hollywood clips with audio descriptions
All three have single sentence descriptions
DATASETS: MetricsAuthors use METEOR metric
- uses exact token, stemmed token and WordNet synonym matches
- better correlation with human judgement than BLEU or ROUGE
- out performs CIDEr when fewer references- datasets only had 1 reference
where:- m is unigram (or n-gram)
matches after alignment- wr is length of reference - wt is length of candidate
RESULTS: MSVDFGM is template based
- not very descriptive- predicts a noun, verb,
object, place- builds sentence off
template
RESULTS: MSVDMean Pool based method
- very similar to author’s method
RESULTS: MSVDTemporal Attention method
- Encoder/Decoder using attention
RESULTS: Frame ordering
- Training with random ordering of frames results in “considerably lower” performance
RESULTS: Optical Flow- Flow results in better
performance only when combined with RGB (& not when used alone)
- Flow can be very different even for same activities
- Flow can’t account for polysemous words like “play” - eg. “play guitar” vs “play golf”
RESULTS: SOTA- Authors claim accurate
comparison is with GoogleNet with NO 3D-CNN (global temporal attention)
- questionable claim
Results: MPII-MD, M-VAD
- Similar performance to Visual-Labels- VL uses more semantic information (eg. object detection) but no
temporal information
MPII-MD M-VAD
Results: Edit DistanceLevenshtein Distance: represents edit distance between two strings
- 42.9% of generated samples match exactly with a sentence in the training corpus of MSVD
- model struggles to learn MVAD
CRITICISM- Model fails to learn temporal relations
- performs nearly as well as mean pooling technique that makes no use of temporal relations
- Model struggles on MVAD dataset for some reason more than other
- Authors should have used BLEU and/or CIDEr scores as well (other studies have them)
- Conduct user study (where human looks at captions)? - Could improve by using better text embeddings?
FURTHER WORK
End-to-End Video Captioning with Multitask Reinforcement Learning - Li & Gong, 2019
- Use Inception ResNet v2 as backbone CNN
- Train CNN against mined video “attributes”
- Achieve +5% METEOR score on MSVD- Same architecture
FURTHER WORK- Use 3D CNN to get better clip embeddings instead of LSTMs
- proven better in activity recognition
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification - Xie et al., 2017
CONCLUSIONAuthors build an end to end differentiable model that can:
1. Handle variable video length (i.e. variable input length)
2. Learn temporal structure
3. Learn a language model that is capable of generating descriptive sentences
Has become a baseline for many video captioning works
EXAMPLES