Video Prediction via Example Guidance

Video Prediction via Example Guidance

ICML 2020 Poster

Presented by Yueyu Hu

STRUCT Paper Reading

12020/12/14

Video Prediction

• Pioneer work, moving MNIST• Unsupervised Learning of Video Representations using LSTMs ICML'15

2020/12/14 7

VAE Powered Methods• Stochastic Video Generation with a Learned Prior, ICML'18

2020/12/14 8

Problems

• Prior Gaussian distribution?• Insufficient to cover future possibilities

• Multi-modal motion pattern• E.g. Moving MNIST: up or down?

• Sampling efficiency• How many samples are required to achieve

accurate prediction?

2020/12/14 9

Existing solutions

• External information• Bounding boxes / Skeleton (pose)• Compositional Video Prediction, ICCV'19 (CMU/FAIR)

2020/12/14 10

Insight & Claims

• “In contrast to these works above, we are motivated by one insight that prediction is based on similarity between the current situation and the past experiences.”

• Optimization, explicit distribution modeling• GAN, plausible predicted samples• Human skeleton topology, preserved• Real-world model

2020/12/14 11

Step 1: Disentangle

• Adopts existing methods:• Stochastic video generation with a learned prior, ICML’18

• Disentangle model used in this work• Unsupervised Keypoint Learning for Guiding Class-

Conditional Video Prediction, NeurIPS’19• Pretrained models used as pose extractor in this work

2020/12/14 12

Step 2: Retrieval

• Sequence X, motion feature F• , the whole training set• Nearest neighbor search, top K features

2020/12/14 13

What does it find?

• They have common pasts• They are non-Gaussian

2020/12/14 14

Next: Prediction

• Existing approaches

• z: latent representation• How to get z? Usually with a neural network

2020/12/14 15

(0,1)~zq

1. . )(z te g fz φ −=

The problem

Approach

• Replace with a new one

• Get prior from samples• Make the predicted close to the prior• Issues: lack diversity of ; distribution of z

infeasible to represent the samples

2020/12/14 16

(0,1)~zq

( || )KLD p q

,t tµ σ

ˆ ˆ,t tµ σ

Methods

• Calculated mean and variance• Sample z from this dist. • Predict multiple instances

2020/12/14 17

Best prediction j

Experiment

• Datasets:• Moving MNIST• BAIR Robot Push• PennAction (SVG / Keypoint settings)

2020/12/14 18

Moving MNIST

• Deterministic and Stochastic settings• D: Feed in motion information• S: Select best from 20 samples

2020/12/14 19

Robot Arm

• Better trajectory, saturating at 100 samples• K saturates at 5

2020/12/14 20

Penn Action• Class label and first frame are fed as inputs• Action Recognition and Fr´echet Video Distance

2020/12/14 21

Cross Class Action PredictionFacilitated by the guidance of examples, our model

produces a visually natural tennis serve sequence,

which clearly demonstrates the generalization

capability of proposed model. We argue that the

majority of previous works are (implicitly) forced

to memorize motion categories in the training set.

In contrast to the paradigm, our work is relieved from

such burden because the retrieved examples contain

the category information in assistance of prediction.

We thus focus only on intra-class diversity. If given

examples with unseen motion categories, our model is

still able to give reasonable predictions, thanks to the

example guidance.

2020/12/14 22

Conclusion

• Sampling methods might be a good idea• Video prediction techniques are still too far

away from being utilized in practical video coding

2020/12/14 23

Thanks

242020/12/14

Documents

Video Prediction via Example Guidance