24
Few-Shot Adaptive Video-to-Video Translation Ting-Chun Wang NVIDIA

Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Few-Shot AdaptiveVideo-to-Video Translation

Ting-Chun Wang

NVIDIA

Page 2: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Recall the Motion Transfer Example

Page 3: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Behind the Scenes…

Page 4: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Disadvantages of vid2vid

• Separate models for each dataset

• Generalizing to new persons requires

Collecting new data Training

model 1 model 2 model 3

Page 5: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Wouldn’t it be great if…

• One model for all

• Dynamically determine the style at run time• based on an exemplar image

Page 6: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive Video-to-Video Translation

T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, B. Catanzaro, “Few-shot Adaptive Video-to-Video Synthesis,” To appear at NeurIPS 2019.

Page 7: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive vid2vid: overflow

• Original vid2vid• Output frame =

Hallucinated frame + Warped frame

• Adaptive vid2vid• Hallucinated frames

• generated based on example images

• Using a filter generation scheme

...

example images

filter

generation

frame t-1 input frame t

filters

output frame t

W

warped

frame

flow generation

conv

layers

hallucinated

frame

Page 8: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive vid2vid

• Based on SPADE (GauGAN)• Prior work: input semantics → encoder-decoder → output image

• Instead: input semantics

→ spatially-varying normalization maps

→ used in every BatchNorm

Parameter-free

Batch Norm

convconv

𝛾

𝛽

element-wise

network output 𝑦network input 𝑥(label free) 𝑦 =

𝑥 − 𝜇

𝜎⋅ 𝛾 + 𝛽

Page 9: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive vid2vid

• Based on SPADE (GauGAN)• Prior work: input semantics → encoder-decoder → output image

• Instead: input semantics

→ spatially-varying normalization maps

→ used in every BatchNorm

• Given an additional exemplar image• Dynamically configure the network weights in SPADE

• Generate spatially-varying, style-dependent normalization maps• Spatial info input semantics

• Style info exemplar images

Parameter-free

Batch Norm

convconv

𝛾

𝛽

element-wise

network output 𝑦network input 𝑥(label free)

Page 10: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

norm

map∗

input semanticsexample image

SPADE

normal convolutionfilter generation

normalizationconvolution filters

dynamic convolution

output image

filters

filters

AdaPool

fc

fc

...

...

∗...

...

...

Dynamic Weight Generation

Spatially-varying, style dependent maps

input semantics

as network inputexample image for

weight generation

main image generator

Page 11: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

example

image 1

example

image 2

Utilizing Multiple Example Images

Inp

ut fr

am

es

Atte

ntio

n m

ap

s

front back

fron

t

fro

nt

ba

ck b

ack

Example images

softmax

example

pose 1

example

pose 2target pose

combined

features

attention maps

example

features

Page 12: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive vid2vid: Training

• From a video• Randomly sample a clip

• Randomly sample another reference frame(s)

• Make the network generate the clip• Based on the reference frame

Page 13: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Adaptive vid2vid: Testing

• Given an example image

• Finetune on the example image• Network output should be the same as the example

• Only finetune for a few iterations

• For faces: normalize keypoints• To the same as example image

• To better preserve identity

Page 14: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Results

• Semantic → Street view scenes

• Edges → Human faces

• Poses → Human bodies

Page 15: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Example images

Synthesized videosInput segmentations

Street View Scenes

Page 16: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Example images

Synthesized videos

Edges → Faces

input videos

Page 17: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Edges → Faces

Input videos Extracted edges Synthesized result

Example image

Page 18: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Example

images

Input poses

Poses → Body

Synthesized

videos

Page 19: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Poses → Body

Page 20: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Poses → Body

Page 21: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Poses → Body

Poses Synthesized

Example image

Input videos

Page 22: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Conclusion

Page 23: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

Conclusion

Generative adversarial networks (GANs) Supervised Image Translation

Unsupervised Image Translation Video Translation

pix2pixHDpix2pix GauGAN

MUNITUNIT FUNIT

F D

D

G D

D

Unconditional

Conditional

Adap vid2vidvid2vid vid2game

Page 24: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune

THANK YOU

Questions?