32
1/32 MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction Ayush Tewari Michael Zollhofer Hyeongwoo Kim Pablo Garrido Florian Bernard Patrick Perez Christian Theobalt Presented by Suleyman Aslan

MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

1/32

MoFA: Model-based Deep Convolutional Face Autoencoder

for Unsupervised Monocular Reconstruction

Ayush Tewari Michael Zollhofer Hyeongwoo Kim Pablo Garrido

Florian Bernard Patrick Perez Christian Theobalt

Presented by Suleyman Aslan

Page 2: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

2/32

Outline• Introduction

• Motivation and Related Work

• Overview

• Semantic Code Vector

• Parametric Model-based Decoder

• Loss Layer

• Experiment Results and Comparisons

• Limitations

• Conclusion

Page 3: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

3/32

Introduction• Challenging problem of reconstructing 3D human face

• encode face pose, shape, expression, reflectance and illumination

• Combining convolutional autoencoder with generative model• a novel differentiable parametric decoder

• Encoder learns to extract semantically meaningful parameters• code vector

• Unsupervised learning

[1]

Page 4: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

4/32

Motivation• A challenging problem in computer vision and computer graphics

• Previous approaches use mostly calibrated, multi-view data

• High variability in pose, facial expression, and lighting

• Face reconstruction from a single uncalibrated image is an open research problem

Page 5: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

5/32

Related Work• Generative methods and regression-based methods

• Generative approaches,• fit a parametric model,

• optimize alignment between projected model and image

• require level of control during image capture or additional input data• e.g., detected landmarks

• Regression-based approaches,• can estimate pose, shape, expression

• can only be trained in a supervised fashion, a major obstacle

• illumination and reflectance do not match best generative methods

Page 6: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

6/32

Related Work• Cootes et al. [2] use Active Appearance Models to match a statistical

model of object shape and appearance to a new image, it can be used for matching and tracking faces

• Roth et al. [3] achieve reconstructing a 3D face model by fitting a 3D Morphable Model (3DMM)

• Zhou et al. [4] use CNN cascades for the detection of facial landmarks, in a supervised manner and predicts only sparse information

• Tran et al. [5] obtain robust and discriminative 3D Morphable Models with annotated training data

• Richardson et al. [6] achieve 3D face reconstruction by learning from synthetic data, lacks realistic features

Page 7: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

7/32

Related Work• Zhmoginov et al [7] generate images from code vectors by using

autoencoders

• Kulkarni et al. [8] learn graphics codes for the reproduction of images under different conditions

• Yan et al. [9] achieves unsupervised volumetric 3D object reconstruction from a single-view

• Higher level computer vision task, reconstruction of a full set of meaningful parameters is not considered

Page 8: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

8/32

Overview• New type of model-based deep convolutional autoencoder

• Makes use of state-of-the-art generative and regression approaches

• Inspired by deep convolutional autoencoders, features a CNN encoder

• Unlike previously used CNN decoders, it features newly designed decoder, a generative image formation model on the basis of a parametric 3D face model

• Input to the decoder, i.e. the semantic meaning of the code vector is ensured by design

Page 9: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

9/32

Overview• Combined end-to-end training of model-based decoder and a CNN

encoder

• Unsupervised training and semantically meaningful face reconstruction

• Generalizes better to real world data

Page 10: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

10/32

Overview• Decoder generates a realistic synthetic image of a face and enforces

semantic meaning

• Pose, shape, expression, reflectance and illumination are parameterized independently

• Synthesized image is compared to the input using a photometric loss

[1]

Page 11: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

11/32

Semantic Code Vector• Semantic code vector 𝑥 ∈ 𝑅257

• facial expression δ ∈ 𝑅64

• shape α ∈ 𝑅80

• skin reflectance β ∈ 𝑅80

• camera rotation T ∈ SO(3) (3D rotation group)

• translation t ∈ 𝑅3

• scene illumination γ ∈ 𝑅27

Page 12: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

12/32

Semantic Code Vector• Face is represented by 24k vertices

• Normals are computed using one-ring neighborhood

• AS, average face shape

• ES and Ee, PCA bases

Page 13: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

13/32

Semantic Code Vector• Per vertex reflectance parameterized based on affine parametric model

• Ar, average skin reflectance

• Er, PCA basis

Page 14: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

14/32

Parametric Model Based Decoder• Model generates the synthetic image that corresponds to the face

• Rendered using a pinhole camera model under full perspective projection• mapping from camera space to screen space (𝑅3 → 𝑅2)

• Position and orientation of the camera is given by a rigid transformation• an arbitrary point is mapped to camera and screen space

• Scene illumination is represented using Spherical Harmonics [10]

Page 15: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

15/32

Parametric Model Based Decoder• For the image, screen space and associated pixel color for each vertex is

then computed

• Backward pass that inverts image formation is implemented for backpropagation

Page 16: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

16/32

Loss Layer• Photometric loss function that enables end-to-end training

• 𝐸land, sparse landmark alignment

• 𝐸photo, dense photometric alignment

• 𝐸reg, statistical plausibility of the modeled faces

Page 17: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

17/32

Loss Layer• Dense photometric alignment (𝐸photo)

• Predict parameters that lead to a synthetic face image that matches the input image

• Photometric alignment

Page 18: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

18/32

Loss Layer• Sparse landmark alignment (𝐸land)

• Enforce projected 3D vertices to be close to the 2D detections• based on detected facial landmarks [20]

• 46 landmarks

• Optional loss

Page 19: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

19/32

Loss Layer• Statistical regularization (𝐸reg)

• Enforce plausible facial shape, expression, and skin reflectance• prefer values close to the average

• Pose and illumination is not regularized

Page 20: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

20/32

Results• Encoders based on AlexNet [18] and VGG-Face [19] are tested

• Last fully connected layer is modified to output 257 model parameters

• Encoder regressed pose, shape, expression, reflectance and illumination from a single image

[1], images from CelebA [11]

Page 21: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

21/32

Results• Combination of four datasets is used for training: CelebA [11], LFW

[12], Facewarehouse [13], and 300-VW [14, 15, 16]

• Facial landmark detection [20] is used and images are cropped to a bounding box using Haar Cascade Face Detection [17]

• Bad detections are dropped

• 147k images in total, 142k for training, 5k for evaluation

• Network trained using,• AdaDelta and 200k batch iterations with batch size of 5

• Learning rate of 0.1 for all parameters except Z-translation

• Encoder is initialized with pre-trained weights

• Last fully connected layer has weights of 0

Page 22: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

22/32

Comparisons• To Richardson et al. [6]

• Richardson et al. use synthetic images and lacks several dimensions, e.g., facial hair

[1]

Page 23: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

23/32

Comparisons• To Tran et al. [5]

• Tran et al. do not estimate facial expression and illumination, trained in supervised manner

[1]

Page 24: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

24/32

Comparisons• To Thies et al. [21]

• Thies et al. require detected landmarks, is slower, and similar or lower quality

[1]

Page 25: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

25/32

Comparisons• To Garrido et al. [22]

• Garrido et al. require landmark detection, comparable quality

[1]

Page 26: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

26/32

Evaluation• Impact of different encoders is evaluated

• VGG-Face [19] is slightly better than AlexNet [18], lower landmark and photometric error

[1]

Page 27: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

27/32

Evaluation• Evaluation of (fully) unsupervised training

• Landmark error can be reduced even when landmark alignment is not part of the loss function

• Training with surrogate loss (landmarks) improves landmark alignment and leads to similar photometric error, also improves robustness to occlusions and expressions

[1]

Page 28: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

28/32

Evaluation• Evaluation on synthetic data

• Trained on 100k synthetic images with background augmentation, 5k images for evaluation with known parameters

[1]

Page 29: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

29/32

Evaluation• Comparison to convolutional autoencoder

• Model-based approach obtains sharper reconstruction and better semantic parameters

• Decoder is trained on synthetic imagery generated by the model to learn parameter-to-image mapping

[1]

Page 30: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

30/32

Limitations• Implausible reconstructions are possible outside of the span of

training data

• Can not regress facial hair, eye gaze, accessories

• Strong occlusions cause approach to fail

[1]

Page 31: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

31/32

Conclusion• First deep convolutional model based face autoencoder

• Trained in an unsupervised manner

• Learns meaningful set of semantic parameters

• Semantic meaning in the code vector is enforced by design

• Pose, shape, expression, skin reflectance, and illumination can be accurately regressed

Page 32: MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

32/32

References1. A. Tewari M. Zollhofer H. Kim P. Garrido F. Bernard P. Perez C. Theobalt. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. arXiv.org:1703.10580v1, 2017

2. T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, June 2001

3. J. Roth, Y. Tong, and X. Liu. Adaptive 3d face reconstruction from unconstrained photo collections. December 2016

4. E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. Extensive Facial Landmark Localization with Coarse-to-Fine Convolutional Network Cascade. In CVPRW, 2013

5. A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. arXiv.org:1612.04904v1, 2016

6. E. Richardson, M. Sela, and R. Kimmel. 3D face reconstruction by learning from synthetic data. In 3DV, 2016

7. A. Zhmoginov and M. Sandler. Inverting face embeddings with convolutional neural networks. arXiv:1606.04189, June 2016

8. T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015

9. X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. arXiv:1612.00814, Dec. 2016

10. C. Muller. ¨ Spherical harmonics. Springer, 1966

11. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

12. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007

13. C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, Mar. 2014.

14. G. G. Chrysos, E. Antonakos, S. Zafeiriou, and P. Snape. Offline deformable face tracking in arbitrary videos. In The IEEE International Conference on Computer Vision (ICCV) Workshops, December 2015

15. J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In ICCVW, December 2015

16. G. Tzimiropoulos. Project-out cascaded regression with an application to face alignment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

17. G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000

18. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012

19. O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.

20. J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting by regularized landmark mean-shift. 91(2):200–215, 2011

21. J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and ¨ M. Niener. Face2Face: Real-time face capture and reenactment of RGB videos. In CVPR, 2016

22. P. Garrido, M. Zollhofer, D. Casas, L. Valgaerts, K. Varanasi, ¨P. Perez, and C. Theobalt. Reconstruction of personalized 3D face rigs from monocular video. ACM Transactions on Graphics, 35(3):28:1–15, June 2016.