MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation

1/32

MoFA: Model-based Deep Convolutional Face Autoencoder

for Unsupervised Monocular Reconstruction

Ayush Tewari Michael Zollhofer Hyeongwoo Kim Pablo Garrido

Florian Bernard Patrick Perez Christian Theobalt

Presented by Suleyman Aslan

2/32

Outline• Introduction

• Motivation and Related Work

• Overview

• Semantic Code Vector

• Parametric Model-based Decoder

• Loss Layer

• Experiment Results and Comparisons

• Limitations

• Conclusion

3/32

Introduction• Challenging problem of reconstructing 3D human face

• encode face pose, shape, expression, reflectance and illumination

• Combining convolutional autoencoder with generative model• a novel differentiable parametric decoder

• Encoder learns to extract semantically meaningful parameters• code vector

• Unsupervised learning

[1]

4/32

Motivation• A challenging problem in computer vision and computer graphics

• Previous approaches use mostly calibrated, multi-view data

• High variability in pose, facial expression, and lighting

• Face reconstruction from a single uncalibrated image is an open research problem

5/32

Related Work• Generative methods and regression-based methods

• Generative approaches,• fit a parametric model,

• optimize alignment between projected model and image

• require level of control during image capture or additional input data• e.g., detected landmarks

• Regression-based approaches,• can estimate pose, shape, expression

• can only be trained in a supervised fashion, a major obstacle

• illumination and reflectance do not match best generative methods

6/32

Related Work• Cootes et al. [2] use Active Appearance Models to match a statistical

model of object shape and appearance to a new image, it can be used for matching and tracking faces

• Roth et al. [3] achieve reconstructing a 3D face model by fitting a 3D Morphable Model (3DMM)

• Zhou et al. [4] use CNN cascades for the detection of facial landmarks, in a supervised manner and predicts only sparse information

• Tran et al. [5] obtain robust and discriminative 3D Morphable Models with annotated training data

• Richardson et al. [6] achieve 3D face reconstruction by learning from synthetic data, lacks realistic features

7/32

Related Work• Zhmoginov et al [7] generate images from code vectors by using

autoencoders

• Kulkarni et al. [8] learn graphics codes for the reproduction of images under different conditions

• Yan et al. [9] achieves unsupervised volumetric 3D object reconstruction from a single-view

• Higher level computer vision task, reconstruction of a full set of meaningful parameters is not considered

8/32

Overview• New type of model-based deep convolutional autoencoder

• Makes use of state-of-the-art generative and regression approaches

• Inspired by deep convolutional autoencoders, features a CNN encoder

• Unlike previously used CNN decoders, it features newly designed decoder, a generative image formation model on the basis of a parametric 3D face model

• Input to the decoder, i.e. the semantic meaning of the code vector is ensured by design

9/32

Overview• Combined end-to-end training of model-based decoder and a CNN

encoder

• Unsupervised training and semantically meaningful face reconstruction

• Generalizes better to real world data

10/32

Overview• Decoder generates a realistic synthetic image of a face and enforces

semantic meaning

• Pose, shape, expression, reflectance and illumination are parameterized independently

• Synthesized image is compared to the input using a photometric loss

[1]

11/32

Semantic Code Vector• Semantic code vector 𝑥 ∈ 𝑅257

• facial expression δ ∈ 𝑅64

• shape α ∈ 𝑅80

• skin reflectance β ∈ 𝑅80

• camera rotation T ∈ SO(3) (3D rotation group)

• translation t ∈ 𝑅3

• scene illumination γ ∈ 𝑅27

12/32

Semantic Code Vector• Face is represented by 24k vertices

• Normals are computed using one-ring neighborhood

• AS, average face shape

• ES and Ee, PCA bases

13/32

Semantic Code Vector• Per vertex reflectance parameterized based on affine parametric model

• Ar, average skin reflectance

• Er, PCA basis

14/32

Parametric Model Based Decoder• Model generates the synthetic image that corresponds to the face

• Rendered using a pinhole camera model under full perspective projection• mapping from camera space to screen space (𝑅3 → 𝑅2)

• Position and orientation of the camera is given by a rigid transformation• an arbitrary point is mapped to camera and screen space

• Scene illumination is represented using Spherical Harmonics [10]

15/32

Parametric Model Based Decoder• For the image, screen space and associated pixel color for each vertex is

then computed

• Backward pass that inverts image formation is implemented for backpropagation

16/32

Loss Layer• Photometric loss function that enables end-to-end training

• 𝐸land, sparse landmark alignment

• 𝐸photo, dense photometric alignment

• 𝐸reg, statistical plausibility of the modeled faces

17/32

Loss Layer• Dense photometric alignment (𝐸photo)

• Predict parameters that lead to a synthetic face image that matches the input image

• Photometric alignment

18/32

Loss Layer• Sparse landmark alignment (𝐸land)

• Enforce projected 3D vertices to be close to the 2D detections• based on detected facial landmarks [20]

• 46 landmarks

• Optional loss

19/32

Loss Layer• Statistical regularization (𝐸reg)

• Enforce plausible facial shape, expression, and skin reflectance• prefer values close to the average

• Pose and illumination is not regularized

20/32

Results• Encoders based on AlexNet [18] and VGG-Face [19] are tested

• Last fully connected layer is modified to output 257 model parameters

• Encoder regressed pose, shape, expression, reflectance and illumination from a single image

[1], images from CelebA [11]

21/32

Results• Combination of four datasets is used for training: CelebA [11], LFW

[12], Facewarehouse [13], and 300-VW [14, 15, 16]

• Facial landmark detection [20] is used and images are cropped to a bounding box using Haar Cascade Face Detection [17]

• Bad detections are dropped

• 147k images in total, 142k for training, 5k for evaluation

• Network trained using,• AdaDelta and 200k batch iterations with batch size of 5

• Learning rate of 0.1 for all parameters except Z-translation

• Encoder is initialized with pre-trained weights

• Last fully connected layer has weights of 0

22/32

Comparisons• To Richardson et al. [6]

• Richardson et al. use synthetic images and lacks several dimensions, e.g., facial hair

[1]

23/32

Comparisons• To Tran et al. [5]

• Tran et al. do not estimate facial expression and illumination, trained in supervised manner

[1]

24/32

Comparisons• To Thies et al. [21]

• Thies et al. require detected landmarks, is slower, and similar or lower quality

[1]

25/32

Comparisons• To Garrido et al. [22]

• Garrido et al. require landmark detection, comparable quality

[1]

26/32

Evaluation• Impact of different encoders is evaluated

• VGG-Face [19] is slightly better than AlexNet [18], lower landmark and photometric error

[1]

27/32

Evaluation• Evaluation of (fully) unsupervised training

• Landmark error can be reduced even when landmark alignment is not part of the loss function

• Training with surrogate loss (landmarks) improves landmark alignment and leads to similar photometric error, also improves robustness to occlusions and expressions

[1]

28/32

Evaluation• Evaluation on synthetic data

• Trained on 100k synthetic images with background augmentation, 5k images for evaluation with known parameters

[1]

29/32

Evaluation• Comparison to convolutional autoencoder

• Model-based approach obtains sharper reconstruction and better semantic parameters

• Decoder is trained on synthetic imagery generated by the model to learn parameter-to-image mapping

[1]

30/32

Limitations• Implausible reconstructions are possible outside of the span of

training data

• Can not regress facial hair, eye gaze, accessories

• Strong occlusions cause approach to fail

[1]

31/32

Conclusion• First deep convolutional model based face autoencoder

• Trained in an unsupervised manner

• Learns meaningful set of semantic parameters

• Semantic meaning in the code vector is enforced by design

• Pose, shape, expression, skin reflectance, and illumination can be accurately regressed

32/32

References1. A. Tewari M. Zollhofer H. Kim P. Garrido F. Bernard P. Perez C. Theobalt. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. arXiv.org:1703.10580v1, 2017

2. T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, June 2001

3. J. Roth, Y. Tong, and X. Liu. Adaptive 3d face reconstruction from unconstrained photo collections. December 2016

4. E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. Extensive Facial Landmark Localization with Coarse-to-Fine Convolutional Network Cascade. In CVPRW, 2013

5. A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. arXiv.org:1612.04904v1, 2016

6. E. Richardson, M. Sela, and R. Kimmel. 3D face reconstruction by learning from synthetic data. In 3DV, 2016

7. A. Zhmoginov and M. Sandler. Inverting face embeddings with convolutional neural networks. arXiv:1606.04189, June 2016

8. T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015

9. X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. arXiv:1612.00814, Dec. 2016

10. C. Muller. ¨ Spherical harmonics. Springer, 1966

11. Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

12. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007

13. C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, Mar. 2014.

14. G. G. Chrysos, E. Antonakos, S. Zafeiriou, and P. Snape. Offline deformable face tracking in arbitrary videos. In The IEEE International Conference on Computer Vision (ICCV) Workshops, December 2015

15. J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In ICCVW, December 2015

16. G. Tzimiropoulos. Project-out cascaded regression with an application to face alignment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

17. G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000

18. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012

19. O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.

20. J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting by regularized landmark mean-shift. 91(2):200–215, 2011

21. J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and ¨ M. Niener. Face2Face: Real-time face capture and reenactment of RGB videos. In CVPR, 2016

22. P. Garrido, M. Zollhofer, D. Casas, L. Valgaerts, K. Varanasi, ¨P. Perez, and C. Theobalt. Reconstruction of personalized 3D face rigs from monocular video. ACM Transactions on Graphics, 35(3):28:1–15, June 2016.

Documents

MoFA: Model-based Deep Convolutional Face Autoencoder for ......MoFA: Model-based Deep Convolutional Face Autoencoder ... •Learning rate of 0.1 for all parameters except Z-translation