arXiv:1912.03455v1 [cs.CV] 7 Dec 2019 · face representation making the face ﬁtting problem more tractable. The FaceWareHouse technique [8] enhances the original PCA-based neutral

Digital Twin: Acquiring High-Fidelity 3D Avatar from a Single Image

Ruizhe Wang ∗ †1, Chih-Fan Chen ∗†1, Hao Peng†1, Xudong Liu †1, 2, Oliver Liu †1, and Xin Li ‡2

1Oben, Inc2West Virginia University

Abstract

We present an approach to generate high fidelity 3D faceavatar with a high-resolution UV texture map from a singleimage. To estimate the face geometry, we use a deep neuralnetwork to directly predict vertex coordinates of the 3D facemodel from the given image. The 3D face geometry is fur-ther refined by a non-rigid deformation process to more ac-curately capture facial landmarks before texture projection.A key novelty of our approach is to train the shape regres-sion network on facial images synthetically generated usinga high-quality rendering engine. Moreover, our shape esti-mator fully leverages the discriminative power of deep fa-cial identity features learned from millions of facial images.We have conducted extensive experiments to demonstratethe superiority of our optimized 2D-to-3D rendering ap-proach, especially its excellent generalization property onreal-world selfie images. Our proposed system of rendering3D avatars from 2D images has a wide range of applica-tions from virtual/augmented reality (VR/AR) and telepsy-chiatry to human-computer interaction and social networks.

1. Introduction

Acquiring high quality 3D avatars is an essential task inmany vision applications including VR/AR, teleconferenc-ing, virtual try-on, computer games, special effect, and soon. A common practice, adopted by most professional pro-duction studios, is to manually create avatars from 3D scansor photo references by skillful artists. This process is oftentime consuming and labor intensive because each model re-quires days of manual processing and touching up. It isdesirable to automate the process of 3D avatar generationby leveraging rapid advances in computer vision/graphics

∗equal contribution†{ruizhe, chihfan, hpeng, xudong,oliver}@oben.com‡[email protected]

Figure 1: Sample outputs of our proposed avatar generationapproach. From left to right: input image, inferred shapemodel with low polygon count, composite model with UVdiffuse map

and image/geometry processing. There has been a flurry ofworks on generating 3D avatars from handheld video [19],Kinect [1] and mesh model [45] in the open literature.

Developing a fully automatic system for generating 3Davatar from a single image is challenging because the esti-mation of both facial shape and texture map involves intrin-sically ambiguous composition of light, shape and surfacematerial. Conventional wisdom attempts to address this is-sue by inverse rendering, which formulates image decom-position as an optimization problem and estimates the pa-rameters best fitting the observed image [4, 2, 31]. Morerecently, several deep learning based approaches have beenproposed - either in a supervised setup to directly regressthe parameterized 3D face model [50, 11, 43] or in an un-

1

arX

iv:1

912.

0345

5v1

[cs

.CV

] 7

Dec

201

9

supervised fashion [39, 16, 32] with the help of a differen-tiable rendering process. However, these existing methodsusually assume over-simplified lighting, shading and skinsurface models, which does not take real-world complex-ities (e.g., sub-surface scattering, shadows caused by self-occlusion and complicated skin reflectance field [9]) intoaccount. Consequently, the recovered 3D avatar often doesnot faithfully reflect the actual face presented in the image.

To meet those challenges, we propose a novel semi-supervised approach to utilize synthetically-rendered,photo-realistic facial images augmented from a prioritized3D facial scan dataset. Upon collecting and processing 482neutral facial scans with a medical grade 3D facial scan-ner, we perform shape augmentation and utilize a high-fidelity rendering engine to create a large collection ofphoto-realistic facial images. To the best of our knowledge,this work is the first attempt to leverage photo-realistic fa-cial image synthesis for accurate face shape inference.

For facial geometry estimation, we propose to first ex-tract deep facial identity features [37, 28], trained on mil-lions of images, which encodes each face into a unique la-tent representation, and regress the vertex coordinates of ageneric 3D head model. To better capture facial landmarksfor texture projection, the vertex coordinates are further re-fined in a non-rigid manner by jointly optimizing over cam-era intrinsic, head pose, facial expression and a per-vertexcorrective field. Our final generated model consists of ashape model with low polygon counts but a high-resolutiontexture map with sharp details, which allows efficient ren-dering even on mobile devices (as shown in Fig. 1).

At the system level, 3D avatars created by our approachare similar to those of Pinscreen [17, 48] and FaceSoft.io[14]. For Pinscreen avatar, the shape model is recon-structed via an analysis-by-synthesis method [4]; while ourshape model is directly regressed from deep facial iden-tity features, hence reaching higher shape similarity. ForFaceSoft.io avatar, both shape and texture models utilizea collection of 10,000 facial scans [6]; while our semi-supervised method only uses 482 scans and is still capa-ble of achieving similar shape reconstruction accuracy andhigher resolution UV-texture map from an input selfie.

Our key contributions can be summarized as follows:•A system for generating a high-fidelity UV-textured 3D

avatar from a single image which can be efficiently renderedin real time even on mobile devices.• Training a shape estimator on the synthetic photo-

realistic images by using pre-trained deep facial identityfeatures. The trained networks demonstrate excellent gen-eralization properties on real-world images.• Extensive qualitative and quantitative evaluation of the

proposed method against other state-of-the-art face mod-eling techniques demonstrates its superiority (i.e., highershape similarity and texture resolution).

2. Related Works3D Face Representation. 3D Morphable Model

(3DMM) [4] uses Principal Component Analysis (PCA) onaligned 3D neutral faces to reduce the dimension of 3Dface representation making the face fitting problem moretractable. The FaceWareHouse technique [8] enhances theoriginal PCA-based neutral face model with expressions byapplying multi-linear analysis [44] to a large collection of4D facial scans captured with RGB-D sensors. The qualityof multi-linear model was further improved in [5] by jointlyoptimizing the model and the group-wise registration of 3Dscans. In [6], a Large Scale Facial Model with 10,000 faceswas generated to maximize the coverage of gender and eth-nics. The training data was further enlarged in [27], whichcreated a linear shape space trained from 4D scans of 3800human heads. More recently, a non-linear model was pro-posed in [41] from a large set of unconstrained face imageswithout the necessity of collecting 3D face scans.

Fitting via Inverse Rendering. Inverse rendering [2, 4]formulates 3D face modeling as an optimization problemover the entire parameter space seeking the best fitting forthe observed image. In addition to pixel intensity values,other constraints such as facial landmarks and edge con-tours, are exploited for more accurate fitting [31]. More re-cently, GanFit [14] used a generative neural network for fa-cial texture modeling and utilized an additional facial iden-tity loss function in the optimization formulation. The in-verse rendering based modeling approach has been widelyused in many applications [48, 18, 40, 15].

Supervised Shape Regression. Convolutional NeuralNetwork (CNN) based approaches have been proposed todirectly map an input image to the parameters of a 3D facemodel such as 3DMM [10, 50, 22, 42, 49]. In [20], a volu-metric representation was learned from an input image. In[34], an input color image was mapped to a depth imageusing an image translation network. In [11], a network wasproposed to jointly reconstruct the 3D facial structure andprovide dense alignment in the UV space. The work of [43]took a layered approach toward decoupling low-frequencygeometry from its mid-level details estimated by a shape-from-shading approach. It is worth mentioning that manyCNN-based approaches use facial shape estimated by in-verse rendering as the ground truth during training.

Unsupervised Learning. Most recently, face modelingfrom images via unsupervised learning becomes popularbecause it affords almost unlimited amount of data for train-ing. An image formation layer was introduced in [39] as thedecoder jointly working with an auto-encoder architecturefor end-to-end unsupervised training. SfSNet [35] explicitlydecomposes an input image into albedo, normal and light-ing components, which are then composed back to approxi-mate the original input image. 3DMM parameters were firstdirectly learned in [16] from facial identity encoding and

then the problem of parameter optimization was formulatedin an unsupervised fashion by introducing a differentiablerenderer and a facial identity loss on the rendered facial im-age. A multi-level face model, (i.e. 3DMM with correctivefield) was developed in [38] following an inverse render-ing setup that explicitly models geometry, reflectance andillumination per vertex. RingNet [32] employed a similaridea as the triplet loss for encoding all images of the samesubject to the same latent shape representation.

Deep Facial Identity Feature. Recent advances in facerecognition [37, 28, 33] attempt to encode all facial imagesof the same subject under different conditions into identi-cal feature representations, namely deep facial identity fea-tures. Several attempts have been made to utilize this ro-bust feature representation for face modeling. GanFit [14]used an additional deep facial identity loss to the commonlyused landmark and pixel intensity losses. In [16], 3DMMparameters were directly learned from deep facial features.Although our shape regression network is similar to theirs,the choice of training data is different. Unlike their unsu-pervised setting, we opt to work with supervision by syn-thetically rendered facial images.

3. Proposed Method3.1. Overview

An overview of the proposed method is shown in Fig. 2.To facilitate facial image synthesis (Sec. 3.2) for training ashape regression neural network (Sec. 3.3), we have col-lected and processed a prioritized 3D face dataset, fromwhich we can sample augmented 3D face shape with UV-texture to render a large collection of photo-realistic facialimages. During testing, the input image is first used to di-rectly regress the 3D vertex coordinates of a 3D face modelwith the given topology, which are furthered refined to fitthe input image with a per-vertex non-rigid deformation ap-proach (Sec. 3.4). Upon accurate fitting, selfie texture isprojected to the UV space to infer a complete texture map(Sec. 3.5).

3.2. Photo-Realistic Facial Synthesis

3D Scan Database. The most widely used Basel FaceModel (BFM) [29] has two major drawbacks. First, it con-sists of 200 subjects but mainly Caucasian, which mightlead to biased face shape estimation. Second, each faceis represented by a dense model with high polygon count,per-vertex texture appearance and frontal face only, whichlimits its use for production-level real-time rendering. Toovercome these limitations, we have collected a total of512 subjects using a professional-grade multi-camera stereoscanner (3dMD LLC, Atlanta 1) across different gender andethnicity as shown in Table 1.

1http://www.3dmd.com/

Gender/ Ethnicity White Asian Black TotalMale 82 / 5 178 / 5 8 / 5 268 / 15

Female 45 / 5 164 / 5 5 / 5 214 / 15Total 127 / 10 342 / 10 13 / 10 482 / 30

Table 1: The distribution of gender and ethnicity in ourdatabase. Note that we randomly select 5 subjects for eachgroup for testing and the rest subjects are used for trainingand validation.

A face representation containing a head model of 2925vertices and a diffuse map sized by 2048 × 2048 is used.We take a non-rigid alignment approach [8] of deforming ageneric head model to match the captured facial scan. Thenwe transfer the texture onto the generic model’s UV space.With further manual artistic touch up, we obtain the finalhigh-fidelity diffuse map.

Shape Augmentation. 482 subjects are far from enoughto cover all possible facial shape variations. While it is ex-pensive to collect thousands of high-quality facial scans, weadopt an alternative shape augmentation approach to im-prove the generalization ability of the trained neural net-work. First, we adopt a recent deformation representation(DR) [46, 13] to model a 3D facial mesh P. DR featureencodes the ith vertex Pi = [P ix, P

iy, P

iz ] as a R9 vector.

Hence the DR feature of the entire mesh is represented as avector D ∈ R|P|×9. Please see the supplementary materialon how to compute a DR feature D from P and vice versa.

Upon obtaining a set of DR features as (D1, ...,DN )where N is the total number of subjects, we follow [21]to sample new DR features. More specifically, we sam-ple a vector (r, θ1, ..., θm1

) in Polar coordinates, wherer observes a uniform distribution U[0.6, 1.3] and θi fol-lows uniform distribution U[0, π/2]. We calculate its cor-responding Cartesian coordinates (a1, a2, ..., am) and in-terpolate the sampled DR features as

∑mi=1 aiDi, from

which we further calculate the corresponding facial mesh.In our experiments, we use m = 5 and only selectsamples from the same gender and ethnicity. We gener-ate 10,000 new 3D faces with a ratio of 0.65/0.30/0.05across Asian/Caucasian/Black and a ratio of 0.5/0.5 acrossMale/Female. For each new sampled face, we assign its UVtexture by choosing that is the closest 3D face in the sameethnicity and gender from existing 482 subjects.

Synthetic Rendering. We use an off-the-shelf highquality rendering engine V-ray 2. With artistic assistance,we set up a shader graph to render photo-realistic facialimages given a custom diffuse map and a generic specularmap. We manually set up 30 different lighting conditionsand further randomize head rotation [−15

◦,+15

◦] in roll,

yaw and pitch. The background of rendered images are ran-

2https://vray.us/

http://www.3dmd.com/

https://vray.us/

Figure 2: Overview of the proposed approach. During training, we learn a shape regression neural network on photo-realisticsynthetic facial images. During testing, we infer a low polygon count shape model with a UV diffuse map generated fromthe projected texture.

domized with a large collection of indoor and outdoor im-ages. We opt not to render eye models and mask out the eyeareas when testing by using detected local eye landmarks.Please see supplementary material for more details.

3.3. Regressing Vertex Coordinates

Our shape regression network consists of a feature en-coder and a shape decoder. Deep facial identity feature isknown for its robustness under varying conditions such aslighting, head pose and facial expression, providing a nat-urally ideal option for the encoded feature. Although anyoff-the-shelf facial recognition network would be sufficientfor our task, we propose to adopt Light CNN-29V2 [47] dueto its good balance between network size and encoding ef-ficiency. A pre-trained Light CNN-29V2 model is used toencode an input image into a 256-dimensional feature vec-tor. We have used a weighted per-vertex L1 loss: weight of5 for vertices on the facial area (within a radius of 95mmfrom the nose tip) and weight of 1 for other vertices.

For shape decoder, we have used three fully connected(FC) layers, with the output size of 128, 200 and 8,775respectively. The last FC layer directly predicts concate-nated vertex coordinates of a generic head model consistingof 2,925 points, and it is initialized with 200 pre-computedPCA components explaining more than 99% of the varianceobserved in the 10,000 augmented 3D facial shapes. Whencompared with unsupervised learning [16], our accessibilityto a high-quality prioritized 3D face scan dataset makes itpossible to achieve higher accuracy by supervision.

3.4. Non-rigid Deformation

3D vertex coordinates generated by the shape regressionneural network is not directly applicable to texture projec-tion because facial images usually contain unknown factorssuch as camera intrinsic, head pose and facial expression.Meanwhile, since shape regression predicts the overall fa-

cial shape, local parts such as eyes, nose and mouth are notaccurately reconstructed; but they are equally important toquality perception when comparing against the original faceimage. We propose to utilize facial landmarks detected ina coarse-to-fine fashion and formulate non-rigid deforma-tion as an optimization problem that jointly optimizes overcamera intrinsic, camera extrinsic, facial expression and aper-vertex corrective field.

Problem Formulation. To handle facial expressions,we transfer the expression blendshape model in FaceWare-house [8] to the same head topology with artist’s assis-tance as {B1,B2, ...,BM}. In addition, we introduce aper-vertex correction field δP to cover out of space non-rigid deformation. Finally, a 3D face is reconstructed as

PF = P +M∑i=1

βiBi + δP. Camera extrinsic T trans-

forms the face from its canonical reference coordinate sys-tem to the camera coordinate system. It has a 3-DoF vectort for translation and a 3-DoF quaternion representation qfor rotation. Camera intrinsic K projects the 3D model tothe image plane. During the optimization, we have foundthat using a scale factor fs to update the intrinsic matrix

by K =( fsf 0 cx

0 fsf cy0 0 1

)leads to the best numerical stability.

Here [f, cx, cy] are all initialized from the size of the inputimage as cx = imh

2 , cy = imw

2 , and f = max(imh, imw).Putting things together, we can represent the overall param-eterized vector by p = [β, δP, t,q, fs].

Landmark Term. We employ a global-to-local methodfor facial landmark localization. For global inference, wefirst detect the standard 68 facial landmarks, and use this ini-tial estimation to crop local areas including eyes, nose, andmouth - i.e., a total of 4 cropped images. Then we performfine-scale local inference on the cropped images (Please seethe supplementary material for more details). The landmarklocalization approach produces a set of facial landmarks Lwhere Li = [Lxi , L

yi ]. We propose to minimize the distance

between the predicted landmarks on the 3D model and thedetected landmarks,

El =100

Weye

K∑i=1

‖K(T (SM(PF ,mi),T),K)−Li‖2, (1)

where SM(P,mi) samples a 3D vertex from P given aproduction-ready and sparse triangulation M on barycen-tric coordinates mi, K(·, ·) and T (·, ·) are perspectiveprojection and rigid transformation operators respectively,Weye is the distance between two outermost eye landmarksand 100

Weyeis used to normalize the eye distance to 100. We

pre-select m on M and follow the sliding scheme [7] toupdate the barycentric coordinates of the 17 facial contourlandmarks at each iteration.

Corrective Field Regularization. To enforce a smoothand small per-vertex corrective field, we combine the fol-lowing two losses,

Ec = ‖L(PF ,M)−L(P+

M∑i=1

βt−1i Bi,M)‖2+λδ‖δP‖2.

(2)The first loss is used to regularize a smooth deformation

by maintaining the Laplacian operator L on the deformedmesh (please refer to [36] for more details). βt−i indicatesthe estimated facial expression blendshape weights from thelast iteration and is a fixed value. The second loss is usedto enforce a small corrective field and λδ is used to balancethe two terms.

Other Regularization Terms. We further regularizeon facial expression, focal length scale factor, and rotationcomponent of camera extrinsic as follows,

Er =

M∑i=1

β2i

σ2i

+ λf log2(f) + λq‖q‖2, (3)

where σ is the vector of eigenvalues of the facial expressioncovariance matrix obtained via PCA. λf and λq are regular-ization parameters.

Summary. Our total loss function is given by

E = El + ωcEc + ωrEr, (4)

where ωc and ωr are used to balance relative importance ofthe three terms. E is optimized by Gauss-Newton approachover parameters pt for a total of N iterations. For the ini-tial parameter vector p0, β0 and δP0 are initialized as all-0vectors, t0 and q0 are estimated from the EPnP approach[26], and f0s is initialized to be 1.

3.5. Texture Processing

Upon non-rigid deformation, we project selfie texture tothe UV space of the generic model using the estimated cam-era intrinsic, head pose, facial expression and per-vertexcorrection. While usually only the frontal area on a selfieis visible, we recover textures on other areas, e.g., back ofhead and neck, by using the UV texture of one of the 482subjects that is closest to the query subject. We define close-ness as L1 loss on the distance between LightCNN-29V2embeddings, i.e., through face recognition. Finally given aforeground projected texture and a background default tex-ture, we blend them using the Poisson Image Editing [30].

4. Experimental Results

4.1. Implementation Details

For shape regression, we use Adam optimizer with alearning rate of 0.0001 and the momentum β1 = 0.5,β2 = 0.999 for 500 epochs. We train on a total of 10,000synthetically rendered facial images with a batch size of 64.For non-rigid deformation, we use a total of N = 5 iter-ations. When minimizing Eq. (4), we use ωc = 25 andωr = 10. In Eq. (2), we set λδ = 4, and in Eq. (3) we setλf = 5 and λq = 5.

4.2. Database and Evaluation Setup

Stirling/ ESRC 3D Faces Database The ESRC [12] isthe latest public 3D faces database captured by a Di3D cam-era system. The database also provides several images cap-tured from different viewpoints under various lighting con-dition. We select those subjects who have both 3D scan anda frontal neutral face for evaluation. There are total 129subjects (62 male and 67 female) for testing. Note that inthis dataset, around 95% of people are Caucasian.

JNU-Validation Database The JNU-ValidationDatabase is a part of the JNU 3D face Database collectedby the Jiangnan University [25]. It has 161 2D imagesof 10 Asians and their 3D face scans captured by 3dMD.Since the validation database was not used during training,we consider it as a test database for Asians. The 2D imagesof each subject are in range of [3, 26]. To minimize theimpact of imbalance data, we select three frontal images ofeach subject for quantitative comparison.

Our Test Data Since there is no public database avail-able for testing, which shall cover all the gender and races,we randomly pick five subjects from the six group in Table 1and form a total 30 subjects as the evaluation database. Theother 482 scans are used as for data augmentation and train-ing/validation stage for both geometry and texture. Eachsubject has two testing images: a selfie captured by a Sam-sung Galaxy S7 and an image captured from a Sony a7RDSLR camera by a photographer.

Evaluation Setup We compared our method with sev-eral state-of-the-art-methods including 3DMM-CNN [42],Extreme 3D Face (E3D) [43], PRNet [11], RingNet [32],and GanFit [14]. The reconstructed model detail of eachmethods are shown in Table 2. Note that for our methodand RingNet, both eyes, teeth and tongue and their modelholders are removed before comparison. Because the evalu-ation metric is using the point-to-plane error, unrelated datawill increase the over all error. Although removing thoseparts will also slightly increase the error (e.g., no data in theeyes area to compare), the introduced error is much smallerthan the error of directly using the original models.

4.3. Quantitative Comparison

Evaluation Metric: To align the reconstructed modelwith ground truth, we followed the step of [42, 16, 14] andthe challenge [12]. Since the topology of each method isfixed, seven pre-selected vertex index is first used to roughlyalign the reconstructed model to the ground truth and thenthe model was further refined by iterative closest point (ICP)[3]. The position of vertex of the tip of the nose vt is cho-sen to be the center of the ground truth and reconstructedmodels. Given a threshold d mm, we discard those vertexvi, where ||vi − vt|| > d. To evaluate the reconstructedmodel with ground truth, we used the Average Root MeanSquare Error (ARMSE) 3 as suggested by the 2nd 3DFAWChallenge 4 , where it computes the closest point-to-meshdistance between the ground truth and predicted model andvice versa.

ESRC and JNU-validation Dataset: In Figure 3, wehave chosen d = [80, 90, 100, 110] and computed theARMSE for each reconstructed model and ground truth.Note that the annotation provided by ESRC database onlyhas the seven landmark for alignment, thus instead of us-ing the tip of nose, we use the average of the 7 landmarkas the center of face. In ESRC, our result is better thanother methods when d > 95 and our performance is moreresilient as d increases. This indicates that our method canbetter replicate the shape of the entire head than other meth-ods. In JNU-validation database, since other methods aretrained from a Caucasian-dominated 3DMM model, whilethe other races are also considered during our augmentedstage, we can achieve much smaller reconstructed error atevery d value.

Our Test Dataset: In Figure 4 (a), the centers of eacherror-bar are the average of the ARMSE from the 60 recon-structed meshes. The range of the errorbar is ±1.96× SE,where SE is the standard error. It is shown that our re-constructed models is slightly better than GanFit and sig-nificantly better than other methods. It is worth mention-

3https://codalab.lri.fr/competitions/572#learn_the_details-evaluation

4https://3dfaw.github.io/

(a) ESRC (b) JNU-Validation

Figure 3: The quantitative results of our method compare to3DMM-CNN [42], E3D [43], PRNet [11] and RingNet [32]on both ESRC and JNU-Validation database.

ing that our vertex number is only ∼ 70% of RingNet andless than 6% of other methods. In Figure 4 (b), the croppedmesh of the ground truth and each methods are shown underdifferent threshold of d. To utilize the reconstructed mod-els for real-world application, we believe that d = 110 isthe best value because it captured the entire head instead ofthe frontal face. We further investigate the performance un-der different races and the results are shown in Figure 4 (c).Our method can correctly replicate the model to under 2.5mm of error in all ethnicity, while other methods such asRingNet and PRNet are very sensitive to the ethnicity dif-ferences. Although GanFit performed slightly better thanour method on White and Black races, the overall perfor-mance is not as good as ours because they are not able torecover the Asian geometries well. It is worth noting thatwe used 10000 synthetic images augmented from less than500 scan data, which is only 5% of the data used in Gan-Fit. To fairly visualize the error between methods withoutthe effect of different topology, we find the closest point-to-plane distance from ground truth to reconstructed modeland generate the heat-map for each method in Figure 5.

4.4. Ablation Study

To demonstrate the effectiveness of the individual mod-ules in the proposed approach, we modify one variable at atime and compare with the following alternatives:• No Augmentation (No-Aug): Without any augmenta-

tion, we simply repetitively sample 10,000 faces from 482subjects.•Categorized-PCA Sampling (C-PCA): Instead of DR

Feature based sampling, we propose a PCA based samplingmethod. We train a shape PCA model from 482 subjects,and for each group in Table 1, a Gaussian random vectorx ∼ N (µi, Σ2

i ) is used to create weights of the principalshape components, where µi and Σ2

i are the mean vectorand co-variance matrix of those coefficients in the group.We sample 10,000 faces with this augmentation approach.• Game engine Rendering-Unity: Instead of using

high-quality photo-realistic renderer, we use Unity, a stan-

https://codalab.lri.fr/competitions/572#learn_the_details-evaluation

https://codalab.lri.fr/competitions/572#learn_the_details-evaluation

https://3dfaw.github.io/

Ours RingNet[32] GanFit [14] PRNet[11] E3D[43] 3DMM-CNN[42]Full Head Yes Yes No No No No

Vertex 2.9K (2.7K) 5.0K (3.8K) 53.2K 43.7K ∼155K 47.0KFace 5.8K (5.3K) 10.0 K (7.4K) 105.8K 86.9K ∼150K 93.3K

Table 2: The geometric complexity of our method and other method. Note that except E3D, the other methods used the sametopology for their reconstructed model. The number inside the parentheses in both our method and RingNet are the details ofhead models after unrelated mesh removal.

(a) (b)

(c)

Figure 4: The quantitative results of our method compare to E3D [43], 3DMM-CNN [42], PRnet [11], RingNet [32] andGanFit [14]. (a) The overall performance of each method. (b) The qualitative comparison of cropped meshes with groundtruth in d = [80, 95, 110]. (c) The evaluation results on different ethnicity.

dard game rendering engine, to synthesize facial images.The quality of rendered images are comparatively lowerthan V-ray. We keep the DR feature based augmentation ap-proach and rendered exactly the same 10000 synthetic facesmentioned in Section 3.2.

In Fig. 6, our proposed approach outperforms all otheralternatives. It is expected that without data augmenta-tion (i.e., No-Aug), the reconstructed error is the worstamong all methods. The difference between C-PCA andour method proves that DR sampling augmentation createsmore natural synthetic faces for training. The results be-tween Unity and our method shows that the quality of ren-dered images plays an important role in bridging the gapbetween real and synthetic images.

4.4.1 Qualitative Comparison

Figure 7 shows our shape estimation method on frontal faceimages side-by-side with the state-of-the-arts in MoFA testdatabase. We picked the same images shown in GanFit [14].Our method creates accurate face geometry, while also cap-turing discriminate features which allow the identity of eachface to be easily distinguishable from the others. Mean-while, as shown in Table 2, our result maintains a low geo-metric complexity. This allows our avatars to be productionready even in demanding cases such as on mobile platforms.In Figure 8, we choose a few celebrity to verify the geom-etry accuracy of our method comparing to others. In Fig-ure 9, we demonstrate our final results with blended diffuse

Figure 5: The heatmap visualization of reconstructed mod-els in d = 110. Vertices colored red fall above the 5mm er-ror tolerance, while blue vertices are those which lie withinthe tolerance.

Figure 6: The quantitative results of No-Aug, C-PCA, Unityand our method. The proposed method achieve the best per-formance at all time.

Figure 7: The qualitative comparison of our method withPRNet [11], MoFA [39], RingNet [32], GanFit [14] andE3D [43]. Our method accurately reconstructs the geom-etry, while maintaining a much lower vertices count, whichis more suitable for production.

maps in Section 3.5.

5. Conclusions and Future WorksIn this paper, we demonstrated a supervised learning

approach for estimating high quality 3D face shape withphoto-realistic high-resolution diffuse map. To facilitatefacial image synthesis, we have collected and processed aprioritized 3D face database, from which we can sample

Figure 8: The showcase of our reconstruction results of sev-eral celebrities comparing to RingNet [32], PRNet [11] andE3D [43].

Figure 9: Our final results with blended diffuse maps.

augmented 3D face shape with UV-texture to render a largecollection of photo-realistic facial images. Unlike previousapproaches, our method leverages the discriminative powerof an off-the-shelf face recognition neural network trainedon millions of synthesized photo-realistic facial images.

We have demonstrated the transferable proficiency of theproposed method from the objective of accurate face recog-nition to fully reconstruct the facial geometry based on asingle selfie. While training on synthetically generated fa-cial imagery, we have observed strong generalization powerwhen tested on real-world images. This opens up oppor-tunities in many interesting applications including VR/AR,teleconferencing, virtual try-on, computer games, specialeffect, and so on.

Supplemental MaterialSection 3.2. Scan Pre-processing

As shown in Fig. 10, we process a raw textured 3D fa-cial scan data to generate our 3D face representation thatconsists of a shape model with low polygon count and ahigh-resolution diffuse map for preserving details.

Section 3.2. Deformation RepresentationHere we give a detailed formulation of the Deformation

Representation (DR) feature. DR feature D encodes localdeformation around each vertex of P with respect to a ref-erence mesh PR into a R9 vector. We use the mean face ofall 482 processed facial models as the reference mesh.

Encode D from P. We denote the i-th vertex as piand pRi respectively. The deformation gradient in the clos-est neighborhood Ni of the i-th vertex from the referencemodel to the deformed model is defined by the affine trans-formation matrix Ti that minimizes the following energy

E(Ti) =∑i∈Ni

cij‖(pi − pj)−Ti(pRi − pRj )‖2, (5)

where cij is the cotangent weight depending on the refer-ence model to handle irregular tessellation. With polar de-composition, Ti is decomposed into a rotation componentRi and a scaling/shear component Si such that Ti = RiSi.The rotation matrix can be represented with a rotation axis

Figure 10: Top row: left is the raw facial scan with densetopology, and right is the model with UV texture; Bottomrow: left is the processed face model with sparse topology,and right is the model with UV texture.

ωi and rotation angle θi pair, and we further convert themto the matrix logarithm representation:

logRi = θi

0 −ωi,z ωi,yωi,z 0 −ωi,x−ωi,y ωi,x 0

. (6)

Finally the DR feature for pi is represented by di ={logRi;Si − I} where I is the identity matrix. Since‖ωi‖ = 1 and Si is symmetric, di has 9 DoF.

Recover P from D. Given the DR feature D and thereference mesh PR, we first recover the affine transforma-tion Ti for each vertex. Then we try to recover the optimalP that minimizes:

E(P) =∑pi∈P

∑j∈Ni

cij‖(pi − pj)−Ti(pRi − pRj )‖2. (7)

For each pi, we obtain it by solving ∂E(P)∂pi

= 0 which gives

2∑j∈Ni

cij(pi − pj) =∑j∈Ni

Ti(pRi − pRj ). (8)

The resulting equations for all pi ∈ P lead to a linear sys-tem which can be written as AP = b. By specifying theposition of one vertex, we can get the single solution to theequation to fully recover P.

Section 3.4. Landmark Localization

To achieve higher landmark localization accuracy, wehave developed a coarse-to-fine approach. First, we pre-dict all facial landmarks from the detected facial boundingbox. Then, given the initial landmarks, we crop the eye,nose, and mouth areas for the second stage fine-scale land-mark localization. Fig. 11 shows our landmark mark-up aswell as the bounding boxes used for the fine scale landmarklocalization stage. We have used a regression forest basedapproach [24] as the base landmark predictor and we train 4landmark predictors in total, i.e., for overall face, eye, noseand mouth.

Section 4.4. Different Rendering Quality

In this section, first we illustrate the 30 different manu-ally created lighting conditions used for high-quality Vrayrendering as shown in Fig. 12. Then we provide several syn-thetic face images rendered from Vray and Unity as shownin Fig. 13. Note that, for both rendering method, we ran-domized the head pose, environment map, lighting condi-tion, and the field of view (FOV) to mimic the selfie in thereal world. We don’t render eye models, and as a result, wemask out the eye area with detected facial landmarks duringtest time as mentioned in Section 3.2.

(a) Input Image (b) Landmark Detection of each parts

Figure 11: Our landmark mark-up consists of 104 points,i.e., face contour (1-17), eye brows (18-27), left eye (28-47), right eye (48-67), nose (68-84) and mouth (85-104).(a) Coarse detection of all landmarks and correspondingbounding boxes for fine scale detection. (b) Separate fine-scale detection result of local areas.

Section 4.5. More Qualitative ResultsIn this section, we provide more comparison results that

cannot be included in the paper due to page limits. ForGanFit [14], we have requested them to run the recon-structed results of our test data. Thus, we are only ableto show the qualitative comparison with GanFit in our testdatabase. For those images/selfies in the other database, wehave compared our results with those papers whose codesare available online including RingNet [32], PRNet [11],Extreme3D [43] and 3DMM-CNN [42].

More Qualitative Results of Our DataIn Fig. 14, we provided the qualitative results of each

categories. The first and second columns are the input im-age and the ground truth. Instead of showing the croppedmesh, we decided to show the whole models for eachmethod in Fig. 14. It is worth noting that our reconstructedfull head model is ready to be deployed for different appli-cations.

Qualitative ESRC and JUN-ValidateDue to the paper limitation, we are not able to show the

qualitative result of ESRC and JUN-validate Dataset. Asshown in Figs. 15, we can still see the similar results weclaimed in the paper that the proposed method can correctlyreplicate the 3D models from single selfies with much lowerpolygon.

More Qualitative Results of MoFA In Fig. 16, we haverequested the results from MoFA [39] for side-by-side com-parisons. Although the quality of reconstructed models arenot as good as the results from other database due to the im-age resolution, large head pose variation, occlusion such ashair and glasses, our model is still considerably better thanother methods.

More Celebrity-In-the-Wild Results In Figs. 17 - 20,

we present the results of several celebrities and compare ourmethod not only for geometry but also in appearance. Notethat by projecting the selfie to a high-resolution UV tex-ture, our reconstructed models has photo-realistic appear-ance while 3DMM-CNN [42] and PRNet [11] used vertexcolor results in limited texture reapplication.

Application - Audio-driven Avatar AnimationOur automatically generated head model is ready for dif-

ferent applications. Here we demonstrate a case of auto-matic lip syncing driven by a raw waveform audio input asshown in Fig. 21. For data collection and deep neural net-work structure, we adopt a similar pipeline as that of [23]to drive the reconstructed model. All the animation blend-shapes are transferred to our generic topology. Please referto our video for more details.

References[1] Kairat Aitpayev and Jaafar Gaber. Creation of 3d human

avatar using kinect. Asian Transactions on Fundamentalsof Electronics, Communication & Multimedia, 1(5):12–24,2012. 1

[2] Oswald Aldrian and William AP Smith. Inverse renderingof faces with a 3d morphable model. IEEE transactions onpattern analysis and machine intelligence, 35(5):1080–1093,2012. 1, 2

[3] Brian Amberg, Sami Romdhani, and Thomas Vetter. Opti-mal step nonrigid icp algorithms for surface registration. InComputer Vision and Pattern Recognition, 2007. CVPR’07.IEEE Conference on, pages 1–8. IEEE, 2007. 6

[4] Volker Blanz and Thomas Vetter. A morphable model forthe synthesis of 3d faces. In Proceedings of the 26th AnnualConference on Computer Graphics and Interactive Tech-niques, SIGGRAPH ’99, pages 187–194, New York, NY,USA, 1999. ACM Press/Addison-Wesley Publishing Co. 1,2

[5] T. Bolkart and S. Wuhrer. A groupwise multilinear corre-spondence optimization for 3d faces. In 2015 IEEE Interna-tional Conference on Computer Vision (ICCV), pages 3604–3612, Dec 2015. 2

[6] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniahy, and D.Dunaway. A 3d morphable model learnt from 10,000 faces.In 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 5543–5552, June 2016. 2

[7] Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamicexpression regression for real-time facial tracking and ani-mation. ACM Transactions on graphics (TOG), 33(4):43,2014. 5

[8] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Faceware-house: A 3d facial expression database for visual computing.IEEE Transactions on Visualization and Computer Graphics,20(3):413–425, March 2014. 2, 3, 4

[9] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-PieterDuiker, Westley Sarokin, and Mark Sagar. Acquiring thereflectance field of a human face. In Proceedings of the

Figure 12: Different lighting conditions for photo-realistic rendering augmentation

(a) V-ray rendering samples

(b) Unity rendering samples

Figure 13: The synthetic facial images from (a) Maya V-ray and (b) Unity

27th annual conference on Computer graphics and inter-active techniques, pages 145–156. ACM Press/Addison-Wesley Publishing Co., 2000. 2

[10] Pengfei Dou, Shishir K Shah, and Ioannis A Kakadiaris.End-to-end 3d face reconstruction with deep neural net-works. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 5908–5917, 2017. 2

[11] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and XiZhou. Joint 3d face reconstruction and dense alignment withposition map regression network. In ECCV, 2018. 1, 2, 6, 7,8, 10, 12, 13, 14, 15, 16, 17, 18

[12] Z. Feng, P. Huber, J. Kittler, P. Hancock, X. Wu, Q. Zhao,

Figure 14: Qualitative results on our test dataset. From left to right, input image, ground truth, our method, GanFit [14],RingNet [32], E3D [43], 3DMM-CNN [42], and PRnet [11].

P. Koppen, and M. Raetsch. Evaluation of dense 3d recon-struction from 2d face images in the wild. In 2018 13th IEEEInternational Conference on Automatic Face Gesture Recog-nition (FG 2018), pages 780–786, May 2018. 5, 6

[13] Lin Gao, Yu-Kun Lai, Jie Yang, Zhang Ling-Xiao, ShihongXia, and Leif Kobbelt. Sparse data driven mesh deformation.IEEE transactions on visualization and computer graphics,2019. 3

[14] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos

Zafeiriou. Ganfit: Generative adversarial network fitting forhigh fidelity 3d face reconstruction. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1155–1164, 2019. 2, 3, 6, 7, 8, 10, 12

[15] Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 3d guidedfine-grained face manipulation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 9821–9830, 2019. 2

[16] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron

(a) ESRC Male

(b) ESRC Female

(c) JUN-Validation Database

Figure 15: Qualitative results on the ESRC and JNU-Validation datasets. From left to right, input image, ground truth, ourmethod, 3DMM-CNN[42], E3D [43], PRnet [11], RingNet [32].

(a) MoFA Male

(b) MoFa Female

Figure 16: Qualitative results of MoFa dataset. From left to right, input image, our method, RingNet [32], PRnet [11],3DMM-CNN[42], and MoFA [39].

Figure 17: Qualitative results of our method compare to RingNet [32], PRnet [11], E3D [43], and 3DMM-CNN [42].




Sarna, Daniel Vlasic, and William T Freeman. Unsuper-vised training for 3d morphable model regression. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 8377–8386, 2018. 2, 3, 4, 6

[17] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jae-woo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a singleimage for real-time rendering. ACM Transactions on Graph-ics (TOG), 36(6):195, 2017. 2

[18] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jae-woo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a singleimage for real-time rendering, nov 2017. 2

[19] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly.

Dynamic 3d avatar creation from hand-held video input.ACM Transactions on Graphics (ToG), 34(4):45, 2015. 1

[20] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, andGeorgios Tzimiropoulos. Large pose 3d face reconstructionfrom a single image via direct volumetric cnn regression. InProceedings of the IEEE International Conference on Com-puter Vision, pages 1031–1039, 2017. 2

[21] Zi-Hang Jiang, Qianyi Wu, Keyu Chen, and Juyong Zhang.Disentangled representation learning for 3d face shape. Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2018. 3

[22] Amin Jourabloo and Xiaoming Liu. Large-pose face align-ment via cnn-based dense 3d model fitting. In Proceedings of

Figure 21: Audio driven lip syncing on our production ready head model

the IEEE conference on computer vision and pattern recog-nition, pages 4188–4196, 2016. 2

[23] Tero Karras, Timo Aila, Samuli Laine, Antti Herva, andJaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph.,36(4):94:1–94:12, July 2017. 10

[24] Vahid Kazemi and Josephine Sullivan. One millisecond facealignment with an ensemble of regression trees. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 1867–1874, 2014. 9

[25] Paul Koppen, Zhen-Hua Feng, Josef Kittler, MuhammadAwais, William Christmas, Xiao-Jun Wu, and He-FengYin. Gaussian mixture 3d morphable face model. PatternRecogn., 74(C):617–628, Feb 2018. 5

[26] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua.Epnp: An accurate o (n) solution to the pnp problem. Inter-national journal of computer vision, 81(2):155, 2009. 5

[27] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, andJavier Romero. Learning a model of facial shape and ex-pression from 4D scans. ACM Transactions on Graphics,(Proc. SIGGRAPH Asia), 36(6), 2017. 2

[28] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman.Deep face recognition. In Mark W. Jones Xianghua Xie andGary K. L. Tam, editors, Proceedings of the British MachineVision Conference (BMVC), pages 41.1–41.12. BMVA Press,September 2015. 2, 3

[29] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vet-ter. A 3d face model for pose and illumination invariant facerecognition. In 2009 Sixth IEEE International Conference onAdvanced Video and Signal Based Surveillance, pages 296–301, Sep. 2009. 3

[30] Patrick Perez, Michel Gangnet, and Andrew Blake. Pois-son image editing. ACM Transactions on graphics (TOG),22(3):313–318, 2003. 5

[31] Sami Romdhani and Thomas Vetter. Estimating 3d shapeand texture using pixel intensity, edges, specular highlights,texture constraints and a prior. In 2005 IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition(CVPR’05), volume 2, pages 986–993. IEEE, 2005. 1, 2

[32] Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J.Black. Learning to regress 3d face shape and expression froman image without 3d supervision, 2019. 2, 3, 6, 7, 8, 10, 12,13, 14, 15, 16, 17, 18

[33] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 815–823, 2015. 3

[34] Matan Sela, Elad Richardson, and Ron Kimmel. Unre-stricted facial geometry reconstruction using image-to-imagetranslation. arxiv, 2017. 2

[35] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo,and David W Jacobs. Sfsnet: Learning shape, reflectanceand illuminance of facesin the wild’. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 6296–6305, 2018. 2

[36] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa,Christian Rossl, and H-P Seidel. Laplacian surface editing.In Proceedings of the 2004 Eurographics/ACM SIGGRAPHsymposium on Geometry processing, pages 175–184. ACM,2004. 5

[37] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and LiorWolf. Deepface: Closing the gap to human-level perfor-mance in face verification. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages1701–1708, 2014. 2, 3

[38] Ayush Tewari, Michael Zollhofer, Pablo Garrido, FlorianBernard, Hyeongwoo Kim, Patrick Perez, and ChristianTheobalt. Self-supervised multi-level face model learningfor monocular reconstruction at over 250 hz. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2549–2559, 2018. 3

[39] Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, PabloGarrido, Florian Bernard, Patrick Perez, and ChristianTheobalt. Mofa: Model-based deep convolutional face au-toencoder for unsupervised monocular reconstruction. 2017IEEE International Conference on Computer Vision (ICCV),pages 3735–3744, 2017. 2, 8, 10, 14

[40] Justus Thies, Michael Zollhofer, Marc Stamminger, Chris-tian Theobalt, and Matthias Nießner. Face2face: Real-timeface capture and reenactment of rgb videos. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2387–2395, 2016. 2

[41] Luan Tran and Xiaoming Liu. Nonlinear 3d face morphablemodel. In IEEE Computer Vision and Pattern Recognition(CVPR), Salt Lake City, UT, June 2018. 2

[42] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and GerardMedioni. Regressing robust and discriminative 3d mor-phable models with a very deep neural network. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 5163–5172, 2017. 2, 6, 7, 10, 12, 13, 14,15, 16, 17, 18

[43] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yu-val Nirkin, and Gerard Medioni. Extreme 3d face recon-struction: Seeing through occlusions. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3935–3944, 2018. 1, 2, 6, 7, 8, 10, 12, 13, 15, 16,17, 18

[44] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and JovanPopovic. Face transfer with multilinear models. ACM Trans.Graph., 24(3):426–433, jul 2005. 2

[45] Lijuan Wang, Wei Han, Frank K Soong, and Qiang Huo.Text driven 3d photo-realistic talking head. In Twelfth An-nual Conference of the International Speech CommunicationAssociation, 2011. 1

[46] Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, andJianfei Cai. Alive caricature from 2d to 3d. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 7336–7345, 2018. 3

[47] Xiang Wu, Ran He, Zhenan Sun, and Tieniu Tan. A light cnnfor deep face representation with noisy labels. IEEE Trans-actions on Information Forensics and Security, 13(11):2884–2896, 2018. 4

[48] Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, YajieZhao, Weikai Chen, Kyle Olszewski, Shigeo Morishima, andHao Li. High-fidelity facial reflectance and geometry infer-ence from an unconstrained image. ACM Transactions onGraphics (TOG), 37(4):162, 2018. 2

[49] Hongwei Yi, Chen Li, Qiong Cao, Xiaoyong Shen, ShengLi, Guoping Wang, and Yu-Wing Tai. Mmface: A multi-metric regression network for unconstrained face reconstruc-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7663–7672, 2019. 2

[50] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, andStan Z Li. Face alignment across large poses: A 3d solu-tion. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 146–155, 2016. 1, 2

Documents

arXiv:1912.03455v1 [cs.CV] 7 Dec 2019 · face representation making the face ﬁtting problem more tractable. The FaceWareHouse technique [8] enhances the original PCA-based neutral