DeepFace: Face Generation using Deep LearningDeepFace: Face Generation using Deep Learning Hardie Cate [email protected] Fahim Dalvi [email protected] Zeshan Hussain [email protected]

DeepFace: Face Generation using Deep Learning

Hardie [email protected]

Fahim [email protected]

Zeshan [email protected]

Abstract

We use CNNs to build a system that both classi-fies images of faces based on a variety of differentfacial attributes and generates new faces given aset of desired facial characteristics. After introduc-ing the problem and providing context in the firstsection, we discuss recent work related to imagegeneration in Section 2. In Section 3, we describethe methods used to fine-tune our CNN and gener-ate new images using a novel approach inspired bya Gaussian mixture model. In Section 4, we discussour working dataset and describe our preprocess-ing steps and handling of facial attributes. Finally,in Sections 5, 6 and 7, we explain our experimentsand results and conclude in the following section.Our classification system has 82% test accuracy.Furthermore, our generation pipeline successfullycreates well-formed faces.

1. Introduction

Convolutional neural networks (CNNs) arepowerful tools for image classification and objectdetection, but they can also be used to generateimages. There are myriad image generation tasksand applications, ranging from generating potentialmass lesions in radiology scans to creating land-scapes or artistic scenes. In our project, we addressthe problem of generating faces based on certaindesired facial characteristics. Potential facial char-acteristics fall within the general categories of rawattributes (e.g., big nose, brown hair, etc.), ethnic-ity (e.g., white, black, Indian), and accessories (e.g.sunglasses, hat, etc.). The problem can be stated as

follows: Given a set of facial attributes as input,produce an image of a well-formed face as outputthat contains these characteristics.

In our face generation system, we fine-tune aCNN pre-trained on faces to create a classifica-tion system for facial characteristics. We employ anovel technique that models distributions of featureactivations within the CNN as a customized Gaus-sian mixture model. We then perform feature inver-sion using the relevant features identified by thismodel. The primary advantage of our implemen-tation is that it does not require any deep learningarchitectures apart from a CNN whereas other gen-erative approaches do. Our face generation systemhas many potential uses, including identifying sus-pects in law enforcement settings as well as in othermore generic generative settings.

2. Related Work

Work surrounding generative models for deeplearning has mostly been in developing graphi-cal models, autoencoder frameworks, and morerecently, generative recurrent neural networks(RNNs). Specific graphical models that have beenused to learn generative models of data are Re-stricted Boltzmann Machines (RBMs), an undi-rected graphical model with connected stochasticvisible and stochastic hidden units, and their gen-eralizations, such as Gaussian RBMs. Srivastavaand Salakhutdinov use these basic RBMs to cre-ate a Deep Boltzmann Machine (DBM), a multi-modal model that learns a probability density overthe space of multimodal inputs and can be effec-tively used for information retrieval and classifica-tion tasks [14]. Similar work done by Salakhut-

1

arX

iv:1

701.

0187

6v1

[cs

.CV

] 7

Jan

201

7

dinov and Hinton shows how the learning of a highcapacity DBM with multiple hidden layers and mil-lions of parameters can be made more efficient witha layer-by-layer ”pre-training” phase that allowsfor more reasonable weight initializations by in-corporating a bottom-up pass [13]. In his thesis,Salakhutdinov also adds to this learning algorithmby incorporating a top-down feedback pass as wellas a bottom-up pass, which allows DBMs to bet-ter propagate uncertainty about ambiguous inputs[12].

Other generative approaches involve using au-toencoders. The first ideas regarding the probabilis-tic interpretation of autoencoders were proposed byRanzato et al.; a more formal interpretation wasgiven by Vincent, who described denoising autoen-coders (DAEs) [10] [15]. A DAE takes an inputx ∈ [0, 1]d and first maps it, with an encoder,to a hidden representation y ∈ [0, 1]d

′through

some mapping, y = s(Wx + b), where s is anon-linearity such as a sigmoid. The latent rep-resentation y is then mapped back via a decoderinto a reconstruction z of the same shape as x, i.e.z = d(W′y + b′). The parameters, W, b, b′, andW′ are learned such that the average reconstructionloss between x and z is minimized [9]. Bengio etal. show an alternate form of the DAE: given someobserved input X and corrupted input X̃ , where X̃has been corrupted based on a conditional distri-bution C(X̃|X), we train the DAE to estimate thereverse conditional P (X|X̃) [2]. With this formu-lation, Vincent et al. construct a deeper network ofstacked DAEs to learn useful representations of theinputs [16].

An alternate model has been posited by Gregoret al., who propose using a recurrent neural networkarchitecture to generate digits. This architecture isa type of variational autoencoder, a recent advancedmodel that bridges deep learning and variational in-ference, since it is comprised of an encoder RNNthat compresses the real images during training anda decoder RNN that reconstitutes images after re-ceiving codes [5].

Finally, another approach for estimating gener-ative models is via generative adversarial nets [3][4]. In this framework, two models are simulta-

neously trained: a generative model G that cap-tures the distribution of the data and a discrimi-native model D that estimates the probability thata sample came from the training data rather thanG. G is trained to maximize the probability that Dmakes a mistake.

3. MethodsOur workflow consists of two steps. Firstly, we

finetune a pre-trained model to classify facial andimage characteristics for an input image. Secondly,we generate faces given some description. We de-scribe our approach for each step in the forthcom-ing sections.

3.1. Notation

To facilitate the discussion of our methods, weintroduce some notation. Let D denote our set oftraining images, and let F denote the set of 73 fa-cial attributes. From this point forward, we use theword attribute to refer strictly to one of the 73 facialcharacteristics in our system.

3.2. Fine-tuning

For finetuning, we employ the VGG-Face net,a 16-layer CNN that was trained on 2 millioncelebrity faces and evaluated on faces from the La-beled Faces in the Wild and YouTube faces datasets[11]. Using the VGG-Face net as our base architec-ture, we then attach 42 heads to the end of the fc-7layer of the VGG-Face net, each of which consistsof a fully connected fc-8 layer and a softmax layer.Each softmax head is a multilabel classifier for agroup of attributes that are highly correlated. Forexample, one group of attributes includes hair color(black, gray, brown, and blond); in general, some-one usually only has one color of hair, which makesgrouping these particular attributes together as onemultilabel classification reasonable. Furthermore,grouping together highly correlated attributes forour classification task in lieu of having 73 binaryclassifiers provides the network with implicit in-formation on the relationship between grouped fea-tures.

During training, we freeze certain layers in theCNN and learn the weights and biases on the un-

2

Figure 1: Modified VGG-Face Architecture

frozen layers. To determine which layers to freeze,we run several experiments with different sets offrozen layers and choose the set with the optimalperformance. We describe the experiments to as-sess our architecture and the results of the fine-tuning in Section 5.

3.3. Generation

3.3.1 Baseline Approach

We begin the generation phase of our project witha simple baseline approach that uses class visual-ization. This approach begins with the mean im-age M , which is the pixel-wise and channel-wisemean of all images in our training set. We add smallGaussian noise to M and use this image X as ourinput to our baseline algorithm. Suppose we wishto produce a face with a set of attributes C ⊆ F . Ineach iteration of our algorithm, we do the follow-ing:

1. Perform a forward pass with X as input;

2. Set the gradients at each softmax layer to ei-ther 1 or 0, depending on our target attributesC;

3. Perform a backward pass through the networkwith these gradients;

4. Update X using stochastic gradient descent(SGD) and regularization.

This baseline serves primarily to demonstrate thatour CNN correctly represents important facialstructures such as eyes, noses, mouths, etc., evenafter fine-tuning on specific facial attributes. Thisapproach is not comprehensive enough becauseboosting certain features from the last layer in thenetwork does not limit the number of facial struc-tures that appear in the image. A more completediscussion of the results of the baseline approachcan be found in the Section 5,6.

Our next and final approach uses a customizedvariant of a Gaussian Mixture Model (GMM). Weopt for this technique over more traditional genera-tive approaches discussed in Section 2 for two mainreasons. Firstly, to our knowledge, this method isnovel and has not been used in context of imagegeneration using CNNs. Secondly, this approachdoes not require complex structures in addition toour existing CNN, so the model is simpler and hasfar fewer parameters to train.

3.3.2 Custom Gaussian Mixture Model

This approach behaves similarly to the baseline inthat we begin with an input image X that consistsof random noise and boost this image toward a setof attributes C ⊆ F . The primary difference isthat instead of class visualization, we use featureinversion with respect to a layer in the CNN. Wealso begin with random noise instead of the meanimage.

3

Intuitively, feature inversion tries to minimizethe difference in the activations of an input imageand some target activations in a certain layer of theCNN. These target activations are the activationsthat we believe an input image with the desired at-tributesC will have when passed through the CNN.

More formally, Let φl(I) be the activations atlayer lwhen an image I is passed through the CNN.Let Tl denote the target activations for layer l andsuppose we have a way of determining Tl from C.We wish to find an image I∗ by solving the opti-mization problem

I∗ = arg minI‖Tl − φl(I)‖22 +R(I) (1)

where ‖ · ‖22 is the squared Euclidean norm and Ris a regularizer such as blurring and/or jitter. GivenTl and X , feature inversion does the following:

1. Perform a forward pass with X as input up tolayer l;

2. Set the gradients at l by taking the gradientof the objective function above with respect tothe activations in layer l;

3. Perform a backward pass from l with thesegradients;

4. Update X using SGD and regularization.

The challenge of this approach is determiningan appropriate set of target activations Tl given C.To our knowledge, there is no established way toautomatically detect these target activations givensolely the desired facial attributes.

To address this problem, we introduce a cus-tom Gaussian Mixture Model (cGMM), in whichwe model the distribution of Tl for a given attributef as a multivariate Gaussian distribution. For eachf ∈ F , we estimate the mean and covariance ma-trix of its Gaussian by sampling images in our train-ing set that are positive examples for this attribute.For computational simplicity, we assume the co-variance matrix for each distribution is diagonal.This assumption implies that we effectively treateach multivariate Gaussian as a stacking of manyindependent Gaussians that model the distributionof activations for a single neuron in l.

Specifically, for each attribute f1, f2, . . . , f73 ∈F , we select 73 random sets of imagesS1, S2, . . . , S73 ⊆ D of equal size such thatfor each i, every image in Si has attribute fi.Then for each i, we compute a mean vector µi andcovariance matrix Σi based on the activations atlayer l of the images in Si. These mean vectorsand covariance matrices define 73 independentmultivariate Gaussian distributions.

Thus, according to our model, if we wish to pro-duce possible activations at l of an image that hasattribute fi and no other f ∈ F , we can simplysample from N (µi,Σi). More generally, we canestimate the target activations for a set of attributesby performing a weighted sum of samples from theGaussians, which we assume to be independent. Ifwe wish to estimate the target activations at l for asubset C ⊆ F of attributes, we would produce asample s given by

s =1

|C|

73∑i=1

1iwizi (2)

where 1i = 1 if fi ∈ C and 0 otherwise, wi isthe weight assigned to the ith Gaussian, and zi ∼N (µi,Σi) is a random variable. This s can be usedin place of Tl for feature inversion.

Now we must determine the weights vector w =(w1, . . . , w73). In our approach, we learn thisweights vector by trying to find weights that min-imize the difference between the target activationsproduced by the sampling method above and theactual activations for images. We wish to find theseweights by solving the optimization problem

w∗ = arg minw

1

|D|

|D|∑i=1

‖φl(Ii)− Φ‖22 + λ‖w‖22

(3)where

Φ =1

|Ci|

73∑j=1

1i,jwjµj (4)

and 1i,j indicates whether fj ∈ Ci, the attributeset for image i. We learn the weights by takingderivatives of this objective function with respectto w and iteratively updating w using SGD.

4

4. Datasets and Features

Figure 2: Variations in lighting conditions andcamera angles

We are currently using the PubFig dataset [8],which is a collection of faces from public photos of200 individuals. The dataset includes 58K imagesscraped from various sources on the Internet, andbounding boxes around the face for each of theseimages. However, to avoid copyright issues, the au-thors provide links to the images, instead of hostingthe images themselves. We have roughly 20, 000images in our training set and 8, 000 images in ourtest set. The missing images include those that donot exist at the given links, those that have changedand a few that had incorrect bounding boxes for thefaces.

Since these images are scraped from public pho-tographs, there are several variations in faces. Mostfaces are frontal, while some are at slight angles.There is also a wide spectrum of lighting conditionsin the dataset.

For each image in the dataset, we also have a listof 73 attributes including gender, eye color, faceshape and ethnicity. Many of these attributes areinterdependent. For example, only one of the at-tributes within black hair, blond hair and brownhair is positively labeled for each individual.

5. Experiments & Analysis5.1. Finetuning Experiments

To understand the data better, we first run someexperiments to find out what the distribution of ourdata looks like. For each image, we are given in-formation about the presence or absence of 73 at-tributes. In Figure 7 we see that the distributionis quite skewed towards certain attributes. For ex-ample, around 76% of the people in the dataset are

(a) Youth (b) Soft-lighting

Figure 3: 2D representation of fc-7 feature space

white, while only 3% are black. The number ofpeople of other ethnicities is even lower, so a largeportion of the images do not have any ethnicity la-beling. Another prime example of such a disparityis eyewear. Around 90% of the people in the datasethave no form of eyewear in the photos. As anotherexample of the incompleteness of our dataset, only60% of the people have age information associatedwith their faces. These numbers give us a good ideaof the limits of the system we are building and howcertain attributes might generate better faces thanothers.

As mentioned in Section 3.2, we use the VGG-Face net as our base for finetuning. Before train-ing the network, we perform some analysis to sup-port the validity of VGG-Face net as our basearchitecture. Specifically, we perform a forwardpass for a set of images from our dataset and ob-tain activations at the fc-7 layer. We then trans-form the space of these activations into two dimen-sions using t-SNE. As evidenced by Figure 3(a),we see clustering even when using the weights di-rectly from VGG-Face net. This gives us con-fidence that the network is a good candidate fortransfer learning. However, clustering does notoccur for all of the attributes, as seen in Figure3(b). This is also expected, since the VGG-Facenet learned attributes that were important in distin-guishing celebrity faces from each other, and cer-tain attributes like soft-lighting may not provide thediscriminatory power to do so. After looking at theoverall results across the attributes, we decided thatwe need to backpropagate our gradients into at least

5

Figure 4: Loss curve after unfreezing variouslayers in the network

some of the layers in the network.Next, we tune hyperparameters to get the best

results from transfer learning. We try a wide rangeof learning rates, diagnosing the learning processby looking at the loss curve. We also try variousconfigurations of freezing layers in the network.As we can see from Figure 4, backpropagating intomore layers in the network helps the learning pro-cess, but the final loss is very similar across all tri-als. We also try different learning rates for the lay-ers based on their distance to the heads. Resultswere similar to the previous trials, with the networkconverging to more or less the same loss.

5.2. Image Generation Experiments

We first run several experiments using our base-line class visualization approach. For our pri-mary experiment, we attempt to boost several fea-tures that vary in frequency of appearance in ourdataset, including black (ethnicity) and black hair,for which there are fewer images, and middle-aged,smiling, and male, for which the frequency of im-ages is much higher. A resultant image of this ex-periment is shown in Figure 5. We see that the fi-nal image has multiple facial structures, i.e. mul-tiple sets of noses, eyes, eyebrows, etc. However,the image also contains some semblance of blackhair and a facial structure that is more masculine.Our baseline demonstrates that the trained CNN

has learned what these general structures in the im-ages look like. Because class visualization seems

Figure 5: Target Image, Vanilla Feature Inversion,cGMM Feature Inversion

to boost any part of the image whose activations areclose to the desired features, we shift our approachto feature inversion. For some image with desiredattributes, there exist corresponding activations atan arbitrary layer that we can use to generate theimage. To sanity check this idea, we perform an ex-periment where we use the ground truth activationsat a convolutional layer for a Barack Obama imageand attempt to reconstruct Obama using these acti-vations. We obtain these activations by performinga forward pass on the image and stopping at thedesired convolutional layer. The resultant BarackObama image is shown in Figure 5. This resultshows that the feature inversion method is a rea-sonable approach as long as we have access to the”target” activations of an image with selected at-tributes. As described in our methods, we use acGMM to estimate these target activations.

(a) Distribution of elementin fc-6 activations for black

attribute

(b) Distribution of elementin fc-6 activations for male

attribute

Figure 6: 2D representation of fc-6 feature space

To justify modeling each attribute by an inde-pendent multivariate Gaussian, we take the imagesin set Si for label fi and for each image, plot a sin-

6

gle element of the activation at the fc-6 layer. Wethen plot the distribution of that particular elementin the activation across all the images for that labeli. Looking at two distributions representing two el-ements of the fc-6 activation in Figure 6, we seethat the distributions are reasonably close to a nor-mal distribution. In general, after plotting distribu-tions for each element across a sample of imagesfor an arbitrary label, any arbitrary distribution ofelement j of the fc-6 activation for label i seemsto be normal (not all graphs are shown). Thus, be-cause we have some confidence that the true distri-bution of the target activations is Gaussian, we canproceed with our cGMM formulation.

Now, we perform experiments to learn theweights in our cGMM. The forward and backwardpasses of are written from scratch, so we use nu-merical gradient checking to validate the analyticgradients. To choose the learning rate and regu-larization parameters, we start by arbitrarily settingthe learning rate and regularization parameter to 1e-6 and 1e-5, respectively. Then, we incrementallyreduce the learning rate until the loss starts to de-crease instead of diverging to infinity. Finally, weuse the trained model to generate several faces ofdiffering attributes.

6. Results

training set test setfc-6 0.847 0.795

conv-5 0.867 0.815

Table 1: Average classification accuracy freezingparameters beneath layers fc-6 and conv-5

Our final training parameters for the finetuningare 5e-6 for the learning rate and 50 for the numberof epochs. As a form of regularization, we utilizedropout after every layer. The average classifica-tion accuracies are shown in Table 1. The classi-fication accuracy for every facial attribute is above0.5, while most are around 0.8 or 0.9.

For our cGMM training, our final learning rateand regularization parameter are 1e-11 and 1e-5,respectively. We generate faces for sets of attributes

person similarity %Barack Obama 35

Clive Owen 27Cristiano Ronaldo 45

Jared Leto 52Julia Roberts 42

Mickey Rourke 22Miley Cyrus 42

Nicole Richie 30Ryan Seacrest 30

Table 2: Similarity Percentages for Different Faces

that are defined by several images in the test set(i.e. we plug in these target attributes into our sys-tem and compare the image we generate with theground truth image). One example of a generatedimage is in Figure 8. To measure how ”reasonable”our faces are, we use a similarity metric imple-mented by the PicTriev software, which uses com-mon facial features to measure similarity [1]. Thisquantitative evaluation of our faces is shown in Ta-ble 2.

Figure 8: Cristiano Ronaldo generated image

7. DiscussionAlthough our classification accuracies are gen-

erally quite high, there are some facial attributessuch as lighting (e.g., soft lighting, harsh lighting)and photo type (e.g., color photo, posed photo) thathave low accuracies (roughly 55-60%). We do notattempt to improve these attributes, since they arenot as important as other, more face-related charac-teristics, such as nose shape and forehead visibility,that have high classification accuracies.

The most obvious result from our generation ex-periments is that each resultant image is very closeto the mean image of our training set. Despite this,

7

Figure 7: Distribution of all attributes in our training dataset

the fact that we successfully generate a coherentface from random noise indicates that our featureinversion process could work if the activations foreach attribute were more discriminative.

We hypothesize several reasons why our activa-tions are not discriminative enough. One possibil-ity is that the distributions for different attributesare very similar and clustered around the mean im-age in this abstract space. Another possibility isthat the distributions lie in different manifolds inthis space, but that they are not normal yet roughlycentered around the mean-image. This means thatthe act of averaging these activations and approxi-mating them as Gaussian blurs the distinctions be-tween these distributions.

This bring us to the assumptions about the fea-ture activations upon which the cGMM approachrelies. Our first assumption is that the distributionof activations for each attribute in a given layeris indeed Gaussian. Additionally, we assume thatthese Gaussians are independent. This is perhapsthe least justified assumption the model makes, be-cause clearly many attributes are correlated, suchas middle-aged and white hair. Despite this, wehope that our model approximates the true distribu-tion sufficiently for our purposes. We also assumethat the variables in the multivariate Gaussian arethemselves distributed normally and are indepen-dent. We believe this assumption is valid becausewe examined several of these distributions for indi-vidual variables as described in Section 5.

One way to improve our cGMM techniquewould be to estimate the target activations of a de-

sired image by taking a weighted sum of samplesfrom the Gaussians instead of a weighted sum ofthe means of the Gaussians. We opted for the latterapproach because it was far more computationallyefficient, but the former approach could allow formore variation from the mean image.

8. Conclusion & Future WorkIn this work, we have trained an architecture

that classifies facial attributes of images with highaccuracy. We have also shown the effectivenessof the custom Gaussian Mixture Model in gener-ating well-formed faces from random noise. Al-though the faces themselves were not discrimina-tive enough for the desired features, we posit thatwe can train a model that generates discriminativefaces if our dataset contained more diversity acrossthe attributes.

Alternatively, we might implement a variationalautoencoder structure as in [5] and compare the re-sults with those of the cGMM. We expect that sincethe variational autoencoder has a higher capacity,it might give more discriminative faces but wouldalso take longer to train. More efficient generativemodels open up exciting possibilities in other appli-cations, including medicine and law enforcement.

8

References[1] AppliedDevice. Pictriev: searching faces on the

web, 2010. [Online; accessed 2016-03-14].[2] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Gener-

alized denoising auto-encoders as generative mod-els. In Advances in Neural Information ProcessingSystems, pages 899–907, 2013.

[3] J. Gauthier. Conditional generative adversarial netsfor convolutional face generation. Class Projectfor Stanford CS231N: Convolutional Neural Net-works for Visual Recognition, Winter semester,2014, 2014.

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial nets. In Ad-vances in Neural Information Processing Systems,pages 2672–2680, 2014.

[5] K. Gregor, I. Danihelka, A. Graves, and D. Wier-stra. Draw: A recurrent neural network for im-age generation. arXiv preprint arXiv:1502.04623,2015.

[6] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Dar-rell. Caffe: Convolutional architecture for fast fea-ture embedding. arXiv preprint arXiv:1408.5093,2014.

[7] E. Jones, T. Oliphant, P. Peterson, et al. SciPy:Open source scientific tools for Python, 2001–.[Online; accessed 2016-03-14].

[8] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K.Nayar. Attribute and Simile Classifiers for FaceVerification. In IEEE International Conference onComputer Vision (ICCV), Oct 2009.

[9] L. Lab. Denoising autoencoders (da). 2016.[10] C. P. Marc’Aurelio Ranzato, S. Chopra, and Y. Le-

Cun. Efficient learning of sparse representationswith an energy-based model. In Proceedings ofNIPS, 2007.

[11] A. Z. O. M. Parkhi, A. Vedaldi. Deep FaceRecognition. In British Machine Vision Conference(BMVC), 2015.

[12] R. Salakhutdinov. Learning deep generative mod-els. PhD thesis, University of Toronto, 2009.

[13] R. Salakhutdinov and G. E. Hinton. Deep boltz-mann machines. In International conference on ar-tificial intelligence and statistics, pages 448–455,2009.

[14] N. Srivastava and R. R. Salakhutdinov. Multi-modal learning with deep boltzmann machines. In

F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Informa-tion Processing Systems 25, pages 2222–2230. Cur-ran Associates, Inc., 2012.

[15] P. Vincent. A connection between score matchingand denoising autoencoders. Neural computation,23(7):1661–1674, 2011.

[16] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, andP.-A. Manzagol. Stacked denoising autoencoders:Learning useful representations in a deep networkwith a local denoising criterion. The Journal of Ma-chine Learning Research, 11:3371–3408, 2010.

9

Documents

DeepFace: Face Generation using Deep LearningDeepFace: Face Generation using Deep Learning Hardie Cate [email protected] Fahim Dalvi [email protected] Zeshan Hussain [email protected]