14
FaceForensics++: Learning to Detect Manipulated Facial Images Andreas R ¨ ossler 1 Davide Cozzolino 2 Luisa Verdoliva 2 Christian Riess 3 Justus Thies 1 Matthias Nießner 1 1 Technical University of Munich 2 University Federico II of Naples 3 University of Erlangen-Nuremberg Figure 1: FaceForensics++ is a database of facial forgeries that enables researchers to train deep-learning-based approaches in a supervised fashion. The database contains manipulations created with three state-of-the-art methods, namely, Face2Face, FaceSwap, and DeepFakes. Abstract The rapid progress in synthetic image generation and manipulation has now come to a point where it raises sig- nificant concerns on the implication on the society. At best, this leads to a loss of trust in digital content, but it might even cause further harm by spreading false information and the creation of fake news. This paper examines the realism of state-of-the-art image manipulations, and how difficult it is to detect them – either automatically or by humans. To standardize the evaluation of detection methods, we propose an automated benchmark for facial manipulation detection 1 . In particular, the benchmark is based on Deep- Fakes [1], Face2Face [58], and FaceSwap [2] as prominent representatives for facial manipulations at random com- pression level and size. The benchmark is publicly avail- able 2 and contains a hidden test set as well as a database of about 1.5 million manipulated images. Thus, this database is at least an order of magnitude larger than comparable, publicly available, forgery datasets. Based on this data, we performed a thorough analysis of data-driven forgery de- tectors. We show that the use of additional domain-specific knowledge improves forgery detection to unprecedented ac- curacy, even in the presence of strong compression and clearly outperforms human observers. 1. Introduction Nowadays, manipulation of visual content is om- nipresent and one of the most critical topics in our digital society. DeepFakes [1], for instance, showed how computer graphics and visualization techniques can be used to defame persons by replacing their face by the face of a different person. Faces are of special interest to current manipula- tion methods for various reasons: firstly, the reconstruc- tion and tracking of human faces is a well-examined field in computer vision [67], which is the foundation of these editing approaches. Secondly, faces play a central role in human communication, as the face of a person can empha- size a message or it can even convey a message in its own right [28]. Current facial manipulation methods can be separated into two categories: facial expression manipulation and fa- cial identity manipulation (see Fig. 2). One of the most prominent facial expression manipulation techniques is the method of Thies et al. [58] called Face2Face. It enables the transfer of facial expressions of one person to another per- son in real time using only commodity hardware. Follow- up work such as ”Synthesizing Obama” [55] is able to ani- 1. http://kaldir.vc.in.tum.de/faceforensics_ benchmark 2. https://github.com/ondyari/FaceForensics 1 arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

FaceForensics++: Learning to Detect Manipulated Facial Images

Andreas Rossler1 Davide Cozzolino2 Luisa Verdoliva2 Christian Riess3

Justus Thies1 Matthias Nießner1

1Technical University of Munich 2University Federico II of Naples 3University of Erlangen-Nuremberg

Figure 1: FaceForensics++ is a database of facial forgeries that enables researchers to train deep-learning-based approachesin a supervised fashion. The database contains manipulations created with three state-of-the-art methods, namely, Face2Face,FaceSwap, and DeepFakes.

Abstract

The rapid progress in synthetic image generation andmanipulation has now come to a point where it raises sig-nificant concerns on the implication on the society. At best,this leads to a loss of trust in digital content, but it mighteven cause further harm by spreading false information andthe creation of fake news. This paper examines the realismof state-of-the-art image manipulations, and how difficult itis to detect them – either automatically or by humans.

To standardize the evaluation of detection methods, wepropose an automated benchmark for facial manipulationdetection1. In particular, the benchmark is based on Deep-Fakes [1], Face2Face [58], and FaceSwap [2] as prominentrepresentatives for facial manipulations at random com-pression level and size. The benchmark is publicly avail-able2 and contains a hidden test set as well as a database ofabout 1.5 million manipulated images. Thus, this databaseis at least an order of magnitude larger than comparable,publicly available, forgery datasets. Based on this data, weperformed a thorough analysis of data-driven forgery de-tectors. We show that the use of additional domain-specificknowledge improves forgery detection to unprecedented ac-curacy, even in the presence of strong compression andclearly outperforms human observers.

1. Introduction

Nowadays, manipulation of visual content is om-nipresent and one of the most critical topics in our digitalsociety. DeepFakes [1], for instance, showed how computergraphics and visualization techniques can be used to defamepersons by replacing their face by the face of a differentperson. Faces are of special interest to current manipula-tion methods for various reasons: firstly, the reconstruc-tion and tracking of human faces is a well-examined fieldin computer vision [67], which is the foundation of theseediting approaches. Secondly, faces play a central role inhuman communication, as the face of a person can empha-size a message or it can even convey a message in its ownright [28].

Current facial manipulation methods can be separatedinto two categories: facial expression manipulation and fa-cial identity manipulation (see Fig. 2). One of the mostprominent facial expression manipulation techniques is themethod of Thies et al. [58] called Face2Face. It enables thetransfer of facial expressions of one person to another per-son in real time using only commodity hardware. Follow-up work such as ”Synthesizing Obama” [55] is able to ani-

1. http://kaldir.vc.in.tum.de/faceforensics_benchmark

2. https://github.com/ondyari/FaceForensics

1

arX

iv:1

901.

0897

1v2

[cs

.CV

] 3

Apr

201

9

Page 2: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

Figure 2: Advances in the digitization of human faces are the basis for modern facial image editing tools. The editingtools can be split in two main categories: identity modification and expression modification. Besides manually editing theface using tools like Photoshop, many automatic approaches have been proposed in the last few years. The most prominentand most spread identity editing technique is face swapping. The popularity of these techniques stems from the fact thatlightweight systems can run on mobile phones. Facial reenactment is a technique to alter the expressions of a person bytransferring the expressions of a source person to the target.

mate the face of a person based on an audio input sequence.”Bringing Portraits to Life” [8] enables the editing of ex-pressions in a photograph.

Identity manipulation is the second category of facialforgeries. Instead of changing expressions, these methodsreplace the face of a person with the face of another per-son. This category is known as face swapping. It becamepopular with wide-spread consumer-level applications likeSnapchat. DeepFakes also performs face swapping, but viadeep learning. While face swapping based on simple com-puter graphics techniques is running in real time, Deep-Fakes need to be trained for each pair of videos, which isa time-consuming task.

In this work, we show that we can automatically and re-liably detect such manipulations, and thereby outperformhuman observers by a significant margin. We leverage therecent advances in deep learning, especially, the ability tolearn extremely powerful image features with convolutionalneural networks (CNNs). We tackle the detection problemby training a neural network in a supervised fashion. Tothis end, we generated a large-scale dataset of manipula-tions based on Face2Face, the classical computer graphics-based FaceSwap [2] and DeepFakes.

As the digital media forensics field lacks a benchmarkfor forgery detection, we propose an automated benchmarkthat considers the three manipulation methods in a realisticscenario, i.e., with random compression and randomdimensions. Using this benchmark, we evaluate the currentstate-of-the-art detection methods as well as our forgerydetection pipeline that considers the restricted field of facialmanipulation methods.

To summarize, our paper makes the following con-tributions:

• an automated benchmark for facial manipulation de-tection under random compression for a standardizedcomparison, including a human baseline,

• a novel dataset of manipulated facial imagery com-posed of more than 1.5 million images from1,000 videos with pristine (i.e., real) sources and tar-get ground truth to enable supervised learning,

• an extensive evaluation of state-of-the-art hand-craftedand learned forgery detectors in various scenarios,

• a state-of-the-art forgery detection method tailored tofacial manipulations.

2. Related WorkThe paper intersects several fields in computer vision and

digital multimedia forensics. We cover the most importantrelated papers in the following paragraphs.

Face Manipulation Methods: In the last two decades, in-terest in virtual face manipulation has rapidly increased. Acomprehensive state-of-the-art report has been published byZollhofer et al. [67]. Bregler et al. [13] presented an image-based approach called Video Rewrite to automatically cre-ate a new video of a person with generated mouth move-ments. With Video Face Replacement [20], Dale et al. pre-sented one of the first automatic face swap methods. Us-ing single-camera videos, they reconstruct a 3D model ofboth faces and exploit the corresponding 3D geometry towarp the source face to the target face. Garrido et al. [29]presented a similar system that replaces the face of an ac-tor while preserving the original expressions. VDub [30]uses high-quality 3D face capturing techniques to photo-realistically alter the face of an actor to match the mouthmovements of a dubber. Thies et al. [57] demonstratedthe first real-time expression transfer for facial reenactment.Based on a consumer level RGB-D camera, they reconstructand track a 3D model of the source and the target actor. Thetracked deformations of the source face are applied to thetarget face model. As a final step, they blend the altered

Page 3: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

face on top of the original target video. Face2Face, pro-posed by Thies et al. [58], is an advanced real-time facialreenactment system, capable of altering facial movementsin commodity video streams, e.g., videos from the internet.They combine 3D model reconstruction and image-basedrendering techniques to generate their output. The sameprinciple can be also applied in Virtual Reality in combina-tion with eye-tracking and reenactment [59] or be extendedto the full body [60]. Kim et al. [38] learn an image-to-image translation network to convert computer graphic ren-derings of faces to real images. Suwajanakorn et al. [55]learned the mapping between audio and lip motions, whiletheir compositing approach builds on similar techniques toFace2Face [58]. Averbuch-Elor et al. [8] present a reen-actment method, Bringing Portraits to Life, which employs2D warps to deform the image to match the expressions of asource actor. They also compare to the Face2Face techniqueand achieve similar quality.

Recently, several face image synthesis approaches us-ing deep learning techniques have been proposed. Lu etal. [47] provide an overview. Generative adversarial net-works (GANs) are used to apply Face Aging [7], to gener-ate new viewpoints [34], or to alter face attributes like skincolor [46]. Deep Feature Interpolation [61] shows impres-sive results on altering face attributes like age, mustache,smiling etc. Similar results of attribute interpolations areachieved by Fader Networks [43]. Most of these deep learn-ing based image synthesis techniques suffer from low imageresolutions. Karras et al. [36] improve the image quality us-ing progressively growing of GANs. Their results includehigh-quality synthesis of faces.

Multimedia Forensics: Multimedia forensics aims to en-sure authenticity, origin, and provenance of an image orvideo without the help of an embedded security scheme.Focusing on integrity, early methods are driven by hand-crafted features that capture expected statistical or physics-based artifacts that occur during image formation. Sur-veys on these methods can be found in [26, 53]. Morerecent literature is indeed concentrated on CNN-based so-lutions both through supervised and unsupervised learn-ing [10, 17, 12, 9, 35, 66]. For videos, the main body ofwork focuses on detecting manipulations that can be cre-ated with relatively low effort, such as dropped or dupli-cated frames [62, 31, 45], varying interpolation types [25],copy-move manipulations [11, 21], or chroma-key compo-sitions [48].

Some other papers explicitly refer to detecting ma-nipulations related to faces, like distinguishing computergenerated faces from natural ones [22, 15, 51], morphedfaces [50], face splicing [24, 23], face swapping [65, 37]and DeepFakes [5, 44, 33]. For face manipulation detec-tion some approaches exploit specific artifacts, arising from

the synthesis process, like eye blinking [44], or color, tex-ture and shape cues [24, 23]. Other papers are more gen-eral and propose a deep network trained to capture the sub-tle inconsistencies arising from low-level and/or high levelfeatures [50, 65, 37, 5, 33]. These approaches show im-pressive results, however robustness issues are addressedonly in very few works, even though it is of paramountimportance for practical applications. For example, oper-ations like compression and resizing are known for launder-ing manipulation traces from the data. In a real scenario,these basic operations are standard when images and videosare for example uploaded to social networks, which is oneof the most important application field for forensic analy-sis. To this end, our dataset is designed to cover a realisticscenario, i.e., videos from the wild, manipulated and com-pressed with different quality level (see Section 3). Theavailability of a such large and various dataset can help re-searchers to benchmark their approaches and develop betterforgery detectors for facial imagery.

Forensic Analysis Datasets: Classical forensics datasetshave been created with significant manual effort under verycontrolled conditions, to isolate specific properties of thedata like camera artifacts. While several datasets wereproposed that include image manipulations, only a fewof them address also the important case of video footage.MICC F2000, for example, is an image copy-move manip-ulation dataset consisting of a collection of 700 forged im-ages from various sources [6]. The First IEEE Image Foren-sics Challenge Dataset3 comprises a total of 1176 forgedimages; the Wild Web Dataset [63] with 90 real cases of ma-nipulations coming from the web and the Realistic Tamper-ing dataset [42] including 220 forged images. A database of2010 FaceSwap- and SwapMe-generated images has beenproposed by Zhou et al. [65]. Recently, Korshunov andMarcel [41] made a database of 620 Deepfakes videos pub-licly available which they created from multiple videos foreach of 43 subjects. The National Institute of Standardsand Technology (NIST) released the most extensive datasetfor generic image manipulation comprising about 50, 000forged images (both local and global manipulations) andaround 500 forged videos [32].

In contrast to these datasets, we generated a databasecontaining more than 1.5 million images from 3000 fakevideos containing 1000 identities which is a magnitudemore than existing databases. We evaluate the importanceof a large training corpus in Section 4.

3. http://ifc.recod.ic.unicamp.br/fc.website/index.py?sec=0

Page 4: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

(a) Gender (b) Resolution (c) Pixel Coverage of Faces

Figure 3: Statistics of our extracted sequences. VGA de-notes 480p, HD denotes 720p, and FHD denotes 1080p res-olution of our videos. The last graphs shows the numberof sequences (y-axis) with given bounding box pixel height(x-axis).

3. Large-Scale Facial Forgery DatabaseA core contribution of this paper is our FaceForensics++

dataset extending the preliminary FaceForensics dataset[52] , which is already actively used by the forensic com-munity.

This novel large-scale dataset enables us to train a state-of-the-art forgery detector for facial image manipulation ina supervised fashion (see Section 4). To this end, we usedthree automated state-of-the-art face manipulation methodsthat are applied to 1,000 pristine videos downloaded fromthe Internet (see Fig. 3 for some statistics). For a realis-tic scenario, we chose to collect videos in the wild, morespecifically from YouTube. However, early experimentswith all manipulation methods showed that the target facehas to be nearly front-facing, to prevent the methods fromfailing or producing strong artifacts (see Fig. 4). Therefore,we perform a manual screening of the resulting clips to en-sure a high quality video selection and to avoid videos withface occlusions. We selected 1,000 video sequences con-taining 509, 914 images which we use as our pristine data.For more details, we refer to Appendix A.

To generate a large scale manipulation database, weadapted state-of-the-art video editing methods to work fullyautomatically. In the following paragraphs, we describe themodifications of these methods.

For our dataset, we chose two computer graphics-basedapproaches (Face2Face and FaceSwap) and a learning-based approach (DeepFakes). The computer graphics-basedapproaches are both 3D editing techniques that reconstructa 3D model of a face and apply edits in 3D. FaceSwap isa lightweight editing tool that copies the face region fromone image to another by using sparse face marker posi-tions. In contrast, Face2Face is a more sophisticated tech-nique with superior face tracking and modeling that enablesreenactment of facial expressions. Similar to FaceSwap,DeepFakes is an identity manipulation tool which is a deep-learning-based method that uses an auto-encoder to replacethe face of a target video by the face of a source video.All three methods require source and a target actor video

Figure 4: Automatic face editing tools rely on the abilityto track the face in the target video. State-of-the-art track-ing methods like Thies et al. [58] fail in cases of profileimagery of a face (left). Rotations larger than 45◦ lead totracking errors (middle). Also, occlusions lead to trackingerrors (right).

pairs as input. The final output of each method is a videocomposed of generated images. Besides the manipulationoutput, we can also access ground truth masks that indi-cate whether a pixel has been modified or not. These binarymasks can be used to train forgery localization methods.

FaceSwap FaceSwap is a graphics-based approach totransfer the face region from a source video to a targetvideo. Based on sparse detected facial landmarks the faceregion is extracted. Using these landmarks, the method fits a3D template model using blendshapes. This model is back-projected to the target image by minimizing the differencebetween the projected shape and the localized landmarksusing the textures of the input image. Finally, the renderedmodel is blended with the image and color correction is ap-plied. We perform these steps for all pairs of source andtarget frames until one video end. The implementation iscomputationally light-weight and can be computed quitefast with CPU only.

DeepFakes DeepFakes is a synonym for face replacementthat is based on deep learning. A face in a target sequence isreplaced by a face that has been observed in a source videoor image collection. There are various public implemen-tations of DeepFakes available, most notably FakeApp [3]and the faceswap github [1]. The method is based on twoautoencoders with a shared encoder that are trained to re-construct training images of the source and the target face,respectively. They are using a face detector to crop and toalign the images. To create a fake image they apply thetrained encoder and the decoder of the source face on thetarget face. The autoencoder output is then blended withthe rest of the image using Poisson image editing [49].

For our dataset, we use the faceswap github implemen-tation. We slightly modify the implementation by replacingthe manual training data selection with a fully automateddata loader. We used the default parameters to train the

Page 5: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

video-pair models. Since the training of these models isvery time-consuming, we also publish the models as part ofthe dataset. This allows to easily generate additional manip-ulations of these persons with different post-processing.

Face2Face Face2Face [58] is a facial reenactment sys-tem that transfers the expressions of a source video toa target video while maintaining the identity of the tar-get person. The original implementation is based on twovideo input streams with a manual key-frame selection.These frames are used to generate a dense reconstructionof the face which can be used to re-synthesize the face un-der different illumination and expressions. To process ourvideo database, we adapt the Face2Face approach to fully-automatically create reenactment manipulations. We pro-cess each video in a preprocessing pass; here, we use thefirst frames in order to obtain a temporary face identity (i.e.,a 3D model), and track the expressions over the remainingframes. In order to select the keyframes that are used to im-prove the identity fitting and the static texture, we select theframes with the left- and rightmost angle of the face. Basedon this identity reconstruction, we track the whole videoto compute per frame the expression, rigid pose, and light-ing parameters as done in the original implementation ofFace2Face. We generate the reenactment video outputs bytransferring the source expression parameters of each frame(i.e., 76 Blendshape coefficients) to the target video. Moredetails of the reenactment process can be found in the orig-inal paper [58].

Postprocessing - Video Quality To create a realistic sce-nario of manipulated videos, we generate output videoswith different quality levels as they are used in many socialnetworks. Since raw videos are rarely found on the Inter-net, we compress the videos using the H.264 codec, whichis used by social networks or video-sharing websites. Togenerate high quality videos, we use a light compressiondenoted by HQ (constant rate quantization parameter equalto 23) which is visually nearly loss-less. Low quality videosare produced using a quantization of 40 (LQ ).

4. Forgery Detection

We cast the forgery detection as a binary classificationproblem per frame of the manipulated videos. The fol-lowing sections show the results of manual and automaticforgery detection. For all experiments, we split the datasetinto a fixed training, validation, and test set, consisting of720, 140, and 140 videos respectively. All evaluations arereported using videos from the test set.

4.1. Forgery Detection of Human Observers

To evaluate the performance of humans in the task offorgery detection, we conducted a user study with 143 par-ticipants consisting mostly of computer science universitystudents. This forms the baseline for the automated forgerydetection methods.

Layout of the User Study: After a short introduction tothe binary task, users are told to classify randomly selectedimages from our test set. The selected images vary in imagequality as well as manipulation method and we used a 50:50split of pristine and fake images. Since the time for inspec-tion of an image may be important, we randomly set a timelimit of 2, 4 or 6 seconds after which we hide the image.Afterwards, the users were asked whether the displayed im-age is ‘real’ or ‘fake’. To ensure that the users spend theavailable time on inspection, the question is asked after theimage has been displayed and not during the observationtime. We designed the study to only take a few minutes perparticipant showing 60 images per attendee which results ina collection of 8580 human decisions.

Evaluation: In Fig. 5 we show the results of our studyon all quality levels. It shows the dependency between theability to detect fakes and video quality. With a lower videoquality, the human performance decreases in average from71% to 61%. The graph shows the numbers averaged acrossall time intervals since the different time constraints did notresult in significantly different observations.

Figure 5: Forgery detection results of our user study with143 participants. The accuracy is dependent on the videoquality and results in a decreasing accuracy rate that is 72%in average on raw videos, 71% on high quality, and 61% onlow quality videos.

Note that the user study contained fake images of allthree manipulation methods and pristine images. In this set-ting Face2Face is particularly harder to detect by human ob-servers since Face2Face does not introduce a strong seman-tic change, and thus, introducing only subtle visual artifactsin comparison to the face replacement methods.

Page 6: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

Figure 6: Our domain specific forgery detection pipeline:the input image is processed by a robust face trackingmethod, we use the information to extract the region of theimage that is covered by the face, this region is fed into alearned classification network that outputs the prediction.

4.2. Automatic Forgery Detection Methods

We analyze the detection accuracy of our forgery detec-tion pipeline depicted in Fig. 6. Since our goal is to de-tect forgeries of facial imagery, we use additional domain-specific information that we can extract from input se-quences. To this end, we use the state-of-the-art face track-ing method by Thies et al. [58] to track the face in thevideo and to extract the face region of the image. We usea conservative crop (enlarged by a factor of 1.3) aroundthe center of the tracked face, enclosing the reconstructedface. This incorporation of domain knowledge improvesthe overall performance of a forgery detector in compari-son to a naıve approach that uses the whole image as in-put (see Sec. 4.2.2). We evaluated various variants of ourapproach by using different state-of-the-art classificationmethods. We are considering learning-based methods usedin the forensic community for generic manipulation detec-tion [10, 17], computer-generated vs natural image detec-tion [51] and face tampering detection [5]. In addition, weshow that the classification based on XceptionNet [14] out-performs all other variants in detecting fakes.

4.2.1 Detection based on Steganalysis Features:

Following the method by Fridrich et al. [27], we are us-ing hand-crafted features to detect forgeries. The featuresare co-occurrences on 4 pixels patterns along the horizon-tal and vertical direction on the high-pass images for a totalfeature length of 162. These features are then used to train alinear Support Vector Machine (SVM) classifier. This tech-nique was the winning approach in the first IEEE Imageforensic Challenge [16]. We provide a 128 × 128 centralcrop-out of the face as input to the method. While the hand-crafted method outperforms human accuracy on raw imagesby a large margin, it struggles to cope with compression,which leads to an accuracy below human performance forlow quality videos (see Fig. 7 and Table 1).

Figure 7: Binary detection accuracy of all evaluated archi-tectures on the different manipulation methods using facetracking when trained on our different manipulation meth-ods separately.

4.2.2 Detection based on Learned Features:

We evaluate five network architectures known from the lit-erature to solve the classification task:

(1) Cozzolino et al. [17] cast the hand-crafted Steganal-ysis features from the previous section in a CNN-based net-work. We fine-tune this network on our large scale dataset.

(2) We use our dataset to train the convolutional neu-ral network proposed by Bayar and Stamm [10] that usesa constrained convolutional layer followed by two convo-lutional, two max-pooling and three fully-connected layer.The constrained convolutional layer is specifically designedto suppress the high-level content of the image.

(3) Rahmouni et al. [51] adopt different CNN architec-tures with a global pooling layer that computes four statis-tics (mean, variance, maximum and minimum). We con-sider the Stats-2L network that had the best performance.

(4) MesoInception-4 [5] is a CNN-based network in-spired by InceptionNet [56] to detect face tampering invideos. The network has two inception modules and twoclassic convolution layers interlaced by max-pooling layers.Afterwards, there are two fully-connected layers. Instead ofthe classic cross-entropy loss, the authors propose the meansquared error between true and predicted labels. We resizethe face images to 256× 256, the input of the network.

(5) XceptionNet [14] is a traditional CNN trained onImageNet based on separable convolutions with residual

Page 7: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

Figure 8: Binary precision values of our baselines whentrained on all three manipulation methods at the same time.Consult Table 1 for the average accuracy values. Besidesthe Full Image XceptionNet, we use the proposed pre-extraction of the face region as input to the approaches..

connections. We transfer it to our task by replacing thefinal fully connected layer with two outputs. The otherlayers are initialized with the ImageNet weights. To setup the newly inserted fully connected layer, we fix allweights up to the final layers and pre-train the network for12 epochs. We use hyper-parameter optimization to findthe best performing model based on validation accuracy.After this step, we train the best performing network for60 more epochs where one epoch equals one-fourth of theactual dataset size.

All five methods are trained with the Adam opti-mizer using the default values for the moments (β1 = 0.9,β2 = 0.999, ε = 10−8). We stop the training process ifthe validation accuracy doesn’t change for 10 consecutiveepochs. Validation accuracies are computed on 100 imagesper video to decrease training time. Finally, we solve theimbalance between real and fake images in the binary task(i.e., the number of fake images being roughly three timesas large as the number of pristine images) by weighting thetraining images correspondingly.

Comparison of our Forgery Detection Variants: Fig. 7shows the results of a binary forgery detection task usingall network architectures evaluated separately on all threemanipulation methods and at different video quality levels.

Compression Raw HQ LQ

[14] Full Image XceptionNet 83.49 80.59 75.61

[27] Steg. Features + SVM 97.63 70.78 56.37

[17] Cozzolino et al. 98.56 79.56 56.38

[10] Bayar and Stamm 99.19 89.90 70.01

[51] Rahmouni et al. 97.72 84.32 62.82

[5] MesoNet 96.51 85.51 75.65

[14] XceptionNet 99.41 97.53 85.49

Table 1: Binary detection accuracy of our baselines whentrained on all three manipulation methods. Besides thenaıve full image XceptionNet, all methods are trained ona conservative crop (enlarged by a factor of 1.3) around thecenter of the tracked face.

All approaches achieve very high performance on raw inputdata. Performance drops for compressed videos, particu-larly for hand-crafted features and for shallow CNN archi-tectures [10, 17]. The neural networks are better at handlingthese situations, with XceptionNet being able to achievecompelling results on weak compression while still main-taining reasonable performance on low quality images, as itbenefits from its pre-training on ImageNet as well as largernetwork capacity.

To compare the results of our user study to the perfor-mance of our automatic detectors, we also tested the detec-tion variants on a dataset containing images from all ma-nipulation methods. Fig. 8 and Table 1 show the results onthe full dataset. Here, our automated detectors outperformhuman performance by a large margin (cf. Fig. 5). We alsoevaluate a naıve forgery detector operating on the full im-age (resized to the XceptionNet input) instead of using facetracking information (see Fig. 8, rightmost column). Dueto the lack of domain-specific information, the XceptionNetclassifier has in this scenario a significantly lower accuracy.To summarize, domain-specific information in combinationwith a XceptionNet classifier shows the best performancein each test. We use this network to further understand theinfluence of the training corpus size and its ability to distin-guish between the different manipulation methods.

Evaluation of the Training Corpus Size: Fig. 9 showsthe importance of the training corpus size. To this end,we trained the XceptionNet classifier with different train-ing corpus sizes on all three video quality level separately.The overall performance increases with the number of train-ing images which is particularly important for low qualityvideo footage, as can be seen in the bottom of the figure.

Page 8: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

Figure 9: The detection performance of our approach us-ing XceptionNet depends on the training corpus size. Espe-cially, for low quality video data, a large database is needed.

5. Benchmark

Besides, our large-scale manipulation database, we pub-lish a competitive benchmark for facial forgery detection.To this end, we collected 700 additional videos and manip-ulated a subset of those in a similar fashion as in Section 3for each of our three manipulation methods. As uploadedvideos, e.g., to social networks, will be post-processed invarious ways, we obscure all selected videos multiple timesby, e.g., unknown re-sizing, compression method as well asbit-rate to ensure realistic conditions. This processing is di-rectly applied on raw videos. Finally, we manually select asingle challenging frame from each video based on visualinspection. Specifically, we get a set of 700 images, eachimage randomly taken from either the manipulation meth-ods or the original footage. Note, we do not necessarilyhave an equal split of pristine and fake images nor an equalsplit of the used manipulation methods. The ground truthlabels are hidden and are used on our host server to evalu-ate the classification accuracy of the submitted models. Theautomated benchmark allows submissions every two weeksfrom a single submitter to prevent overfitting (similar to ex-isting benchmarks [19]).

As baselines, we evaluate all previously trained modelson the benchmark and report the numbers of the best per-forming model for each detection method separately (seeTable 2). Besides the Full Image XceptionNet, we use theproposed pre-extraction of the face region as input to the

Accuracies DF F2F FS Real Total

Cozzolino et al. 0.836 0.847 0.825 0.326 0.581

Rahmouni et al. 0.691 0.401 0.447 0.709 0.607

Bayar and Stamm 0.882 0.730 0.650 0.614 0.684

MesoNet 0.700 0.489 0.612 0.823 0.707

Full Image XceptionNet 0.755 0.759 0.699 0.697 0.719

XceptionNet 0.882 0.715 0.718 0.797 0.783

Table 2: Results of the best performing model of each detec-tion method, i.e., trained on either raw, low or high quality.We report precision results for DeepFakes (DF), Face2Face(F2F), FaceSwap (FS), and pristine images (Real) as wellas the overall total accuracy.

approaches. The relative performance of the classificationmodels is similar to our database test set (see Table 1).However, since the benchmark scenario deviates from thetraining database, the overall performance of the models islower, especially, for the pristine image detection precision.

The benchmark is already publicly available to the com-munity and we hope that it leads to a standardized compar-ison of follow-up work.

6. Discussion & ConclusionWhile current state-of-the-art facial image manipulation

methods exhibit visually stunning results, we demonstratethat they can be detected by trained forgery detectors. Itis particularly encouraging that also the challenging caseof low-quality video can be tackled by learning-based ap-proaches, where humans and hand-crafted features exhibitdifficulties. To train detectors using domain-specific knowl-edge, we introduce a novel dataset of videos of manipulatedfaces that exceeds all existing publicly available forensicdatasets by an order of magnitude.

In this paper we focus on the influence of compressionto the detectability of state-of-the-art manipulation methodsand propose a standardized benchmark for follow-up work.All image data, trained models as well as our benchmark arepublicly available and are already used by other researchers.Especially, transfer learning is of high interest in the foren-sic community. As new manipulation methods appear bythe day, methods have to be developed that are able to detectfakes with no or little training data. Our database is alreadyused for this forensic transfer learning task, where knowl-edge of one source manipulation domain is transferred toanother target domain, as shown by Cozzolino et al. [18].We hope that the dataset and benchmark becomes a step-ping stone for even more researchers to work in the fieldof digital media forensics, and in particular with a focus onfacial forgeries.

Page 9: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

7. AcknowledgementWe gratefully acknowledge the support of this research

by the AI Foundation, a TUM-IAS Rudolf MoßbauerFellowship, the and the ERC Starting Grant Scan2CAD(804724), and Google Faculty Award. We would also liketo thank Google’s Chris Bregler for help with the cloudcomputing. In addition, this material is based on researchsponsored by the Air Force Research Laboratory and theDefense Advanced Research Projects Agency under agree-ment number FA8750-16-2-0204. The U.S. Governmentis authorized to reproduce and distribute reprints for Gov-ernmental purposes notwithstanding any copyright notationthereon. The views and conclusions contained herein arethose of the authors and should not be interpreted as neces-sarily representing the official policies or endorsements, ei-ther expressed or implied, of the Air Force Research Labo-ratory and the Defense Advanced Research Projects Agencyor the U.S. Government.

References[1] Deepfakes github. https://github.com/

deepfakes/faceswap. Accessed: 2018-10-29. 1,4, 14

[2] Faceswap. https://github.com/MarekKowalski/FaceSwap/. Accessed: 2018-10-29.1, 2

[3] Fakeapp. https://www.fakeapp.com/. Accessed:2018-09-01. 4

[4] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici,B. Varadarajan, and S. Vijayanarasimhan. YouTube-8m: Alarge-scale video classification benchmark. arXiv preprintarXiv:1609.08675, 2016. 12

[5] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. Mesonet:a compact facial video forgery detection network. arXivpreprint arXiv:1809.00888, 2018. 3, 6, 7, 14

[6] I. Amerini, L. Ballan, R. Caldelli, A. Del Bimbo, andG. Serra. A SIFT-based forensic method for copy-moveattack detection and transformation recovery. IEEE Trans-actions on Information Forensics and Security, 6(3):1099–1110, Mar. 2011. 3

[7] G. Antipov, M. Baccouche, and J. Dugelay. Face agingwith conditional generative adversarial networks. CoRR,abs/1702.01983, 2017. 3

[8] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen.Bringing portraits to life. ACM Transactions on Graph-ics (Proceeding of SIGGRAPH Asia 2017), 36(4):to appear,2017. 2, 3

[9] J. Bappy, A. Roy-Chowdhury, J. Bunk, L. Nataraj, andB. Manjunath. Exploiting spatial structure for localizing ma-nipulated image regions. In IEEE International Conferenceon Computer Vision, pages 4970–4979, 2017. 3

[10] B. Bayar and M. Stamm. A deep learning approach to univer-sal image manipulation detection using a new convolutionallayer. In ACM Workshop on Information Hiding and Multi-media Security, pages 5–10, 2016. 3, 6, 7, 14

[11] P. Bestagini, S. Milani, M. Tagliasacchi, and S. Tubaro. Lo-cal tampering detection in video sequences. In IEEE Inter-national Workshop on Multimedia Signal Processing, pages488–493, October 2013. 3

[12] L. Bondi, S. Lameri, D. Guera, P. Bestagini, E. Delp, andS. Tubaro. Tampering Detection and Localization throughClustering of Camera-Based CNN Features. In IEEE Com-puter Vision and Pattern Recognition Workshops, 2017. 3

[13] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Drivingvisual speech with audio. In Proceedings of the 24th AnnualConference on Computer Graphics and Interactive Tech-niques, SIGGRAPH ’97, pages 353–360, New York, NY,USA, 1997. ACM Press/Addison-Wesley Publishing Co. 2

[14] F. Chollet. Xception: Deep Learning with Depthwise Sepa-rable Convolutions. In IEEE Conference on Computer Visionand Pattern Recognition, 2017. 6, 7, 14

[15] V. Conotter, E. Bodnari, G. Boato, and H. Farid.Physiologically-based detection of computer generated facesin video. In IEEE International Conference on Image Pro-cessing, pages 1–5, Oct 2014. 3

[16] D. Cozzolino, D. Gragnaniello, and L.Verdoliva. Imageforgery detection through residual-based local descriptorsand block-matching. In IEEE International Conference onImage Processing, pages 5297–5301, October 2014. 6

[17] D. Cozzolino, G. Poggi, and L. Verdoliva. Recastingresidual-based local descriptors as convolutional neural net-works: an application to image forgery detection. In ACMWorkshop on Information Hiding and Multimedia Security,pages 1–6, 2017. 3, 6, 7, 14

[18] D. Cozzolino, J. Thies, A. Rossler, C. Riess, M. Nießner, andL. Verdoliva. ForensicTransfer: Weakly-supervised domainadaptation for forgery detection. arXiv, 2018. 8

[19] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,and M. Nießner. Scannet: Richly-annotated 3d reconstruc-tions of indoor scenes. In Proc. Computer Vision and PatternRecognition (CVPR), IEEE, 2017. 8

[20] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Ma-tusik, and H. Pfister. Video face replacement. ACM Trans.Graph., 30(6):130:1–130:10, Dec. 2011. 2

[21] L. D’Amiano, D. Cozzolino, G. Poggi, and L. Verdoliva. APatchMatch-based Dense-field Algorithm for Video Copy-Move Detection and Localization. IEEE Transactions onCircuits and Systems for Video Technology, in press, 2018.3

[22] D.-T. Dang-Nguyen, G. Boato, and F. De Natale. Identifycomputer generated characters by analysing facial expres-sions variation. In IEEE International Workshop on Infor-mation Forensics and Security, pages 252–257, 2012. 3

[23] T. de Carvalho, F. Faria, H. Pedrini, R. Torres, and A. Rocha.Illuminant-Based Transformed Spaces for Image Forensics.IEEE Transactions on Information Forensics and Security,11(4):720–733, 2016. 3

[24] T. de Carvalho, C. Riessa, E. Angelopoulou, H. Pedrini, andA. Rocha. Exposing digital image forgeries by illumina-tion color classification. IEEE Transactions on InformationForensics and Security, 8(7):1182–1194, 2013. 3

Page 10: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

[25] X. Ding, Y. Gaobo, R. Li, L. Zhang, Y. Li, and X. Sun.Identification of Motion-Compensated Frame Rate Up-Conversion Based on Residual Signal. IEEE Transactions onCircuits and Systems for Video Technology, in press, 2017. 3

[26] H. Farid. Photo Forensics. The MIT Press, 2016. 3[27] J. Fridrich and J. Kodovsky. Rich Models for Steganalysis of

Digital Images. IEEE Transactions on Information Forensicsand Security, 7(3):868–882, June 2012. 6, 7

[28] C. Frith. Role of facial expressions in social interactions.Philosophical Transactions of the Royal Society B: Biologi-cal Sciences, 364(1535), Dec. 2009. 1

[29] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormaehlen,P. Perez, and C. Theobalt. Automatic face reenactment. InIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4217–4224, 2014. 2

[30] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi,P. Perez, and C. Theobalt. Vdub: Modifying face video ofactors for plausible visual alignment to a dubbed audio track.Computer Graphics Forum, 34(2):193–204, 2015. 2

[31] A. Gironi, M. Fontani, T. Bianchi, A. Piva, and M. Barni.A video forensic technique for detection frame deletion andinsertion. In IEEE International Conference on Acoustics,Speech and Signal Processing, pages 6226–6230, 2014. 3

[32] H. Guan, M. Kozak, E. Robertson, Y. Lee, A. N. Yates,A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fis-cus. Mfc datasets: Large-scale benchmark datasets for mediaforensic challenge evaluation. In IEEE Winter Applicationsof Computer Vision Workshops, pages 63–72, Jan 2019. 3

[33] D. Guera and E. Delp. Fake face detection methods: Canthey be generalized? In IEEE International Conference onAdvanced Video and Signal Based Surveillance, 2018. 3

[34] R. Huang, S. Zhang, T. Li, and R. He. Beyond face ro-tation: Global and local perception GAN for photorealis-tic and identity preserving frontal view synthesis. CoRR,abs/1704.04086, 2017. 3

[35] M. Huh, A. Liu, A. Owens, and A. A. Efros. Fighting fakenews: Image splice detection via learned self-consistency. InECCV, 2018. 3

[36] T. Karras, T. Aila, S. Laine, and J. Lehtinen. ProgressiveGrowing of GANs for Improved Quality, Stability, and Vari-ation. CoRR, abs/1710.10196, 2017. 3

[37] A. Khodabakhsh, R. Ramachandra, K. Raja, P. Wasnik, andC. Busch. Fake face detection methods: Can they be general-ized? In International Conference of the Biometrics SpecialInterest Group, 2018. 3

[38] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner,P. Perez, C. Richardt, M. Zollhofer, and C. Theobalt. DeepVideo Portraits. ACM Transactions on Graphics 2018(TOG), 2018. 3

[39] D. E. King. Dlib-ml: A machine learning toolkit. Journal ofMachine Learning Research, 10:1755–1758, 2009. 12

[40] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014. 13

[41] P. Korshunov and S. Marcel. Deepfakes: a new threat toface recognition? assessment and detection. arXiv preprintarXiv:1812.08685, 2018. 3

[42] P. Korus and J. Huang. Multi-scale Analysis Strategies inPRNU-based Tampering Localization. IEEE Transactions

on Information Forensics and Security, 12(4):809–824, Apr.2017. 3

[43] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. De-noyer, and M. Ranzato. Fader networks: Manipulating im-ages by sliding attributes. CoRR, abs/1706.00409, 2017. 3

[44] Y. Li, M. Chang, and S. Lyu. In ictu oculi: Exposing aicreated fake videos by detecting eye blinking. In IEEE WIFS,2018. 3

[45] C. Long, E. Smith, A. Basharat, and A. Hoogs. A C3D-basedConvolutional Neural Network for Frame Dropping Detec-tion in a Single Video Shot. In IEEE Computer Vision andPattern Recognition Workshops, pages 1898–1906, 2017. 3

[46] Y. Lu, Y. Tai, and C. Tang. Conditional cyclegan for attributeguided face image generation. CoRR, abs/1705.09966, 2017.3

[47] Z. Lu, Z. Li, J. Cao, R. He, and Z. Sun. Recent progress offace image synthesis. CoRR, abs/1706.04717, 2017. 3

[48] P. Mullan, D. Cozzolino, L. Verdoliva, and C. Riess.Residual-based forensic comparison of video sequences. InIEEE International Conference on Image Processing, 2017.3

[49] P. Perez, M. Gangnet, and A. Blake. Poisson image edit-ing. ACM Transactions on graphics (TOG), 22(3):313–318,2003. 4, 14

[50] R. Raghavendra, K. Raja, S. Venkatesh, and C. Busch. Trans-ferable Deep-CNN features for detecting digital and print-scanned morphed face images. In IEEE Computer Visionand Pattern Recognition Workshops, 2017. 3

[51] N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen. Dis-tinguishing computer graphics from natural images usingconvolution neural networks. In IEEE Workshop on Infor-mation Forensics and Security, pages 1–6, 2017. 3, 6, 7, 14

[52] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,and M. Nießner. FaceForensics: A large-scale video datasetfor forgery detection in human faces. arXiv, 2018. 4

[53] H. T. Sencar and N. Memon. Digital Image Forensics —There is More to a Picture than Meets the Eye. Springer,2013. 3

[54] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken,R. Bishop, D. Rueckert, and Z. Wang. Real-time singleimage and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 1874–1883, 2016. 14

[55] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync fromaudio. ACM Transactions on Graphics (TOG), 36(4), 2017.1, 3

[56] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.Inception-v4, inception-resnet and the impact of residualconnections on learning. 2017. 6

[57] J. Thies, M. Zollhofer, M. Nießner, L. Valgaerts, M. Stam-minger, and C. Theobalt. Real-time expression transfer forfacial reenactment. ACM Transactions on Graphics (TOG) -Proceedings of ACM SIGGRAPH Asia 2015, 34(6):Art. No.183, 2015. 2

Page 11: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

[58] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, andM. Nießner. Face2Face: Real-Time Face Capture and Reen-actment of RGB Videos. In IEEE Conference on Com-puter Vision and Pattern Recognition, pages 2387–2395,June 2016. 1, 3, 4, 5, 6, 13

[59] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, andM. Nießner. Facevr: Real-time gaze-aware facial reenact-ment in virtual reality. ACM Transactions on Graphics 2018(TOG), 2018. 3

[60] J. Thies, M. Zollhofer, C. Theobalt, M. Stamminger, andM. Nießner. Headon: Real-time reenactment of human por-trait videos. arXiv preprint arXiv:1805.11729, 2018. 3

[61] P. Upchurch, J. R. Gardner, K. Bala, R. Pless, N. Snavely,and K. Q. Weinberger. Deep feature interpolation for imagecontent changes. CoRR, abs/1611.05507, 2016. 3

[62] W. Wang and H. Farid. Exposing Digital Forgeries in In-terlaced and Deinterlaced Video. IEEE Transactions on In-formation Forensics and Security, 2(3):438–449, Sept. 2007.3

[63] M. Zampoglou, S. Papadopoulos, , and Y. Kompatsiaris.Detecting image splicing in the wild (Web). In IEEE In-ternational Conference on Multimedia & Expo Workshops(ICMEW), 2015. 3

[64] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detectionand alignment using multitask cascaded convolutional net-works. IEEE Signal Processing Letters, 23(10):1499–1503,Oct 2016. 14

[65] P. Zhou, X. Han, V. Morariu, and L. Davis. Two-stream neu-ral networks for tampered face detection. In IEEE ComputerVision and Pattern Recognition Workshops, pages 1831–1839, 2017. 3

[66] P. Zhou, X. Han, V. Morariu, and L. Davis. Learning richfeatures for image manipulation detection. In CVPR, 2018.3

[67] M. Zollhofer, J. Thies, D. Bradley, P. Garrido, T. Beeler,P. Peerez, M. Stamminger, M. Nießner, and C. Theobalt.State of the art on monocular 3d face reconstruction, track-ing, and applications. 2018. 1, 2

Page 12: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

Methods Train Validation Test

Original 367,282 68,862 73,770

Face2Face 367,282 68,862 73,770

FaceSwap 292,376 54,630 59,672

DeepFakes 367,282 68,862 73,770

Table 3: Number of unique images of the dataset w.r.t. themethod and the split in train validation and test used for ourtests.

A. Pristine Data AcquisitionIn this section we give details on the selection of videos

that we use as input to state-of-the-art manipulation meth-ods. The videos have to fulfill certain criteria. Early experi-ments with all manipulation methods showed that the targetface has to be nearly front-facing, to prevent the methodsfrom failing or producing strong artifacts (see Fig. 4 in mainpaper). In addition, the manipulation methods exhibit diffi-culties in handling occlusions. Thus, we ensure that the faceis visible in all frames to generate the best possible results.

For a realistic scenario, we chose to collect videosin the wild, more specifically from YouTube. As theYouTube search engine limits search hits, we made use ofthe YouTube-8m dataset [4] to collect videos with the tags“face”, “newscaster” or “newsprogram” and also includedvideos which we collected from the YouTube search inter-face with the same tags and additional tags like “interview”,“blog” or “video blog”. To ensure adequate video quality,we only downloaded videos that offer a resolution of 480por higher. For every video, we save its metadata to sort themby properties later on. In order to not change any image in-formation and also to prevent the introduction of additionalartifacts, we extract all video sequences into a lossless im-age format.

In order to match the above requirements, we first pro-cess all downloaded videos with a standard CPU Dlib facedetector [39], which is based on Histograms of OrientedGradients (HOG). During this step, we track the largest de-tected face by ensuring that the centers of two detectionsof consecutive frames are pixel-wise close. The histogram-based face tracker was chosen to ensure that the resultingvideo sequences contain little occlusions and thus containeasy-to-manipulate faces.

Methods like Face2Face or DeepFakes rely on manytraining images for each video. We therefore select all se-quences with a minimal length of 280 after removal of thefirst and last 10 to omit transition effects like fading overlaysin TV shows. In addition, we perform a manual screeningof the resulting clips to ensure a high quality video selectionand to avoid videos with face occlusions. We selected 1,000

video sequences containing 509, 914 images which we useas our pristine data. All examined manipulation methodsneed a source and a target video. In case of facial reenact-ment, the expressions of the source video are transferred tothe target video while retaining the identity of the target per-son. In contrast, the face swapping methods replace the facein the target video with the face in the source video. For theface swap methods, one has also to select source and targetvideos with similar resolutions to avoid strong visual arti-facts. To this end, we compare the bounding box sizes thatare computed by the DLib face detector. Further, we ensurethat a video pair contains persons of the same gender andthat the frame rates are similar.

B. Classification of Manipulation MethodTo train the XceptionNet classification network to dis-

tinguish between all three manipulation methods and thepristine images, we adapted the final output layer to returnfour class probabilities. The network is trained on the fulldataset containing all pristine and manipulated images. Onraw data the network is able to achieve a 99.08% accuracy,which slightly decreases for the high quality compressionto 97.33% and to 86.69% on low quality images.

C. Forgery LocalizationForgery localization is an important task in the field of

digital multimedia forensics. It is the term for forgery seg-mentation, i.e., predicting a probability map with the size ofthe input image indicating whether a pixel has been alteredor not. We built our database to also include such groundtruth segmentation masks for each manipulation methodwhich are published alongside the image data and can beused to train neural networks in a supervised fashion. Weevaluate the performance of our method described in themain paper to detect forgeries on the task of forgery lo-calization. I.e., instead of a binary decision on the image,we predict per-pixel classifications. To this end, we useour method in combination with the XceptionNet, as it per-formed better than the other classification networks for theforgery detection task. In the following, we detail the net-work architecture and show an evaluation of our approach.

Network Architecture: To enable per-pixel classificationwe adapt the architecture presented in the main paper as fol-lows. We designed the network to estimate the probabilityof a single patch being manipulated. To this end, we areusing a sliding window with a size of 128× 128 pixels withstride 16. Thus, resulting in a probability map with the sizeof a sixteenth part of the original image. We are using a lin-ear interpolation to distribute the likelihoods to all pixels ofthe frames.

Page 13: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

0 0.2 0.4 0.6 0.8 1

recall

0

0.2

0.4

0.6

0.8

1

pre

cis

ion

Raw

HQ

LQ

0 0.2 0.4 0.6 0.8 1

recall

0

0.2

0.4

0.6

0.8

1

pre

cis

ion

Raw

HQ

LQ

0 0.2 0.4 0.6 0.8 1

recall

0

0.2

0.4

0.6

0.8

1

pre

cis

ion

Raw

HQ

LQ

0 0.2 0.4 0.6 0.8 1

recall

0

0.2

0.4

0.6

0.8

1

pre

cis

ion

Raw

HQ

LQ

(a) DeepFakes (b) Face2Face (c) FaceSwap (c) All

Figure 10: Precision vs. recall curve of the localization task evaluated on the three examined manipulation methods separatelyand on all methods. The training considered all manipulation methods at full resolution. The test set comprises both forgedand pristine images.

Figure 11: Our forgery localization network is able to reli-ably detect manipulated pixels of a face in an image. In thisexperiment, we manipulate only the lower part of the faceusing the Face2Face algorithm [58]. As can be seen, thetrained network does not learn a simple face detector, butinstead correctly identifies the manipulated regions.

RawReference MaskFrame

Deepfa

kes re

al

fake

HQ LQ

Fa

ce2F

ace re

al

fake

Fa

ceS

wap re

al

fake

Figure 12: Examples of forgery localization results on threereal and manipulated images on our three quality levels.

Training: The network is trained on randomly extractedpatches from the frames. We compute smooth labels foreach patch in a 32 × 32 neighborhood around the centerof the patch using a box filter assuming manipulated pixelshaving a label of class 1 and pristine pixels having the classlabel 0. Based on these smooth labels that are in the range of[0, 1], we employ a cross-entropy loss to train the network.We are using ADAM [40] to optimize the network, with alearning rate of 0.0001 and a mini-batch size of 96 patches.The 96 patches are extracted from 16 forged frames and thecorresponding 16 pristine ones.Convergence of the training is measured by the change inaccuracy. If the accuracy does not improve for 10 epochs(each 1000 iterations), we stop the training process.

Evaluation of the Localization Performance: Fig. 10shows an evaluation of our approach applied to our test set.As can be seen our approach is able to reliably localize themanipulated region in a raw image. Similar to the detec-tion accuracy reported in the main paper, the performanceis lowered by a reduced quality level but is still at a highlevel. In Fig. 12, we show qualitative results on randomlypicked images from the test set.

We also conducted an experiment on a face manipulationthat is not influencing the whole face region (see Fig. 11).Specifically, we want to highlight that the forgery localiza-tion in the case of facial manipulations is not equivalent to aface segmentation. As can be seen, our detection approachis able to detect the region of the manipulation when half ofthe face is untouched.

Conclusion As shown in the main paper, our dataset canbe used for various tasks in the field of digital multimediaforensics. Specifically, we addressed the task of forgery lo-calization, the segmentation of the altered region. Our ap-proach enables the forgery localization in high-quality aswell as in low-quality video footage.

Page 14: arXiv:1901.08971v2 [cs.CV] 3 Apr 2019

D. HyperparametersFor reproducibility, we detail the hyperparameters used

for the methods in the main paper. We structured this sec-tion into two parts, one for the manipulation methods andthe second part for the classification approaches used forforgery detection.

D.1. Manipulation Methods

Only DeepFakes is learning-based, for the other manip-ulation methods we used the default parameters of the ap-proaches. Our DeepFakes implementation is based on thedeepfakes faceswap github project [1]. We keep most ofthe default parameters of this implementation. MTCNN([64]) is used to extract and align the images for each video.Specifically, the largest face in the first frame of a sequenceis detected and tracked throughout the whole video. Thistracking information is used to extract the training data forDeepFakes. The auto-encoder takes input images of 64 (de-fault). It uses a shared encoder consisting of four convolu-tional layers which downsizes the image to a bottleneck of4 × 4, where we flatten the input, apply a fully connectedlayer, reshape the dense layer and apply a single upscalingusing a convolutional layer as well as a pixel shuffle layer(see [54]). The two decoders use three identical up-scalinglayers to attain full input image resolution. All layers useLeaky ReLus as non-linearities. The network is trainedusing Adam with a learning rate of 10−5, β1 = 0.5 andβ2 = 0.999 as well as a batch size of 64. In our experi-ments we run the training for 200000 iterations on a cloudplatform. Since the training procedure of the auto-encoder

networks is very time-consuming, we also publish the cor-responding models.

By exchanging the decoder of one person to another, wecan generate a identity-swapped face region. To insert theface into the target image we chose Poisson Image Edit-ing [49] to achieve a seamless blending result.

D.2. Classification Methods

For our forgery detection pipeline proposed in the mainpaper, we conducted studies with five classification ap-proaches based on convolutional neural networks. The net-works are trained using the Adam optimizer with differentparameters for learning-rate and batch-size. In particular,for the network proposed in Cozzolino at al. [17] the usedlearning-rate is 10−5 with batch-size 16. For the proposalof Bayar and Stamm [10], we use a learning-rate equal to10−5 with a batch-size of 64. The network proposed byRahmouni [51] is trained with a learning-rate of 10−4 anda batch-size equal to 64. MesoNet [5] uses a batch-size of76. The learning-rate is initially set to 10−3 and is consecu-tively reduced by a factor of ten for each epoch to 10−6. OurXceptionNet [14]-based approach is trained with a learning-rate of 0.0002 and a batch-size of 32.

The moments of Adam are set to the default values(β1 = 0.9, β2 = 0.999) for all experiments. The stop crite-rion of the training process is triggered when the validationaccuracy does not improve for 10 consecutive epochs (anepoch equals to a fourth of the training set). We keep thenetwork weights relative to the epoch with the best valida-tion accuracy.