11
Research Article DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng , 1 HongfeiWang , 1 JingmeiZhou , 2 HuiChang , 1 XiangmoZhao , 1 andYilinJia 3,4 1 School of Information Engineering, Chang’an University, Xi’an 710064, China 2 School of Electronic and Control Engineering, Chang’an University, Xi’an 710064, China 3 Xi’an University of Architecture and Technology, Xi’an, China 4 University of South Australia An De College, Adelaide, Australia Correspondence should be addressed to Xin Cheng; [email protected], Hongfei Wang; [email protected], and Jingmei Zhou; [email protected] Received 18 May 2020; Accepted 16 June 2020; Published 10 July 2020 Guest Editor: Zhihan Lv Copyright © 2020 Xin Cheng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For face recognition systems, liveness detection can effectively avoid illegal fraud and improve the safety of face recognition systems. Common face attacks include photo printing and video replay attacks. is paper studied the differences between photos, videos, and real faces in static texture and motion information and proposed a living detection structure based on feature fusion and attention mechanism, Dynamic and Texture Fusion Attention Network (DTFA-Net). We proposed a dynamic information fusion structure of an interchannel attention block to fuse the magnitude and direction of optical flow to extract facial motion features. In addition, for the face detection failure of HOG algorithm under complex illumination, we proposed an improved Gamma image preprocessing algorithm, which effectively improved the face detection ability. We conducted experiments on the CASIA-MFSD and Replay Attack Databases. According to experiments, the DTFA-Net proposed in this paper achieved 6.9% EER on CASIA and 2.2% HTER on Replay Attack that was comparable to other methods. 1.Introduction With the application of face recognition technology in the identification scene such as access security check and face payment, the methods of attack and fraud against face recognition system also appear. Face is obviously a much easier way to steal identity information than biometric features such as iris and fingerprints. Attackers can easily steal images or videos of legitimate users on social net- working sites and then launching print or replay attacks on face recognition systems. Some face verification systems use techniques such as face tracking to locate key points on the face, requiring users to complete actions such as blinking, shaking their heads, and reading text aloud and use motion detection to determine whether the current image is a real face. is approach is not suitable for silent detection sce- narios. In addition, some researchers use infrared camera, depth camera, and other sensors to collect different modes of face images to achieve living detection [1–3]. ese methods show excellent performance in many scenarios but need to add information acquisition equipment other than camera to face recognition devices, need to invest additional hardware costs, and cannot meet the requirements of some mobile devices. In this paper, we will study the monocular static and silent living detection and achieve the living detection task by analyzing the difference between real face and fake face in image texture, facial structure, action change, and so on. Real face image is often taken directly by the camera, while attacking face images are collected many times. As shown in Figure 1, false face images may show the texture of the image carrier itself, and the light region with large difference from the real face image is also easy to appear in the false face image. According to this, researchers proposed Hindawi Complexity Volume 2020, Article ID 5836596, 11 pages https://doi.org/10.1155/2020/5836596

DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

Research ArticleDTFA-Net Dynamic and Texture Features Fusion AttentionNetwork for Face Antispoofing

Xin Cheng 1 Hongfei Wang 1 Jingmei Zhou 2 Hui Chang 1 Xiangmo Zhao 1

and Yilin Jia34

1School of Information Engineering Changrsquoan University Xirsquoan 710064 China2School of Electronic and Control Engineering Changrsquoan University Xirsquoan 710064 China3Xirsquoan University of Architecture and Technology Xirsquoan China4University of South Australia An De College Adelaide Australia

Correspondence should be addressed to Xin Cheng xinchengchdeducn Hongfei Wang herr_wangchdeducn andJingmei Zhou jmzhouchdeducn

Received 18 May 2020 Accepted 16 June 2020 Published 10 July 2020

Guest Editor Zhihan Lv

Copyright copy 2020 Xin Cheng et al 1is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

For face recognition systems liveness detection can effectively avoid illegal fraud and improve the safety of face recognitionsystems Common face attacks include photo printing and video replay attacks1is paper studied the differences between photosvideos and real faces in static texture and motion information and proposed a living detection structure based on feature fusionand attention mechanism Dynamic and Texture Fusion Attention Network (DTFA-Net) We proposed a dynamic informationfusion structure of an interchannel attention block to fuse the magnitude and direction of optical flow to extract facial motionfeatures In addition for the face detection failure of HOG algorithm under complex illumination we proposed an improvedGamma image preprocessing algorithm which effectively improved the face detection ability We conducted experiments on theCASIA-MFSD and Replay Attack Databases According to experiments the DTFA-Net proposed in this paper achieved 69 EERon CASIA and 22 HTER on Replay Attack that was comparable to other methods

1 Introduction

With the application of face recognition technology in theidentification scene such as access security check and facepayment the methods of attack and fraud against facerecognition system also appear Face is obviously a mucheasier way to steal identity information than biometricfeatures such as iris and fingerprints Attackers can easilysteal images or videos of legitimate users on social net-working sites and then launching print or replay attacks onface recognition systems Some face verification systems usetechniques such as face tracking to locate key points on theface requiring users to complete actions such as blinkingshaking their heads and reading text aloud and use motiondetection to determine whether the current image is a realface 1is approach is not suitable for silent detection sce-narios In addition some researchers use infrared camera

depth camera and other sensors to collect different modes offace images to achieve living detection [1ndash3] 1ese methodsshow excellent performance in many scenarios but need toadd information acquisition equipment other than camerato face recognition devices need to invest additionalhardware costs and cannot meet the requirements of somemobile devices In this paper we will study the monocularstatic and silent living detection and achieve the livingdetection task by analyzing the difference between real faceand fake face in image texture facial structure actionchange and so on

Real face image is often taken directly by the camerawhile attacking face images are collected many times Asshown in Figure 1 false face images may show the texture ofthe image carrier itself and the light region with largedifference from the real face image is also easy to appear inthe false face image According to this researchers proposed

HindawiComplexityVolume 2020 Article ID 5836596 11 pageshttpsdoiorg10115520205836596

many feature descriptors for characterizing the living textureof face and then implemented the classification by trainingmodels such as SVM and LDA classifier In order tocharacterize the high semantic features of face living bodythe deep neural network is applied in the feature extractionprocess to further enhance the performance of living de-tection1e features included in the local area of the face canoften be used as an important basis for living detection andplay a different role as shown in Figure 2 Based on thissome researchers [4 5] decomposed faces into differentregions to extract features through neural networks and thenrealize feature splicing

Most prosthetic faces are difficult to simulate the vitalsigns of real faces such as head movement lip peristalsisand blinking At the same time due to background noiseskin texture and other factors the dynamic characteristics ofreal face in some frequency bands are obviously higher thanthat of fraudulent face which provides the basis for dis-tinguishing real face from fraudulent face 1e variation inoptical flow field is an important basis of this kind of al-gorithm However the dynamic information generated bymovement and bending of photo will influence the ex-traction of life signals Remote photoplethysmography(rPPG) is another effective noncontact living signal ex-traction method which provides a basis for face livingdetection by observing face images to calculate the changesin blood flow and flow rate [6 7] but the rPPG method hasstrict requirements for algorithm application environment

1is work proposed a network that fuses dynamic andtexture information to represent face and detect the attacksOptical flow method is used to calculate the motion changein two adjacent frames of face images 1e optical flowgenerated by the bending and movement of the photo isdifferent from the optical flow generated by themovement ofthe real face in the direction of displacement We use asimple convolutional neural network with the same struc-ture to characterize the magnitude and direction of dis-placement 1en a feature fusion module is designed for thecombination of the above two representations so that onthis basis facial motion features can be further extracted Inaddition RGB images are used to extract texture informa-tion of the face area By giving a different attention to theparts of the face we enhance the networkrsquos ability to rep-resent living faces

Face detection algorithms are widely used in living bodydetection tasks which can be used to locate faces therebyeliminating the interference of background information onliving body detection In this paper for face detection scenesunder complex lighting we propose an improved imagepreprocessing algorithm combined with local contrast in theface area which effectively improves the performance of theface detection algorithm

2 Relating Works

21 Texture based Living verification is completed by usingthe difference between real face and replay image in surfacetexture 3D structure image quality and so on Boulkenafetet al [8] analyzed the chroma and brightness difference

between real and false face images it is based on the colorlocal binary pattern and the feature histogram of each orderimage frequency band was extracted as the face texturerepresentation Finally the classification was realized bysupport vector machine and testing on the Replay AttackDataset obtained the half error rate it is 29 Galbally et al[9] prove that the image quality loss value produced byGaussian filtering can distinguish the truth effectively withfraudulent face images designed a quality assessment vectorcontaining 14 indicators and proposed a live detectionmethod the method in combination with LDA (lineardiscriminant analysis) and obtained 152 half error rate onthe Replay Attack Dataset However such methods based onstatic feature often require the design of specific descriptorsfor a certain types of attacks and the robustness is poorunder different light conditions and different fraud carriers[10]

22 Dynamic Based Some researchers have proposed aface living detection algorithm based on dynamic featuresby analyzing face motion patterns and show good per-formance in related datasets [11] Kim et al [12] designeda local velocity pattern for the estimation of the speed oflight and distinguished the fraud from the real faceaccording to the difference in the diffusion speed betweenthe light on the real face and the fraud carrier surface A1250 half error rate was obtained on the Replay AttackDataset Bharadwaj et al [13] amplify the blink signalwhich is 02ndash05 Hz in the image by the Eulerian motionamplification algorithm combined with local binarypattern with directional flow histogram (LBP-HOOF) toextract dynamic features as classification basis and ob-tained error rate which is 125 on the Replay AttackDataset At the same time they proved the positive effectof image amplification algorithm on the performance ofthe algorithm Freitas et al [14] learned from the facialexpression detection method extracted feature histo-grams from the orthogonal plane of time-spatial domainby using LBP-TOP operator used support vector machineto classify and got 76 half error rate on Replay AttackDataset Xiaoguang et al [15] based on the action in-formation between adjacent frames established a CNN-LSTM network model used convolutional neural networkto extract the texture features of adjacent frame faceimages and then input it to the long- and short-termmemory structure to learn the time-domain action in-formation in face video

In addition some researchers combined different de-tection equipments or system modules to fuse informationon different levels which effectively increased the accuracyof living detection [1 16] Zhang and Wang [17] used IntelRealSense SR300 camera to construct multimodal face imagedatabase including RGB image depth image (depth) andinfrared image (IR) 1e face region was accurately locatedusing face 3D reconstruction network PRNet [18] and maskoperation and then based on ResNet 18 classification [19]network to extract and fuse feature of multimodal datawhich mixed RGB depth and IR

2 Complexity

3 Proposed Method

31 Face Detection in Complex Illumination In order toeliminate the interference of background in the process ofliving information extraction it is necessary to segment theface area of the image Traditional detection techniques canbe divided into three categories the face detection based onfeature the face detection based on template and the facedetection based on statistics 1is paper uses face frontdetection API provided by Dlib which uses gradient di-rection histogram feature to achieve face detection 1e facedetection algorithm based on gradient direction histogramcanmaintain good immutability of image texture and opticaldeformation and ignore the slight texture and changes inexpression

Histogram of Oriented Gradients (HOGs) is a methodused to describe the local texture features of image 1ealgorithm divides the image into small spaces and calculatesthe gradient of pixel points in each space 1e pixel pointgradient calculation is shown in the following equations

Gx(x y) I(x + 1 y) minus I(x minus 1 y) (1)

Gy(x y) I(x y + 1) minus I(x y minus 1) (2)

where Gx(x y) and Gy(x y) are the horizontal gradientand vertical gradient at the (x y) of the image respectively

and I(x y) is the gray value In reality local shading or overexposure will affect the extraction of gradient informationbecause the image target will appear in different light en-vironments as shown in Figure 3 In order to enhance therobustness of the HOG feature descriptor to environmentalchanges and reduce the noise such as the local shadow of theimage a Gamma correction algorithm is used to preprocessthe image to eliminate the interference of partial light

Traditional Gamma correction method changes thebrightness of image by selecting the appropriate c operatoras follows

O(x y) 255 timesI(x y)

2551113890 1113891

c

(3)

where I(x y) is the pixel value of the image at the position(x y) O(x y) is the corrected pixel value and c is theconstant1e traditional method performs image processingat the global level without considering the lightness differ-ence between local and neighborhood pixels 1ereforeSchettini et al [20] proposed a formula for the value of c

operator

c[x y] z[128minusmask(xy)128]

z

(In(I255))

(In(05)) Ilt 128

1 I 128

(In(05))

(In(I255)) Igt 128

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(4)

where mask is an image mask and Gaussian blur can be usedin practice For the more balanced image with bright areaand dark area the average pixel of the image is close to 128so the calculated α is close to 1 and the image is hardlychanged which obviously does not meet the actual needsConsidering the local feature of face this paper introduces

Figure 1 Face print and replay attack images 1e face attacked has been collected many times showing the difference between texturefeature light image quality and real face

Figure 2Weights visualization of a layer in a depth neural networkfor real face texture information extraction Different face regionsoccupy different weights in living detection task

Complexity 3

the local normalization method proposed in [21] to calculatethe ratio relation of pixels in the neighborhood and adjustthe operator α

z(x y)

(In(I255))

(In(05))+

N(x y)

(In(I255))In(05) Ilt 128

(In(05))

(In(I255))+

N(x y)

(In(05))In

I

2551113888 1113889 Ige 128

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(5)

Among them the specific calculation process of localnormalized characteristic N is as follows

(1) To calculate the maximum pixel value Im(x y) in theneighborhood φ(x y) centered on pixel (x y)

Im(x y) max I(i j) | (i j) isin φ(x y)1113864 1113865 (6)

(2) To calculate the median value of the Im(x y) of allpixels centered on pixel (x y)

Imm(x y) medium Im(i j)1113868111386811138681113868 (i j) isin φ(x y)1113966 1113967 (7)

(3) To calculate the maximum value of the Imm(x y) ofall pixels centered on pixel (x y)

S(x y) max Imm(i j) | (i j) isin φ(x y)1113864 1113865 (8)

(4) To calculate the ratio of pixels (x y) to neighborhoodpixels

N(x y) I(x y)

S(x y) (9)

We use algorithm in [20] and the improved algorithm inthis paper to preprocess the portrait 208 photos on YaleBsubdatabase that is difficult to be detected by HOG undercomplex lighting conditions and then detect 196 and 201faces separately 1e result is shown in Figure 4

32 DTFA-Net Architecture In Section 32 we mainly in-troduce the dynamic and texture features fusion attentionnetwork DTFA-Net As shown in Figure 5 the optical flowgraph and the texture image are respectively subjected toobtain 256lowast2 and 256lowast4 embedding by extracting dynamicfeature and texture feature from subnetwork and then fusingthe spliced 256lowast6 features through the fully connected layerand living detection 1e specific details of the network aredescribed below

321 Dynamic Feature Fusion 1is paper generates theoptical flow field change map of adjacent two frames of facevideo by the optical flow method 1e optical flow change inface region is extracted by dynamic feature fusion subnet-work in two dimensions of displacement and size and thefeatures of the two dimensions are fused by feature fusionblock to extract the dynamic information of face region

(1) Optical Flow Optical flow method is a proposal used todescribe themotion information of adjacent frame objects Itreflects the interframe field changes by calculating themotion displacement in the x and y directions of the imageon the time domain Defining videomidpoint P located (x y)of the image at the t moment and moving to the place(x + dx y + dy) then when the dt is close to 0 the two pixelvalues satisfy the following relationship

I(v) I(v + d) (10)

where v (x y) is the coordinate of the point P at the time t I(v) is the gray value of the place (x y) at the time t d (dxdy) is the displacement of the point P during dt and I(v + d)

is the gray value of the place (x + dx y + dy) at the timet + dt

In this paper the dense optical flowmethod proposed byFarneback [22] is used to calculate the interframe dis-placement of face video 1e algorithm approximates thepixels of two-frame images by a polynomial expansiontransformation And it based on the assumption that thelocal optical flow and the image gradient are stable and thedisplacement field is deduced in the polynomial expansioncoefficient We transform the displacement d (dx dy) tothe extreme coordinate system d (ρ θ) and visualize theoptical flow displacement and direction by the HSV modelAs shown in Figure 6 the optical flow change image ob-tained will be used as input of the dynamic feature fusionnetwork

(2) Fusion Attention Module In the process of dynamicinformation extraction we extract respectively the motioninformation contained in the input optical flow changedirection feature map and the optical flow change intensityfeature map through 5 convolution layers Because themotion pattern of living human face contains two dimen-sions of direction and intensity it is necessary to combinethe above representations to further extract the movingfeatures of the face As a result we designed a fusion moduleas shown in Figure 7

Figure 3 HOG feature of shadow on face region under lightcondition It is necessary to initialize the face image because theshadow or exposure caused by complex light can affect the faceregion gradient information

4 Complexity

(a)

(b)

(c)

Figure 4 Comparison between [20] and ours 1e improved algorithm we proposed performs better than that in [20] (a) Original imagesthat cannot detect face by HOG (b) images processed in [20] and the detection result (c) images processed by ours and the detection result

Conv layer

Fusion block

FC layers

Spatialattention

(attack real)

VecConv 1-5

MagConv 1-5

TexConv 5TexConv 1-4

Img_Vec

Img_Mag

Img_Tex

FC1

FC2

FC3

Figure 5 1e dynamic and texture features fusion attention network (DTFA-Net) architecture 1e figure omits the ReLU and poolinglayers after the convolution layer and the id of the convolution is shown on the top Color code used is as follows pink convolutionblue fusion block gray spatial attention green fully connected layer

Complexity 5

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 2: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

many feature descriptors for characterizing the living textureof face and then implemented the classification by trainingmodels such as SVM and LDA classifier In order tocharacterize the high semantic features of face living bodythe deep neural network is applied in the feature extractionprocess to further enhance the performance of living de-tection1e features included in the local area of the face canoften be used as an important basis for living detection andplay a different role as shown in Figure 2 Based on thissome researchers [4 5] decomposed faces into differentregions to extract features through neural networks and thenrealize feature splicing

Most prosthetic faces are difficult to simulate the vitalsigns of real faces such as head movement lip peristalsisand blinking At the same time due to background noiseskin texture and other factors the dynamic characteristics ofreal face in some frequency bands are obviously higher thanthat of fraudulent face which provides the basis for dis-tinguishing real face from fraudulent face 1e variation inoptical flow field is an important basis of this kind of al-gorithm However the dynamic information generated bymovement and bending of photo will influence the ex-traction of life signals Remote photoplethysmography(rPPG) is another effective noncontact living signal ex-traction method which provides a basis for face livingdetection by observing face images to calculate the changesin blood flow and flow rate [6 7] but the rPPG method hasstrict requirements for algorithm application environment

1is work proposed a network that fuses dynamic andtexture information to represent face and detect the attacksOptical flow method is used to calculate the motion changein two adjacent frames of face images 1e optical flowgenerated by the bending and movement of the photo isdifferent from the optical flow generated by themovement ofthe real face in the direction of displacement We use asimple convolutional neural network with the same struc-ture to characterize the magnitude and direction of dis-placement 1en a feature fusion module is designed for thecombination of the above two representations so that onthis basis facial motion features can be further extracted Inaddition RGB images are used to extract texture informa-tion of the face area By giving a different attention to theparts of the face we enhance the networkrsquos ability to rep-resent living faces

Face detection algorithms are widely used in living bodydetection tasks which can be used to locate faces therebyeliminating the interference of background information onliving body detection In this paper for face detection scenesunder complex lighting we propose an improved imagepreprocessing algorithm combined with local contrast in theface area which effectively improves the performance of theface detection algorithm

2 Relating Works

21 Texture based Living verification is completed by usingthe difference between real face and replay image in surfacetexture 3D structure image quality and so on Boulkenafetet al [8] analyzed the chroma and brightness difference

between real and false face images it is based on the colorlocal binary pattern and the feature histogram of each orderimage frequency band was extracted as the face texturerepresentation Finally the classification was realized bysupport vector machine and testing on the Replay AttackDataset obtained the half error rate it is 29 Galbally et al[9] prove that the image quality loss value produced byGaussian filtering can distinguish the truth effectively withfraudulent face images designed a quality assessment vectorcontaining 14 indicators and proposed a live detectionmethod the method in combination with LDA (lineardiscriminant analysis) and obtained 152 half error rate onthe Replay Attack Dataset However such methods based onstatic feature often require the design of specific descriptorsfor a certain types of attacks and the robustness is poorunder different light conditions and different fraud carriers[10]

22 Dynamic Based Some researchers have proposed aface living detection algorithm based on dynamic featuresby analyzing face motion patterns and show good per-formance in related datasets [11] Kim et al [12] designeda local velocity pattern for the estimation of the speed oflight and distinguished the fraud from the real faceaccording to the difference in the diffusion speed betweenthe light on the real face and the fraud carrier surface A1250 half error rate was obtained on the Replay AttackDataset Bharadwaj et al [13] amplify the blink signalwhich is 02ndash05 Hz in the image by the Eulerian motionamplification algorithm combined with local binarypattern with directional flow histogram (LBP-HOOF) toextract dynamic features as classification basis and ob-tained error rate which is 125 on the Replay AttackDataset At the same time they proved the positive effectof image amplification algorithm on the performance ofthe algorithm Freitas et al [14] learned from the facialexpression detection method extracted feature histo-grams from the orthogonal plane of time-spatial domainby using LBP-TOP operator used support vector machineto classify and got 76 half error rate on Replay AttackDataset Xiaoguang et al [15] based on the action in-formation between adjacent frames established a CNN-LSTM network model used convolutional neural networkto extract the texture features of adjacent frame faceimages and then input it to the long- and short-termmemory structure to learn the time-domain action in-formation in face video

In addition some researchers combined different de-tection equipments or system modules to fuse informationon different levels which effectively increased the accuracyof living detection [1 16] Zhang and Wang [17] used IntelRealSense SR300 camera to construct multimodal face imagedatabase including RGB image depth image (depth) andinfrared image (IR) 1e face region was accurately locatedusing face 3D reconstruction network PRNet [18] and maskoperation and then based on ResNet 18 classification [19]network to extract and fuse feature of multimodal datawhich mixed RGB depth and IR

2 Complexity

3 Proposed Method

31 Face Detection in Complex Illumination In order toeliminate the interference of background in the process ofliving information extraction it is necessary to segment theface area of the image Traditional detection techniques canbe divided into three categories the face detection based onfeature the face detection based on template and the facedetection based on statistics 1is paper uses face frontdetection API provided by Dlib which uses gradient di-rection histogram feature to achieve face detection 1e facedetection algorithm based on gradient direction histogramcanmaintain good immutability of image texture and opticaldeformation and ignore the slight texture and changes inexpression

Histogram of Oriented Gradients (HOGs) is a methodused to describe the local texture features of image 1ealgorithm divides the image into small spaces and calculatesthe gradient of pixel points in each space 1e pixel pointgradient calculation is shown in the following equations

Gx(x y) I(x + 1 y) minus I(x minus 1 y) (1)

Gy(x y) I(x y + 1) minus I(x y minus 1) (2)

where Gx(x y) and Gy(x y) are the horizontal gradientand vertical gradient at the (x y) of the image respectively

and I(x y) is the gray value In reality local shading or overexposure will affect the extraction of gradient informationbecause the image target will appear in different light en-vironments as shown in Figure 3 In order to enhance therobustness of the HOG feature descriptor to environmentalchanges and reduce the noise such as the local shadow of theimage a Gamma correction algorithm is used to preprocessthe image to eliminate the interference of partial light

Traditional Gamma correction method changes thebrightness of image by selecting the appropriate c operatoras follows

O(x y) 255 timesI(x y)

2551113890 1113891

c

(3)

where I(x y) is the pixel value of the image at the position(x y) O(x y) is the corrected pixel value and c is theconstant1e traditional method performs image processingat the global level without considering the lightness differ-ence between local and neighborhood pixels 1ereforeSchettini et al [20] proposed a formula for the value of c

operator

c[x y] z[128minusmask(xy)128]

z

(In(I255))

(In(05)) Ilt 128

1 I 128

(In(05))

(In(I255)) Igt 128

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(4)

where mask is an image mask and Gaussian blur can be usedin practice For the more balanced image with bright areaand dark area the average pixel of the image is close to 128so the calculated α is close to 1 and the image is hardlychanged which obviously does not meet the actual needsConsidering the local feature of face this paper introduces

Figure 1 Face print and replay attack images 1e face attacked has been collected many times showing the difference between texturefeature light image quality and real face

Figure 2Weights visualization of a layer in a depth neural networkfor real face texture information extraction Different face regionsoccupy different weights in living detection task

Complexity 3

the local normalization method proposed in [21] to calculatethe ratio relation of pixels in the neighborhood and adjustthe operator α

z(x y)

(In(I255))

(In(05))+

N(x y)

(In(I255))In(05) Ilt 128

(In(05))

(In(I255))+

N(x y)

(In(05))In

I

2551113888 1113889 Ige 128

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(5)

Among them the specific calculation process of localnormalized characteristic N is as follows

(1) To calculate the maximum pixel value Im(x y) in theneighborhood φ(x y) centered on pixel (x y)

Im(x y) max I(i j) | (i j) isin φ(x y)1113864 1113865 (6)

(2) To calculate the median value of the Im(x y) of allpixels centered on pixel (x y)

Imm(x y) medium Im(i j)1113868111386811138681113868 (i j) isin φ(x y)1113966 1113967 (7)

(3) To calculate the maximum value of the Imm(x y) ofall pixels centered on pixel (x y)

S(x y) max Imm(i j) | (i j) isin φ(x y)1113864 1113865 (8)

(4) To calculate the ratio of pixels (x y) to neighborhoodpixels

N(x y) I(x y)

S(x y) (9)

We use algorithm in [20] and the improved algorithm inthis paper to preprocess the portrait 208 photos on YaleBsubdatabase that is difficult to be detected by HOG undercomplex lighting conditions and then detect 196 and 201faces separately 1e result is shown in Figure 4

32 DTFA-Net Architecture In Section 32 we mainly in-troduce the dynamic and texture features fusion attentionnetwork DTFA-Net As shown in Figure 5 the optical flowgraph and the texture image are respectively subjected toobtain 256lowast2 and 256lowast4 embedding by extracting dynamicfeature and texture feature from subnetwork and then fusingthe spliced 256lowast6 features through the fully connected layerand living detection 1e specific details of the network aredescribed below

321 Dynamic Feature Fusion 1is paper generates theoptical flow field change map of adjacent two frames of facevideo by the optical flow method 1e optical flow change inface region is extracted by dynamic feature fusion subnet-work in two dimensions of displacement and size and thefeatures of the two dimensions are fused by feature fusionblock to extract the dynamic information of face region

(1) Optical Flow Optical flow method is a proposal used todescribe themotion information of adjacent frame objects Itreflects the interframe field changes by calculating themotion displacement in the x and y directions of the imageon the time domain Defining videomidpoint P located (x y)of the image at the t moment and moving to the place(x + dx y + dy) then when the dt is close to 0 the two pixelvalues satisfy the following relationship

I(v) I(v + d) (10)

where v (x y) is the coordinate of the point P at the time t I(v) is the gray value of the place (x y) at the time t d (dxdy) is the displacement of the point P during dt and I(v + d)

is the gray value of the place (x + dx y + dy) at the timet + dt

In this paper the dense optical flowmethod proposed byFarneback [22] is used to calculate the interframe dis-placement of face video 1e algorithm approximates thepixels of two-frame images by a polynomial expansiontransformation And it based on the assumption that thelocal optical flow and the image gradient are stable and thedisplacement field is deduced in the polynomial expansioncoefficient We transform the displacement d (dx dy) tothe extreme coordinate system d (ρ θ) and visualize theoptical flow displacement and direction by the HSV modelAs shown in Figure 6 the optical flow change image ob-tained will be used as input of the dynamic feature fusionnetwork

(2) Fusion Attention Module In the process of dynamicinformation extraction we extract respectively the motioninformation contained in the input optical flow changedirection feature map and the optical flow change intensityfeature map through 5 convolution layers Because themotion pattern of living human face contains two dimen-sions of direction and intensity it is necessary to combinethe above representations to further extract the movingfeatures of the face As a result we designed a fusion moduleas shown in Figure 7

Figure 3 HOG feature of shadow on face region under lightcondition It is necessary to initialize the face image because theshadow or exposure caused by complex light can affect the faceregion gradient information

4 Complexity

(a)

(b)

(c)

Figure 4 Comparison between [20] and ours 1e improved algorithm we proposed performs better than that in [20] (a) Original imagesthat cannot detect face by HOG (b) images processed in [20] and the detection result (c) images processed by ours and the detection result

Conv layer

Fusion block

FC layers

Spatialattention

(attack real)

VecConv 1-5

MagConv 1-5

TexConv 5TexConv 1-4

Img_Vec

Img_Mag

Img_Tex

FC1

FC2

FC3

Figure 5 1e dynamic and texture features fusion attention network (DTFA-Net) architecture 1e figure omits the ReLU and poolinglayers after the convolution layer and the id of the convolution is shown on the top Color code used is as follows pink convolutionblue fusion block gray spatial attention green fully connected layer

Complexity 5

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 3: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

3 Proposed Method

31 Face Detection in Complex Illumination In order toeliminate the interference of background in the process ofliving information extraction it is necessary to segment theface area of the image Traditional detection techniques canbe divided into three categories the face detection based onfeature the face detection based on template and the facedetection based on statistics 1is paper uses face frontdetection API provided by Dlib which uses gradient di-rection histogram feature to achieve face detection 1e facedetection algorithm based on gradient direction histogramcanmaintain good immutability of image texture and opticaldeformation and ignore the slight texture and changes inexpression

Histogram of Oriented Gradients (HOGs) is a methodused to describe the local texture features of image 1ealgorithm divides the image into small spaces and calculatesthe gradient of pixel points in each space 1e pixel pointgradient calculation is shown in the following equations

Gx(x y) I(x + 1 y) minus I(x minus 1 y) (1)

Gy(x y) I(x y + 1) minus I(x y minus 1) (2)

where Gx(x y) and Gy(x y) are the horizontal gradientand vertical gradient at the (x y) of the image respectively

and I(x y) is the gray value In reality local shading or overexposure will affect the extraction of gradient informationbecause the image target will appear in different light en-vironments as shown in Figure 3 In order to enhance therobustness of the HOG feature descriptor to environmentalchanges and reduce the noise such as the local shadow of theimage a Gamma correction algorithm is used to preprocessthe image to eliminate the interference of partial light

Traditional Gamma correction method changes thebrightness of image by selecting the appropriate c operatoras follows

O(x y) 255 timesI(x y)

2551113890 1113891

c

(3)

where I(x y) is the pixel value of the image at the position(x y) O(x y) is the corrected pixel value and c is theconstant1e traditional method performs image processingat the global level without considering the lightness differ-ence between local and neighborhood pixels 1ereforeSchettini et al [20] proposed a formula for the value of c

operator

c[x y] z[128minusmask(xy)128]

z

(In(I255))

(In(05)) Ilt 128

1 I 128

(In(05))

(In(I255)) Igt 128

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(4)

where mask is an image mask and Gaussian blur can be usedin practice For the more balanced image with bright areaand dark area the average pixel of the image is close to 128so the calculated α is close to 1 and the image is hardlychanged which obviously does not meet the actual needsConsidering the local feature of face this paper introduces

Figure 1 Face print and replay attack images 1e face attacked has been collected many times showing the difference between texturefeature light image quality and real face

Figure 2Weights visualization of a layer in a depth neural networkfor real face texture information extraction Different face regionsoccupy different weights in living detection task

Complexity 3

the local normalization method proposed in [21] to calculatethe ratio relation of pixels in the neighborhood and adjustthe operator α

z(x y)

(In(I255))

(In(05))+

N(x y)

(In(I255))In(05) Ilt 128

(In(05))

(In(I255))+

N(x y)

(In(05))In

I

2551113888 1113889 Ige 128

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(5)

Among them the specific calculation process of localnormalized characteristic N is as follows

(1) To calculate the maximum pixel value Im(x y) in theneighborhood φ(x y) centered on pixel (x y)

Im(x y) max I(i j) | (i j) isin φ(x y)1113864 1113865 (6)

(2) To calculate the median value of the Im(x y) of allpixels centered on pixel (x y)

Imm(x y) medium Im(i j)1113868111386811138681113868 (i j) isin φ(x y)1113966 1113967 (7)

(3) To calculate the maximum value of the Imm(x y) ofall pixels centered on pixel (x y)

S(x y) max Imm(i j) | (i j) isin φ(x y)1113864 1113865 (8)

(4) To calculate the ratio of pixels (x y) to neighborhoodpixels

N(x y) I(x y)

S(x y) (9)

We use algorithm in [20] and the improved algorithm inthis paper to preprocess the portrait 208 photos on YaleBsubdatabase that is difficult to be detected by HOG undercomplex lighting conditions and then detect 196 and 201faces separately 1e result is shown in Figure 4

32 DTFA-Net Architecture In Section 32 we mainly in-troduce the dynamic and texture features fusion attentionnetwork DTFA-Net As shown in Figure 5 the optical flowgraph and the texture image are respectively subjected toobtain 256lowast2 and 256lowast4 embedding by extracting dynamicfeature and texture feature from subnetwork and then fusingthe spliced 256lowast6 features through the fully connected layerand living detection 1e specific details of the network aredescribed below

321 Dynamic Feature Fusion 1is paper generates theoptical flow field change map of adjacent two frames of facevideo by the optical flow method 1e optical flow change inface region is extracted by dynamic feature fusion subnet-work in two dimensions of displacement and size and thefeatures of the two dimensions are fused by feature fusionblock to extract the dynamic information of face region

(1) Optical Flow Optical flow method is a proposal used todescribe themotion information of adjacent frame objects Itreflects the interframe field changes by calculating themotion displacement in the x and y directions of the imageon the time domain Defining videomidpoint P located (x y)of the image at the t moment and moving to the place(x + dx y + dy) then when the dt is close to 0 the two pixelvalues satisfy the following relationship

I(v) I(v + d) (10)

where v (x y) is the coordinate of the point P at the time t I(v) is the gray value of the place (x y) at the time t d (dxdy) is the displacement of the point P during dt and I(v + d)

is the gray value of the place (x + dx y + dy) at the timet + dt

In this paper the dense optical flowmethod proposed byFarneback [22] is used to calculate the interframe dis-placement of face video 1e algorithm approximates thepixels of two-frame images by a polynomial expansiontransformation And it based on the assumption that thelocal optical flow and the image gradient are stable and thedisplacement field is deduced in the polynomial expansioncoefficient We transform the displacement d (dx dy) tothe extreme coordinate system d (ρ θ) and visualize theoptical flow displacement and direction by the HSV modelAs shown in Figure 6 the optical flow change image ob-tained will be used as input of the dynamic feature fusionnetwork

(2) Fusion Attention Module In the process of dynamicinformation extraction we extract respectively the motioninformation contained in the input optical flow changedirection feature map and the optical flow change intensityfeature map through 5 convolution layers Because themotion pattern of living human face contains two dimen-sions of direction and intensity it is necessary to combinethe above representations to further extract the movingfeatures of the face As a result we designed a fusion moduleas shown in Figure 7

Figure 3 HOG feature of shadow on face region under lightcondition It is necessary to initialize the face image because theshadow or exposure caused by complex light can affect the faceregion gradient information

4 Complexity

(a)

(b)

(c)

Figure 4 Comparison between [20] and ours 1e improved algorithm we proposed performs better than that in [20] (a) Original imagesthat cannot detect face by HOG (b) images processed in [20] and the detection result (c) images processed by ours and the detection result

Conv layer

Fusion block

FC layers

Spatialattention

(attack real)

VecConv 1-5

MagConv 1-5

TexConv 5TexConv 1-4

Img_Vec

Img_Mag

Img_Tex

FC1

FC2

FC3

Figure 5 1e dynamic and texture features fusion attention network (DTFA-Net) architecture 1e figure omits the ReLU and poolinglayers after the convolution layer and the id of the convolution is shown on the top Color code used is as follows pink convolutionblue fusion block gray spatial attention green fully connected layer

Complexity 5

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 4: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

the local normalization method proposed in [21] to calculatethe ratio relation of pixels in the neighborhood and adjustthe operator α

z(x y)

(In(I255))

(In(05))+

N(x y)

(In(I255))In(05) Ilt 128

(In(05))

(In(I255))+

N(x y)

(In(05))In

I

2551113888 1113889 Ige 128

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

(5)

Among them the specific calculation process of localnormalized characteristic N is as follows

(1) To calculate the maximum pixel value Im(x y) in theneighborhood φ(x y) centered on pixel (x y)

Im(x y) max I(i j) | (i j) isin φ(x y)1113864 1113865 (6)

(2) To calculate the median value of the Im(x y) of allpixels centered on pixel (x y)

Imm(x y) medium Im(i j)1113868111386811138681113868 (i j) isin φ(x y)1113966 1113967 (7)

(3) To calculate the maximum value of the Imm(x y) ofall pixels centered on pixel (x y)

S(x y) max Imm(i j) | (i j) isin φ(x y)1113864 1113865 (8)

(4) To calculate the ratio of pixels (x y) to neighborhoodpixels

N(x y) I(x y)

S(x y) (9)

We use algorithm in [20] and the improved algorithm inthis paper to preprocess the portrait 208 photos on YaleBsubdatabase that is difficult to be detected by HOG undercomplex lighting conditions and then detect 196 and 201faces separately 1e result is shown in Figure 4

32 DTFA-Net Architecture In Section 32 we mainly in-troduce the dynamic and texture features fusion attentionnetwork DTFA-Net As shown in Figure 5 the optical flowgraph and the texture image are respectively subjected toobtain 256lowast2 and 256lowast4 embedding by extracting dynamicfeature and texture feature from subnetwork and then fusingthe spliced 256lowast6 features through the fully connected layerand living detection 1e specific details of the network aredescribed below

321 Dynamic Feature Fusion 1is paper generates theoptical flow field change map of adjacent two frames of facevideo by the optical flow method 1e optical flow change inface region is extracted by dynamic feature fusion subnet-work in two dimensions of displacement and size and thefeatures of the two dimensions are fused by feature fusionblock to extract the dynamic information of face region

(1) Optical Flow Optical flow method is a proposal used todescribe themotion information of adjacent frame objects Itreflects the interframe field changes by calculating themotion displacement in the x and y directions of the imageon the time domain Defining videomidpoint P located (x y)of the image at the t moment and moving to the place(x + dx y + dy) then when the dt is close to 0 the two pixelvalues satisfy the following relationship

I(v) I(v + d) (10)

where v (x y) is the coordinate of the point P at the time t I(v) is the gray value of the place (x y) at the time t d (dxdy) is the displacement of the point P during dt and I(v + d)

is the gray value of the place (x + dx y + dy) at the timet + dt

In this paper the dense optical flowmethod proposed byFarneback [22] is used to calculate the interframe dis-placement of face video 1e algorithm approximates thepixels of two-frame images by a polynomial expansiontransformation And it based on the assumption that thelocal optical flow and the image gradient are stable and thedisplacement field is deduced in the polynomial expansioncoefficient We transform the displacement d (dx dy) tothe extreme coordinate system d (ρ θ) and visualize theoptical flow displacement and direction by the HSV modelAs shown in Figure 6 the optical flow change image ob-tained will be used as input of the dynamic feature fusionnetwork

(2) Fusion Attention Module In the process of dynamicinformation extraction we extract respectively the motioninformation contained in the input optical flow changedirection feature map and the optical flow change intensityfeature map through 5 convolution layers Because themotion pattern of living human face contains two dimen-sions of direction and intensity it is necessary to combinethe above representations to further extract the movingfeatures of the face As a result we designed a fusion moduleas shown in Figure 7

Figure 3 HOG feature of shadow on face region under lightcondition It is necessary to initialize the face image because theshadow or exposure caused by complex light can affect the faceregion gradient information

4 Complexity

(a)

(b)

(c)

Figure 4 Comparison between [20] and ours 1e improved algorithm we proposed performs better than that in [20] (a) Original imagesthat cannot detect face by HOG (b) images processed in [20] and the detection result (c) images processed by ours and the detection result

Conv layer

Fusion block

FC layers

Spatialattention

(attack real)

VecConv 1-5

MagConv 1-5

TexConv 5TexConv 1-4

Img_Vec

Img_Mag

Img_Tex

FC1

FC2

FC3

Figure 5 1e dynamic and texture features fusion attention network (DTFA-Net) architecture 1e figure omits the ReLU and poolinglayers after the convolution layer and the id of the convolution is shown on the top Color code used is as follows pink convolutionblue fusion block gray spatial attention green fully connected layer

Complexity 5

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 5: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

(a)

(b)

(c)

Figure 4 Comparison between [20] and ours 1e improved algorithm we proposed performs better than that in [20] (a) Original imagesthat cannot detect face by HOG (b) images processed in [20] and the detection result (c) images processed by ours and the detection result

Conv layer

Fusion block

FC layers

Spatialattention

(attack real)

VecConv 1-5

MagConv 1-5

TexConv 5TexConv 1-4

Img_Vec

Img_Mag

Img_Tex

FC1

FC2

FC3

Figure 5 1e dynamic and texture features fusion attention network (DTFA-Net) architecture 1e figure omits the ReLU and poolinglayers after the convolution layer and the id of the convolution is shown on the top Color code used is as follows pink convolutionblue fusion block gray spatial attention green fully connected layer

Complexity 5

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 6: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

To improve the characterization ability of the modelwe use the SE structure [23] in the fusion module whichgives different weights for the optical flow intensity anddirection features to strengthen the decision-makingability of some features First global pooling of featuregraphs is

opc AvgPool Fop1113872 1113873 1

H times W1113944

H

i11113944

W

j1Fop(i j) (11)

where Fop(i j) stands for the concatenated features ofoptical magnitude and angle 1rough global averagepooling the dimension of the stitching feature mapchanges from C timesHtimesW to C times 1times 1 Secondly learn thenonlinear functional relationship between each channelthrough full connection (FC) and activation function(ReLU) 1en use normalization (sigmoid) to get theweight of each channel

opa σ FC δ FC opc( 1113857( 1113857( 1113857( 1113857 (12)

where σ is the sigmoid function and δ is the ReLU function1e two fully connected layers are used to reduce and re-covery dimension respectively which is helpful to improvethe complexity of the function Finally we multiply Fop withopa and pass through a convolution layer to get the fusionfeatures

Fop+ Conv opa otimesFop1113872 1113873 (13)

(3) Network Details Dynamic feature extraction subnetworkinput image size is 227times 227times 3 which contains 11 con-volution layers 2 full connected layers and 6 pooling layersTables 1ndash3 show the specific network parameters of con-volution and pooling layers

322 Texture Feature Representation In specific we mapthe input RBG image to the intermediate feature maps with adimension of 384 through TexConv1-4 and then pay moreattention to some of the regions through the spatial attentionmechanism and then input the output of the attentionmodule to TexConv5 and full connection layer FC2 performsfeature extraction 1e structure of the convolutional layerTexConv1-5 is shown in Table 1 and the structure of thefully connected layer FC2 is shown in Table 4

(1) Spatial Attention Block After experiments we found thatneural networks often pay special attention to the humaneyes cheeks mouths and other areas when extracting livingfeatures 1erefore we added a spatial attention module tothe static texture extraction structure and give a different

Real face Photo attack

(a)

Real face Photo attack

(b)

Figure 6 Optical flow visualization of two adjacent face regions (a) visualization of changes in optical flow direction hue direction ofoptical flow saturation 255 and value 255 (b) optical flow magnitude visualize hue 255 saturation 255 and value size of opticalflow Among them the left two are the optical flow changes in the real face and the right two groups are the optical flow changes in the photoattacks

S

Conv layer

FC layer

S Sigmoid

GlobalAvgPool

ReLU

Product

Figure 7 Fusion attention module architecture

6 Complexity

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 7: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

attention to the features of different face regions Weadopted the CBAM (Figure 8) spatial attention structureproposed in [24] 1is module reduces the dimension of theinput feature map through the maximum pooling and av-erage pooling layers splices the two feature maps andobtains the attention weight of 1lowastHlowastW by the convolutionlayer and activation function

SAc δ Conv Cat AvgPool Ft( 1113857MaxPool Ft( 1113857( 1113857( 1113857( 1113857

(14)

Finally we utilized element-wise product for input Ftand SAc and the output of the spatial attention block willpass through the next layers TextConv5 and FC2

Ft+ SAc otimesFt (15)

323 Feature Fusion 1rough the above two subnetworksdynamic information and texture information are obtainedrespectively By a series of fully connected layers dropoutlayers and activation functions we fully fuse the two

information learning the nonlinear relationship between thedynamic and static features and obtain a two-dimensionalrepresentation of face in living information for living de-tection as shown in Table 4

4 Experiment

41Dataset We use CASIA-MFSD [25] to train and test themodel 1e dataset contains a total of 600 face videos col-lected from 50 individuals Face video of real face photoattack and video attack scenes are collected at differentresolutions Among them photo attack includes photobending and photo mask We ignore the different attackways and divide all the videos into real face and false face1rough the calculation of optical flow field face regiondetection and tailoring etc get 35428 sets of training imagesand 64674 sets of test images as shown in Figure 9 And wealso train and test our model on Replay Attack Database

42 Evaluation 1is experiment uses false acceptance rate(FAR) false rejection rate (FRR) equal error rate (EER) andhalf total error rate (HTER) 1e face living detection al-gorithm is based on these indicators 1e FAR refers to theratio of judging the fake face as the real face the FRR refersto the ratio of judging the real face as false and the cal-culation formulas are shown as follows

FAR Nf_r

Nf (16)

FRR Nr_f

Nr (17)

where Nf_r is the number of false face error Nr_f is thenumber of real face error Nf is the number of false faceliveness detection and Nr is the number of real face de-tection 1e two classification methods of this experimentare as follows (1) nearest neighborhood (NN) which cor-responds the two-dimensional vector of which each

Table 1 1e network of Mag_Conv1-5

Layer Input size Kernel size Filter StrideMag_Conv1 227lowast 22lowast 3 11lowast 11lowast 3 96 4MaxPooo1 27lowast 27lowast 64 3lowast 3 2Mag_Conv2 27lowast 27lowast 64 5lowast 5lowast 64 192 1MaxPool2 27lowast 27lowast192 3lowast 3 2Mag_Conv3 13lowast13lowast192 3lowast 3lowast192 384 1Mag_Conv4 13lowast13lowast 384 3lowast 3lowast 384 256 1Mag_Conv5 13lowast13lowast 256 3lowast 3lowast 256 256 1lowastVecConv1-5 and TexConv1-5 parameters are same as MagConv1-5

Table 2 1e structure of fusion attention module

Layer Input size Kernel size Filter StrideGlobalAvgPool 13lowast13lowast 512 13lowast13 1Fc1 512Fc2 32LK_Conv 13lowast13lowast 512 3lowast 3lowast 512 256 2LK_MaxPool 6lowast 6lowast 256 3lowast 3 1

Table 3 1e structure of FC1 in Figure 5

Layer Input size Output sizeFC1_1 256lowast 6lowast 6 256lowast 3lowast 3FC1_2 256lowast 3lowast 3 256lowast 2lowast 2FC1_3 256lowast 2lowast 2 256lowast 2

Table 4 1e structure of FC2-3 in Figure 5

Layer Input size Output sizeFC2_1 256lowast 6lowast 6 256lowast 3lowast 3FC2_2 256lowast 3lowast 3 256lowast 2lowast 2FC3_1 256lowast 6 256lowast 3FC3_2 256lowast 3 256FC3_3 256 2

AvgPool

Product

S

Conv layer

S Sigmoid

MaxPool

Figure 8 Spatial attention block We introduce this module afterthe convolution layer of the subnetwork is extracted from the staticfeature which gives the difference attention to the local area of theface

Complexity 7

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 8: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

dimension value represents the probability of real face orattack face and selects the category which corresponds to themaximum value as the classification result (2) 1resholdingselects a certain threshold to classify the representationresult 1is method is mainly for model validation andtesting Calculating FAR and FRR at different thresholds canplot the receiver operating characteristic (ROC) curve formeasuring the nonequilibrium in the classification problemthe area under the ROC curve (area under curve AUC) canintuitively show the algorithm classification effect

43 Implementation Details 1e proposed method isimplemented in Pytorch with an inconstant learning rate(eg lr 001 when epochlt5 and lr 0001 when epochge 5)

1e batch size of the model is 128 with num_worker 100We initialize our network by using the parameters ofAlextNet100 1e network is trained with standard SGD for50 or 100 epochs on Tesla V100 GPU And we use crossentropy loss and the input resolution is 227times 227

44 Experimental Result

441 Ablation of Spatial Attention Module We conductedan ablation experiment on the attention module of thetexture feature extraction subnetwork and only rely ontexture features to perform live detection on the CAISAdataset We trained the two texture feature extraction net-works with or without spatial attention block 50 times

(a)

(b)

Figure 9 CASIA-MFSD examples after preprocessing From left to right texture image optical flow magnitude and optical flow directionAmong them (a) fake face and (b) real face

06

05

04

03

02

01

00

0 2000 4000 6000 8000 10000Time step

Loss

With SAWithout SA

(a)

10

08

06

04

02

00100806040200

FAR

1-FR

R

With SA AUC = 095484)Without SA AUC = 094652)

(b)

Figure 10 Network loss and ROC curve with or without spatial attention module (a) training loss as time step went by (b) ROC curve

8 Complexity

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 9: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

(a)

(b)

Figure 11 Change in weight heat map before and after spatial attention module (a) before (b) after 1e spatial attention module paysspecial attention to some features of the face area

000010

000008

000006

000004

000002

000000

Loss

18000 20000 22000 24000 26000 28000 30000 32000Time step

(a)

09650

09675

09700

09725

09750

09775

09800

09825

09850

AU

C

50 55 60 65 70 75 80 85 90Epoch

(b)0950

0945

0940

0935

0930

0925

0920

0915

0910

ACC

50 55 60 65 70 75 80 85 90Epoch

(c)

0071

0070

0069

0068

0067

0066

0065

EER

50 55 60 65 70 75 80 85 90Epoch

(d)

Figure 12 DTFA-Net training and evaluation results in Epoch49-89 (a) the loss fluctuations of model training in Epoch49-89 (b) the AUCresults of the model in the test set in Epoch49-89 (c) the ACC results of the model in the test set in Epoch49-89 (d) the ERR results of themodel in the test set in Epoch49-89

Complexity 9

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 10: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

respectively and verified them on the CASIA test set Fig-ure 10 shows the training loss process (Epoch0-Epoch29)and the ROC curve in the test set (Epoch50)1e experimentshows that after introducing the attention mechanism dueto the increase in the network structure (in fact a convo-lution layer is added) the loss of the model during thetraining process is slower than that of model without SA inthe initial stage of training and there is a large shockHowever as the number of network training iterationsincreases the loss tends to be stable and there is almost nodifference between the two cases After 50 cycles of trainingthe model with SA achieved AUC 954 on the test setwhich is higher than model without SA

Visualize the input and output results of our spatialattention mechanism module as shown in Figure 11 Itshows that SA pays more attention to local areas in the faceimage such as the mouth and eyes 1is point shows theconsistency of the prior knowledge as assumed by the tra-ditional image feature description method

We first do not use SA to train the DTFA network to acertain degree and then add the SA structure to train 100times so that the spatial attention module can better learnface area information and accelerate model convergenceFigure 12 shows the training and test results of DTFA-Neton the CASIA dataset When the number of training iter-ations of the model reaches the interval of 49 ndash 89EER 0069 and AUC 0975plusmn 00001 reaching a stablestate

Table 5 provides a comparison between the results of ourproposed approach and those of the other methods in both

intradatabase evaluation Our model result is comparable tothe state-of-the-art methods

45 Samples Figure 13 shows several samples of the failureand right detection of real faces 1rough analysis we foundthat the illumination in RGB images may be the main causeof wrong classification

5 Conclusion

1is paper analyzed the photo and video replay attacks offace spoofing and built an attention network structure thatintegrated dynamic-texture features and designed a dynamicinformation fusion module that extracted features fromtexture images based on the spatial attention mechanism Atthe same time an improved gamma image optimizationalgorithm was proposed for preprocessing of image in facedetection tasks under multiple illuminations

Data Availability

1e CASIA-MFSD data used to support the findings of thisstudy were supplied by CASIA under license and so cannotbe made freely available Requests for access to these datashould be made to CASIA via httpwwwcbsriaaccn

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Table 5 Comparison between our proposed method and the other in intradatabase

Method CASIA-MFSD Replay AttackEER () EER () HTER ()

LBP [26] 182 139 138IQA [9] 324 ndash 152CNN [4] 74 61 21LiveNet [27] 459 ndash 574DTFA-Net (ours) 690 647 22

False Successful

Figure 13 1e false and right detection samples Left false-negative result right true-positive case

10 Complexity

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11

Page 11: DTFA …downloads.hindawi.com/journals/complexity/2020/5836596.pdfResearchArticle DTFA-Net:DynamicandTextureFeaturesFusionAttention NetworkforFaceAntispoofing XinCheng ,1HongfeiWang

Acknowledgments

1is work was supported by the National Key Research andDevelopment Program of China (Grant 2018YFB1600600)National Natural Science Funds of China (Grant 51278058)111 Project on Information of Vehicle-InfrastructureSensing and ITS (Grant B14043) Shaanxi Natural ScienceBasic Research Program (Grant nos 2019NY-163 and2020GY-018) Joint Laboratory for Internet of VehiclesMinistry of Education-China Mobile CommunicationsCorporation (Grant 213024170015) and Special Fund forBasic Scientific Research of Central Colleges ChangrsquoanUniversity China (Grant nos 300102329101 and300102249101)

References

[1] H Steiner A Kolb and N Jung ldquoReliable face anti-spoofingusing multispectral swir imagingrdquo in Proceedings of the In-ternational Conference on Biometrics IEEE Halmstad Swe-den May 2016

[2] Y H Tang and L M Chen ldquo3d facial geometric attributesbased anti-spoofing approach against mask attacksrdquo in Pro-ceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition IEEE Washington DC USApp 589ndash595 September 2017

[3] R Raghavendra and C Busch ldquoNovel presentation attackdetection algorithm for face recognition systemApplicationto 3d face mask attackrdquo in Proceedings of the IEEE Interna-tional Conference on Image Processing IEEE Paris Francepp 323ndash327 October 2014

[4] J W Yang Z Lei and S Z Li ldquoLearn convolutional neuralnetwork for face anti-spoofingrdquo 2014 httparxivorgabs14085601

[5] Y Atoum Y J Liu A Jourabloo and X M Liu ldquoFaceantispoofing using patch and depth-based cnnsrdquo in Pro-ceedings of IEEE International Joint Conference on BiometricsIEEE Denver Colorado USA pp 319ndash328 August 2017

[6] J Hernandez-Ortega J Fierrez A Morales and P TomeldquoTime analysis of pulse-based face anti-spoofing in visible andnirrdquo in Proceedings of the Conference on Computer Vision andPattern Recognition Workshops IEEE Salt Lake City UtahUSA June 2018

[7] S Q Liu X Y Lan and P C Yuen ldquoRemote photo-plethysmography correspondence feature for 3d mask facepresentation attack detectionrdquo in Proceedings of the EuropeanConference on Computer Vision IEEE Munich Germanypp 558ndash573 September 2018

[8] Z Boulkenafet J Komulainen and A Hadid ldquoFace anti-spoofing based on color texture analysisrdquo in Proceedings of theInternational Conference on Image Processing IEEE QuebecCanada pp 2636ndash2640 September 2015

[9] J Galbally and S Marcel ldquoFace anti-spoofing based on generalimage quality assessmentrdquo in Proceedings of the InternationalConference on Pattern Recognition IEEE Stockholm Swedenpp 1173ndash1178 August 2014

[10] Z Boulkenafet J Komulainen and A Hadid ldquoOn the gen-eralization of color texture-based face anti-spoofingrdquo Imageand Vision Computing vol 77 pp 1ndash9 2018

[11] S Tirunagari N Poh D Windridge A Iorliam N Suki andA T S Ho ldquoDetection of face spoofing using visual dy-namicsrdquo IEEE Transactions on Information Forensics andSecurity vol 10 no 4 pp 762ndash777 2015

[12] W Kim S Suh and J-J Han ldquoFace liveness detection from asingle image via diffusion speed modelrdquo IEEE Transactions onImage Processing vol 24 no 8 pp 2456ndash2465 2015

[13] S Bharadwaj T Dhamecha M Vatsa et al ldquoComputationallyefficient face spoofing detection with motion magnificationrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition IEEE Portland Oregon June 2013

[14] T Freitas J Komulainen Anjos et al ldquoFace liveness detectionusing dynamic texturerdquo EURASIP Journal on Image andVideo Processing vol 2014 no 1 p 2 2014

[15] T U Xiaoguang H Zhang X I E Mei et al ldquoEnhance themotion cues for face anti-spoofing using cnn-lstm architec-turerdquo 2019 httparxivorgabs190105635

[16] A Alotaibi and A Mahmood ldquoDeep face liveness detectionbased on nonlinear diffusion using convolution neural net-workrdquo Signal Image and Video Processing vol 11 no 4pp 713ndash720 2017

[17] S Zhang and X Wang ldquoA dataset and benchmark for largescale multi modal face anti-spoofingrdquo in Proceedings of theConference on Computer Vision and Pattern RecognitionIEEE CA USA November 2019

[18] Y A O Feng W U Fan S H A O Xiaohu et al ldquoJoint 3Dface reconstruction and dense alignment with position mapregression networkrdquo in Proceedings of the European Con-ference on Computer Vision Springer Berlin Germanypp 557ndash574 September 2018

[19] H Kaiming Z Xiangyu and R Shaoqing ldquoDeep residuallearning for image recognitionrdquo in Proceedings of the Con-ference on Computer Vision and Pattern Recognition SeattleWA USA June 2016

[20] Schettini R Gasparini F Corchs et al ldquoContrast imagecorrection methodrdquo Journal of Electronic Imaging vol 19no 2 Article ID 023005 2010

[21] Y Cheng L Jiao X Cao and Z Li ldquoIllumination-insensitivefeatures for face recognitionrdquo Ce Visual Computer vol 33no 11 pp 1483ndash1493 2017

[22] G Farneback ldquoTwo-frame motion estimation based onpolynomial expansionrdquo in Proceedings of the 13th Scandi-navian Conference on Image Analysis Halmstad SwedenJune 2003

[23] H U Jie L I Shen S Albanie et al ldquoSqueeze-and-excitationnetworksrdquo in Proceedings of the IEEE Transactions on PatternAnalysis and Machine Intelligence Salt Lake City UT USAJune 2019

[24] S Woo J Park L Joon-Young et al ldquoCBAMconvolutionalblock attention modulerdquo in Proceedings of the EuropeanConference on Computer Vision ECCV Munich GermanySeptember 2018

[25] Z W Zhang J J Yan S F Liu Z Lei D Yi and S Z Li ldquoAface antispoofing database with diverse attacksrdquo in Proceed-ings of the International Conference on Biometrics IEEE NewDelhi India pp 26ndash31 June 2012

[26] I Chingovska A Anjos and S Marcel ldquoOn the effectivenessof localbinary patterns in face anti-spoofifingrdquo in Proceedingsof the International Conference of the Biometrics Special In-terest Group (BIOSIG) Hong Kong China September 2012

[27] Y A U Rehman L M Po and M Liu ldquoLivenet improvingfeatures generalization for face liveness detection usingconvolution neural networksrdquo Expert Systems with Applica-tions vol 108 pp 159ndash169 2018

Complexity 11