Outline - cool.ntu.edu.tw

Outline

❏ Introduction❏ Related works❏ Why do this work

❏ Model design❏ Adversarial attack algorithm❏ Experiments❏ Conclusion❏ Future work

Introduction

❏ Traditional deception detection technology :❏ Polygraph and FMRI

❏ Inconvenient and not reliable.❏ Eye-tracking technology

❏ Based on emotional reaction❏ limited

1. A. R, "Detecting Deception". Monitor on Psychology, vol. 37, no. 7, pp. 70, 2004.2. “Education psychologists use eye-tracking method for detecting lies,” psychologialscience.org. Retrieved 26 April 2012.

Related Works

❏ They analyzed the importance of vision, audio and text in videos throughsequential input.

Z. Wu, B. Singh, L.S. Davis, and V.S. Subrahmanian. "Deception detection in videos," Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Related Works

❏ They adopted the relationship between gestures and facial emotions toidentify whether the subject is lying or not.

M. Ding, A. Zhao, Z. Lu, T. Xiang, and J.R. Wen. "Face-focused cross-stream network for deception detection in videos," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

Why do this work

❏ Recently, there are some papers exploring the usefulness of faces and audiosfor deception detection.

❏ However, the adversarial attack on face or audio features has not beenstudied.

M. Ding, A. Zhao, Z. Lu, T. Xiang, and J.R. Wen. "Face-focused cross-stream network for deception detection in videos," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

Dataset

❏ Real-Life trial dataset❏ 59 deceptive and 50 truthful videos❏ 14 female and 16 male speakers❏ Collected from the court

Pérez-Rosas, V., Abouelenien, M., Mihalcea, R., & Burzo, M. (2015, November). Deception detection using real-life trial data. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (pp. 59-66).

Model architecture (server)❏ CNN : ResNet18, ResNeXt50_32x4d❏ Sequential processing : GRU, Transformer

Multi-modal architecture (server)❏ CNN : ResNet18, ResNeXt50_32x4d❏ Sequential processing : GRU, Transformer

Model architecture (attacker)❏ CNN :

Alexnet, VGG16, ResNet18, ResNet50, ResNeXt50_32x4d

Model architecture (attacker)

❏ CNN : Alexnet, VGG16, ResNet18, ResNet50, ResNeXt50_32x4d

Attack algorithm

❏ Fast Gradient Sign Method (FGSM)

❏ Iterative FGSM (I-FGSM)

❏ Momentum iterative FGSM (MI-FGSM)

Experiments

Model of the server

Real-life ResNet18GRU

ResNet18Transformer

ResNeXt50_32x4dGRU

ResNeXt50_32x4dTransformer

Video 78.18% 90.91% 81.82% 86.36%

Video + Audio 95.45% 90.91% - -

Model of the attacker

Real-life Alexnet VGG16 ResNet18 ResNet50 ResNeXt50_32x4d

Video 72.73% 89.10% 84.55% 90.91% 90.91%

Audio 72.73% 72.73% 81.82% 90.91% 100%

Adversarial attack on Video model

ResNet18GRU

ResNet18Transformer

ResNeXt50_32x4dGRU


Standard accuracy 78.18% 90.91% 81.82% 86.36%

ResNet18 (fgsm) 71.82% 63.64% 48.18% 63.64%

Ensemble (fgsm) 67.27% 63.64% 46.36% 63.64%

ResNet18 (i-fgsm) 68.18% 81.82% 71.82% 72.73%

Ensemble (i-fgsm) 74.55% 63.64% 73.64% 51.82%

ResNet18 (MI-fgsm) 70.91% 55.45% 64.55% 60.91%

Ensemble (MI-fgsm) 66.36% 45.45% 61.82% 49.09%

Adversarial attack on Multi-modal modelVideo Standard

accuracyResNet18

(fgsm)Ensemble

(fgsm)ResNet18 (i-

fgsm)Ensemble (i-

fgsm)ResNet18 (mi-fgsm)

Ensemble (mi-fgsm)

ResNet18GRU

95.45% 92.73% 88.18% 93.64% 79.09% 92.73% 70.00%

ResNet18Transformer

90.91% 87.27% 80.00% 90.00% 70.91% 82.73% 67.27%

Audio Standard accuracy

ResNet18 (fgsm)

Ensemble (fgsm)

ResNet18 (i-fgsm)

Ensemble (i-fgsm)

ResNet18 (mi-fgsm)

Ensemble (mi-fgsm)

ResNet18GRU

95.45% 94.55% 95.45% 91.82% 91.82% 90.91% 92.73%

ResNet18Transformer

90.91% 90.00% 90.91% 90.00% 90.91% 90.00% 90.91%

Video & Audio Standard accuracy

ResNet18 (fgsm)

Ensemble(fgsm)

ResNet18 (i-fgsm)

Ensemble (i-fgsm)

ResNet18 (mi-fgsm)

Ensemble (mi-fgsm)

ResNet18GRU

95.45% 90.00% 82.73% 87.27% 73.64% 82.73% 68.18%

ResNet18Transformer

90.91% 80.91% 80.00% 88.18% 63.64% 66.36% 55.45%

Adversarial training on ResNet18-fgsm Adv. samples

ResNet18GRU

ResNet18Transformer

ResNeXt50_32x4dGRU



ResNet18 (fgsm) 90.00% 89.09% 90.91% 81.82%

Ensemble (fgsm) 83.64% 88.18% 88.18% 77.27%

Adversarial training on Ensemble-fgsm Adv. samples

ResNet18GRU

ResNet18Transformer

ResNeXt50_32x4dGRU



ResNet18 (fgsm) 75.45% 79.09% 78.18% 80.91%

Ensemble (fgsm) 78.18% 85.45% 79.09% 81.82%

Conclusion

❏ Generate adversarial examples with Ensemble (MI-FGSM) has the best attacking

intensity.

❏ Adversarial attack on multi-modal model is significant.❏ ResNet18 + GRU: 95.45% -> 68.18%

❏ Adversarial training on the corresponding adv. samples can improve the recognition

robustness on deception detection task.

Future work

❏ Adopt state-of-the-art model to execute deception detection.

❏ Adversarial training on audio and multi-modal data.

❏ Other available datasets❏ Bag-of-lies

❏ Silesian Deception Dataset

❏ etc.

Documents

Outline - cool.ntu.edu.tw