Interpretable Adversarial Perturbation in Input Embedding...

Interpretable Adversarial Perturbation in Input Embedding Space for Text

Motoki Sato [1], Jun Suzuki [2,4], Hiroyuki Shindo[3,4], Yuji Matsumoto[3,4]

[1] Preferred Networks, Inc. [2] NTT Communication Science Laboratories[3] Nara Institute of Science and Technology [4] RIKEN Center for Advanced Intelligence Project

2 / 30

Overview of this paper

lAdversarial Training for Text

better

＋ →＋ →Word Vector

??? (symbol)

Adversarial Example (continuous)

Continuous

Discrete We can not interpret this vector in discrete space

Adversarial Perturbation

3 / 30

Adversarial Perturbationl 2 , 2 5 5C4 2 2 5

10H 5G �� 5 E �� 2

l 10H 5G �� 2 I C A C 3 E :A . l 1 5 E �� 2 I 3 3 5 ,3 AA 5 A3 3 3 A .

0 ��

5 1 2, 2 5

4 / 30

Adversarial Perturbationl 2 , 2 5 0 2 2 2 5

5 4 �� 11 , 1 ��

0 ��

5 1 2, 2 5

The gradient of loss function

5 / 30

Adversarial Training

l 1 171 A 7 1 2 7 1 Goodfellow et al .,2015]

0 1 7 7

Original Training Data Adversarial Example

l 0 1 1 7 0 1 �� ,– , - , -

- , - -

0 1 7 7 2 1

6 / 30

Our main idea

better

worse bad

better

worse bad

u Restrict the directions of the perturbations u Restrict toward the locations of existing words.

Previous Method Ours

[Miyato et al., 2017]

7 / 30

Related Work

8 / 30

Related Work

– Human Knowledge [Jia and Liang, 2017]

Fooling Reading Comprehension System using crowdsourcing

– Random search [Belinkov and Bisk, 2018]

Random character-level swaps can break output of Neural

Machine Translation

– Synonym dictionary [Samanta and Mehta, 2017]

Replacing a word with its synonym

l Existing methods require human knowledge or heuristic8

9 / 30

Previous Method , ��

l Takeru Miyato, Andrew M Dai, and Ian Goodfellow, ICLR 2017“Adversarial training methods for semi-supervised text classification.”

l Uni-directional LSTM + Pre-Training (Language Model) + Adversarial Training

Word VectorAdversarial Perturbation

10 / 30

Word Vector�Adversarial Perturbation�

�Word vector with perturbation

Definition

� Find to increase the lossfunction

How to obtain the perturbation

� Compute the gradient with L2 normalization

Previous Method ��

�hyper-parameter (e.g.: 1.0)

11 / 30

Our Method

Definition

�Direction Vector

�Compute α with the gradient

�Compute the perturbation

How to obtain the perturbation?

: is a weight for the direction

Word Vector�Adversarial Perturbation�

12 / 30

Experiments

13 / 30

Experiments

l Setting

– Text Classification : IMDB (Sentiment Analysis)

– Sequence Labeling : FCE-public (Grammatical Error Detection)

13Word Vector�Adversarial Perturbation�

14 / 30

Evaluation by task performance

14• 5 5 0 .5

Baseline 7.05 (%) 39.21

Random Perturbation 6.74 (%) 39.90

AdvT-Text [Miyato et al., 2017] 6.12 (%) 42.28

iAdvT-Text (Ours) 6.08 (%) 42.26

VAT-Text [Miyato et al., 2017] 5.69 (%) 41.81

iVAT-Text (Ours) 5.66 (%) 41.88

SEC: Sentiment Classification Task (IMDB)GED: Grammatical Error Detection Task (FCE-public)

15 / 30

Model Analysis

16 / 3016

Right Words reconstructed fromperturbations

We visualized the perturbations for understanding its behavior.

Words reconstructed from perturbations

Visualization of sentence-level perturbationsOurs

17 / 3017

) ( ( ( ( ) (

Right Most cosine similar words

Cosine similarities betweenperturbation and words.

( ( )( ) () )

) ) ) () )

Words reconstructed from perturbations

Visualization of sentence-level perturbations[Miyato et al., 2017]

18 / 30

Creating Adversarial Example

Test Text

Predict: PositiveAdversarial Example

Predict: Negative

There is really but one thing to say about this sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>

There is really but one thing to say about that sorry movie It should never have been made The first one one of my favourites An American Werewolf in London is a great movie with a good plot good actors and good FX But this one It stinks to heaven with a cry of helplessness <eos>

Find the largest perturbation and replace the original word with one that matches the largest perturbation.

19 / 30

Conclusion

l We discussed the interpretability of adversarial perturbation in

the NLP (Text) field.l Our methods can generate reasonable adversarial texts and

interpretable visualizations.

Code: https://github.com/aonotas/interpretable-adv

Thank you!

20 / 30

21 / 30

Adversarial ExampleOriginal Text (Incorrect)

Adversarial Example

Prediction

Incorrect → Correct

Interpretable Adversarial Perturbation in Input Embedding...

Documents

FESTIVAL FAVOURITES

Holiday Favourites

Werewolf Game Manual

[Werewolf - The Apocalipse] the Art of Werewolf the Apocalypse

PR-17 - The Werewolf - The Curse of The Werewolf

Werewolf Gifts List

Genre Research - Werewolf Horror

WWF Werewolf Full Text

THE PIG WEREWOLF

Werewolf - The Wild West

InfoGAIL: Interpretable Imitation Learning from Visual ...papers.nips.cc/paper/6971-infogail-interpretable-imitation-learning... · InfoGAIL: Interpretable Imitation Learning from

36513188 Werewolf Haiku

KT Favourites

Werewolf Mask

The Werewolf Game

Kitchissippi Favourites

Werewolf Translation Guide

Werewolf The Forsaken - dl.alicanc.comdl.alicanc.com/Character Sheets/sheets/Werewolf the... · wEREWoLF THE FORSAKEN Name: Player: Chronicle: Virtue: Vice: Concept: Auspice: Tribe:

Book favourites

Werewolf Card