Adversarial Learning

Adversarial LearningoWhat is a GAN?oSome mathematical backgroundoAlgorithms for training a GANoThe Wasserstein GANoConditional GANs

How do we sample from a distribution?

𝑌 ∼ 𝑝! 𝑦!Parametric methods:

– Some known parametric distribution:

𝑝! 𝑦 =1𝑍 exp −𝑢 𝑦

– Monte Carlo Markov Chain (MCMC) methods– Hastings-Metropolis sampling– Requires a known distribution

!Non-parametric methods:– Provided with samples 𝑦", ⋯ , 𝑦#$%– Infer distribution from samples– Generator

𝑌 = ℎ" 𝑍

1

https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html

Multivariate random vector in ℜ!

Multivariate density in ℜ!

Source of randomness; 𝑍 ∼ 𝑁 0, 𝐼

Random vector with desired distribution

https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html

Training a Generator

!Function of GAN:– Generates samples, -𝑦&, with the same distribution as 𝑦&.– Can use drop-outs to generate randomness

!Training algorithm:– Compare the distributions of -𝑦& and 𝑦&.– Feedback parameter corrections

(𝑦"generatedsamples

Generator(𝑦" = ℎ# 𝑧"

TrainingAlgorithm

𝑧"independent noise source

𝑦"referencesamples

𝜃

The Bayes Discriminator

!Use Bayes rule:𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 =

𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 + 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹

• Assuming 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 = 𝑝!$ 𝑦 , 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 = 𝑝" 𝑦 , 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 =𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹 = #

$ , we have that

𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 = =𝑝% 𝑦 ½

𝑝% 𝑦 ½ + 𝑝#! 𝑦 ½=

11 + 𝑅#! 𝑦

– where 𝑅#! 𝑦 =&"! '

&# 'is the likelihood ratio defined by


Generator(𝑦" = ℎ#! 𝑧"

𝑧"independentnoise source


“R -Real”

“F - Fake”

What is the probability that an observation, 𝑦, is real?

Implementing a Bayes Discriminator

!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

= !% "!% " #!&' "

= $$#%&' "

where

𝑅'! 𝑦 =𝑝'! 𝑦𝑝! 𝑦

Likelihood Ratio



Discriminator (𝑝" = 𝑓#( (𝑦"

𝑧"independentnoise source


Discriminator 𝑝" = 𝑓#( 𝑦"

Should be mostly 0s

Should be mostly 1s

Bayes Discriminator and the Likelihood Ratio

!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

= !% "!% " #!&' "

= $$#%&' "

generated distribution 𝑝#! 𝑦

reference distribution 𝑝% 𝑦

𝑦

likelihood ratio 𝑅#! 𝑦

1

Classify as “real” Classify as “fake”

𝑦

Training a Bayes Discriminator

!Discriminator loss function:7𝑑 𝜃), 𝜃( =

1𝐾;&*"

#$%

− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&

!Optimal discriminator parameter:

𝜃(∗ = argmin'"

7𝑑 𝜃), 𝜃(

– Results in ML estimate of Bayes classifier parameters.



Discriminator(𝑝" = 𝑓#( (𝑦"

𝑧"

𝑦"


0 =Generated

Discriminator𝑝" = 𝑓#( 𝑦"

1 =Reference+ 9𝑑 𝜃) , 𝜃(

CrossEntropy( (𝑝!, 0)

CrossEntropy(𝑝!, 1)𝑝"

(𝑝"

Training a Generator




𝑧" <𝑔 𝜃) , 𝜃(Loss Function

𝐿 (𝑝!

!Big idea: Maximize the probability that outputs of the generator are classified as being from the reference distribution.

– /𝑝/ should be large– 𝐿 /𝑝/ should be small when /𝑝/ is large– 𝐿 /𝑝/ should be a decreasing function of /𝑝/

!Generator loss function:

1𝑔 𝜃0, 𝜃1 =1𝐾6/23

45#

𝐿 𝑓!1 /𝑦/

!Optimal generator parameter:

𝜃0∗ = argmin!$

𝑔 𝜃0, 𝜃1

Loss should encourage(𝑝" to be large

Generative Adversarial Network (GAN)*




𝑧"

𝑦"


0 =Generated

Discriminator 𝑝" = 𝑓#( 𝑦"

1 =Reference + 9𝑑 𝜃) , 𝜃(

Loss Function𝐿 (𝑝!

<𝑔 𝜃) , 𝜃(


CrossEntropy(𝑝!, 1)

!Generator loss function:

D𝑔 𝜃), 𝜃( =1𝐾;&*"

#$%

𝐿 𝑓'( -𝑦&

!Discriminator loss function:7𝑑 𝜃), 𝜃( =

1𝐾;&*"

#$%

− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&

*Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26

https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

GAN: Expected Loss Functions

>𝑌#!generatedsamples

Generator>𝑌 = ℎ#! 𝑍

Discriminator >𝑃#! = 𝑓#(

>𝑌#!𝑍

𝑌

𝑌referencesamples

0 =Generated

Discriminator 𝑃 = 𝑓#( 𝑌

1 =Reference + 𝑑 𝜃) , 𝜃(

Loss Function𝐿 7𝑃"!

𝑔 𝜃) , 𝜃(

CrossEntropy( 7𝑃, 0)

CrossEntropy(𝑃, 1)

!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 𝐿 𝑓'( H𝑌'!

!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– By the weak and strong law of large numbers lim#→-

D𝑔 = 𝑔 and lim#→-

7𝑑 = 𝑑

Generator Loss Function Choices

!Option 0: Original loss function proposed in [1].– 𝐿 𝑝 = log 1 − 𝑝 𝐿 0 = 0; 𝐿 1 = −∞

– 𝑔 𝜃), 𝜃( = 𝐸 log 1 − 𝑓'( H𝑌'!– Presented as key theoretically grounded approach in Goodfellow paper.– Consistent with zero-sum game Nash equilibrium theory– Almost no one uses it.

!Option 1: “Non-saturating” loss function, i.e., the “-log D trick”– 𝐿 𝑝 = − log 𝑝 𝐿 0 = +∞; 𝐿 1 = 0

– 𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!– Mentioned as trick in [1] to keep the training loss from “saturating”.– This is what you get if you use cross-entropy loss for generator – This is what is commonly done.

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26

https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

GAN Architecture(Non-Saturating)

>𝑌#!generatedsamples

Generator>𝑌 = ℎ#! 𝑍

Discriminator >𝑃#! = 𝑓#(

>𝑌#!𝑍

𝑌

𝑌referencesamples

0 =Generated

Discriminator 𝑃 = 𝑓#( 𝑌


CrossEntropy(𝑃, 1) 𝑔 𝜃) , 𝜃(

CrossEntropy( 7𝑃, 0)

CrossEntropy(𝑃, 1)

!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!

!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– By the weak and strong law of large numbers lim#→-

D𝑔 = 𝑔 and lim#→-

7𝑑 = 𝑑

GAN Equilibrium Conditions (Non-Saturating)

!We would like to find the solution to:

𝜃)∗ = argmin'!

𝑔 𝜃), 𝜃(∗

= argmin'!

𝐸 − log 𝑓'( H𝑌'!

𝜃(∗ = argmin'"

𝑑 𝜃)∗, 𝜃(

= argmin'"

𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)

!How do we solve this?

!Will it converge?

Nash Equilibrium with Two Agents*

!Agent 𝐺:– Controls parameter 𝜃0– Goal is to minimize 𝑔 𝜃0, 𝜃1∗

!Agent 𝐷– Controls parameter 𝜃1– Goal is to minimize 𝑑 𝜃0∗, 𝜃1

𝑔 𝜃O, 𝜃Pminimize meter

knob

𝜃O

𝑑 𝜃O, 𝜃Pminimize meter

knob

𝜃P

𝐺 Agent 𝐷 Agent

𝜃O

𝜃P

!Each Agent tries to minimize their meter

*Graphics and art reproduced from “stick figure” Wiki page.

𝑔 𝜃)∗, 𝜃(∗ = min'!

𝑔 𝜃), 𝜃(∗

𝑑 𝜃)∗, 𝜃(∗ = min'"

𝑑 𝜃)∗, 𝜃(

Zero-Sum Game: Special Nash Equilibrium*

!Agent 𝐺:– Goal is to minimize 𝑔 𝜃0, 𝜃1∗

– Goal is to maximize 𝑑 𝜃0, 𝜃1∗

!Agent 𝐷– Goal is to minimize 𝑑 𝜃0∗, 𝜃1

𝑑 𝜃O, 𝜃Pmaximize meter

knob

𝜃O

𝑑 𝜃O, 𝜃Pminimize meter

knob

𝜃P

𝐺 Agent 𝐷 Agent

𝜃O

𝜃P

!Special case when 𝑔 𝜃&, 𝜃' = −𝑑 𝜃&, 𝜃'

*Graphics and art reproduced from “stick figure” Wiki page.

𝑑 𝜃)∗, 𝜃(∗ = max'!

𝑑 𝜃), 𝜃(∗

𝑑 𝜃)∗, 𝜃(∗ = min'"

𝑑 𝜃)∗, 𝜃(

Adversarial relationship

Computing the GAN Equilibrium

!Reparameterizing equations

!Alternating minimization ⇒ mode collapse

!Generator loss gradient descent

!Practical convergence issues

Reparameterize Loss Functions!Goal: Replace 𝜃O and 𝜃' with 𝑅 and 𝑓

!Generator parameter 𝑅:– Generated samples are

)𝑌 ∼ 𝑅 𝑦 𝑝Q 𝑦 = 𝑝R 𝑦

where 𝑅 𝑦 =.#! /

.$ /

!Discriminator parameter 𝑓:– Discriminator is

𝑓 𝑦 = 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

!Important facts:– 𝐸 𝑅(𝑌) = 1– Ω) = 𝑅:ℜ0 → 0,∞ such that 𝐸 𝑅 𝑌 = 1– Ω( = 𝑓:ℜ0 → 0,1– For any function ℎ 𝑦 , 𝐸 ℎ H𝑌 = 𝐸 ℎ 𝑌 𝑅(𝑌)

GAN Equilibrium Conditions

!We would like to find the solution to:

𝑅∗ = arg minR∈a@

𝑔 𝑅, 𝑓∗

𝑓∗ = arg minb∈aA

𝑑 𝑅∗, 𝑓

– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)

!How do we do this?

!Will it converge?

Reparameterized Loss Functions!Generator loss function:

𝑔 𝑅, 𝑓 = 𝐸 − log 𝑓 H𝑌= 𝐸 −𝑅(𝑌) log 𝑓 𝑌

!Discriminator loss function:𝑑 𝑅, 𝑓 = 𝐸 − log 𝑓 𝑌 + 𝐸 −log 1 − 𝑓 H𝑌

= 𝐸 − log 𝑓 𝑌 + 𝐸 −𝑅 𝑌 log 1 − 𝑓 𝑌

!Nash equilibrium:

𝑅∗ = arg min1∈3!

𝑔 𝑅, 𝑓∗

𝑓∗ = arg min4∈3"

𝑑 𝑅∗, 𝑓

Method 1: Alternating Minimization!Algorithm

!Discriminator update

𝑓∗ 𝑦 ←1

1 + 𝑅 𝑦!Generator update

𝑅∗ 𝑦 ← 𝛿 𝑦 − 𝑦5 where 𝑦5 = max/𝑓 𝑦

Repeat {𝑓∗ ← arg min

)∈+9𝑑 𝑅∗, 𝑓

𝑅∗ ← arg min%∈+'

𝑔 𝑅, 𝑓∗

} Doesn’t Work!

Problem:• This is called “Mode collapse”• Only generates sample that the discriminator likes best• Intuition:

“We come from France.” “I like cheese steaks.”“Too good to be true.”“Too creepy to be real.”

Method 2: Generator Loss Gradient Descent (GLGD)!Algorithm

!Discriminator update

𝑓∗ 𝑦 ←1

1 + 𝑅 𝑦

!Generator update– Take a step in the negative direction of the generator loss gradient.– 𝑃: - project into the allow parameter space. (This is not an issue in practice.)

𝑃:𝑑 𝑦 = 𝑑 𝑦 −𝐸 𝑑(𝑌)𝐸 𝑝"(𝑌)

𝑝"(𝑦)

!Questions/Comments:– Can be applied with a wide variety of generator/discriminator loss functions– Does this converge?– If so, then what (if anything) is being minimized?

*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.


b∈aA𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔 𝑅, 𝑓∗}

My term

gradient descent stepProjection onto valid

parameter space

https://arxiv.org/abs/1701.04862

GLGD Convergence for Non-Saturating GAN

*This is an equivalent expression to Theorem 2.5 of [1].[1] Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

!For non-saturating GAN when 𝑓1∗ 𝑦 = arg min4∈3"

𝑑 𝑅, 𝑓 , then it can be shown that

∇1 𝑔 𝑅, 𝑓∗ = ∇1𝐶 𝑅

where*

𝐶 𝑅 = 𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌

!Conclusions:– GLGD is really a gradient descent algorithm for the cost function 𝐶 𝑅 .– 1 + 𝑥 log 1 + 𝑥 is a strictly convex function 𝑥: – Therefore, we know that 𝐶 𝑅 has a unique global minimum at 𝑅 𝑌 = 1.– However, convergence tends to be slow.


)∈+9𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}


More Details on GLGD


!Arjovsky and Bottou showed that for the non-saturating GAN*𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 𝐾𝐿 𝑝1||𝑝! − 2 𝐽𝑆𝐷 𝑝1||𝑝!

!So from previous identities, we have that:

𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 2 𝐾𝐿 𝑝1 + 𝑝! /2||𝑝!

= 𝑃5∇1𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌

!Conclusions:– GLGD is really a gradient descent algorithm for the cost function

𝐶 𝑅 = 2 𝐾𝐿 𝑃; + 𝑃" /2||𝑃"– 𝐶 𝑅 has a unique global minimum at 𝑃; = 𝑃"– However, convergence tends to be slow.


)∈+9𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}


Convergence of GANs

!Generator and discriminator at convergence:– Reference and generated distribution are the same ⇒ 𝑅∗ 𝑌 = 1

– Discriminator can not distinguish distributions ⇒ 𝑓∗ 𝑦 = ##<; =

= #$

!At convergence the generated and reference distributions are identical. – Therefore, the likelihood ratio is 𝑅 𝑦 = 1; – The generated (fake) and reference (real) distributions are identical;– The discriminator assigns a 50/50 probability to either case because they are

indistinguishable.

– Then both the generator are discriminator cross-entropy losses are −log #$ ≈ 0.693.

!In practice, things don’t usually work out this well…

Method 2: Practical Algorithm!Algorithm

!What you would like to see

Repeat {For 𝑁1 iterations {

𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃1 ← 𝜃1 − 𝛽∇!*𝑑 𝜃0, 𝜃1; 𝐵

}𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃0 ← 𝜃0 − 𝛼∇!$𝑔 𝜃0, 𝜃1; 𝐵}

Iteration #

Loss

Generator Loss

½ Discriminator Loss

− log12 ≈ 0.693

Looks good, but…– Could result from a discriminator with insufficient capacity

Failure Mode: Mode Collapse!Algorithm

!Sometimes you get mode collapse

Iteration #

Loss

Generator Loss

Discriminator dominates

½ Discriminator Loss

≈ 0.693

Repeat {For 𝑖 = 0 to 𝑁' − 1 {

𝜃' ← 𝜃' − 𝛽∇-9𝑑 𝜃&, 𝜃'}𝜃& ← 𝜃& − 𝛼∇-'𝑔 𝜃&, 𝜃'}

Might be caused by:– Overfitting by discriminator– Insufficient number of discriminator updates– Insufficient generator capacity

Concept Wasserstein GAN

*Martin Arjovsky, Soumith Chintala and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.

!Problem with GAN training using “ − log𝐷 trick”– Slow and sometimes unstable convergence– Problems with vanishing gradient

!Conjecture:– The problem is caused by the discriminator function.– Bayes classifier is

• Too sensitive and too nonlinear• Non-overlapping distributions create vanishing gradients that slow convergence.

!Base discriminator of the Wasserstein distance (i.e., earth mover distance)

*Reproduced from paper

https://dl.acm.org/doi/10.5555/3305381.3305404

Wasserstein Fundamentals

!Based on Kantorovich-Rubinstein duality (Villani, 2009)*

𝑊 𝑝Q||𝑝R = supb Bmn

𝐸 𝑓 𝑌 − 𝐸 𝑓 )𝑌R

– where 𝑓 6 is the Lipschitz constant of 𝑓

– 𝑓 6 ≤ 1 is referred to as the 1-Lipschitz condition

*Villani, Cedric. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.

Wasserstein GAN*!Then the fundamental result of the Arjovsky and Bottou

– Define the sets 𝑅 ∈ Ω) as usual, but define and 𝑓 ∈ Ω) so that

Ω'. = 𝑓:ℜ/ → −∞,∞ 𝑠. 𝑡. 𝑓 0 ≤ 1

– Then define Wasserstein generator and discriminator loss functions as𝑔1 𝑅, 𝑓 = 𝐸 −𝑓 L𝑌%

𝑑1 𝑅, 𝑓 = 𝐸 𝑓 L𝑌% − 𝐸 𝑓 𝑌%

!Key result from Arjovsky paper*:∇%𝑔1 𝑅, 𝑓∗ = ∇%𝑊(𝑝2||𝑝%)

– where𝑓∗ = arg min

)∈39>𝑑4 𝑅, 𝑓

*Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.

https://dl.acm.org/doi/10.5555/3305381.3305404

Method 3: Wasserstein Algorithm!Algorithm

!Discriminator update– How do we solve the problem of minimizing the discriminator loss with the Lipschitz

constraint?– Answer: We clip the discriminator weights during training.– Observation: Isn’t this just regularization of the discriminator DNN??

!Generator update– Take a step in the negative direction of the generator loss gradient descent.


Repeat {𝑓∗ = arg min

b∈pAC𝑑q 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔q 𝑅, 𝑓∗}


Method 3: Wasserstein Practical Algorithm!Algorithm

!Observations:– Some people seem to feel the Wasserstein GAN has better convergence.– However, is this because of the Wasserstein metric?– Or is it because of the other algorithmic improvements?


Make sure to get new batches

Iterate discriminator to approximate convergence

Minimize discriminator loss

Clip discriminator weights to approximate Lipschitz constraint

Take gradient step of generator loss


Conditional Generative Adversarial Network

!Generates samples from the conditional distribution of 𝑌 given 𝑋.

!Descriminator takes 𝑦5, 𝑥5 input pairs for 𝑘 = 0,⋯ , 𝐾 − 1

(𝑦"- GeneratedGenerator

(𝑦" = ℎ#! 𝑥" , 𝑧"

𝑥" Discriminator (𝑝" = 𝑓#( (𝑦" , 𝑥"

𝑧"

𝑦"𝑦"- Reference

0 =Generated

𝑥" , 𝑦"

Discriminator 𝑝" = 𝑓#( 𝑦" , 𝑥"


1 = Reference

CrossEntropy( (𝑝!, 1) 𝑔 𝜃) , 𝜃(


CrossEntropy(𝑝!, 1)

Reference distribution

𝑥"

𝑥"

Documents

Adversarial Learning