32
Adversarial Learning o What is a GAN? o Some mathematical background o Algorithms for training a GAN o The Wasserstein GAN o Conditional GANs

Adversarial Learning

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adversarial Learning

Adversarial LearningoWhat is a GAN?oSome mathematical backgroundoAlgorithms for training a GANoThe Wasserstein GANoConditional GANs

Page 2: Adversarial Learning

How do we sample from a distribution?

𝑌 ∼ 𝑝! 𝑦!Parametric methods:

– Some known parametric distribution:

𝑝! 𝑦 =1𝑍 exp −𝑢 𝑦

– Monte Carlo Markov Chain (MCMC) methods– Hastings-Metropolis sampling– Requires a known distribution

!Non-parametric methods:– Provided with samples 𝑦", ⋯ , 𝑦#$%– Infer distribution from samples– Generator

𝑌 = ℎ" 𝑍

1

https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html

Multivariate random vector in ℜ!

Multivariate density in ℜ!

Source of randomness; 𝑍 ∼ 𝑁 0, 𝐼

Random vector with desired distribution

Page 3: Adversarial Learning

Training a Generator

!Function of GAN:– Generates samples, -𝑦&, with the same distribution as 𝑦&.– Can use drop-outs to generate randomness

!Training algorithm:– Compare the distributions of -𝑦& and 𝑦&.– Feedback parameter corrections

(𝑦"generatedsamples

Generator(𝑦" = ℎ# 𝑧"

TrainingAlgorithm

𝑧"independent noise source

𝑦"referencesamples

𝜃

Page 4: Adversarial Learning

The Bayes Discriminator

!Use Bayes rule:𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 =

𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 + 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹

• Assuming 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝐹 = 𝑝!$ 𝑦 , 𝑃 𝑦|𝐶𝑙𝑎𝑠𝑠 = 𝑅 = 𝑝" 𝑦 , 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅 =𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝐹 = #

$ , we have that

𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦 = =𝑝% 𝑦 ½

𝑝% 𝑦 ½ + 𝑝#! 𝑦 ½=

11 + 𝑅#! 𝑦

– where 𝑅#! 𝑦 =&"! '

&# 'is the likelihood ratio defined by

(𝑦"generatedsamples

Generator(𝑦" = ℎ#! 𝑧"

𝑧"independentnoise source

𝑦"referencesamples

“R -Real”

“F - Fake”

What is the probability that an observation, 𝑦, is real?

Page 5: Adversarial Learning

Implementing a Bayes Discriminator

!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

= !% "!% " #!&' "

= $$#%&' "

where

𝑅'! 𝑦 =𝑝'! 𝑦𝑝! 𝑦

Likelihood Ratio

(𝑦"generatedsamples

Generator(𝑦" = ℎ#! 𝑧"

Discriminator (𝑝" = 𝑓#( (𝑦"

𝑧"independentnoise source

𝑦"referencesamples

Discriminator 𝑝" = 𝑓#( 𝑦"

Should be mostly 0s

Should be mostly 1s

Page 6: Adversarial Learning

Bayes Discriminator and the Likelihood Ratio

!Bayesian Discriminator:𝑓'( 𝑦 ≈ 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

= !% "!% " #!&' "

= $$#%&' "

generated distribution 𝑝#! 𝑦

reference distribution 𝑝% 𝑦

𝑦

likelihood ratio 𝑅#! 𝑦

1

Classify as “real” Classify as “fake”

𝑦

Page 7: Adversarial Learning

Training a Bayes Discriminator

!Discriminator loss function:7𝑑 𝜃), 𝜃( =

1𝐾;&*"

#$%

− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&

!Optimal discriminator parameter:

𝜃(∗ = argmin'"

7𝑑 𝜃), 𝜃(

– Results in ML estimate of Bayes classifier parameters.

(𝑦"generatedsamples

Generator(𝑦" = ℎ#! 𝑧"

Discriminator(𝑝" = 𝑓#( (𝑦"

𝑧"

𝑦"

𝑦"referencesamples

0 =Generated

Discriminator𝑝" = 𝑓#( 𝑦"

1 =Reference+ 9𝑑 𝜃) , 𝜃(

CrossEntropy( (𝑝!, 0)

CrossEntropy(𝑝!, 1)𝑝"

(𝑝"

Page 8: Adversarial Learning

Training a Generator

(𝑦"generatedsamples

Generator(𝑦" = ℎ#! 𝑧"

Discriminator (𝑝" = 𝑓#( (𝑦"

𝑧" <𝑔 𝜃) , 𝜃(Loss Function

𝐿 (𝑝!

!Big idea: Maximize the probability that outputs of the generator are classified as being from the reference distribution.

– /𝑝/ should be large– 𝐿 /𝑝/ should be small when /𝑝/ is large– 𝐿 /𝑝/ should be a decreasing function of /𝑝/

!Generator loss function:

1𝑔 𝜃0, 𝜃1 =1𝐾6/23

45#

𝐿 𝑓!1 /𝑦/

!Optimal generator parameter:

𝜃0∗ = argmin!$

𝑔 𝜃0, 𝜃1

Loss should encourage(𝑝" to be large

Page 9: Adversarial Learning

Generative Adversarial Network (GAN)*

(𝑦"generatedsamples

Generator(𝑦" = ℎ#! 𝑧"

Discriminator (𝑝" = 𝑓#( (𝑦"

𝑧"

𝑦"

𝑦"referencesamples

0 =Generated

Discriminator 𝑝" = 𝑓#( 𝑦"

1 =Reference + 9𝑑 𝜃) , 𝜃(

Loss Function𝐿 (𝑝!

<𝑔 𝜃) , 𝜃(

CrossEntropy( (𝑝!, 0)

CrossEntropy(𝑝!, 1)

!Generator loss function:

D𝑔 𝜃), 𝜃( =1𝐾;&*"

#$%

𝐿 𝑓'( -𝑦&

!Discriminator loss function:7𝑑 𝜃), 𝜃( =

1𝐾;&*"

#$%

− log 𝑓'( 𝑦& − log 1 − 𝑓'( -𝑦&

*Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26

Page 10: Adversarial Learning

GAN: Expected Loss Functions

>𝑌#!generatedsamples

Generator>𝑌 = ℎ#! 𝑍

Discriminator >𝑃#! = 𝑓#(

>𝑌#!𝑍

𝑌

𝑌referencesamples

0 =Generated

Discriminator 𝑃 = 𝑓#( 𝑌

1 =Reference + 𝑑 𝜃) , 𝜃(

Loss Function𝐿 7𝑃"!

𝑔 𝜃) , 𝜃(

CrossEntropy( 7𝑃, 0)

CrossEntropy(𝑃, 1)

!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 𝐿 𝑓'( H𝑌'!

!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– By the weak and strong law of large numbers lim#→-

D𝑔 = 𝑔 and lim#→-

7𝑑 = 𝑑

Page 11: Adversarial Learning

Generator Loss Function Choices

!Option 0: Original loss function proposed in [1].– 𝐿 𝑝 = log 1 − 𝑝 𝐿 0 = 0; 𝐿 1 = −∞

– 𝑔 𝜃), 𝜃( = 𝐸 log 1 − 𝑓'( H𝑌'!– Presented as key theoretically grounded approach in Goodfellow paper.– Consistent with zero-sum game Nash equilibrium theory– Almost no one uses it.

!Option 1: “Non-saturating” loss function, i.e., the “-log D trick”– 𝐿 𝑝 = − log 𝑝 𝐿 0 = +∞; 𝐿 1 = 0

– 𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!– Mentioned as trick in [1] to keep the training loss from “saturating”.– This is what you get if you use cross-entropy loss for generator – This is what is commonly done.

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. “Generative Adversarial Networks”, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26

Page 12: Adversarial Learning

GAN Architecture(Non-Saturating)

>𝑌#!generatedsamples

Generator>𝑌 = ℎ#! 𝑍

Discriminator >𝑃#! = 𝑓#(

>𝑌#!𝑍

𝑌

𝑌referencesamples

0 =Generated

Discriminator 𝑃 = 𝑓#( 𝑌

1 =Reference + 𝑑 𝜃) , 𝜃(

CrossEntropy(𝑃, 1) 𝑔 𝜃) , 𝜃(

CrossEntropy( 7𝑃, 0)

CrossEntropy(𝑃, 1)

!Generator loss function:𝑔 𝜃), 𝜃( = 𝐸 − log 𝑓'( H𝑌'!

!Discriminator loss function:𝑑 𝜃), 𝜃( = 𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– By the weak and strong law of large numbers lim#→-

D𝑔 = 𝑔 and lim#→-

7𝑑 = 𝑑

Page 13: Adversarial Learning

GAN Equilibrium Conditions (Non-Saturating)

!We would like to find the solution to:

𝜃)∗ = argmin'!

𝑔 𝜃), 𝜃(∗

= argmin'!

𝐸 − log 𝑓'( H𝑌'!

𝜃(∗ = argmin'"

𝑑 𝜃)∗, 𝜃(

= argmin'"

𝐸 − log 𝑓'( 𝑌 + 𝐸 −log 1 − 𝑓'( H𝑌'!

– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)

!How do we solve this?

!Will it converge?

Page 14: Adversarial Learning

Nash Equilibrium with Two Agents*

!Agent 𝐺:– Controls parameter 𝜃0– Goal is to minimize 𝑔 𝜃0, 𝜃1∗

!Agent 𝐷– Controls parameter 𝜃1– Goal is to minimize 𝑑 𝜃0∗, 𝜃1

𝑔 𝜃O, 𝜃Pminimize meter

knob

𝜃O

𝑑 𝜃O, 𝜃Pminimize meter

knob

𝜃P

𝐺 Agent 𝐷 Agent

𝜃O

𝜃P

!Each Agent tries to minimize their meter

*Graphics and art reproduced from “stick figure” Wiki page.

𝑔 𝜃)∗, 𝜃(∗ = min'!

𝑔 𝜃), 𝜃(∗

𝑑 𝜃)∗, 𝜃(∗ = min'"

𝑑 𝜃)∗, 𝜃(

Page 15: Adversarial Learning

Zero-Sum Game: Special Nash Equilibrium*

!Agent 𝐺:– Goal is to minimize 𝑔 𝜃0, 𝜃1∗

– Goal is to maximize 𝑑 𝜃0, 𝜃1∗

!Agent 𝐷– Goal is to minimize 𝑑 𝜃0∗, 𝜃1

𝑑 𝜃O, 𝜃Pmaximize meter

knob

𝜃O

𝑑 𝜃O, 𝜃Pminimize meter

knob

𝜃P

𝐺 Agent 𝐷 Agent

𝜃O

𝜃P

!Special case when 𝑔 𝜃&, 𝜃' = −𝑑 𝜃&, 𝜃'

*Graphics and art reproduced from “stick figure” Wiki page.

𝑑 𝜃)∗, 𝜃(∗ = max'!

𝑑 𝜃), 𝜃(∗

𝑑 𝜃)∗, 𝜃(∗ = min'"

𝑑 𝜃)∗, 𝜃(

Adversarial relationship

Page 16: Adversarial Learning

Computing the GAN Equilibrium

!Reparameterizing equations

!Alternating minimization ⇒ mode collapse

!Generator loss gradient descent

!Practical convergence issues

Page 17: Adversarial Learning

Reparameterize Loss Functions!Goal: Replace 𝜃O and 𝜃' with 𝑅 and 𝑓

!Generator parameter 𝑅:– Generated samples are

)𝑌 ∼ 𝑅 𝑦 𝑝Q 𝑦 = 𝑝R 𝑦

where 𝑅 𝑦 =.#! /

.$ /

!Discriminator parameter 𝑓:– Discriminator is

𝑓 𝑦 = 𝑃 𝐶𝑙𝑎𝑠𝑠 = 𝑅|𝑦

!Important facts:– 𝐸 𝑅(𝑌) = 1– Ω) = 𝑅:ℜ0 → 0,∞ such that 𝐸 𝑅 𝑌 = 1– Ω( = 𝑓:ℜ0 → 0,1– For any function ℎ 𝑦 , 𝐸 ℎ H𝑌 = 𝐸 ℎ 𝑌 𝑅(𝑌)

Page 18: Adversarial Learning

GAN Equilibrium Conditions

!We would like to find the solution to:

𝑅∗ = arg minR∈a@

𝑔 𝑅, 𝑓∗

𝑓∗ = arg minb∈aA

𝑑 𝑅∗, 𝑓

– This is known as a Nash Equilibrium– We would like it to converge to 𝑅∗ = 1 ⇒ (generated = reference distributions)

!How do we do this?

!Will it converge?

Page 19: Adversarial Learning

Reparameterized Loss Functions!Generator loss function:

𝑔 𝑅, 𝑓 = 𝐸 − log 𝑓 H𝑌= 𝐸 −𝑅(𝑌) log 𝑓 𝑌

!Discriminator loss function:𝑑 𝑅, 𝑓 = 𝐸 − log 𝑓 𝑌 + 𝐸 −log 1 − 𝑓 H𝑌

= 𝐸 − log 𝑓 𝑌 + 𝐸 −𝑅 𝑌 log 1 − 𝑓 𝑌

!Nash equilibrium:

𝑅∗ = arg min1∈3!

𝑔 𝑅, 𝑓∗

𝑓∗ = arg min4∈3"

𝑑 𝑅∗, 𝑓

Page 20: Adversarial Learning

Method 1: Alternating Minimization!Algorithm

!Discriminator update

𝑓∗ 𝑦 ←1

1 + 𝑅 𝑦!Generator update

𝑅∗ 𝑦 ← 𝛿 𝑦 − 𝑦5 where 𝑦5 = max/𝑓 𝑦

Repeat {𝑓∗ ← arg min

)∈+9𝑑 𝑅∗, 𝑓

𝑅∗ ← arg min%∈+'

𝑔 𝑅, 𝑓∗

} Doesn’t Work!

Problem:• This is called “Mode collapse”• Only generates sample that the discriminator likes best• Intuition:

“We come from France.” “I like cheese steaks.”“Too good to be true.”“Too creepy to be real.”

Page 21: Adversarial Learning

Method 2: Generator Loss Gradient Descent (GLGD)!Algorithm

!Discriminator update

𝑓∗ 𝑦 ←1

1 + 𝑅 𝑦

!Generator update– Take a step in the negative direction of the generator loss gradient.– 𝑃: - project into the allow parameter space. (This is not an issue in practice.)

𝑃:𝑑 𝑦 = 𝑑 𝑦 −𝐸 𝑑(𝑌)𝐸 𝑝"(𝑌)

𝑝"(𝑦)

!Questions/Comments:– Can be applied with a wide variety of generator/discriminator loss functions– Does this converge?– If so, then what (if anything) is being minimized?

*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

Repeat {𝑓∗ ← arg min

b∈aA𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔 𝑅, 𝑓∗}

My term

gradient descent stepProjection onto valid

parameter space

Page 22: Adversarial Learning

GLGD Convergence for Non-Saturating GAN

*This is an equivalent expression to Theorem 2.5 of [1].[1] Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

!For non-saturating GAN when 𝑓1∗ 𝑦 = arg min4∈3"

𝑑 𝑅, 𝑓 , then it can be shown that

∇1 𝑔 𝑅, 𝑓∗ = ∇1𝐶 𝑅

where*

𝐶 𝑅 = 𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌

!Conclusions:– GLGD is really a gradient descent algorithm for the cost function 𝐶 𝑅 .– 1 + 𝑥 log 1 + 𝑥 is a strictly convex function 𝑥: – Therefore, we know that 𝐶 𝑅 has a unique global minimum at 𝑅 𝑌 = 1.– However, convergence tends to be slow.

Repeat {𝑓∗ ← arg min

)∈+9𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}

Page 23: Adversarial Learning

More Details on GLGD

*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

!Arjovsky and Bottou showed that for the non-saturating GAN*𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 𝐾𝐿 𝑝1||𝑝! − 2 𝐽𝑆𝐷 𝑝1||𝑝!

!So from previous identities, we have that:

𝑃5 ∇1 𝑔 𝑅, 𝑓∗ = ∇1 2 𝐾𝐿 𝑝1 + 𝑝! /2||𝑝!

= 𝑃5∇1𝐸 1 + 𝑅 𝑌 log 1 + 𝑅 𝑌

!Conclusions:– GLGD is really a gradient descent algorithm for the cost function

𝐶 𝑅 = 2 𝐾𝐿 𝑃; + 𝑃" /2||𝑃"– 𝐶 𝑅 has a unique global minimum at 𝑃; = 𝑃"– However, convergence tends to be slow.

Repeat {𝑓∗ ← arg min

)∈+9𝑑 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃,∇%𝑔 𝑅, 𝑓∗}

Page 24: Adversarial Learning

Convergence of GANs

!Generator and discriminator at convergence:– Reference and generated distribution are the same ⇒ 𝑅∗ 𝑌 = 1

– Discriminator can not distinguish distributions ⇒ 𝑓∗ 𝑦 = ##<; =

= #$

!At convergence the generated and reference distributions are identical. – Therefore, the likelihood ratio is 𝑅 𝑦 = 1; – The generated (fake) and reference (real) distributions are identical;– The discriminator assigns a 50/50 probability to either case because they are

indistinguishable.

– Then both the generator are discriminator cross-entropy losses are −log #$ ≈ 0.693.

!In practice, things don’t usually work out this well…

Page 25: Adversarial Learning

Method 2: Practical Algorithm!Algorithm

!What you would like to see

Repeat {For 𝑁1 iterations {

𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃1 ← 𝜃1 − 𝛽∇!*𝑑 𝜃0, 𝜃1; 𝐵

}𝐵 ← 𝐺𝑒𝑡𝑅𝑎𝑛𝑑𝑜𝑚𝐵𝑎𝑡𝑐ℎ𝜃0 ← 𝜃0 − 𝛼∇!$𝑔 𝜃0, 𝜃1; 𝐵}

Iteration #

Loss

Generator Loss

½ Discriminator Loss

− log12 ≈ 0.693

Looks good, but…– Could result from a discriminator with insufficient capacity

Page 26: Adversarial Learning

Failure Mode: Mode Collapse!Algorithm

!Sometimes you get mode collapse

Iteration #

Loss

Generator Loss

Discriminator dominates

½ Discriminator Loss

≈ 0.693

Repeat {For 𝑖 = 0 to 𝑁' − 1 {

𝜃' ← 𝜃' − 𝛽∇-9𝑑 𝜃&, 𝜃'}𝜃& ← 𝜃& − 𝛼∇-'𝑔 𝜃&, 𝜃'}

Might be caused by:– Overfitting by discriminator– Insufficient number of discriminator updates– Insufficient generator capacity

Page 27: Adversarial Learning

Concept Wasserstein GAN

*Martin Arjovsky, Soumith Chintala and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.

!Problem with GAN training using “ − log𝐷 trick”– Slow and sometimes unstable convergence– Problems with vanishing gradient

!Conjecture:– The problem is caused by the discriminator function.– Bayes classifier is

• Too sensitive and too nonlinear• Non-overlapping distributions create vanishing gradients that slow convergence.

!Base discriminator of the Wasserstein distance (i.e., earth mover distance)

*Reproduced from paper

Page 28: Adversarial Learning

Wasserstein Fundamentals

!Based on Kantorovich-Rubinstein duality (Villani, 2009)*

𝑊 𝑝Q||𝑝R = supb Bmn

𝐸 𝑓 𝑌 − 𝐸 𝑓 )𝑌R

– where 𝑓 6 is the Lipschitz constant of 𝑓

– 𝑓 6 ≤ 1 is referred to as the 1-Lipschitz condition

*Villani, Cedric. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.

Page 29: Adversarial Learning

Wasserstein GAN*!Then the fundamental result of the Arjovsky and Bottou

– Define the sets 𝑅 ∈ Ω) as usual, but define and 𝑓 ∈ Ω) so that

Ω'. = 𝑓:ℜ/ → −∞,∞ 𝑠. 𝑡. 𝑓 0 ≤ 1

– Then define Wasserstein generator and discriminator loss functions as𝑔1 𝑅, 𝑓 = 𝐸 −𝑓 L𝑌%

𝑑1 𝑅, 𝑓 = 𝐸 𝑓 L𝑌% − 𝐸 𝑓 𝑌%

!Key result from Arjovsky paper*:∇%𝑔1 𝑅, 𝑓∗ = ∇%𝑊(𝑝2||𝑝%)

– where𝑓∗ = arg min

)∈39>𝑑4 𝑅, 𝑓

*Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasserstein Generative Adversarial Networks”, ICML 2017.

Page 30: Adversarial Learning

Method 3: Wasserstein Algorithm!Algorithm

!Discriminator update– How do we solve the problem of minimizing the discriminator loss with the Lipschitz

constraint?– Answer: We clip the discriminator weights during training.– Observation: Isn’t this just regularization of the discriminator DNN??

!Generator update– Take a step in the negative direction of the generator loss gradient descent.

*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

Repeat {𝑓∗ = arg min

b∈pAC𝑑q 𝑅, 𝑓

𝑅 ← 𝑅 − 𝛼𝑃f∇R𝑔q 𝑅, 𝑓∗}

Page 31: Adversarial Learning

Method 3: Wasserstein Practical Algorithm!Algorithm

!Observations:– Some people seem to feel the Wasserstein GAN has better convergence.– However, is this because of the Wasserstein metric?– Or is it because of the other algorithmic improvements?

*Martin Arjovsky and Leon Bottou, “Towards Principled Methods for Training Generative Adversarial Networks”, ICLR 2017.

Make sure to get new batches

Iterate discriminator to approximate convergence

Minimize discriminator loss

Clip discriminator weights to approximate Lipschitz constraint

Take gradient step of generator loss

Page 32: Adversarial Learning

Conditional Generative Adversarial Network

!Generates samples from the conditional distribution of 𝑌 given 𝑋.

!Descriminator takes 𝑦5, 𝑥5 input pairs for 𝑘 = 0,⋯ , 𝐾 − 1

(𝑦"- GeneratedGenerator

(𝑦" = ℎ#! 𝑥" , 𝑧"

𝑥" Discriminator (𝑝" = 𝑓#( (𝑦" , 𝑥"

𝑧"

𝑦"𝑦"- Reference

0 =Generated

𝑥" , 𝑦"

Discriminator 𝑝" = 𝑓#( 𝑦" , 𝑥"

1 =Reference + 𝑑 𝜃) , 𝜃(

1 = Reference

CrossEntropy( (𝑝!, 1) 𝑔 𝜃) , 𝜃(

CrossEntropy( (𝑝!, 0)

CrossEntropy(𝑝!, 1)

Reference distribution

𝑥"

𝑥"