Download pdf - The Expressive Power of Planar Flowscseweb.ucsd.edu/~z4kong/files/DL_Workshop_Poster.pdf · Results for the d= 1 case: universal approximation Theorem (universal approximation for

The Expressive Power of Planar FlowsZhifeng Kong, Kamalika Chaudhuri

University of California San Diegoz4kong, [email protected]

AbstractNormalizing flows have received a great deal of recent attention as they allowflexible generative modeling as well as easy likelihood computation. While a widevariety of flow models have been proposed, there is little formal understanding ofthe representation power of these models. In this work, we study a class of simplenormalizing flows called planar flows, and rigorously establish bounds on theirexpressive power. Our results indicate that while these flows are highly expressive inone dimension, in higher dimensions their representation power may be limited,especially for planar flows of moderate depth.

Background: Normalizing Flows (NF)

← sourcedistribution ← target

distribution

Figure: A normalizing flow model that transforms the source distribution p0(z0) to the targetdistribution pK(zK). Figure by Lilian Weng, https://lilianweng.github.io/lil-log/assets/images/normalizing-flow.png

I zi ∈ Rd, 0 ≤ i ≤ K.

I pK = f#p0, where f = fK · · · f1.

I Each fi : Rd→ Rd is simple, invertible, and parameterized.

I Density computation:

log pK(zK) = log p0(z0)−K∑i=1

log | det Jfi(zi)|, zi = fi(zi−1), 1 ≤ i ≤ K

I Solve MLE ⇒ a generative model with computable likelihood.

Definition (planar flow [Rezende and Mohamed, 2015])A planar flow fpf is defined by an invertible function fpf(z) = z + uh(w>z + b),where u,w, z ∈ Rd, b ∈ R with non-linearity h : R→ R.

Figure: The output distributions transformed from different source distributions with different#layers of planar flows (h = tanh). Figure in [Rezende and Mohamed, 2015].

Problem StatementSuppose f is composed of T planar flows: f = fT · · · f1.Let q be the source(input) distribution and p be the target distribution.

I Q1–Exact transformation: when does it satisfy

p = f#q (a.e.)

I Q2–Approximation: given ε > 0, is there a bound on T s.t.

‖f#q − p‖1 ≤ ε

ChallengeSuppose F is a function class and I = all invertible functions.I F is a universal approximator ; F ∩ I can transform between arbitrary

distributions.

I F has limited expressivity ; F ∩ I is not a universal approximator intransforming distributions.

Therefore, the expressive power of F in the function space does not indicate theexpressive power of F ∩ I in transforming distributions.Our technique: directly look at input-output distribution pairs.

Results for the d = 1 case: universal approximation

Theorem (universal approximation for the ReLU Non-linearity)Let p be a density on R supported on a finite union of intervals. Then, for anyε > 0, there exists a flow f composed of finitely many ReLU planar flows and aGaussian distribution qN such that ‖f#qN − p‖1 ≤ ε.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.51st piece

2nd piece

3rd piece

Figure: A tail-consistent piecewiseGaussian distribution of 3 pieces.

Sketch of proof.First, we show by construction that if f is anormalizing flow composed of n− 1 ReLU planarflows, then f#qN can express any tail-consistentpiecewise Gaussian distribution. Then, we usetail-consistent piecewise Gaussian distributionsto approximate piecewise constant distributions.

−4 −3 −2 −1 0 1 2 3 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30pqpwcqpwg

−4 −3 −2 −1 0 1 2 3 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30 pqpwcqpwg

Figure: Target distribution p, its piecewise constant distribution approximation qpwc of 50 (left)/300(right) pieces, and its tail-consistent piecewise Gaussian distribution approximation qpwg generatedby 50 (left)/300 (right) ReLU planar flows over a Gaussian.

Results for high−d exact transformation: topology matchingSuppose distribution q is defined on Rd.

Theorem (planar flows with h = ReLU)Suppose f is composed of finitely many ReLU planar flows. Let p = f#q. Then,there exists a zero-measure closed set Ω ⊂ Rd such that ∀z ∈ Rd \ Ω, we haveJf(z)>∇z log p(f (z)) = ∇z log q(z).

Figure: The surface plot of q (left), a mixture of Gaussian distribution with 4 peaks located at(±1,±1), and p = f#q (right), the transformed distribution of q. The red points correspond topeaks of q and are mapped to peaks of p.

Corollary (MoG9MoG, Prod9Prod)Suppose p, q are (i) mixture of Gaussian distributions:

p(z) =

rp∑i=1

wipN (z;µip,Σp), q(z) =

rq∑j=1

wjqN (z;µjq,Σq)

or (ii) product distributions:

p(z) ∝d∏i=1

g(zi)rp; q(z) ∝

d∏i=1

g(zi)rq, rp > 0, rq > 0, rp 6= rq

where g is a smooth function. Then, there generally does not exist flow fcomposed of finitely many ReLU planar flows such that p = f#q.

Theorem (planar flows with general smooth h)Suppose f = f

(n)pf · · · f

(1)pf where f

(i)pf (z) = z + uih(w>i z + bi). Let p = f#q.

Then ∀z ∈ Rd, we have ∇z log p(f (z))−∇z log q(z) ∈ spanw1, · · · , wn.

Results for high−d approximationLet q, p be the input distribution and the target distribution on Rd.

Definition (`1 norm approximation lower bound)Let F be a set of normalizing flows. Then for any ε > 0, the minimum number offlows in F required to transform q to an approximation of p to within ε is

Tε(p, q,F) = infn : ∃fini=1 ∈ F such that ‖(f1 · · · fn)#q − p‖1 ≤ ε

Theorem (`1 norm approximation lower bound for local planar flows)A planar flow fpf = z + uh(w>z + b) is called ch-local if ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,|h(x)| ≤ ch, and |h′(x)| ≤ ch/(1 + |x|). Suppose F is the set of all ch-local planarflows, q is a random initialization, and p satisfies for τ ∈ (0, 1):

I p = O(p1), where density p1(z) ∝ exp(−‖z‖τ2);

I ‖∇p‖2 = O(‖∇p2‖2), where density p2(z) ∝

exp(−d) ‖z‖2 ≤ d1τ

exp(−‖z‖τ2) ‖z‖2 > d1τ

.

Then ∃ ε = Θ(1) such that Tε(p, q,F) = Ω(

min(

(log d)−1τd(1

τ−12), d(1

τ−1)))

.

Sketch of proof. Let L(p, f ) = supq′ is a density on Rd ‖p− q′‖1 − ‖p− f#q′‖1. Then,

L(p, f ) ≤ L(p, f ) =∫Rd || det Jf(z)|p(f (z))− p(z)| dz. If we can bound L, then

Tε(p, q,F) ≥ ‖p− q‖1 − εsupf∈F L(p, f )

≥ ‖p− q‖1 − εsupf∈F L(p, f )

= Ω

(1

supf∈F L(p, f )

)

Out[]=-3 -2 -1 1 2 3

-1.0

-0.5

0.5

1.0

tanh(x)

Sigmoid(x)

tan-1(x)

Out[]=

-3 -2 -1 1 2 3

0.2

0.4

0.6

0.8

1.0

tanh′(x)

Sigmoid′(x)

tan-1′(x)

1

1+x

Figure: Examples of ch-local non-linearities: tanh (ch = 2), sigmoid (ch = 1), and arctan (ch = π2).

ReferenceRezende, D. J. and Mohamed, S. (2015).Variational inference with normalizing flows.arXiv preprint arXiv:1505.05770.

https://lilianweng.github.io/lil-log/assets/images/normalizing-flow.png