119
Theory and Practice of Differential Entropy Estimation Yanjun Han (Stanford EE) Joint work with: Weihao Gao UIUC ECE Jiantao Jiao Stanford EE Tsachy Weissman Stanford EE Yihong Wu Yale Stats July 17, 2018

Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential EntropyEstimation

Yanjun Han (Stanford EE)

Joint work with:

Weihao Gao UIUC ECE

Jiantao Jiao Stanford EE

Tsachy Weissman Stanford EE

Yihong Wu Yale Stats

July 17, 2018

Page 2: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Outline

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

2 / 42

Page 3: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

3 / 42

Page 4: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Motivation

Information-theoretic measures:

I entropy H(X )

I mutual information I (X ;Y )

I Kullback–Leibler divergence D(P‖Q)

Subroutine for many fields and applications:

I machine learning: classification, clustering, feature selection

I causal inference: network flow

I sociology

I computational biology

I ...

4 / 42

Page 5: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Motivation

Information-theoretic measures:

I entropy H(X )

I mutual information I (X ;Y )

I Kullback–Leibler divergence D(P‖Q)

Subroutine for many fields and applications:

I machine learning: classification, clustering, feature selection

I causal inference: network flow

I sociology

I computational biology

I ...

4 / 42

Page 6: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef

|h(X n)− h(f )|

5 / 42

Page 7: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef

|h(X n)− h(f )|

5 / 42

Page 8: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef

|h(X n)− h(f )|

5 / 42

Page 9: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef

|h(X n)− h(f )|

5 / 42

Page 10: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef |h(X n)− h(f )|

5 / 42

Page 11: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef |h(X n)− h(f )|

5 / 42

Page 12: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Problem Formulation

Problem:

I let f be a continuous density supported on [0, 1]d , belongingto some function class F

I observe X n = (X1, · · · ,Xn)i .i .d .∼ f

I estimate the differential entropy of f based on X n:

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

Target: charactermize the minimax risk

infh

supf ∈F

Ef |h(X n)− h(f )|

5 / 42

Page 13: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Choice of Function Class

Holder ball Hsd(L)

I 0 < s ≤ 1: |f (x)− f (y)| ≤ L‖x − y‖s

I 1 < s ≤ 2: ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖s−1

I intuition: ‖f (s)‖∞ ≤ L

Lipschitz ball (or Besov ball) Lipsp,d(L)

I definition: for any t ∈ Rd ,

‖f (·+ t) + f (· − t)− 2f (·)‖p ≤ L‖t‖s

I intuition: ‖f (s)‖p ≤ L

6 / 42

Page 14: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Choice of Function Class

Holder ball Hsd(L)

I 0 < s ≤ 1: |f (x)− f (y)| ≤ L‖x − y‖s

I 1 < s ≤ 2: ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖s−1

I intuition: ‖f (s)‖∞ ≤ L

Lipschitz ball (or Besov ball) Lipsp,d(L)

I definition: for any t ∈ Rd ,

‖f (·+ t) + f (· − t)− 2f (·)‖p ≤ L‖t‖s

I intuition: ‖f (s)‖p ≤ L

6 / 42

Page 15: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Parameters

Reminder of important parameters:

I n: sample size

I d : dimension of support of f

I s ∈ (0, 2]: smoothness parameter of F

7 / 42

Page 16: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

8 / 42

Page 17: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Nonparametric Functional Estimation

General ProblemGiven X1, · · · ,Xn ∼ f , we would like to estimate the functional ofthe form

I (f ) =

∫w(f (x))dx

Example

I quadratic functional: I (f ) =∫f (x)2dx

I cubic functional: I (f ) =∫f (x)3dx

9 / 42

Page 18: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Nonparametric Functional Estimation

General ProblemGiven X1, · · · ,Xn ∼ f , we would like to estimate the functional ofthe form

I (f ) =

∫w(f (x))dx

Example

I quadratic functional: I (f ) =∫f (x)2dx

I cubic functional: I (f ) =∫f (x)3dx

9 / 42

Page 19: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Smooth Functional

I quadratic functional: elbow effect

Theorem (Bickel–Ritov’88)

infI

supf ∈Hs

d

Ef |I − I (f )| n−4s

4s+d + n−12

I cubic functional: same result with much more involvedestimator construction (Kerkyacharian–Picard’96, Tchetgen etal.’08)

I smooth functional: reduce to linear, quadratic and cubic viaTaylor expansion (Mukherjee–Newey–Robins’17)

I almost nothing is known for nonsmooth functionals

10 / 42

Page 20: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Smooth Functional

I quadratic functional: elbow effect

Theorem (Bickel–Ritov’88)

infI

supf ∈Hs

d

Ef |I − I (f )| n−4s

4s+d + n−12

I cubic functional: same result with much more involvedestimator construction (Kerkyacharian–Picard’96, Tchetgen etal.’08)

I smooth functional: reduce to linear, quadratic and cubic viaTaylor expansion (Mukherjee–Newey–Robins’17)

I almost nothing is known for nonsmooth functionals

10 / 42

Page 21: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Smooth Functional

I quadratic functional: elbow effect

Theorem (Bickel–Ritov’88)

infI

supf ∈Hs

d

Ef |I − I (f )| n−4s

4s+d + n−12

I cubic functional: same result with much more involvedestimator construction (Kerkyacharian–Picard’96, Tchetgen etal.’08)

I smooth functional: reduce to linear, quadratic and cubic viaTaylor expansion (Mukherjee–Newey–Robins’17)

I almost nothing is known for nonsmooth functionals

10 / 42

Page 22: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Smooth Functional

I quadratic functional: elbow effect

Theorem (Bickel–Ritov’88)

infI

supf ∈Hs

d

Ef |I − I (f )| n−4s

4s+d + n−12

I cubic functional: same result with much more involvedestimator construction (Kerkyacharian–Picard’96, Tchetgen etal.’08)

I smooth functional: reduce to linear, quadratic and cubic viaTaylor expansion (Mukherjee–Newey–Robins’17)

I almost nothing is known for nonsmooth functionals

10 / 42

Page 23: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Differential Entropy Estimation

Kernel-based methods:

I Joe’89

I Gyorfi–van der Meulen’91

I Hall–Morton’93

I Paninski–Yajima’08

I Kandasamy et al.’15

I ...

Nearest neighbor methods:

I Tsybakov–van derMeulen’96

I Sricharan–Raich–Hero’12

I Singh–Poczos’16

I Berrett–Samworth–Yuan’16

I Delattre–Fournier’17

I Gao–Oh–Viswanath’17

I ...

11 / 42

Page 24: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Differential Entropy Estimation

Kernel-based methods:

I Joe’89

I Gyorfi–van der Meulen’91

I Hall–Morton’93

I Paninski–Yajima’08

I Kandasamy et al.’15

I ...

Nearest neighbor methods:

I Tsybakov–van derMeulen’96

I Sricharan–Raich–Hero’12

I Singh–Poczos’16

I Berrett–Samworth–Yuan’16

I Delattre–Fournier’17

I Gao–Oh–Viswanath’17

I ...

11 / 42

Page 25: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Differential Entropy Estimation (Cont’d)

Drawbacks of previous works:

I extra assumption: the density f is lower bounded by a positiveuniversal constant, e.g., f (x) ≥ 0.01 everywhere

I only prove consistency

I no new lower bound beyond quadratic case

12 / 42

Page 26: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Differential Entropy Estimation (Cont’d)

Drawbacks of previous works:

I extra assumption: the density f is lower bounded by a positiveuniversal constant, e.g., f (x) ≥ 0.01 everywhere

I only prove consistency

I no new lower bound beyond quadratic case

12 / 42

Page 27: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Differential Entropy Estimation (Cont’d)

Drawbacks of previous works:

I extra assumption: the density f is lower bounded by a positiveuniversal constant, e.g., f (x) ≥ 0.01 everywhere

I only prove consistency

I no new lower bound beyond quadratic case

12 / 42

Page 28: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

13 / 42

Page 29: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d and p ∈ [2,∞), s ∈ (0, 2], we have

infh

supf ∈Lipsp,d

Ef |h − h(f )| (n log n)−s

s+d + n−12

Significance

I first exact expression for the minimax rate, including sharpexponent and exact logarithmic factor

I parametric rate n−12 requires s ≥ d

I does not use any extra assumption (e.g., boundedness of f )

I improves the best known lower bound

14 / 42

Page 30: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d and p ∈ [2,∞), s ∈ (0, 2], we have

infh

supf ∈Lipsp,d

Ef |h − h(f )| (n log n)−s

s+d + n−12

Significance

I first exact expression for the minimax rate, including sharpexponent and exact logarithmic factor

I parametric rate n−12 requires s ≥ d

I does not use any extra assumption (e.g., boundedness of f )

I improves the best known lower bound

14 / 42

Page 31: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d and p ∈ [2,∞), s ∈ (0, 2], we have

infh

supf ∈Lipsp,d

Ef |h − h(f )| (n log n)−s

s+d + n−12

Significance

I first exact expression for the minimax rate, including sharpexponent and exact logarithmic factor

I parametric rate n−12 requires s ≥ d

I does not use any extra assumption (e.g., boundedness of f )

I improves the best known lower bound

14 / 42

Page 32: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d and p ∈ [2,∞), s ∈ (0, 2], we have

infh

supf ∈Lipsp,d

Ef |h − h(f )| (n log n)−s

s+d + n−12

Significance

I first exact expression for the minimax rate, including sharpexponent and exact logarithmic factor

I parametric rate n−12 requires s ≥ d

I does not use any extra assumption (e.g., boundedness of f )

I improves the best known lower bound

14 / 42

Page 33: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d and p ∈ [2,∞), s ∈ (0, 2], we have

infh

supf ∈Lipsp,d

Ef |h − h(f )| (n log n)−s

s+d + n−12

Significance

I first exact expression for the minimax rate, including sharpexponent and exact logarithmic factor

I parametric rate n−12 requires s ≥ d

I does not use any extra assumption (e.g., boundedness of f )

I improves the best known lower bound

14 / 42

Page 34: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Idea: Two-stage Approximation

Recall

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

I can estimate −f (x) log f (x) for every x and then integrate

I involves both function f (x) and functional y 7→ −y log y

I two-stage approximation: first approximate the function andthen approximate the functional

15 / 42

Page 35: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Idea: Two-stage Approximation

Recall

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

I can estimate −f (x) log f (x) for every x and then integrate

I involves both function f (x) and functional y 7→ −y log y

I two-stage approximation: first approximate the function andthen approximate the functional

15 / 42

Page 36: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Idea: Two-stage Approximation

Recall

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

I can estimate −f (x) log f (x) for every x and then integrate

I involves both function f (x) and functional y 7→ −y log y

I two-stage approximation: first approximate the function andthen approximate the functional

15 / 42

Page 37: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Idea: Two-stage Approximation

Recall

h(f ) =

∫[0,1]d

−f (x) log f (x)dx

I can estimate −f (x) log f (x) for every x and then integrate

I involves both function f (x) and functional y 7→ −y log y

I two-stage approximation: first approximate the function andthen approximate the functional

15 / 42

Page 38: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage

How to estimate f (x) at a given x?

I no unbiased estimator...

I first-stage approximation: consider fh = f ∗ Kh instead, whereKh is some kernel with bandwidth h

Example

When Kh(x) = 12h1[−h,h](x), we have

fh(x) =1

2h

∫ x+h

x−hf (y)dy

16 / 42

Page 39: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage

How to estimate f (x) at a given x?

I no unbiased estimator...

I first-stage approximation: consider fh = f ∗ Kh instead, whereKh is some kernel with bandwidth h

Example

When Kh(x) = 12h1[−h,h](x), we have

fh(x) =1

2h

∫ x+h

x−hf (y)dy

16 / 42

Page 40: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage

How to estimate f (x) at a given x?

I no unbiased estimator...

I first-stage approximation: consider fh = f ∗ Kh instead, whereKh is some kernel with bandwidth h

Example

When Kh(x) = 12h1[−h,h](x), we have

fh(x) =1

2h

∫ x+h

x−hf (y)dy

16 / 42

Page 41: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage

How to estimate f (x) at a given x?

I no unbiased estimator...

I first-stage approximation: consider fh = f ∗ Kh instead, whereKh is some kernel with bandwidth h

Example

When Kh(x) = 12h1[−h,h](x), we have

fh(x) =1

2h

∫ x+h

x−hf (y)dy

16 / 42

Page 42: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[

1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 43: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[

1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 44: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[1

n

n∑i=1

Kh(x − Xi )

]

=1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 45: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 46: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy

= f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 47: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 48: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

First Stage (Cont’d)

Advantages of fh:

I close to f for small bandwidth: ‖fh − f ‖p . hs

I admits an unbiased estimator:

E

[1

n

n∑i=1

Kh(x − Xi )

]=

1

n

n∑i=1

E[Kh(x − Xi )]

=1

n

n∑i=1

∫Kh(x − y)f (y)dy = f ∗ Kh(x)

First-stage approximation

Estimate h(fh) instead of h(f )!

17 / 42

Page 49: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [

Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)

]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 50: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [

Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)

]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 51: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 52: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 53: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 54: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 55: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second StageThere exists unbiased estimator for fh(x)k for any k = 1, 2, · · · , n

E [Kh(x − X1)Kh(x − X2) · · ·Kh(x − Xk)]

=

∫· · ·∫

Kh(x − y1) · · ·Kh(x − yk)f (y1) · · · f (yk)dy1 · · · dyk

=

(∫Kh(x − y)f (y)dy

)k

= fh(x)k

U-statistics

Uk(x) =1(nk

) ∑1≤i1<i2<···<ik≤n

k∏j=1

Kh(x − Xij )

I efficiently computable via Newton’s identity

18 / 42

Page 56: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second Stage (Cont’d)

How to estimate −fh(x) log fh(x) at a given x?

I no unbiased estimator either...

I but we have unbiased estimator for all polynomials of fh(x)!

Second-stage Approximation

Write the objective functional as

−fh(x) log fh(x) ≈K∑

k=0

ak fh(x)k

then H(x) =∑K

k=0 akUk(x) is an unbiased estimator for thepolynomial approximation.

19 / 42

Page 57: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second Stage (Cont’d)

How to estimate −fh(x) log fh(x) at a given x?

I no unbiased estimator either...

I but we have unbiased estimator for all polynomials of fh(x)!

Second-stage Approximation

Write the objective functional as

−fh(x) log fh(x) ≈K∑

k=0

ak fh(x)k

then H(x) =∑K

k=0 akUk(x) is an unbiased estimator for thepolynomial approximation.

19 / 42

Page 58: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second Stage (Cont’d)

How to estimate −fh(x) log fh(x) at a given x?

I no unbiased estimator either...

I but we have unbiased estimator for all polynomials of fh(x)!

Second-stage Approximation

Write the objective functional as

−fh(x) log fh(x) ≈K∑

k=0

ak fh(x)k

then H(x) =∑K

k=0 akUk(x) is an unbiased estimator for thepolynomial approximation.

19 / 42

Page 59: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Second Stage (Cont’d)

How to estimate −fh(x) log fh(x) at a given x?

I no unbiased estimator either...

I but we have unbiased estimator for all polynomials of fh(x)!

Second-stage Approximation

Write the objective functional as

−fh(x) log fh(x) ≈K∑

k=0

ak fh(x)k

then H(x) =∑K

k=0 akUk(x) is an unbiased estimator for thepolynomial approximation.

19 / 42

Page 60: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Estimator Construction

I choose a suitable kernel Kh with bandwidth h, and fh , f ∗Kh

I for every x , we aim to estimate −fh(x) log fh(x)

1. if fh(x) is small, apply the previous unbiased estimator of thepolynomial approximation of y 7→ −y log y

2. if fh(x) is large, just plug in −fh(x) log fh(x)

I integrate the pointwise estimate

20 / 42

Page 61: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Estimator Construction

I choose a suitable kernel Kh with bandwidth h, and fh , f ∗Kh

I for every x , we aim to estimate −fh(x) log fh(x)

1. if fh(x) is small, apply the previous unbiased estimator of thepolynomial approximation of y 7→ −y log y

2. if fh(x) is large, just plug in −fh(x) log fh(x)

I integrate the pointwise estimate

20 / 42

Page 62: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Estimator Construction

I choose a suitable kernel Kh with bandwidth h, and fh , f ∗Kh

I for every x , we aim to estimate −fh(x) log fh(x)

1. if fh(x) is small, apply the previous unbiased estimator of thepolynomial approximation of y 7→ −y log y

2. if fh(x) is large, just plug in −fh(x) log fh(x)

I integrate the pointwise estimate

20 / 42

Page 63: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Estimator Construction

I choose a suitable kernel Kh with bandwidth h, and fh , f ∗Kh

I for every x , we aim to estimate −fh(x) log fh(x)

1. if fh(x) is small, apply the previous unbiased estimator of thepolynomial approximation of y 7→ −y log y

2. if fh(x) is large, just plug in −fh(x) log fh(x)

I integrate the pointwise estimate

20 / 42

Page 64: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Estimator Construction

I choose a suitable kernel Kh with bandwidth h, and fh , f ∗Kh

I for every x , we aim to estimate −fh(x) log fh(x)

1. if fh(x) is small, apply the previous unbiased estimator of thepolynomial approximation of y 7→ −y log y

2. if fh(x) is large, just plug in −fh(x) log fh(x)

I integrate the pointwise estimate

20 / 42

Page 65: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

21 / 42

Page 66: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )|

≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 67: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )| ≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 68: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )| ≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 69: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )| ≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 70: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )| ≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 71: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Decomposition

Ef |h − h(f )| ≤ |h(f )− h(fh)|+ Ef |h − h(fh)|

≤ |h(f )− h(fh)|+ |Ef h − h(fh)|+√

Varf (h)

= approx. error + bias + std

. hs +log n

nhdK 2+

2K

n√hd

Choosing h (n log n)−1

s+d ,K log n completes the proof.

22 / 42

Page 72: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Key Lemma in Bounding |h(fh)− h(f )|

Inequality of Fisher Information

Let f ∈ C 2(R) be supported on [0, 1], and f ≥ 0 everywhere. Thefollowing inequality holds:

J(f ) ,∫

(f ′)2

f≤ Cp‖f ′′‖p

where 1 < p ≤ ∞.

23 / 42

Page 73: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h)

≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]

≤ infh>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]= 2√

f (x)M[|f ′′|](x)

24 / 42

Page 74: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h) ≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]

≤ infh>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]= 2√

f (x)M[|f ′′|](x)

24 / 42

Page 75: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h) ≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]

≤ infh>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]= 2√

f (x)M[|f ′′|](x)

24 / 42

Page 76: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h) ≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]

≤ infh>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]= 2√f (x)M[|f ′′|](x)

24 / 42

Page 77: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h) ≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]≤ inf

h>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]

= 2√f (x)M[|f ′′|](x)

24 / 42

Page 78: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key LemmaNon-negativity of f :

0 ≤ f (x + h) ≤ f (x) + hf ′(x) + h

∫ x+h

x|f ′′(y)|dy

0 ≤ f (x − h) ≤ f (x)− hf ′(x) + h

∫ x

x−h|f ′′(y)|dy

Rearranging:

|f ′(x)| ≤ infh>0

[2f (x)

h+ 2h · 1

2h

∫ x+h

x−h|f ′′(y)|dy

]≤ inf

h>0

[2f (x)

h+ 2h · sup

r>0

1

2r

∫ x+r

x−r|f ′′(y)|dy

]= 2√f (x)M[|f ′′|](x)

24 / 42

Page 79: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Maximal Function

Definition (Hardy–Littlewood Maximal Function)

For non-negative function h, the Hardy–Littlewood maximalfunction M[h] is defined as

M[h](x) , supr>0

1

|B(x ; r)|

∫B(x ;r)

h(y)dy .

Theorem (Hardy–Littlewood Maximal Inequality)

The following tail bound holds:

Volx ∈ Rd : M[h](x) > t

≤ Cd

t

∫h(x)dx .

Consequently, ‖M[h]‖p ≤ Cp‖h‖p for any p ∈ (1,∞].

25 / 42

Page 80: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Maximal Function

Definition (Hardy–Littlewood Maximal Function)

For non-negative function h, the Hardy–Littlewood maximalfunction M[h] is defined as

M[h](x) , supr>0

1

|B(x ; r)|

∫B(x ;r)

h(y)dy .

Theorem (Hardy–Littlewood Maximal Inequality)

The following tail bound holds:

Volx ∈ Rd : M[h](x) > t

≤ Cd

t

∫h(x)dx .

Consequently, ‖M[h]‖p ≤ Cp‖h‖p for any p ∈ (1,∞].

25 / 42

Page 81: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key Lemma (Cont’d)

Recall

|f ′(x)| ≤ 2√

f (x)M[|f ′′|](x).

Consequently,∫(f ′)2

f≤ 4‖M[f ′′]‖1 ≤ 4‖M[f ′′]‖p ≤ 4Cp‖f ′′‖p.

26 / 42

Page 82: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Proof of Key Lemma (Cont’d)

Recall

|f ′(x)| ≤ 2√

f (x)M[|f ′′|](x).

Consequently,∫(f ′)2

f≤ 4‖M[f ′′]‖1 ≤ 4‖M[f ′′]‖p ≤ 4Cp‖f ′′‖p.

26 / 42

Page 83: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Applications of Maximal Function

I Doob’s martingle inequality

I Lebesgue differentiation theorem

I Birkhoff’s pointwise ergodic theorem

27 / 42

Page 84: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Summary

I two-stage approximation is optimal for differential entropyestimation

I polynomial-time estimator

I need to tune parameters h,K in practice

I requires the knowledge of s

28 / 42

Page 85: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

29 / 42

Page 86: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Another View of Differential Entropy

h(f ) =

∫−f (x) log f (x)dx

= Ef [− log f (X )]

≈ 1

n

n∑i=1

− log f (Xi )

≈ 1

n

n∑i=1

− log f (Xi )

QuestionHow to find a good estimator f (Xi )?

30 / 42

Page 87: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Another View of Differential Entropy

h(f ) =

∫−f (x) log f (x)dx

= Ef [− log f (X )]

≈ 1

n

n∑i=1

− log f (Xi )

≈ 1

n

n∑i=1

− log f (Xi )

QuestionHow to find a good estimator f (Xi )?

30 / 42

Page 88: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Another View of Differential Entropy

h(f ) =

∫−f (x) log f (x)dx

= Ef [− log f (X )]

≈ 1

n

n∑i=1

− log f (Xi )

≈ 1

n

n∑i=1

− log f (Xi )

QuestionHow to find a good estimator f (Xi )?

30 / 42

Page 89: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Another View of Differential Entropy

h(f ) =

∫−f (x) log f (x)dx

= Ef [− log f (X )]

≈ 1

n

n∑i=1

− log f (Xi )

≈ 1

n

n∑i=1

− log f (Xi )

QuestionHow to find a good estimator f (Xi )?

30 / 42

Page 90: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Another View of Differential Entropy

h(f ) =

∫−f (x) log f (x)dx

= Ef [− log f (X )]

≈ 1

n

n∑i=1

− log f (Xi )

≈ 1

n

n∑i=1

− log f (Xi )

QuestionHow to find a good estimator f (Xi )?

30 / 42

Page 91: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Nearest Neighbor Estimator

Let hi be the distance of Xi to its nearest neighbor, we set

f (Xi ) · Vol(B(Xi ; hi )) =1

n.

Kozachenko–Leonenko (KL) Nearest Neighbor Estimator

hKL =1

n

n∑i=1

log [nVol(B(Xi ; hi ))] + γ

where γ ≈ 0.577 is Euler’s constant.

31 / 42

Page 92: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Nearest Neighbor Estimator

Let hi be the distance of Xi to its nearest neighbor, we set

f (Xi ) · Vol(B(Xi ; hi )) =1

n.

Kozachenko–Leonenko (KL) Nearest Neighbor Estimator

hKL =1

n

n∑i=1

log [nVol(B(Xi ; hi ))] + γ

where γ ≈ 0.577 is Euler’s constant.

31 / 42

Page 93: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Insights behind KL Estimator

Key Observation

For each i ,∫B(Xi ,hi )

f (y)dy ∼ Beta(1, n − 1).

Consequence

Define

fh(x) =1

Vol(B(x ; h))

∫B(x ;h)

f (y)dy

we have

Ef [hKL]− h(f ) = Ef

[log

f (X )

fh(X )(X )

]+ E log[n · Beta(1, n − 1)] + γ︸ ︷︷ ︸

=O(n−1)

32 / 42

Page 94: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Insights behind KL Estimator

Key Observation

For each i ,∫B(Xi ,hi )

f (y)dy ∼ Beta(1, n − 1).

Consequence

Define

fh(x) =1

Vol(B(x ; h))

∫B(x ;h)

f (y)dy

we have

Ef [hKL]− h(f ) = Ef

[log

f (X )

fh(X )(X )

]+ E log[n · Beta(1, n − 1)] + γ︸ ︷︷ ︸

=O(n−1)

32 / 42

Page 95: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Insights behind KL Estimator

Key Observation

For each i ,∫B(Xi ,hi )

f (y)dy ∼ Beta(1, n − 1).

Consequence

Define

fh(x) =1

Vol(B(x ; h))

∫B(x ;h)

f (y)dy

we have

Ef [hKL]− h(f ) = Ef

[log

f (X )

fh(X )(X )

]+ E log[n · Beta(1, n − 1)] + γ︸ ︷︷ ︸

=O(n−1)

32 / 42

Page 96: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d > 0 and s ∈ (0, 2], the KL estimator satisfies

supf ∈Hs

d

Ef |hKL − h(f )| . n−s

s+d log n + n−12

Significance

I optimal up to logarithmic factor

I does not use extra assumptions (e.g., boundedness of f )

I adaptive in smoothness s

I do not need to tune parameter

33 / 42

Page 97: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d > 0 and s ∈ (0, 2], the KL estimator satisfies

supf ∈Hs

d

Ef |hKL − h(f )| . n−s

s+d log n + n−12

Significance

I optimal up to logarithmic factor

I does not use extra assumptions (e.g., boundedness of f )

I adaptive in smoothness s

I do not need to tune parameter

33 / 42

Page 98: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d > 0 and s ∈ (0, 2], the KL estimator satisfies

supf ∈Hs

d

Ef |hKL − h(f )| . n−s

s+d log n + n−12

Significance

I optimal up to logarithmic factor

I does not use extra assumptions (e.g., boundedness of f )

I adaptive in smoothness s

I do not need to tune parameter

33 / 42

Page 99: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d > 0 and s ∈ (0, 2], the KL estimator satisfies

supf ∈Hs

d

Ef |hKL − h(f )| . n−s

s+d log n + n−12

Significance

I optimal up to logarithmic factor

I does not use extra assumptions (e.g., boundedness of f )

I adaptive in smoothness s

I do not need to tune parameter

33 / 42

Page 100: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Main Result

TheoremFor any d > 0 and s ∈ (0, 2], the KL estimator satisfies

supf ∈Hs

d

Ef |hKL − h(f )| . n−s

s+d log n + n−12

Significance

I optimal up to logarithmic factor

I does not use extra assumptions (e.g., boundedness of f )

I adaptive in smoothness s

I do not need to tune parameter

33 / 42

Page 101: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

34 / 42

Page 102: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,

I Bias of hKL: suffices to upper bound |Ef log f (X )fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 103: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 104: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 105: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 106: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,

I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 107: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 108: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Error Analysis

I Variance of hKL: ,I Bias of hKL: suffices to upper bound |Ef log f (X )

fh(X )(X ) |

1. Upper bound Ef logfh(X )(X )

f (X ) =∫f (x)E log

fh(x)(x)

f (x) dx : ,

2. Upper bound Ef log f (X )fh(X )(X ) =

∫f (x)E log f (x)

fh(x)(x)dx

I If fh(x)(x) is large: ,I If fh(x)(x) is small: /

QuestionFor small ε > 0, find a good upper bound of

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]

35 / 42

Page 109: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Minimal Function

Definition (Minimal Function)

For non-negative function f supported on [0, 1]d , the minimalfunction m[f ] is defined as

m[f ](x) = inf0<r≤1

1

|Vol(B(x ; r))|

∫B(x ;r)

f (y)dy .

Observation

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]≤∫

f (x)1(m[f ](x) ≤ ε)dx

36 / 42

Page 110: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Minimal Function

Definition (Minimal Function)

For non-negative function f supported on [0, 1]d , the minimalfunction m[f ] is defined as

m[f ](x) = inf0<r≤1

1

|Vol(B(x ; r))|

∫B(x ;r)

f (y)dy .

Observation

E[∫

f (x)1(fh(x)(x) ≤ ε)dx

]≤∫

f (x)1(m[f ](x) ≤ ε)dx

36 / 42

Page 111: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Generalized Maximal Inequality

Theorem (Generalized Maximal Inequality)

Let µ1, µ2 be two Borel measures on metric space Ω ⊂ Rd , thenfor any t > 0,

µ2

x ∈ Ω : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))> t

≤ Cd

t· µ1(Ω).

Corollary

Choose µ1 = Lebesgue measure, dµ2dµ1

= f , we have∫f (x)1(m[f ](x) ≤ ε)dx

≤ µ2x ∈ [0, 1]d : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))>

1

ε

≤ Cd · ε.

37 / 42

Page 112: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Generalized Maximal Inequality

Theorem (Generalized Maximal Inequality)

Let µ1, µ2 be two Borel measures on metric space Ω ⊂ Rd , thenfor any t > 0,

µ2

x ∈ Ω : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))> t

≤ Cd

t· µ1(Ω).

Corollary

Choose µ1 = Lebesgue measure, dµ2dµ1

= f , we have∫f (x)1(m[f ](x) ≤ ε)dx

≤ µ2x ∈ [0, 1]d : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))>

1

ε

≤ Cd · ε.

37 / 42

Page 113: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Generalized Maximal Inequality

Theorem (Generalized Maximal Inequality)

Let µ1, µ2 be two Borel measures on metric space Ω ⊂ Rd , thenfor any t > 0,

µ2

x ∈ Ω : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))> t

≤ Cd

t· µ1(Ω).

Corollary

Choose µ1 = Lebesgue measure, dµ2dµ1

= f , we have∫f (x)1(m[f ](x) ≤ ε)dx ≤ µ2

x ∈ [0, 1]d : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))>

1

ε

≤ Cd · ε.

37 / 42

Page 114: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Generalized Maximal Inequality

Theorem (Generalized Maximal Inequality)

Let µ1, µ2 be two Borel measures on metric space Ω ⊂ Rd , thenfor any t > 0,

µ2

x ∈ Ω : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))> t

≤ Cd

t· µ1(Ω).

Corollary

Choose µ1 = Lebesgue measure, dµ2dµ1

= f , we have∫f (x)1(m[f ](x) ≤ ε)dx ≤ µ2

x ∈ [0, 1]d : sup

r>0

µ1(B(x ; r))

µ2(B(x ; r))>

1

ε

≤ Cd · ε.

37 / 42

Page 115: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

IntroductionProblem SetupRelated Works

Theory: Optimal EstimationEstimator ConstructionEstimator Analysis

Practice: Adaptive EstimationIdea of Nearest NeighborEstimator AnalysisNumerical Results

Conclusion

38 / 42

Page 116: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Dimensionality d

0.01

0.1

100 1000

k = 5, s = 2

d=1d=2d=4

39 / 42

Page 117: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Smoothness s

0.01

0.1

100 1000

k = 5, d = 1

s=0.5s=1

s=1.5s=2

40 / 42

Page 118: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

Conclusion

Take-home message:

I two-stage approximation (first approximate the function, thenapproximate the functional) is optimal

I nearest neighbor estimator is near-optimal and adaptive to thesmoothness parameter

I Hardy–Littlewood maximal inequality is crucial to deal withdensity close to zero

41 / 42

Page 119: Theory and Practice of Di erential Entropy Estimationyjhan/diff_entropy.pdf · Theory and Practice of Di erential Entropy Estimation Introduction Theory: Optimal Estimation Practice:

Theory and Practice of Differential Entropy Estimation

Introduction Theory: Optimal Estimation Practice: Adaptive Estimation Conclusion

References

I Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu,“Optimal Rates of Entropy Estimation over Lipschitz Balls”,arXiv preprint, arXiv:1711.02141

I Yanjun Han, Jiantao Jiao, Rajarshi Mukherjee, and TsachyWeissman, “On Estimation of Lr -Norms in Gaussian WhiteNoise Models”, arXiv preprint, arXiv:1710.03863.

I Jiantao Jiao, Weihao Gao, and Yanjun Han, “The NearestNeighbor Information Estimator is Adaptively Near MinimaxRate-Optimal”, arXiv preprint, arXiv:1711.08824.

42 / 42