An Empirical Bayesian interpretation and generalization of NL … · 2016. 4. 19. · weighted average over other pixels in the same image that come from similar patches. But the

Computer Science Technical Report TR2010-934, October 2010Courant Institute of Mathematical Sciences, New York University

http://cs.nyu.edu/web/Research/TechReports/reports.html

An Empirical Bayesian interpretationand generalization of NL-means

Martin Raphan

Courant Inst. of Mathematical SciencesNew York University

[email protected]

Eero P. SimoncelliHoward Hughes Medical Institute,

Center for Neural Science, andCourant Inst. of Mathematical Sciences

New York [email protected]

Abstract

A number of recent algorithms in signal and image processing are based on the empiricaldistribution of localized patches. Here, we develop a nonparametric empirical Bayesianestimator for recovering an image corrupted by additive Gaussian noise, based on fittingthe density over image patches with a local exponential model. The resulting solution is inthe form of an adaptively weighted average of the observed patch with the mean of a setof similar patches, and thus both justifies and generalizes the recently proposed nonlocal-means (NL-means) method for image denoising. Unlike NL-means, our estimator includesa dependency on the size of the patch similarity neighborhood, and we show that this neigh-borhood size can be chosen in such a way that the estimator converges to the optimal Bayesleast squares estimator as the amount of data grows. We demonstrate the increase in per-formance of our method compared to NL-means on a set of simulated examples.

1 Introduction

Local patches within a photographic image are often highly structured. Moreover, it is quite common tofind many patches within the same image that have similar or even identical content. This fact has beenexploited in developing new forms of signal representation [10], and in applications such as denoising [5],clustering [3], inpainting [4], and texture synthesis [15, 7, 8]. Most of these methods are based on assumptions(sometimes implicit) about the probability distribution over the intensity vectors associated with the patches.A recent example is a patched-based inference methodology, known as nonlocal means (NL-means), in whicheach pixel in a noisy image is replaced by a weighted average over other pixels in the image that occur insimilar image patches [1, 2]. The weighting is determined by the similarity of the patches, as measured witha Gaussian kernel. This method is intuitively appealing, and gives good (although less than state-of-the-art)denoising performance on natural images. More recent variants have been developed that achieve state-of-the-art [5]. But as we show here, the NL-means has the drawback that the estimate depends crucially on theGaussian kernel width, and even in the limit of infinite data, there is no way to choose this width so as toguarantee convergence to the optimal answer.

In this paper, we develop an alternative patch-based denoising algorithm as an empirical Bayes (EB) esti-mator, and show that it generalizes and improves on the NL-means solution. We use an exponential densitymodel for local patches of a noisy image, and derive a maximum likelihood method for estimating the param-eters from data. We then combine this with a nonparametric EB form of the optimal least-squares estimatorfor additive Gaussian noise [13]. This simple, but often overlooked form expresses the usual Bayes leastsquares estimator directly in terms of the density of noisy observations, without reference to the (clean) priordensity. The resulting denoising method resembles NL-means, in that the denoised value is a function of the

weighted average over other pixels in the same image that come from similar patches. But the function isnonlinear, and is scaled according to the size of the neighborhood over which intensities are averaged. Asa result, we prove that the estimator converges to the Bayes least squares optimum as the amount of dataincreases, a feature not present in the standard NL-means approach.

2 NL-means estimation

Suppose Y is a noise-corrupted measurement of X , where either or both variables can be finite-dimensionalvectors. Given this measurement, the estimate of X that minimizes the expected squared error is the BayesLeast Squared (BLS) estimator, which is the expected value of X conditioned on Y , E{X|Y }. In particular,for an observation y0, the estimate is

x̂BLS(y0) = E{X|Y = y0}.In the NL-means approach, a vector observation Y = y0 is denoised by averaging it together with patchesthat are ”similar”, i.e., that are in some neighborhood, Nh, of size h about the vector y0. As the number ofdata points available increases to infinity, the NL-means estimator will converge to

x̂NL(y0) → E{Y |Y ∈ Nh}. (1)Notice that this is a conditional expectation of Y , not X (as required for the BLS estimate). The originalpresentation of NL-means [1] justifies this by considering the case of estimating the center pixel, Xc of thepatch based on the surrounding pixels. In this case, and assuming zero-mean noise, we can write

E{Yc|Y−c = y−c} = E{E{Yc|Xc, Y−c = y−c}|Y−c = y−c} = E{Xc|Y−c = y−c}, (2)where Y−c indicates the patch without the center pixel. Thus, by setting Nh to be a neighborhood which doesnot depend on the central coefficient of the vector, Eq. (1) will tend to the optimal estimator of the centralcoefficient as a function of the surrounding coefficients, Eq. (2).

In practice, NL-means is implemented to denoise the central coefficient in the patch, but includes the centralcoefficient in determining similarity of patches. Under these conditions, the solution does not, in general,converge to the correct BLS estimator, regardless of the size of the neighborhood. To see this, consider thesimplified situation where the neighborhood is defined as

Nh = {y : |y − y0| ≤ h}. (3)Clearly, x̂NL(y0) will depend on the value of h. For a fixed value of h, we end up with a conditionalexpectation of Y instead of X , as mentioned above. And, in the limit as h → 0, the estimate will convergeto the identity, i.e. x̂NL(y0) → y0. Intuitively, this is because the mean of data within a neighborhood mustalso lie within that neighborhood. Thus, the best we can hope for is to find some binwidth, h, that worksreasonably well asymptotically, but this choice will be highly dependent on the (unknown) signal prior.

This is illustrated for a simple scalar example in Fig. 1. The prior here is made up of delta functions whichare well separated, relative to the standard deviation of the noise. With the right choice of binwidth (abouttwice the width of the noise distribution), for any amount of data, NL-means provides a good approximationto the optimal denoiser. But note that the NL-means behavior depends critically on the choice of binwidth:Fig. 1 also shows a result for a smaller binwidth, which is clearly suboptimal. In general, there is no simpleway to choose an optimal binwidth, and to do so in a way that makes the NL-means estimator converge tothe BLS estimator.

3 Patch-based empirical Bayes least squares estimator

We wish to develop a patch-based estimator that makes explicit the dependence on neighborhood size, h, andfor which we can demonstrate proper convergence in the limit as h → 0. For the particular case of additiveGaussian noise, Miyasawa [13] developed an nonparametric empirical form of the Bayes least squares esti-mate that is written directly in terms of the measurement (noisy) density PY (Y ), with no explicit referenceto the prior:

E{X|Y = y0} = y0 + Λ∇y ln(PY (y0)), (4)where Λ is the covariance matrix of the noise (assumed known). Note that this expression is not an approx-imation: the equality is exact, assuming the noise is additive and Gaussian (with zero mean and covarianceΛ), and assuming the measurement density is known.

y

PY(y

)

0

y

E{X

|Y=

y}0

y

E{Y

| |Y−

y| ≤

h}

Fig. 1: Observed density and optimal estimator for prior made up of three isolated delta functions. Top panel:observed density (arrows represent prior). Middle panel: BLS estimator (assuming true prior is known).Bottom: NL-means estimates calculated for two different binsizes (thick solid and dashed lines). Note alsothat the NL-means estimator approaches the identity function as the binsize goes to zero.

3.1 Learning the estimator function from data

The formulation of Eq. (4) offers a means of computing the BLS estimator from noisy samples, if one canconstruct an approximation of the noisy distribution, PY (Y ). Simple histograms will not suffice for thisapproximation, because Eq.(4) requires us to compute the logarithmic derivative of the estimated density.Instead, we use an exponential model for the local density centered on an observation Yk = y0 within someneighborhood, Nh of size h, centered on y0:

PY |Y ∈Nh(y) =ea(y−y0)

Z(a, h)1Nh (5)

where 1Nh denotes the indicator function of Nh, and Z(a, h) normalizes the density over the interval:Z(a, h) =

∫Nh

ea·(y−y0)dy. (6)

This type of local exponential approximation is similar to that used in [11] for density estimation. The shapeof the neighborhood, Nh, is unspecified, but the linear size should scale with h. For example, it could be aspherical ball of radius h around y0 or the hypercube centered on y0 of length 2h on each side. In any case,the logarithmic gradient of this density evaluated at y0 is simply a.

We can then fit the parameter a by maximizing the likelihood of the conditional density over the neighbor-hood:

â = arg maxa

∑{n:Yn∈Nh}

ln(PY |Y ∈Nh(Yn)) (7)

Substituting Eq. (5), taking the gradient of the sum, and setting it equal to zero gives

∇a ln(Z(â, h)) = Ȳ − y0 (8)where Ȳ is the average of the data that lie in Nh. Note that substituting Eq. (6) in for Z(â, h) and adding y0to both sides gives

E{Y |Y ∈ Nh} = Ȳ .Thus, the ML solution for â matches the local mean of the fitted density to the local empirical mean.

Equation (8) provides an implicit definition of â which involves the function Z(a, h). Noting that Z scalesaccording to Z(a, h) = hdZ(ha, 1), we can solve explicitly for â:

â =1h

F−1(Ȳ − y0

h) (9)

where

F (a) = ∇a ln(Z(a, 1)), (10)does not depend on h.

Finally, replacing ∇y ln(PY (y)) in Eq. (4) by â gives the empirical Bayes estimator:

x̂EB(y0) = y0 + Λ1h

F−1(Ȳ − y0

h). (11)

Note that it may be difficult to calculate F−1 for a general neighborhood. We’ll return to this issue insection 3.2.

3.2 Convergence with proper choice of binwidth

The empirical Bayes estimator of Eq. (11) depends an the choice of binwidth, h, which not only appearsexplicitly in the expression, but also implicitly affects the choice of which data vectors are included in calcu-lating the local mean, Ȳ . This binwidth can vary for each data point, but in this paper, we will consider onlythe simpler case of a constant binwidth for all data. In order for our EB estimator to converge to the BLSestimator, we need â, as defined in Eq. (9), to converge to the logarithmic derivative of the true density, andthis means the binwidth must shrink to zero as the total amount of data grows. The rate of shrinkage shouldbe set in such a way that the amount of data within each neighborhood goes to infinity. In this subsection, wecalculate the rate at which the binwidths should shrink, in order to ensure convergence.

First, we note that the factor of 1h in Eq. (9) implies that, for â to converge at all, it is necessary that

F−1(

Ȳ − y0h

)→ 0. (12)

Since it can be shown that 0 is the only zero of F , we need only show that

Ȳ − y0h

→ 0. (13)Looking again at Eq. (9), and considering a Taylor approximation of F about zero, we see that for for â toconverge, it is further necessary that Ȳ −y0h2 have some finite limit

Ȳ − y0h2

→ v, (14)for some value v. In this case, a bit of calculation shows that the Taylor approximation of Eq. (9) may bewritten

â → V1C−11 v,where C1/V1 is the Jacobian matrix of F evaluated at zero, with V1 denoting the volume of the neighborhoodN1 (Nh, with size h = 1), and

C1 =∫

N1

(ỹ − y0)(ỹ − y0)T dỹ.

Finally, combining the Taylor approximation with the definition of â tells us that our EB estimator convergesto the optimal BLS estimator if and only if we can find a choice of binwidth so that

1h2

V1C−11 (Ȳ − y0) → ∇ ln(PY (y0)). (15)

We now discuss how the binwidth should shrink as the amount of data increases in order to ensure thiscondition is met. First, we decompose the asymptotic mean squared error into variance and squared biasterms:

E

{(Ȳ − y0

h2− C1

V1∇ ln(PY (y0))

)2}

= V ar{

Ȳ − y0h2

}+

(E

{Ȳ − y0

h2

}− C1

V1∇ ln(PY (y0))

)2, (16)

We expect (and will show) that the variance term should decrease as h grows (since more data will fall intoeach bin), whereas the squared bias should increase as the neighborhood grows. This is illustrated in Fig. 2.The trick is to find a way to shrink the binwidths as the amount of data increases, so that the bias goes to zero,but slowly enough to ensure that the variance still goes to zero as well.

For large amounts of data, we expect h to be small, and so we may use small h approximations for the biasand variance. Since the local average, Ȳ is calculated using only data which fall in Nh, we may write theexpected value as

E{Ȳ } = E{Y |Y ∈ Nh} =∫

NhỹPY (ỹ)dỹ∫

NhPY (ỹ)dỹ

(17)

To get an idea of how this behaves asymptotically, we can take the local Taylor approximation of the densityand substitute into Eq. (17). Since the neighborhoods are centered on y0, the integrals of terms with odddegree will vanish, leaving

E{Ȳ } − y0 ≈ Ch∇PY (y0) + O(hd+4)

PY (y0)Vh + O(hd+2)(18)

where Vh denotes the volume of the neighborhood and

Ch =∫

Nh

(ỹ − y0)(ỹ − y0)T dỹ

For symmetric neighborhoods, Ch will be a diagonal matrix. If the Nh has the same shape for all h justrescaled by the factor h, a simple change of variables shows that

Vh = hdV1

andCh = hd+2C1

so that

E{ Ȳ − y0h2

} ≈ C1V1

∇ ln(PY (y0)) + O(h2) (19)so that the bias term in Eq. (16) will shrink to zero as long as the binwidth shrinks to zero.

Now consider the variance term in Eq. (16). We need to choose binwidths that shrink to zero slowly enoughto allow

V ar

{Ȳ − y0

h2

}→ 0.

First note that

E{|Y − y0|2|Y ∈ Nh} = O(h2) (20)Combining this with Eq. (19) gives

V ar(Ȳ − y0

h2) =

1nh4

V ar (Y − y0|Y ∈ Nh)

= O(1

nh2) + O(

1n

) (21)

where n is the number of data points which fall in the bin. Thus, we see that as long as

nh2 → ∞ (22)the variance of our approximation will also tend to zero. If there are N total data points, we can approximatethe number falling in Nh as

n ≈ PY (y0)NV1hd (23)and, inserting this into Eq. (22) gives

Nhd+2 → ∞, h → 0 (24)For example, we might use

h ∝ N− 1m+d+2 , m > 0 (25)

0.3 0.5 0.7 0.90

h v

aria

nce

increa

sing N bi

as2

Fig. 2: Bias-variance tradeoff, as a function of binwidth, h. Solid blue line indicates squared bias. Dashedlines indicate variance, for different values of N (number of data points). Solid black line (with black points)shows the variance resulting when h is chosen as a function of N , as indicated in Eq. (25) with m = 2. Underthese conditions, the variance falls to zero as h shrinks.

The variance resulting from this choice (for m = 2) is illustrated in Fig. 2, and can be seen to approachzero as h decreases. With this choice of binwidth, it now follows that the estimator in Eq. (11) convergesasymptotically to the optimal BLS solution. Note that a consequence of our asymptotic analysis is that, whenthe estimator in Eq. (11) converges, we can use the Taylor series approximation of the estimator, so that forthe same choice of binwidths

x̂EB(y0) ≈ y0 + ΛV1C−11 (Ȳ − y0

h2) (26)

will also converge asymptotically to the BLS estimator. In the case of spherical neighborhoods we get anestimate of the logarithmic derivative very similar to that derived for the mean-shift method of clustering[3].Note, though, that the derivation used for that approach uses the derivative of the Epanechnikov kernel, whichwould only seem to work for spherical neighborhoods. Our method, on the other hand, works for generalneighborhoods. Note also that mean-shift is meant to be used as an iterative gradient ascent, moving datatowards peaks in the density. For our purposes, however, we have shown that when the logarithmic gradientis scaled by the noise variance, we achieve BLS denoising in one step.

4 Empirical Bayes versus NL means

We have now introduced both the Empirical Bayes method for approximating the BLS estimator, Eq. (11),and the NL-means estimator Eq. (1). For pedagogical reasons, we will compare and contrast the two usinglow dimensional examples which allow for both easier visualization and easier computation. As discussedin the last section, we expect our EB estimator to converge to the ideal solution as the amount of data goesto infinity. NL-means, on the other hand does not have this asymptotic property in general, since if thebinwidths were to shrink the NL-means method would do no denoising. Therefore, we expect the optimaloptimal binwidth to asymptote to some finite, nonzero optimal value that depends heavily on both the priorand the amount of noise. For this reason, this method may be able to give us improved performance for lowto moderate amounts of data, but will in general be asymptotically biased.

To illustrate the convergence behavior as a function of the amount of data, we have simulated a scalar esti-mation problem. Signal values are drawn from a generalized Gaussian density, p(x) ∝ exp(− |x|β), withexponent β = 0.5. For each N , we draw N samples from this distribution, and add Gaussian white noise toeach. The noise was chosen so that the noisy Signal to Noise Ratio (SNR) (10 log of squared error dividedby signal variance) was 7.8 dB. We then apply our estimator, with binwidth chosen according to Eq. (25). Asa reference, we also show the performance of the ideal BLS estimator (which knows the true prior). We alsoshow the result obtained using NL-means (Eq. (1)), with binwidth chosen to optimize performance (i.e., bycomparing the denoising result to the clean signal). For each N , this simulation is performed 1000 times, soas to obtain estimates of the average performance, as well as the variability. The SNR performance is plotted,as a function of N , in Fig. 3. Note that the average performance of NL-means is reasonable for small N

103

104

105

106

1.25

1.3

1.35

1.4

1.45

1.5

1.55

1.6

1.65

# samples

SN

R im

prov

emen

t(dB

)

NL-meansEBBLS

103

104

105

106

1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

# samples

SN

R im

prov

emen

t(dB

)

NL-meansEBBLS

(a) (b)

Fig. 3: Performance of Empirical Bayes method compared with NL-means, and the optimal BLS estimator,for (a) a generalized Gaussian prior; (b) the sum of two generalized Gaussians. Lines (solid, dashed, anddotted) indicate mean performance of each estimator over 1000 simulations. Gray regions (for EB and NL-means) indicate plus/minus one standard deviation. See text.

TrueObservationNL meansEB



Fig. 4: Behavior of EB method compared with NL-means, for two-dimensional data, with three differentneighborhood choices. Prior is a delta at the origin (indicated by cross). Only a subsample of the data pointsare shown, for visual clarity. Boxes indicate neighborhoods. Note that in the third example, the neighborhoodextends beyond the display. See text.

(although the variability is quite high), but saturates as N grows, and thus does not converge to the optimalperformance of the BLS estimator. By comparison, our estimator starts out with a lower SNR at small N ,but asymptotically converges (in both mean and variance) to the optimal solution. Figure 3 shows anotherexample, for a signal drawn from a mixture of two generalized Gaussian variables, with mean values ±5.Here, we see a more substantial asymptotic bias for the NL-means estimator.

Figure 4 shows a comparison of our method to NL-means in two dimensions. The signal samples in thisexample are drawn from a delta function at the origin. We see that the NL-means solution is highly dependenton the size and shape of the neighborhood. It is possible to achieve good results in this particular case of adelta function prior by using a very large neighborhood (3rd panel). But for the smaller neighborhoods, theNL-means estimate is restricted to lie within the neighborhood, and is thus heavily biased, while the EBmethod is not.

5 Discussion

We’ve derived a patch-based method for signal denoising by assuming a local exponential density modelfor the noisy data. We’ve proven that, despite the fact that we do not include any assumption about the

overall form of the signal prior, our estimator converges to the optimal (Bayes) least squares estimator asthe amount of data increases, assuming the binsize is reduced appropriately. The estimator is a nonlinearfunction of the average of similar patches, and thus may be viewed as a generalization of the NL-means andmean-shift algorithms. The main drawback of the method (as with NL-means) is the high computational costassociated with gathering the data patches that are within an h-neighborhood of the patch being denoised.We believe that some of the methods that have been developed for accelerating NL-means [18, 14, 6] (or theother patch-based applications mentioned in the introduction) may be utilized in our estimator as well, andwe are currently working on such an implementation so as to examine empirical performance of our methodon image denoising.

We envision a number of ways in which our method may be extended. First, the binsizes need not be uniformfor all patches, but can be adapted according to the local sample density. Specifically, for each patch that isto be denoised, a rule such as that of Eq. (25) could be used to adjust the binsize to as to achieve the bestbias/variance tradeoff (i.e., so as to best approximate the BLS estimator). This extension necessarily increasesthe computational cost (since the binsize must now be optimized for each patch) but may lead to substantialincreases in performance. Second, the NL-means method computes a weighted average of patches, wherethe weighting is a function of both the patch similarity, and the patch proximity within the signal/image(for example, bilateral filtering may be viewed as an NL-means method, with one-pixel patches). We areexploring means by which our method can be generalized to include both of these forms of weighting. Third,EB estimators have been developed for observation processes other than additive Gaussian noise [17, 12,9, 16], and each of these could be developed into a patch-based estimator (although most of these wouldnot be written in terms of local means). And finally, we believe our local exponential density estimate willprove useful as a substrate for generalizing other patch-based empirical methods that have become popular incomputer vision, graphics, and image processing.

References

[1] A. Buades, B. Coll, and J. M. Morel. A review of image denoising algorithms, with a new one. Multiscale Modelingand Simulation, 4(2):490–530, July 2005. 1, 2

[2] A. Buades, B. Coll, and J. M. Morel. Nonlocal image and movie denoising. Intl Journal of Computer Vision,76(2):123–139, 2008. 1

[3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Pat. Anal. Mach.Intell., 24:603–619, 2002. 1, 6

[4] A. Criminisi, P. Perez, and K. Toyama. Region filling and object removal by exemplar-based inpainting. IEEETrand Image Proc, 13:1200–1212, 2004. 1

[5] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3D transform-domain collaborativefiltering. IEEE Trans. Image Proc., 16(8), August 2007. 1

[6] A. Dauwe, B. Goossens, H. Luong, and W. Philips. A fast non-local image denoising algorithm. In Proc SPIEElectronic Imaging, volume 6812, 2008. 8

[7] J. S. De Bonet. Multiresolution sampling procedure for analysis and synthesis of texture images. In ComputerGraphics. ACM SIGGRAPH, 1997. 1

[8] A. A. Efros and T. K. Leung. Texture synthesis by non-parameteric sampling. In Proc. Int’l Conference on ComputerVision, Corfu, 1999. 1

[9] Y. C. Eldar. Generalized SURE for exponential families: Applications to regularization. IEEE Trans. on SignalProcessing, 57(2):471–481, Feb 2009. 8

[10] N. Jojic, B.Frey, and A. Kannan. Epitomic analysis of appearance and shape. In ICCV, 2003. 1[11] C. R. Loader. Local likelihood density estimation. Annals of Statistics, 24(4):1602–1618, 1996. 3[12] J. S. Maritz and T. Lwin. Empirical Bayes Methods. Chapman & Hall, 2nd edition, 1989. 8[13] K. Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bull. Inst. Internat. Statist.,

38:181–188, 1961. 1, 2[14] J. Orcharda, M. Ebrahimi, and A. Wong. Efcient nonlocal-means denoising using the SVD. In Proc IEEE Intl Conf

on Image Processing, 2008. 8[15] K. Popat and R. W. Picard. Cluster-based probability model and its application to image and texture processing.

IEEE Trans Im Proc, 6(2):268–284, 1997. 1[16] M. Raphan and E. P. Simoncelli. Least squares estimation without priors or supervision. Neural Computation,

2010. In Press. 8[17] H. Robbins. An empirical Bayes approach to statistics. Proc. Third Berkley Symposium on Mathematcal Statistics,

1:157–163, 1956. 8[18] J. Wang, Y. Guo, Y. Ying, Y. Liu, and Q. Peng. Fast non-local algorithm for image denoising. In Proc IEEE Intl

Conf on Image Processing, 2006. 8

Documents

An Empirical Bayesian interpretation and generalization of NL … · 2016. 4. 19. · weighted average over other pixels in the same image that come from similar patches. But the