Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas

Correlation With Errors-In-Variables 3/28/2002 1

Correlation with Errors-In-Variables and an Application to Galaxies

William H. Jefferys

University of Texas at Austin, USA


A Problem in Correlation

• A graduate student at Maryland asked me to assist her on a problem involving galaxy data. She wanted to know if the data showed clear evidence of correlation, and if so, what the correlation was and how strong was the evidence for it.


The Data


Comments on Data

• At first glance the correlation seems obvious.

• But there is an unusual feature of her problem: she knew that the data were imperfect, and for each data point had an error bar in both x and y. Standard treatments of correlation do not address this situation.


The Data


Comments on Data

• The presence of the error bars contributes to uncertainty as to how big the correlation is and how well it is determined

• The data are sparse, so we also need to be concerned about small number statistics

• The student also was concerned about how the lowest point affected any correlation. What would happen, she wondered, if it were not included in the sample? [She was afraid that the ellipticity of the particular galaxy was so low that it might not have been measured accurately and in fact that the galaxy might belong to a different class of galaxies]


The Data


Bayesian Analysis and Astronomy

• She had been trying to use our program GaussFit to analyze the data, but it is not designed for tasks such as measuring correlation.

• I, of course, suggested to her that a Bayesian approach might be appropriate

• Bayesian methods offer many advantages for astronomical research and have attracted much recent interest.

• Astronomy and Astrophysics Abstracts lists 169 articles with the keywords ‘Bayes’ or ‘Bayesian’ in the past 5 years, and the number is increasing rapidly (there were 53 in 2000 alone, up from 33 in 1999).


Advantages of Bayesian Methods

• Bayesian methods allow us to do things that would be difficult or impossible with standard (frequentist) statistical analysis.

• It is simple to incorporate prior physical or statistical information

• Interpretation of results is very natural

• Model comparison is easy and straightforward. (This is such a problem)

• It is a systematic way of approaching statistical problems, rather than a collection of ad hoc techniques. Very complex problems (difficult or impossible to handle classically) are straightforwardly analyzed within a Bayesian framework.


Bayesian Model Selection/Averaging

• Given models Mi, each of which depends on a vector of parameters M, and given data Y, Bayes’ theorem tells us that

• The probabilities p (M | Mi ) and p (Mi ) are the prior probabilities of the parameters given the model and of the model, respectively; p (Y |M, Mi ) is the likelihood function, and p (M, Mi |Y ) is the joint posterior probability distribution of the parameters and models, given the data.

• Note that some parameters may not appear in some models, and there is no requirement that the models be nested.

p( M,Mi |Y) ∝ p(Y | M,Mi )p(M |Mi )p(Mi ),


Strategy

• I do not see a simple frequentist approach to this student’s problem

• A reasonable Bayesian approach is fairly straightforward:

• Assume that the underlying “true” (but unknown) galaxy parameters i and i (corresponding to the observed xi and yi) are distributed as a bivariate normal distribution

p(i ,i |ρ,a,b,σ,σ) ∝1

σσ (1−ρ2)

×exp−1

2(1−ρ2 )(i −a)2

σ2 +

(i −b)2

σ2 −2ρ

(i −a)(i −b)

σσ

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥


Strategy

• Since we do not know i and i for each galaxy but instead only the observed values xi and yi, we introduced the i and i for each galaxy as latent variables. These are parameters to be estimated.

• Here a and b give the true center of the distribution; ρ is the true correlation coefficient, and σ and σ are the true standard deviations. None of these quantities are known. They are also parameters which must be estimated.


Strategy

• [Since we are using a bivariate normal, the variance-covariance matrix is

with inverse

V =σ2 ρσσ

ρσσy σ2

⎡

⎣ ⎢

⎤

⎦ ⎥

V−1 =1

1−ρ2

1σ2 −

ρσσ

−ρ

σσ

1σ2

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥


Strategy

• So the density is

where =(–a–b)´ ]

p(ϕ ) ∝ V −1/ 2 exp−12

′ ϕ V−1ϕ ⎡ ⎣

⎤ ⎦


Strategy

• This expression may be regarded as our prior on the latent variables i and i. It depends on the other parameters (a, b, ρ, σ, σ). We can regard these as hyperparameters, which will in turn require their own priors.

• The joint prior on all the latent variables i and i can be written as a product:

where and are vectors whose components are i and i respectively.

p(Ξ,Η |ρ,a,b,σ ,σ ) ∝ p(i ,i |ρ,a,b,σ ,σ )i∏


Strategy

• We know the distributions of the data xi and yi conditional on i and i. Their joint distribution is given by

• Here si and ti are the standard deviations of the data points, assumed known perfectly for this analysis (these are the basis of the error bars I showed earlier...).

p(xi ,yi |i ,i ,si ,ti ) ∝ exp−(xi −i )

2

2si2

⎡

⎣ ⎢ ⎤

⎦ ⎥exp−

(yi −i )2

2ti2

⎡

⎣ ⎢ ⎤

⎦ ⎥


The Data

The “true” valueis somewhere nearthis error ellipse


Strategy

• Now we can write down the likelihood, the joint probability of observing the data, conditional on the parameters (here only the latent parameters appear, the others are implicit through the prior on the latent parameters):

where X, Y, S, and T are vectors whose components are xi, yi, si, and ti, respectively.

p(X,Y |Ξ,Η,S,T) ∝ p(xi ,yi |i ,i ,si ,ti)i∏


Priors

• The next step is to assign priors for each of the parameters, including the latent variables. Lacking special information, I chose conventional priors for all but and . Thus, I assign

• Improper constant flat priors on a and b.

• Improper Jeffreys priors 1/σ and 1/σ on σ and σ.

• We have two models, one with correlation (M1) and one without (M0). I assign p(M1)= p(M0)=1/2

• We will compare M1 and M0 by computing their posterior probabilities. I chose the prior p(ρ|M1) on ρ to be flat and normalized on [–1,1] and zero elsewhere; I chose a delta-function prior p(ρ|M0)= (ρ–0) on M0

• Priors on and were displayed earlier


Posterior Distribution

• The posterior distribution is proportional to the prior times the likelihood, as Bayes instructs us

€

p(ρ ,a,b,σ ξ ,σ η ,Ξ,Η, M k | X,Y ,S,T )

∝p(ρ | Mk ) p(Mk )

σξσ η

× p(Ξ,Η | ρ ,a,b,σ ξ ,σ η )

× p( X,Y | Ξ,Η,S,T )


Simulation Strategy

• We used simulation to generate a sample from the posterior distribution through a combination of Gibbs and Metropolis-Hastings samplers (“Metropolis-within-Gibbs”).

• The sample can be used to calculate quantities of interest:

» Compute posterior mean and variance of the correlation coefficient ρ, (calculate sample mean and variance of ρ)

» Plot the posterior distribution of ρ, (plot a histogram of ρ from the sample).

» Determine quantiles of the posterior distribution of ρ (use quantiles of the sample).

» Compute posterior probabilities of each model (calculate the frequency of the model in the sample).


Posterior Conditionals

• The conditional distribution on i and i looks like

• By combining terms and completing the square we can sample i and i from a bivariate normal.

p(i ,i |ρ,a,b,σ,σ,X,Y,S,T) ∝

exp−12

(i −xi )2

si2 +

(i −yi )2

ti2

⎛

⎝ ⎜

⎞

⎠ ⎟

⎡

⎣ ⎢

⎤

⎦ ⎥

×exp−1

2(1−ρ2 )(i −a)2

σ2 +

(i −b)2

σ2 −2ρ

(i −a)(i −b)

σσ

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥



• Similarly, the posterior conditional on a, b is

• Again, by completing the square we can sample a and b from a bivariate normal

• Note that if this were not an EIV problem, we would have x’s and y’s instead of ’s and ’s

p(a,b | ρ,σ ,σ,Ξ,Η) ∝

exp−1

2(1−ρ2)(i −a)2

σ2 +

(i −b)2

σ2 −2ρ

(i −a)(i −b)

σσ

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥i

∏



• The posterior conditional on σ and σ is

• It wasn’t obvious to me that we could sample this in a Gibbs step, but maybe there’s a way to do it. I just used independent M-H steps with a uniform symmetric proposal distribution, tuning the step size for good mixing, and this worked fine.

p(σ ,σ |ρ,a,b,Ξ,Η) ∝σ−(N+1)σ

−(N +1)

× exp−1

2(1−ρ2 )(i −a)2

σ2 +

(i −b)2

σ2 −2ρ

(i −a)(i −b)

σσ

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥i

∏



• We do a reversible-jump step on ρ and M simultaneously. Here the idea is to propose a model M and at the same time a correlation ρ in a M-H step and either accept or reject according to the M-H .

• Proposing from M0 to M0 or from M1 to M1 is basically simple, just an ordinary M-H step.

• If we are proposing between models, then things are a bit more complicated. This is due to the fact that the dimensionalities of the parameter spaces are different between the two models.



• The posterior conditional of ρ under M1 is

• (The leading factor of 1/2 comes from the prior on ρ and is very important)

p(ρ |a,b,σ ,σ,Ξ,Η,M1) ∝12×

1

(1−ρ2)N / 2

× exp−1

2(1−ρ2 )(i −a)2

σ2 +

(i −b)2

σ2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥i

∏

× expρ

(1−ρ2 )(i −a)(i −b)

σσ

⎛

⎝

⎜ ⎜

⎞

⎠

⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥i

∏



• The posterior conditional of ρ under M0 is

• The function guarantees that ρ=0.

• Here, the proportionality factor is chosen so to match the factor of 1/2 under M0. The factors come from the priors [p(ρ|M1~U(–1,1) which has an implicit factor 1/2, and p(ρ|M0~(ρ)].

p(ρ |a,b,σ ,σ,Ξ,Η,M0) ∝ (ρ)

× exp−12

(i −a)2

σ2 +

(i −b)2

σ2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

⎡

⎣

⎢ ⎢

⎤

⎦

⎥ ⎥i

∏



• The M-H ratio when jumping from (M,ρ) to (M*,ρ*) is therefore (where the q’s are the proposals):

• We sampled using a beta proposal q(ρ|M1,…) with parameters tuned by experiment for efficiency and good mixing under the complex model, and with a proposal q(M1|…) that also was chosen by experiment with an eye to getting an accurate estimate of the posterior odds on M1.

• The idea is that a beta proposal on ρ matches the actual conditional pretty well so will be accepted with high probability; the M-H ratio will be close to 1.

p(ρ* |M* ,K )p(M* |K )q(ρ |M,K )q(M |K )p(ρ |M,K )p(M |K )q(ρ* |M* ,K )q(M* |K )


Sampling Strategy for Our Problem

• To summarize:

• We sampled the a, b, j, j in Gibbs steps (a and b appear in the posterior distribution as a bivariate normal distribution, as do the j, j).

• We sampled σ, σ with M-H steps using symmetric uniform proposals centered on the current point, adjusting the maximum step for good mixing

• We sampled ρ and M in a simultaneous reversible-jump M-H step, using a beta proposal on ρ with parameters tuned by experiment for efficiency and good mixing under the complex model, and with a proposal on M that also was chosen by experiment with an eye to getting an accurate estimate of the posterior odds on M.


Results

• For the data set including the circled point, we obtained

• Odds on model with correlation = 207 (assumes prior odds equal to 1)

• Median rho = -0.81

• Mean rho = -0.79 ± 0.10


Posterior distribution of ρ (Including all points)


Results

• For the data set including the circled point, we obtained

• Odds on model with correlation = 207 (assumes prior odds equal to 1)


• Mean rho = -0.79 ± 0.10

• For the data set without the circled point we obtained

• Odds on model with correlation = 9.9


• Mean rho = -0.68 ± 0.16


Posterior distribution of ρ (Excluding 1 point)


Final Comments

• This problem combines several interesting features:

• Latent variables, introduced because this is an errors-in-variables problem

• Model selection, implemented through reversible-jump MCMC simulation

• A combination of Gibbs and Metropolis-Hastings steps to implement the sampler (“Metropolis-within-Gibbs”)

• It is a good example of how systematic application of basic Bayesian analysis can yield a satisfying solution of a problem that, when looked at frequentistically, seems almost intractable


Final Comments

• One final comment: If you look at the tail area in either of the two cases investigated, you will see that it is much less than the 1/200 or 1/10 odds ratio that we calculated for the odds of M0 against M1. This is an example of how tail areas in general are not reliable statistics for deciding whether a hypothesis should be selected.

Documents

Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas