Statistical analysis of neural data - Columbia Department of

Statistical analysis of neural data:Maximum a posteriori techniques for decoding spike trains∗

Liam PaninskiDepartment of Statistics and Center for Theoretical Neuroscience

Columbia Universityhttp://www.stat.columbia.edu/∼liam

May 20, 2007

Contents

1 Introduction: overview of the neural decoding problem 2

2 Maximum a posteriori neural decoding 32.1 Gaussian approximations to the posterior p(~x|D) are tractable and useful . . 4

2.1.1 Moment-matching provides an alternative method for constructing theGaussian approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Numerical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 MAP decoding examples: correlated and spatial stimuli . . . . . . . . . . . . 72.4 Decoding Poisson image observations . . . . . . . . . . . . . . . . . . . . . . . 11

3 Connections between the MAP estimator and the optimal linear (regres-sion) estimator 14

4 The Laplace approximation leads to a highly tractable method for comput-ing information-theoretic quantities 16

5 Discrimination and detection; change-point analysis 195.1 Optimal change-point detection . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Discussion 226.1 Extensions: fully-Bayesian techniques . . . . . . . . . . . . . . . . . . . . . . 22

∗These notes are in large part directly drawn from (Pillow and Paninski, 2007).

1

Figure 1: Schematic diagram of the decoding process.

1 Introduction: overview of the neural decoding problem

The neural decoding problem is a fundamental question in computational neuroscience (Riekeet al., 1997): given the observed spike trains of a population of cells whose responses arerelated to the state of some behaviorally-relevant signal ~x, how can we estimate, or “decode,”~x? Solving this problem experimentally is of basic importance both for our understanding ofneural coding and for the design of neural prosthetic devices (Donoghue, 2002). Accordingly,a rather large literature now exists on developing and applying decoding methods to spiketrain data, both in single cell- and population recordings.

This literature can be roughly broken down into two parts, in which the decoding algorithmis based on either 1) regression techniques, or 2) Bayesian methods. Following the influentialwork of (Bialek et al., 1991), who proposed an optimal linear decoder (i.e., multiple linearregression of the spike train, or population spike train, onto ~x), the last decade has seen agreat number of papers employing regression (typically multiple linear regression in the timeor frequency domain) methods (Theunissen et al., 1996; Haag and Borst, 1997; Warlandet al., 1997; Salinas and Abbott, 2001; Serruya et al., 2002; Nicolelis et al., 2003) (see also theearlier work of (Humphrey et al., 1970)). Elaborations on this idea include using nonlinearterms in the regression model, e.g. polynomial terms, as in the Volterra model (Marmarelisand Marmarelis, 1978; Bialek et al., 1991), or using neural network (Warland et al., 1997) orkernel regression (Shpigelman et al., 2003; Eichhorn et al., 2004) techniques. These methodstend to be quite efficient both computationally and in terms of the amount of data theyrequire, but are not guaranteed to perform optimally (since the true optimal decoder mightbe far from linear, or polynomial).

On the other hand we have decoding algorithms based on Bayes’ rule, in which the a priori

distribution of the signal to be decoded is combined with a “forward,” or “encoding,” modeldescribing the likelihood of the observed spike train, given the signal (Fig. 1). The resultingBayes estimate is optimal in principle (by construction), assuming that the prior distributionand encoding model are correct (of course, this is a rather significant assumption). Thisestimate also comes with natural “errorbars,” measures of how confident we should be aboutour predictions, arising from the posterior distribution over the stimulus given the response.Decoding therefore serves as a means for probing which aspects of the stimulus are preservedby the response, but also as a tool for comparing different encoding models. For example,

2

we can decode a spike train using different models (e.g., including vs. ignoring spike-historyeffects) and examine which encoding model allows us to best decode the true stimulus (Pillowet al., 2005b). Such a test may in principle give a different outcome than a comparison whichfocuses on two encoding models’ abilities to predict spike train statistics.

However, computing this Bayes-optimal solution can present some difficulties. For ex-ample, the optimal Bayesian estimator under the squared-error cost function requires thecomputation of a conditional expectation E(~x|spikes) of the signal ~x given the observed spikeresponse D, which in turn requires that we compute d-dimensional integrals over the ~x-space(where d = dim(~x)). Thus, most previous work on Bayesian decoding of spike trains haseither focused on low-dimensional signals ~x (Sanger, 1994; Maynard et al., 1999; Abbott andDayan, 1999; Karmeier et al., 2005) or on situations in which recursive techniques may be usedto perform these conditional expectation computations efficiently, either using approximateextended Kalman filter techniques (Zhang et al., 1998; Brown et al., 1998; Barbieri et al.,2004; Wu et al., 2004) or variants of the particle filtering algorithm (Brockwell et al., 2004;Kelly and Lee, 2004; Shoham et al., 2005), which is exact in the limit of an infinite numberof particles. While this recursive approach is quite powerful, unfortunately its applicabilityis limited to cases in which the joint distribution of the signal ~x and the spike responses hasa certain Markov tree decomposition (e.g., a hidden Markov model or state-space represen-tation) (Jordan, 1999). We will discuss the recursive approach in much more depth in a laterchapter, after building some additional analytical tools.

Thus we will begin our discussion of the decoding problem by describing a straightforwardapproach which is applicable without any such tree-decomposition assumptions, and whichremains tractable even when the stimulus ~x is fairly high-dimensional (e.g, dim(~x) ∼ 103).The idea is to compute the maximum a posteriori (MAP) estimate x̂MAP . This estimate isBayesian in the sense that it incorporates knowledge of both the prior distribution p(~x) andthe likelihood p(D|~x) of having observed the (single- or multiple-) spike train data D giventhe stimulus ~x.1 However, computing x̂MAP requires only that we perform a maximizationof the posterior, instead of an integration, and in the cases we examine here the posterior ismuch easier to exactly maximize than to integrate.

2 Maximum a posteriori neural decoding

To compute the MAP estimate we need to maximize the posterior over the stimulus giventhe data:

x̂MAP = arg max~x

p(~x|D) = arg max~x

1

Zp(D|~x)p(~x)

as a function of ~x, or equivalently maximize

log p(~x|D) = log p(D|~x) + log p(~x) + const.

If we are willing to restrict our attention to the case that the data D are point processeswhose conditional rate function may be written in GLM form,

λi(t) = f

(

~ki · ~x(t) +∑

i′,j

hi′,jni′(t − j)

)

, (1)

1The MAP estimate is also Bayes-optimal under a “zero-one” loss function, which rewards only the correctestimate of the stimulus and penalizes all incorrect estimates with a fixed penalty.

3

with f(.) a convex, log-concave function, then the log-likelihood term

log p(D|~x, θ) =∑

i

ni log λi(t) −∑

i

∫ T

0λi(t)dt + const.

is concave in ~x, for any observed spike data D. (This follows by exactly the same logic asthe fact that the log-likelihood is concave as a function of the parameter ~k.2) Since the sumof two concave functions is itself concave, it is clear that this optimization problem will betractable whenever the log-prior term log p(~x) is also a concave function of ~x (Paninski, 2004):in this case, any ascent algorithm is guaranteed to return the optimal solution

x̂MAP ≡ arg max~x

log p(~x|D),

since ~x lies within a convex set (the d-dimensional vector space). Note that this optimalsolution x̂MAP is in general a nonlinear function of the data D.

We should emphasize that this log-concavity condition on the stimulus distribution p(~x) isrestrictive (Paninski, 2005): for example, log-concave distributions must necessarily have tailsthat decrease at least exponentially quickly, ruling out “fat-tailed” prior distributions withinfinite moments. Nonetheless, the class of log-concave distributions is quite large, including(by definition) any distribution of the form

p(~x) = exp(Q(~x))

for some concave function Q(~x); for example, the exponential, triangular, uniform, and mul-tivariate Gaussian (with arbitrary mean and covariance) distributions may all be written inthis form. In particular, any experiment based on the “white noise” paradigm, in which aGaussian signal of some mean and a white power spectrum (or more generally, any powerspectrum) is used to generate stimuli (see e.g. (Marmarelis and Marmarelis, 1978) or (Riekeet al., 1997) for many examples), may be easily analyzed in this framework. (Of course, inprinciple we may still compute x̂MAP in the case of non-concave log-priors; the point is thatascent techniques might not return x̂MAP in this case, and that therefore computing the trueglobal optimizer x̂MAP may not be tractable in this more general setting.)

2.1 Gaussian approximations to the posterior p(~x|D) are tractable and use-ful

As we will demonstrate below, the estimate x̂MAP proves to be a good decoder of spiketrain data in a variety of settings, and this ability to tractably perform optimal nonlinearsignal reconstruction given the activity of ensembles of interacting neurons is quite useful perse. However, computing x̂MAP gives us easy access to several other important and usefulquantities. In particular, just as in our discussion of estimating the model parameters θ,we would like to quantify the uncertainty in our estimates; one easy way to do this is byperturbing x̂MAP slightly in some direction ~y (say, x̂MAP + ǫ~y, for some small positive scalarǫ) and computing the ratio of posteriors at these two points p(x̂MAP|D)/p(x̂MAP + ǫ~y|D),or equivalently the difference in the log-posteriors log p(x̂MAP|D) − log p(x̂MAP + ǫ~y|D). If

2Note that the log-likelihood function is separately, not jointly, concave in ~x and the model parameters;that is, log p(D|~x, θ) is concave in the stimulus ~x for any fixed data D and parameters ~θ, and concave in the

parameters ~θ for any fixed observed D and ~x.

4

the posterior changes significantly with the perturbation ǫ~y, then this perturbation is highly“detectable”; conversely, if the change in the posterior is small or (in the extreme case) thereis no change at all, then it is difficult (or impossible) to discriminate between x̂MAP andx̂MAP + ǫ~y on the basis of the data D, and we can expect our estimate x̂MAP to be highlyvariable in this direction (and the corresponding confidence interval in this direction to bewide).

For sufficiently small stimulus perturbations ǫ, a second-order expansion suffices to ap-proximate the log-posterior (by definition, the first derivative is zero at the optimizer x̂MAP):

log p(x̂MAP + ǫ~y|D) − log p(x̂MAP|D) = −ǫ2

2~yT Jx~y + o(ǫ2),

where −Jx is the Hessian (second derivative matrix) of the log-posterior at x̂MAP with re-spect to ~x. As discussed previously, this quadratic form may be most easily interpreted bycomputing the eigenvectors of Jx (Huys et al., 2006): eigenvectors corresponding to largeeigenvalues represent stimulus directions ~y along which the curvature of the posterior is large,i.e. directions along which perturbations are highly discriminable, and those correspondingto small eigenvalues represent directions that are only weakly discriminable.

This second-order description of the log-posterior corresponds to a Gaussian approxima-tion of the posterior (in the statistics literature this is known as a “Laplace approximation”(Kass and Raftery, 1995)):

p(~x|D) ≈ N (x̂MAP , C), (2)

where the mean of this Gaussian is the MAP estimate x̂MAP and the covariance matrix isC = −J−1

x . In general applications, of course, such a Gaussian approximation is not justi-fied. However, in our case, we know that p(~x|D) is always unimodal (since any concave, orlog-concave, function is unimodal (Boyd and Vandenberghe, 2004)) and at least continuouson its support (again by log-concavity). If the nonlinearity f(u) and the log-prior log p(~x) aresmooth functions of their arguments u and ~x respectively, then the log-posterior log p(~x|D)is necessarily smooth as well; in this case, the Gaussian approximation is fairly accurate(although of course the posterior will never be exactly Gaussian). See Fig. 3 for some com-parisons of the true posterior and this Gaussian approximation.

2.1.1 Moment-matching provides an alternative method for constructing theGaussian approximation

If the log-posterior log p(~x|D) is strongly asymmetric (“skewed”) about its peak ~xMAP , thenthe Laplace approximation described above (which is by construction symmetric around~xMAP ) will be inaccurate. In this case we may construct a better Gaussian approximation bya more computationally-expensive moment-matching technique known as “expectation prop-agation” in the machine learning literature (Minka, 2001; Yu et al., 2006). The idea, onceagain, is to construct a weighted-Gaussian approximation

p(~x, D) = wGµ,C(~x)

(where w is a weight term we will discuss in more depth in section 5 below), but instead oftaking the mean µ to be ~xMAP and the covariance to be the inverse Hessian evaluated at µ,we attempt to approximate w, µ, and C directly as

w =

∫

p(~x, D)d~x,

5

µ =1

w

∫

p(~x, D)~xd~x,

and

C =1

w

∫

p(~x, D)(~x − µ)T (~x − µ)d~x.

Of course, as we discussed in the introduction, these integrals are not directly tractable;to construct a tractable approximation, we make use of the fact that the joint distributionp(~x, D) factorizes:

p(~x, D) = p(~x)∏

t,i

e−λi(t)λi(t)ni(t)

ni(t)!.

Now, if we assume a Gaussian prior p(~x) for simplicity, our approximation takes the form

p(~x)∏

t,i

e−λi(t)λi(t)ni(t)

ni(t)!≈ p(~x)

∏

l

ql(rTl ~x) = Gµ0,C0

(~x)∏

l

exp(

al(rTl ~x)2 + bl(r

Tl ~x) + cl

)

= wGµ,C(~x)

where we have made the abbreviation ql(.) for the Poisson terms on the left; l represents anindex over (t, i), and the rT

l ~x notation emphasizes that each term is a rank-one function of~x. Now we construct our approximation by iterating over the l terms: with each iteration,we adjust the quadratic terms al, bl, cl (or equivalently w, µ, C) by incorporating the effect ofthe corresponding l term.

More concretely, we initialize each ql(.) term as ql(.) = 1, i.e., al = bl = cl = 0. Then, foreach l, we remove the ql(.) term from the product to form

w\lGµ\l,C\l(~x) = Gµ0,C0

(~x)∏

l′ 6=l

exp(

al′(rTl′ ~x)2 + bl′(r

Tl′ ~x) + cl′

)

=wGµ,C(~x)

ql(rTl ~x)

,

and then incorporate the true likelihood term by numerically integrating (this is equivalentto numerically matching the zero-th, first, and second moments):

wupdate =

∫

w\lGµ\l,C\l(~x)ql(r

Tl ~x)d~x,

µupdate =1

wupdate

∫


Tl ~x)~xd~x,

and

Cupdate =1

wupdate

∫


Tl ~x)(~x − µ)T (~x − µ)d~x

(updating the coefficients al, bl, cl from (wold, µold, Cold) and (wupdate, µupdate, Cupdate) via theusual rank-one manipulations; see (Minka, 2001) for details). The key point is that, due to therank-one nature of ql(.), on each step we only need to numerically compute one-dimensionalintegrals of the form

∫

Gm,σ2(y)q(y)dy,

since the Gaussian integrals in the (d− 1) directions unaffected by ql(rTl ~x) may be computed

analytically.

6

This process of looping through all the l terms may be iterated until convergence, at whichpoint we obtain our approximations of w, µ, and C (see (Minka, 2001) for further discussionof convergence issues). Thus the procedure is tractable when the number of terms in thelikelihood product is not too large. In the future, for simplicity, we will restrict our discussionto the somewhat more transparent Laplace approximation; however, the reader should keepin mind that this moment-matching approximation is a useful alternative whenever Laplaceapproximations are mentioned in the following.

2.2 Numerical implementation

While the above discussion establishes that ascent-based methods will succeed in findingthe true x̂MAP , and that the resulting Gaussian approximation should be fairly accurate, itremains to show that these operations are tractable and can be performed efficiently. To com-pute x̂MAP , we may employ conjugate gradient ascent methods with analytically computedgradients and Hessian, both of which may be computed exactly as before, when we wereestimating ~k instead of ~x; the roles of ~k and ~x are simply reversed (see (Pillow and Paninski,2007) for full details). The total gradient and Hessian are given by performing a summationover neurons i (since the log-likelihood log p(D|~x) is computed as a sum over i) and addingthe gradient and Hessian of the log-prior, evaluated at ~x.

While in general it takes O(d2) steps to compute the Hessian of a d-dimensional function(where d = dim(~x) here), the Hessian of log p(~x|D) may be computed much more quickly.Most importantly, the Hessian of the log-likelihood is a banded matrix, with the width ofthe band equal to the length of the filter ~k. Additionally, the Hessian of the log-prior canbe computed easily in many important cases. For example, in the case of Gaussian stimuli,this Hessian is constant as a function of ~x and can be precomputed just once. Thus in factthe computation of the Hessian is just O(d) per iteration, instead of the O(d2) time requiredmore generally.

Optimization via conjugate gradient ascent requires O(d3) time in general (Press et al.,1992). In special cases, however, we may reduce this to O(N3 + N2d) time, where N is thenumber of observed neurons. This speedup is key in cases where d is very large (for example,in the case of spatial image stimuli ~x, where d may be on the order of d ∼ 104 or larger),while the number of observed neurons using currently available technology is smaller, at mostN ∼ 103 and more typically N ∼ 102. We discuss an example of this speedup in more detailbelow (see especially Fig. 5).

Finally, the convergence to the optimizer may be sped up by initializing with a goodapproximate solution for ~xMAP : we use the optimal linear estimate (OLE) of ~x given theobserved data, which may be computed easily using standard multiple (linear) regressiontechniques. See section 3 below for a further discussion of the connections between these twoestimators.

2.3 MAP decoding examples: correlated and spatial stimuli

Figure 2 shows a straightforward example of MAP estimation of a 50-sample stimulus usingeither a 2-cell (A) or a 20-cell (E) simulated population response. In this case, the stimu-lus (black trace) consisted of 500 ms of Gaussian white noise, refreshed every 10ms. Spikeresponses (dots) were generated from the GLM point process encoding model, with param-eters roughly matched to those found with fits to macaque retinal ganglion ON and OFF

7

0 100 200 300 400 500

time (ms)

best feature

worst feature

20 cellsa

mp

litu

de

0 100 200 300 400 500

time (ms)

2 cellsu

nce

rta

in

ty

best feature

worst feature

A E I

B

C

D

F

G

H

pro

ba

bility

0

3

best features

2-cells

20-cells

-2 0 2

-2 0 2

feature amplitude

0

3

worst features

pro

ba

bility

2-cells

20-cells

J

Figure 2: Illustration of MAP decoding on a 50-sample Gaussian white-noise stimulus. (A)Spike times generated by a single ON cell (gray dots) and a single OFF cell (black dots)in simulation. Black trace shows the true stimulus and blue trace is MAP estimate x̂MAP

(smoothed for ease of comparison) based the response of these two cells. (B) x̂MAP replot-ted with gray region representing 2-standard deviations of the posterior uncertainty aboutstimulus value. (C-D) Best and worst-encoded stimulus features given the spike data, deter-mined as the eigenvectors of the Hessian with largest and smallest eigenvalue, respectively.Perturbing x̂MAP along the best (worst) feature causes the fastest (slowest) falloff in theposterior. Slices through the posterior along each of these axes are shown in I and J (bluetraces). (E-H) Similar plots showing x̂MAP given the response of 10 ON and 10 OFF cells;slices through the posterior along best and worst-encoded feature axes shown in I-J (red).

cells (Pillow et al., 2005a). However, for demonstration purposes, the stimulus filters wereset to positive and negative delta functions of the light stimulus (for ON and OFF cells,respectively), so that band-pass filtering of the stimulus did not result in information loss,and convergence of the MAP estimate to the true stimulus could be observed more easily.Blue and red traces show the MAP estimate computed using the 2-cell and 20-cell popula-tion response, respectively. The shaded regions (2C, G) show two standard deviations of themarginal posterior uncertainty about each stimulus value, computed as [(J−1

x )ii]1/2, as we

have discussed previously.Note, however, that this shading provides an incomplete picture of our uncertainty about

the stimulus. The posterior p(~x|D) is a probability distribution in 50 dimensions, and itscurvature along each of these dimensions is captured by Jx. Eigenvectors associated with thelargest and smallest eigenvalues of Jx correspond to directions in this space along which thedistribution is most and least tightly constrained; these are the “features” which are “best”and “worst” represented in the population response, respectively. Slices through the (2-celland 20-cell) posterior along these best (I) and worst (J) features illustrate how rapidly theposterior falls off as we subtract or add these stimulus patterns to the MAP estimate.

Implicit in this analysis of coding fidelity is the Gaussian approximation to the posteriorintroduced above (eq. 2). If the shape of the true posterior is poorly approximated by a

8

0

1

2

3

-2 0 2

worst feature

0

1

2

3

best feature

-2 0 2

-20

-10

0

best feature

-2 0 2 -2 0 2

-20

-10

0

worst featurep

ro

ba

bility

lo

g p

ro

ba

bility

pro

ba

bility

lo

g p

ro

ba

bility

true posterior

Gaussian approx

feature amplitude feature amplitude

2-cells 20-cells

Figure 3: Comparison of true posterior (solid) with Gaussian approximation (dashed). Left,Top Row: 1D slices through p(~x|D) and the Gaussian approximation along axes correspondingto best and worst encoded stimulus features (i.e., axes of least and greatest posterior variance),for the two-cell response shown in Fig. 2A. Bottom row: same distributions plotted on a log-scale, indicating mismatch in the tails and skewness in the true posterior. Right: analogousslices through the posterior and Gaussian approximation given the 20-cell population response(Fig. 2E); note the reduction in the posterior variance as more data is collected.

Gaussian (i.e., the log-posterior is poorly approximated by a quadratic), then the Hessiandoes not fully describe our posterior uncertainty. Conversely, if the posterior is approximatelyGaussian, then the posterior maximum is also the posterior expectation E(~x|D), meaning thatx̂MAP closely matches the Bayes estimator (optimal under mean-squared error loss). Figure 3shows a comparison between the true posterior and the Gaussian approximation around x̂MAP

for the 2-cell and 20-cell population responses shown in Figure 2. Although log-scaling of thevertical axis (bottom row) reveals significant discrepancies in the tails of the distributions,the Gaussian approximation in this case provides a close match to the shape of the centralpeak and an accurate description of the bulk of the probability mass under the posterior (toprow).

Next, we examined the role of the stimulus prior in MAP decoding, using a stimulusgenerated from a non-independent prior. Figure 4 shows an example in which the Gaussianstimulus ~x was drawn to have a power spectrum that falls as 1/F , meaning that low frequenciespredominate and nearby timebins are strongly correlated. The true stimulus is plotted inblack, and the left panel shows x̂MAP computed from the response of two neurons, under:(1) an i.i.d. Gaussian prior p(~x) =

∏

t p(xt) (with the one-dimensional mean and variancechosen to match the true p(~x); top); and (2) under the correct the correct 1/F Gaussianprior (bottom). Note that the likelihood term p(D|~x) is identical in both cases, i.e., it is onlythe prior that gives rise to the difference between the two estimates. The right panel showsa comparison using the same stimulus decoded from the responses of 10 ON and 10 OFFcells. This illustrates that although both estimates converge to the correct stimulus as the

9

0 0.25 0.5 0.75 1

time (s)

MAP (independent prior)

MAP (1/F) prior

0 0.25 0.5 0.75 1

time (s)

stim

ulu

s a

mp

litu

de

Figure 4: Illustration of MAP decoding with a correlated prior. Top left: spike responseof an ON cell (gray) and OFF cell (black) to a Gaussian stimulus with 1/F power structure(black trace in all plots). Middle left: MAP estimate under independent Gaussian prior (darkgray). Bottom left: MAP estimate under correct (1/F ) Gaussian prior. Right: analogouscomparison using responses of 10 ON and 10 OFF cells.

number of neurons increases, the prior still gives rise to a significant difference in decodingperformance.

Figure 5 shows an example of MAP estimation applied to spatially varying stimuli. Wepresented a 256-dimensional spatial stimulus (A, upper left) to a set of 512 simulated ONand OFF neurons whose center-surround spatial receptive fields (top center) tiled the imageplane. As expected, decoding performance improves as a function of the duration of stimuluspresentation and the number of spikes collected (bottom).

We also performed decoding of a much higher-dimensional 128x128 stimulus, using theresponses of 1,024 cells whose receptive fields spanned a 512-dimensional subspace (Fig. 5B).Estimating x̂MAP in the full stimulus space here is computationally prohibitive; for example,the Hessian Jx has 1284 > 108 elements. Instead, we work within the subspace spanned bythe receptive fields of the observed neurons. We begin by writing the log-posterior in the form

p(~x|D) = f(~xT A~x) + g(K~x);

here the first term corresponds to the log-prior log p(~x), which is a zero-mean Gaussian inFig. 5 for simplicity (corresponding to f(u) = exp(−u/2)), but which in general could beany concave, elliptically-symmetric function of ~x. The second term corresponds to the log-likelihood log p(D|~x), which only depends on ~x through a projection onto a subspace spannedby the columns of the filter matrix K.

Now, to maximize this function it turns out we may restrict our attention to a subspace;that is, we do not need to perform a search over the full high-dimensional ~x. To see this, weuse a linear change of variables,

~y = A1/2~x,

to rewrite the log-posterior as

log p(~x|D) = f(~yT ~y) + g(KA−1/2~y).

Now, by the log-concavity of the prior p(~x), the function f(u) must be a decreasing functionof u > 0. This implies, by the representer theorem (Scholkopf and Smola, 2002), that we may

10

128x128 image

decoded

(1024 cells)

decoded

imageON

5 10 15

5

10

15

OFF

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

16 x 16 image

ON OFF

0 0.1

0 0.5

time (s)

cell resps

receptive fieldsA B

Figure 5: Illustration of spatial decoding. (A) Encoding and decoding of a 16x16 image (topleft) with a population of 512 simulated retinal ganglion cells with center-surround receptivefields. Below: intensity plots show ON and OFF spike counts as a function of receptivefield location, for short and long presentations of the spatial stimulus (above and below,respectively). Decoded image (x̂MAP ) exhibits greater similarity to original image for longerpresentations. (B) Illustration of decoding in a stimulus space with higher dimensionality(dim(~x) = 16, 384) than the number of cells. The decoded image was computed in the 512-dimensional subspace spanned by the receptive fields of the 1,024 neurons in the populationresponse. In each case, the prior p(~x) was taken to be independent, with mean and variancechosen to match that of the true image.

always choose an optimal ~y in the space spanned by the columns of KA−1/2; this is becauseincreasing ~y in a direction orthogonal to KA−1/2 does not change the second term in the log-posterior but cannot increase the first term (since f(.) is decreasing). So we may perform ouroptimization in the lower-dimensional subspace spanned by KA−1/2 (in Fig. 5B, this reducesthe dimensionality of the search from the intractable 1282 ≈ 105 to a much more feasible512). Then, once an optimal ~yMAP is computed, we need only set x̂MAP = A−1/2~yMAP .

The one catch is that we need to compute the forward and backward change-of-basisoperators A1/2 and A−1/2, but this is easy if we are given a diagonalization of A, A = OT DO;for example, in the case of a spatial 1/f prior the orthogonal matrix O is given by a Fouriertransform, and Ot by the inverse Fourier transform. Thus, for example, A−1/2~y = OT D1/2O~ymay be computed quite efficiently with two applications of the fast Fourier transform. Thus,when the prior is log-concave and elliptically-symmetric, and a convenient diagonalizationfor the matrix A is available, we may tractably perform MAP decoding even for very high-dimensional stimuli ~x.

2.4 Decoding Poisson image observations

Above we have been discussing the problem of decoding a one-dimensional point process(a spike train). As usual, these one-dimensional techniques generalize in a straightforwardmanner to higher-dimensional problems. We give a few examples here.

11

We have previously discussed the Poisson model of image formation in low-light (photon-limited) conditions: the observed data D = {n(x, y)} is a collection of counts of photons perpixel. We model D as a Poisson process with rate

λ(x, y) = I(x, y) ∗ w(x, y),

where I(x, y) is the underlying image and w(x, y) is the point spread function. Clearly thelikelihood p(D|I, w) is log-concave in I; thus, under a log-concave image prior in I (e.g. aGaussian prior), MAP restoration of I given noisy measurements {n(x, y)} is quite tractable.

We may also draw a connection with linear smoothing methods. Assume a zero-meanGaussian prior on the image, for simplicity,

log p(I) = c −1

2||AI||22

for some matrix A (as usual, the corresponding covariance matrix is (AT A)−1), and that theintensity of the image is small relative to some baseline noise level λ0 > 0,

λ(~x) = λ0 + ǫI ∗ w(~x),

for some small ǫ > 0. Now we may expand the log-posterior: log p(I|D, w) =

∑

x,y

n(x, y) log (λ0 + ǫI ∗ w(x, y)) −

∫

(λ0 + ǫI ∗ w(x, y)) dxdy −1

2||AI||22

= const. + ǫ

[

1

λ0

∑

x,y

n(x, y)

(

I ∗ w(x, y) −ǫ

2λ0(I ∗ w(x, y))2

)

−

∫

I ∗ w(x, y)dxdy

]

−1

2||AI||22 + o(ǫ2),

for small ǫ, where we have used the second-order expansion of the logarithm, log(c + u) =log c+log(1+u/c) = log c+u/c−(u/c)2/2+o[(u/c)2]. The point is that for ǫ not too large wemay approximate our original nonlinear MAP problem with a much simpler negative definitequadratic optimization problem, which we may solve by solving the linear equation obtainedby setting the gradient of the above expression with respect to I to zero. (We will return tothis connection between the linear and MAP approaches in much more depth below.) It is agood idea to enforce a positivity constraint I(x, y) > 0 on the image as well; this transformsour original convex optimization problem into a convex optimization problem with convexconstraints (i.e., the problem remains completely tractable), and our approximate quadraticproblem with a similarly tractable quadratic programming problem.

Similar ideas may be applied to estimate receptive fields nonparametrically. In the set-ting described in (Gao et al., 2002), we observe a sequence of hand positions (x, y)i andcorresponding spike counts ri from a single neuron recorded in the primary motor cortex.We would like to estimate the mean firing rate as a function of the two-dimensional variable(x, y). Assuming that ri are conditionally i.i.d. Poisson given (x, y)i, with rate λ(x, y), andthat the prior on receptive fields λ(x, y) is log-concave, computing the MAP estimate for thereceptive field is quite tractable, since the posterior is log-concave in the function λ(x, y).

Yet another variation on this theme arises in the context of making inferences fromPositron Emission Tomography (PET) data. In this context we do not observe the underlyingimage directly, but rather our detectors capture a kind of projected “slice” of the image; theseslices can be modeled mathematically as

ni ∼ Poiss(KiI),

12

where ni is the emission count observed at detector i, I is the underlying (three-dimensional)intensity image, and Ki is a linear operator corresponding to the “slice” that the i-th detectortakes of I. Once again, then, inferring I(x, y, z) is a highly tractable problem, due to theconcavity of the likelihood in I.

A final example involves observations of Poisson-distributed photoisomerizations in thecone mosaic of the retina3. Imagine we have observed an array {ni} of photoreceptor iso-merizations. The true image is defined on a (p × 3)-dimensional vector space, where p is thenumber of pixels and 3 is the number of color elements, and is assumed to have a truncatedGaussian distribution

p(~x) = (1/Z)Nµ,C(~x)1(~x ≥ 0),

where the covariance C matches the true observed image covariance (the parameters µ and Cof this rather crude Gaussian approximation may be estimated directly from data; typicallythe intensity covariance is approximated as translation-invariant, with a power law spectrum),and the constraint ~x ≥ 0 follows from the fact that image intensities must be positive.

Formation of the image on the retina is modeled as follows: the true image, ~x, is mappedlinearly through some possibly anisotroptic blurring lens; this transformation is positive andlinear, corresponding to some nonnegative matrix K (this operator K may be measureddirectly in the living eye using specialized adaptive optical techniques developed by (Lianget al., 1997)).

The number of isomerization events on the photoreceptor array given an image ~x can bemodeled as independent Poisson,

p(response = {ni}|~x) =∏

i

(e−λidt(λidt)ni)/ni!,

with the rates λi defined asλi = f [b + (K~x)i,ci

];

here f(.) is a possibly nonlinear function mapping pixel light intensity into isomerization rate,f(b) = f(b + 0) is the baseline dark-adapted isomerization rate, and i,ci

denotes the spatialindex i and the corresponding color index ci.

Now how do we infer what image produced the observed data? That is, how do we decode{ni} into a representation of the true image ~x? One simple way to do this is to compute theMAP estimator. Here log p(~x) is a (negative definite) quadratic function of ~x. The likelihoodterm is, as usual,

log p({ni}|~x) = c +∑

i

(

ni log f [b + (K~x)i,ci] − f [b + (K~x)i,ci

]dt

)

.

If the function f(u) is convex and log f(u) is concave in u (e.g., if f(.) is the exponentialfunction, or the linear rectifier), then once again log p({ni}|~x) is a concave function of ~x, nomatter what data {ni} are observed. Since the space {~x : ~x ≥ 0} is convex and the functionlog p(~x|{ni}) is strictly concave, this optimization problem may be solved via standard ascenttechniques: there is a single global maximum, no matter what data {ni} are observed. Thisoptimum may be computed using standard constrained optimization routines, e.g. fmincon.min Matlab.

3Thanks to David Brainard for this example.

13

3 Connections between the MAP estimator and the optimallinear (regression) estimator

Several important connections exist between the MAP estimate and the optimal linear estima-tor (OLE). To explore these connections, it is helpful begin, as usual, by examining a slightlysimpler model, in which the MAP and the OLE coincide exactly. Assume for a moment thefollowing Gaussian model for the spike train responses ~r:

ri ∼ N ((K~x)i + b, σ2); ~x ∼ N (0, C),

where (K~x)i denotes the ith element of the vector K~x, and b denotes the baseline firing rate.The matrix K controls the dependence of the responses ri on the stimulus ~x; we can thinkof K as a convolution matrix corresponding to the linear filter ~k in the GLM. In this caseit is easy to see that the MAP is exactly the OLE, since the posterior distribution p(~x|~r) isGaussian, with covariance independent of the observed ~r. In particular, in this case the OLEand the MAP are both given by

~̂xOLE = ~̂xMAP = (σ2C−1 + KT K)−1KT (~r − b).

This solution has several important and intuitive features. First, in the low signal-to-noiseregime (that is, for high values of σ, or equivalently small values of K), the solution lookslike (1/σ2)CKT (~r − b). If the prior stimulus covariance C is white, this just corresponds toconvolving ~r with ~k. Conversely, in the high SNR regime, where K is much larger than σ,the solution looks more like (KT K)−1KT (~r − b) (assuming for simplicity that KT K has fullrank), where (KT K)−1KT is the Moore-Penrose pseudoinverse (Strang, 1988) of K; thus, inthe high-SNR limit the MAP (and the OLE) effectively deconvolves ~r by ~k. (Note in additionthat, as expected, the effect of the prior covariance C disappears in the high-SNR limit, wherethe likelihood term dominates the prior term.)

We will see that the same basic behavior (convolution in the low-SNR regime and decon-volution in the high-SNR regime) holds in our model, and that the MAP and OLE (althoughnot equivalent in general) coincide exactly in the low-SNR limit. Let ~r denote the observeddiscretized responses, ~r0 the mean-subtracted version ~r0 = ~r − E(~r), and once again assumethat ~x ∼ N (0, C). Then the OLE, as usual, is given by

~xOLE = (E(~r0~rT0 )−1E(~r0~x

T ))T~r0.

So we need to compute E(~r0~rT0 ) (the auto-covariance of the response) and E(~r0~x

T ) (thecovariance of the response with the stimulus); this may be done in the limit of low signal-to-noise ||K|| → 0 as follows. From the GLM encoding model,

ri ∼ Poiss[f((K~x)i +~bi)dt],

where f is the response nonlinearity. A second-order expansion in ~x around ~bi gives

E(~r|~x) = dt

(

f(~b) + f ′(~b).Kx +1

2f ′′(~b).K~x.K~x

)

+ o(||K||2),

with ‘.’ denoting pointwise multiplication of vectors. Given that p(~x) has mean zero andcovariance C, we can compute the desired expectations:

E(~r0~xT ) = dt

(

diag(f ′(~b))KC)

+ o(||K||2)

14

andE(~r0~r

T0 ) = dt

(

diag(f(b)) + diag(f ′(~b))KCKT diag(f ′(~b)))

+ o(||K||2).

Putting these pieces together gives

~xOLE =

[

(

diag(f(b)) + diag(f ′(~b))KCKT diag(f ′(~b)))−1 (

diag(f ′(~b))KC)

]T

~r0 + o(||K||2)

= CKT diag(

f ′(~b)./f(~b)) (

~r − dt(f(~b)))

+ o(||K||).

Thus we see that the OLE in the case of Poisson observations behaves much as in the caseof Gaussian observations: in the low SNR regime, the OLE behaves like a convolution (withf ′(~b)./f(~b) here playing the role of 1/σ2 in the Gaussian case), while as the SNR increasesthe optimal linear filters take on the pseudoinverse form (KCKT )−1(CKT ). See (Bialek andZee, 1990; Rieke et al., 1997) for further discussion of this effect.

Turning to the MAP case, we examine the log-posterior in a similar limit:

log p(~x|D) = g(~b + K~x)T~r − dtf(~b + K~x)T 1 −1

2~xT C−1~x,

where g(.) abbreviates log f(.) and 1 is a vector of all ones. Taking a second-order expansionin ~x gives

log p(~x|D) =

(

g(~b) + g′(~b).K~x +1

2g′′(~b).K~x.K~x

)T

~r

−dt

(

f(~b) + f ′(~b).K~x +1

2f ′′(~b).K~x.K~x

)T

1 −1

2~xT C−1~x + o(||K||2),

This expression is quadratic in ~x; if we note that (f ′′(~b).K~x.K~x)T 1 and (g′′(~b).K~x.K~x)T~rmay be written in the more standard form ~xT KT diag(f ′′(~b))K~x and ~xT KT diag(g′′(~b).~r)K~x,respectively, then we may easily optimize to obtain

x̂MAP =(

C−1 − KT diag(

~rg′′(~b) − dtf ′′(~b))

K)−1

KT(

diag(g′(~b))~r − diag(f ′(~b))dt)

+ o(||K||2)

= CKT diag(

f ′(~b)./f(~b)) (

~r − dtf(~b))

+ o(||K||).

Thus the MAP and OLE approaches match exactly in the low-SNR regime; again, for largerSNR the MAP displays pseudoinverse-like behavior. It is also worth noting that to first orderin K, the posterior covariance does not depend on the observations ~r (and hence the MAP islinear in ~r); however, when we expand the log-posterior to second order the posterior covari-ance does include a ~r-dependent term (corresponding to the fact that the Poisson distributionhas mean-dependent noise), and so the optimal decoder is no longer linear in ~r, although thedecoder can still be computed by solving a single linear equation. (This might make for agood initialization for the MAP optimization in the regime of larger K; the advantage is thatthe OLE coefficients do not need to be precomputed, which saves us from running a longsimulation to accurately compute the covariances E(~r0~r

T0 ) and E(~r0~x

T ).)In the case of a GLM with an exponential nonlinearity, f(.) = exp(.), the above formulae

can be simplified to provide insight into the decoding significance of model components suchas a spike-history dependent term. Specifically, if the conditional intensity of the ith neuron

15

250 500

-1

0

1

2

time (ms)

MAP estimates

0 50 100

0

0.5

1

post-spike

recovery function

p(sp

ike

)

time (ms)

Figure 6: Left: Three different post-spike recovery functions (exponentiated post-spike ker-nels), which multiply the conditional intensity function following a spike. These induce spike-history effects ranging from “none” (light gray) to a relative refractory period of approximately25ms (black). Right: MAP decoding of a single set of spike times (dots) under three GLMsthat differ only in their post-spike kernels (shown at left). Spike bursts are interpreted quitedifferently by the three models, indicating large stimulus transients under the model withstrong refractory effects (since for a burst to have occurred the stimulus must have been largeenough to overcome the refractory effects), whereas isolated spikes (i.e., near 250ms) havenearly the same decoded interpretation for all three models.

is f((K~x)i + bi + (B~r)i), where B is a linear operator capturing the causal dependence of theresponse on spike-train history, then we obtain

log p(~x|D) =(

~b + K~x + B~r)T

~r − dt exp(~b + K~x + B~r)T 1 −1

2~xT C−1~x;

optimizing to second order, as above, gives

x̂MAP =(

C−1 + dtKT diag(exp(~b + B~r)K)−1

KT(

~r − dt exp(~b + B~r))

+ o(||K||2)

= CKT(

~r − dt exp(~b + B~r))

+ o(||K||)

Thus we have the rather intuitive result that spike history effects (to first order in K) simplyweight the baseline firing rate in the MAP estimate; see Fig. 6 for an illustration of this effect.

4 The Laplace approximation leads to a highly tractable methodfor computing information-theoretic quantities

A number of previous authors have drawn attention to the connections between the decod-ing problem and the problem of estimating how much information (in the Shannon sense;(Cover and Thomas, 1991)) a population spike train carries about ~x (Bialek et al., 1991;Rieke et al., 1997; Warland et al., 1997; Barbieri et al., 2004). In general, estimating this mu-tual information nonparametrically is quite difficult, particularly in high-dimensional spaces(Paninski, 2003), but in the case that our forward model of p(D|~x) is sufficiently accurate, themethods discussed here permit tractable computation of the mutual information. (See, e.g.,(Nemenman et al., 2004; Kennel et al., 2005) for alternate approaches towards estimating theinformation which are model-based, but in a much weaker, more nonparametric sense thanthe methods developed here.)

16

We may write the mutual information

I(~x; D) = H(~x) −

∫

p(D)

(

−

∫

p(~x|D) log p(~x|D)d~x

)

= H(~x) − H(~x|D).

The first term, the stimulus entropy H(~x) = −∫

p(~x) log p(~x)d~x, depends only on the priorstimulus distribution p(~x), which in the case of artificial stimuli is set by the experimenter(and whose entropy may often therefore be computed exactly). In the case of natural stimulusensembles we can draw an arbitrarily large number of samples from p(~x), and may thereforestill consider computing H(~x) to be a solvable problem.

The second term, which is sometimes referred to as the “noise entropy” (Strong et al.,1998), is the average residual entropy in ~x conditional on the spike data D. Although this isin general much more challenging to compute, in the case that our Gaussian approximationto the posterior is acceptably accurate, we can apply a simple shortcut by using the formulafor the entropy of a Gaussian. Namely, for any specific instantiation of the observed spikedata D, we have

−

∫

p(~x|D) log p(~x|D)d~x ≈1

2log |C| + const. = −

1

2log | − Jx| + const.,

where we have used the formula for the entropy of a Gaussian distribution with covariancematrix C = −J−1

x (Cover and Thomas, 1991), with Jx, as above, denoting the Hessian of thelog-posterior computed at x̂MAP . Then, we need only average this entropy over the data dis-tribution p(D) =

∫

p(~x)p(D|~x)d~x; this averaging may be performed most easily via standardMonte Carlo numerical integration techniques (Press et al., 1992; Robert and Casella, 2005).

To summarize, computing this approximation to the information I(~x; D) requires that we:

1. Draw i.i.d. samples ~xj from the stimulus distribution p(~x)

2. Draw sample spike trains Dj from the corresponding conditional distributions p(D|~xj);this may be accomplished via our usual point-process sampling techniques (Snyder andMiller, 1991; Brown et al., 2002)

3. Compute the MAP estimate x̂MAP (Dj) and Hessian Jx(Dj) corresponding to the ob-served data Dj

4. Compute the approximate posterior entropy −(1/2) log | − Jx(Dj)|

5. Form the average over all of our Monte Carlo samples:

∫

p(D)

(

−

∫

p(~x|D) log p(~x|D)d~x

)

≈ limN→∞

1

N

N∑

j=1

−(1/2) log | − Jx(Dj)| + const.

(The limit here follows by the law of large numbers; we will discuss this Monte Carloidea in much more depth in the next chapter.)

6. Finally, subtract the result from the prior entropy:

I(~x; D) ≈ H(~x) −

limN→∞

1

N

N∑

j=1

−(1/2) log | − Jx(Dj)| + const

.

17

(In practice, of course, the number of Monte Carlo samples N above does not have totend to infinity, but merely has to be large enough to make the confidence interval onthe empirical average acceptably small.)

This method should give accurate estimates of the information whenever the posterior p(~x|D)may be well-approximated by a Gaussian, as is the case for the examples analyzed here (c.f.Fig. 3).

We can compare this estimate with a well-known lower-bound on I(~x; D) which arisesfrom a Gaussian approximation to the OLE residuals (Bialek et al., 1991; Rieke et al., 1997).This bound may be derived as follows: we use the data processing inequality

H(~x) − H(~x|D) = I(~x; D) ≥ I(~x; x̂(D)) = H(~x) − H(~x|x̂),

where x̂ is any estimator of ~x given D, to establish that

H(~x|D) ≤ H(~x|x̂).

If we write out this conditional entropy, we see that

H(~x|D) ≤ H(~x|x̂)

= EDh(~x|x̂D)

≤ ED

(

1

2log |cov(~x|x̂D)| + const.

)

≤1

2log

∣

∣

∣

∣

ED

(

cov(~x|x̂D))

∣

∣

∣

∣

+ const.,

where the last inequality is Jensen’s (since the log-determinant is a concave function) and thesecond inequality follows from the fact that Gaussian densities have the most entropy amongall densities with a fixed covariance (Cover and Thomas, 1991).

This upper bound on H(~x|D) provides us with a lower bound on I(~x; D); we may estimatethe last line of this bound numerically by drawing many stimulus-response pairs {~xj , Dj},computing residuals of the estimator x̂ (in most applications this estimator x̂ is taken as theOLE, but any estimator which is a function of the data D may be employed here), χj = ~xj−x̂j ,and then computing the covariance of these residuals, E(χjχ

Tj ); we thus have

I(~x; D) ≥ H(~x) −

(

1

2log

∣

∣

∣

∣

E(χjχTj )

∣

∣

∣

∣

+ const.

)

.

Figure 7 shows a comparison of the lower bound obtained by this method with an estimateof I(~x|D) obtained directly from our Gaussian approximation to the posterior. For complete-ness, we also compare to the lower bound obtained by using x̂MAP instead of ~xOLE in theabove derivation (this second lower bound is guaranteed to be tighter if the MAP residualsare smaller than those of the OLE, as is the case here). These results indicate that at lowSNR, the lower bounds may be relatively tight, but that at high SNR they may significantlyunderestimate the total mutual information. See (Passaglia and Troy, 2004; Chacron, 2005)for further discussion.

18

0 5 10 15

contrast

0

50

100

150

OLE

lower bound

MAP

lower bound

Posterior-based

estimate

Mu

tu

al In

fo

rm

atio

n (b

its)

Figure 7: Estimates of the mutual information between a 50-dimensional stimulus (Gaussianwhite noise) and the spike trains of two (ON and OFF) GLM neurons, as a function of stimuluscontrast. Lower bounds on the mutual information were computed as described in the text,using a Gaussian approximation to the residuals of the OLE (light gray) and MAP estimate(dark gray). A posterior-based estimate of the mutual information was obtained by averagingsingle-sample calculations of noise entropy (computed using a Gaussian approximation top(~x|D)) over many stimulus-response pairs.

5 Discrimination and detection; change-point analysis

In the above we have been discussing estimation of continuous-valued vector stimuli ~x. How-ever, it is important to note that similar methods are quite useful for two-point discrimination(aka detection) problems as well. Consider the following two-alternative forced choice (2AFC)paradigm: we observe a spike train, or population spike train, D, and are asked to discrimi-nate between two possible known stimuli, ~x0 and ~x1, which might have produced the observedresponses. In the statistics literature this 2AFC paradigm is known as testing between twosimple hypotheses, and the optimal discriminator is known to be based on the posterior ratiop(~x0|D)/p(~x1|D); if this ratio is greater than some threshold value, we say that ~x0 was thestimulus, and ~x1 otherwise (Schervish, 1995). See, e.g., (Pillow et al., 2005b) for a recentapplication to retinal data, or (de Ruyter van Steveninck and Bialek, 1995; Rieke et al., 1997)for a nice list of applications of this idea in the classical psychophysics and neuroethologyliterature.

Now let us consider a slightly more general and realistic case, in which neither ~x0 nor ~x1

are known exactly. We have two hypotheses, H0 and H1, and stimuli are drawn accordingto two distinct distributions p(~x|H0) and p(~x|H1). Our goal is to decide which of the twodistributions the stimulus was drawn from, given not the stimulus but just spiking data D.(We will discuss more concrete examples below, but for now it may be helpful to keep thefollowing simple example in mind: ~x is a white Gaussian noise stimulus, with one meanand variance under H0 and a different mean and variance under H1; our goal is to decidebetween these two hypotheses. Of course, more complex distributions p(~x|H0) and p(~x|H1)are feasible.)

In this case the optimal decision rule still has the form of a posterior-ratio test

p(H0|D)

p(H1|D)=

p(D|H0)p(H0)

p(D|H1)p(H1)=

p(H0)∫

p(D|~x)p(~x|H0)d~x

p(H1)∫

p(D|~x)p(~x|H1)d~x.

19

Thus we need to marginalize out the stimulus ~x, which is not observed directly. (This ratioof marginal probabilities is called the “Bayes factor” in the Bayesian hypothesis testing liter-ature (Kass and Raftery, 1995).) The key point is that we can directly adapt the Gaussianapproximation described above to compute these integrals.

As before, we approximate∫

p(D|~x)p(~x|Hi)d~x ≈

∫

wiGi(~x)d~x,

where Gi(~x) is the normalized Gaussian with mean x̂MAP and covariance C = J−1x , and the

MAP estimate and Hessian Jx in each case are computed under the prior p(~x|Hi). Now clearly∫

wiGi(~x)d~x = wi, so to compute our posterior ratio we just need to compute the weight wi.We know the value

p(D|x̂MAP,i)p(x̂MAP,i|Hi),

from our explicit encoding model and prior on ~x, and we also know

Gi(x̂MAP,i) =[

(2π)d/2|Ci|1/2

]−1= |Ji|

1/2/(2π)d/2

from the formula for a Gaussian density. So by matching Gi(x̂MAP,i) = p(D|x̂MAP,i)p(x̂MAP,i|Hi)we obtain

wi = p(D|x̂MAP,i)p(x̂MAP,i|Hi)/G(x̂MAP,i)

= p(D|x̂MAP,i)p(x̂MAP,i|Hi)(2π)d/2|Ci|1/2,

and the Bayes factor reduces to

p(H0|D)

p(H1|D)=

p(H0)a0

p(H1)a1=

p(H0)p(D|x̂MAP,0)p(x̂MAP,0|H0)|C0|1/2

p(H1)p(D|x̂MAP,1)p(x̂MAP,1|H1)|C1|1/2.

Thus, once again, the computation of these marginal posterior quantities reduces to a sim-ple determinant computation once we have obtained x̂MAP and the Hessian Jx under eachhypothesis.

5.1 Optimal change-point detection

A more subtle and perhaps more behaviorally relevant situation arises when we are asked todetect the time at which the stimulus undergoes a change between class H0 to class H1 (forexample, the time at which the mean or the variance of the stimulus is changed suddenly(DeWeese and Zador, 1998)). We may compute the posterior probability of “no change”exactly as before, using our Gaussian approximation-based estimate of p(D|H0). Now thelikelihood that a change occurred at time t is

p(D|change at time t) =

∫

p(D|~x)p(~x|change at time t)d~x.

Thus finding the time at which the change occurs simply requires that we compute

p(D|change at time t) =

∫

p(D|~x)p(~x|change at time t)d~x

≈ p(D|x̂MAP )p(x̂MAP |change at time t)(2π)d/2|Ct|1/2, (3)

20

0

0.2

0.4

p(ch

an

ge

)

0 0.5 1

time (ms)

in

te

nsity

E

0

20

lo

gli ra

tio

0.4

0.5

0.6

1

1.5

2

0.1

time (s

)

ste

p h

eig

ht

p(ch

an

ge

)

D

A

B

C

0

20

0

0.2

H

0 0.5 1

-5

0

5

time (ms)

J

F

G

I

0.4

0.5

0.6

1.5

2.5

0.1

�mean� change detection �variance� change detection

0 0.5 1

time (ms)

0 0.5 1

time (ms)

-5

0

5

Figure 8: Illustration of change-point detection for changes in mean (left) and variance (right).(A) Spike times of an ON and OFF cell given a 100-sample (1s) Gaussian white-noise stimulusthat underwent a change in mean (from 0 to 1.5) at time t = 0.5s. The task here is todetermine from these spike times (and model parameters) if and when a change in the meanoccurred, and if so how large the change was. (B) Log-likelihood ratio of the hypothesis“Change in mean at time t” (eq. 3) to that of “no change”, plotted as a function of t (andassuming that we know the expected change height). (C) Exponentiating this ratio andnormalizing gives a probability distribution over the time at which a change occurred. (D)Alternatively, we can we can compute a 2D distribution over time and step size if we donot assume we know the latter. Note that the true change-time and step size are inferredfairly accurately here. (E) True stimulus (black) and MAP estimate (blue) using the thecorrect change time and height. (F-J) Analogous plots showing detection of a step changein standard deviation (from 1 to 2) at t = 0.5.

where, again, the MAP estimate and approximate covariance Ct for each time t are computedunder the prior distribution p(~x|change at time t).

Choosing the peak of this function gives us the maximum-likelihood estimator for thechange-point time; on the other hand, the posterior probability that a change occurred at allis given by averaging:

p(change at any time t) =

∫

p(D|change at time t)p(t)dt,

with p(t) the experimentally-controlled prior distribution on change point times (which wouldtypically be chosen to be uniform on some interval t ∈ (a, b)). Figure 8 shows an example ofthe change-point detection task, illustrating detection of a change in mean (left) and a changein variance (right).

21

6 Discussion

We have discussed three techniques for model-based decoding of neural spike trains: (1) effi-cient computational methods for computing the MAP estimate, based on the GLM encodingmodel; (2) a tractable method for estimating the mutual information between the stimulusand the response; and (3) methods for change-point detection based on marginal likelihood.These three ideas are connected by a simple Gaussian approximation of the (log-concave)posterior p(~x|D): the MAP decoder corresponds to the peak location of this Gaussian; ourestimate of the mutual information corresponds to the width of this Gaussian relative to thewidth (i.e., entropy) of the prior distribution (and more generally, the width of this distribu-tion provides errorbars on our MAP estimates); finally, the marginal likelihood correspondsto the height of the (unnormalized) Gaussian approximation to p(~x, D), relative to the heightof the normalized Gaussian p(~x|D).

We close this section by noting a major open direction for extending the applicabilityof the MAP decoding methods described here. As we have discussed before, one of themajor strengths of the GLM encoding model (and related models (Paninski et al., 2004;Paninski, 2005; Kulkarni and Paninski, 2006)) is that we may very easily incorporate nonlinearterms into the model. That is, instead of restricting our attention to models of the formλ(t) = f(b + ~k · ~x +

∑

j h(t − tj)), we may incorporate nonlinear terms λ(t) = f(b + ~k · ~x +∑

i aiFi(~x)+∑

j h(t−tj)) and estimate the weights ai by concave ML, just like the other model

parameters (b,~k, h(.)); this greatly increases the flexibility and power of this model. However,while this nonlinear model retains its concavity in the parameters ~a, unfortunately we losethe key concavity in the stimulus ~x, and the likelihood of ~x is therefore prone to nongloballocal maxima. Handling this issue constitutes an important avenue for future research.

6.1 Extensions: fully-Bayesian techniques

Another important extension is to employ more general “fully Bayesian” methods, in whichwe compute these integrals exactly by Monte Carlo techniques (Robert and Casella, 2005)instead of the computationally cheaper Gaussian approximation used here. This extensionis important for two reasons: first, the MAP estimator can have a large average error incases where the stimulus prior is too flat, and the likelihood term p(D|~x) poorly constrainsour estimate of ~x (e.g., uniformly-distributed ~x; data not shown); in this case, we expectthe posterior mean estimate E(~x|D) to be superior. Second, fully Bayesian methods allowus to consider a wider variety of convex cost functions than does the MAP framework; thisflexibility in the choice of cost function may be important for some decoding applications.

A large variety of computational methods have been developed and intensively studied forcomputing the necessary integrals. In cases where the tree decomposition holds (e.g., the statespace models discussed above), sequential importance sampling methods (particle filtering)can be quite effective (Doucet et al., 2001; Brockwell et al., 2004; Kelly and Lee, 2004). Moregenerally, methods based on the Metropolis-Hastings algorithm (Robert and Casella, 2005)may be applied (Cronin et al., 2006; Rigat et al., 2006); we will discuss these methods next.

References

Abbott, L. and Dayan, P. (1999). The effect of correlated variability on the accuracy of apopulation code. Neural Computation, 11:91–101.

22

Barbieri, R., Frank, L., Nguyen, D., Quirk, M., Solo, V., Wilson, M., and Brown, E. (2004).Dynamic analyses of information encoding in neural ensembles. Neural Computation,16:277–307.

Bialek, W., Rieke, F., de Ruyter van Steveninck, R., and Warland, D. (1991). Reading aneural code. Science, 252:1854–1857.

Bialek, W. and Zee, A. (1990). Coding and computation with neural spike trains. Journal of

Statistical Physics, 59:103–115.

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Oxford University Press.

Brockwell, A., Rojas, A., and Kass, R. (2004). Recursive Bayesian decoding of motor corticalsignals by particle filtering. Journal of Neurophysiology, 91:1899–1907.

Brown, E., Barbieri, R., Ventura, V., Kass, R., and Frank, L. (2002). The time-rescalingtheorem and its application to neural spike train data analysis. Neural Computation,14:325–346.

Brown, E., Frank, L., Tang, D., Quirk, M., and Wilson, M. (1998). A statistical paradigm forneural spike train decoding applied to position prediction from ensemble firing patternsof rat hippocampal place cells. Journal of Neuroscience, 18:7411–7425.

Chacron, M. (2005). Nonlinear information processing in a model sensory system. Journal of

Neurophysiology, 95:2933–2946.

Cover, T. and Thomas, J. (1991). Elements of information theory. Wiley, New York.

Cronin, B., Schummers, J., Koerding, K., and Sur, M. (2006). Bayesian sampling methodsfor the analysis of reverse correlation data. SFN Abstracts.

de Ruyter van Steveninck, R. and Bialek, W. (1995). Reliability and statistical efficiency ofa blowfly movement-sensitive neuron. Phil Trans R Soc Lond Ser B, 348:321–340.

DeWeese, M. and Zador, A. (1998). Asymmetric Dynamics in Optimal Variance Adaptation.Neural Computation, 10(5):1179–1202.

Donoghue, J. (2002). Connecting cortex to machines: recent advances in brain interfaces.Nature Neuroscience, 5:1085–1088.

Doucet, A., de Freitas, N., and Gordon, N., editors (2001). Sequential Monte Carlo in Practice.Springer.

Eichhorn, J., Tolias, A., Zien, A., Kuss, M., Rasmussen, C., Weston, J., Logothetis, N., andSchoelkopf, B. (2004). Prediction on spike data using kernel algorithms. NIPS, 16.

Gao, Y., Black, M., Bienenstock, E., Shoham, S., and Donoghue, J. (2002). Probabilisticinference of arm motion from neural activity in motor cortex. NIPS, 14:221–228.

Haag, J. and Borst, A. (1997). Encoding of visual motion information and reliability inspiking and graded potential neurons. Journal of Neuroscience, 17:4809–4819.

23

Humphrey, D., Schmidt, E., and Thompson, W. (1970). Predicting measures of motor per-formance from multiple cortical spike trains. Science, 170:758–762.

Huys, Q., Ahrens, M., and Paninski, L. (2006). Efficient estimation of detailed single-neuronmodels. Journal of Neurophysiology, 96:872–890.

Jordan, M. I., editor (1999). Learning in graphical models. MIT Press, Cambridge, MA, USA.

Karmeier, K., Krapp, H., and Egelhaaf, M. (2005). Population coding of self-motion: Ap-plying bayesian analysis to a population of visual interneurons in the fly. Journal of

Neurophysiology, 94:2182–2194.

Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Associa-

tion, 90:773–795.

Kelly, R. and Lee, T. (2004). Decoding V1 neuronal activity using particle filtering withVolterra kernels. Advances in Neural Information Processing Systems, 15:1359–1366.

Kennel, M., Shlens, J., Abarbanel, H., and Chichilnisky, E. (2005). Estimating entropy rateswith Bayesian confidence intervals. Neural Computation, 17:1531–1576.

Kulkarni, J. and Paninski, L. (2006). Common-input models for multiple neural spike-traindata. COSYNE’06.

Liang, J., Williams, D., and Miller, D. (1997). Supernormal vision and high-resolution retinalimaging through adaptive optics. J. Opt. Soc. Am. A, 14:2884–2892.

Marmarelis, P. and Marmarelis, V. (1978). Analysis of physiological systems: the white-noise

approach. Plenum Press, New York.

Maynard, E., Hatsopoulos, N., Ojakangas, C., Acuna, B., Sanes, J., Normann, R., andDonoghue, J. (1999). Neuronal interactions improve cortical population coding of move-ment direction. Journal of Neuroscience, 19:8083–8093.

Minka, T. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD thesis,MIT.

Nemenman, I., Bialek, W., and de Ruyter van Steveninck, R. (2004). Entropy and informationin neural spike trains: Progress on the sampling problem. Physical Review E, 69:056111.

Nicolelis, M., Dimitrov, D., Carmena, J., Crist, R., Lehew, G., Kralik, J., and Wise, S. (2003).Chronic, multisite, multielectrode recordings in macaque monkeys. PNAS, 100:11041–11046.

Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation,15:1191–1253.

Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encodingmodels. Network: Computation in Neural Systems, 15:243–262.

Paninski, L. (2005). Log-concavity results on Gaussian process methods for supervised andunsupervised learning. Advances in Neural Information Processing Systems, 17.

24

Paninski, L., Pillow, J., and Simoncelli, E. (2004). Maximum likelihood estimation of astochastic integrate-and-fire neural model. Neural Computation, 16:2533–2561.

Passaglia, C. and Troy, J. (2004). Information transmission rates of cat retinal ganglion cells.Journal of Neurophysiology, 91:1217–1229.

Pillow, J. and Paninski, L. (2007). Model-based decoding, information estimation, andchange-point detection in multi-neuron spike trains. Submitted.

Pillow, J., Paninski, L., Shlens, J., Simoncelli, E., and Chichilnisky, E. (2005a). Modelingmulti-neuronal responses in primate retinal ganglion cells. Comp. Sys. Neur. ’05.

Pillow, J., Paninski, L., Uzzell, V., Simoncelli, E., and Chichilnisky, E. (2005b). Predictionand decoding of retinal ganglion cell responses with a probabilistic spiking model. Journal

of Neuroscience, 25:11003–11013.

Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical recipes in C.Cambridge University Press.

Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1997). Spikes: Ex-

ploring the neural code. MIT Press, Cambridge.

Rigat, F., de Gunst, M., and van Pelt, J. (2006). Bayesian modelling and analysis of spatio-temporal neuronal networks. Bayesian Analysis, In press.

Robert, C. and Casella, G. (2005). Monte Carlo Statistical Methods. Springer.

Salinas, E. and Abbott, L. (2001). Coordinate transformations in the visual system: How togenerate gain fields and what to compute with them. In Principles of Neural Ensemble

and Distributed Coding in the Nervous System, pages 175–190. Elsevier.

Sanger, T. (1994). Theoretical considerations for the analysis of population coding in motorcortex. Neural Computation, 6:12–21.

Schervish, M. (1995). Theory of statistics. Springer-Verlag, New York.

Scholkopf, B. and Smola, A. (2002). Learning with Kernels: Support Vector Machines, Reg-

ularization, Optimization and Beyond. MIT Press.

Serruya, M., Hatsopoulos, N., Paninski, L., Fellows, M., and Donoghue, J. (2002). Instantneural control of a movement signal. Nature, 416:141–142.

Shoham, S., Paninski, L., Fellows, M., Hatsopoulos, N., Donoghue, J., and Normann, R.(2005). Optimal decoding for a primary motor cortical brain-computer interface. IEEE

Transactions on Biomedical Engineering, 52:1312–1322.

Shpigelman, L., Singer, Y., Paz, R., and Vaadia, E. (2003). Spikernels: embedding spikeneurons in inner product spaces. NIPS, 15.

Snyder, D. and Miller, M. (1991). Random Point Processes in Time and Space. Springer-Verlag.

Strang, G. (1988). Linear algebra and its applications. Harcourt Brace, New York.

25

Strong, S. Koberle, R., de Ruyter van Steveninck R., and Bialek, W. (1998). Entropy andinformation in neural spike trains. Physical Review Letters, 80:197–202.

Theunissen, F., Roddey, J., Stufflebeam, S., Clague, H., and Miller, J. (1996). Informationtheoretic analysis of dynamical encoding by four primary sensory interneurons in thecricket cercal system. Journal of Neurophysiology, 75:1345–1364.

Warland, D., Reinagel, P., and Meister, M. (1997). Decoding visual information from apopulation of retinal ganglion cells. Journal of Neurophysiology, 78:2336–2350.

Wu, W., Black, M. J., Mumford, D., Gao, Y., Bienenstock, E., and Donoghue, J. (2004).Modeling and decoding motor cortical activity using a switching Kalman filter. IEEE

Transactions on Biomedical Engineering, 51:933–942.

Yu, B., Afshar, A., Santhanam, G., Ryu, S., Shenoy, K., and Sahani, M. (2006). Extractingdynamical structure embedded in neural activity. NIPS.

Zhang, K., Ginzburg, I., McNaughton, B., and Sejnowski, T. (1998). Interpreting neuronalpopulation activity by reconstruction: Unified framework with application to hippocam-pal place cells. Journal of Neurophysiology, 79:1017–1044.

26

Documents

Statistical analysis of neural data - Columbia Department of