26
Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis Surya Ganguli 1 and Haim Sompolinsky 2, 3 1 Department of Applied Physics, Stanford University, Stanford, California 94305; email: [email protected] 2 Edmond and Lily Safra Center for Brain Sciences, Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel; email: haim@fiz.huji.ac.il 3 Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138 Annu. Rev. Neurosci. 2012. 35:485–508 First published online as a Review in Advance on April 5, 2012 The Annual Review of Neuroscience is online at neuro.annualreviews.org This article’s doi: 10.1146/annurev-neuro-062111–150410 Copyright c 2012 by Annual Reviews. All rights reserved 0147-006X/12/0721-0485$20.00 Keywords random projections, connectomics, imaging, memory, communication, learning, generalization Abstract The curse of dimensionality poses severe challenges to both techni- cal and conceptual progress in neuroscience. In particular, it plagues our ability to acquire, process, and model high-dimensional data sets. Moreover, neural systems must cope with the challenge of processing data in high dimensions to learn and operate successfully within a com- plex world. We review recent mathematical advances that provide ways to combat dimensionality in specific situations. These advances shed light on two dual questions in neuroscience. First, how can we as neu- roscientists rapidly acquire high-dimensional data from the brain and subsequently extract meaningful models from limited amounts of these data? And second, how do brains themselves process information in their intrinsically high-dimensional patterns of neural activity as well as learn meaningful, generalizable models of the external world from limited experience? 485 Annu. Rev. Neurosci. 2012.35:485-508. Downloaded from www.annualreviews.org by Ohio State University Library on 10/02/12. For personal use only.

Compressed Sensing, Sparsity, and Dimensionality in ...ganguli-gang.stanford.edu/pdf/12.CompSense.pdfexplore more complex problems, such as the brain’s ability to process images

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    Compressed Sensing, Sparsity,and Dimensionality inNeuronal InformationProcessing and Data AnalysisSurya Ganguli1 and Haim Sompolinsky2,31Department of Applied Physics, Stanford University, Stanford, California 94305;email: [email protected] and Lily Safra Center for Brain Sciences, Interdisciplinary Center for NeuralComputation, Hebrew University, Jerusalem 91904, Israel; email: [email protected] for Brain Science, Harvard University, Cambridge, Massachusetts 02138

    Annu. Rev. Neurosci. 2012. 35:485–508

    First published online as a Review in Advance onApril 5, 2012

    The Annual Review of Neuroscience is online atneuro.annualreviews.org

    This article’s doi:10.1146/annurev-neuro-062111–150410

    Copyright c© 2012 by Annual Reviews.All rights reserved

    0147-006X/12/0721-0485$20.00

    Keywordsrandom projections, connectomics, imaging, memory,communication, learning, generalization

    AbstractThe curse of dimensionality poses severe challenges to both techni-cal and conceptual progress in neuroscience. In particular, it plaguesour ability to acquire, process, and model high-dimensional data sets.Moreover, neural systems must cope with the challenge of processingdata in high dimensions to learn and operate successfully within a com-plex world. We review recent mathematical advances that provide waysto combat dimensionality in specific situations. These advances shedlight on two dual questions in neuroscience. First, how can we as neu-roscientists rapidly acquire high-dimensional data from the brain andsubsequently extract meaningful models from limited amounts of thesedata? And second, how do brains themselves process information intheir intrinsically high-dimensional patterns of neural activity as wellas learn meaningful, generalizable models of the external world fromlimited experience?

    485

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    ContentsINTRODUCTION. . . . . . . . . . . . . . . . . 486ADVANCES IN THE THEORY OF

    HIGH-DIMENSIONALSTATISTICS. . . . . . . . . . . . . . . . . . . . 488The Compressed Sensing

    Framework: Incoherence andRandomness . . . . . . . . . . . . . . . . . . 488

    L1 Minimization: A NonlinearRecovery Algorithm . . . . . . . . . . . 490

    Dimensionality Reduction byRandom Projections . . . . . . . . . . . 491

    Compressed Computation . . . . . . . . 492Approximate Sparsity and Noise . . 493Sparse Models of

    High-Dimensional Data . . . . . . . 494Dictionary Learning . . . . . . . . . . . . . . 495

    COMPRESSED SENSING OFTHE BRAIN . . . . . . . . . . . . . . . . . . . . 495Rapid Functional Imaging . . . . . . . . 495Fluorescence Microscopy . . . . . . . . . 496Gene-Expression Analysis . . . . . . . . . 496Compressed Connectomics . . . . . . . 497

    COMPRESSED SENSINGBY THE BRAIN . . . . . . . . . . . . . . . . . 497Semantic Similarity and Random

    Projections . . . . . . . . . . . . . . . . . . . . 498

    Short-Term Memoryin Neuronal Networks . . . . . . . . . 498

    SPARSE EXPANDED NEURONALREPRESENTATIONS . . . . . . . . . . 499Neuronal Implementations of L1

    Minimization . . . . . . . . . . . . . . . . . 500Compression and Expansion in

    Long-Range BrainCommunication . . . . . . . . . . . . . . . 501

    LEARNING INHIGH-DIMENSIONALSYNAPTIC WEIGHTSPACES . . . . . . . . . . . . . . . . . . . . . . . . . 501Neural Learning of

    Classification . . . . . . . . . . . . . . . . . . 502Optimality and Sparsity

    of Synaptic Weights . . . . . . . . . . . 502DISCUSSION . . . . . . . . . . . . . . . . . . . . . . 503

    Dimensionality Reduction: CSversus Efficient Coding . . . . . . . . 503

    Expansion and Sparsification:Compressed Sensing versusIndependent ComponentsAnalysis . . . . . . . . . . . . . . . . . . . . . . . 503

    Beyond Linear Projections:Neuronal Nonlinearities . . . . . . . 504

    INTRODUCTIONFor most of its history, neuroscience has madewonderful progress by considering problemswhose descriptions require only a small numberof variables. For example, Hodgkin & Huxley(1952) discovered the mechanism of the nerveimpulse by studying the relationship betweentwo variables: the voltage and the current acrossthe cell membrane. But as we have started toexplore more complex problems, such as thebrain’s ability to process images and sounds,neuroscientists have had to analyze many vari-ables at once. For example, any given gray-scaleimage requires N analog variables, or pixel in-tensities, for its description, where N could beon the order of 1 million. Similarly, such images

    could be represented in the firing-rate patternsof many neurons, with each neuron’s firing ratebeing a single analog variable. The number ofvariables required to describe a space of objectsis known as the dimensionality of that space;i.e., the dimensionality of the space of all possi-ble images of a given size equals the number ofpixels, whereas the dimensionality of the spaceof all possible neuronal firing-rate patterns ina given brain area equals the number of neu-rons in that area. Thus our quest to understandhow networks of neurons store and process in-formation depends crucially on our ability tomeasure and understand the relationships be-tween high-dimensional spaces of stimuli andneuronal activity patterns.

    486 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    However, the problem of measuring andfinding statistical relationships between pat-terns becomes more difficult as their dimen-sionality increases. This phenomenon is knownas the curse of dimensionality. One approach toaddressing this problem is to somehow reducethe number of variables required to describethe patterns in question, a process known as di-mensionality reduction. We can do this, for ex-ample, with natural images, which are a highlyrestricted subset of all possible images, so thatthey can be described by many fewer variablesthan the number of pixels. In particular, natu-ral images are often sparse in the sense that ifyou view them in the wavelet domain (roughlyas a superposition of edges), only a very smallnumber of K wavelet coefficients will have sig-nificant power, where K can be on the orderof 20,000 for a 1-million-pixel image. This ob-servation underlies JPEG compression, whichcomputes all possible wavelet coefficients andkeeps only the K largest (Taubman et al. 2002).Similarly, neuronal activity patterns that actu-ally occur are often a highly restricted subset ofall possible patterns (Ganguli et al. 2008a, Yuet al. 2009, Machens et al. 2010) in the sensethat they often lie along a low K-dimensionalmanifold embedded in N-dimensional firing-rate space; by this we mean that only K numbersare required to uniquely specify any observedactivity pattern across N neurons, where K canbe much smaller than N. As a concrete exam-ple, consider the set of visual activity patternsin N neurons in response to a bar presentedat a variety of orientations. As the orientationvaries, the elicited firing-rate responses traceout a circle, or a one-dimensional manifold inN-dimensional space.

    More generally, given a class of apparentlyhigh-dimensional stimuli, or neuronal activ-ity patterns, how can either we or neural sys-tems extract a small number of variables to de-scribe these patterns without losing too muchimportant information? Machine learning pro-vides a variety of algorithms to perform thisdimensionality reduction, but they are oftencomputationally expensive in terms of running

    time. Moreover, how neuronal circuits couldimplement many of these algorithms is notclear. However, recent advances in an emerg-ing field of high-dimensional statistics (Donoho2000, Baraniuk 2011) have revealed a sur-prisingly simple yet powerful method of per-forming dimensionality reduction: One canrandomly project patterns into a lower-dimensional space. To understand the centralconcept of a random projection (RP), it is usefulto think of the shadow of a wire-frame object inthree-dimensional space projected onto a two-dimensional screen by shining a light beam onthe object. For poorly chosen angles of light, theshadow may lose important information aboutthe wire-frame object. For example, if the axis oflight is aligned with any segment of wire, thatentire length of wire will have a single pointas its shadow. However, if the axis of light ischosen randomly, it is highly unlikely that thesame degenerate situation will occur; instead,every length of wire will have a correspondingnonzero length of shadow. Thus the shadow,obtained by this RP, generically retains muchinformation about the wire-frame object.

    In the context of image acquisition, an RPof an image down to an M-dimensional spacecan be obtained by taking M measurements ofthe image, where each measurement consists ofa weighted sum of all the pixel intensities, andallowing the weights themselves to be chosenrandomly (for example, drawn independentlyfrom a Gaussian distribution). Thus the orig-inal image (i.e., the wire-frame structure) isdescribed by M measurements (i.e., its shadow)by projecting against a random set of weights(i.e., a random light angle). Now, the fieldof compressed sensing (CS) (Candes et al.2006, Candes & Tao 2006, Donoho 2006; seeBaraniuk 2007, Candes & Wakin 2008,Bruckstein et al. 2009 for reviews) shows thatthe shadow can contain enough information toreconstruct the original image (i.e., all N pixelvalues) as long as the original image is sparseenough. In particular, if the space of the imagesin question can be described by K variables,then as long as M is slightly larger than K, CS

    www.annualreviews.org • Sparsity and Dimensionality 487

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    provides an algorithm (called L1 minimization,described below) to reconstruct the image.Thus for typical images, we can simultaneouslysense and compress 1-million-pixel images with∼20,000 random measurements. As we reviewbelow, these CS results have significant impli-cations for data acquisition in neuroscience.

    Furthermore, in the context of neuronal in-formation processing, an RP of neuronal activ-ity in an upstream brain region consisting of Nneurons can be achieved by synaptic mappingto a downstream region consisting of M < Nneurons, where the downstream neurons’ firingrates are obtained by linearly summing the fir-ing rates of the upstream neurons through a setof random synaptic weights. Thus the down-stream activity constitutes a shadow of the up-stream activity through an RP determined bythe synaptic weights (i.e., angle of light). Aswe review below, the theory of CS and RPscan provide a theoretical framework for under-standing one of the most salient aspects of neu-ronal information processing: radical changesin the dimensionality, and sometimes sparsity,of neuronal representations, often within a sin-gle stage of synaptic transformation.

    Finally, another application of CS is theproblem of modeling high-dimensional data.This is challenging because such models havehigh-dimensional parameter spaces, necessi-tating many example data points to learnthe correct parameter values. Neural sys-tems face a similar challenge in searchinghigh-dimensional synaptic weight spaces tolearn generalizable rules from limited experi-ence. We review how regularization techniques(Tibshirani 1996, Efron et al. 2004) closely re-lated to CS allow statisticians and neural sys-tems alike to rapidly learn sparse models ofhigh-dimensional data from limited examples.

    ADVANCES IN THE THEORYOF HIGH-DIMENSIONALSTATISTICSBefore we describe the applicability of CS andRPs to the acquisition and analysis of dataand to neuronal information processing and

    learning, we first give in this section a moreprecise overview of recent results in high-dimensional statistics. We begin by giving anoverview of the CS framework and define themathematical notation we use throughout thisreview. Subsequently, a reader who is inter-ested mainly in applications can skip the restof this section. Here, we discuss how to re-cover the sparse signals from small numbers ofmeasurements, even in the presence of approx-imate sparsity and noise, and we discuss RPsand sparse regression in more detail. Finally, wediscuss dictionary learning, an approach to findbases in which ensembles of signals are sparse.

    The Compressed Sensing Framework:Incoherence and RandomnessWe now formalize the intuitions given in theintroduction and describe the mathematical no-tation that we use throughout this review (seealso Figure 1). We let u0 be an N-dimensionalsignal that we wish to measure. Thus u0 is avector with components u0i for i = 1, . . . , N ,where each u0i can take an analog value. In theexample of an image, u0i would be the gray-scaleintensity of the ith pixel. The M linear mea-surements of u0 are of the form xµ = bµ · u0for µ = 1, . . . , M . Here we think of xµ as ananalog outcome of measurement µ obtained bycomputing the overlap or dot product betweenthe unknown signal u0 and a measurement vec-tor bµ. We can summarize the relationship be-tween the signal and the measurements via thematrix relationship x = Bu0. Here B is anM × N measurement matrix, whose µth rowis the vector bµ, and x is a measurement vectorwhose µ‘th component is xµ. Now the true sig-nal u0 is sparse in a basis given by the columnsof an N × N matrix C. By this we mean thatu0 = Cs0, where s0 is a sparse N-dimensionalvector, in the sense that it has a relatively smallnumber K of nonzero elements, though we donot know ahead of time which K of the N com-ponents are nonzero. For example, when u0

    is an image in the pixel basis, s0 could be thewavelet coefficients of that same image, and thecolumns of C would comprise a complete basis

    488 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    s0

    C B L1 C

    A = BC

    u0 x ŝ ŝu

    Figure 1Framework of compressed sensing (CS). A high-dimensional signal u0 is sparse in a basis given by thecolumns of a matrix C so that u0 = Cs0, where s0 is a sparse coefficient vector. Through a set ofmeasurements given by the rows of B, u0 is compressed to a low-dimensional space of measurements x. Ifthe measurements are incoherent with respect to the sparsity basis, then L1 minimization can recover a goodestimate ŝ of the sparse coefficients s0 from x, and then an estimate of u0 can be recovered by expanding inthe basis C.

    of orthonormal wavelets. Finally, the overall re-lationship between the measurements and thesparse coefficients is given by x = As0, whereA = BC. We often refer to A also as the mea-surement matrix.

    An important question is, given a sparsitybasis C, what should we choose as our measure-ment basis B? Consider what might happen ifwe measured signals in the same basis in whichthey were sparse. For example, in the case of animage, one could directly measure M randomlychosen wavelet coefficients of the image inwhich M is just a little larger than K. The prob-lem, of course, is that for any given image, it ishighly unlikely that all the K coefficients withlarge power coincide with the M coefficientswe chose to measure. So unless the number ofmeasurements M equals the dimensionality ofthe image, N, we will inevitably miss importantcoefficients. In the wire-frame shadow exampleabove, this is the analog of choosing a poorangle of light (i.e., measurement basis) thataligns with a segment of wire (i.e., sparsitybasis), which causes information loss.

    To circumvent this problem, one of the keyideas of CS is that we should make our measure-ments as different as possible from the domainin which the signal is sparse (i.e., shine light atan angle that does not align with any segmentof wire frame). In particular, the measurementsshould have many nonzero elements in the

    domain in which the image is sparse. This no-tion of difference is captured by the mathemat-ical definition of incoherence, or a small valueof the maximal inner product between rows ofB and columns of C, so that no measurementvector should look like any sparsity vector. CSprovides mathematical guarantees that one canachieve perfect recovery with a number of mea-surements M that is only slightly larger than K,as long as the M measurement vectors are suf-ficiently incoherent with respect to the sparsitydomain (Candes & Romberg 2007).

    An important observation is that any setof measurement vectors, which are themselvesrandom, will be incoherent with respect to anyfixed sparsity domain. For example, the ele-ments of each such measurement vector can bedrawn independently from a Gaussian distribu-tion. Intuitively, it is highly unlikely for a ran-dom vector to look like a sparsity vector (i.e.,just as it is unlikely for a random light angleto align with a wire segment). One of the keyresults of CS is that with such random measure-ment vectors, only

    M > O(K log(N /K )) 1.

    measurements are needed to guarantee perfectsignal reconstruction with high probability(Candes & Tao 2005, Baraniuk et al. 2008,Candes & Plan 2010). Thus random mea-surements constitute a universal measurement

    www.annualreviews.org • Sparsity and Dimensionality 489

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    strategy in the sense that they will work forsignals that are sparse in any basis. Indeed,the sparsity basis need not even be knownyet when the measurements are chosen. Itsknowledge is required only after measurementsare taken, during the nonlinear reconstructionprocess. And remarkably, investigators havefurther shown that no measurement matricesand no reconstruction algorithm can yieldsparse signal recovery with substantially fewermeasurements (Candes & Tao 2006, Donoho2006) than that shown in Equation 1.

    L1 Minimization: A NonlinearRecovery AlgorithmGiven only our measurements x, how can werecover the unknown signal u0? One could po-tentially do this by inverting the relationshipbetween measurements and signal by solvingfor an unknown candidate signal u in the equa-tion x = Bu. This is a set of M equations, onefor each measurement, with N unknowns, onefor each component of the candidate signal u. Ifthe number of independent measurements M isgreater than or equal to the dimensionality N ofthe signal, then the set of equations x = Bu hasa unique solution u = u0; thus, solving theseequations will recover the true signal u0. How-ever, if M < N , the set of equations x = Buno longer has a unique solution. Indeed thereis generically an N − M dimensional space ofcandidate signals u that satisfy the measurementconstraints. How might we find the true signalu0 in this large space of candidate signals?

    If we know nothing further about the truesignal u0, then the situation is indeed hopeless.However, if u0 = Cs0 where s0 is sparse, wecan try to exploit this prior knowledge as fol-lows (see Figure 1). First, the measurementsare linearly related to the sparse coefficients s0

    through the M equations x = As0, where A =BC is an M × N matrix. Again, when M < N ,there is a large N − M dimensional space ofsolutions s to the measurement constraint x =As. However, not all of them will be sparse, aswe expect the true solution s0 to be. Thus onemight try to construct an estimate ŝ of s0 by

    solving the optimization problem

    ŝ = arg mins

    N∑

    i=1

    V (s i ) subject to x = As, 2.

    where V(s) is any cost function that penalizesnonzero values of s. A natural choice is V (s ) = 0if s = 0 and V (s ) = 1 otherwise. With thischoice, Equation 2 says that our estimate ŝ isobtained by searching, in the space of all can-didate signals s that satisfy the measurementconstraints x = As, for the one that has thesmallest number of nonzero elements. This ap-proach, while reasonable given the prior knowl-edge that the true signal s0 has a small numberof nonzero coefficients, unfortunately yields acomputationally intractable combinatorial op-timization problem; to solve it, one must essen-tially search over all subsets of possible nonzeroelements in s.

    An alternative approach, adopted by CS,is to solve a related and potentially easierproblem, by choosing V (s ) = |s |. The quantity∑N

    i=1 |s i | is known as L1 norm of s; hence,this method is called L1 minimization. Theadvantage of this choice is that the L1 normis a convex function on the space of candidatesignals, which implies that the optimizationproblem in Equation 2, with V (s ) = |s |, hasno (nonglobal) local minima, and there areefficient algorithms for finding the globalminimum using methods of linear program-ming (Boyd & Vandenberghe 2004), messagepassing (Donoho et al. 2009), and neural circuitdynamics (see below). CS theory shows thatwith an appropriate choice of A, L1 minimiza-tion exactly recovers the true signal so thatŝ = s0, with a number of measurements that isroughly proportional to the number of nonzeroelements in the source, K, which can be muchsmaller than the dimensionality N of the signal.

    A popular and even simpler reconstruc-tion algorithm is L2 minimization in whichV (s ) = s 2 in Equation 2. This result can ariseas a consequence of oft-used Gaussian priorson the unknown signal and leads to an estimatethat is simply linearly related to the measure-ments through the pseudoinverse relation

    490 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    0 100 200−2

    −1

    0

    1

    2

    a b

    c

    s0

    s0 s0 = ŝ

    0 100 200−2

    −1

    0

    1

    2

    d

    ŝ

    ŝ

    0 100 200−2

    −1

    0

    1

    2

    e

    ŝ

    i i

    i ii

    i

    Figure 2Geometry of compressed sensing (CS). (a) A geometric interpretation of L2 minimization. An unknown N = 2 dimensional sparsesignal s0 with K = 1 nonzero components is measured using M = 1 linear measurements, yielding a one-dimensional space ofcandidate signals consistent with the measurement constraints (red line). The estimate ŝ is the candidate signal with the smallest L2norm and can be found geometrically by expanding the locus of points with a fixed and increasing L2 norm (the olive circles) until thelocus first intersects the allowed space of candidate signals. This intersection point is the L2 estimate ŝ, which is different from the truesignal s0. (b) In the identical scenario as in panel a, L1 minimization recovers an estimate by expanding the locus of points with the sameL1 norm (blue diamonds), and in this case, the expanding locus first intersects the space of candidate signals at the true signal s0 so thatperfect recovery ŝ = s0 is achieved. Of course, a sparse signal could also have been located on the other coordinate axis, in which caseL1 minimization would have failed to recover s0 accurately. (c) An unknown sparse signal s0 of dimension N = 200, withf = K/N = 0.2, i.e., 20% of its elements are nonzero. (d ) An estimate ŝ (red dots) recovered from M = 120 random linearmeasurements of s0 (α = N /T = 0.6, or 60% subsampling) by L2 minimization superimposed on the true signal s0. (e) From the samemeasurements in panel d, L1 minimization yields an estimate ŝ (red dots) that coincides with the true signal. Note that the parameters off = 0.2 and α = 0.6 lie just above the phase boundary for perfect recovery in Figure 3.

    ŝ = (AT A)−1AT x. Figure 2 provides heuristicintuition for the utility of L1 minimization andits superior performance over L2 minimizationin the case of sparse signals.

    An interesting observation is that the boundin Equation 1 represents a sufficient conditionon the number of measurements M for perfectsignal recovery. Alternately, recent work on thetypical behavior of CS in the limit where M andN are large has revealed that the performanceof CS is surprisingly insensitive to the detailsof the measurement matrix A and the unknownsignal s0 and depends only on the degree ofsubsampling α = M /N and the signal sparsity

    f = K/N . In the α − f plane, there is auniversal, critical phase boundary αc ( f ) suchthat if α > αc ( f ), then L1 minimization willtypically yield perfect signal reconstruction,whereas if α < αc ( f ), it will yield a nonzeroerror (see Figure 3) (Donoho & Tanner2005a,b, Donoho et al. 2009, Kabashima et al.2009, Ganguli & Sompolinsky 2010b).

    Dimensionality Reductionby Random ProjectionsThe above CS results can be understood usingthe theory of RPs. Geometrically, the mapping

    www.annualreviews.org • Sparsity and Dimensionality 491

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    f

    α

    0 0.2 0.4 0.6 0.8 1.00

    0.2

    0.4

    0.6

    0.8

    1.0

    Figure 3Phase transition in compressed sensing (CS) (reproduced from Ganguli &Sompolinsky 2010b). We use linear programming to solve Equation 2 50 timesfor each value of α and f in increments of 0.01, with N = 500. The greytransition region shows when the fraction of times perfect recovery occurs isneither 0 nor 1. The red curve is the theoretical phase boundary αc ( f ). Asf → 0, this boundary is of the form αc ( f ) = f log 1/ f .

    x = As through a measurement matrix A canbe thought of as a linear projection from ahigh N-dimensional space of signals down to alow M-dimensional space of measurements. Inthis geometric picture, the space of K-sparsesignals consists of a low-dimensional (non-smooth) manifold, which is the union of allK-dimensional linear spaces characterized byK nonzero values at specific locations, as inFigure 4a. Candes & Tao (2005) show thatany projection that preserves the geometryof all K-sparse vectors allows one to recon-struct these vectors from the low-dimensionalprojection efficiently and robustly using L1minimization. The power of compression byRPs lies in the fact that they preserve thegeometrical structure of this manifold. Inparticular, Baraniuk et al. (2008) show that RPsdown to an M = O(K log(N /K )) dimensionalspace preserve the distance between any pairof K-sparse signals up to a small distortion.

    However, we can move beyond sparsityand consider how well RPs preserve the ge-ometric structure of other signal or data pat-terns that lie on more general low-dimensionalmanifolds embedded in a high-dimensional

    space. An extremely simple manifold is a pointcloud consisting of a finite set of points, asin Figure 4b. Suppose this cloud consists ofP points sα , for α = 1, . . . , P , embedded inan N-dimensional space, and we project themdown to the points xα = Asα in a low M-dimensional space through an appropriatelynormalized RP. How small can we make Mbefore the point cloud becomes distorted inthe low-dimensional space so that pairwise dis-tances in the low-dimensional space are nolonger similar to the corresponding distancesin the high-dimensional space?

    The celebrated Johnson-Lindenstrauss ( JL)lemma ( Johnson & Lindenstrauss 1984, Indyk& Motwani 1998, Dasgupta & Gupta 2003)provides a striking answer. It states that RPswith M > O(log P ) will yield, with high prob-ability, only a small distortion in distance be-tween all pairs of points in the cloud. Thus thenumber of projected dimensions M needs onlybe logarithmic in the number of points P in-dependent of the embedding dimension of thesource data, N.

    Finally, we consider data distributed alonga nonlinear K-dimensional manifold embeddedin N-dimensional space, as in Figure 4c. An ex-ample might be a set of images of a single objectobserved under different lighting conditions,perspectives, rotations, and scales. Another ex-ample would be the set of neural firing-ratevectors in a brain region in response to a con-tinuous family of stimuli. Baraniuk & Wakin(2009) and Baraniuk et al. (2010) show thatM > O(K log NC) RPs preserve the geome-try of the manifold with small distortion. HereC is a number related to the curvature of themanifold so that highly curved manifolds re-quire more projections. Overall, these resultsshow that surprisingly small numbers of RPs,which can be chosen without any knowledgeof the data distribution, can preserve geometricstructure in data.

    Compressed ComputationAlthough CS emphasizes the reconstructionof sparse high-dimensional signals from

    492 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    a cb

    Figure 4Random projections. (a) A manifold of K-sparse signals (red) in N-dimensional space is randomly projected down to an M-dimensionalspace (here K = 1, N = 3, M = 2). (b,c) Projection of a point cloud, and a nonlinear manifold respectively.

    low-dimensional projections, many importantproblems in signal processing and learningcan be accomplished by performing compu-tations directly in the low-dimensional spacewithout the need to first reconstruct the high-dimensional signal. For example, regression(Zhou et al. 2009), signal detection (Duarteet al. 2006), classification (Blum 2006, Hauptet al. 2006, Davenport et al. 2007, Duarteet al. 2007), manifold learning (Hegde et al.2007), and nearest neighbor finding (Indyk &Motwani 1998) can all be accomplished in alow-dimensional space given a relatively smallnumber of RPs. Moreover, task performanceis often comparable to what can be obtainedby performing the task directly in the originalhigh-dimensional space. The reason for this

    remarkable performance is that these com-putations rely on the distances between datapoints, which are preserved by RPs. Thus RPsprovide one way to cope with the curse ofdimensionality, and as we discuss below, thiscan have significant implications for neuronalinformation processing and data analysis.

    Approximate Sparsity and NoiseAbove, we have assumed a definition of spar-sity in which an N-dimensional signal s0 hasK < N nonzero elements, with the other el-ements being exactly 0. In reality, many of thecoefficients of a signal may be small, but theyare unlikely to be exactly zero. We thus expectsignals not to be exactly sparse but to be well

    www.annualreviews.org • Sparsity and Dimensionality 493

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    approximated by a K-sparse vector s0K , which isobtained by keeping the K largest coefficientsof s0 and setting the rest of them to 0. In addi-tion, we have to allow for measurement noiseso that x = As0 + z, where z is a noise vectorwhose µ’th component is zero mean Gaussiannoise with a fixed variance.

    In the presence of noise, it no longer makessense to enforce perfectly the measurementconstraints x = As. Instead, a common ap-proach, known as the LASSO method, is tosolve the alternate optimization problem

    ŝ = arg mins

    {

    ‖x − As‖2 + λT∑

    i=1

    V (s i )

    }

    , 3.

    where V (s ) = |s | (the absolute value function)and λ is a parameter to be optimized. The costfunction minimized here allows deviations be-tween As, which are the noise-free measure-ment outcomes generated by a candidate sig-nal s, and the actual noisy measurements x.However, such deviations are penalized by thequadratic term in Equation 3.

    Several works (see e.g., Candes et al. 2006,Wainwright 2009, Bayati et al. 2010, Candes &Plan 2010) have addressed the performance ofthe LASSO in the combined situation of noiseand departures from perfect sparsity. The mainoutcome is roughly that for an appropriatechoice of λ, which depends on the signal-to-noise ratio (SNR), the same conditions thatguaranteed exact recovery of K-sparse signalsby L1 minimization in the absence of noise alsoensure good performance of the LASSO forapproximately sparse signals in the presenceof noise. In particular, whenever s0K is a goodapproximation to s0, the LASSO estimate ŝ inEquation 3 is a good approximation to s0, upto a level of precision that is allowed by thenoise.

    Sparse Models ofHigh-Dimensional DataL1-based minimization can also be applied tothe modeling of high-dimensional data. A sim-ple example is sparse linear regression. Suppose

    that our data set consists of M N-dimensionalvectors, aµ, along with M scalar response vari-ables xµ. The regression model assumes that oneach observation, µ, xµ = aµ · s0 + zµ, where s0is an N-dimensional vector of unknown regres-sion coefficients and zµ is Gaussian measure-ment noise. This can be summarized in the ma-trix equation x = As0 + z, where the M rows ofthe M ×N matrix A are the N-dimensional datapoints, aµ. Now if the number of data points Mis fewer than the dimensionality of the data N, itwould seem hopeless to infer the regression co-efficients. However, in many high-dimensionalregression problems, we expect that the regres-sion coefficients will be sparse. For example,aµ could be a vector of expression levels ofN = O(1000) genes measured in a microarrayunder experimental condition µ, and xµ couldbe the response of a biological signal of interest.However, only a small fraction of genes are ex-pected to regulate any given signal of interest,and hence we expect the regression coefficientss0 to be sparse.

    This scenario is exactly equivalent to the caseof CS with noise. Here the regression coeffi-cients s0 play the role of an unknown sparsesignal to be recovered, the input data pointsaµ play the role of the measurement vectors,and the scalar output or response xµ plays therole of the measurement outcome in CS. Thesame LASSO algorithm described in Equation3 can be used to infer the regression coefficients(Tibshirani 1996). Here, the parameter λ is notset by the SNR but rather is chosen to minimizesome measure of the prediction error on a newinput. This estimate can be obtained throughcross validation, for example. Efron et al. (2004)have proposed efficient algorithms to computeŝ, optimizing over λ for a given data set (A, x).

    The technique of L1 regularization general-izes beyond linear regression to the problem oflearning large statistical models with expectedsparse parameter sets. Indeed it has been usedsuccessfully in learning logistic regression (Leeet al. 2006b) and in various graphical models(Lee et al. 2006a, Wainwright et al. 2007), aswell as in point process models of neuronalspike trains (Kelly et al. 2010).

    494 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    Dictionary LearningAs Equations 2 and 3 imply, to reconstruct asignal from a small number of random mea-surements using L1 minimization, we need toknow A = BC, which means that we need toknow the basis C in which the signal is sparse.What if we have to work with a new ensembleof signals and we do not yet know of a basis inwhich these signals are sparse?

    One approach is to perform dictionarylearning (Olshausen et al. 1996; Olshausen &Field 1996a,b, 1997) on the ensemble of sig-nals. Suppose {xα} for α = 1, . . . , P is a col-lection of P M-dimensional signals. We imag-ine that each signal is well approximated by asparse linear combination of the columns of anunknown M ×N matrix A, i.e., xα ≈ Asα for allα = 1, . . . , P , where sα is an unknown sparse N-dimensional vector. We refer to the columns ofA as the dictionary elements. Thus, the nonzerocoefficients of sα indicate which dictionary el-ements linearly combine to form the signal xα .Here N can be larger than M, in which case weare looking for an overcomplete basis, or dic-tionary, to represent the ensemble of signals.Given our training signals xα , we wish to findthe sparse codes sα and dictionary A. These canpotentially be found by minimizing the follow-ing energy function:

    E(s1, . . . , sP , A) =P∑

    α=1

    (‖xα − Asα‖2 + λ‖sα‖1),

    4.where ||sα||1 denotes the L1 norm of sα . Foreach α, this second term enforces the sparsityof the code, whereas the first quadratic costterm enforces the fidelity of the code and thedictionary. Subsequent work (Kreutz-Delgadoet al. 2003; Aharon et al. 2006a,b) has extendedthis basic formalism as well as derived efficientalgorithms for solving Equation 4. Moreover,Aharon et al. (2006b), Isely et al. (2010), andHillar & Sommer (2011) have recently shownthat if the signals xα are indeed generated bysparse noiseless codes through a dictionary A,under certain conditions related to CS, dictio-nary learning will recover A, up to permutationsand scalings of its columns.

    COMPRESSED SENSINGOF THE BRAIN

    Rapid Functional Imaging

    In many ways, magnetic resonance imaging(MRI) is a well-suited application for CS(Lustig et al. 2008). In MRI, a strong staticmagnetic field with a linear spatial gradient,#H, causes magnetic dipoles in a tissue sampleto align with the magnetic field. A radio fre-quency excitation pulse then generates a trans-verse complex magnetic moment at locationr, with amplitude m(r) and a phase φ(r) pro-portional to r · #H. Depending on the sam-ple preparation, the amplitudes m(r) correlatewith various local properties of interest. Forexample, in functional MRI, it correlates withthe concentration of oxygenated hemoglobin,which in turn increases in response to neural ac-tivity. Thus, the measurement goal is to extractthe spatial profile of m(r). A detector coil mea-sures the spatial integral of the complex magne-tization. Hence, it essentially measures a spatialFourier transform of the profile with a Fourierwave vector k = (kx, ky, kz) ∝ #H.

    The traditional approach to MR imaging hasbeen to sample the image densely through aregular lattice in Fourier wave vector space, ork-space, by generating a sequence of static lin-ear gradient fields and radio frequency pulses.If the Fourier space is sampled at the Nyquist-Shannon rate, then one can perform a lin-ear reconstruction of the image m(r) simplyby performing an inverse Fourier transformof the measurements. However, acquiring eachFourier sample can take time, so any method toreduce the number of such samples can dramat-ically reduce patient time in scanners, as wellas increase the temporal resolution of dynamicimaging.

    CS provides an interesting approach toreducing the number of measurements. In theCS framework, the measurement basis B inFigure 1 consists of Fourier modes. CS willwork well if the MRI image is sparse in a basis Cthat is incoherent with respect to B. For exam-ple, many MRI images, such as angiograms, aresparse in the position, or pixel basis. For such

    www.annualreviews.org • Sparsity and Dimensionality 495

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    images, one can subsample random trajectoriesin k-space and use nonlinear L1 reconstructionto recover the image. For appropriately chosenrandom trajectories, one can obtain high-quality images using a tenth of the numberof measurements required in the traditionalapproach (Lustig et al. 2008). Similarly, brainimages are often sparse in a wavelet basis, andfor such images, random trajectories in k-spacecan be found that speed up the rate at which im-ages can be acquired by a factor of 2.4 comparedwith the traditional approach (Lustig et al.2007). Moreover, dynamic movies of oscilla-tory phenomena that are sparse in the temporalfrequency domain can be obtained at high tem-poral resolution by sampling randomly both ink-space and in time (Parrish & Hu 1995).

    Fluorescence MicroscopySimultaneously imaging the dynamics of mul-tiple molecular species at both high spatial andtemporal resolution is a central goal of cellu-lar microscopy. CS-inspired technologies suchas single-pixel cameras (Takhar et al. 2006,Duarte et al. 2008) combined with fluores-cence microscopy techniques (Wilt et al. 2009,Taraska & Zagotta 2010) provide one promis-ing route toward such a goal (Coskun et al.2010; E. Candes, personal communication).In fluorescence imaging, multiple molecularspecies can be tagged with markers capable ofemitting light at different frequencies. Imag-ing the molecules then requires two key steps:First, the sample must be illuminated with light,causing the tagged species to fluoresce, and sec-ond, the emitted photons from the fluorescentspecies must be detected. Traditionally, twomain methods have been used to accomplishboth steps. In widefield (WF) microscopy, theentire image is illuminated at once, and a largearray of detectors records the emitted photons.In raster scan (RS) microscopy, each point ofthe image is illuminated in sequence, so onlyone detector is required to collect the emittedphotons at any given time.

    WF can achieve high temporal resolutionbut requires many photodetectors for high

    spatial resolution. This is problematic for imag-ing applications in which photons at manydifferent frequencies, corresponding to differ-ent molecules, need to be simultaneously mea-sured. This requires a prohibitively expensivehigh-density array of photodetectors that canperform hyperspectral imaging, i.e., measuremany spectral channels at once. One could em-ploy a single such detector in RS mode, but thenachieving high spatial resolution comes at thecost of low temporal resolution because of therequired number of raster scans.

    The single-pixel-camera approach exploitsthe potential spatial sparsity of a fluorescenceimage to achieve both high spatial and tempo-ral resolution. In this approach, the image isilluminated using a sequence of random lightpatterns. This can be achieved by a digital mi-cromirror device (DMD), which consists of aspatial array of micrometer scale mirrors whoseangles can be rapidly and individually adjusted.Light is reflected off this array into the sample,and on each trial, a different configuration ofmirrors leads to a different pattern of illumina-tion. A single hyperspectral photodetector (thesingle pixel) then measures the total emittedfluorescence. Owing to the randomness of thelight patterns, the image can be reconstructed atthe micrometer spatial resolution of the DMDusing a number of measurements that is muchsmaller than the number of pixels (or resolvablespatial locations) in the image. Thus compres-sive imaging retains the relative speed and reso-lution of WF and the simplicity and achievablespectral range of RS. As such, this rapidly evolv-ing method has the potential to open up newexperimental windows into the dynamics of in-tracellular molecular cascades within neurons.

    Gene-Expression AnalysisThe use of microarrays to collect large-scaledata sets of gene-expression levels across manybrain regions is now a well-established enter-prise in neuroscience. Suppose we want to mea-sure a vector s0 of concentrations of N geneticsequences in a sample. A microarray consistsof N spots, indexed by i = 1, . . . , N , where

    496 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    each spot i contains a unique complementarysequence that will specifically bind with thesequence i in the sample. All N genetic se-quences of interest in the sample are fluores-cently tagged and exposed to all the spots. Eachspot binds a specific sequence, and after the ex-cess unbound DNA is washed off, the vector ofconcentrations s0 can be read off by imaging thefluorescence levels of the spots.

    Often this procedure is highly inefficient be-cause any particular sample will contain only afew genetic sequences of interest, i.e., the con-centration vector s0 is sparse. Dai et al. (2009)proposed a CS-based approach in which onecan use M < N spots, where each spot con-tains a random subset of the N sequences ofinterest. Thus each spot, now indexed by µ =1, . . . , M , is characterized by an N-dimensionalmeasurement vector aµ, where the componentaµi reflects the binding affinity of sequence i inthe sample to the contents of spot µ. After theCS microarray is exposed to the sample, theM-dimensional vector of fluorescence levels xis approximately related to the sample concen-tration s0 through the linear relation x = As0,where the rows of A are the measurement vec-tors aµ. Thus if each spot contains enough ran-domly chosen complementary sequences, suchthat the measurements are incoherent with re-gard to the basis of sequences, one can use theLASSO method in Equation 3 to recover theconcentrations s0 from the fluorescence mea-surements x. Dai et al. (2009) do a thoroughanalysis of this basic framework. Overall, re-ducing the number of spots required to collectgene expression data reduces both the cost andthe size of the array, as well as the amount ofbiological sample material required to make ac-curate concentration measurements.

    Compressed ConnectomicsThe problem of reconstructing functional cir-cuit connectivity from recordings of neuronalpostsynaptic responses presents a considerablechallenge to neuroscience. Consider, for ex-ample, a simple scenario in which we have apopulation of N neurons that are potentially

    presynaptic to a given neuron whose membranevoltage x we can record intracellularly. Thesynaptic strengths from the N neurons to therecorded neuron is an unknown N-dimensionalvector s0. The traditional approach to estimat-ing this set of synaptic strengths is to exciteeach potential presynaptic neuron one by oneand record the resultant postsynaptic mem-brane voltage x. Each such measurement re-veals the strength of one synapse. This brute-force approach is highly inefficient because thesynaptic connectivity s0 is often sparse, withonly K < N nonzero elements, where K/N is∼10%. Thus most measurements would simplyyield 0.

    Hu & Chklovskii (2009) propose a CS-basedapproach to recovering s0 by randomly stimu-lating F neurons out of N on any given trialµ. This method corresponds to a random mea-surement matrix A characterized by F nonzeroentries per row. Given that the true weight vec-tor s0 is sparse, Hu & Chklovskii (2009) pro-pose to use L1 minimization in Equation 2 torecover s0 from knowledge of the inputs A andoutputs x. The authors find for a wide rangeof parameters that F/N = 0.1 minimizes therequired number of measurements, M, and forthis value of F, M = O(K log N ) measure-ments are required to recover s0. Thus randomstimulation of 10% of the population consti-tutes an effective measurement basis for CS ofsynaptic connectivity (Hu & Chklovskii 2009).Alternative ideas have been proposed for CS ofconnectivity using fluorescent synaptic markers(Mishchenko 2011).

    COMPRESSED SENSINGBY THE BRAINThe problem of storing, communicating, andprocessing high-dimensional neural activitypatterns, or external stimuli, presents a funda-mental challenge to any neural system. Thischallenge is complicated by the widespreadexistence of convergent pathways, or bottle-necks, in which information stored in a largenumber of neurons is often compressed intoa small number of axons, or neurons in a

    www.annualreviews.org • Sparsity and Dimensionality 497

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    downstream system. For example, 1 million op-tic nerve fibers carry information about theactivity of 100 times as many photoreceptors.Only 1 million pyramidal tract fibers carry in-formation from motor cortex to the spinal cord.And corticobasal ganglia pathways undergo a10–1,000-fold convergence. In this section wereview how the theory of CS and RPs yieldstheoretical insight into how efficient storage,communication, and computation are possibledespite drastic reductions in the dimensionalityof neural representations through informationbottlenecks.

    Semantic Similarityand Random ProjectionsHow much can a neural system reduce thedimensionality of its activity patterns withoutincurring a large loss in its ability to performrelevant computations? A plausible minimal re-quirement is that any reduction through a con-vergent pathway should preserve the similaritystructure of the neuronal representations atthe source area. This requirement is motivatedby the observation that in higher perceptualor association areas in the brain semanticallysimilar objects elicit similar neural activitypatterns (Kiani et al. 2007). This similaritystructure of the neural code is likely the basis ofour ability to categorize objects and generalizeappropriate responses to new objects (Rogers &McClelland 2004). Moreover, this similaritystructure is remarkably preserved acrossmonkeys and humans, for example, in imagerepresentations in the inferotemporal (IT)cortex (Kriegeskorte et al. 2008).

    When a semantic task involves a finite num-ber of activity patterns, or objects, the JL lemmadiscussed above implies that the required com-munication resources vary only logarithmicallywith the number of patterns, independent ofhow many neurons are involved in the sourcearea. For example, suppose 20,000 images canbe represented by the corresponding popula-tion activity patterns in the IT cortex. Thenthe similarity structure between all pairs of im-ages can be preserved to 10% precision in a

    downstream area using only ∼1000 neurons.Furthermore, this result can be achieved with avery simple dimensionality-reduction scheme,namely by a random synaptic connectivity ma-trix. Moreover, any computation that relies onsimilarity structure, and can be solved by the ITcortex, can also be solved by the downstreamregion.

    A more stringent challenge occurs whenconvergent pathways must preserve the similar-ity structure of not just a finite set of neuronalactivity patterns, but an arbitrarily large, possi-bly infinite, number of patterns, as is likely thecase in any pathway that represents informationabout continuous families of stimuli. The theo-ries of CS and RPs of manifolds discussed abovereveal that again drastic compression is possibleif the corresponding neural patterns are sparseor lie on a low-dimensional manifold (for exam-ple, as in Figure 4a–c). In this case, the numberof required neurons in a randomly connecteddownstream area is proportional to the intrin-sic dimension of the ensemble of neural activitypatterns and depends only weakly (logarithmi-cally) on the number of neurons in the sourcearea.

    Hidden low-dimensional structure in neu-ral activity patterns has been found in severalsystems (Ganguli et al. 2008a, Yu et al. 2009,Machens et al. 2010), and moreover, intrinsicspatiotemporal fluctuations exhibited in manymodels of recurrent neuronal circuits, includ-ing chaotic networks, are low dimensional(Rajan et al. 2010, Sussillo & Abbott 2009).The ubiquity of this low-dimensional structurein neuronal systems may be intimately relatedto the requirement of communication andcomputation through widespread anatomicalbottlenecks.

    Short-Term Memoryin Neuronal NetworksAnother bottleneck is posed by the task ofworking memory, where streams of sensory in-puts must presumably be stored within the dy-namic reverberations of neuronal circuits. Thisis a bottleneck from time into space: Long

    498 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    temporal streams of input must be stored in theinstantaneous spatial activity patterns of a lim-ited number of neurons. The influential ideaof attractor dynamics (Hopfield 1982) suggestshow single stimuli can be stored as stable pat-terns of activity, or fixed points, but such sim-ple fixed points are incapable of storing tempo-ral sequences of information, like an ongoingsentence, song, or motion trajectory. More re-cent proposals ( Jaeger 2001, Maass et al. 2002,Jaeger & Haas 2004) suggest that recurrent net-works could store temporal sequences of in-puts in their ongoing, transient activity. Thisnew paradigm raises several theoretical ques-tions about how long memory traces can last insuch networks, as functions of the network size,connectivity, and input statistics. Several stud-ies have addressed these questions in the case ofsimple linear neuronal networks and Gaussianinput statistics. These studies show that the du-ration of memory traces in any network cannotexceed the number of neurons (in units of theintrinsic time constant) ( Jaeger 2001, Whiteet al. 2004) and that no network can outper-form an equivalent delay line or a nonnormalnetwork, characterized by a hidden feedforwardstructure (Ganguli et al. 2008b).

    However, a more ethologically relevanttemporal input statistic is that of a sparse, non-Gaussian sequence. Indeed a wide variety oftemporal signals of interest are sparse in somebasis, for example, human speech in a waveletbasis. Recent work (Ganguli & Sompolinsky2010a) has derived a connection betweenCS and short-term memory by showing thatrecurrent neuronal networks can essentiallyperform online, dynamical compressed sensingof an incoming sparse sequence, yieldingsequence memory traces that are longer thanthe number of neurons, again in units of theintrinsic time constant. In particular, neuronalcircuits with M neurons can remember sparsesequences, which have a probability f of beingnonzero at any given time for an amountof time that is O( Mf log(1/ f ) ). This enhancedcapacity cannot be attained by purely feedfor-ward networks, or random Gaussian networkconnectivities, but requires antisymmetric

    connectivity matrices that generate complextransient activity patterns and diverse temporalfiltering properties.

    SPARSE EXPANDED NEURONALREPRESENTATIONSIn the previous section, we have discussedhow CS and RPs can explain how convergentpathways can compress neuronal representa-tions. However, in many computations, neu-ral systems may need to expand these low-dimensional compressed representations backinto high-dimensional sparse ones. For exam-ple, such representations reduce the overlapbetween activity patterns, thereby simplifyingthe tasks of learning, discrimination, catego-rization, noise filtering, and multiscale stimulusrepresentation. Indeed, like convergence, theexpansion of neural representations throughdivergent pathways is a widespread anatomi-cal motif. For example, information in 1 mil-lion optic nerve fibers is expanded into morethan 100 million primary visual cortical neu-rons. Also in the cerebellum, a small number ofmossy fibers target a large number of granulecells, creating a 100-fold expansion.

    How do neural circuits transform com-pressed dense codes into expanded sparse ones?A simple mechanism would be to project thedense activity patterns into a larger pool of neu-rons via random divergent projections and usehigh spiking thresholds to ensure sparsity of thetarget activity patterns. Indeed, Marr (1969)suggested this mechanism in his influential hy-pothesis that the granule cell layer in the cere-bellar cortex performs sparse coding of densestimulus representations in incoming mossyfibers to facilitate learning of sensorimotorassociations at the Purkinje cell layer. Althoughrandom expansion may work for some compu-tations, sparse codes are generally most usefulwhen they represent essential sparse features ofthe compressed signal. In the next sections, wereview how CS methods for generating sparseexpanded representations, which faithfullycapture hidden structures in compressed data,can operate within neural systems.

    www.annualreviews.org • Sparsity and Dimensionality 499

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    s0 x

    s

    x

    L = ATAL = W

    TW

    AT W T

    λ–λ

    A

    Source area

    x u

    Target area

    a

    b

    c

    Figure 5Neural L1 minimization and long-range brain communication. (a) A two-layer circuit for performing L1 minimization and dictionarylearning. (b) Nonlinear transfer function from inputs to firing rates of neurons in the second layer in panel a. (c) A scheme for efficientlong-range brain communication in which sparse activity s0 is compressed to a low-dimensional dense representation x in a source areaand efficiently communicated downstream to a target area with a small number of axons, where it could be re-expanded into a newsparse representation u through a dictionary learning circuit as in panel a.

    Neuronal Implementationsof L1 MinimizationGiven that solving the optimization problem inEquation 3 with V (s ) = |s | has proven to be anefficient method for sparse signal reconstruc-tion, whether neuronal circuits can perform thiscomputation is a natural question. Here we de-scribe one plausible two-layer circuit solution(see Figure 5a) proposed in Rozell et al. (2008),inspired by gradient descent in s on the costfunction in Equation 3. Suppose that the lowM-dimensional input x is represented in thefirst layer by the firing rates of a population ofM neurons such that the µth input neuron hasa firing rate xµ. Now suppose that the recon-structed sparse signal is represented by a largerpopulation of N neurons where si is the firingrate of neuron i. In this population, we denotethe synaptic potential for each neuron by vi,which determines the neuron’s firing rate via astatic nonlinearity F, s i = F (vi ).

    The synaptic connectivity from the M inputneurons to the N second-layer neurons comput-ing the sparse representation s is given by theN ×M matrix AT such that the ith column of A,ai, denotes the set of M synaptic weights fromthe input neurons to neuron i in the second

    layer. Finally, assume there is lateral inhibitionbetween any pair of neurons i and j in the secondlayer, governed by synaptic weights Lij, whichare related to the feedforward weight vectors tothe pair of neurons, through Li j = ai ·a j . Thenthe internal dynamics of the second-layer neu-rons obey the differential equations

    τdvid t

    = −vi + ai · x −T∑

    j=1

    Li j s j , 5.

    where x is the activity of the input layer. Rozellet al. (2008) found that for an appropriate choiceof the static nonlinearity, this dynamic is sim-ilar to a gradient descent on the cost functiongiven by Equation 3. In particular, for L1 min-imization, the static nonlinearity F is simply athreshold linear function with threshold λ andgain 1 (see Figure 5b).

    To obtain a qualitative understanding ofthis circuit, consider what happens when thesecond-layer activity pattern is initially inactiveso that s = 0 and an input x occurs in the firstlayer. Then the internal variable vi (t) of eachsecond-layer neuron i will charge up with a ratecontrolled by the overlap of the input x withthe synaptic weight vector, ai, which is closelyrelated to the receptive field (RF) of neuron

    500 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    i. As neuron i’s internal activation crossesthe threshold λ, it starts to fire and inhibitsneurons with RFs similar to ai. This sets up acompetitive dynamic in which a small numberof neurons with RFs similar to the input x cometo represent it, yielding a sparse representationŝ of the input x, which is the solution toEquation 3. In the case of zero noise, the abovecircuit dynamic needs to be supplementedwith an appropriate dynamic update of thethreshold λ, which eventually approaches zeroat the fixed point (Donoho et al. 2009). Finally,we note that several works (Olshausen et al.1996, Perrinet 2010) have proposed synapticHebbian plasticity and homeostasis rules thatsupplement Equation 5 and allow the circuitto solve the full dictionary learning problem,Equation 4, without prior knowledge of A.

    An intriguing feature of the above dynamicis that the inhibitory recurrent connections aretightly related to the feedforward excitatorydrive. Koulakov & Rinberg (2011) suggest thatexactly this computation may be implementedin the rodent olfactory bulb. They propose thatreciprocal dendrodendritic synaptic couplingbetween mitral cells and granule cells yieldsan effective lateral inhibition between granulecells that is related to the feedforward drivefrom mitral cells to granule cells, in accordancewith the requirements of Equation 5. Thus thecomposite olfactory circuit builds up a sparsecode for odors in the granule cell population.Likewise, Hu et al. (2011) proposed thatsparse coding is implemented within theamacrine/horizontal cell layers in the retina.

    Compression and Expansion inLong-Range Brain CommunicationA series of papers (Coulter et al. 2010, Isely et al.2010, Hillar & Sommer 2011) have integratedthe dual aspects of CS theory: dimensionalityreduction of sparse neural representations, andthe recoding of stimuli in sparse overcompleterepresentations into a theory of efficientlong-range brain communication (see alsoTarifi et al. 2011). According to this theory(see Figure 5c), each area in a long-range

    communication pathway has both dense andsparse representations. Local sparse represen-tations are first compressed to communicatethem using a small number of axons andpotentially re-expanded in a downstream area.

    Where in the brain might these transforma-tions occur? Coulter et al. (2010) predict thatthis could occur within every cortical column,with compressive projections, possibly random,occurring between more superficial corticallayers and the output layer 5. A key testablephysiological prediction would then be thatactivity in more superficial layers is sparser thanactivity in deeper output layers. Another pos-sibility is the transformation from sparse high-dimensional representations of space in theCA3/CA1 fields of the hippocampus to denser,lower-dimensional representations of space inthe subiculum, which constitutes the major out-put structure of the hippocampus. A functionalexplanation for this representational dichotomycould be that the hippocampus is performing anRP from CA3/CA1 to the subiculum, therebyminimizing the number of axons requiredto communicate the results of hippocampalcomputations to the rest of the brain.

    Overall, these works suggest more generallythat random compression and sparse coding canbe combined to yield computational strategiesfor efficient use of the limited bandwidth avail-able for long-range brain communication.

    LEARNING INHIGH-DIMENSIONALSYNAPTIC WEIGHT SPACESLearning new skills and knowledge is thoughtto be achieved by continuous synaptic mod-ifications that explore the space of possibleneuronal circuits, selecting through experiencethose that are well adapted to the given task.We review how regularization techniquesused by statisticians to learn high-dimensionalstatistical models from limited amounts of datacan also be employed by synaptic learningrules to search efficiently the high-dimensionalspace of synaptic patterns to learn appropriaterules from limited experience.

    www.annualreviews.org • Sparsity and Dimensionality 501

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    Neural Learning of ClassificationA simple model of neural decision makingand classification is a single-layer feedforwardnetwork in which the postsynaptic potential ofthe readout neuron is a sum of the activity of itsafferents, weighted by a set of synaptic weights,and the decision is signaled by firing or notfiring depending on whether the potentialreaches threshold. Such a model is equivalentto the classical perceptron (Rosenblatt 1958).Computationally, this model classifies N-dimensional input patterns into two categoriesseparated by a hyperplane determined by thesynaptic weights. These weights are learnedthrough experience-dependent modificationsbased on a set of M training input examplesand their correct classifications. Of course,the goal of any organism is not to classify pastexperience correctly, but rather to generalize tonovel experience. Thus an important measureof learning performance is the generalizationerror, or the probability of incorrectly clas-sifying a novel input, and a central questionof learning theory is how many examples Mare required to achieve a good generalizationerror given a number of N unknown synapticweights that need to be learned.

    This question has been studied exhaustively(Gardner 1988, Seung et al. 1992) (see Engel &den Broeck 2001 for an overview), and the gen-eral consensus finds that for a wide variety oflearning rules, a small generalization error canoccur only when the number of examples M islarger than the number of synapses N. This re-sult has striking implications because it suggeststhat learning may suffer from a curse of dimen-sionality: Given the large number of synapsesinvolved in any task, this theory suggests weneed an equally large number of training exam-ples to learn any task.

    Recent work (Lage-Castellanos et al. 2009)has considered the case when a categorizationtask can be realized by a sparse synaptic weightvector, meaning that only a subset of inputs aretask relevant, though which subset is a prioriunknown. The authors showed that a simplelearning rule that involves minimization of the

    classification error on the training set, plus anL1 regularization on the synaptic weights of theperceptron, yields a good generalization erroreven when the number of examples can be lessthan the number of synapses. Thus a sparsityprior is one route to combat the curse of dimen-sionality in learning tasks that are realizable bya sparse rule.

    Optimality and Sparsityof Synaptic WeightsConsider again the perceptron learning toclassify a finite set of M input patterns. In gen-eral, many synaptic weight vectors will classifythese inputs correctly. We can, however, lookfor the optimal weight vector that maximizesthe margin, or the minimal distance betweeninput patterns and the category boundary.For such weights, the induced synaptic po-tentials are as far as possible from threshold,and the resultant classifications yield goodgeneralization and noise tolerance (Vapnik1998).

    A remarkable theoretical result is that ifsynapses are constrained to be either excitatoryor inhibitory, then near capacity, the optimalsolution is sparse, with most of the synapsessilent (Brunel et al. 2004), even if the inputpatterns themselves show no obvious sparsestructure. This result has been proposed as afunctional explanation for the abundance ofsilent synapses in the cerebellum and otherbrain areas.

    When the sign of the weights are uncon-strained, the optimal solutions are still sparse,but not in the basis of neurons. Instead, the op-timal weight vector can be expressed as a linearcombination of a small number of input pat-terns, known as support vectors, the number ofsupport vectors being much smaller than theirdimensionality. Indeed, several powerful learn-ing algorithms, including support vector ma-chines (SVMs) (see Burges 1998, Vapnik 1998,Smola 2000 for reviews), exploit this form ofsparsity to achieve good generalization fromrelatively few high-dimensional examples.

    502 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    Finally, because a sufficiently large numberof RPs preserve Euclidean distances, theyincur only a modest reduction in the marginof the optimal category boundary separatingclasses (Blum 2006). Hence, classificationproblems can also be learned directly in alow-dimensional space. In summary, thereis an interesting interplay among sparsity,dimensionality, and the learnability of high-dimensional classification problems: Any suchrapidly learnable problem (i.e., one with a largemargin) is both (a) sparse, in the sense that itssolution can be expressed in terms of a sparselinear combination of input patterns, and(b) low-dimensional in the sense that it can belearned in a compressed space after a RP.

    DISCUSSION

    Dimensionality Reduction:CS versus Efficient Coding

    Efficient coding theories (Barlow 1961, Atick1992, Atick & Redlich 1992, Barlow 2001)suggest that information bottlenecks in thebrain perform optimal dimensionality re-duction by maximizing mutual informationbetween the low-dimensional output and thehigh-dimensional input (Linsker 1990). Thepredictions of such information maximizationtheories depend on assumptions about inputstatistics, neural noise, and metabolic con-straints. In particular, infomax theories of earlyvision, based on Gaussian signal and noiseassumptions, predict that high-dimensionalspatiotemporal patterns of photoreceptoractivation should be projected onto the linearsubspace of their largest principal components.Furthermore, the individual projection vectors,i.e., retinal ganglion cell (RGC) RFs, dependon the stimulus SNR; in particular, at a highSNR, RFs should decorrelate or whiten thestimulus. This is consistent with the center-surround arrangement of RFs, which removesmuch of the low-frequency correlations innatural images (Atick 1992, Atick & Redlich1992, Borghuis et al. 2008).

    What is the relation between infomaxtheories and CS? According to CS theory, forsparse inputs, close to optimal dimensionalityreduction is achieved when the projectionvectors are maximally incoherent with respectto the basis in which the stimulus is sparse. As-suming visual stimuli are approximately sparsein a wavelet or Gabor-like basis, incoherentprojections are likely to be spatially distributed.If sparseness is a prominent feature of naturalvisual spatiotemporal signals, how can wereconcile the observed RGC center-surroundRFs with the demand for incoherence? Inco-herent or random projections are optimal forsignal ensembles composed of a combinationof a few feature vectors in which the identityof these vectors varies across signals. Thismay be an adequate description of naturalimages after whitening. However, prewhitenednatural images have strong second-ordercorrelations, implying that they lie close to alow-dimensional linear space given by theirprincipal components. Thus, the ensemble ofnatural images is characterized by both linearlow-dimensional structure and sparse structureimposed by higher-order statistics. In such en-sembles, whether sensory stimuli or neuronalactivity patterns, when second-order correla-tions are strong enough, the optimal dimen-sionality reduction may indeed be close to thatpredicted by Gaussian-based infomax, as hasbeen argued in recent work (Weiss et al. 2007).

    Expansion and Sparsification:Compressed Sensing versusIndependent Components AnalysisWhat does efficient coding theory predict re-garding the recoding of signals through ex-pansive transformations, for example, from theoptic nerve to visual cortex? Several modernefficient coding theories, such as basis pur-suit, independent components analysis (ICA),maximizing non-Gaussianity, and others, sug-gest that even after decorrelation, naturalimages include higher-order statistical depen-dencies that arise through linear mixing of sta-tistically independent sources. The role of thecortical representation is to further reduce the

    www.annualreviews.org • Sparsity and Dimensionality 503

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    redundancy of the signal by separating themixed signal into its independent causes (i.e., anunmixing operation), essentially generating afactorial statistical representation of the signal.

    The application of ICA to natural im-ages and movies yields at the output layer,single-neuron response histograms, which areconsiderably sparser than those in the inputlayer. These responses have Gabor-like RFssimilar to those of simple cells in V1 (Olshausenet al. 1996, Bell & Sejnowski 1997, van Hateren& Ruderman 1998, van Hateren & van derSchaaf 1998, Simoncelli & Olshausen 2001,Hyvarinen 2010). ICA algorithms have alsobeen applied to natural sounds (Lewicki 2002),yielding a set of temporal filters, resemblingauditory cortical RFs.

    Although the algorithms and results ofICA and source extraction by CS are oftensimilar, there are important differences. First,CS results in signals that are truly sparse,i.e., most of the coefficients are zero, whereasICA algorithms generally yield signals withmany small values, i.e., distributions with highkurtosis but no coefficients vanish (Olshausenet al. 1996, Bell & Sejnowski 1997, Hyvarinen2010). Second, ICA emphasizes the statisticalindependence of the unmixed sources (Barlow2001). Sparseness is a special case; ICA can beapplied to reconstruct dense sources as well.In contrast, signal extraction by CS relies onlyon the assumed approximate sparseness ofthe signal, and not on any statistical priors,and is similar in spirit to the seminal work ofOlshausen et al. (1996). Indeed, a recent studysuggests that sparseness may be a more usefulnotion than independence and that the successof ICA in some applications is due to its abilityto generate sparse representations rather thanto discover statistically independent features(Daubechies et al. 2009).

    Beyond Linear Projections:Neuronal Nonlinearities

    The abundance of nonlinearities in neuronalsignaling raises the question of the relevance ofthe CS linear projections to neuronal informa-tion processing. One fundamental nonlinearityis the input-output relation between synapticpotentials and action potential firing of individ-ual neurons. This nonlinearity is often approx-imated by the linear-nonlinear (LN) model(Dayan & Abbott 2001, Ostojic & Brunel 2011)in which the firing rate of a neuron, x, is relatedto its input activity a through x = σ (a · s0),where s0 is the neuron’s spatiotemporal linearfilter and σ (·) is a scalar sigmoidal function. Aslong as σ (·) is an invertible function of its input,the nonlinearity in the measurement can beundone to recover the fundamental linear rela-tion between the synaptic input to the neuronand the source, given by As0; hence, the resultsof CS should hold. More generally, it will bean important challenge to evaluate the roleof dimensionality reduction, expansion, andsparse coding in neuronal circuit models thatincorporate additional nonlinearities, includingnonlinear temporal coding of inputs, synapticdepression and facilitation, and nonlinear feed-back dynamics through recurrent connections.

    In summary, we have reviewed a relativelynew set of surprising mathematical phenomenarelated to RPs of high-dimensional patterns.But far from being a set of intellectual curiosi-ties, these phenomena have important practicalimplications for data acquisition and analysisand important conceptual implications for neu-ronal information processing. It is likely thatmore surprises await us, lurking in the proper-ties of high-dimensional spaces and mappings,properties that could further change the waywe measure, analyze, and understand the brain.

    DISCLOSURE STATEMENTThe authors are not aware of any affiliations, memberships, funding, or financial holdings thatmight be perceived as affecting the objectivity of this review.

    504 Ganguli · Sompolinsky

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    ACKNOWLEDGMENTSS.G. and H.S. thank the Swartz Foundation, Burroughs Wellcome Foundation, Israeli ScienceFoundation, Israeli Defense Ministry (MAFAT), the McDonnell Foundation, and the GatsbyCharitable Foundation for support, and we thank Daniel Lee for useful discussions.

    LITERATURE CITED

    Aharon M, Elad M, Bruckstein A. 2006a. K-SVD: an algorithm for designing overcomplete dictionaries forsparse representation. IEEE Trans. Signal Proc. 54(11):4311

    Aharon M, Elad M, Bruckstein A. 2006b. On the uniqueness of overcomplete dictionaries, and a practical wayto retrieve them. Linear Algebr. Appl. 416(1):48–67

    Atick J. 1992. Could information theory provide an ecological theory of sensory processing? Netw. Comput.Neural Syst. 3(2):213–51

    Atick J, Redlich A. 1992. What does the retina know about natural scenes? Neural Comput. 4(2):196–210Baraniuk R. 2007. Compressive sensing. Signal Proc. Mag. IEEE 24(4):118–21Baraniuk R. 2011. More is less: signal processing and the data deluge. Science 331(6018):717–19Baraniuk R, Cevher V, Wakin M. 2010. Low-dimensional models for dimensionality reduction and signal

    recovery: a geometric perspective. Proc. IEEE 98(6):959–71Baraniuk R, Davenport M, DeVore R, Wakin M. 2008. A simple proof of the restricted isometry property for

    random matrices. Constr. Approx. 28(3):253–63Baraniuk R, Wakin M. 2009. Random projections of smooth manifolds. Found. Comput. Math. 9(1):51–77Barlow H. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communi-

    cation, ed. WA Rosenblith, pp. 217–34. New York: WileyBarlow H. 2001. Redundancy reduction revisited. Netw. Comput. Neural Syst. 12(3):241–53Bayati M, Bento J, Montanari A. 2010. The LASSO risk: asymptotic results and real world examples. Neural

    Inf. Process. Syst. (NIPS )Bell A, Sejnowski T. 1997. The independent components of natural scenes are edge filters. Vis. Res.

    37(23):3327–38Blum A. 2006. Random projection, margins, kernels, and feature-selection. In Subspace, Latent Structure and

    Feature Selection, ed. C Saunders, M Grobelnik, S Gunn, J Shawe-Taylor, pp. 52–68. Heidelberg, Germ.:Springer

    Borghuis B, Ratliff C, Smith R, Sterling P, Balasubramanian V. 2008. Design of a neuronal array. J. Neurosci.28(12):3178–89

    Boyd S, Vandenberghe L. 2004. Convex Optimization. New York: Cambridge Univ PressBruckstein A, Donoho D, Elad M. 2009. From sparse solutions of systems of equations to sparse modeling of

    signals and images. Siam Rev. 51(1):34–81Brunel N, Hakim V, Isope P, Nadal J, Barbour B. 2004. Optimal information storage and the distribution of

    synaptic weights: perceptron versus Purkinje cell. Neuron 43(5):745–57Burges C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov.

    2(2):121–67Candes E, Plan Y. 2010. A probabilistic and RIPless theory of compressed sensing. IEEE Trans. Inf. Theory

    57(11)7235–54Candes E, Romberg J. 2007. Sparsity and incoherence in compressive sampling. Invers. Probl. 23(3):969–85Candes E, Romberg J, Tao T. 2006. Stable signal recovery from incomplete and inaccurate measurements.

    Commun. Pure Appl. Math. 59(8):1207–23Candes E, Tao T. 2005. Decoding by linear programming. IEEE Trans. Inf. Theory 51:4203–15Candes E, Tao T. 2006. Near-optimal signal recovery from random projections: universal encoding strategies?

    IEEE Trans. Inf. Theory 52(12):5406–25Candes E, Wakin M. 2008. An introduction to compressive sampling. IEEE Sig. Proc. Mag. 25(2):21–30Coskun A, Sencan I, Su T, Ozcan A. 2010. Lensless wide-field fluorescent imaging on a chip using compressive

    decoding of sparse objects. Opt. Express 18(10):10510–23

    www.annualreviews.org • Sparsity and Dimensionality 505

    Ann

    u. R

    ev. N

    euro

    sci.

    2012

    .35:

    485-

    508.

    Dow

    nloa

    ded

    from

    ww

    w.a

    nnua

    lrevi

    ews.o

    rgby

    Ohi

    o St

    ate

    Uni

    vers

    ity L

    ibra

    ry o

    n 10

    /02/

    12. F

    or p

    erso

    nal u

    se o

    nly.

  • NE35CH23-Ganguli ARI 14 May 2012 15:29

    Coulter W, Hillar C, Isley G, Sommer F. 2010. Adaptive compressed sensing—a new class of self-organizingcoding models for neuroscience. Presented at IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP),pp. 5494–97

    Dai W, Sheikh M, Milenkovic O, Baraniuk R. 2009. Compressive sensing DNA microarrays. EURASIP J.Bioinf. Syst. Biol. 2009:162824

    Dasgupta S, Gupta A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct.Algorithms 22(1):60–65

    Daubechies I, Roussos E, Takerkart S, Benharrosh M, Golden C, et al. 2009. Independent component analysisfor brain fMRI does not select for independence. Proc. Natl. Acad. Sci. 106(26):10415–20

    Davenport M, Duarte M, Wakin M, Laska J, Takhar D, et al. 2007. The smashed filter for compressiveclassification and target recognition. Proc. Comput. Imaging V SPIE Electron Imaging, San Jose, CA

    Dayan P, Abbott L. 2001. Theoretical Neuroscience. Computational and Mathematical Modelling of Neural Systems.Cambridge, MA: MIT Press

    Donoho D. 2000. High-dimensional data analysis: the curses and blessings of dimensionality. AMS MathChallenges Lecture, pp. 1–32

    Donoho D. 2006. Compressed sensing. IEEE Trans. Inf. Theory 52(4):1289–306Donoho D, Maleki A, Montanari A. 2009. Message-passing algorithms for compressed sensing. Proc. Natl.

    Acad. Sci. USA 106(45):18914–19Donoho D, Tanner J. 2005a. Neighborliness of randomly projected simplices in high dimensions. Proc. Natl.

    Acad. Sci. USA 102:9452–57Donoho D, Tanner J. 2005b. Sparse nonnegative solution of underdetermined linear equations by linear

    programming. Proc. Natl. Acad. Sci. USA 102:9446–51Duarte M, Davenport M, Takhar D, Laska J, Sun T, et al. 2008. Single-pixel imaging via compressive sampling.

    Signal Proc. Mag. IEEE 25(2):83–91Duarte M, Davenport M, Wakin M, Baraniuk R. 2006. Sparse signal detection from incoherent projections.

    Proc. Acoust. Speech Signal Process. (ICASSP) 3:III–IIIDuarte M, Davenport M, Wakin M, Laska J, Takhar D, et al. 2007. Multiscale random projections for

    compressive classification. Presented at IEEE Int. Conf. Image Process (ICIP) Int. Conf. 6:VI161–64, SanAntonio, TX

    Efron B, Hastie T, Johnstone I, Tibshirani R. 2004. Least angle regression. Ann. Stat. 32(2):407–99Engel A, den Broeck CV. 2001. Statistical Mechanics of Learning. London: Cambridge Univ. PressGanguli S, Bisley J, Roitman J, Shadlen M, Goldberg M, Miller K. 2008a. One-dimensional dynamics of

    attention and decision making in lip. Neuron 58(1):15–25Ganguli S, Huh D, Sompolinsky H. 2008b. Memory traces in dynamical systems. Proc. Natl. Acad. Sci. USA

    105(48):18970–74Ganguli S, Sompolinsky H. 2010a. Short-term me