Transcript

Behavioral/Systems/Cognitive

Spatiotemporal Representation of the Pitch of HarmonicComplex Tones in the Auditory Nerve

Leonardo Cedolin1,3 and Bertrand Delgutte1,2

1Eaton–Peabody Laboratory, Massachusetts Eye and Ear Infirmary, Boston, Massachusetts 02114, and 2Research Laboratory of Electronics, MassachusettsInstitute of Technology, and 3Speech and Hearing Bioscience and Technology Program, Harvard–Massachusetts Institute of Technology Division of HealthSciences and Technology, Cambridge, Massachusetts 02139

The pitch of harmonic complex tones plays an important role in speech and music perception and the analysis of auditory scenes, yettraditional rate–place and temporal models for pitch processing provide only an incomplete description of the psychophysical data. Totest physiologically a model based on spatiotemporal pitch cues created by the cochlear traveling wave (Shamma, 1985), we recorded fromsingle fibers in the auditory nerve of anesthetized cat in response to harmonic complex tones with missing fundamentals and equal-amplitude harmonics. We used the principle of scaling invariance in cochlear mechanics to infer the spatiotemporal response pattern toa given stimulus from a series of measurements made in a single fiber as a function of fundamental frequency F0. We found thatspatiotemporal cues to resolved harmonics are available for F0 values between 350 and 1100 Hz and that these cues are more robust thantraditional rate–place cues at high stimulus levels. The lower F0 limit is determined by the limited frequency selectivity of the cochlea,whereas the upper limit is caused by the degradation of phase locking to the stimulus fine structure at high frequencies. The spatiotem-poral representation is consistent with the upper F0 limit to the perception of the pitch of complex tones with a missing fundamental, andits effectiveness does not depend on the relative phase between resolved harmonics. The spatiotemporal representation is thus consistentwith key trends in human psychophysics.

IntroductionHarmonic complex tones present in speech, animal vocaliza-tions, and the sounds of musical instruments produce a strongpitch sensation at their fundamental frequency F0. Although thepitch of complex tones plays an important role in music, speech,and auditory scene analysis, the neural mechanisms for pitchperception are still poorly understood. Despite the report of pitch-selective neurons in the auditory cortex of a primate (Bendor andWang, 2005), the computations underlying these cortical re-sponses and the neural codes on which these computations op-erate are unknown. Studies of the coding of harmonic complextones in the auditory nerve (AN) and cochlear nucleus have pri-marily focused on temporal pitch cues available in interspikeinterval distributions (Javel, 1980; Evans, 1983; Palmer, 1990;Shofner, 1991, 1999; Rhode, 1995; Cariani and Delgutte, 1996a,b;Wiegrebe and Winter, 2001; Winter et al., 2001; Cedolin andDelgutte, 2005; Sayles and Winter, 2008). These cues are closelyrelated to the autocorrelation model of pitch, which accounts fora wide variety of pitch phenomena (Licklider, 1951; Meddis andHewitt, 1991a,b; de Cheveigne, 2005). Fewer studies have focused

on “place” or “spectral” cues based on the cochlear frequencymap and the mechanical frequency analysis of individual harmonicsin the cochlea (Sachs and Young, 1979; Cedolin and Delgutte, 2005).Such place cues provide an appropriate input to harmonic tem-plate models for pitch extraction (Goldstein, 1973; Wightman,1973; Cohen et al., 1995). Although either rate–place or inter-spike interval information from AN fibers in cat supports precisepitch estimation over the F0 range of cat vocalizations (500 –1000Hz), neither representation is entirely consistent with humanpsychophysical data (Cedolin and Delgutte, 2005). The rate–place representation degrades with increasing sound level andalso fails to predict the existence of an upper frequency limit forpitch (Schouten et al., 1962; Moore, 1973). The interval repre-sentation does not account for the greater salience of pitchbased on resolved harmonics compared with pitch basedon unresolved harmonics (Carlyon and Shackleton, 1994;Carlyon, 1998). Here, we investigate an alternative, spatiotem-poral representation of pitch (Shamma, 1985) aimed at com-bining the advantages and overcoming the limitations oftraditional representations.

Sinusoidal stimulation of the cochlea gives rise to a travelingwave that moves from base to apex, progressively slowing downas it approaches the cochlear location tuned to the stimulus fre-quency, where the phase of basilar membrane velocity changesrapidly (Robles and Ruggero, 2001). At frequencies within therange of phase locking, this rapid phase transition is reflected inthe timing of AN spike discharges (Anderson et al., 1971; Pfeifferand Kim, 1975; van der Heijden and Joris, 2006; Palmer andShackleton, 2009; Temchin and Ruggero, 2010). For harmonic

Received Dec. 23, 2009; revised July 28, 2010; accepted July 31, 2010.This work was supported by National Institutes of Health Grants R01 DC002258 and P30 DC005209. We thank

Connie Miller for surgical assistance and Ken Hancock for software support. Bob Carlyon, Ken Hancock,Andrew Oxenham, Shihab Shamma, Chris Shera, Garrett Stanley, and Grace Wang made valuable commentson this manuscript.

Correspondence should be addressed to Bertrand Delgutte, Massachusetts Eye and Ear Infirmary, Eaton–Pea-body Laboratory, 243 Charles Street, Boston, MA 02114. E-mail: [email protected].

DOI:10.1523/JNEUROSCI.6365-09.2010Copyright © 2010 the authors 0270-6474/10/3012712-13$15.00/0

12712 • The Journal of Neuroscience, September 22, 2010 • 30(38):12712–12724

complex tones, a rapid phase transition is expected to occur at thespatial locations tuned to each resolved harmonic (see Fig. 1). Thesespatiotemporal cues to resolved harmonics could be extracted by aneural mechanism sensitive to the relative timing of spikes fromadjacent cochlear locations (Shamma, 1985; Carney, 1990).

We tested the spatiotemporal representation of pitch by re-cording the responses of AN fibers in anesthetized cats to har-monic complex tones with F0 varied in fine increments. We findthat this representation is more robust to variations in stimuluslevel than the rate–place representation and also predicts an up-per frequency limit to pitch consistent with psychophysical data.

Materials and MethodsSpatiotemporal pitch cues in a peripheral auditory model. The spatiotem-poral representation of pitch is based on phase transition cues to thefrequencies of resolved harmonics created by the cochlear traveling wave.Figure 1 shows the spatiotemporal pattern of AN activity produced by aphysiologically realistic peripheral auditory model (Zhang et al., 2001) inresponse to a harmonic complex tone with missing fundamental at 200Hz. The response pattern is shown as a function of both time (expressedin dimensionless units t � F0) and position along the cochlea, whichmaps to characteristic frequency (CF). The CF is expressed by the dimen-sionless ratio CF/F0, which we call “neural harmonic number.” In thisexample, a neural harmonic number of 3 corresponds to a CF of 600 Hz.As predicted, the latency of the traveling wave changes more rapidly withCF at cochlear locations tuned to low-order harmonics than for CFs inbetween two harmonics (Fig. 1, white line). To extract these cues, we usea spatial derivative operation that simulates a hypothetical lateral inhib-itory mechanism operating on the spatiotemporal pattern of AN activity(Shamma, 1985). Specifically, we compute the point-by-point differencebetween adjacent rows in Figure 1, and then integrate the absolute valueof the difference over time. The resulting “mean absolute spatial deriva-tive” (MASD) shows local maxima at CFs corresponding to the frequen-cies of harmonics 2– 6, whereas the average firing rate is mostly saturatedat this stimulus level [50 dB sound pressure level (SPL) per component].Thus, the model predicts that spatiotemporal pitch cues persist at stim-ulus levels where the rate–place representation is degraded.

The most direct way to study the spatiotemporal representation ofpitch would be to measure the AN response to a given harmonic complex

tone as a function of both time and CF. Since a fine, regular sampling ofthe tonotopic axis is hard to achieve with single-unit recordings, we reliedinstead on the principle of scaling invariance in cochlear mechanics(Zweig, 1976) to infer the spatiotemporal response pattern from mea-surements made at a single cochlear location. In a cochlea with perfectscaling invariance, the response to a pure tone of frequency f at thecochlear location tuned to CF depends only on the ratio f/CF (Zweig,1976). This means that the magnitude and phase of the cochlear responseto a pure tone of frequency fd at the location tuned to CFd are equal to themagnitude and the phase of the response of the cochlear location CFp toa tone of frequency fp � CFp fd�CFd. Thus, to obtain the spatial patternof response to a tone of desired frequency fd for the set of cochlearlocations tuned to {CFd}, it suffices, in principle, to measure re-sponses at one location tuned to CFp to the set of probe frequencies� fp� � CFp fd��CFd�. The same reasoning applies to a harmonic complextone in which all the components are multiples of a common fundamen-tal. Thus, the spatiotemporal response pattern to a complex tone with agiven F0 can, in principle, be inferred from the responses recoded in asingle AN fiber to a series of complex tones with varying F0.

Figure 2 illustrates the scaling invariance principle using the model ofZhang et al. (2001) for peripheral auditory processing in cat. The leftpanel shows the model spatiotemporal response pattern to a harmoniccomplex tone with fixed F0 (500 Hz) for CFs ranging from 750 to 2250Hz. The right panel shows the model temporal response patterns at asingle cochlear place (CF0, 1500 Hz) to a series of complex tones with F0

values varying from 333 to 1000 Hz. The F0 values and CFs were chosenso that the neural harmonic number CF/F0 ranges from 1.5 to 4.5 in bothpanels. In addition, the time axis is expressed in dimensionless units � �t � F0 (number of stimulus cycles). The spatiotemporal response pat-terns for the two conditions are nearly indistinguishable: they both showfast latency changes around integer values of neural harmonic number(2, 3, 4) and relatively more constant latencies around noninteger har-monic numbers (2.5, 3.5). Average discharge rate and MASD, computedas in Figure 1, also exhibit nearly identical features in the two panels, withpeaks at integer values of neural harmonic number and valleys in be-tween. Thus, these model simulations support the use of scaling invari-ance to infer the response of a population of AN fibers to a given complextone by measuring the response of a single AN fiber to a series of complextones with varying F0.

Scaling invariance is a good approximation when applied to a localregion of the cochlea but does not hold over wide cochlear spans (Sheraand Guinan, 2003; van der Heijden and Joris, 2006). Since, for each fiber,F0 was varied over a limited range in our experiments (�1.6 octave),deviations from scaling invariance may not present a major problem, asFigure 2 suggests. This point is addressed further in Discussion.

Neurophysiological procedure. Methods for recording from auditory-nerve fibers in anesthetized cats were as described by Cedolin andDelgutte (2005) and were approved by the Animal Care and Use Com-mittees of both the Massachusetts Eye and Ear Infirmary and the Massa-chusetts Institute of Technology. Cats were anesthetized with Dial inurethane (75 mg/kg), with supplementary doses given as needed to main-tain an areflexic state. The posterior fossa was opened and the cerebellumretracted to expose the auditory nerve. The tympanic bullae were openedto expose the round window. Throughout the experiment, the cat wasgiven injections of dexamethasone (0.26 mg/kg, i.m. every 4 h) to preventbrain swelling, and Ringer’s solution (50 ml/d, i.v.) to prevent dehydra-tion. General physiological state was assessed by monitoring heart rate,respiratory rate, exhaled CO2 concentration, and rectal temperature,which was maintained at 37°C by a thermostat-controlled heating pad.

The cat was placed on a vibration-isolated table in an electricallyshielded, soundproof chamber. A silver electrode positioned near theround window was used to measure the compound action potential ofthe AN in response to click stimuli, to assess the condition and stability ofcochlear function.

Sound was delivered to the cat’s ear through a calibrated closed acous-tic assembly driven by an electrodynamic speaker (Realistic 40-1377).Stimuli were generated by a 16-bit digital-to-analog converter (NationalInstruments NIDAC 6052E) using sampling rates of 20 or 50 kHz. Stim-

Figure 1. Spatiotemporal activity pattern of the human peripheral auditory model of Zhanget al. (2001) in response to a harmonic complex tone with F0 of 200 Hz at 50 dB SPL. Left, Themodel response is displayed as a function of time (in dimensionless units, t � F0 ) and cochlearplace, which maps to CF, expressed as the dimensionless ratio CF/F0 (neural harmonic number).Fast variations in response latency with CF at integer neural harmonic numbers are highlightedby the white line. The right panel shows the spatial profiles of average rate (solid line) and MASD(dashed), derived from the spatiotemporal pattern on the left. The rate and MASD profiles arenormalized by their respective maximum value.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12713

uli were digitally filtered to compensate for thetransfer characteristics of the acoustic system.

Spikes were recorded with glass micropi-pettes filled with 2 M KCl. The electrode wasinserted into the nerve and then mechanicallyadvanced using a micropositioner (Kopf 650).The electrode signal was bandpass filtered(0.3–3 kHz) and fed to a custom spike detector.The times of spike peaks were recorded with 1�s resolution and saved to disk for subsequentanalysis.

A click at �60 dB SPL was used as searchstimulus. On isolation of a single unit, a fre-quency tuning curve was measured by an auto-matic tracking algorithm (Kiang and Moxon,1974) using 50 ms tone bursts to determine theCF. The spontaneous firing rate (SR) of thefiber was measured over 20 s. The responses tocomplex-tone stimuli were then studied.

Stimuli. Stimuli were series of harmoniccomplex tones consisting of 19 equal-amplitude harmonics [numbers 2–20 (i.e., ex-cluding the fundamental)]. The fundamentalfrequency (F0) was stepped up and down suchthat the ratio of the fiber CF to F0 (the neuralharmonic number) typically varied from 1.5 to4.5 in increments of �1/8. This F0 variation causes harmonics 2– 4(which are important for determining the pitch of missing fundamentalstimuli) to successively traverse the auditory filter centered at the CF. Ifthese harmonics are resolved by the cochlea, there should be a regularmodulation in both firing rate and response latency as F0 varies. Eachcomplex-tone series consisted of 25 ascending F0 steps and 25 descendingsteps, each lasting 200 ms, including a 20 ms transition during which thewaveform for one F0 gradually decayed while overlapping with the grad-ual buildup of the waveform for the subsequent F0. Responses were typ-ically collected over 20 repetitions of the 10 s stimulus with nointerruption.

The sound pressure level of each harmonic was initially set at 10 –15 dBabove the threshold of the fiber for a pure tone at CF. When possible, thestimulus level was then varied over a 20 –30 dB range to investigate therobustness of the spatiotemporal representation. All stimulus levels inthis paper refer to the sound pressure of one frequency component of thecomplex tone. Since the stimuli contain 19 equal-amplitude compo-nents, the overall sound pressure level is 12.8 dB higher.

Psychophysical studies show that the pitch and pitch strength of har-monic complex tones containing resolved harmonics are mostly inde-pendent of the phase relationships among harmonics (Houtsma andSmurzynski, 1990; Carlyon and Shackleton, 1994; Bernstein and Oxenham,2005). To evaluate the robustness of the spatiotemporal representationto the phase pattern, three versions of each stimulus were generated withdifferent phase relationships among the harmonics: cosine phase, alter-nating sine– cosine phase, and negative “Schroeder” phase (Schroeder,1970). The three stimuli have the same power spectrum and autocorre-lation function, but sharply differ in their temporal envelopes (see Fig. 9).The cosine-phase stimulus has a very “peaky” envelope periodic at F0.The envelope of the alternating-phase stimulus is also very peaky, but itsperiodicity is at 2 � F0, even though the periodicity of the waveformremains at F0. Finally, a Schroeder phase relationship among the har-monics ensures the envelope is nearly flat.

Data analysis. The first step in the analysis was to select the spikesoccurring during the 180 ms steady-state portion of each F0 step, exclud-ing the 20 ms transition period between steps. The nonscaling conduc-tion delay for each cochlear location [Td parameter in the study byCarney and Yin (1988)] was subtracted from the time of each spike.Period histograms were computed using 50 bins per cycle of F0 anddisplayed as a function of both time and F0, as in Figure 3. Period histo-grams obtained with the same F0 during the ascending and descendingparts of the F0 sequence were added together. To apply scaling invariance,the time and F0 axes were expressed in dimensionless units � � t � F0

(number of stimulus cycles) and n � CF/F0 (neural harmonic number),respectively (see Fig. 3, left panels). Two metrics were computed from theresulting pseudo-spatiotemporal response pattern: the average dischargerate, obtained by integrating each period histogram over time and con-verting to spikes/second, and the MASD, obtained by differentiating thepseudo-spatiotemporal pattern with respect to neural harmonic num-ber, and then taking the absolute value and integrating over time.

Both the average rate and the MASD were plotted against neural har-monic number after smoothing by a three-point triangular filter withweights [1⁄4, 1⁄2, 1⁄4] (see Fig. 3, right panels). Integer values of neuralharmonic number occur when the fiber CF coincides with one of theharmonics of the stimulus, whereas the neural harmonic number is anodd integer multiple of 0.5 (2.5, 3.5, 4.5) when the CF falls halfwaybetween two harmonics. Resolved harmonics are expected to result inpeaks in either rate or MASD (or both) near integer values of the neuralharmonic number.

We used “bootstrap” resampling (Efron and Tibshirani, 1993) of thespike trains recorded from each fiber to evaluate the statistical reliabilityof the estimates of average rate and MASD. Specifically, 100 resampleddata sets were generated by drawing with replacement from the set ofspike trains, typically recorded over 20 stimulus repetitions. For each F0,spike trains in response to the ascending and descending part of the F0

sequence were drawn independently from each other. The average dis-charge rate and MASD were computed from each bootstrap data set, andthe SDs of these measures were used to obtain error bars on the estimates.

To quantitatively describe the data from each fiber, a simple mathe-matical function was fit separately to profiles of average rate and MASDagainst neural harmonic number n as follows:

f�n� � A cos(2��n)en

n0 � B en

n0 � C. (1)

This expression is the sum of an exponentially decaying sinusoid in co-sine phase, a constant term C, and an exponential term. � represents thenormalized frequency of the oscillating component relative to the CF; thesinusoid peaks exactly at integer values of the neural harmonic numberwhen � � 1. The function has five free parameters: the amplitude A andnormalized frequency � of the damped oscillation, the decay parametern0 of the exponentials (constrained to be the same for the baseline andoscillatory components), the amplitude B of the exponential term, and theconstant term C. Equation 1 was fit to the data by the least-squares methodusing the Levenberg–Marquardt algorithm as implemented by Matlab’s“lsqcurvefit” function. Figure 5A shows example fits to rate and MASD pro-files for one AN fiber.

Figure 2. Illustration of the principle of cochlear scaling invariance using the peripheral auditory model for cat of Zhang et al.(2001). Left, Spatiotemporal response pattern of the model AN for CFs between 750 and 2250 Hz to a harmonic complex tone withan F0 of 500 Hz at 40 dB SPL. The response is displayed as a function of time (in normalized units t � F0 ) and cochlear placeexpressed as neural harmonic number (CF/F0 ). Right, Temporal response pattern of one model AN fiber (CF, 1500 Hz) to a series ofharmonic complex tones with F0 values varying from 333 to 1000 Hz. The response is shown in the same dimensionless coordinatesas in the left. The rightmost panels show the spatial profiles of average discharge rate (solid lines) and mean absolute spatialderivative (dashed) derived from the response patterns at their immediate left. Rate and MASD profiles are normalized by theirmaximum.

12714 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch

To assess the reliability of the rate–place and spatiotemporal cues toresolved harmonics, we compared the ability of the five-parameter model(Equation 1) to fit each rate–place or MASD profile with that of a three-parameter model lacking an oscillatory component (i.e., with A set to 0 inEquation 1). The pitch cues were only considered reliable if an F test forthe ratio of variances of the residuals indicated a significantly better fit( p 0.01) for the model with an oscillatory component.

To quantitatively characterize how well resolved harmonics are repre-sented in rate and MASD profiles, we used the oscillatory fitted curves tocompute metrics such as the “harmonic strength” (see Fig. 6) and the bestfrequency for complex tones, the product � � CF (see Fig. 8). Specifi-cally, we fit the model to 100 bootstrap resamplings of the rate and MASDdata for each F0 series and computed the metrics for each bootstrapresampling. In Figures 6 – 8, we report the median values of these metricsacross all bootstrap resamplings. These median values showed lessvariability across fibers compared with estimates based on the originaldata set.

ResultsOur results are based on responses to 240harmonic complex-tone series recordedfrom 102 auditory-nerve fibers in fivecats. Of these, 72 fibers (71%) had high SR(�18 spikes/s) (Liberman 1978), 4 hadlow SR (0.5 spike/s), and 26 (25%) hadmedium SR. The CFs of the fibers rangedfrom 300 to 5900 Hz, with 55% between 1and 3 kHz. We focused on fibers with CFs5– 6 kHz because phase locking to thewaveform fine time structure is critical forthe spatiotemporal representation. Stim-ulus levels ranged from 5 to 85 dB SPL percomponent, with 80% of the data between25 and 65 dB SPL.

Spatiotemporal pitch cues insingle-fiber responsesFigure 3 shows the responses to complex-tone series with harmonics in cosinephase for three auditory-nerve fibers withCFs of 700 Hz (A), 2150 Hz (B), and 4300Hz (C), respectively. The left panels showpseudo-spatiotemporal patterns (based onperiod histograms) as a function of bothnormalized time (t � F0; horizontal axis)and neural harmonic number (CF/F0; verti-cal axis). The right panels show the averagedischarge rate and the MASD obtained fromthe pseudo-spatiotemporal response pat-terns on the left (see Materials and Meth-ods) as a function of neural harmonicnumber. An additional vertical axis on theright shows the stimulus F0 values corre-sponding to the neural harmonic num-bers on the left axis.

For the low-CF fiber (Fig. 3, top), theresponse latency decreases monotonicallywith increasing neural harmonic number.This smooth variation is reflected in theabsence of peaks at integer neural har-monic numbers in either rate or MASD.Thus, at this low CF, neither the rate–place nor the spatiotemporal represen-tation provides evidence for resolvedharmonics for F0 values in the range tested

(156 – 467 Hz). This is consistent with a previous report thatrate–place cues to resolved harmonics are rarely observed for F0

values 400 –500 Hz in cat because of the limited cochlear fre-quency resolution at low CFs (Cedolin and Delgutte, 2005).

The pseudo-spatiotemporal response pattern of the fiber withCF at 2150 Hz (Fig. 3, middle) does show staircase-like variationsin response latency with neural harmonic number that are qual-itatively similar to those predicted by the peripheral auditorymodel in Figure 2. Specifically, the response latency varies rapidlywith neural harmonic number near integer neural harmonicnumbers, whereas it changes more slowly at neural harmonicnumbers that are halfway between integers. As a result, the MASDshows local maxima near neural harmonic numbers 2, 3, and 4,thus providing evidence for spatiotemporal pitch cues in the ANresponse. The mean firing rate also shows peaks at integer neural

Figure 3. Responses of three AN fibers with CFs of 700 Hz (top), 2150 Hz (middle), and 4300 Hz (bottom) to series of harmoniccomplex tones in cosine phase. Left panels, Pseudo-spatiotemporal discharge pattern, displayed as a function of normalized time(horizontal axis) and neural harmonic number CF/F0 (vertical axis). Right panels, Firing rate and MASD derived from the corre-sponding pseudo-spatiotemporal response patterns on the left.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12715

harmonic numbers, although they are lesspronounced than those of the MASD atthis moderate stimulus level [20 dB rela-tive to (re.) threshold].

The pseudo-spatiotemporal responsepattern of the 4.3 kHz fiber (Fig. 3, bot-tom) shows dark horizontal bands repre-senting increased activity at integer valuesof the neural harmonic number. Thesebands are reflected in the rate–place pro-file, where harmonics 2– 6 are resolved de-spite the relatively high stimulus level (30dB re. threshold). This strong rate–placecoding is consistent with the progressivesharpening of the bandwidths of cochlearfilters relative to their center frequencywith increasing CF (Kiang et al., 1965;Shera et al., 2002; Cedolin and Delgutte,2005). In contrast, no obvious latencycues are apparent in the pseudo-spatio-temporal response pattern, and theMASD only shows small peaks at neuralharmonic numbers 2 and 3, which likelyreflect increased Poisson-like noise whenthe firing rate is high rather than genuinespatiotemporal cues. This degradation inspatiotemporal cues at higher CFs is con-sistent with the steep decline in phaselocking to the fine time structure �3 kHzin the cat AN (Johnson, 1980).

We have hypothesized that the spatio-temporal representation of pitch may beeffective at higher stimulus levels wherethe rate–place representation breaks downbecause of saturation of the rate-levelfunctions. Figure 4 shows the responses ofan AN fiber (CF, 1920 Hz) to a series ofharmonic complex tones at 10, 25, and 40dB above the fiber’s threshold at CF (25dB SPL). At the lower level (top), the sec-ond, third, and possibly fourth harmonicappear as distinct peaks in both rate andMASD profiles. As the level of each har-monic is increased to 25 dB above thresh-old (middle), the rate begins to saturate sothat only the second harmonic is convincingly resolved. In con-trast, strong latency cues to the second, third, and possibly fourthharmonic are still present in the pseudo-spatiotemporal responsepattern, resulting in corresponding peaks in the MASD. At thehighest level tested (40 dB re. threshold, bottom) the rate is al-most completely saturated, whereas peaks at the second and thirdharmonic are still detectable in the MASD. This example supportsour hypothesis that a spatiotemporal representation of pitch mightstill work at stimulus levels for which a strictly rate-based represen-tation is severely degraded.

Spatiotemporal representation of pitch by the ANfiber populationTo quantitatively compare the strengths of the pitch cues pro-vided by the rate–place and spatiotemporal representations, amathematical function was fit independently to the profiles ofaverage rate and MASD against neural harmonic number for

each set of measured responses (see Materials and Methods). Anexample is shown in Figure 5A for the same fiber as in the middlepanels of Figure 3 (CF, 2150 Hz). The fitted curves (solid lines)closely capture the oscillations in both rate and MASD profiles.The more pronounced these oscillations, the better individualharmonics are resolved. In Figure 5A, the oscillations for theMASD seem more prominent than those for the rate. We use thearea between the top and bottom envelopes of the fitted curve(Fig. 5A, light shadings) to characterize the strength of the oscil-lations in the rate and MASD profiles. Since the two metrics (rateand MASD) have different values (and units), the oscillation areawas normalized by the median SD of the data points (Fig. 5A,dark shadings) to allow direct comparisons between the two rep-resentations. The SDs are obtained by bootstrap resampling ofthe spike trains across stimulus presentations (see Materials andMethods) and we use the median SD across all F0 values tested.The resulting metric, which we call harmonic strength, is analo-gous to the sensitivity index d� in psychophysics (albeit with dif-

Figure 4. Effect of stimulus level on the response of an AN fiber (CF, 1920 Hz) to a series of harmonic complex tones in cosinephase. Response is shown at 35 (top), 50 (middle), and 65 (bottom) dB SPL, respectively. The threshold for a pure tone at CF was 25dB SPL. The layout is the same as in Figure 3.

12716 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch

ferent units) in that it expresses the strength of the “signal” (theoscillation area) in units of SD of the noise. The harmonicstrengths for rate and MASD in Figure 5A are 19.6 and 41.0,respectively, consistent with the visual impression that harmon-ics are better represented in the MASD in this example.

Figure 6 shows harmonic strengths for both average rate (top)and MASD (bottom) plotted against CF for our entire set ofresponses to complex tones with harmonics in cosine phase. Sep-arate plots are shown for low (20 dB), moderate (20 – 40 dB),and high (�40 dB) stimulus levels re. pure-tone thresholds at CF.The solid lines show the linear models (based on analyses ofcovariance) that best fit the log transformed data across all levelgroups with the minimum number of free parameters. For theMASD, separate analyses of covariance were run for CFs below andabove 2700 Hz, respectively.

The harmonic strength for rate increases significantly withCF ( p 0.001), consistent with the increase in relative sharpnessof cochlear tuning (as expressed by the quality factor Q) (Kiang etal., 1965; Liberman, 1978). The harmonic strength for MASDalso increases with CF ( p 0.001), but only up to 2700 Hz,suggesting that another factor—most likely the decrease in phase

locking to the fine time structure— coun-teracts the improvement in cochlear fre-quency selectivity at higher CFs. Above2700 Hz, the harmonic strength forMASD does not depend statistically on ei-ther CF or level, as shown by the horizon-tal line.

Harmonic strengths for rate andMASD (2700 Hz) also differ in theirlevel dependence. For rate, the analysis ofcovariance shows a significant effect oflevel group on harmonic strength ( p 0.001) and a significant interaction ( p �0.009) between CF and level, indicatingthat the degradation in harmonic strengthwith increasing level is more pronouncedfor low CFs than for high CFs. The de-crease in harmonic strength at high levelsmay be attributable to the broadening ofcochlear filters, rate saturation, or both. Incontrast, for MASD, the effect of level onharmonic strength for CFs 2700 Hzbarely reached statistical significance( p � 0.033). Post hoc paired comparisons(with Bonferroni’s corrections) revealedno significant differences in MASD har-monic strengths between any pairs of lev-els. This result is consistent with ourhypothesis that the spatiotemporal repre-sentation is more robust with level thanthe rate representation.

To directly compare the strengths ofthe rate–place and spatiotemporal repre-sentations of resolved harmonics for eachfiber, we defined a normalized strengthdifference as the difference between theharmonic strength of the MASD and thatof the rate, divided by their sum. Thismetric takes values between 1 and 1,with positive values indicating that thespatiotemporal representation is strongerthan the rate–place representation. Figure

7 shows the normalized strength difference against CF for thosemeasurements in which an oscillating curve could be reliably fitto both rate and MASD profiles. As in Figure 6, data are groupedby level relative to the pure-tone threshold at CF of each fiber. Wesplit the CF range into three groups (with cutoffs at 1350 and2800 Hz) and ran a two-way ANOVA on the normalized strengthdifference with level and CF groups as factors. There was a signif-icant effect of CF ( p 0.001) and a significant interaction be-tween CF and level ( p � 0.034). We then ran post hoc analyses(with Bonferroni’s corrections) to determine whether the meanstrength difference in each condition is greater than zero (indi-cating the spatiotemporal code is better) or smaller than zero(indicating the rate code is better). For CFs 1350 Hz, the meanstrength difference was significantly greater than zero at interme-diate and high levels, but not at low levels. For CFs between 1350and 2800 Hz, the strength difference was only significantlygreater than zero at high levels. For CFs �2800 Hz, the meanstrength difference was significantly smaller than zero at all levels.Thus, resolved harmonics are better represented in the MASDprofile than in the rate–place profiles at higher stimulus levels,but only for CFs 2800 Hz.

Figure 5. A, Profiles of average rate (left) and MASD (right) against neural harmonic number for an AN fiber (CF, 2150 Hz; samefiber as in middle panel of Fig. 3) in response to a series of harmonic complex tones at 18 dB re. thresholds. The filled circles showthe data points, and the solid lines show best-fitting curves based on Equation 1. The light shadings indicate the area between thetop and bottom envelopes of the fitted curve. The dark shadings correspond to two typical SDs of the data points, estimated bybootstrap (see Materials and Methods). The ratio of these two quantities is used as an estimate of the strength of the pitch cuesprovided by each neural representation. B, Interpolation method used for replotting rate and MASD as a function of BF across theAN fiber population for a specific stimulus F0 (872 Hz). Each symbol shows the normalized, interpolated rate (left) or MASD (right)at F0 � 872 Hz for one AN fiber. The solid lines show the best-fitting curves based on Equation 1. The BF estimated from responsesto harmonic complex tones is expressed in dimensionless units (BF/F0). Harmonics are in cosine phase.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12717

Locations of rate and MASD peaksrelative to the fiber CFBoth rate–place and spatiotemporal pitchcodes are based on cues to the locations ofresolved harmonics via local maxima inresponse profiles (rate or MASD) alongthe tonotopic axis. Pitch is putatively ex-tracted by matching these cues to centralharmonic templates. The exact locationsof these local maxima along the tonotopicaxis and their stability with respect tostimulus level are therefore important tothe viability (or at least the simplicity) ofpitch codes based on these cues. Withpure-tone stimuli, the best frequency (BF)of AN fibers (where the firing rate is max-imal) is known to be dependent on stimu-lus level (Rose et al., 1971; Temchin andRuggero, 2010), and such level depen-dence is likely to occur with harmoniccomplex tones as well. The situation is lessclear for the MASD, which depends pri-marily on the phase pattern of the co-chlear traveling wave. Although AN fiberswith CFs between 1 and 4 kHz often showan inflection point near the CF in theirphase-frequency curves for pure-tonestimuli (Palmer and Shackleton, 2009;Temchin and Ruggero, 2010), this inflec-tion seems to be level dependent. In thissection, we examine in detail the locationsof local maxima in rate and MASD pro-files and their level dependence.

A precise estimate of the peak locationsin rate and MASD profiles can be obtainedfrom the frequency of the oscillatory com-ponent of the fitted curve (i.e., the param-eter � in Eq. 1). Specifically, the productof � and the tuning-curve CF gives an es-timate of the BF of the fiber based on re-sponses to complex tones. When � equals1, the BF matches the tuning-curve CFand the oscillations peak exactly at integervalues of the neural harmonic number. Inthe example of Figure 5A, local maxima inrate and MASD profiles tend to occurslightly below integer values of the neuralharmonic number. As a result, the BFs es-timated by fitting both rate and MASDprofiles (2210 and 2263 Hz, respectively)are slightly, but significantly higher thanthe CF estimated from the pure-tone tun-ing curve (2150 Hz).

Figure 8 shows the relationships be-tween the BF estimates derived from rateand MASD profiles and the CF estimatefrom pure-tone tuning curves for our en-tire set of responses to complex tones withharmonics in cosine phase. Differencesbetween estimates, expressed as a percent-age of the tuning-curve CF, are plottedagainst the tuning-curve CF. Because athreshold tuning curve is only measured

Figure 6. Strength of rate–place and spatiotemporal cues to resolved harmonics as a function of CF for the entire set ofresponses to complex-tone series with harmonics in cosine phase. The filled circles show harmonic strengths computed frombest-fitting curves to profiles of firing rate (top) and MASD (bottom) against neural harmonic number. The open circles along thehorizontal axis show data points for which the best-fitting curve had no reliable oscillatory component. Results are grouped by levelrelative to the threshold of each fiber for a pure tone at CF. The solid lines show best-fitting lines across all level groups obtainedfrom analyses of covariance. Separate analyses were performed for rate and MASD, and, for MASD, for CFs below and above 2700Hz, respectively.

Figure 7. Comparison of the strengths of rate-based and spatiotemporal representations of pitch. Symbols show normalizedstrength differences as a function of CF for the AN fiber population. Positive values mean greater strength for the spatiotemporalrepresentation. Results are grouped by level relative to the threshold of each fiber for a pure tone at CF as in Figure 6. The verticaldashed lines indicate the CF cutoffs used in the ANOVA (see text). Harmonics are in cosine phase.

12718 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch

once for each fiber, the tuning-curve CF provides a level-invariant reference for assessing the level dependence of the BFestimates based on rate and MASD. As in Figure 6, data in Figure8 are grouped by level relative to the pure-tone threshold at CF ofeach fiber. For a point to be included, the oscillatory componentof the fitted curve (Eq. 1) had to be statistically significant so thatthe BF could be reliably estimated. The CFs of excluded measure-ments are indicated by open circles along the horizontal axis inFigure 8. As in Figure 6, the solid lines show the best-fitting linearmodels across all level groups based on analyses of covariance forCFs �1 kHz.

At low CFs, BFs estimated from both rate (Fig. 8A) and MASD(Fig. 8B) tend to be larger than the tuning-curve CF, with differ-ences that can reach 20 –25%. Despite considerable scatter in thedata, these deviations exhibit a similar decreasing trend with in-creasing CF in all three level ranges. Overall, the majority of BFestimates based on rate or MASD are larger than the tuning-curveCFs, except for the rate-based BF at high levels. Importantly, theanalyses of covariance show a significant effect of level on BF forrate ( p 0.001), but not for MASD ( p � 0.01), indicating thatthe BF based on MASD stable with level, whereas the rate-based BF is not.

MASD-based BF estimates are generally larger than rate-basedestimates at all levels (Fig. 8C). An analysis of covariance showssignificant effects of both CF ( p � 0.008) and level ( p 0.001),but no interaction. The mean BF difference increases from 0.93%at low levels to 4.6% at high levels, but remains substantiallysmaller than the large deviations observed when comparing ei-

ther BF to the tuning-curve CF (Fig.8A,B). These observations are consistentwith measurements of basilar membranevelocity in the cochlear base for pure-tonestimuli (Robles and Ruggero, 2001): At agiven cochlear place, the frequency forwhich velocity amplitude is maximum isslightly lower than the frequency thatmaximizes the group delay (the derivativeof the phase with respect to frequency).Moreover, the frequency of maximalgroup delay appears to be more stablewith level than that of maximumamplitude.

The finding that local maxima in theMASD profiles are relatively robust withlevel may appear to conflict with previousdata from AN fibers for either pure tones(Palmer and Shackleton, 2009; Temchinand Ruggero, 2010) or inharmonic com-plex tones (van der Heijden and Joris,2006). These studies typically report ei-ther plots of phase against stimulus fre-quency on a linear scale for single fibers,or plots of phase against CF on a logarith-mic scale for a sample of AN fibers. Wheninflection points (corresponding to a localmaximum in the phase derivative) arepresent in these plots, their locations tendto be level dependent, in contrast toMASD maxima in the present study. TheMASD is based on a derivative with re-spect to neural harmonic number CF/F0

to take advantage of cochlear scaling in-variance. Phase plots from the previous

studies might show more level-invariant inflection points if theywere replotted against neural harmonic number rather than lin-ear frequency or log CF. A detailed comparison between thesestudies is difficult because of differences in stimuli and species[guinea pig for Palmer and Shackleton (2009); chinchilla for Tem-chin and Ruggero (2010); cat for van der Heijden and Joris (2006)and the present study] and because the MASD does not dependexclusively on phase but also on rate.

Both rate–place and spatiotemporal profiles are labeled linecodes in that they require the position along the tonotopic axis tobe “known” by higher centers where these profiles are putativelymatched to harmonic templates. Traditionally, the label is as-sumed to be the CF obtained from a pure-tone tuning curve, butthis threshold metric may not be appropriate for suprathresholdstimuli. The present results suggest that the BF based on spatio-temporal cues may be a suitable label because it is stable with leveland is obtained from responses to harmonic complex tones,which are more common in nature than pure tones.

Phase dependencePsychophysical studies show that the pitch value and pitchstrength of harmonic complex tones are generally not greatlyaffected by the phase relationships among the partials so long asthe stimuli contain resolved harmonics (Houtsma and Smurzynski,1990; Carlyon and Shackleton, 1994; Bernstein and Oxenham,2005). To test whether the spatiotemporal representation isconsistent with these perceptual observations, we measuredresponses to complex tones with harmonics in alternating sine–

Figure 8. Relationships between BF estimated from rate and MASD profiles in response to harmonic complex tones, and CFestimates from pure-tone tuning curves. The filled circles show differences between estimates, expressed as a percentage of thetuning-curve CF. Results are grouped by stimulus level relative to the threshold of each fiber for a pure tone at CF. The open circlesalong the horizontal axis indicate measurements for which a BF could not be reliably estimated from rate or MASD profiles. The solidlines show the best-fitting linear models (based on an analysis of covariance) across all levels for CFs �1 kHz. Harmonics are incosine phase.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12719

cosine phase and negative Schroederphase as well as cosine phase. Figure 9compares the responses to harmoniccomplexes differing in phase patterns for afiber with a CF of 2520 Hz. In this exam-ple, F0 ranged from 560 to 1680 Hz for allthree stimuli. Based on previous results(Cedolin and Delgutte, 2005), we expectlow-order harmonics to be well resolvedin this F0 range, and this expectation isconfirmed by the presence of peaks atneural harmonic numbers 2, 3, and 4 inthe rate profiles for all three stimuli (Fig.9D). The pseudo-spatiotemporal re-sponse patterns are very similar for allthree phase conditions (Fig. 9A–C), as arethe strong cues to resolved harmonicspresent in the MASD profiles (Fig. 9E).

Figure 10 compares the harmonicstrengths of the rate–place (top) and spa-tiotemporal (bottom) representationsacross a sample of AN fibers for every pairof phase patterns. The CFs of the fibersincluded in this sample range from 1 to 3kHz. Harmonic strengths of the MASDfor harmonics in cosine and alternatingphase are very similar (A), and both areonly slightly larger than the harmonicstrengths for harmonics in Schroeder phase (B–C). Similarly, wefound no statistically significant difference between rate-based har-monic strengths for any of the three phase configurations (D–F),consistent with previous findings (Cedolin and Delgutte, 2005).These results suggest that the saliences of pitch cues available in boththe rate–place and spatiotemporal representations are to a largeextent independent of the phase relationship among resolvedharmonics, consistent with human psychophysical observations.

The phase invariance of the spatiotemporal representationmay seem surprising given that this representation strongly de-pends on the phase pattern of the cochlear traveling wave alongthe tonotopic axis. However, if a harmonic is well resolved, itslocal phase pattern (near the place tuned to the harmonic fre-quency) will be similar to that of a single sinusoid regardless of thephase relationships between this harmonic and neighboringones. The phase relationship between two neighboring resolvedharmonics does affect the spatiotemporal response patterns at

Figure 9. Effect of phase relationship among the harmonics on rate–place and spatiotemporal representations of pitch for one AN fiber (CF, 2530 Hz). A–C, Pseudo-spatiotemporal responsepatterns to complex-tone series with harmonics in cosine, alternating (sine– cosine), and Schroeder phase, respectively. The corresponding average rate and MASD profiles, normalized by theirrespective maxima, are plotted in D and E, respectively.

Figure 10. Effect of phase relationships among the harmonics on the strength of the rate–place (top) and spatiotemporal(bottom) cues to resolved harmonics for a sample of AN fibers. The filled circles compare harmonic strengths derived fromspatiotemporal patterns of response to complex tones with harmonics in cosine, alternating, and Schroeder phase. CFs range from1 to 3 kHz. The solid lines indicate equality.

12720 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch

CFs that lie halfway between these two harmonics (Fig. 9,compare A, B, for neural harmonic numbers near 2.5 and 3.5).

Pitch estimation from neural population responsesSo far, we have relied on the assumption of cochlear scaling in-variance to interpret profiles of rate and MASD as a function ofCF/F0 for single fibers as being equivalent to spatial patternsalong the tonotopic axis of the cochlea. We found that the spa-tiotemporal cues to resolved harmonics are most robust for CFsbetween 1 and 2.8 kHz. For comparison with psychophysicaldata, this CF range needs to be mapped into a range of stimulusF0. For this purpose, we used an interpolation procedure to replotour rate and MASD data as a function of BF for a given stimulusF0. Figure 5B illustrates this procedure for an F0 of 872 Hz. Be-cause the F0 values tested for each fiber are chosen so that the ratioCF/F0 varies from 1.5 to 4.5 in steps of 1/8, few, if any, fibersprovide data at this exact F0. However, fibers with CFs between1300 Hz (1.5 � 872) and 3900 Hz (4.5 � 872) do provide data forF0 values close to the desired 872 Hz. For each of these fibers, weused the fitted curves based on Equation 1 (as in Fig. 5A) tointerpolate the rate and MASD at the desired F0 (872 Hz). Inaddition, because average rates and MASD in response to thesame stimulus can substantially differ among fibers with similarCFs, we normalized the firing rates (and MASD) of each fiber bytheir median value over all F0 tested.

Figure 5B shows the interpolated, normalized rates andMASD for an F0 of 872 Hz as a function of BF (expressed as neuralharmonic number BF/F0) for all the fibers studied with harmoniccomplex tones in cosine phase at levels 27 dB re. threshold. Weuse the BFs estimated from responses to complex tones ratherthan the less reliable tuning-curve CF, and we take the geometricmean of the BF estimates for rate and MASD when both areavailable. The interpolated data show considerably more scatterthan typically seen in the single-fiber data (Figs. 3, 4, 5A), reflect-ing the somewhat ad hoc normalization as well variability result-ing from pooling data across a range of levels and across differentanimals. Despite this scatter, both rate and MASD tend to peaknear integer values of the neural harmonic number. The sameoscillating curve (Eq. 1) was fit to the interpolated populationdata as for the single-fiber data. Since this function shows peaks atharmonically related BFs, we are effectively fitting a harmonictemplate to the profiles of rate and MASD along the tonotopicmap in much the same way as in spectral models of pitch (Goldstein,1973; Wightman, 1973; Cohen et al., 1995). For both rate andMASD, the period of the damped oscillation in the fitted curve(the parameter 1/� in Eq. 1) gives an estimate of the stimulus F0

from the neural population response. In this case, both F0 esti-mates (870 Hz for MASD and 855 Hz for rate) deviate by 2%from the actual F0 (872 Hz). However, since the independentvariable for curve fitting is the BF derived from responses toharmonic complex tones, the F0 estimates are expected to beunbiased. More meaningful are the SDs of the F0 estimates[computed from the root-mean-square residuals and the Ja-cobian at the solution vector (Press et al., 1988)], which are ameasure of their precision. In the example of Figure 5B, theSDs are �1% of the actual F0.

The analysis of Figure 5B was repeated for F0 values rangingfrom 100 to 1600 Hz in 1/8-octave steps. Figure 11 shows the SDsof the F0 estimates based on rate and MASD profiles as a functionof stimulus F0. Results are shown separately for two different levelranges, using a cutoff of 27 dB re. threshold that splits the data intwo approximately equal halves. For both metrics and both levelranges, the SDs of the F0 estimates are mostly 1.5% and show no

strong dependence on F0. Data are only shown for F0 valueswhere an oscillatory curve could be reliably fit to the interpolatedrate and MASD profiles (as assessed by an F test) (see Materialsand Methods). For some F0 values, the sampling of BFs was sparseand/or oddly distributed, making F0 estimation unreliable. Suchunpredictable sampling of the tonotopic axis is the very reasonwhy we adopted an experimental design based on cochlear scal-ing invariance, which ensures a fine, uniform sampling of neuralharmonic numbers for all fibers.

The lowest F0 for which a clear oscillatory component couldbe detected in either rate or MASD profiles was 336 Hz at lowerlevels, and 367 Hz at higher levels. It is harder to define an upperlimit for F0 estimation based on MASD. At the higher levels,MASD-based F0 estimation tended to be less reliable than rate-based estimation for F0 values �1100 Hz, as expected from thedegradation in phase locking. However, at the lower levels, F0

could be estimated from MASD profiles up to the highest valuetested (1600 Hz). The MASD is likely to be overestimated at highfrequencies because of intrinsic noise in neural firings. The spatialderivative operation amplifies high-frequency noise in temporalresponse patterns. This noise is not attenuated at the subsequenttemporal integration stage because it is the absolute value of thespatial derivative that is integrated. If the noise is Poisson-like, asexpected for AN fibers, then the higher the firing rate, the higherthe noise. Thus, the oscillations in MASD profiles at high F0

values may simply be noise modulations directly resulting fromthe observed oscillations in firing rate. Despite this difficulty, F0

estimation from MASD profiles was overall most effective for F0

values between �350 and 1100 Hz.

DiscussionSpatiotemporal representation of pitchWe tested the effectiveness of a spatiotemporal representation ofthe pitch of harmonic complex tones in the cat auditory nerve.We found that strong spatiotemporal cues to resolved harmonicsare available in the responses of AN fibers whose CFs are high

Figure 11. SD of the F0 estimated from tonotopic profiles of interpolated rate (top) andMASD (bottom) as a function of stimulus F0 for two ranges of stimulus levels. SDs are expressedas a percentage of the actual F0 and were computed by fitting Equation 1 to the rate and MASDprofiles (as in Fig. 5B) and using the Jacobian of the best-fitting parameter vector.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12721

enough for harmonics to be sufficiently resolved (�1 kHz) butbelow the limit (�2.8 kHz) above which phase locking is signif-icantly degraded. For fibers with CF 1 kHz, rate–place cues toresolved harmonics are also weak because of poor harmonic re-solvability, consistent with previous studies (Sachs and Young,1979; Cedolin and Delgutte, 2005). At high CFs, however, therate–place representation improves because of the progressivesharpening of cochlear tuning relative to the CF (Kiang et al.,1965; Shera et al., 2002; Cedolin and Delgutte, 2005), whereas thespatiotemporal representation degrades because of the decline inphase locking (Johnson, 1980). In the CF range in which it iseffective, the spatiotemporal representation is more robust thanthe rate–place representation at high stimulus levels. Whereas theeffectiveness of the rate-pace representation is limited by firingrate saturation (Sachs and Young, 1979), phase locking remainsrobust at levels at which rate is saturated (Young and Sachs,1979). Nevertheless, the spatiotemporal representation does de-grade somewhat at high levels consistent with broader cochleartuning and associated flattening in the frequency dependence ofthe phase response (Anderson et al., 1971; Palmer and Shackleton,2009; Temchin and Ruggero, 2010).

Our primary results are based on the assumption of local scal-ing invariance in cochlear mechanics (Zweig, 1976), which holdsonly approximately (Shera and Guinan, 2003; van der Heijdenand Joris, 2006). Although scaling invariance implies cochlearfilters with constant Q (the ratio of CF to bandwidth), Q is actu-ally an increasing power function of CF in cat AN (Shera et al.,2002; Shera and Guinan, 2003; Cedolin and Delgutte, 2005). Thispower law (with an exponent of �0.37) predicts that Q varies byno more than �21% over the 1.6 octave range of CF/F0 spannedby our stimuli. Although not insignificant, these deviations areprobably within the range of experimental error. Scaling invari-ance further prescribes that all temporal parameters of cochlearprocessing scale with CF, whereas some of these parameters suchas the upper frequency limit of phase locking clearly do not(Johnson, 1980). Using model simulations, Larsen et al. (2008)found that assuming scaling invariance tends to overestimatephase locking to higher harmonics (relative to the mean neuralharmonic number 3), and underestimate phase locking to lowerharmonics. This effect can be discerned in the model results ofFigure 2, in which the decay of the MASD with increasing neuralharmonic number is somewhat less pronounced in the rightpanel (in which scaling invariance is assumed) than in the leftpanel (in which it is not).

Our interpolation procedure for replotting rate and MASD asa function of BF for a given F0 (Fig. 5B) bears on the issue ofscaling invariance because, at least for rate, it eliminates the con-stant Q assumption. For the interpolated data, the spatiotempo-ral representation worked best for F0 values between 350 and1100 Hz. Within that range, the stimulus F0 could be estimatedwith a SD 2% based on data from only 30 –50 AN fibers. Theseprecision figures are only indicative, as they depend on experi-mental variables such as the number of fibers sampled and thedistribution of CFs in the sample.

The F0 range over which the spatiotemporal representation ofpitch is most effective in cat AN encompasses the 500 –1000 Hzrange of cat vocalizations (Brown et al., 1978; Shipley et al., 1991).What might be the corresponding range in humans? Evidencefrom both otoacoustic emissions and psychophysics suggests thatcochlear frequency resolution may be up to three times finer inhumans than in cats (Shera et al., 2002; Oxenham and Shera,2003; Shera et al., 2010) [but see Ruggero and Temchin (2005) fora contrary opinion]. If so, the lower F0 limit for a viable spatio-

temporal representation of pitch would be �120 Hz in humans.The upper limit at �1100 Hz would remain the same providingthe frequency dependence of phase locking is similar in cats andhumans. This F0 range encompasses most of the range of spokenhuman voice (80 –350 Hz), and the upper limit is approximatelyconsistent with the �1300 Hz upper limit of pitch perception formissing-fundamental stimuli containing many harmonics(Moore, 1973).

Neural codes for pitchThe spatiotemporal representation of pitch offers a number ofadvantages over traditional codes for the pitch of complex tones.It is more robust with stimulus level than the rate–place repre-sentation and, unlike the rate–place representation, predicts anupper frequency limit to the pitch of missing-fundamental stim-uli approximately consistent with psychophysical data. Althoughthe autocorrelation (also known as interspike interval) model ofpitch has trouble predicting the stronger pitch produced bystimuli containing resolved harmonics compared with stimuliconsisting entirely of unresolved harmonics (Carlyon andShackleton, 1994; Carlyon, 1998; Bernstein and Oxenham, 2005;Cedolin and Delgutte, 2005), the spatiotemporal representationintrinsically requires resolved harmonics. The autocorrelationmodel also has trouble explaining the poor temporal pitch per-ception with cochlear implants (Shannon, 1983; Townshend etal., 1987) because phase locking of AN fibers is excellent withelectric stimulation of the cochlea (Dynes and Delgutte, 1992;Shepherd and Javel, 1997). In contrast, the lack of a cochleartraveling wave with cochlear implants precludes the phase cueson which the spatiotemporal representation is based. Neverthe-less, if pitch perception with resolved harmonics were based pri-marily on spatiotemporal cues, a second pitch mechanism wouldbe required to account for the weak pitch produced by unre-solved harmonics and stimulation through cochlear implants (deCheveigne, 2005).

The pitch representation tested in this paper (Shamma, 1985)is called “spatiotemporal” because it makes use of both the spatialdistribution of AN activity along the tonotopic axis and the pre-cise phase locking of spikes to the fine time structure of thestimulus waveform. Other spatiotemporal models of auditoryprocessing (Loeb et al., 1983; Shamma and Klein, 2000; Carneyand Heinz, 2002) also critically depend on the relative timing ofspikes from different cochlear locations and rely on the delayscreated by the cochlear traveling wave rather than postulatingneural delay lines as in the autocorrelation model. Nevertheless,these spatiotemporal models differ in important ways. All threemodels just cited differ from the present one in that they usecross-correlation (or equivalently coincidence detection) ratherthan lateral inhibition to compare the timing of spikes acrosscochlear locations. The models of both Loeb et al. (1983) andShamma and Klein (2000) involve cross-correlation overmuch wider extents of the tonotopic axis (CF separations of upto several octaves) than the local comparisons used in thepresent model. The wide tonotopic span of spatial correlationsin the model of Shamma and Klein (2000) is consistent with itspurpose of explaining the formation of harmonic templates,and this model is compatible with the spatial derivative as afront end. Although level-dependent changes in cochlear tun-ing play a critical role in the phase opponency model of Carneyet al. (2002), the phase cues used by the present model directlyfollow from linear systems theory (Goldstein et al., 1971).Apparently, the cochlear nonlinearities are mild enough that

12722 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch

the spatiotemporal phase cues are fairly robust over the levelrange tested.

In summary, although several models combining place andtemporal information have been proposed for pitch and otherperceptual phenomena, these models are clearly distinct, andeach has to be evaluated on its own merit. The present studyrepresents the first detailed physiological evaluation of a spatio-temporal pitch model.

Neural mechanisms for extracting spatiotemporal pitch cuesA key question is whether the spatiotemporal cues to resolvedharmonics available in the patterns of AN activity are actuallyextracted in the CNS. A neural mechanism that extracts spatio-temporal cues must be sensitive to the relative timing of spikesfrom AN fibers innervating neighboring cochlear locations. Be-cause phase locking to the waveform fine structure rapidly de-grades in the ascending auditory pathway (Langner, 1992), anearly brainstem processing site such as the cochlear nucleus or,perhaps, the medial superior olive (Loeb et al., 1983) is a logicalplace for such a mechanism.

The simplest possibility is a lateral inhibitory mechanism, asoriginally proposed by Shamma (1985) and implemented herevia the MASD. There is evidence for lateral inhibition in thedorsal cochlear nucleus (DCN), where principal cells receive bothexcitatory input from AN fibers and a lower-CF inhibitory inputfrom vertical cells (Voigt and Young, 1990). However, phaselocking to the fine structure is generally poor in DCN cells(Lavine, 1971; Rhode and Smith, 1986), so these neurons seem illsuited for the fast temporal processing required to decode spatio-temporal cues. Nevertheless, neurons might perform the re-quired computations in their dendrites even if they have poorphase locking because of subsequent low-pass filtering. Althoughbushy cells in the ventral cochlear nucleus (VCN) do show goodphase locking, the BF of inhibition in these neurons appears tomatch the excitatory BF (i.e., the inhibition is not lateral)(Caspary et al., 1994; Kopp-Scheinpflug et al., 2002). However,these experiments may not have been sufficiently sensitive todetect small BF differences, so that a role for lateral inhibition inprocessing spatiotemporal cues cannot be ruled out.

An alternative decoding mechanism for spatiotemporal cuesis across-CF coincidence detection (Carney, 1994; Carney andHeinz, 2002) for which there is evidence in some phase-lockingVCN units (Carney, 1990; Joris et al., 1994; Wang and Delgutte,2009). A potential difficulty is that fewer spike coincidences areexpected at the cochlear locations tuned to resolved harmonics(i.e., rate and phase cues tend to oppose each other, whereas theyact in synergy in a lateral inhibitory circuit). Thus, a coincidencedetection mechanism may work best at higher stimulus levels atwhich the rate is saturated.

ConclusionWe tested a spatiotemporal representation of pitch based onphase cues to resolved harmonics that are created by the cochleartraveling wave and can, in principle, be extracted by a lateral-inhibition mechanism. This representation is effective over arange of F0 values whose lower limit is determined by the fre-quency selectivity of the cochlea, whereas the upper limit iscaused by the degradation in phase locking. In cats, the spatio-temporal representation is viable over the F0 range of cat vocal-izations, and this correspondence may also approximately holdin humans because of the putatively sharper cochlear frequencytuning in that species. The spatiotemporal representation is con-

sistent with key trends in pitch psychophysics and is more robustthan the rate–place representation at high stimulus levels.

ReferencesAnderson DJ, Rose JE, Hind JE, Brugge JF (1971) Temporal position of

discharges in single auditory nerve fibers within the cycle of a sine-wavestimulus: frequency and intensity effects. J Acoust Soc Am 49 [Suppl2]:1131–1154.

Bendor D, Wang X (2005) The neuronal representation of pitch in primateauditory cortex. Nature 436:1161–1165.

Bernstein JG, Oxenham AJ (2005) An autocorrelation model with place de-pendence to account for the effect of harmonic number on fundamentalfrequency discrimination. J Acoust Soc Am 117:3816 –3831.

Brown KA, Buchwald JS, Johnson JR, Mikolich DJ (1978) Vocalization inthe cat and kitten. Dev Psychobiol 11:559 –570.

Cariani PA, Delgutte B (1996a) Neural correlates of the pitch of complextones. I. Pitch and pitch salience. J Neurophysiol 76:1698 –1716.

Cariani PA, Delgutte B (1996b) Neural correlates of the pitch of complex tones.II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch,and the dominance region for pitch. J Neurophysiol 76:1717–1734.

Carlyon RP (1998) Comments on “A unitary model of pitch perception” [J.Acoust. Soc. Am. 102, 1811–1820 (1997)]. J Acoust Soc Am 104:1118–1121.

Carlyon RP, Shackleton TM (1994) Comparing the fundamental frequen-cies of resolved and unresolved harmonics: evidence for two pitch mech-anisms? J Acoust Soc Am 95:3541–3554.

Carney LH (1990) Sensitivities of cells in the anteroventral cochlear nucleusof cat to spatiotemporal discharge patterns across primary afferents.J Neurophysiol 64:437– 456.

Carney LH (1994) Spatiotemporal encoding of sound level: models for nor-mal encoding and recruitment of loudness. Hear Res 76:31– 44.

Carney LH, Heinz MG (2002) Auditory phase opponency: a temporalmodel for masked detection at low frequencies. Acta Acust 88:334 –346.

Carney LH, Yin TC (1988) Temporal coding of resonances by low-frequency auditory nerve fibers: single-fiber responses and a populationmodel. J Neurophysiol 60:1653–1677.

Caspary DM, Backoff PM, Finlayson PG, Palombi PS (1994) Inhibitory in-puts modulate discharge rate within frecuency receptive fields of antero-ventral cochlear nucleus neurons. J Neurophysiol 72:2124 –2133.

Cedolin L, Delgutte B (2005) Pitch of complex tones: rate-place and inter-spike interval representations in the auditory nerve. J Neurophysiol94:347–362.

Cohen MA, Grossberg S, Wyse LL (1995) A spectral network model of pitchperception. J Acoust Soc Am 98:862– 879.

de Cheveigne A (2005) Pitch perception models. In: Pitch: neural codingand perception (Plack CJ, Oxenham AJ, Fay RR, Popper AN, eds), pp169 –233. New York: Springer.

Dynes SB, Delgutte B (1992) Phase-locking of auditory-nerve discharges tosinusoidal electric stimulation of the cochlea. Hear Res 58:79 –90.

Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. New York:Chapman and Hall.

Evans EF (1983) Pitch and cochlear nerve fibre temporal discharge patterns.In: Hearing: physiological bases and psychophysics (Klinke R, HartmannR, eds), pp 140 –146. Berlin: Springer.

Goldstein JL (1973) An optimum processor theory for the central formationof the pitch of complex tones. J Acoust Soc Am 54:1496 –1516.

Goldstein JL, Baer T, Kiang NYS (1971) A theoretical treatment of latency,group delay, and tuning characteristics for auditory-nerve responses toclicks and tones. In: Physiology of the auditory system (Sachs MB, ed), pp133–141. Baltimore: National Educational Consultants.

Houtsma AJM, Smurzynski J (1990) Pitch identification and discriminationfor complex tones with many harmonics. J Acoust Soc Am 87:304 –310.

Javel E (1980) Coding of AM tones in the chinchilla auditory nerve: impli-cations for the pitch of complex tones. J Acoust Soc Am 68:133–146.

Johnson DH (1980) The relationship between spike rate and synchrony inresponses of auditory-nerve fibers to single tones. J Acoust Soc Am68:1115–1122.

Joris PX, Carney LH, Smith PH, Yin TC (1994) Enhancement of neuralsynchronization in the anteroventral cochlear nucleus. I. Responses totones at the characteristic frequency. J Neurophysiol 71:1022–1036.

Kiang NY, Moxon EC (1974) Tails of tuning curves of auditory-nerve fibers.J Acoust Soc Am 55:620 – 630.

Cedolin and Delgutte • Spatiotemporal Representation of Pitch J. Neurosci., September 22, 2010 • 30(38):12712–12724 • 12723

Kiang NYS, Watanabe T, Thomas EC, Clark LF (1965) Discharge patterns ofsingle fibers in the cat’s auditory nerve. Cambridge, MA: MIT.

Kopp-Scheinpflug C, Dehmel S, Dorrscheidt GJ, Rubsamen R (2002) In-teraction of excitation and inhibition in anteroventral cochlear nu-cleus neurons that receive large endbulb synaptic endings. J Neurosci22:11004 –11018.

Langner G (1992) Periodicity coding in the auditory system. Hear Res60:115–142.

Larsen E, Cedolin L, Delgutte B (2008) Pitch representations in the auditorynerve: two concurrent complex tones. J Neurophysiol 100:1301–1319.

Lavine RA (1971) Phase-locking in response of single neurons in cochlearnuclear complex of the cat to low frequency tonal stimuli. J Neurophysiol34:467– 483.

Liberman MC (1978) Auditory-nerve responses from cats raised in a low-noise chamber. J Acoust Soc Am 63:442– 455.

Licklider JC (1951) A duplex theory of pitch perception. Experientia7:128 –134.

Loeb GE, White MW, Merzenich MM (1983) Spatial cross-correlation. Aproposed mechanism for acoustic pitch perception. Biol Cybern47:149 –163.

Meddis R, Hewitt MJ (1991a) Virtual pitch and phase sensitivity of a com-puter model of the auditory periphery. I. Pitch identification. J Acoust SocAm 89:2866 –2882.

Meddis R, Hewitt MJ (1991b) Virtual pitch and phase sensitivity of a com-puter model of the auditory periphery. II. Phase sensitivity. J Acoust SocAm 89:2883–2894.

Moore BC (1973) Some experiments relating to the perception of complextones. Q J Exp Psychol 25:451– 475.

Oxenham AJ, Shera CA (2003) Estimates of human cochlear tuning at lowlevels using forward and simultaneous masking. J Assoc Res Otolaryngol4:541–554.

Palmer AR (1990) The representation of the spectra and fundamental fre-quencies of steady-state single- and double-vowel sounds in the temporaldischarge patterns of guinea pig cochlear-nerve fibers. J Acoust Soc Am88:1412–1426.

Palmer AR, Shackleton TM (2009) Variation in the phase of response tolow-frequency pure tones in the guinea pig auditory nerve as functions ofstimulus level and frequency. J Assoc Res Otolaryngol 10:233–250.

Pfeiffer RR, Kim DO (1975) Cochlear nerve fiber responses: distributionalong the cochlear partition. J Acoust Soc Am 58:867– 869.

Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical reci-pes in C. The art of scientific computing. Cambridge, UK: Cambridge UP.

Rhode WS (1995) Interspike intervals as a correlate of periodicity pitch incat cochlear nucleus. J Acoust Soc Am 97:2413–2429.

Rhode WS, Smith PH (1986) Physiological studies on neurons in the dorsalcochlear nucleus of cat. J Neurophysiol 56:287–307.

Robles L, Ruggero MA (2001) Mechanics of the mammalian cochlea.Physiol Rev 81:1305–1352.

Rose JE, Hind JE, Anderson DJ, Brugge JF (1971) Some effects of stimulusintensity on response of auditory nerve fibers in the squirrel monkey.J Neurophysiol 34:685– 699.

Ruggero MA, Temchin AN (2005) Unexceptional sharpness of frequencytuning in the human cochlea. Proc Natl Acad Sci U S A 102:18614 –18619.

Sachs MB, Young ED (1979) Encoding of steady-state vowels in the audi-tory nerve: representation in terms of discharge rate. J Acoust Soc Am66:470 – 479.

Sayles M, Winter IM (2008) Ambiguous pitch and the temporal represen-tation of inharmonic iterated rippled noise in the ventral cochlear nu-cleus. J Neurosci 28:11925–11938.

Schouten JF, Ritsma RJ, Cardozo BL (1962) Pitch of the residue. J AcoustSoc Am 34:1418 –1424.

Schroeder MR (1970) Synthesis of low peak-factor signals and binary se-quences with low autocorrelation. IEEE Trans Inf Theory 16:85– 89.

Shamma SA (1985) Speech processing in the auditory system. II: Lateralinhibition and the central processing of speech evoked activity in theauditory nerve. J Acoust Soc Am 78:1622–1632.

Shamma S, Klein D (2000) The case of the missing pitch templates: howharmonic templates emerge in the early auditory system. J Acoust Soc Am107:2631–2644.

Shannon RV (1983) Multichannel electrical stimulation of the auditorynerve in man. I. Basic psychophysics. Hear Res 11:157–189.

Shepherd RK, Javel E (1997) Electrical stimulation of the auditory nerve. I.Correlation of physiological responses with cochlear status. Hear Res108:112–144.

Shera CA, Guinan JJ Jr (2003) Stimulus-frequency-emission group de-lay: a test of coherent reflection filtering and a window on cochleartuning. J Acoust Soc Am 113:2762–2772.

Shera CA, Guinan JJ Jr, Oxenham AJ (2002) Revised estimates of humancochlear tuning from otoacoustic and behavioral measurements. ProcNatl Acad Sci U S A 99:3318 –3323.

Shera CA, Guinan JJ Jr, Oxenham AJ (2010) Otoacoustic estimation of co-chlear tuning: validation in the chinchilla. J Assoc Res Otolaryngol 11:343–365.

Shipley C, Carterette EC, Buchwald JS (1991) The effect of articulationon the acoustical structure of feline vocalizations. J Acoust Soc Am89:902–909.

Shofner WP (1991) Temporal representation of rippled noise in the antero-ventral cochlear nucleus of the chinchilla. J Acoust Soc Am 90:2450 –2466.

Shofner WP (1999) Responses of cochlear nucleus units in the chinchilla toiterated rippled noises: analysis of neural autocorrelograms. J Neuro-physiol 81:2662–2674.

Temchin AN, Ruggero MA (2010) Phase-locked responses to tones ofchinchilla auditory nerve fibers: implications for apical cochlear me-chanics. J Assoc Res Otolaryngol 11:297–318.

Townshend B, Cotter N, Van Compernolle D, White RL (1987) Pitch per-ception by cochlear implant subjects. J Acoust Soc Am 82:106 –115.

van der Heijden M, Joris PX (2006) Panoramic measurements of the apex ofthe cochlea. J Neurosci 26:11462–11473.

Voigt HF, Young ED (1990) Cross-correlation analysis of inhibitory inter-actions in dorsal cochlear nucleus. J Neurophysiol 64:1590 –1610.

Wang GI, Delgutte B (2009) Spatio-temporal processing of auditory-nerveactivity in the cochlear nucleus. Abstr Assoc Res Otolaryngol 32:845.

Wiegrebe L, Winter IM (2001) Temporal representation of iterated ripplednoise as a function of delay and sound level in the ventral cochlear nu-cleus. J Neurophysiol 85:1206 –1219.

Wightman FL (1973) The pattern-transformation model of pitch. J AcoustSoc Am 54:407– 416.

Winter IM, Wiegrebe L, Patterson RD (2001) The temporal representationof the delay of iterated rippled noise in the ventral cochlear nucleus of theguinea-pig. J Physiol 537:553–566.

Young ED, Sachs MB (1979) Representation of steady-state vowels in thetemporal aspects of the discharge patterns of populations of auditory-nerve fibers. J Acoust Soc Am 66:1381–1403.

Zhang X, Heinz MG, Bruce IC, Carney LH (2001) A phenomenologicalmodel for the responses of auditory-nerve fibers: I. Nonlinear tuning withcompression and suppression. J Acoust Soc Am 109:648 – 670.

Zweig G (1976) Basilar membrane motion. Cold Spring Harbor SympQuant Biol 40:619 – 633.

12724 • J. Neurosci., September 22, 2010 • 30(38):12712–12724 Cedolin and Delgutte • Spatiotemporal Representation of Pitch


Recommended