Behavioral/Systems/Cognitive Real ...graphics.cs.cmu.edu/courses/16-899A/2014_spring/... · Behavioral/Systems/Cognitive Real-WorldSceneRepresentationsinHigh-LevelVisual Cortex:It’stheSpacesMoreThanthePlaces

Behavioral/Systems/Cognitive

Real-World Scene Representations in High-Level VisualCortex: It’s the Spaces More Than the Places

Dwight J. Kravitz, Cynthia S. Peng, and Chris I. BakerLaboratory of Brain and Cognition, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland 20892

Real-world scenes are incredibly complex and heterogeneous, yet we are able to identify and categorize them effortlessly. In humans, theventral temporal parahippocampal place area (PPA) has been implicated in scene processing, but scene information is contained in manyvisual areas, leaving their specific contributions unclear. Although early theories of PPA emphasized its role in spatial processing, morerecent reports of its function have emphasized semantic or contextual processing. Here, using functional imaging, we reconstructed theorganization of scene representations across human ventral visual cortex by analyzing the distributed response to 96 diverse real-worldscenes. We found that, although individual scenes could be decoded in both PPA and early visual cortex (EVC), the structure of repre-sentations in these regions was vastly different. In both regions, spatial rather than semantic factors defined the structure of represen-tations. However, in PPA, representations were defined primarily by the spatial factor of expanse (open, closed) and in EVC primarily bydistance (near, far). Furthermore, independent behavioral ratings of expanse and distance correlated strongly with representations inPPA and peripheral EVC, respectively. In neither region was content (manmade, natural) a major contributor to the overall organization.Furthermore, the response of PPA could not be used to decode the high-level semantic category of scenes even when spatial factors wereheld constant, nor could category be decoded across different distances. These findings demonstrate, contrary to recent reports, that theresponse PPA primarily reflects spatial, not categorical or contextual, aspects of real-world scenes.

IntroductionDespite the complexity and heterogeneity of scenes, scene pro-cessing produces neural representations capable of supporting avariety of tasks, including navigation, object identification, ex-traction of semantic information, and guidance of visual atten-tion. Although much of visual cortex clearly contributes to sceneprocessing, research has often focused on the parahippocampalplace area (PPA), which responds more strongly when peopleview scenes or buildings than individual objects or faces (Aguirreet al., 1998; Epstein and Kanwisher, 1998; Levy et al., 2001). Al-though such scene selectivity suggests a specialized role in sceneprocessing, the precise information extracted by PPA and thenature of the underlying neural representations remain unclear.

Some theories of PPA function suggest that it is primarilyinvolved in encoding the spatial layout of scenes (Maguire et al.,1996; Epstein and Kanwisher, 1998; Park et al., 2011) and theretrieval of familiar scenes (Rosenbaum et al., 2004; Epstein andHiggins, 2007; Hayes et al., 2007). Consistent with these theories,there are anatomical projections from parietal into parahip-pocampal cortex (Kravitz et al., 2011) and anterograde amnesia

for scene layouts has been reported after damage to regions en-compassing PPA (Aguirre and D’Esposito, 1999; Barrash et al.,2000). However, more recent reports have proposed that PPAmaintains representations of the contextual associations of indi-vidual objects rather than scenes, per se. (Bar, 2004; Bar et al.,2008; Gronau et al., 2008) (but see Epstein and Ward, 2010).Finally, it has been proposed that PPA is responsible for naturalscene categorization, distinguishing among high-level concep-tual categories of scenes (e.g., beaches, buildings) (Walther et al.,2009). Critically, however, other regions, such as early visual cor-tex (EVC) and object-selective cortex, evidenced equivalent cat-egorization of scenes, making it difficult to determine the uniquecontribution of PPA.

The aim of the current study was to investigate, in a data-driven manner, the structure of scene representations across hu-man ventral visual cortex using the distributed response patterns.We took advantage of the power of ungrouped event-related de-signs (Kriegeskorte et al., 2006, 2008b; Kravitz et al., 2010) to testa broad array of scenes from different categories, evenly dividedbetween manmade and natural scenes (Oliva and Torralba, 2001;Joubert et al., 2007). Critically, we further controlled and evalu-ated the contribution of spatial information, by choosing scenesto equally span differences in expanse (open, closed) and relativedistance (near, far) (see Fig. 1) (Oliva and Torralba, 2001; Tor-ralba and Oliva, 2003; Loschky and Larson, 2008; Greene andOliva, 2009b). Consistent with previous reports (Kay et al., 2008;Walther et al., 2009), the identity of individual scenes could bedecoded in both EVC and PPA. However, PPA primarily groupedscenes based on their expanse, whereas grouping in EVC wasgenerally weaker and based on relative distance. Furthermore, the

Received Sept. 1, 2010; revised March 3, 2011; accepted March 22, 2011.Author contributions: D.J.K., C.S.P., and C.I.B. designed research; D.J.K. and C.S.P. performed research; D.J.K.

contributed unpublished reagents/analytic tools; D.J.K. analyzed data; D.J.K. and C.I.B. wrote the paper.This work was supported by the National Institute of Mental Health Intramural Research Program. Thanks to

Marlene Behrmann, Assaf Harel, Alex Martin, Dale Stevens, and other members of the Laboratory of Brain andCognition, National Institute of Mental Health for helpful comments and discussion.

Correspondence should be addressed to Dwight J. Kravitz, 10 Center Drive, Room 3N228, Laboratory of BrainCognition, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20892. E-mail:[email protected].

DOI:10.1523/JNEUROSCI.4588-10.2011Copyright © 2011 the authors 0270-6474/11/317322-12$15.00/0

7322 • The Journal of Neuroscience, May 18, 2011 • 31(20):7322–7333

observed grouping in PPA and EVC correlated strongly with be-havioral judgments of expanse and relative distance, respectively.Contrary to reports of contextual and category effects in PPA,there was no grouping by content nor any ability to decode scenecategory either within or across spatial factors. Together, thesefindings indicate that representations in PPA primarily reflectspatial and not category information.

Materials and MethodsParticipants and testing. Ten participants (six female), ages 21–35 years,participated in the functional magnetic resonance experiment. For oneparticipant, there was insufficient time to collect the localizer for EVC.Six participants aged 21–28 years participated in the independent be-havioral experiment. All participants had normal or corrected-to-normal vision and gave written informed consent. The consent andprotocol were approved by the National Institutes of Health Institu-tional Review Board.

Event-related fMRI stimuli and task. During the six event-related runsof the fMRI experiment, participants were presented with 96 highly de-tailed and diverse real-world scenes (1024 ! 768 pixels, 20 ! 15°) in arandomized order for 500 ms each. Interstimulus intervals (4 –12 s) werechosen to optimize the ability of the subsequent deconvolution toextract responses to each scene using the optseq function from AFNI(for Analysis of Functional NeuroImages)/Freesurfer.

To ensure fixation, participants performed a shape-judgment task onthe central fixation cross. Specifically, simultaneous with the presenta-tion of each scene, one arm of the fixation cross grew slight longer andparticipants indicated which arm grew via a button press. Which armgrew was counterbalanced across scenes between runs, such that botharms grew equally often with each scene. We used this task, which was

orthogonal to scenes, to measure the structureof scene representations without introducingany confounds or feedback effects caused bytask.

The scenes were selected to span the stimu-lus domain as broadly as possible. Scenes wereconstrained to represent naturalistic (eye-level) views. The scenes were taken from 16categories (six exemplars each), divided evenlyby content (manmade, natural) (Oliva andTorralba, 2001; Joubert et al., 2007). To test forthe relative importance of spatial information,scenes within these categories were chosen toequally span two spatial dichotomies thoughtto be important for scene perception: expanse(open, closed: the spatial boundary of thescene) and relative distance (near, far: distanceto the nearest foreground objects) (Oliva andTorralba, 2001; Torralba and Oliva, 2003;Loschky and Larson, 2008; Greene and Oliva,2009b; Ross and Oliva, 2010) (Fig. 1) (for fullstimulus set, see supplemental Item 1, availableat www.jneurosci.org as supplemental mate-rial). Scenes were identified as belonging to aparticular level of a dichotomy (e.g., open,closed), based on agreement among the au-thors. In the case of open and closed scenes,which differed in their spatial boundaries,and content, which differed in their constit-uent objects, the differences were quite clear.Relative distance was defined within eachcategory, and thus exemplars differed con-siderably in vergence cues and the amount ofspace depicted, making attributions to eithernear or far simple. Because each of the 16categories had both near and far exemplars,each scene reflected one of eight possibleclassifications (Fig. 1, manmade/closed/neartop left two images). Note that all scenes dif-

fered from one another at an individual level in their spatial layout.fMRI localizer stimuli and task. Four independent block-design scans

were also collected in each participant to localize scene-selective, object-selective, and face-selective and EVC regions of interest (ROIs). Each ofthese scans was an on/off design with alternating blocks of stimuli pre-sented while participants either performed a one-back task (for object,face, scene localizers) or simply maintained fixation (EVC). Scene-selective cortex was localized with the contrast of scenes versus faces,object-selective cortex with the contrast of objects versus retinotopicallymatched scrambled objects (Kravitz et al., 2010), and face-selective cor-tex with the contrast of faces versus objects. Scene, object, and face im-ages were grayscale photographs. Peripheral (pEVC) and central (cEVC)EVC were localized with the contrast of central (5°) and peripheral (6 –15°) flickering (8 Hz) checkerboards.

fMRI scanning parameters. Participants were scanned on a research-dedicated GE 3 tesla Signa scanner located in the Clinical Research Cen-ter on the National Institutes of Health campus (Bethesda, MD). Partialvolumes of the temporal and occipital cortices were acquired using aneight-channel head coil (22 slices; 2 ! 2 ! 2 mm; 0.2 mm interslice gap;TR, 2 s; TE, 30 ms; matrix size, 96 ! 96; FOV, 192 mm). In all scans,oblique slices were oriented approximately parallel to the base of thetemporal lobe and generally covered the temporal lobe from its mostinferior extent to the superior temporal sulcus and extended posteriorlythrough all of early visual cortex. Six event-related runs (263 TRs) andeight localizer scans (144 TRs) were acquired in each session.

fMRI preprocessing. Data were analyzed using the AFNI software pack-age (http://afni.nimh.nih.gov/afni). Before statistical analysis, all of theimages for each participant were motion corrected to the first image of theirfirst run after removal of the first and last eight TRs from each run. After

Figure 1. Stimulus selection. Scenes were chosen from a broad array of high-level conceptual categories (supplemental Item 1,available at www.jneurosci.org as supplemental material) and equally spanned three dichotomies: content (manmade, natural),expanse (open, closed), and relative distance (near, far). Near and far scenes were differentiated by the relative distance betweenthe viewer and foreground objects within a scene category. Manmade and natural scenes differed in whether the majority of thescene contained artificial or natural objects. Open and closed scenes were defined by whether the scene implied the viewer was inan enclosed space.

Kravitz et al. • Spaces More Than Places in High-Level Scene Representations J. Neurosci., May 18, 2011 • 31(20):7322–7333 • 7323

motion correction, the localizer runs (but not the event-related runs) weresmoothed with a 3 mm full-width at half-maximum Gaussian kernel.

fMRI statistical analysis. ROIs were created for each participant fromthe localizer runs. Significance maps of the brain were computed byperforming a correlation analysis thresholded at a p value of 0.0001 (un-corrected). ROIs were generated from these maps by taking the contig-uous clusters of voxels that exceeded threshold and occupied theappropriate anatomical location based on previous studies (Sayresand Grill-Spector, 2008; Schwarzlose et al., 2008). To ensure that allROIs were mutually exclusive, we used the following precedence rulesto remove overlapping voxels. First, if a voxel showed any positionselectivity (center vs periphery), it was deemed retinotopic and ex-cluded from all the category-selective ROIs. Category selectivity is, bynecessity, always established by the contrast of two retinotopicallydistinct categories, and the demonstration that voxel shows any po-sition effects suggests that its selectivity is attributable to simple reti-notopy. Second, any voxel that showed selectivity to faces or scenesbut did not differentially respond to central or peripheral checker-boards was deemed selective for those categories. Third, any voxelthat showed a stronger response to objects than scrambled objects butdid not respond differentially to the checkerboards and did not re-spond more to faces or scenes than objects was included in the object-selective ROIs.

Furthermore, all of the analyses presented below were also performedwith all overlapping voxels removed from every ROI, and no significantchanges in the results occurred. Finally, we also performed all of theanalyses of PPA and pEVC with matching voxel sizes by randomly sub-sampling pEVC and found no qualitative differences in any of the re-ported results.

We conducted a standard general linear model using the AFNIsoftware package to deconvolve the event-related responses. Our ex-periment combined a sparse event-related design with multivoxelpattern analysis, allowing us to assess the response to each individualstimulus and not average across a priori categories of stimuli (un-grouped design). Response patterns in the event-related runs werecreated by performing t tests between each condition and baseline.The t values for each condition were then extracted from the voxelswithin each ROI, and we then used an iterative variant (MacEvoy andEpstein, 2007; Chan et al., 2010; Kravitz et al., 2010) of split-halfcorrelation analysis (Haxby et al., 2001; Williams et al., 2008) toestablish the similarity between the response patterns of each pair ofscenes, once the mean signal was independently removed from eachhalf of the data. This yielded similarity matrices that represent thesimilarity in the spatial pattern of response across the ROI betweeneach pair of conditions. t values were used because they reduced theimpact of noisy voxels on the patterns of response (Misaki et al.,2010), and nearly equivalent results were obtained using the coeffi-cients. Also, to rule out baseline activity differences as the source ofany observed effects, all analyses were performed with and withoutthe mean activity removed. The main effect of the removal of themean activity was a normalization of the data leading to an increase inthe structure of resulting similarity matrices and reduction in theoverall level of correlation. However, there were no qualitative orsignificant effects on any of the grouping or discrimination results.

All analyses were also repeated after applying a Fisher transformationto the correlation values. No qualitative or significant effects on any ofthe results was observed, which is unsurprising given that none of thecorrelations approached either 1 or "1 and correlations near to zeroapproximate the normal distribution.

Selectivity analysis. To investigate the distribution of scene informa-tion throughout the whole volume, we performed a novel selectivityanalysis. Typical information-based mapping uses a searchlight, whichdetermines what information is available in the response of a localcluster of voxels. Although useful, this approach is forced to assumethat information is present only in these local clusters, constrains thesort of information being searched for, and introduces non-independence between adjacent voxels. Our analysis avoids theseproblems and simply evaluates whether each individual voxel shows

any consistent selectivity among our set of 96 stimuli across indepen-dent halves of the data.

To determine whether a particular voxel exhibits consistent selectivityamong our set of stimuli, we smoothed the event-related data to 3 mm tomatch our block-design localizers and divided the data into two indepen-dent halves, using the same iterative procedure we used for the similarityanalysis. We then correlated the relative levels of activation to each of the96 scenes across the two halves of the data. If a particular voxel is respon-sive to, but not selective among, our set of scenes, it will produce two setsof responses in the two halves of data that may have the same distribution(i.e., mean, SD), but there will be no correlation between the rank order-ing of the responses. Alternatively, a voxel that is both responsive andselective will produce a correlated pattern of selectivity between the twohalves of the data. The correlation value assigned to each voxel thereforeindicated its consistency of selectivity across our stimuli. These valueswere then averaged across all the voxels within a region within eachparticipant.

To establish whether a cluster of voxels showed significant selectivity,we used a cluster threshold based on the following randomization pro-cedure. First, we took the data from the independent halves of the data ineach participant and then randomized the condition labels and corre-lated the selectivity. Importantly, the randomization was the same forevery voxel, maintaining any non-stimulus-specific relationshipsbetween voxels. We then searched the entire volume for the largest con-tiguous cluster of voxels with correlation values greater than r # 0.168( p $ 0.05). We repeated this procedure 10,000 times for each participantand derived the minimum cluster size that occurred in $5% of the iter-ations. This cluster size served as a participant-specific threshold fordetermining which clusters of voxels (r % 0.168, p $ 0.05) were signifi-cant. The average threshold for cluster size was &12.

Behavioral experiment. Twelve new participants completed three ses-sions of 576 trials, during which they judged which of a pair of scenes waseither more open (expanse), more natural (content), or more distant(distance). Importantly, no specific instructions were given to the partic-ipants about what defined each of the dimensions; they were left free torate stimuli based on their intuitions about the labels given. Ideally, wewould have directly measured relative distance within each category ofstimuli, but that would have required informing participants of the cat-egories and/or limiting the trials to only comparisons within a category,both of which would have introduced task confounds into our measureof distance.

On each trial, participants were sequentially presented with two scenesfrom our set of 96 for 500 ms each with a 1 s blank screen between.Participants indicated their chosen scene via a button press. The order ofthese sessions (expanse, content, distance) was counterbalanced acrossparticipants. Furthermore, the trials were chosen such that no trial wasever repeated across participants, so that as many of the comparisons aspossible were made.

Because there were not enough trials available to probe every singlepossible comparison (4560) within a single participant, trials were con-catenated across participants. To determine a ranking across our stimu-lus set for expanse, distance, and content, Elo ratings (Elo, 1978) werederived in the following manner. Each scene was given an initial Elorating of 1000. Each trial was treated as a match between the twoscenes, and the losers and winners rankings were adjusted accordingto the standard Elo formula (Meng et al., 2010). The final rankings foreach scene reflect their relative ranking along the dimension of inter-est. Because the order of matches impacts the final Elo ratings, 10,000iterations of this procedure with different random trial orders wereaveraged together.

ResultsThe purpose of this study was to perform a data-driven investi-gation of scene representations across the ventral visual cortex.We presented 96 highly detailed and diverse scenes chosen toboth broadly cover the stimulus domain. The scenes were bal-anced in such a way as to allow us to evaluate the relative contri-butions of nonspatial factors, such as content (manmade,

7324 • J. Neurosci., May 18, 2011 • 31(20):7322–7333 Kravitz et al. • Spaces More Than Places in High-Level Scene Representations

Elissa Aminoff

Elissa Aminoff

natural) and high-level category (e.g., beaches, highways), andspatial factors, such as expanse (open, closed) and relative dis-tance (near, far), to scene representations. None of these factorshad any preferential status within any of the subsequent analyses,and there was no bias in our design for any, all, or none of thesefactors or categories to emerge.

Representational structure within cortical regionsIn our first test of scene representations, we independently localizedscene-, object-, and face-selective regions as well as retinotopic EVCin both hemispheres. Given the limited acquisition volume possibleat our high resolution (2 ! 2 ! 2 mm), our scene-selective regionsincluded both transverse occipital sulcus (TOS) (Epstein et al., 2007)and PPA but not retrosplenial cortex (Epstein and Higgins, 2007).We divided EVC into pEVC and cEVC, given evidence for a periph-eral bias in PPA (Levy et al., 2001; Hasson et al., 2002) and for thedifferential involvement of central and peripheral space in scene per-ception (Larson and Loschky, 2009). We will focus initially on com-paring and contrasting PPA and pEVC, the regions that showed thestrongest discrimination and most structured representations.

Within each region, we extracted the pattern of responseacross voxels to each of the 96 scenes. We then cross-correlatedthese response patterns to establish the similarity between theresponse patterns of each pair of scenes. This analysis yielded a96 ! 96 similarity matrix for each region (Fig. 2a,b) wherein each

point represents the correlation or simi-larity between a pair of scenes (Krieges-korte et al., 2008a; Drucker and Aguirre,2009). These matrices can be decomposedinto two components. First, the pointsalong the main diagonal, from the top leftto bottom right corner of the matrix, rep-resent the consistency of the response pat-terns for the same scene across the twohalves of the data (within-scene correla-tions). Second, the points off the diagonalare the correlations between pairs of differ-ent scenes (between-scene correlations).These two components can be used to pro-vide information about both categorizationand discrimination of scenes. Specifically,the between-scene correlations definehow a region groups scene together (cat-egorization). In contrast, significantlygreater within- than between-scene correla-tions indicate that the region can distinguishbetween individual scenes from one another(discrimination).

Given previous results on categoriza-tion in PPA, we first ordered the raw sim-ilarity matrices by scene category anddivided scenes by content into manmadeand natural. For PPA and pEVC (Fig.2a,b), it is clear that the patterns of re-sponse contain rich information aboutthe presented scenes. In both regions, thewithin-scene correlations (diagonal) areon average stronger than the between-scene correlations (off-diagonal), indicat-ing an ability to discriminate scenes. Thiseffect is particularly prominent in pEVC(Fig. 2b). However, there is very littlestructure to the between-scene correla-

tions in pEVC and only mild grouping evident in PPA. Further-more, neither region shows any consistent grouping of manmadeand natural scenes. To better visualize this structure, we averagedthe between-scene correlations by high-level category (Fig. 2c,d).In these matrices, the points along the main diagonal reflect thecoherence of a scene category. Even within these average matri-ces, there is only weak evidence for coherent scene categories inPPA (Fig. 2c, high within-category correlations for Living Roomsand Ice Caves) and no obvious coherent categories in pEVC (Fig.2d). Furthermore, even among the most coherent categories inPPA, there are between-category correlations that violate differ-ences in content. For example, Living Rooms and Ice Caves arewell correlated despite vast differences in content and low-levelstimulus properties (e.g., color, spatial frequency, luminance,etc).

To better visualize the structure of scene representations inboth regions, without assuming the importance of scene catego-ries, we used multidimensional scaling (MDS) (Kriegeskorte etal., 2008a). Each scene was positioned on two-dimensional plane,in which the distance between any pair of scenes reflects the cor-relation between their response patterns (the higher the correla-tion the closer the distance) (Fig. 3a,b). This visualization revealsa very striking structure not captured by scene categories in eitherPPA or pEVC. In PPA, there is clear grouping by expanse, withopen scenes to the right and closed scenes to the left. In pEVC,

Figure 2. Similarity matrices for PPA and pEVC. a, b, Raw similarity matrices for PPA (a) and pEVC (b) averaged across partic-ipants. The matrices comprise 96 ! 96 elements, with each point reflecting the amount of correlation in the pattern of responsebetween a pair of scenes. The main diagonal in each matrix from the top left to bottom right corner are the correlations betweena scene and itself in the two halves of the data. The matrices are ordered by high-level category, and dashed lines indicate divisionsbetween those categories. The solid lines indicate the division between manmade and natural scenes. Note that, although for bothPPA and pEVC the main diagonal shows on average higher correlations than the off-diagonal elements (indicating scene discrim-ination), there is very little grouping evident in either matrix. c, d, Between-scene correlations from a and b averaged by high-levelconceptual category. The main diagonal in these plots reflects the coherence of a high-level category and the off-diagonal repre-sent correlations between categories of scenes. Although some categories appear to exhibit a degree of coherence, note, forexample, the high correlations between Living Rooms and Ice Caves, as well as Hills and Harbors, which differ markedly inhigh-level conceptual properties.


grouping was weaker but defined by rela-tive distance. We verified the strength ofthese differential groupings between thetwo regions by reordering the raw similar-ity matrices (Fig. 2) by these dichotomies(Fig. 3c,d) rather than high-level category.Note that, in some cases, the difference inthe structure of scene representations be-tween PPA and pEVC caused large shiftsin the pairwise similarity of individualscenes. For example, a church image and acanyon image were similarly categorizedby PPA (Fig. 3a, yellow boxes), reflectingenclosed structure, whereas in pEVC, theywere categorized as dissimilar (Fig. 3b,yellow boxes) because they had differentrelative distances. In the following sec-tion, we quantify these differences inrepresentational structure between theregions.

Comparison of the representationalstructure in PPA and pEVCWe directly quantified the relative contri-butions of expanse, relative distance, andcontent by averaging the between-scenecorrelations (off-diagonal) across theeight different combinations of the threedichotomies (Fig. 4a,b). We then averagedeach row of these matrices according thecorrelation within and between the vari-ous levels of expanse, relative distance,and content (Fig. 4c,d). The resulting cor-relations were then entered into a four-way repeated-measures ANOVA withexpanse (same, different), relative dis-tance (same, different), content (same,different), and region (PPA, pEVC) asfactors.

Grouping was weaker in pEVC than PPA (see also discrimi-nation analysis below) with lower between-scene correlations,resulting in a significant main effect of region (F(1,8) # 19.269,p $ 0.01). Furthermore, the contributions of relative distanceand expanse were different in the two regions, resulting inhighly significant interactions between region ! expanse (F(1,8)

# 33.709, p $ 0.001) and region ! relative distance (F(1,8) #24.361, p $ 0.01). Notably, content was not a major contributorto grouping in either region, and no main effects or interactionsinvolving content (all p % 0.16) were observed.

To investigate the differential grouping in the two regionsfurther, data from each region were entered independently in tworepeated-measures ANOVAs. In pEVC, relative distance was theonly significant factor producing grouping (F(1,8) # 30.554, p $0.001), and no main effect of expanse or content ( p % 0.12) wasobserved. In contrast, in PPA, expanse was the primary factorproducing grouping (F(1,9) # 44.419, p $ 0.001), although therewas a smaller effect of relative distance (F(1,9) # 18.152, p $ 0.01).No interactions between expanse ! relative distance were foundeither within or across the ROIs ( p % 0.2). Again, content playedno role in grouping, with no main effects or interactions involv-ing content ( p % 0.15). Furthermore, even when the matriceswere averaged by the semantic categories (e.g., beaches, moun-tains) used in previous studies (Walther et al., 2009), expanse

remained the dominant factor producing grouping (supplemen-tal Item 2, available at www.jneurosci.org as supplemental mate-rial) (also see below).

Thus, neither PPA nor pEVC show effects of scene category orcontent. Instead, both regions group scenes by their spatial as-pects, with pEVC showing grouping by relative distance and PPAgrouping primarily by expanse. Although the weaker categoriza-tion by relative distance in PPA may suggest that some aspects ofscene categorization are inherited from pEVC, the absence of aneffect of expanse in pEVC implies that the structure of scenerepresentations is transformed between pEVC and PPA.

Comparison of behavior and scene representations in PPAand pEVCMultivariate designs, by virtue of their large number of condi-tions, produce data that can be directly correlated with behaviorat an individual item level (Kriegeskorte et al., 2008a; Druckerand Aguirre, 2009). To assess whether the representational struc-ture we observed in PPA and pEVC was reflected in behavior, wenext directly tested whether the structure of scene representationswe measured in PPA and pEVC agreed, at the level of individualscenes, with subjective behavioral ratings from a new set of sixparticipants. The task and instructions used in collecting behav-ioral judgments will inevitably constrain the resulting data.

Figure 3. MDS plots for PPA and pEVC. a, b, MDS from PPA (a) and pEVC (b). The main plots and the insets to the right containthe same data points. In the main plots, the scenes are plotted directly, whereas in the insets, the scenes are represented bysymbols that reflect the levels of expanse and relative distance. Note that, in PPA, the scenes group by expanse (red vs bluesymbols), whereas in pEVC, the scenes group by relative distance (circles vs triangles). The four highlighted scenes (2 yellow boxes,2 green boxes) in each plot were chosen to highlight the difference in the similarity between pairs of scenes in the two ROIs. The twoscenes highlighted in green share the same expanse but differ in relative distance, whereas the scenes highlighted in yellow sharerelative distance but differ in expanse. Note the difference in their relative positions in the MDS plots from PPA and pEVC. c, d, Rawsimilarity matrices for PPA (c) and pEVC (d) from Figure 2, a and b, reordered by expanse and relative distance. Solid lines indicatea distinction between open and closed scenes, whereas the dashed lines indicate a distinction between near and far scenes. In PPA(c), note the clear clustering of strong correlations between scenes that shared the same expanse (top left and bottom rightquadrant) and the clustering of weak correlation between scenes with different scene boundaries (bottom left and top rightquadrants). In contrast, in pEVC, note the clustering of strong correlation between scenes that shared relative distance, evident asa checkerboard pattern.


Therefore, we provided as little instruction as possible, simplyasking participants to report which of a sequential pairs of sceneswas more open (expanse), more natural (content), or more dis-tant (distance). Participants were free to interpret these labels asthey wanted. We used Elo ratings (see Materials and Methods) toderive a ranking for each individual scene for each of the threedichotomies. These rankings turn our dichotomies into dimen-sions, in which the rating of a scene reflects its subjective open-ness, naturalness, or depth relative to the other 95 scenes.

First, we used the Elo ratings as independent confirmation ofour dichotomies. The content dichotomy was the most clearlyreflected in the Elo ratings, with 46 of 48 of the top ranked scenesbeing natural scenes. The expanse dichotomy was similarlystrong, with 40 of 48 of the top ranked scenes being open scenes.To calculate the strength of the relative distance dichotomy, wecounted the number of times the far exemplars of a particularhigh-level category of scene were rated more highly than the nearexemplars in that category, which was true for 40 of 48 scenes. Toassess the reliability of the ratings, we divided the 12 participantsinto two groups of six and calculated Elo ratings for each groupseparately. The ratings for all three dichotomies were highly cor-related (expanse, r # 0.92; content, r # 0.94; relative distance, r #0.86; all p $ 0.0001) across the groups, verifying the reliability of theElo ratings. Thus, independent ratings of the individual scenes bynaive observers reliably confirm our original classifications.

Next, we directly compared the Elo ratings with the scenerepresentations we recovered with fMRI in PPA and pEVC. We

calculated an fMRI grouping score from the average similaritymatrices (Fig. 3c,d) for each scene that reflected how stronglygrouped that scene was within a particular dichotomy. For exam-ple, the expanse score for a scene was calculated by subtracting itsaverage correlation with the closed scenes from its average corre-lation with the open scenes. We then correlated these fMRIgrouping scores with their respective Elo ratings to determinewhether scene representations in each region reflected the behav-ioral rankings of scenes.

For expanse, we found a very strong correlation between theElo ratings and expanse scores in PPA (r # 0.67, p $ 0.0001) (Fig.5a) but not in pEVC (r # 0.08, p % 0.1) (Fig. 5b). This differencein correlation was significant (z # 2.16, p $ 0.05), suggesting thatthe pattern of response in PPA more closely reflects behavioraljudgments of expanse than does the pattern in pEVC. For con-tent, we found no correlations in either PPA (r # 0.10, p % 0.1)(Fig. 5c) or pEVC (r # 0.07, p % 0.1) (Fig. 5d). Furthermore, inPPA, this correlation was significantly weaker than the correla-tion between Elo ratings and expanse scores (z # 2.99, p $ 0.05),demonstrating that there is a stronger relationship between scenerepresentations in PPA and judgments of expanse than judg-ments of content. For distance, we found equivalent correlations( p % 0.1) in both PPA (r # 0.54, p $ 0.0001) (Fig. 5e) and pEVC(r # 0.31, p $ 0.01), consistent with the grouping we observed inboth regions. Based on our previous analysis, it might have beenexpected that the correlation with distance would have beenstronger in pEVC than PPA. Their equivalent correlations mayreflect a weaker direct contribution of pEVC to conscious judg-ments about scenes than PPA.

These correlations between the structure of scene representa-tions in fMRI and behavior suggest that the pattern of response inPPA much more strongly reflects subjective judgments aboutspatial aspects of scenes (expanse, distance) than the content ofthose same scenes. In contrast, the pattern of response in pEVCreflected only judgments of the distance of those scenes, provid-ing converging evidence for the different scene information cap-tured in pEVC and PPA. Furthermore, these results show that,regardless of what visual statistics drive the responses of pEVCand PPA, the representations they contain directly reflect, andperhaps even contribute to, subjective judgments of high-levelspatial aspects of complex scenes.

High-level category information in PPA within and acrossspatial factorsOur previous analyses confirmed that spatial factors have agreater impact on the structure of scene representations in PPAthan nonspatial factors. To directly test whether there was anyhigh-level category information independent from spatial fac-tors, we next considered whether (1) scene category could bedecoded when spatial factors were held constant or do scenesfrom different categories, but with similar spatial properties elicitsimilar response, and (2) whether scene category could be de-coded across spatial factors, or do scenes from the same category,but with different spatial properties, elicit different responses.Because expanse is primarily confounded with category (e.g., allmountain scenes will be open), item 2 could only be tested acrossrelative distance.

To perform these analyses, we needed to consider the near andfar exemplars of each of the 16 high-level categories separately(Fig. 1), effectively doubling the number of categories to 32. Wethen averaged the off-diagonal correlation from the raw similar-ity matrix for PPA (Fig. 3c) by scene category (Fig. 6a). The pointsalong the diagonal of this matrix represent the average correla-

Figure 4. Categorization in PPA and pEVC. a, b, The off-diagonal points from the raw matrixfrom Figure 2, a and b, averaged by the eight combinations of expanse, relative distance, andcontent. The solid lines again denote divisions between closed and open scenes, whereas thedashed lines indicate divisions between near and far scenes. Manmade and natural scenesalternate in that order, such that the first point within any of the small boxes defined by thedotted lines is the average of the manmade scenes and the second is the average of the naturalscenes. c, d, Bar plots of the average effects of expanse, relative distance, and content oncategorization. Averages were created by averaging across the rows of the matrix after they hadbeen aligned such that switches between the levels of the factors were in agreement. Forexample, the first solid green bar represents the average effect of keeping expanse and contentconstant but varying relative distance for all eight of the possible crossings of the three factors.Solid and hashed bars indicate the division between same and different expanse. Effect ofholding content constant is plotted in the left, whereas the effect of changing content is plottedin right. In PPA (c), note the large effect of changing expanse (solid vs hashed bars). In contrastin pEVC, note that changing relative distance has the largest effect (orange vs green bars). Allerror bars indicate the between-subjects SEM.


tion between exemplars of each category.The off-diagonal points represent the cor-relations between different scene catego-ries or between the near and far exemplarsof the same category [Fig. 6a, whiteellipses].

To establish whether categories couldbe distinguished from one another whenthey shared both expanse and relative dis-tance, discrimination indices were calcu-lated for each category within eachcombination of the spatial factors (Fig.6b). These discrimination indices weredefined as the difference between the cor-relation of a category with itself [Fig. 6a]and the average correlation between thatcategory and the other categories thatshared expanse and relative distance.These indices were entered into a one-wayANOVA with scene category (32) as a fac-tor. No main effect of scene category wasobserved ( p % 0.15), nor was there sig-nificant discrimination across the scenecategories on average ( p % 0.15), nordid any individual category evidencesignificant discrimination with a Bon-ferroni’s correction for multiple com-parisons ( p % 0.3). To apply the mostliberal test for category informationpossible, we conducted one-tailed t testsfor each scene category. We found onlya single category (near cities) (Fig. 6c)that evidenced any decoding ( p $ 0.05,uncorrected). Thus, even when spatial factors are held con-stant, we found no strong evidence for scene categoryrepresentations.

To establish whether high-level scene category could be de-coded across variations in spatial factors, we calculated discrim-ination indices for each category across the two levels of relativedistance (Fig. 6d). These discrimination indices were defined asthe difference between the correlation of the near and far exem-plars of a category with each other [Fig. 6a, white ellipses] and theaverage correlation between the near and far exemplars of thatcategory and other categories. These indices were entered into aone-way ANOVA with scene category (16) as a factor. No maineffect of scene category was observed ( p % 0.375), nor was theresignificant discrimination across the scene categories on average( p % 0.15), nor did any individual category evidence significantdiscrimination with a Bonferroni’s correction for multiple com-parisons (all p % 0.3). Again, we applied the most liberal test forcategory information and conducted one-tailed t tests for eachscene category. We found only a single category (living rooms)that evidenced any decoding ( p $ 0.05, uncorrected).

In summary, in contrast to reports emphasizing the represen-tation of scene category in PPA (Walther et al., 2009), we foundno evidence for decoding of scene categories in PPA when spatialfactors are controlled. We found no ability to decode high-levelcategory across different levels of relative distance. We found noevidence for content as a significant contributor to the overallstructure of representations in PPA or pEVC. We also found nocorrelation between scene representations in PPA or pEVC andsubjective judgments of content and significantly weaker behav-ioral correlations for content than expanse. Although it is possi-

ble that these nonspatial factors do have some impact on scenerepresentations in these regions, that impact is clearly minorcompared with the spatial factors of expanse and relativedistance.

Scene discrimination in PPA and pEVCAlthough the grouping of between-scene correlations providesinsight into how these regions categorize scenes, the differencebetween within- and between-scene correlations provides an in-dex of scene discrimination. For this analysis, it was critical thatwe consider only between-scene correlations that did not crossany grouping boundary. Otherwise, our discrimination measurewould be implicitly confounded with grouping. Given the strongevidence for both expanse and relative distance as categories,we consider discrimination between scenes within the combi-nations of these factors separately (four white squares encom-passing the main diagonal in Fig. 2a,b), collapsing acrossdifferences in content.

Within- and between-scene correlations were extracted fromeach of the four combinations of expanse and relative distance(supplemental Item 3, available at www.jneurosci.org as supple-mental material). These correlations were then averaged and sub-tracted from one another to yield discrimination scores (Fig.7a,b). There was a broad ability to discriminate scenes in bothregions, with significant discrimination ( p $ 0.05) observed inevery condition except for near, closed scenes in PPA. To inves-tigate the pattern of discrimination between the two regions,discrimination scores were entered into a three-way repeated-measures ANOVA with expanse (open, closed), relative distance(near, far), and region (PPA, pEVC) as factors. Discrimination

Figure 5. Comparison of behavioral and imaging data. a, Scatter plot of the Elo ratings for expanse derived from the behavioralexperiment against the expanse score calculated from the average similarity matrix in PPA (Fig. 3c) for each scene image. fMRIscores for each scene were calculated by subtracting its average correlation with the closed scenes from its average correlation withthe open scenes. A zero fMRI score (horizontal dotted line) indicates equivalent correlation with both open and closed scenes. AnElo score of 1000 (vertical dotted line) indicates that the scene has an average expanse. Note the strong correlation between thefMRI and behavioral measures. Note also the large spread of fMRI scores along the y-axis, reflecting the strong grouping byexpanse. b, Same as a but for pEVC. Note the significantly lower correlation between the fMRI and behavioral measures and thesmaller spread along the y-axis. c, d, Same as a and b, respectively, but now considering content rather than expanse. Note thesignificantly lower correlation in PPA than was observed with expanse and the lack of correlation in pEVC. e, f, Same as a and b,respectively, but now considering distance rather than expanse. Note the correlation between the fMRI and behavioral measuresin both regions. Note also the larger spread along the y-axis in f, reflecting the stronger grouping by distance in pEVC.


was stronger in pEVC that PPA, resulting in a significant maineffect of ROI (F(1,8) # 18.838, p $ 0.01). Discrimination was alsogenerally stronger for near than far scenes, resulting in a signifi-cant main effect of relative distance (F(1,9) # 9.793, p $ 0.05),although this effect was stronger in pEVC, resulting in a signifi-cant interaction between region ! relative distance (F(1,8) #8.898, p $ 0.05). Separate ANOVAs within each region confirmedthe larger effect of relative distance in pEVC (F(1,8) # 15.477, p $0.01) than in PPA (F(1,9) # 5.328, p $ 0.05) but revealed no addi-tional effects (all p % 0.3). These results demonstrate that, evenwithin scenes that are grouped together, there is significant informa-tion about the individual scenes.

The gross pattern of scene discrimination was very similar inboth pEVC and PPA. To investigate the relationship betweendiscriminability in the two regions in greater detail, we calculateddiscrimination indices for each individual scene and then corre-lated them across pEVC and PPA (Fig. 7c). The high correlation

(r # 0.659, p $ 0.001) between the dis-crimination indices suggests that the dis-tinctiveness of the representation of ascene in PPA is directly related to its dis-tinctiveness in pEVC.

Together, the results of the discrimina-tion and categorization analyses suggest atransformation of scene representationsbetween pEVC and PPA. Clearly the dis-criminability of scene representations inPPA reflects discriminability in pEVC.However, PPA sacrifices some scene dis-criminability, perhaps to better categorizescenes by their spatial expanse. Thus, PPAmaintains less distinct representations ofscenes that seem broadly organized tocapture spatial aspects of scenes.

Categorization and discrimination inother cortical regionsIn addition to PPA and pEVC, we also in-vestigated cEVC, TOS, object-selective re-gions lateral occipital (LO) and posteriorfusiform sulcus (PFs), and the face-selective occipital face area (OFA) andfusiform face area (FFA).

cEVC was similar to pEVC in its pat-tern of discrimination (supplementalItem 4, available at www.jneurosci.org assupplemental material) but showed noscene categorization. This difference incategorization between cEVC and pEVCled to a significant relative distance ! re-gion interaction (F(1,8) # 29.901, p $0.01) when categorization averages wereentered into a four-way ANOVA with ex-panse, relative distance, content, and re-gion (cEVC, pEVC) as factors. Thissuggests that pEVC contains more struc-tured scene representations than cEVCand highlights the likely importance ofpEVC in scene processing (Levy et al.,2001; Hasson et al., 2002). However, itmust be noted that cEVC represents theportion of space containing the fixationcross, on which the participants were

performing the task. Although the cross was very small(&0.5°) relative to the central localizer (5°), it cannot be ruledout that this overlap impacted results in cEVC.

Scene representations in TOS had a structure similar to PPAbut were less categorical. Scene discrimination in TOS and PPAwere similar (supplemental Item 4, available at www.jneurosci.org as supplemental material), but categorization by expanse wasweaker. This weaker categorization led to a significant interactionbetween expanse ! region (F(1,9) # 11.714, p $ 0.01) when cat-egorization averages from TOS and PPA were entered in a four-way ANOVA. In TOS, as in PPA, there was a trend for weakcategorization by relative distance (F(1,9) # 4.548, p # 0.06) andno effects involving content (all p % 0.25).

The object-selective regions (supplemental Item 5, available atwww.jneurosci.org as supplemental material) did not seem par-ticularly involved in processing the scene stimuli. LO evidencedsome weak discrimination of scenes and no categorization by any

Figure 6. High-level category discrimination in PPA controlling for spatial factors. a, Similarity matrix for PPA averaged byhigh-level category with spatial factors held constant. Controlling for relative distance effectively doubles the number of high-levelcategories, as each category had both near and far exemplars. The first eight categories are the near instances of churches, concerthalls, hallways, living rooms, canopies, canyons, caves, and ice caves, followed by the far instances of those same categories. Thenext eight categories are the near instances of cities, harbors, highways, suburbs, beaches, deserts, hills, and mountains, followedby the far instances of those same categories. Note that the diagonal is generally weak, indicating little information abouthigh-level category. b, Bar plot of the discrimination indices for each of the near and far exemplars of each of 16 high-levelcategories. Discrimination indices were created by subtracting the average correlation between a high-level category and othercategories that shared expanse and relative distance from the within category correlation. Positive discrimination indices indicatethe presence of high-level category information. Note that only one category (near cities; green bar in b and green circle in a)produces a significant discrimination index under the most liberal test possible. c, Stimuli from the high-level categories of nearbeaches and cities, the two categories with the highest discrimination indices (purple and green bars in b and circles in a). Note thatthese two categories of stimuli are also strongly correlated with each other (cyan circle in a) despite sharing only expanse andrelative distance and having different content and visual features. d, Bar plot of the discrimination indices for each of the 16high-level categories across different relative distances. Discrimination indices were the difference between the correlation be-tween near and far exemplars of a high-level category (white ellipses in a) and the average correlation between the near and farexemplars across high-level categories (other values within that square in a). Positive discrimination indices indicate that high-level category could be decoded across different relative distances. Note that only one category (living rooms) produces a signifi-cant discrimination index under the most liberal test possible.


of the three dichotomies (all p % 0.1). PFs showed no scenediscrimination and some categorization by expanse but far moreweakly than that observed in PPA, resulting in a highly significantregion ! expanse interaction (F(1,8) # 17.382, p $ 0.01). It islikely that the short presentations times and the scenes we chose,which did not contain strong central objects, reduced the abilityof object-selective cortex to extract individual objects from thescenes.

The results from the face-selective regions (supplementalItem 6, available at www.jneurosci.org as supplemental material)confirmed they contribute little to scene processing (see below,Selectivity analysis). Neither of the face-selective regions evi-denced any categorization by the three dichotomies (all p % 0.2).Neither region showed much ability to discriminate betweenscenes, with FFA showing significant discrimination only for far,closed scenes and OFA for far, open scenes.

Overall, at least some discrimination was possible based onthe response of a number of cortical regions, although strongestdiscrimination was found in EVC, PPA, and TOS. In contrast,grouping was primarily confined to PPA, EVC, and TOS. Impor-tantly, EVC grouped primarily by relative distance, whereas PPAand TOS both grouped primarily by expanse.

Selectivity analysisSo far we have focused on examining scene categorization anddiscrimination within regions defined by their category selectiv-ity. However, the contrast of a preferred and nonpreferred stim-ulus class (Kanwisher et al., 1997; Epstein and Kanwisher, 1998)implies that a region might be identified as specialized for a par-ticular stimulus class because of a difference in response betweenthese conditions and not necessarily because the region main-tains any fine-grained representation of that class. Here we tookadvantage of our ungrouped design and searched for regions thatshowed consistent selectivity among the set of 96 scenes. Thisanalysis provides an alternate way to identify regions important

in scene representation and allows us to investigate whether anyother regions are also important.

The aim of this analysis was to identify voxels in a whole-volume search that show consistent selectivity for the set of sceneimages. Selectivity was defined by the response profile across all96 scenes in a single voxel (Erickson et al., 2000). We computedthe consistency of selectivity by calculating the correlation of theresponse profile between independent halves of the data. We thenproduced maps of the correlation values, deriving cluster thresh-olds using a randomization procedure to determine which voxelswere significantly selective (see Materials and Methods). Giventhe breadth of our scene stimuli, voxels that do not show at leasta modicum of consistency in their selectivity are unlikely to beinvolved in scene processing.

We found that the vast majority of the consistently selectivevoxels (&76%) lay within our predefined regions, indicating thatthese regions primarily contain the core voxels involved in scene-processing in our volume (Fig. 8a).

We next quantified the average selectivity within each of ourpredefined ROIs (Fig. 8b). As expected, significant selectivity( p $ 0.05) was observed only within scene-selective and EVCROIs. In EVC, there was significantly greater selectivity in pEVCthan cEVC (F(1,8) # 21.991, p $ 0.01). To confirm there wasgreater selectivity in scene-selective cortex than in either object-or face-selective cortex, their selectivity scores were entered into atwo-way ANOVA with selectivity (scene, object, face) and loca-tion (anterior, posterior) as factors. The only effect observed wasa main effect of selectivity (F(2,16) # 6.769, p $ 0.01, Greenhouse–Geisser corrected) as a result of the greater selectivity observed inthe scene-selective than in either the object-selective (F(1,8) #8.105, p $ 0.05) or face-selective (F(1,8) # 9.069, p $ 0.05) ROIs.

Finally, we quantified the amount of overlap between eachROI and the significantly selective clusters derived from thewhole volume search (Fig. 8c). Again, significant overlap waspresent only between the EVC and scene-selective ROIs ( p $0.05). The advantage for pEVC over cEVC in both mean selec-tivity and overlap with selective voxels is in keeping with thetheory that PPA has a bias for the peripheral visual field (Levyet al., 2001; Hasson et al., 2002). In combination, these twoselectivity analyses suggest that our analysis of pEVC and PPAcaptured the majority of the scene processing voxels in theventral visual pathway.

In summary, using a voxelwise measure of scene selectivity,based only on responses to scenes, we found that our ROIs cap-tured the vast majority of voxels with consistent scene selectivity.Furthermore, selectivity was most stable in PPA, pEVC, and TOS,consistent with our analyses of categorization and discrimination.

DiscussionReal-world scenes are perhaps the most complex domain forwhich specialized cortical regions have been identified. Here, wedemonstrated that, although many visual areas contain informa-tion about real-world scenes, the structure of the underlying rep-resentations are vastly different. Critically, we were able toestablish, without making previous assumptions, that expanse isthe primary dimension reflected in PPA. Surprisingly, neitherhigh-level scene category nor gross content (manmade, natural)seemed to play a major role in the structure of the representa-tions. In contrast, pEVC grouped scenes by relative distance andmaintained stronger discrimination of individual scenes than ob-served in PPA. Furthermore, the structure of representationsobserved with fMRI corresponded closely with independent be-havioral ratings of the scene stimuli, with high correlations in

Figure 7. Discrimination in PPA and pEVC. a, b, Bar plots of the discrimination indices foreach combination of expanse and relative distance. Discrimination indices were created bysubtracting the average between-scene correlation from the average within-scene correlation.PPA (a) and pEVC (b) exhibit the same pattern of discrimination across near and far, open andclosed scenes, although discrimination was stronger in pEVC than PPA. * indicates the presenceof significant discrimination ( p $ 0.05). All error bars indicate the between-subjects SEM. c,Comparison of discrimination indices for each scene in PPA and pEVC. Each point is a singlescene, whose symbol reflects its expanse and relative distance. Dashed lines indicate the loca-tion of 0 for both pEVC (x-axis) and PPA ( y-axis). Most points fall on the positive side of theselines, indicating that the individual scenes can be discriminated from the other scenes. The solidline is the unity line. Note that most points fall to the right of this line, indicating strongerdiscrimination in pEVC than in PPA.


PPA for ratings of scene openness but not content. This specificpattern of brain-behavior correlation suggests that subjectivejudgments of spatial but not nonspatial aspects of scenes are wellcaptured by, and perhaps dependent, on the response of PPA.These findings provide critical insight into the nature of high-level cortical scene representations and highlight the importanceof determining the structure of representations within a regionbeyond whether those representations are distinct enough to bedecoded.

To date, the problem of differentiating between competingaccounts of PPA and determining the specific contributions ofdifferent visual areas to scene processing has been the complexityand heterogeneity of real-world scenes. First, typical fMRI studiescontrast only a small set of preselected conditions or categories,presenting blocks of these conditions or averaging over event-related responses to individual exemplars. These designs are im-plicitly constrained to show differences only between the testedcategories or conditions, potentially missing other more impor-tant differences. Second, the analysis of these studies also assumesthat the response to each exemplar within a category is equiva-lent. Although this assumption is justified in simple domains inwhich there are minimal differences between stimuli, the hetero-geneity of scenes makes it more tenuous. For example, the iden-tity of individual scenes can be decoded even from the response ofEVC (Kay et al., 2008). Thus, a difference between conditionsmight reflect bias in the study design, differences in exemplars, ordifferences in the homogeneity of stimuli within conditions (Thi-erry et al., 2007) rather than revealing a critical difference in scene

representations. Finally, the paucity of con-ditions in standard designs also makes it dif-ficult to establish the relative importance ofdifferent factors in scene representations ina single study (e.g., spatial vs category differ-ences). The strength of the our approach isthe ability to present a multitude of stimuli,evaluate the response to each stimulus indi-vidually, and establish the relative impor-tance of various factors in defining thestructure of representations.

Taking advantage of an ungrouped de-sign (supplemental Item 7, available atwww.jneurosci.org as supplemental ma-terial), we were able to directly contrastthe impact of spatial and nonspatial infor-mation on scene representations. Our re-sults further support the theory, based onactivation studies, that PPA is part of net-work of regions specialized for processingthe spatial layout of scenes (Epstein et al.,1999; Henderson et al., 2007; Epstein,2008). The strong grouping of scenes byexpanse (Park et al., 2011) and relativedistance, paired with the absence ofgrouping by content, is inconsistent withtheories suggesting that the primary func-tion of PPA is distinguishing scene catego-ries (Walther et al., 2009) or, based onactivation studies, representing nonspa-tial contextual associations between ob-jects (Bar, 2004; Bar et al., 2008; Gronau etal., 2008). This is not to suggest that PPAcontains no nonspatial scene informa-tion; it is possible that other methods that

more directly measure the within voxel selectivity [e.g., adaption(Drucker and Aguirre, 2009)] would reveal a different pattern ofresults. Our results simply show that the dominant factors indefining the macroscopic response of PPA are spatial. This find-ing is also consistent with reports of PPA activation during sceneencoding (Epstein et al., 1999; Ranganath et al., 2004; Epstein andHiggins, 2007), adaption studies showing viewpoint-specific rep-resentations in PPA (Epstein et al., 2003), and anterograde am-nesia for novel scene layouts with damage to parahippocampalregions (Aguirre and D’Esposito, 1999; Barrash et al., 2000; Taka-hashi and Kawamura, 2002; Mendez and Cherrier, 2003).

Our findings contradict a recent study reporting categoriza-tion for “natural scene categories” (e.g., forests, mountains, in-dustry) (Walther et al., 2009) in PPA. However, in this study,there was no control for spatial factors, including relative dis-tance and expanse. Therefore, the ability to decode, for example,highways versus industry could partly reflect the different relativedistances within each category or the fact that industry sceneswere more likely to have a closed expanse. Similarly, the confu-sions of their classifier between beaches, highways, and moun-tains could reflect their shared open expanse. This hypothesis issupported by our inability to decode category when spatial fac-tors were held constant or to decode category across variations inrelative distance (Fig. 6). Finally, the scenes in this previous studyoften contained prominent objects (e.g., cars), or even people,and this might explain the equivalent decoding accuracy betweenPPA and object- and face-selective regions, whereas we found onlyweak discrimination and no categorization within these areas.

Figure 8. Selectivity analysis. a, The left column shows the significantly selective clusters of voxels (see Materials and Methods)for each of three example participants. The right column shows the ROIs for those same participants. Note the large overlapbetween the significant clusters and the PPA and pEVC ROIs. b, Plot of the average selectivity correlation within each independentlydefined ROI. * indicate significant selective correlations ( p % 0.05). Note that significant selectivity correlations were observedonly in EVC and the scene-selective ROIs. Note also the stronger selectivity correlations in pEVC than cEVC. c, Plot of the proportionof voxels within significant clusters that overlapped with each ROI. * indicate significant overlap between the ROIs and clusters.Note again the significant overlap only between the EVC and scene-selective ROIs and the clusters. Note also the greater overlapbetween pEVC than cEVC and the clusters. All error bars indicate the between-subject SEM.


In PPA, it is also possible that low-level features account forsome of the observed grouping effects. In particular, there is adifference in the spatial frequency envelopes of closed and openscenes (supplemental Item 8, available at www.jneurosci.org assupplemental material). Furthermore, it is tempting to suggestthat categorization by expanse might reflect the fact that the openscenes often contained sky, despite the absence of any such cate-gorization in EVC. However, this explanation cannot account forthe strong discrimination of open far scenes (which shared sky).Furthermore, scene inversion, which should not change the effectof sky or differences in spatial frequencies, has been shown tohave a strong impact on both decoding (Walther et al., 2009) andresponse (Epstein et al., 2006) in PPA. Nonetheless, there must besome visual statistic or combination thereof that is the basis forgrouping by expanse in PPA, because all visual representations,whether high or low level, must reflect some difference in theimages. The key observation in this study is that the representa-tions in PPA can properly be called spatial because they (1) differsignificantly from those observed in early visual cortex and (2)primarily capture differences in spatial information across com-plex scenes, (3) their structure directly reflects independent be-havioral judgments of the spatial and not nonspatial structure ofthe scenes, and (4) lesions of parahippocampal cortex lead toimpairments in the spatial processing of scenes (Aguirre andD’Esposito, 1999).

Grouping in EVC likely reflects some low-level features presentin the scenes. However, neither the pixelwise similarity (supplemen-tal Item 9, available at www.jneurosci.org as supplemental material)from either the peripheral or central portion of the scenes nor thespatial frequency (supplemental Item 8, available at www.jneurosci.org as supplemental material) across the entire image seem to indi-vidually account for the grouping of scenes by relative distance.Instead, this grouping likely reflects a complex combination of reti-notopic, spatial frequency, and orientation information interactingwith the structure of EVC (Kay et al., 2008).

There are two possible sources of spatial information in PPA.First, position information has been reported in PPA (Arcaro etal., 2009) (but see MacEvoy and Epstein, 2007) and other high-level visual areas (Schwarzlose et al., 2008; Kravitz et al., 2010),suggesting feedforward processing of spatial information. PPAmight also receive spatial information from its connections withthe retrosplenial cortex, posterior cingulate, and parietal cortex(Kravitz et al., 2011). Additional research is needed to address thisquestion, but ultimately, which factors contribute to the forma-tion of a representation and its actual structure are distinct.

The push/pull relationship between discrimination and cate-gorization observed in PPA and pEVC suggests that low-levelrepresentations may be important in supporting quick discrimi-nations of complex stimuli (Bacon-Mace et al., 2007; Greene andOliva, 2009a), whereas high-level representations are specializedto support more abstract or specialized actions (e.g., navigation).Thus, discrimination of complex stimuli based on the response ofEVC (Kay et al., 2008) must be interpreted with reference to theparticular tasks that response is likely to support, especially givenreports that the presence of stimulus information in a region isnot necessarily reflected in behavior (Williams et al., 2007;Walther et al., 2009). Our results demonstrate that the criticalfactors that define high-level representations may not be presentwithin or even predictable from the response of EVC, nor canEVC be ignored given the clear inheritance of many aspects ofscene representation by PPA; rather, the response of both EVCand high-level cortex must be considered in any account of com-plex visual processing.

In conclusion, we have shown with a data-driven approachthat spatial and not high-level category information is the dom-inant factor in how PPA categorizes scenes. Although informa-tion about scene was present in other visual regions, includingEVC, grouping of scenes varied enormously. These results dem-onstrate the importance of understanding the structure of repre-sentations beyond whether individual presented items can bedecoded.

ReferencesAguirre GK, D’Esposito M (1999) Topographical disorientation: a synthesis

and taxonomy. Brain 122:1613–1628.Aguirre GK, Zarahn E, D’Esposito M (1998) An area within human ventral

cortex sensitive to “building” stimuli: evidence and implications. Neuron21:373–383.

Arcaro MJ, McMains SA, Singer BD, Kastner S (2009) Retinotopic organi-zation of human ventral visual cortex. J Neurosci 29:10638 –10652.

Bacon-Mace N, Kirchner H, Fabre-Thorpe M, Thorpe SJ (2007) Effects oftask requirements on rapid natural scene processing: from common sen-sory encoding to distinct decisional mechanisms. J Exp Psychol HumPercept Perform 33:1013–1026.

Bar M (2004) Visual objects in context. Nat Rev Neurosci 5:617– 629.Bar M, Aminoff E, Schacter DL (2008) Scenes unseen: the parahippocampal

cortex intrinsically subserves contextual associations, not scenes or placesper se. J Neurosci 28:8539 – 8544.

Barrash J, Damasio H, Adolphs R, Tranel D (2000) The neuroanatomicalcorrelates of route learning impairment. Neuropsychologia 38:820 – 836.

Chan AW, Kravitz DJ, Truong S, Arizpe J, Baker CI (2010) Cortical repre-sentations of bodies and faces are strongest in their commonly experi-enced configurations. Nat Neurosci 13:417– 418.

Drucker DM, Aguirre GK (2009) Different spatial scales of shape similarityrepresentation in lateral and ventral LOC. Cereb Cortex 19:2269 –2280.

Elo A (1978) The rating of chessplayers, past and present. New York: Arco.Epstein R, Kanwisher N (1998) A cortical representation of the local visual

environment. Nature 392:598 – 601.Epstein R, Harris A, Stanley D, Kanwisher N (1999) The parahippocampal

place area: recognition, navigation, or encoding? Neuron 23:115–125.Epstein R, Graham KS, Downing PE (2003) Viewpoint-specific scene rep-

resentations in human parahippocampal cortex. Neuron 37:865– 876.Epstein RA (2008) Parahippocampal and retrosplenial contributions to hu-

man spatial navigation. Trends Cogn Sci 12:388 –396.Epstein RA, Higgins JS (2007) Differential parahippocampal and retro-

splenial involvement in three types of visual scene recognition. CerebCortex 17:1680 –1693.

Epstein RA, Ward EJ (2010) How reliable are visual context effects in theparahippocampal place area? Cereb Cortex 20:294 –303.

Epstein RA, Higgins JS, Parker W, Aguirre GK, Cooperman S (2006) Corti-cal correlates of face and scene inversion: a comparison. Neuropsycholo-gia 44:1145–1158.

Epstein RA, Higgins JS, Jablonski K, Feiler AM (2007) Visual scene process-ing in familiar and unfamiliar environments. J Neurophysiol97:3670 –3683.

Erickson CA, Jagadeesh B, Desimone R (2000) Clustering of perirhinal neu-rons with similar properties following visual experience in adult monkeys.Nat Neurosci 3:1143–1148.

Greene MR, Oliva A (2009a) The briefest of glances: the time course ofnatural scene understanding. Psychol Sci 20:464 – 472.

Greene MR, Oliva A (2009b) Recognition of natural scenes from globalproperties: seeing the forest without representing the trees. Cogn Psychol58:137–176.

Gronau N, Neta M, Bar M (2008) Integrated contextual representation forobjects’ identities and their locations. J Cogn Neurosci 20:371–388.

Hasson U, Levy I, Behrmann M, Hendler T, Malach R (2002) Eccentricitybias as an organizing principle for human high-order object areas. Neu-ron 34:479 – 490.

Haxby JV, Gobbini MI, Furey ML, Ishai A, Schouten JL, Pietrini P (2001)Distributed and overlapping representations of faces and objects in ven-tral temporal cortex. Science 293:2425–2430.

Hayes SM, Nadel L, Ryan L (2007) The effect of scene context on episodicobject recognition: parahippocampal cortex mediates memory encodingand retrieval success. Hippocampus 17:873– 889.


Henderson JM, Larson CL, Zhu DC (2007) Cortical activation to indoorversus outdoor scenes: an fMRI study. Exp Brain Res 179:75– 84.

Joubert OR, Rousselet GA, Fize D, Fabre-Thorpe M (2007) Processing scenecontext: fast categorization and object interference. Vision Res47:3286 –3297.

Kanwisher N, McDermott J, Chun MM (1997) The fusiform face area: amodule in human extrastriate cortex specialized for face perception.J Neurosci 17:4302– 4311.

Kay KN, Naselaris T, Prenger RJ, Gallant JL (2008) Identifying natural im-ages from human brain activity. Nature 452:352–355.

Kravitz DJ, Kriegeskorte N, Baker CI (2010) High-level object representa-tions are constrained by position. Cereb Cortex 20:2916 –2925.

Kravitz DJ, Saleem KS, Baker CI, Mishkin M (2011) A new neural frame-work for visuospatial processing. Nat Rev Neurosci 12:217–230.

Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based func-tional brain mapping. Proc Natl Acad Sci U S A 103:3863–3868.

Kriegeskorte N, Mur M, Bandettini P (2008a) Representational similarityanalysis: connecting the branches of systems neuroscience. Front SystNeurosci 2:4.

Kriegeskorte N, Mur M, Ruff DA, Kiani R, Bodurka J, Esteky H, Tanaka K,Bandettini PA (2008b) Matching categorical object representations ininferior temporal cortex of man and monkey. Neuron 60:1126 –1141.

Larson AM, Loschky LC (2009) The contributions of central of peripheralvision to scene gist recognition. J Vis 9:1–16.

Levy I, Hasson U, Avidan G, Hendler T, Malach R (2001) Center-peripheryorganization of human object areas. Nat Neurosci 4:533–539.

Loschky LC, Larson AM (2008) Localized information is necessary for scenecategorization, including the natural/man-made distinction. J Vis 8:4.1–4.9.

MacEvoy SP, Epstein RA (2007) Position selectivity in scene- and object-responsive occipitotemporal regions. J Neurophysiol 98:2089 –2098.

Maguire EA, Frackowiak RS, Frith CD (1996) Learning to find your way: arole for the human hippocampal formation. Proc Biol Sci 263:1745–1750.

Mendez MF, Cherrier MM (2003) Agnosia for scenes in topographagnosia.Neuropsychologia 41:1387–1395.

Meng M, Cherian T, Signal G, Sinha P (2010) Functional lateralization of faceprocessing. J Vis 10:562.

Misaki M, Kim Y, Bandettini PA, Kriegeskorte N (2010) Comparison of multi-

variate classifiers and response normalizations for pattern-informationfMRI. Neuroimage 53:103–118.

Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic rep-resentation of the spatial envelope. Int J Comput Vision 42:145–175.

Park S, Brady TF, Greene MR, Oliva A (2011) Disentangling scene contentfrom spatial boundary: complementary roles for the parahippocampalplace area and lateral occipital complex in representing real-world scenes.J Neurosci 31:1333–1340.

Ranganath C, DeGutis J, D’Esposito M (2004) Category-specific modula-tion of inferior temporal activity during working memory encoding andmaintenance. Brain Res Cogn Brain Res 20:37– 45.

Rosenbaum RS, Ziegler M, Winocur G, Grady CL, Moscovitch M (2004) “Ihave often walked down this street before”: fMRI studies on the hip-pocampus and other structures during mental navigation of an old envi-ronment. Hippocampus 14:826 – 835.

Ross MG, Oliva A (2010) Estimating perception of scene layout propertiesfrom global image features. J Vis 10:2.1–2.25.

Sayres R, Grill-Spector K (2008) Relating retinotopic and object-selectiveresponses in human lateral occipital cortex. J Neurophysiol 100:249 –267.

Schwarzlose RF, Swisher JD, Dang S, Kanwisher N (2008) The distributionof category and location information across object-selective regions inhuman visual cortex. Proc Natl Acad Sci U S A 105:4447– 4452.

Takahashi N, Kawamura M (2002) Pure topographical disorientation–theanatomical basis of landmark agnosia. Cortex 38:717–725.

Thierry G, Martin CD, Downing P, Pegna AJ (2007) Controlling for inter-stimulus perceptual variance abolishes N170 face selectivity. Nat Neurosci10:505–511.

Torralba A, Oliva A (2003) Statistics of natural image categories. Network14:391– 412.

Walther DB, Caddigan E, Fei-Fei L, Beck DM (2009) Natural scene catego-ries revealed in distributed patterns of activity in the human brain. J Neu-rosci 29:10573–10581.

Williams MA, Dang S, Kanwisher NG (2007) Only some spatial patterns offMRI response are read out in task performance. Nat Neurosci 10:685– 686.

Williams MA, Baker CI, Op de Beeck HP, Shim WM, Dang S, Triantafyllou C,Kanwisher N (2008) Feedback of visual object information to fovealretinotopic cortex. Nat Neurosci 11:1439 –1445.


Documents

Behavioral/Systems/Cognitive Real ...graphics.cs.cmu.edu/courses/16-899A/2014_spring/... · Behavioral/Systems/Cognitive Real-WorldSceneRepresentationsinHigh-LevelVisual Cortex:It’stheSpacesMoreThanthePlaces