24
Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State University The nature of the information retained from previously fixated (and hence attended) objects in natural scenes was investigated. In a saccade-contingent change paradigm, participants successfully detected type and token changes (Experiment 1) or token and rotation changes (Experiment 2) to a target object when the object had been previously attended but was no longer within the focus of attention when the change occurred. In addition, participants demonstrated accurate type-, token-, and orientation- discrimination performance on subsequent long-term memory tests (Experiments 1 and 2) and during online perceptual processing of a scene (Experiment 3). These data suggest that relatively detailed visual information is retained in memory from previously attended objects in natural scenes. A model of scene perception and long-term memory is proposed. Because of the size and complexity of the visual environments humans tend to inhabit, and because high-acuity vision is limited to a relatively small area of the visual field, detailed perceptual processing of a natural scene depends on the selection of local scene regions by movements of the eyes (for reviews, see Hen- derson & Hollingworth, 1998, 1999a). During scene viewing, the eyes are reoriented approximately three times each second by saccadic eye movements to bring the projection of a local scene region (typically a discrete object) onto the area of the retina producing the highest acuity vision (the fovea). The periods be- tween saccades, when the eyes are relatively stationary and de- tailed visual information is encoded, are termed fixations and last an average of approximately 300 ms during scene viewing. During each brief saccadic eye movement, however, visual encoding is suppressed (Matin, 1974). Thus, the visual system is provided with what amounts to a series of snapshots (corresponding to fixations), which may vary dramatically in their visual content over a com- plex scene, punctuated by brief periods of blindness (correspond- ing to saccades). The selective nature of scene perception places strong con- straints on the construction of an internal representation of a scene. If a detailed visual representation is to be formed, then information from separate fixations must be retained and combined over one or more saccadic eye movements as the eyes are oriented to multiple local regions. The temporal and spatial separation of eye fixations on a scene leads to two general memory problems in the construc- tion of a scene representation. One is the short-term retention of scene information across a single saccadic eye movement, partic- ularly from the target of the next saccade (for reviews, see Hen- derson & Hollingworth, 1999a; Irwin, 1992b; Pollatsek & Rayner, 1992). The second, which this study investigated, is the accumu- lation of scene information across longer periods of time and across multiple fixations. That is, what information is retained from previously fixated (and hence attended) regions of a scene, and how is that information used to construct a larger-scale rep- resentation of the scene as a whole, if such a representation is constructed at all? Scene Representation as the Construction of a Composite Image One possibility is that a sensory representation is retained and combined from previously fixated and attended regions to form a global composite image of the scene. We define sensory repre- sentation as a precategorical, maskable, complete, and metrically organized (i.e., iconic) representation of the properties available from early vision (such as shape, shading, texture, and color; Irwin, 1992b, Sperling, 1960). According to this composite image hypothesis, sensory representations from individual fixations are integrated within a visual buffer and organized according to the position in the world from which each was encoded (Breitmeyer, Kropfl, & Julesz, 1982; Davidson, Fox, & Dick, 1973; Feldman, 1985; McConkie & Rayner, 1976). Metaphorically, local high- resolution information is painted onto an internal canvas, produc- ing over multiple fixations a metrically organized composite image of previously attended regions. Such a composite sensory image could then be used to support a variety of visual– cognitive tasks Andrew Hollingworth and John M. Henderson, Department of Psychol- ogy and Cognitive Science Program, Michigan State University. This research was supported by a National Science Foundation (NSF) graduate fellowship to Andrew Hollingworth and NSF Grants SBR 9617274 and ECS 9873531 to John M. Henderson. This research derives from a doctoral dissertation submitted by Andrew Hollingworth to Mich- igan State University. Aspects of the data were reported at the seventh annual Workshop on Object Perception and Memory, Los Angeles, CA, November 1999, and the 41st annual meeting of the Psychonomic Society, New Orleans, LA, November 2000. We thank the dissertation committee of Tom Carr, Rose Zacks, and Richard Hall for their helpful discussions of the research and for their comments on the dissertation. We also thank Dan Simons and Dave Irwin for comments on an earlier version of the article, and Gary Schrock for his technical assistance. Correspondence concerning this article should be sent to Andrew Hollingworth, who is now at Yale University, Department of Psychol- ogy, Box 208205, New Haven, Connecticut 06520-8205. E-mail: [email protected] Journal of Experimental Psychology: Copyright 2002 by the American Psychological Association, Inc. Human Perception and Performance 2002, Vol. 28, No. 1, 113–136 0096-1523/02/$5.00 DOI: 10.1037//0096-1523.28.1.113 113

Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

Accurate Visual Memory for Previously Attended Objectsin Natural Scenes

Andrew Hollingworth and John M. HendersonMichigan State University

The nature of the information retained from previously fixated (and hence attended) objects in naturalscenes was investigated. In a saccade-contingent change paradigm, participants successfully detectedtype and token changes (Experiment 1) or token and rotation changes (Experiment 2) to a target objectwhen the object had been previously attended but was no longer within the focus of attention when thechange occurred. In addition, participants demonstrated accurate type-, token-, and orientation-discrimination performance on subsequent long-term memory tests (Experiments 1 and 2) and duringonline perceptual processing of a scene (Experiment 3). These data suggest that relatively detailed visualinformation is retained in memory from previously attended objects in natural scenes. A model of sceneperception and long-term memory is proposed.

Because of the size and complexity of the visual environmentshumans tend to inhabit, and because high-acuity vision is limitedto a relatively small area of the visual field, detailed perceptualprocessing of a natural scene depends on the selection of localscene regions by movements of the eyes (for reviews, see Hen-derson & Hollingworth, 1998, 1999a). During scene viewing, theeyes are reoriented approximately three times each second bysaccadic eye movements to bring the projection of a local sceneregion (typically a discrete object) onto the area of the retinaproducing the highest acuity vision (the fovea). The periods be-tween saccades, when the eyes are relatively stationary and de-tailed visual information is encoded, are termed fixations and lastan average of approximately 300 ms during scene viewing. Duringeach brief saccadic eye movement, however, visual encoding issuppressed (Matin, 1974). Thus, the visual system is provided withwhat amounts to a series of snapshots (corresponding to fixations),which may vary dramatically in their visual content over a com-plex scene, punctuated by brief periods of blindness (correspond-ing to saccades).

The selective nature of scene perception places strong con-straints on the construction of an internal representation of a scene.If a detailed visual representation is to be formed, then informationfrom separate fixations must be retained and combined over one ormore saccadic eye movements as the eyes are oriented to multiplelocal regions. The temporal and spatial separation of eye fixationson a scene leads to two general memory problems in the construc-tion of a scene representation. One is the short-term retention ofscene information across a single saccadic eye movement, partic-ularly from the target of the next saccade (for reviews, see Hen-derson & Hollingworth, 1999a; Irwin, 1992b; Pollatsek & Rayner,1992). The second, which this study investigated, is the accumu-lation of scene information across longer periods of time andacross multiple fixations. That is, what information is retainedfrom previously fixated (and hence attended) regions of a scene,and how is that information used to construct a larger-scale rep-resentation of the scene as a whole, if such a representation isconstructed at all?

Scene Representation as the Constructionof a Composite Image

One possibility is that a sensory representation is retained andcombined from previously fixated and attended regions to form aglobal composite image of the scene. We define sensory repre-sentation as a precategorical, maskable, complete, and metricallyorganized (i.e., iconic) representation of the properties availablefrom early vision (such as shape, shading, texture, and color;Irwin, 1992b, Sperling, 1960). According to this composite imagehypothesis, sensory representations from individual fixations areintegrated within a visual buffer and organized according to theposition in the world from which each was encoded (Breitmeyer,Kropfl, & Julesz, 1982; Davidson, Fox, & Dick, 1973; Feldman,1985; McConkie & Rayner, 1976). Metaphorically, local high-resolution information is painted onto an internal canvas, produc-ing over multiple fixations a metrically organized composite imageof previously attended regions. Such a composite sensory imagecould then be used to support a variety of visual–cognitive tasks

Andrew Hollingworth and John M. Henderson, Department of Psychol-ogy and Cognitive Science Program, Michigan State University.

This research was supported by a National Science Foundation (NSF)graduate fellowship to Andrew Hollingworth and NSF Grants SBR9617274 and ECS 9873531 to John M. Henderson. This research derivesfrom a doctoral dissertation submitted by Andrew Hollingworth to Mich-igan State University. Aspects of the data were reported at the seventhannual Workshop on Object Perception and Memory, Los Angeles, CA,November 1999, and the 41st annual meeting of the Psychonomic Society,New Orleans, LA, November 2000.

We thank the dissertation committee of Tom Carr, Rose Zacks, andRichard Hall for their helpful discussions of the research and for theircomments on the dissertation. We also thank Dan Simons and Dave Irwinfor comments on an earlier version of the article, and Gary Schrock for histechnical assistance.

Correspondence concerning this article should be sent to AndrewHollingworth, who is now at Yale University, Department of Psychol-ogy, Box 208205, New Haven, Connecticut 06520-8205. E-mail:[email protected]

Journal of Experimental Psychology: Copyright 2002 by the American Psychological Association, Inc.Human Perception and Performance2002, Vol. 28, No. 1, 113–136

0096-1523/02/$5.00 DOI: 10.1037//0096-1523.28.1.113

113

Page 2: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

and would account for the phenomenological perception of ahighly detailed and stable visual world.

This possibility has proved attractive, but a large body of re-search demonstrates conclusively that the visual system does notintegrate sensory information across saccadic eye movements(Bridgeman & Mayer, 1983; Henderson, 1997; Irwin, 1991; Irwin,Yantis, & Jonides, 1983; McConkie & Zola, 1979; O’Regan &Levy-Schoen, 1983; Rayner & Pollatsek, 1983). For example,Irwin et al. (1983) and Rayner and Pollatsek (1983) found thatparticipants could not integrate two dot patterns when presented inthe same spatial position but on subsequent fixations, suggestingthat the type of sensory fusion possible within a fixation acrossshort interstimulus intervals (e.g., Di Lollo, 1980) does not occuracross separate fixations. In addition, a precise representation ofobject contours does not appear to be retained across eye move-ments (Henderson, 1997; Pollatsek, Rayner, & Collins, 1984). Ifsensory representations are not retained across a single saccadiceye movement, sensory information could not be accumulatedacross multiple fixations to form a composite global image of ascene.

Although transsaccadic memory does not support sensory inte-gration, visual representations are nonetheless retained across eyemovements. In transsaccadic object identification studies, partici-pants are faster to identify an object when a preview of that objecthas been available prior to the saccade (e.g., Henderson, Pollatsek,& Rayner, 1989), and this benefit is influenced by visual changes,such as substitution of one object with another from the samebasic-level category (Henderson & Siefert, in press) and mirrorreflection (Henderson & Siefert, 1999, in press). In addition, objectpriming across saccades appears to be governed primarily byvisual similarity rather than by conceptual similarity (Pollatsek etal., 1984). Finally, structural descriptions of simple visual stimulican be retained and integrated across eye movements (Carlson-Radvansky, 1999; Carlson-Radvansky & Irwin, 1995). Thus, in-tegration across saccades appears to be limited to visual codesabstracted from precise sensory representation, but detailedenough to specify individual object tokens and the viewpoint atwhich the object was observed. Higher-level visual representationsmeeting these criteria have been proposed in the object recognitionliterature, including viewpoint-dependent structural descriptions(Bulthoff, Edelman, & Tarr, 1995) and abstract 2-D-feature rep-resentations (Riesenhuber & Poggio, 1999). It is then a possibilitythat such higher-level visual representations are retained frompreviously fixated and attended regions and accumulate within arepresentation of the scene.1

Research using natural scene stimuli has provided convergingevidence that the visual system does not form a representation asdetailed and complete as a composite sensory image. A number ofstudies have made changes to a scene during a saccadic eyemovement with the logic that if a global image of the scene wereconstructed and retained across eye movements, changes to thescene should be detected easily. However, participants haveproved rather poor at detecting scene changes across saccadic eyemovements (Currie, McConkie, Carlson-Radvansky, & Irwin,2000; Grimes, 1996; Henderson & Hollingworth, 1999b; McCon-kie & Currie, 1996). For example, Grimes and McConkie (Grimes,1996; McConkie, 1991) coordinated relatively large scene changes(e.g., enlarging a child in a playground scene by 30% and movingthe child forward in depth) with saccadic eye movements and

found that participants detected changes at well below 50% cor-rect. A stronger manipulation was conducted by Henderson,Hollingworth, and Subramanian (1999), in which every pixel in ascene image was changed during a saccade (a set of gray barsoccluded half of the scene image; during a saccade, the occludedand visible portions were reversed). Although the pictorial contentchanged dramatically over the entire scene, participants detectedthese changes less than 3% of the time.

In addition, this phenomenon of poor change detection, orchange blindness, is not limited to scene changes made duringsaccadic eye movements. Rensink, O’Regan, and Clark (1997)examined whether the apparent inability to accumulate sensoryinformation across discrete views of a scene is a specific propertyof saccadic eye movements or a more general property of visualperception and memory. Rensink et al. simulated the visual eventscaused by moving the eyes: Initial and changed scene images weredisplayed in alternation for 250 ms each (roughly the duration ofa fixation on a scene), and each image was separated by a brief80-ms blank interval (corresponding to saccadic suppression). Asin transsaccadic change studies, participants were often quite poorat detecting significant changes to a scene in this flicker paradigm.Subsequent research has demonstrated similar change blindnesswhen a change occurs across many different forms of visualdisruption, including film cuts (Levin & Simons, 1997), occlusionin a real-world encounter (Simons & Levin, 1998), or a blink(O’Regan, Deubel, Clark, & Rensink, 2000).

In summary, the literature on visual memory across saccadesand other visual disruptions conclusively demonstrates that sen-sory representations are not integrated to form a composite imageof a scene. If visual representations are accumulated from previ-ously fixated and attended regions, they must be in a form signif-icantly more abstract than sensory representation.

Localist Attention-Based Accounts

Recent proposals have abandoned the idea of a composite sen-sory image in favor of a view that visual scene representation ismore local and more transient. Irwin (1992a, 1992b; Irwin &Andrews, 1996) has proposed an object file theory of transsac-cadic memory, developed primarily to explain the integration of

1 As is evident from this discussion, we use the term visual to refer toboth lower-level sensory representations and higher-level visual represen-tations, such as a structural description. This usage is consistent with mostof the existing literature in visual cognition, object perception, and trans-saccadic memory and integration. In addition, we distinguish visual rep-resentations (encoding properties such as shape and color) from conceptualrepresentations (encoding object identity and other associative informa-tion). However, within the change blindness literature, some researchersprefer to limit the term visual to sensory representation (e.g., Simons,1996). Others appear to equate visual representation with conscious visualexperience (e.g., Wolfe, 1999). Yet given that higher-level visual repre-sentations form the basis of integration across saccades and are functionalin visual processes such as object recognition, it does not seem appropriateto limit the term visual to sensory representation. In addition, whateverconstitutes visual experience across saccades must necessarily be due tohigher-level visual representation, as sensory information is not retainedfrom one fixation to the next. Finally, given that much of the work of visionis unavailable to awareness, we believe it is unnecessarily constraining toequate visual representation with conscious visual experience.

114 HOLLINGWORTH AND HENDERSON

Page 3: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

information across single eye movements but with implications forscene representation, in general, and for the accumulation of sceneinformation from previously fixated and attended regions. Theallocation of visual attention, according to Irwin, rules what localvisual information is and is not represented from a complex scene.When attention is directed to an object, visual features are boundinto a unified object description (Treisman, 1988). In addition, atemporary representation is formed, an object file, that links thevisual object description to a spatial position in a master map oflocations (Kahneman & Treisman, 1984). Across a saccade, objectfiles are maintained in visual short-term memory (VSTM), arelatively long-lasting, capacity-limited store maintaining visualcodes abstracted from sensory representation (Irwin, 1992b). Ob-jects files, then, are the primary content of memory across sac-cades, providing local continuity from one fixation to the next.

In this view, only a very small portion of the local informationavailable in a complex natural scene is represented across a sac-cade, due to the strong capacity constraints on VSTM. Irwin(1992a) has provided evidence that three or four discrete objectfiles can be retained in VSTM across a saccade. In those experi-ments, an array of letters was presented prior to an eye movement.The letters were removed during the saccade, and a position wascued. Participants’ ability to report the identity of the letter in thecued position was consistent with the retention of three or fourposition-bound letter codes. As a consequence of this limitedcapacity in VSTM, only currently or recently attended objects arerepresented in any detail. Visual representations from previouslyfixated and attended regions should be quickly replaced as newobject files are constructed. In support of this view, Irwin andAndrews (1996) used the partial report paradigm described abovebut allowed participants two fixations rather than one in the arrayprior to test. If visual information encoded during the secondfixation accumulates with that encoded during the first fixation,then report performance should have been reliably higher with twofixations prior to the probe versus one. Yet report performanceafter two fixations was not reliably improved, suggesting little ifany visual accumulation across the two fixations.

In addition, Irwin (1992b; Irwin & Andrews, 1996) has inte-grated the object file framework into a more general theory ofscene representation and transsaccadic memory. In this view, thescene information retained across a saccadic eye movement islimited to three sources. One is active object files maintainingvisual codes (abstracted from sensory representation) from cur-rently attended or recently attended objects and, in particular, fromthe target of the next saccade (Currie et al., 2000). The second isposition-independent activation of conceptual nodes in long-termmemory (LTM) coding the identity of local objects that have beenrecognized (Henderson, 1994; Henderson & Anes, 1994; Pollat-sek, Rayner, & Henderson, 1990). The third is schematic scene-level representations derived from scene identification, codingconceptual–semantic properties such as scene meaning or gist.However, only object files maintain a visual representation of localobjects, and these structures are transient. Irwin and Andrews(1996) summarize this view as follows:

According to object file theory, relatively little information actuallyaccumulates across saccades; rather, one’s mental representation of ascene consists of mental schemata and identity codes activated in long

term memory and of a small number of detailed object files inshort-term memory. (p. 130)

More recent proposals, drawn primarily from the change blind-ness literature, have placed even greater emphasis on the role ofattention in scene perception and on the transience of visualrepresentation (O’Regan, 1992; O’Regan, Rensink, & Clark, 1999;Rensink, 2000a, 2000b; Rensink et al., 1997; Simons & Levin,1997; Wolfe, 1999). Rensink (2000a, 2000b) has provided themost detailed account of this view, termed coherence theory. As inIrwin’s object file theory of transsaccadic memory, coherencetheory claims that visual attention is necessary to bind sensoryfeatures into a coherent object representation and to maintain thisrepresentation in VSTM, which is stable across brief disruptionssuch as saccadic eye movements. In contrast, unattended sensoryrepresentations decay rapidly and are overwritten by new visualencoding. When visual attention is withdrawn from an object,however, the representation of that object immediately reverts toits preattentive state, becoming “unglued” (see also Wolfe, 1999).2

Finally, initial perceptual processing of a scene activates schematicrepresentations of scene gist and general spatial layout that arepreserved across visual interruptions, providing an impression ofscene continuity. According to Rensink (2000a), scene gist corre-sponds to a scene category label (e.g., bedroom), and the repre-sentation of spatial layout does not contain information about thevisual properties of individual objects. Thus, visual representationis limited to the currently attended object. Because there are few,if any, representational consequences of having previously at-tended an object, the visual system is unable to accumulate infor-mation from previously attended regions.

Though clearly similar, coherence theory and the object filetheory of transsaccadic memory appear to differ on three points.First, Rensink (2000a) proposes that only one object can be main-tained in VSTM across visual disruptions, whereas Irwin (1992a)provides evidence that three or four objects can be maintained.Second, Rensink proposes that sensory representations can beretained in VSTM across disruptions such as saccades, whereasobject file theory holds that VSTM supports the maintenance ofvisual representations abstracted from sensory information. Third,coherence theory holds that visual object representations disinte-grate as soon as attention is withdrawn, whereas object files canremain active after attention is withdrawn (at least until replaced).The first two differences are unlikely to be critical. AlthoughRensink claims that VSTM is limited to one object, he leaves openthe possibility that the visual system may treat a collection of threeor four objects as a single entity. The second difference is signif-

2 Although a key proposal in Wolfe (1999) is that a unified objectrepresentation dissolves when attention is withdrawn from an object, morerecent work by Wolfe, Klempen, and Dahlen (2000) has modified thisearlier proposal. The modified claim in Wolfe et al. (2000) is that whenattention is withdrawn from an object, the link established between thevisual representation of that object and corresponding LTM representations(which allows conscious identification) is dissolved. As a result, multipleobjects in a scene cannot be consciously and simultaneously recognized.However, Wolfe et al. (2000) have left open the possibility that boundobject representations may be retained in memory after attention is with-drawn from an object and used for subsequent change detection. Thus,Wolfe’s view no longer appears consistent with the original claim thatvisual object representations disintegrate on the withdrawal of attention.

115VISUAL MEMORY FOR NATURAL SCENES

Page 4: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

icant, but extant data provide conclusive evidence that sensoryrepresentations cannot be retained across disruptions such as sac-cades, as reviewed above. The only difference of real import, then,concerns the fate of previously attended objects: Do visual repre-sentations disintegrate immediately on the withdrawal of attention,or do they remain active until replaced by subsequent encoding?This difference in theory leads to different predictions regardingthe detection of changes to natural scenes. Coherence theorypredicts that only visual changes to a currently attended objectshould be detected, whereas object file theory predicts that changesto an unattended object could be detected if the object has beenattended earlier and if its object file has not been replaced bysubsequent encoding.

In summary, both theories propose that the visual representationof a scene across disruptions such as saccades is local and tran-sient, with only currently or recently attended objects representedin any detail. Thus, we refer to these proposals as visual transiencehypotheses of scene representation. Although the representation ofvisual information is proposed to be transient, these theories doallow for the retention of more abstract and stable representations,coding such properties as scene gist, the spatial layout of the scene,and the identities of recognized objects. With regard to visualrepresentation, visual transience hypotheses are consistent with aview of perception in which the visual system does not rely heavilyon memory to construct a scene representation, but instead de-pends on the fact that local objects in the environment can besampled when necessary by movements of the eyes or attention.The world itself serves as an “external memory” (O’Regan, 1992;O’Regan et al., 1999). In addition, visual transience hypotheses areconsistent with functionalist approaches to scene representation(Ballard, Hayhoe, Pook, & Rao, 1997; Hayhoe, 2000; Hayhoe etal., 1998), which reject the notion that the visual system creates ageneral purpose representation that can support a variety of tasks.Instead, the representation of local scene information is directlygoverned by the allocation of attention to goal-relevant objects.

Researchers initially assumed that the goal of vision was toconstruct a global and veridical internal representation of thevisual world by integrating sensory representations from multiplelocal fixations. The pendulum of theory has now swung to the viewthat little or no visual information is retained from previouslyfixated and attended regions of a scene, that visual representationis transient, leaving no lasting memory.

In the next sections, we discuss two lines of evidence relevant tothe visual transience claim: (a) research on LTM for scenes and (b)change detection studies that have examined visual representationafter the withdrawal of attention.

Evidence From Long-Term Scene Memory

One place to look for initial evidence regarding the retention ofvisual information from natural scenes is the literature on LTM forpictures. Visual transience hypotheses hold that the LTM repre-sentation of a scene cannot contain specific visual information,because such information is not retained for very long after atten-tion is withdrawn from an object. Instead, scene memory underthis view is limited to gist, layout, and perhaps the abstract iden-tities of recognized objects (Rensink, 2000b; Simons, 1996; Si-mons & Levin, 1997; Wolfe, 1998). The picture memory literature,however, indicates that long-term picture memory can preserve

quite detailed information. Initial studies of picture memory dem-onstrated that humans possess a prodigious ability to rememberpictures presented at study (Nickerson, 1965; Shepard, 1967;Standing, Conezio, & Haber, 1970). For example, Standing et al.(1970) tested LTM for 2,560 images, about 600 of which wereclassified as city scenes. Memory for a subset of 280 images wastested in a two-alternative, forced-choice test, with mean discrim-ination performance of approximately 90% correct. Thus, memoryfor a very large number of scenes can be quite accurate, even whensome of the studied images have similar subject matter.

Although studies of memory capacity demonstrate that scenememory is specific enough to successfully discriminate betweenthousands of different items, they do not identify the nature of thestored information supporting this performance. However, threestudies suggest that scene memory can indeed preserve specificvisual information. First, Friedman (1979) presented line drawingsof six common environments for 30 s each during a study session.At test, changed versions of each scene were presented, and theparticipant’s task was to determine whether the scene was the sameas the studied version. One change conducted by Friedman was toreplace an object in the initial scene with another object from thesame basic-level category (a token change), a manipulation thattested whether specific visual information (as opposed to purelyconceptual information) was preserved in memory. Participantscorrectly rejected 25% of the changed scenes when the targetobject was very likely to appear in the scene, 39% when the targetobject was moderately likely to have appeared in the scene, and60% when the target object was unlikely to have appeared in thescene. In addition, Parker (1978) found accurate correct rejectionperformance (above 85% correct) on a recognition memory test fortoken and size changes to individual objects in a scene.

Together these data suggest that visual information can beretained in memory from individual objects in a scene. One po-tential problem with these studies, however, is that each of arelatively small number of scenes was repeated a large number oftimes. In addition to the initial 30-s study period, Friedman (1979)presented each scene 12 different times in the memory test sessionto test different object changes. A second potential problem withthese studies is that they used relatively simple stimuli. Parker’sscenes contained just six discrete objects arranged on a blankbackground. Thus, the small number of scenes, the visual simplic-ity of those scenes, and the repeated presentation of each scenemay have produced unrealistic estimates of the extent to whichspecific visual information was retained.

Converging evidence that LTM preserves specific visual infor-mation comes from the Standing et al. (1970) study. Memory forthe left–right orientation of studied pictures was tested by present-ing studied scenes at test, either in the same orientation as at studyor in the reverse orientation. The orientation of a picture wasunlikely to be encoded using a purely conceptual representation, asthe meaning of the scenes did not change when the orientation wasreversed. However, participants were able to correctly identify theinitially viewed picture orientation 86% of the time after a 30-minretention interval. Thus, the Standing et al. study demonstrated thatin addition to being accurate enough to discriminate betweenthousands of studied pictures, LTM for scenes is not limited to thegist of the scene or to the identities of individual objects. However,it is at least possible that left–right orientation discrimination couldhave been driven by an accurate representation of the layout of the

116 HOLLINGWORTH AND HENDERSON

Page 5: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

scene without the retention of visual information from local ob-jects (Simons, 1996).

In summary, the picture memory literature converges on theconclusion that scene representation is more detailed than wouldbe expected under visual transience hypotheses. However, nosingle study provides unequivocal evidence that visual represen-tations are reliably retained in memory from previously attendedobjects in scenes.

Evidence From Change Detection Studies

Further evidence bearing on whether visual information is ac-cumulated from previously fixated and attended objects comesfrom studies examining change detection as a function of eyeposition (reviewed in Henderson & Hollingworth, in press).Hollingworth, Williams, and Henderson (2001) made a tokenchange to a target object in a line drawing of a scene during thesaccade that took the eyes away from that object after it had beenfixated the first time. Prior to a saccade, visual attention is auto-matically and exclusively directed to the target of that saccade(Deubel & Schneider, 1996; Henderson et al., 1989; Hoffman &Subramaniam, 1995; Kowler, Anderson, Dosher, & Blaser, 1995;Rayner, McConkie, & Ehrlich, 1978; Shepherd, Findlay, &Hockey, 1986). For example, Hoffman and Subramaniam (1995)provided evidence that visual attention is allocated exclusively tothe saccade target prior to the execution of a saccade, as partici-pants could not attend to one object in the visual field whenpreparing a saccade to another, even when such a strategy wouldhave been optimal. Thus, in Hollingworth et al. (2001), when achange was introduced on the saccade away from the target object,attention had been withdrawn from the target object before thechange occurred. According to coherence theory, this type ofchange should not be detected, because the maintenance of a visualobject representation depends on the continuous allocation ofattention to that object (Rensink, 2000a; Rensink et al., 1997).However, participants were able to detect these changes, albeit ata fairly modest rate of 27% correct (the false alarm rate was 2%).This result has also been observed using 3-D-rendered color im-ages of scenes and with a different type of visual change: a 90°rotation in depth (Henderson & Hollingworth, 1999b).

Coherence theory has difficulty accounting for these data, butthey are not necessarily inconsistent with the object file theory oftranssaccadic memory, because the latter view holds that visualobject representations can be maintained briefly after attention iswithdrawn. However, two pieces of evidence from the Holling-worth et al. (2001) and Henderson and Hollingworth (1999b)studies appear to be inconsistent with object file theory as well.First, in each of these experiments, detection was often delayeduntil refixation of the changed object, suggesting that informationspecific to the visual form and orientation of an object was retainedin memory across multiple intervening fixations and consultedwhen focal attention was directed back to the changed object.Second, in Hollingworth et al. (2001), when a change was notexplicitly detected, fixation duration on the changed object wassignificantly longer than when no change occurred, and this effectwas likewise delayed over multiple intervening fixations (13.5 onaverage). Under object file theory, this longer-term retention ofvisual information should not occur, because the critical object fileshould have been replaced by the creation of new object files as the

eyes and attention were directed to other objects in the scene.Instead, these data are consistent with the picture memory litera-ture suggesting that detailed visual information (though clearlyless detailed than a sensory image) is retained from previouslyattended objects.

Change Blindness Reconsidered

If relatively detailed visual information can be retained in mem-ory from previously attended objects in a scene, as suggested bythe picture memory literature and our initial change detectionstudies (Henderson & Hollingworth, 1999b; Hollingworth et al.,2001, in press), why would change blindness occur at all? Weconsidered three possibilities. First, in studies demonstrating poorchange detection performance, the critical change in the scenecould have occurred before the target region was fixated and thusbefore detailed information was encoded from that region. Thishypothesis is motivated by evidence that the encoding of sceneinformation is strongly influenced by fixation position. For exam-ple, in an LTM study, Nelson and Loftus (1980) demonstrated thatthe encoding of object information into a scene representation isgenerally limited to a very local region around the current fixationposition. In addition, in an online change detection paradigm,Hollingworth, Schrock, and Henderson (2001) found that fixationposition played a significant role in the detection of scene changesmade periodically across a blank interval, with the majority ofchanges detected only when the object was in foveal or near-fovealvision (see also O’Regan et al., 2000). To acquire further evidence,we reexamined data from Henderson and Hollingworth (1999b),from a control condition in which a target object was changed(deletion or 90° in-depth rotation) during a saccade to a differentobject in the scene. Trials were divided into those on which thetarget object had been fixated prior to the change and those onwhich the change occurred before fixation on the target object.Changes that occurred after fixation on the target were detectedmore accurately (39.7% correct) than changes that occurred beforefixation on the target (14.2% correct), F(1, 16) � 11.44, MSE �965.5, p � .005. Thus, given the likely dependence of changedetection on prior target fixation, changes could sometimes goundetected in change blindness studies simply because the targetregion was not fixated prior to the change.

A second reason change blindness could underestimate thedetail of scene representation is that in studies demonstrating poorchange detection performance, information encoded from the tar-get region might not have been retrieved to support change detec-tion. As reviewed above, a number of studies have found that achange to an object is sometimes detected only when the changedregion is refixated after the change (Henderson & Hollingworth,1999b; Hollingworth et al., 2001; Parker, 1978). Thus, fixationand/or focal attention may sometimes be necessary to retrievestored information about a previously fixated object. If thechanged region is not refixated, then the change may go undetec-ted, even though the stored representation of that object is suffi-ciently detailed to support change detection.

Finally, the standard interpretation of change detection perfor-mance in change blindness studies may be incorrect. Within thechange blindness literature, the interpretation of change detectionmeasures has tended to use the following logic. Explicit changedetection directly reflects the extent to which scene information is

117VISUAL MEMORY FOR NATURAL SCENES

Page 6: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

represented. Therefore, if a change is not detected, the informationnecessary to detect the change must be absent from the internalrepresentation of the scene. However, a number of recent studieshave demonstrated that for trials on which a change was notexplicitly detected, effects of that change can be observed on moresensitive measures (Fernandez-Duque & Thomton, 2000; Hayhoeet al., 1998; Hollingworth et al., 2001; Williams & Simons, 2000).For example, Hollingworth et al. (2001) found that gaze durationon a changed object when the change was not detected was 250 mslonger on average than when the same object was not changed.Thus, change blindness may be observed not because the criticalinformation is absent from the scene representation but becauseexplicit detection is not always sensitive to the presence of thatinformation.

This Study

The goal of this study, then, was to investigate the nature ofthe information retained in memory from previously attendedobjects in natural scenes. The study sought to resolve theapparent discrepancy between evidence of poor change detec-tion (and visual transience hypotheses which seek to explainsuch change blindness) and evidence of excellent memory forpictures. This primary goal was broken down into a number ofcomponent questions. First, how specific is the representationof objects in a scene that have been previously attended but arenot within the current focus of attention, both during the onlineperceptual processing of the scene and later, after the scene hasbeen removed? Second, is fixation of an object important forencoding that object into a scene representation and thus for thedetection of changes? Third, does refixation play a role in theretrieval of stored object information, supporting change detec-tion? Fourth, to what extent does explicit change detectionreflect the detail of the underlying representation?

Experiment 1

In Experiment 1, we combined a saccade-contingent changeparadigm with an LTM paradigm to investigate the nature of thescene representation constructed during scene viewing and storedinto LTM.

In an initial study session, computer-rendered color images ofcommon environments were presented to participants, whose eyemovements were monitored as they viewed each image for 20 s inpreparation for a later memory test. In each scene, one target objectwas chosen. To investigate the representation of previously at-tended objects during scene viewing, the target object was changedduring a saccade to a different region of the scene, but only if thetarget object had already been fixated at least once. Because visualattention and fixation position are tightly linked during normalviewing, making the change only after the object had been fixatedensured that that object had been attended at least once prior to thechange. However, because visual attention is automatically andexclusively allocated to the target of the next saccadic eye move-ment prior to the execution of that eye movement, as reviewedabove, the target object was not within the current focus ofattention when it changed: Before the initiation of the eye move-ment that triggered the change, visual attention had shifted to theobject within the change-triggering region, and thus, participants

could not have been attending the target object when the changeoccurred. Note that this method depends on the (uncontroversial)assumption that fixated objects have also been focally attended;however, this assumption does not entail that all attended objectsare necessarily fixated.

To test the specificity of the representation of previously at-tended objects, the target object in each scene was changed in oneof two ways: a type change, in which the target was replaced byanother object from a different basic-level category, and a tokenchange, in which the target was replaced by an object from thesame basic-level category (Hollingworth et al., 2001; see alsoArchambault, O’Donnell, & Schyns, 1999). These conditions areillustrated in Figure 1. In the type-change condition, detectioncould be based on a basic-level coding of object identity. However,if participants can detect token changes, information specific to theobject’s visual form, as opposed to its basic-level identity, waslikely to have been retained.

If detailed visual information is retained from previously at-tended scene regions, as suggested by the picture memory litera-ture, participants should be able to detect both type changes andtoken changes. Coherence theory, however, makes a differentprediction. Coherence theory holds that only changes to attendedvisual information, the gist, or the layout of a scene can bedetected, as these are the only forms of information retained acrossdisruptions such as saccades. The target object changes in thisexperiment do not alter attended visual information, as the targetobject was not attended when the change occurred. In addition,general layout should not be altered by these changes, as theoriginal and changed target objects occupied the same spatialposition and were matched for size. It is possible that a type changemight alter the gist of the scene if that representation is detailedenough to code the identities of individual objects. The mostcommon definition of gist is a short description capturing theidentity of the scene, such as child’s bedroom (a definition sharedby Rensink, 2000a). So, although we conservatively assumed thata type change could alter the gist of the scene, in reality, most typechanges would not alter this very abstract description. A tokenchange, however, should never alter the gist of the scene, as thechange does not even alter the basic-level identity of the targetobject itself. Thus, coherence theory makes the clear predictionthat token changes should not be detected in this study. In fact,Rensink (2000a) has stated directly that information specific toobject tokens can be maintained only in the presence of attention.

The object file theory of transsaccadic memory also predictspoor detection performance. If the object file coding detailedvisual information from an object is replaced quickly after atten-tion is withdrawn from that object, detection performance in thetoken-change condition should decrease as a function of theelapsed time between the withdrawal of attention from the targetand the change. A more precise prediction depends on making anumber of assumptions about the creation of object files and theirreplacement in VSTM. According to object file theory, an objectfile is formed when attention is directed to a new perceptual object.Attention precedes the eyes to the next saccade target, and thus,object file creation could be expected to be roughly one-per-saccade during scene viewing. This is an admittedly rough esti-mate, because attention could be allocated to more than one objectwithin a single fixation or to the same object across more than onefixation. In addition, the length of time an object file persists after

118 HOLLINGWORTH AND HENDERSON

Page 7: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

the withdrawal of attention depends on the mode of replacement inVSTM. If replacement is first in, first out, then detection perfor-mance should decline sharply to zero if the change happens morethan about three or four fixations after eyes leave the target region.It is possible, though, that replacement is a stochastic process, inwhich case a much more gradual, exponential decline in detectionperformance should be observed.

In either case, however, Irwin’s object file theory predicts asignificant decline in detection performance as a function of thenumber of intervening fixations between the last exit of the eyesfrom the target region prior to the change and the change itself. Inkeeping with Irwin and Andrews’ (1996) claim that there is littleaccumulation of detailed information across eye movements andthat replacement in VSTM is likely to be first in, first out, this viewpredicts that detection performance should decline to zero quitequickly, within a maximum of about four fixations. Type changes,on the other hand, might be detected successfully and in a mannerindependent of the number of intervening fixations if the change issignificant enough to alter the gist of the scene or if an abstract

identity code is retained from the target object, as Irwin’s theoryholds that these types of information can be maintained in a stableform across multiple eye movements.

The change-after-fixation condition was contrasted with twocontrol conditions. In the change-before-fixation condition, thetarget object was changed before the first fixation on that object. Inthe no-change control condition, the initial scene was not changed.The change-before-fixation condition was included to test theextent to which local object encoding is dependent on fixation. Ifencoding is facilitated by object fixation, then change detectionshould be reliably poorer when the object had not been fixatedprior to the change, compared with when it had been fixated. Theno-change control condition was included to assess the false-alarmrate.

Finally, to investigate LTM for the target objects in the scenes,we administered a forced-choice recognition test for no-changecontrol scenes after the study session. Participants saw two ver-sions of each scene in succession, one containing the studied objectand the other a distractor object in the same spatial position. The

Figure 1. Sample scene illustrating the change conditions in Experiments 1 and 2. A: Initial scene, in whichthe notepad is the target object; B: Type change (Experiment 1); C: Token change (Experiments 1 and 2); D:Rotation (Experiment 2). In the experiments, stimuli were presented in color.

119VISUAL MEMORY FOR NATURAL SCENES

Page 8: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

distractor could be either a different type (type-discriminationcondition) or a different token (token-discrimination condition).Similar predictions hold for the LTM test as for online changedetection. If visual object representations are retained in memoryafter attention is withdrawn, as the picture memory literatureimplies, participants should be able to successfully discriminatebetween both type and token alternatives. However, if visualrepresentation is transient and there is little accumulation of infor-mation from local scene regions, as proposed by both visualtransience hypotheses, participants should not be able to accuratelydiscriminate two token alternatives.

In addition, the LTM test in this study avoids some of theinterpretative difficulties present in other scene memory para-digms. First, distractors in prior studies were often chosen tomaximize discriminability, whereas studied scenes and distractorsin this study differed only in the properties of a single object.Second, whereas prior studies showing the retention of token-specific information repeated each scene many times, participantsviewed each scene in this study only once prior to the test. Third,prior studies often used a variety of materials from a variety ofsources (e.g., mixing together color images with black and whiteimages), whereas the similarity between studied scenes was fairlyhigh in this study: Each scene was a 3-D-rendered color image ofa common environment, many of the scenes were taken from thesame large-scale model of a single house, and some scenes werecreated by rendering different viewpoints within a single roommodel. Thus, this study provides a particularly stringent test ofscene memory.

Method

Participants. Twelve Michigan State University undergraduates par-ticipated in the experiment for course credit. All participants had normalvision and were naive with respect to the hypotheses under investigation.

Stimuli. Thirty-six scene images were computer-rendered from 3-Dwire-frame models using 3-D graphics software. Wire-frame models wereacquired commercially, donated by 3-D graphic artists, or developedin-house. Each model depicted a typical human-scaled environment (e.g.,office or patio). To create each initial scene image, a target object waschosen within the model, and the scene was rendered so that this targetobject did not coincide with the initial experimenter-determined fixationposition. To create the type-change scene images, the scene was re-rendered after the target object had been replaced by another object froma different basic-level category. To create the token-change condition, thescene was re-rendered after the target object had been replaced by anotherobject from the same basic-level category. In the changed scenes, the 3-Dgraphics software automatically filled in contours that had been occludedprior to the change and corrected the lighting of the scene. All scene imagessubtended 15.8° � 11.9° visual angle at a viewing distance of 1.13 m.Target objects subtended 2.41° on average along the longest dimension inthe picture plane. The objects used for type and token changes were chosento be approximately the same size as the initial target object in each scene.The full set of scene stimuli is listed in the Appendix.

Apparatus. The stimuli were displayed at a resolution of 800 � 600pixels � 15-bit color. The display monitor refresh rate was set at 144 Hz.The room was dimly illuminated by an indirect, low-intensity light source.Eye movements were monitored using a dual-Purkinje-image eyetracker(Generation 5.5, Stanford Research Institute; for more information, seeCrane & Steele, 1985). A bite-bar and forehead rest were used to maintainthe participant’s viewing position. The position of the right eye wastracked, though viewing was binocular. Eye position was sampled at a rateof better than 1000 Hz. Button presses were collected using a button panel

connected to a dedicated input–output (I/O) card. The eyetracker, displaymonitor, and I/O card were interfaced with a 90-MHz Pentium-basedmicrocomputer. The computer controlled the experiment and maintained acomplete record of time and eye position values over the course of eachtrial.

Procedure. On arriving for the experimental session, participants weregiven a written description of the experiment along with a set of instruc-tions. The description informed participants that their eye movementswould be monitored while they viewed images of real-world scenes on acomputer monitor. Participants were informed that they would view eachimage to prepare for a memory test on which they would have to “distin-guish the original scenes from new versions of the scenes that may differin only a small detail of a single object.” In addition to the memory testinstruction, participants were instructed to monitor each scene for objectchanges during study and to press a button immediately after detecting achange. The two types of possible changes were demonstrated using asample scene. These instructions were the same as in Henderson andHollingworth (1999b) and similar to instructions in other studies demon-strating transsaccadic change blindness (e.g., Grimes, 1996). Followingreview of the instructions, the experimenter calibrated the eyetracker byhaving participants fixate four markers at the centers of the top, bottom,left, and right sides of the display. Calibration was considered accurate ifthe computer’s estimate of the current fixation position was within �5 minof arc of each marker. The participant then completed the experimentalsession. Calibration was checked every three or four trials, and the eye-tracker was recalibrated when necessary. To begin each trial, the partici-pant fixated a central box on a fixation screen. The experimenter theninitiated the trial.

Scene changes were initiated on the basis of eye position, as illustratedin Figure 2. In the change-after-fixation condition, an invisible region wasinitially activated surrounding the target object (Region A in Figure 2).This region was 0.36° larger on each side than the smallest rectangleenclosing the target object. When the eyes had dwelled within the targetregion continuously for at least 90 ms, the computer activated a change-triggering region surrounding a different object on the opposite side of thescene (Region B in Figure 2). The center of this region was on average11.0° from the center of the target region. When the eyes crossed theboundary of the change-triggering region, the change was initiated. At arefresh rate of 144 Hz, the change was completed in a maximum of 14 ms.In the control condition, the procedure was identical except that the initialscene was replaced by an identical scene image as the eyes crossed theboundary of the change-triggering region. The procedure in the change-before-fixation condition was slightly different. At the beginning of thetrial, an initial 4.9° (horizontal) � 3.9° (vertical) region was activated at thecenter of the screen (Region C in Figure 2). The participant’s initialfixation on the scene fell within this region. The change-triggering regionwas activated as the eyes left the central region, and as in the otherconditions, the change was initiated as the eyes crossed the boundary of thechange-triggering region.

In the experimental session, each participant saw all 36 scenes. Sixscenes appeared in the change-after-fixation condition, 18 in the change-before-fixation condition, and 12 in the control condition. The large num-ber of change-before-fixation trials was included because in that condition,sometimes the target object would be fixated between the point that theeyes left the central region and the point when they crossed the change-triggering boundary. Trials on which this occurred were recoded as change-after-fixation trials. In each of the change conditions, the trials were evenlydivided between type-change trials and token-change trials. Across thetwelve participants, each scene appeared in each condition an equal numberof times. Each scene was displayed for 20 s, and the order of imagepresentation was randomly determined for each participant. The studysession lasted approximately 20 min.

After all 36 scenes had been viewed, the LTM test was administered.There was a delay of approximately 5 min between the study and test

120 HOLLINGWORTH AND HENDERSON

Page 9: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

sessions, during which the experimenter reviewed the memory test instruc-tions and demonstrated the paradigm using a sample scene. Thus, theretention interval for scenes varied from a minimum of about 5 min to amaximum of about 30 min. Memory was tested for the twelve scenesappearing in the control condition. Participants saw two versions of eachscene sequentially: the studied scene, and a distractor scene that wasidentical to the studied scene except for the target object. In the type-discrimination condition, the distractor object was from a different basic-level category (identical to the changed target in the type-change condi-tion); in the token-discrimination condition, the distractor target object wasfrom the same basic-level category (identical to the changed target in thetoken-change condition). To ensure that participants based their decisionon target object information, the target was marked with a small greenarrow in both the studied and distractor scenes. Each version was presentedfor 8 s with a 1-s interstimulus interval. The order of presentation wascounterbalanced. Participants were instructed to view each scene and thenpress one of two buttons to indicate whether the first or second version wasidentical to the scene studied earlier. Across participants, each scene itemappeared in the type- and token-discrimination conditions an equal numberof times.

Results

Online change detection performance. Eye movement datafiles consisted of time and position values for each eyetrackersample. Saccades were defined as changes in eye position greaterthan 8 pixels (about 8.8 min of arc) in 15 ms or less. Samples thatdid not fall within a saccade were considered part of a fixation. The

position of each fixation was calculated as the mean of the positionsamples (weighted by the duration of time at each position) thatfell between consecutive saccades (see Henderson, McClure,Pierce, & Schrock, 1997). Fixation duration was calculated as theelapsed time between consecutive saccades. Fixations less than 90ms and greater than 2,000 ms were eliminated as outliers. Trialswere eliminated if the eyetracker lost track of eye position prior tothe change or if the change was not completed before the begin-ning of the next fixation on the scene. Eliminated trials accountedfor 2.1% of the data. In addition, in the change-before-fixationcondition, the target object was fixated before the change on57.0% of the trials. These were recoded as change-after-fixationtrials.

Mean percentage correct detection data are reported in Figure 3.When a change occurred after target fixation, we observed 51.1%correct type-change detection and 28.4% correct token-changedetection, which were reliably different, F(1, 11) � 8.66, MSE �357.2, p � .05. Performance in each of these conditions wasreliably higher than the false alarm rate of 9.1% in the no-changecontrol condition: type change versus false alarms, F(1, 11) �85.63, MSE � 123.7, p � .001; token change versus false alarms,F(1, 11) � 7.47, MSE � 299.5, p � .05. When the changeoccurred before target fixation, we observed 8.8% correct detec-tion in the type-change condition and 4.7% correct detection in thetoken-change condition, which did not differ (F � 1). Performance

Figure 2. Sample scene (with contrast reduced) illustrating the software regions used to control scene changesin Experiment 1. Participants began by fixating the center of the screen. In the change-after-fixation condition,the computer waited until the eyes had dwelled in the target object region (A) for at least 90 ms. Then, thechange-triggering region (B) was activated, and as the eye crossed the boundary to this region, the change wasinitiated. In the change-before-fixation condition, the computer waited until the eyes left the central region (C)before activating the change-triggering region (B), and the change was initiated as the eyes crossed thechange-triggering boundary. The regions depicted here were not visible to the participants.

121VISUAL MEMORY FOR NATURAL SCENES

Page 10: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

in the change-before-fixation condition did not differ from thefalse alarm rate: type change versus false alarms, F � 1; tokenchange versus false alarms, F(1, 11) � 1.32, MSE � 85.2, p � .28.Finally, comparison of the change-after-fixation condition with thechange-before-fixation condition demonstrated that performancewas reliably higher in the former compared with the latter, both fortype changes, F(1, 11) � 32.72, MSE � 327.0, p � .001, andtoken changes, F(1, 11) � 8.11, MSE � 413.2, p � .05.

One potential explanation for poor detection performance in thechange-before-fixation condition is that on average, changes oc-curred earlier in these trials compared with those in the change-after-fixation condition. Figure 4 plots detection performance as afunction of the elapsed time to the change, both for the change-before-fixation and change-after-fixation conditions, collapsingacross type and token change. There was a positive correlationbetween change detection and elapsed time to the change in thechange-before-fixation condition, which approached reliability,rpb � .21, t(79) � 1.89, p � .065.3 This marginally positivecorrelation suggests that there may have been some encoding oftarget information without direct fixation, but even in the fourthquartile of the elapsed time distribution in this condition, detectionperformance (13.0%) was not much above the false alarm rate(9.1%). In addition, the elapsed time distributions overlapped forchange before and after fixation. In the region of overlap, changedetection after fixation on the target object was still clearly higherthan when the change occurred before fixation on that object.Finally, there appeared to be little effect of elapsed time to changeon detection in the change-after-fixation condition, rpb � .07,t(171) � 0.91, p � .36. Thus, prior fixation of the target objectclearly played a significant role in change detection.

Further evidence that target fixation plays a significant role insubsequent change detection comes from an analysis of fixationtime on the target object prior to the change. In the change-after-fixation condition, mean total time fixating the target object priorto the change was 568 ms in the type-change condition and 622 ms

in the token-change condition. Figure 5 plots detection perfor-mance as a function of fixation time on the target prior to thechange. The correlation between fixation time and detection per-formance was reliable in the token-change condition, rpb � .35,t(76) � 3.25, p � .005, and approached reliability in the type-change condition, rpb � .18, t(83) � 1.64, p � .10. Thus, at leastfor token changes, detection depended not only on whether thetarget was fixated prior to the change but also on the length of timethe target was fixated.

The ability of participants to detect changes in this experiment,particularly token changes, is inconsistent with coherence theory,as the target object was not attended when the change occurred.However, object file theory could account for the change detectionresults if changes occurred soon enough after the object had beenattended that the relevant object file had not been replaced bysubsequent encoding. Thus, we examined detection performancein the change-after-fixation condition as a function of the numberof fixations that intervened between the last exit of the eyes fromthe target region prior to the change and the change itself. Therewas an average of 4.7 fixations between the last exit from thetarget region and the change. Figure 6 plots detection performanceas a function of the number of intervening fixations. Zero inter-vening fixations indicates that the saccade leaving the target objectregion crossed the boundary of the change-triggering region, trig-gering the change. However, contrary to the object file theoryprediction, there was no evidence of decreasing detection perfor-

3 In this and subsequent regression analyses, we regressed a predictorvariable of interest (such as elapsed time to change) against the dichoto-mous detection variable, yielding a point-biserial coefficient. Each trialwas treated as an observation. Because each participant contributed morethan one sample to the analysis, variation due to differences in participantmeans was removed by including participant as a categorical factor (im-plemented as a dummy variable) in the model.

Figure 3. Experiment 1: Mean percentage correct change detection foreach change condition and mean false alarms for the no-change controlcondition. Error bars are 95% confidence intervals, based on error term forthe interaction between change condition (token or type) and eye position(change before or after fixation).

Figure 4. Experiment 1: Mean percentage correct change detection as afunction of the elapsed time from the beginning of the trial to the change,for the change-before-fixation and change-after-fixation conditions (col-lapsing across type and token changes). In each condition, the mean of eachelapsed time quartile is plotted against mean percentage detections in thatquartile.

122 HOLLINGWORTH AND HENDERSON

Page 11: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

mance with an increasing number of intervening fixations: typechange, rpb � �.07, t(83) � �0.61, p � .54; token change, rpb ��.05, t(76) � �0.48, p � .64.

For correct detections in the change-after-fixation condition, weexamined the position of the eyes when the change was detected.The vast majority of detections came on refixation of the targetobject. On 93.2% of the trials, the detection button was pressedwhen the participant was refixating the target object after thechange or within one eye movement after refixation. In addition,these detections tended to occur quite a long time after the changeoccurred. Mean detection latency in the change-after-fixation con-dition was 5.7 s.4

We were also interested in whether there might be effects ofchange not reflected in the explicit detection measure. For trials onwhich a change was not explicitly detected, we examined gazeduration (the sum of all fixation durations on an object region,from entry to exit from that region) on the target object for the firstentry after the change. Miss trials in the change-after-fixationcondition were compared with the equivalent entry in the no-change control condition. There was no difference between meangaze duration on the changed object for miss trials in the type-change condition (477 ms) and the no-change control (479 ms;F � 1). For token changes, there was a trend toward elevated gazeduration for miss trials compared with the no-change control, withmean gaze duration of 649 ms for token-change misses versus 479ms in the no-change control, F(1, 11) � 2.40, MSE � 72,263,p � .15.

Long-term memory performance. Mean percentage correctfor the forced-choice memory test was calculated for type-discrimination and token-discrimination conditions. Contrary tothe predictions of both coherence theory and object file theory,discrimination performance was well above the chance level of50% correct, both for the type-discrimination condition (93.1%)and the token-discrimination condition (80.6%), which werereliably different, F(1, 11) � 6.05, MSE � 154.5, p � .05.

Discussion

The principal issue in Experiment 1 was whether visual objectrepresentations persist after attention is withdrawn from an object,or whether such representations are transient, consistent with re-cent proposals in the transsaccadic memory and change blindnessliteratures. The data support the former view. Participants wereable to detect both type and token changes when the changedobject had been previously fixated and attended but was no longerwithin the focus of attention when the change occurred. Coherencetheory (Rensink, 2000a) would appear unable to account for thesedata, particularly in the token-change condition, as that theoryholds that coherent visual object representations disintegrate assoon as attention is withdrawn. The results are also inconsistentwith object file theory (Irwin & Andrews, 1996), because detectionoften occurred many fixations after the last fixation on the targetobject, after the object file for the target should have been replacedby subsequent encoding. In addition, detection was significantlydelayed after the change, more than 5 s on average, and typicallyuntil the target object had been refixated. This result suggests thatvisual information was often retained for a relatively long periodof time and was consulted when focal attention and the eyes weredirected back to the changed object. In summary, there appears tobe significant accumulation of local scene information acrossmultiple eye fixations on a scene.

Further evidence that visual information accumulates from pre-viously fixated and attended regions of a scene comes from accu-rate discrimination performance on the LTM test. Discrimination

4 Given the strong relationship between detection and refixation, weexamined percentage correct in the change-after-fixation condition, elim-inating trials on which the target was not refixated after the change. Ononly 4.4% of the trials did the participant fail to refixate the changed targetobject, so detection performance was raised only slightly with their elim-ination (type change, 53.6% correct; token change, 29.2% correct).

Figure 5. Experiment 1: Mean percentage correct change detection in thechange-after-fixation condition, as a function of the total fixation time onthe target object prior to the change. In each change type condition, themean of each fixation time quartile is plotted against mean percentagedetections in that quartile.

Figure 6. Experiment 1: Mean percentage correct change detection in thechange-after-fixation condition, as a function of the number of interveningfixations between the last exit from the target region prior to the changeand the change itself. Zero intervening fixations indicates that the saccadeleaving the target object region crossed the change-triggering boundary,triggering the change.

123VISUAL MEMORY FOR NATURAL SCENES

Page 12: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

performance in both the type- and token-discrimination conditionswas above 80% correct. These results are inconsistent with visualtransience hypotheses but correspond with the picture memoryliterature (Friedman, 1979; Nelson & Loftus, 1980; Parker, 1978).Scene memory is clearly not limited to the gist or layout of thescene, or even to the identities of individual objects, becausetoken-discrimination performance was quite accurate.

A puzzling issue given these results is why Irwin and Andrews(1996) found little evidence of visual accumulation across multipleeye movements. In that study, two fixations within an array ofletters did not produce reliably better partial report performancethan one. Although this result is consistent with the object filetheory of transsaccadic memory, another aspect of Irwin andAndrews’ data was not. Object file theory predicts that informationfrom the most recently attended region of the array should be mostoften retained, as object files created earlier should be rapidlyreplaced. However, Irwin and Andrews found that partial reportperformance was better for array positions near the first saccadetarget rather than the second, suggesting that visual informationfrom the region of the array attended earlier was preferentiallyretained over information from the region attended later. Thiscomplicates the interpretation of Irwin and Andrews’ results con-siderably. In addition, the many methodological differences be-tween our study and Irwin and Andrews make pinpointing thesource of the discrepancy difficult: In Irwin and Andrews, stimuliconsisted of letter arrays rather than natural scenes, letters were notdirectly fixated, and fixation durations and saccade targets werecontrolled by the experimenter. Whatever the source of the differ-ence, the data from our study demonstrate that for free viewing ofnatural scenes, type- and token-specific information reliably accu-mulates from previously attended regions.

Our experiment also sought to shed light on the relationshipbetween fixation position and change detection. The first issue waswhether change detection depends on prior fixation of the targetobject. This was clearly the case, as change detection without priortarget fixation was no higher than the false alarm rate. In addition,change detection performance increased with the length of timespent fixating the target prior to the change. Thus, in studiesdemonstrating change blindness, poor detection performance mayhave occurred, in part, because target regions were not alwaysfixated prior to the change. The second issue was whether refix-ation of the target object plays an important role in change detec-tion. The vast majority of detections came on refixation of thechanged object, suggesting that refixation may cue the retrieval ofstored information about a previously fixated and attended object.In studies demonstrating change blindness, then, poor detectionperformance could have also occurred because target regions werenot always refixated after the change.

However, these potential explanations cannot fully account forchange blindness phenomena. In this experiment, even when thetarget object was fixated before the change and again after thechange, detection performance was still only modest, with 53.6%correct for type changes and 29.6% correct for token changes. It isimportant to note however, that visual transience theories cannotaccount for even this modest detection performance when attentionhas been withdrawn. One reason for an intermediate level ofchange detection performance may be that the change detectionmeasure itself is not particularly sensitive to the detail of the scenerepresentation. This possibility finds support in evidence from

other studies demonstrating effects of change on trials withoutexplicit detection (e.g., Hollingworth et al., 2001). In addition, thefinding that forced-choice discrimination performance on the LTMtest was apparently superior to performance on the online changedetection test suggests the latter may not have reflected in full theinformation retained from previously attended objects. This issuewas addressed in Experiment 3.

Exactly what is the nature of the information supporting detec-tion and discrimination performance in this experiment? The pos-sibility that sensory information was retained from previouslyfixated and attended objects can be ruled out, as prior researchshows that such information is not retained across a single saccadiceye movement. Thus, it is likely that higher-level visual represen-tations, abstracted away from sensory information, are retainedacross multiple fixations after attention is withdrawn and areultimately stored in LTM. A large body of research indicates thatsuch visual representations can be retained across eye movements(Carlson-Radvansky, 1999; Carlson-Radvansky & Irwin, 1995;Henderson, 1997; Henderson & Siefert, 1999; Pollatsek et al.,1984; Pollatsek et al., 1990). It is tempting to speculate that theadvantage for type-change detection and type discrimination com-pared with token-change detection and token discrimination indi-cates that qualitatively different information was used to supportperformance in each case. For example, it is possible that fortype-change detection and type discrimination, both informationabout the visual form of the object and also basic-level identitycodes could have been used. Though plausible, it is difficult toconclude this was the case given that the visual difference betweeninitial and changed objects in the two conditions was not con-trolled. In general, objects from the same category are morevisually similar than objects from different categories. Thus, thepossibility that visual information was solely functional in changedetection and discrimination cannot be ruled out.

Experiment 2

The purpose of Experiment 2 was to strengthen the evidencethat visual representations persist after attention is withdrawn froman object and are ultimately stored into LTM. In Experiment 1, thisconclusion depended primarily on evidence from token manipula-tions. However, it is possible that the representations underlyingthis performance could have been conceptual in nature rather thanvisual. For example, if participants were to have encoded objectidentity at a subordinate category level, an identity code of legalnotebook could have been sufficient to discriminate the originaltarget from the changed target (a spiral notebook) in the officescene illustrated in Figure 1. Thus, in Experiment 2, a rotationmanipulation was used (see Figure 1). The changed target objectwas created by rotating the initial target object 90° in depth(Henderson & Hollingworth, 1999b). In this condition, the identityof the target object was not changed at all, yet the visual appear-ance of the object was modified. If participants are able to suc-cessfully detect the rotation of a previously attended object anddiscriminate between two orientations of the same object on thesubsequent LTM test, this would provide strong evidence thatspecifically visual information had been retained in memory.

For the online change detection task, in addition to the rotationmanipulation, the token-change and control trials were retainedfrom Experiment 1. The change-before-fixation condition was

124 HOLLINGWORTH AND HENDERSON

Page 13: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

eliminated; on all trials, the target object was changed only after ithad been directly fixated at least once. Otherwise, Experiment 2was identical to Experiment 1.

Method

Participants. Twelve Michigan State University undergraduates par-ticipated in the experiment for course credit. All participants had normalvision, were naive with respect to the hypotheses under investigation, andhad not participated in Experiment 1.

Stimuli. To create the changed images in the rotation condition, theinitial scene model was rendered after the target object model had beenrotated 90° in depth. In addition, three scenes were modified slightly toaccommodate the rotation condition. Two of these were minor modifica-tions to target objects whose original appearances did not change signifi-cantly on rotation. The third change was to replace the book target objectin a bedroom scene (which did not change much on rotation) with an alarmclock target.

Apparatus and procedure. The apparatus was the same as in Experi-ment 1. The procedure was the same as in Experiment 1, except that thetype-change condition was replaced by a rotation condition. In addition, thechange-before-fixation condition was eliminated. Twelve scene items ap-peared in each of the three change conditions: token change, rotation, andno-change control. The 12 control scenes served as the basis of the LTMtest. Six of these scenes appeared in the token-discrimination condition andsix in the orientation-discrimination condition. Across participants, eachscene item appeared in each condition an equal number of times.

Results

Online change detection performance. Trials were eliminatedif the eyetracker lost track of eye position prior to the change or ifthe change was not completed before the beginning of the nextfixation on the scene. These accounted for 5.3% of the data.Fixations shorter than 90 ms or longer than 2,000 ms were elim-inated as outliers.

Mean percentage correct detection data are reported in Figure 7.In all change trials, the change was made after the target object hadbeen fixated at least once, equivalent to the change-after-fixationcondition of Experiment 1. Detection performance was 26.0%correct in the token-change condition and 29.2% correct in therotation condition, which did not differ (F � 1). Performance ineach of these conditions was reliably higher than the false alarmrate of 4.2% in the no-change control condition: token changeversus false alarms, F(1, 11) � 11.32, MSE � 186.7, p � .005;rotation versus false alarms, F(1, 11) � 20.29, MSE � 185.8, p �.005. Replicating Experiment 1, the vast majority of detectionscame on refixation of the target object (89.2%) and detection wassignificantly delayed, with mean detection latency of 4.6 s in thetoken-change condition and 4.5 s in the rotation condition.

As in Experiment 1, detection performance was influenced bythe length of time the target object was fixated prior to the change.Mean total time fixating the target object region prior to the changewas 768 ms in the token-change condition and 760 ms in therotation condition. Figure 8 plots detection performance as afunction of the length of time spent fixating the target object priorto the change. There was a reliable positive correlation betweenfixation time and percentage correct detection in both the token-change condition, rpb � .21, t(115) � 2.35, p � .05, and therotation condition, rpb � .26, t(124) � 2.96, p � .05.

Above floor change detection for previously attended objects isnot consistent with coherence theory. To test object file theory,

however, we again examined detection performance in the changeconditions as a function of the number of fixations that intervenedbetween the last exit of the eyes from the target region prior to thechange and the change itself. There was an average of 4.8 fixationsbetween the last exit from the target region and the change. Figure9 plots detection performance as a function of the number ofintervening fixations. Unlike Experiment 1, however, there wassome evidence that detection performance fell with the number ofintervening fixations: The rotation condition produced a margin-ally reliable negative correlation between the number of interven-ing fixations and detection performance, rpb � �.17, t(124) ��1.87, p � .06, though no such effect was observed in thetoken-change condition, rpb � �.09, t(115) � �.97, p � .33.

Finally, we examined whether there might be effects of changenot reflected in the explicit detection measure. Gaze duration wascalculated for the first entry of the eyes into the target region afterthe change. Miss trials in the change conditions were comparedwith the equivalent entry in the no-change control condition. Forrotations, there was no reliable difference between mean gazeduration for miss trials (586 ms) compared with that for theno-change control (535 ms; F � 1). For token changes, there wasagain a trend toward elevated gaze duration for miss trials com-pared with that for the no-change control, with mean gaze durationof 655 ms for token-change misses versus 535 ms for the no-change control, F(1, 11) � 2.48, MSE � 34,378, p � .14. Becausethese analyses consulted only a subset of the data and had rela-tively little power, we combined the token-change and control datafrom Experiments 1 and 2 in a mixed analysis of variance(ANOVA) design, with experiment treated as a between-subjectsfactor. The combined analysis revealed a reliable 145-ms differ-ence between gaze duration on changed objects for token-changemisses (652 ms) compared with that for the no-change control (507ms), F(1, 22) � 4.71, MSE � 53,321, p � .05. This implicit effectof token change on gaze duration has since been replicated(Hollingworth et al., 2001).

Figure 7. Experiment 2: Mean percentage correct change detection foreach change condition and mean false alarms for the no-change controlcondition. Error bars are 95% confidence intervals, based on the error termof token-rotation contrast.

125VISUAL MEMORY FOR NATURAL SCENES

Page 14: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

One potential alternative explanation for these gaze durationresults needs to be examined.5 In Experiments 1 and 2, detectionperformance was positively correlated with fixation time on theobject prior to the change. Thus, detection trials eliminated fromthe above analysis of postchange gaze duration were more likely tohave been trials on which the target had been fixated for a rela-tively long period of time prior to the change. If one entertains theadditional assumption that there may be a baseline tendency forshort fixation times on an object to be followed by longer fixationtimes on that object and vice versa (i.e., a negative correlationbetween early fixation times and later fixation times), then onemight find elevated postchange gaze duration for token-changemisses (which occur later in a trial) simply because more trialswith longer initial fixation times had been eliminated from theanalysis. To test whether there is a baseline tendency for earlyfixation times to be negatively correlated with later fixation times,we examined the no-change control condition for Experiments 1and 2. The control condition allowed us to test this assumptionindependently of object changes. In Experiment 1, there wasactually a reliable positive correlation between fixation time on thetarget prior to the change (a change in this condition was thesubstitution of an identical image) and gaze duration on the targetfor the first entry of the eyes after the change, rpb � .22, t(103) �2.26, p � .05. In Experiment 2, there was no observed correlationbetween these variables, rpb � .04, t(109) � 0.43, p � .67. Thetendency, at least in Experiment 1, for longer initial fixation timesto be followed by longer fixation times later in viewing is consis-tent with the earlier finding that elevated gaze duration on seman-tically incongruous objects is observed not only for the first entryof the eyes into that object region but also for the second entry(Henderson, Weeks, & Hollingworth, 1999). Thus, the generallypositive baseline relationship between initial and later fixationtimes on an object, combined with the elimination of more trialswith longer initial fixation times in the analysis of token-changemisses, would serve to underestimate mean postchange gaze du-ration for miss trials and would work against our finding of asignificant implicit effect of token change on gaze duration.

Long-term memory performance. Mean forced-choice dis-crimination performance was calculated for the token-discrimination and orientation-discrimination conditions. Con-trary to the predictions of both coherence theory and object filetheory, discrimination performance was well above chance per-formance of 50% correct, for both the token-discriminationcondition (80.6%) and the orientation-discrimination condition(81.9%), which did not differ (F � 1).

Discussion

In Experiment 2, a rotation condition was included in which thevisual form, but not the identity, of the target object changedbetween the initial and changed scene images. Contrary to theprediction derived from coherence theory, participants were able todetect rotations and token changes even though the object was notwithin the current focus of attention when the change occurred.Participants’ ability to detect rotations provides converging evi-dence that specifically visual, as opposed to conceptual, represen-tations were retained after attention was withdrawn. Unlike inExperiment 1, however, there was some evidence that detectionperformance fell as a function of the number of intervening fixa-tions between the last exit of the eyes from the target object regionand the change, consistent with the prediction of object file theory.This relationship was observed for rotations but not for tokenchanges. Thus, the explicit detection data do not support coherencetheory but are consistent, to some degree, with object file theory.The LTM test results, however, support neither the attention hy-potheses nor object file theory. Both token- and orientation-discrimination performance was above 80% correct. Thus, al-though there appeared to be some decay of information relevant tothe detection of rotation changes, token- and orientation-specificinformation was reliably retained in memory long after object filetheory predicts such information should have been replaced.

5 We thank Dan Simons for suggesting this alternative account.

Figure 8. Experiment 2: Mean percentage correct change detection as afunction of the total fixation time on the target object prior to the change.In each change type condition, the mean of each fixation time quintile isplotted against mean percentage detections in that quintile.

Figure 9. Experiment 2: Mean percentage correct change detection as afunction of the number of intervening fixations between the last exit fromthe target region prior to the change and the change itself.

126 HOLLINGWORTH AND HENDERSON

Page 15: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

In addition, Experiment 2 provided converging evidence thatrefixation serves as a strong cue to retrieve stored informationfrom previous fixations. In replication of Experiment 1, changedetection was delayed, on average, about 4.5 s after the change andtypically until refixation of the changed object. In addition, changedetection performance was influenced by the amount of time spentfixating the target prior to the change, supporting the conclusiondrawn from Experiment 1 that change detection depends on priortarget fixation.

Although rotation detection and orientation-discriminationperformance in Experiment 2 cannot be attributed to an abstractcoding of object identity, it is still possible that performancewas mediated by the maintenance of nonvisual representations.Specifically, participants may have produced an abstract, verbaldescription of the visual properties of the target object (e.g.,yellow, lined, rectangular notebook with writing on the page, ablack spine, and oriented so that the longer side is roughlyparallel to the nearest edge of the table would describe thenotebook in Panel A of Figure 1 fairly well). If this were so,object memory may not have been visual in the sense that it wasnot based on representations in a visual format (though a verbaldescription of this sort would still preserve visual content,coding visual properties such as shape or color). Though pos-sible, verbal encoding does not appear to provide a plausibleaccount of performance in Experiments 1 and 2. First, partici-pants could not have known beforehand which features wouldbe critical to differentiating between the original target and thechanged target. In addition, token and orientation trials weremixed together, so participants could not have known whichtype of task they would have to perform when encoding infor-mation from the scene. Thus, to support successful perfor-mance, and discrimination performance in particular, verbaldescriptions would have had to have been quite detailed, en-coding enough features from the original target so that a criticalfeature would happen to be encoded. Second, participants couldnot have known beforehand which of the objects in the scenewas the target. Thus, they would have had to have produced ahighly detailed verbal description of each of the objects in thescene. Third, a detailed verbal description must have beenproduced in a relatively short amount of time. In Experiments 1and 2, participants fixated the target object for approximately750 ms prior to the change and for approximately 1,500 msprior to the memory test. In addition, as described in the resultsof Experiment 3, participants demonstrated token- and orientation-discrimination performance above 80% correct after having fix-ated the target object for only 702 ms on average prior to the test.Although a verbal description hypothesis cannot be definitivelyruled out (in theory, a verbal description of unlimited specificitycould be produced with enough time and enough words), it seemshighly unlikely that participants could produce verbal descriptionsfor each of the objects in a scene, with each description detailedenough to perform accurate token and orientation discrimination,and do this within approximately 700 ms per object.

Experiment 3

Accurate discrimination performance in the LTM tests of Ex-periments 1 and 2 provides strong evidence that visual sceneinformation is retained in LTM. However, the fact that perfor-

mance in the online change detection task was approximately 30%correct for token and rotation changes doesn’t allow the verystrongest conclusion that the representation formed during onlinescene perception contains visual information from previously at-tended objects. One could reasonably argue that accurate LTMperformance could not occur unless the information supportingthat performance had been present during the online perceptualprocessing of the scene. In addition, any evidence of above-floordetection performance in the absence of sustained attention isinconsistent with visual transience theories, in general, and withcoherence theory, in particular. Nevertheless, it remains the casethat modest change detection performance is typically interpretedas evidence for the absence of representation.

There are a number of reasons, however, why online changedetection performance may have underestimated the specificity ofthe scene representation, compared with, in particular, the forced-choice task used in the LTM tests. First, the online change detec-tion task was performed concurrently with the task of studying forthe memory test. Thus, change detection may have underestimatedthe detail of the scene representation, because participants couldnot devote their full attention to monitoring for object changes.Second, in the forced-choice discrimination test, the target objectwas specified with a green arrow. Thus, participants could limitretrieval to information about the target object. However, suchfocused analysis was not possible in the online change detectiontask, because the target was not specified. Finally, explicit changedetection, regardless of other task demands, may not be verysensitive to visual representation (as reviewed above with regard toimplicit effects), especially if participants adopt a fairly highcriterion for change detection. By forcing participants to make achoice between two alternatives, information unavailable or insuf-ficient for explicit detection could nevertheless influence perfor-mance. In support of these points, there is direct evidence fromExperiments 1 and 2 that explicit change detection did not reflectthe full detail of the scene representation constructed online, asgaze duration on the changed object for token-change miss trialswas reliably longer compared with the same entry when no changehad occurred.

In Experiment 3, then, we used a forced-choice discriminationprocedure to test the representation of previously attended objectsduring the online perceptual processing of a scene. Figure 10illustrates the sequence of events in a trial in Experiment 3. As inthe change-after-fixation conditions of Experiments 1 and 2, thecomputer waited until the participant had fixated the target object,at which point a second region was activated around another objectin the scene. When the eyes crossed the boundary to this secondregion, instead of changing the target object, the target object wasmasked by a speckled, green rectangular field slightly larger thanthe object itself. Participants were instructed to fixate this maskand press a button to continue. As in the LTM tests of Experiments1 and 2, participants were then shown two object alternatives insequence, one of which was identical to the initial target. Thedistractor was either a different token (token-discrimination con-dition) or the same object rotated 90° in depth (orientation-discrimination condition). Participants indicated whether the firstor second object alternative was the same as the one initiallypresent in the scene.

This paradigm replicates the encoding conditions of the changedetection trials in Experiments 1 and 2, yet uses a forced-choice

127VISUAL MEMORY FOR NATURAL SCENES

Page 16: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

discrimination procedure similar to that used in the LTM tests.This method should eliminate the factors that could have causedthe change detection tasks of Experiments 1 and 2 to underestimatethe detail of scene representation. First, the instruction to study for

an LTM test was eliminated, so participants had only one task toperform, the discrimination task. Second, the critical object wasspecified (by the mask), so participants could limit analysis to thetarget. Third, the potentially more sensitive forced-choice proce-

Figure 10. Sequence of events in an orientation-discrimination trial in Experiment 3. Panel 1: Initial sceneimage (the regions depicted in the figure were not visible to participants). Participants began by fixating thecenter of the screen. The computer waited until the eyes had dwelled within the target object region (Region A)for at least 90 ms. Then, a second region (Region B) was activated around a different object in the scene. As theeye crossed the boundary to Region B, the target object was occluded by a salient mask (Panel 2). The maskremained visible until the participant pressed a button to begin the forced-choice test. After a delay of 500 ms,the first target object alternative was displayed for 4 s (Panel 3), followed by the target object mask for 1 s (Panel4), the second target object alternative for 4 s (Panel 5), and the target object mask (Panel 6), which remainedvisible until response.

128 HOLLINGWORTH AND HENDERSON

Page 17: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

dure was used. Given the results of the first two experiments,participants should be able to perform this task very accurately(i.e., above 80% correct), which would provide strong evidencethat visual information is retained from previously attended objectsduring online scene perception. In contrast, visual transience hy-potheses predict poor discrimination performance. Coherence the-ory predicts 50% discrimination performance (i.e., chance), be-cause attention is withdrawn from the target object prior to theonset of the mask. Object file theory predicts that discriminationperformance should fall to chance as the eyes and attention areoriented to new objects and the object file from the target isreplaced.

Method

Participants. Twelve Michigan State University undergraduates par-ticipated in the experiment for course credit. All participants had normalvision, were naive with respect to the hypotheses under investigation, andhad not participated in Experiments 1 or 2.

Stimuli. The stimuli were the same as in Experiment 2, with minormodifications to three of the scene items. In these scenes, a few moreobjects were added, and the target object was moved closer to the center ofthe scene. These modifications were part of an effort to improve the scenestimuli and were not related to any experimental manipulation. The greenmask in each scene was large enough to occlude not only the target objectbut also the two potential distractors and the shadows cast by each of theseobjects. Thus, the mask provided no information useful to performance ofthe task except to specify the relevant object.

Apparatus. The apparatus was the same as in Experiment 1.Procedure. Participants were informed that their eye movements

would be monitored while they viewed images of real-world scenes on acomputer monitor. They were instructed that at some point during theviewing of each scene, a bright green, speckled box would appear, con-cealing an object in the scene. When they saw the box, they were to lookdirectly at it and press a button to continue. After a brief delay, two objectswere displayed in succession at that position, only one of which wasidentical to the original object. Participants were instructed that afterpresentation of the two alternatives, they were to press the left-hand buttonon the button box if the first alternative was identical to the original object,or the right-hand button if the second alternative was identical to theoriginal. The two types of possible distractors were described using asample scene. Following review of the instructions, the experimentercalibrated the eyetracker as described in Experiment 1.

Each trial began with the participant fixating the center of the screen.The computer waited until the eyes had dwelled in the target object regionfor at least 90 ms. Then, the second region (the change-triggering region inExperiments 1 and 2) was activated around a different object in the scene.As the eye crossed the boundary to this region, the target object wasmasked. When the button was pressed to begin the discrimination test,there was a delay of 500 ms, followed by the first object alternative displayfor 4 s, the target mask for 1 s, the second object alternative for 4 s, and thetarget mask, which remained visible until response. To avoid exceedinglylong trials, if the mask had not appeared by 20 s into viewing, it wasdisplayed regardless of eye position at that point.

Participants first completed a practice session of four trials, two in eachof the discrimination conditions. Participants then completed the experi-mental session, in which they viewed all 36 scenes, 18 in each of thediscrimination conditions. The original target was the first alternative onhalf the trials and the second alternative on the other half. The assignmentof scene items to conditions was counterbalanced between participantgroups. The order of image presentation was determined randomly for eachparticipant. The entire session lasted approximately 20 min.

Results

On 25 trials (5.8%), the test had not been initiated by 20 s intoviewing, and the target object was masked at that point. On one ofthese trials, the participant was fixating the target object when themask appeared. This trial was eliminated, along with trials onwhich the target was not fixated for at least 90 ms prior to the onsetof the mask. A total of 3.5% of the trials was removed. Eyefixations shorter than 90 ms or longer than 2,500 ms were elimi-nated as outliers.

Consistent with results from the LTM tests of Experiments 1and 2, forced-choice discrimination performance was quite accu-rate, with 86.9% correct in the token-discrimination condition and81.9% correct in the orientation-discrimination condition. Thetrend toward superior token-discrimination performance was notreliable, F(1, 11) � 2.54, MSE � 119.8, p � .14. There was,however, a reliable and unanticipated interaction between discrim-ination condition and the order of target–distractor presentation inthe forced-choice test, F(1, 11) � 12.05, MSE � 124.0, p � .01.For token discrimination, there was little difference between thetarget-first condition (88.8% correct) and target-second condition(85.1% correct). However, for orientation discrimination, therewas a large difference between target first (72.6% correct) andtarget second (91.2% correct). In the orientation-discriminationcondition, participants apparently were biased to respond “sec-ond,” but such a bias does not compromise the main finding ofaccurate performance in both the token- and orientation-discrimination conditions.

In addition, we investigated whether performance was influ-enced by the length of time spent fixating the target object prior totest. Mean total time fixating the target object prior to test was 725ms in the token-discrimination condition and 678 ms in theorientation-discrimination condition. These values are roughlyequivalent to the amount of time spent fixating the target objectprior to the change in Experiments 1 and 2, suggesting that theencoding conditions of the online change detection task weresuccessfully replicated. Figure 11 plots discrimination perfor-mance as a function of the length of time spent fixating the targetobject prior to the test. There was a reliable positive correlationbetween fixation time and performance in the orientation-discrimination condition, rpb � .15, t(193) � 2.08, p � .05, but noreliable correlation in the token-discrimination condition, rpb �.04, t(196) � 0.60, p � .55.

We also examined discrimination performance as a function ofthe number of fixations that intervened between the last exit ofthe eyes from the target region prior to the onset of the mask, andthe mask’s onset. There was an average of 4.6 fixations betweenthe last exit from the target region and the onset of the mask.Figure 12 plots detection performance as a function of the numberof intervening fixations. Contrary to the prediction of object filetheory, there was no evidence that discrimination performance fellas the number of intervening fixations increased: token discrimi-nation, rpb � .00, t(196) � �0.05, p � .96; orientation discrim-ination, rpb � .11, t(193) � 1.59, p � .11. In the token-discrimination condition, when nine or more fixations intervenedbetween last exit and the onset of the mask (range � 9 to 42fixations; M � 15.3 fixations), performance was 85.3% correct. Inthe orientation-discrimination condition, when nine or more fixa-tions intervened between last exit and the onset of the mask

129VISUAL MEMORY FOR NATURAL SCENES

Page 18: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

(range � 9–58 fixations; M � 16.7 fixations), performance was92.3% correct.

Discussion

Experiment 3 used a forced-choice procedure to test the onlinerepresentation of previously attended objects in natural scenes.During viewing, after the target object had been fixated, it wasmasked as the eyes and focal attention were directed to a differentobject in the scene. Memory for the target object was then testedusing a forced-choice procedure. Participants demonstrated accu-rate token- and orientation-discrimination performance, above80% correct in each condition, even though the target object wasnot attended when the test was initiated. This result provides strongevidence against the claim of coherence theory that coherent visualrepresentations disintegrate as soon as attention is withdrawn froman object. If this were the case, then performance on the discrim-ination task should have been at chance. In addition, these resultsdo not support the object file theory of transsaccadic memory(Irwin, 1992a), as there was no evidence of decreasing discrimi-nation performance with the number of intervening fixations be-tween the last exit of the eyes from the target object region and theonset of the mask. Instead, these data support a view of sceneperception in which visual representations accumulate in memoryfrom fixated and attended regions of a scene.

General Discussion

The three experiments reported in this study were designed toinvestigate the nature of the information retained from previouslyfixated and attended objects in natural scenes. The principal ques-tion was whether visual information is retained from previouslyattended objects, consistent with evidence from the picture mem-ory literature (e.g., Friedman, 1979; Standing et al., 1970), orwhether visual object representations decay rapidly after attentionis withdrawn from an object, as proposed by visual transiencehypotheses of scene representation (e.g., Irwin & Andrews, 1996;

Rensink, 2000a). In Experiment 1, target objects in natural sceneswere changed during a saccade to another object in the scene, butonly after the target had been fixated directly at least once. Thetarget was replaced with another object from a different basic-levelcategory (type change) or from the same basic-level category(token change). In addition, LTM for target objects in the sceneswas tested using a forced-choice procedure. Participants success-fully detected both type and token changes on a significant pro-portion of trials, even though the target object was not attendedwhen the change occurred. In addition, participants accuratelydiscriminated between original targets and distractor objects thatdiffered either at the level of type or token. In Experiment 2, arotation condition was included as a more stringent test of whethervisual representations persist after attention is withdrawn. Partic-ipants not only detected the rotation of previously attended targetobjects, but also accurately discriminated between two orientationsof the same object on the LTM test. In Experiment 3, a forced-choice procedure was used to test the online representation ofpreviously attended objects in natural scenes. During scene view-ing, participants were asked to discriminate between the originaltarget object and a different-token or different-orientation distrac-tor. Discrimination performance for previously attended objectswas quite accurate, above 80% correct.

These results are not consistent with the proposal that visualrepresentation is limited to the currently attended object (O’Regan,1992; O’Regan et al., 1999; Rensink, 2000a, 2000b; Rensink et al.,1997; Simons & Levin, 1997; Wolfe, 1999). This view predictsthat token and rotation changes should not be detected in theabsence of attention and, additionally, that forced-choice discrim-ination should be at chance if attention was not allocated to thecritical object when it was masked. The current change detectionexperiments initiated a change during a saccade to a completelydifferent object in the scene. We have obtained converging data inexperiments in which a change was made during a saccade thattook the eyes away from the target object after it had been fixatedthe first time (Henderson & Hollingworth, 1999b; Hollingworth etal., 2001). Because attention precedes the eyes to the next fixation

Figure 11. Experiment 3: Mean percentage correct discrimination per-formance as a function of the total fixation time on the target object priorto test. In each discrimination condition, the mean of each fixation timequintile is plotted against mean percentage correct in that quintile.

Figure 12. Experiment 3: Mean percentage correct discrimination per-formance as a function of the number of intervening fixations between thelast exit from the target region prior to the test and the onset of the targetobject mask.

130 HOLLINGWORTH AND HENDERSON

Page 19: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

position, the target object was not attended when the changeoccurred, as in the current study. Yet token and rotation changeswere detected at rates significantly above floor. Thus, the proposalthat sustained attention to a changing object is necessary to detectthat change (Rensink, 2000a, 2000b; Rensink et al., 1997; Simons,2000; Simons & Levin, 1997) is disconfirmed by this study and byconverging data from related studies.

In addition, these results are inconsistent with portions of theobject file theory of transsaccadic memory (Irwin, 1992a, 1992b;Irwin & Andrews, 1996). This theory holds that three or fourobject files, which maintain detailed visual information from at-tended objects, can be retained in VSTM but are quickly replacedas attention and the eyes are directed to new perceptual objects.This view predicts that detection and discrimination performanceshould fall quickly to zero or chance as the number of interveningfixations increases between the last exit of the eyes from the targetobject region and the change (in the change detection paradigm) orthe onset of the mask (in the forced-choice discrimination para-digm). Although there was a reliable drop in change detectionperformance as a function of the number of intervening fixationsfor rotations in Experiment 2, the remaining five analyses showedno such effect. In particular, the more sensitive forced-choicediscrimination measure used in Experiment 3 appeared to beentirely independent of the number of intervening fixations. Inaddition, Irwin’s object file theory cannot account for successfuldiscrimination performance on the LTM tests, as object files couldnot have been retained in VSTM from study to test. Thus, althoughthere may be some decay of visual information encoded frompreviously attended objects, visual object representations are none-theless reliably and stably retained from previously attended ob-jects. However, these results are only inconsistent with the portionof Irwin’s object file theory dealing with the representational fateof previously attended objects. The bulk of the theory, whichconcerns the retention and integration of information across singleeye movements, and particularly from the attended saccade target,is not compromised by the findings of this study. In fact, the modelwe describe subsequently draws heavily from object file theory yetprovides a different account of visual representation after thewithdrawal of attention.

The LTM tests provide strong converging evidence that visualobject representations are retained after attention is withdrawn andsuggest that such representations are quite stable over the course ofa 5–30-min retention interval. These results provide a bridgebetween the literature on LTM for pictures and the literature onscene memory across saccades and other visual disruptions. TheLTM data from this study are consistent with prior evidenceshowing accurate memory for the visual form of whole scenes(Standing et al., 1970) and of individual objects in scenes (Fried-man, 1979; Parker, 1978). In addition, the current study providesa stronger test of long-term scene memory compared with previousstudies, because the scenes themselves were relatively complex,participants viewed each scene only once, between-item similaritywas high for studied scenes, and distractors in the forced-choicediscrimination test differed from targets in only the properties of asingle object. One of the objectives of this study was to resolve thediscrepancy between evidence of excellent picture memory andrecent proposals, derived from change detection studies, that visualobject representations are transient. The discrepancy appears to beresolved: Visual object representations are reliably and stably

retained from previously attended objects during online sceneperception and are stored into LTM. Visual object representation isnot transient.

In addition to the main question of the representation of previ-ously attended objects, we investigated three secondary questions.First, we sought to determine whether change detection depends onthe prior fixation of the target object. This was indeed the case. InExperiment 1, a condition in which the target was changed beforeit was directly fixated produced detection performance that did notdiffer from the false alarm rate and was reliably poorer thandetection performance when the target had been fixated prior to thechange. In addition, a positive relationship between fixation timeon the object prior to the change and detection performance wasobserved in Experiments 1 and 2. Thus, the encoding of sceneinformation appears to be a strongly influenced fixation position,consistent with prior reports (Henderson & Hollingworth, 1999b;Hollingworth et al., 2001; Nelson & Loftus, 1980). Second, weexamined the role of refixation of a changed object in the detectionof that change. The vast majority of correct detections in Experi-ments 1 and 2 came on refixation of the changed target. Thus,refixation appears to play an important role in the retrieval of astored object representation and the comparison of that represen-tation to current perceptual information (see also Henderson &Hollingworth, 1999b; Hollingworth et al., 2001; Parker, 1978).Finally, we were interested in whether explicit change detectionperformance provides an accurate measure of the detail of thevisual scene representation. In Experiments 1 and 2, when a tokenchange was not explicitly detected, gaze duration on the changedobject was reliably longer than when no change occurred, an effectthat has subsequently been replicated (Hollingworth et al., 2001).Thus, the current data provide evidence that explicit change de-tection performance underestimates the detail of the visual scenerepresentation.

Together these data provide an explanation for why changeblindness may occur, despite strong evidence from this study thatvisual representations persist after the withdrawal of attention.First, in studies demonstrating change blindness, eye movementshave rarely been monitored. Thus, changes may be missed simplybecause the target object was not fixated prior to the change. Ifdetailed information had not been encoded from a target object, itis hardly surprising that a change to that object would not bedetected. Providing further support for this idea, Hollingworth etal. (2001) monitored eye movements during a flicker paradigm(see Rensink et al., 1997) using scenes similar to those in thisstudy. Over 70% of object deletions and over 90% of objectrotations were detected only when the changing object was infoveal or near-foveal vision. Second, even if the object represen-tation is detailed enough to discriminate between the initial andchanged targets, it may not be reliably retrieved to support changedetection. Our results demonstrate that changes are often detectedonly when the changed object is refixated after the change (seealso Henderson & Hollingworth, 1999b; Hollingworth et al.,2001). Again, because most change detection paradigms do notmonitor eye movements, changes may be missed because thechanged region is not refixated. Finally, even if a target object isfixated before and after the change, changes may go undetected notbecause the relevant information is absent from the scene repre-sentation, but because the explicit detection measure is not sensi-tive to the presence of that information, as has been amply dem-

131VISUAL MEMORY FOR NATURAL SCENES

Page 20: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

onstrated by studies such as this one, showing implicit effects ofchange (Fernandez-Duque & Thornton, 2000; Hayhoe et al., 1998;Hollingworth et al., 2001; Williams & Simons, 2000). In summary,it has been known since the early 1980s that a global sensoryimage is not constructed by the visual system and retained acrossvisual disruptions, such as eye movements. However, poor changedetection performance does not necessarily indicate the absence ofvisual representation.

A Descriptive Model of Scene Perception and Memory

If visual object representations are retained in memory afterattention is withdrawn from an object, in what type of memorystore is this information maintained? Clearly, the LTM tests dem-onstrate that fairly detailed information is retained in LTM, butwhat accounts for online change detection performance in Exper-iments 1 and 2 and online discrimination performance in Experi-ment 3? Three strands of evidence suggest that performance was,to a large degree, supported by the maintenance of visual objectrepresentations in LTM during the online perceptual processing ofthe scene, rather than in VSTM. First, if current estimates of thecapacity of VSTM are correct, it is unlikely that target objectinformation could have been retained in VSTM during the intervalbetween the last fixation on the target object and the change ordiscrimination test. In Experiment 3, discrimination performancewas highly accurate, even when more than nine separate fixationsintervened between the last exit and the onset of the target objectmask. Second, the similarity between online discrimination (Ex-periment 3) and long-term discrimination (Experiments 1 and 2)suggests that performance in each was supported by a similar setof processes. Third, Hollingworth et al. (2001) found that onlinechange detection performance was strongly influenced by thesemantic consistency between that target object and the scene inwhich it appeared, a variable known to influence the LTM repre-sentation of an object (e.g., Friedman, 1979). It therefore appearsthat LTM plays an important role in online scene perception (seealso Chun & Nakayama, 2000). Given the amount of visual infor-mation available for analysis from a natural scene and the length oftime that we may be present in the same visual environment, thevisual system takes advantage of the capacity of LTM to storepotentially relevant information for future analysis, such as thedetection of changes to the environment.

The data from this study can be accommodated by the followingmodel. It takes as its foundation current theories of episodic objectrepresentation (e.g., Henderson, 1994; Kahneman, Treisman, &Gibbs, 1992), and is broadly consistent with the object file theoryof transsaccadic memory (Irwin, 1992a), but proposes a large rolefor LTM in the online construction of a scene representation. Asdiscussed in the introduction, dynamic scene perception faces twomemory problems: (a) the short-term retention and integration ofscene information across single saccadic eye movements, particu-larly from the attended saccade target, and (b) the longer-termretention and potential integration of information from previouslyfixated and attended objects. The model proposed here is limited inscope to the second issue. In particular, it concerns the nature ofthe representations produced when attention and the eyes areoriented to an object, the retention of object information whenattention and the eyes are withdrawn, the integration of objectinformation within a scene-level representation, and the subse-

quent retrieval of that information. It rests on the followingassumptions.

First, when attention and the eyes are oriented to a local objectin a scene, in addition to low-level sensory processing, visualprocessing leads to the construction of representations at higherlevels of analysis. These may include a visual description of theattended object, abstracted from low-level sensory properties, andconceptual representations of object identity and meaning. Higher-level visual representations can code quite detailed informationabout the visual form of an object, specific to the viewpoint atwhich the object was observed (Riesenhuber & Poggio, 1999;Tarr, Williams, Hayward, & Gauthier, 1998), and viewpoint-specific object representations can be retained across eye move-ments (Henderson & Siefert, 1999, in press).

Second, these abstracted representations are indexed to a posi-tion in a map coding the spatial layout of the scene, forming anobject file (Kahneman & Treisman, 1984; Kahneman et al., 1992).This view of object files (described in detail in Henderson, 1994;Henderson & Anes, 1994; Henderson & Siefert, 1999) differs fromearlier proposals (e.g., Kahneman et al., 1992) in that object filespreserve abstracted visual representations rather than sensory in-formation and also support the short-term retention of conceptualcodes. Thus, objects files instantiate not only VSTM but alsoconceptual short-term memory (CSTM; see Potter, 1999).

Third, processing of abstracted visual and conceptual represen-tations in short-term memory and the indexing of these codes to aparticular spatial position leads to their consolidation in LTM. TheLTM codes for an object are likewise indexed to the spatialposition in the scene map from which the object information wasencoded, forming what we term a long-term memory object file.

Fourth, when attention is withdrawn from an object, the short-term memory representations decay quite rapidly, leaving only thespatially indexed, long-term memory object files, which are rela-tively stable. Whether short-term memory decay is immediate orwhether short-term memory information persists until replaced bysubsequent encoding is not central to our proposal. However, thefact that changes to objects on the saccade away from that objectare often detected immediately (Henderson & Hollingworth,1999b; Hollingworth et al., 2001) suggests that visual objectrepresentations can be retained in VSTM at least briefly afterattention is withdrawn from an object, consistent with Irwin’s(1992a) view of VSTM.

Thus, over multiple fixations on a scene, local object informa-tion accumulates in LTM from previously fixated and attendedregions and is indexed within the scene map, forming a detailedrepresentation of the scene as a whole (though clearly less detailedthan a sensory image, as the visual representations stored fromlocal regions are abstracted away from sensory properties such asprecise metric organization). In contrast to high-capacity LTMstorage, only a small portion of the visual information in a sceneis actively maintained in short-term stores, and the moment-by-moment content of VSTM and CSTM is dictated by the allocationof attention.

Fifth, the retrieval of LTM codes for previously attended objectsand the comparison of this information with current perceptualrepresentations is strongly influenced by the allocation of visualattention and thus by fixation position. Access to the contents of anobject file in VSTM is proposed to be dependent on attending tothe spatial position at which the file is indexed, a proposal that is

132 HOLLINGWORTH AND HENDERSON

Page 21: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

supported by evidence that preview effects are mediated by spa-tiotemporal continuity (Kahneman et al., 1992). Evidence forspatially mediated LTM retrieval in this study comes from that factthat changes were detected on refixation of the target object. Inaddition, fixating the changed object led to change detection in thetype- and token-change conditions, even though the original objectwas no longer present and could not act as a retrieval cue. More-over, in Henderson & Hollingworth (1999b), object deletions weresometimes detected only when the participant fixated the spatialposition in the scene where the object had originally appeared.Clearly, the original object could not serve as a retrieval cue in thisparadigm, as it had been deleted, suggesting that attending to theoriginal spatial position of the target led to the retrieval of itslong-term memory object file and subsequent change detection.

Sixth, the retrieval from LTM of higher-level visual codesspecific to the viewed orientation of a previously attended objectaccounts for participants’ ability to detect token and rotationchanges and to perform accurately on token- and orientation-discrimination tests.

Finally, when the scene is removed, the LTM representationconsists of the scene map with indexed local object codes. Duringsubsequent perceptual episodes with the scene, the scene map isretrieved, and local object information can be retrieved by attend-ing to the position in the scene at which information about thatobject was originally encoded, leading to successful performanceon the LTM tests. How the correct scene map is selected is aninteresting question, the answer to which lies beyond the scope ofthis model.

In summary, the model holds that a relatively detailed represen-tation of a scene is constructed as the eyes and attention aredirected to multiple local regions. In addition, encoding into andretrieval from this representation are controlled by the allocation ofvisual attention and thus by fixation position, given the tightcoupling between attention and the eyes during normal sceneviewing. The principal difference between this model of sceneperception and visual transience hypotheses is the proposal thatvisual representations persist after attention is withdrawn, arestored in LTM, and form the basis of a fairly detailed scene-levelrepresentation. However, our model is consistent with the proposalof visual transience hypotheses that object representations inVSTM decay quickly once attention is withdrawn. In fact, thismodel is consistent with object file theory except for an additionalform of representation, long-term memory object files, as objectfile theory has no mechanism for long-term storage. This addi-tional representation, however, has significant implications for thenature of the representation constructed from a scene. Thus, ourmodel describes a means by which relevant visual information canbe stored and retrieved to support such processes as perceptualcomparison, motor interaction, navigation, or scene recognition,while retaining the view that active visual representation is essen-tially local and transient, governed by the allocation of attention.

References

Archambault, A., O’Donnell, C., & Schyns, P. G. (1999). Blind to objectchanges: When learning the same object at different levels of categori-zation modifies its perception. Psychological Science, 10, 249–255.

Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. (1997). Deicticcodes for the embodiment of cognition. Behavioral & Brain Sciences,20, 723–767.

Bridgeman, B., & Mayer, M. (1983). Failure to integrate visual informationfrom successive fixations. Bulletin of the Psychonomic Society, 21,285–286.

Brietmeyer, B. G., Kropfl, W., & Julesz, B. (1982). The existence and roleof retinotopic and spatiotopic forms of visual persistence. Acta Psycho-logica, 52, 175–196.

Bulthoff, H. H., Edelman, S. Y., & Tarr, M. J. (1995). How are three-dimensional objects represented in the brain? Cerebral Cortex, 3, 247–260.

Carlson-Radvansky, L. A. (1999). Memory for relational informationacross eye movements. Perception & Psychophysics, 61, 919–934.

Carlson-Radvansky, L. A., & Irwin, D. E. (1995). Memory for structuralinformation across eye movements. Journal of Experimental Psychol-ogy: Learning, Memory, and Cognition, 21, 1441–1458.

Chun, M. M., & Nakayama, K. (2000). On the functional role of implicitvisual memory for the adaptive deployment of attention across views.Visual Cognition, 7, 65–81.

Crane, H. D., & Steele, C. M. (1985). Generation-V dual-Purkinje-imageeyetracker. Applied Optics, 24, 527–537.

Currie, C., McConkie, G., Carlson-Radvansky, L. A., & Irwin, D. E.(2000). The role of the saccade target object in the perception of avisually stable world. Perception & Psychophysics, 62, 673–683.

Davidson, M. L., Fox, M. J., & Dick, A. O. (1973). Effect of eyemovements on backward masking and perceived location. Perception &Psychophysics, 14, 110–116.

Deubel, H., & Schneider, W. X. (1996). Saccade target selection and objectrecognition: Evidence for a common attentional mechanism. VisionResearch, 36, 1827–1837.

Di Lollo, V. (1980). Temporal integration in visual memory. Journal ofExperimental Psychology: General, 109, 75–97.

Feldman, J. A. (1985). Four frames suffice: A provisional model of visionand space. Behavioral & Brain Sciences, 8, 265–289.

Fernandez-Duque, D., & Thornton, I. M. (2000). Change detection withoutawareness: Do explicit reports underestimate the representation ofchange in the visual system? Visual Cognition, 7, 324–344.

Friedman, A. (1979). Framing pictures: The role of knowledge in autom-atized encoding and memory for gist. Journal of Experimental Psychol-ogy: General, 108, 316–355.

Grimes, J. (1996). On the failure to detect changes in scenes acrosssaccades. In K. Akins (Ed.), Perception: Vancouver studies in cognitivescience (Vol. 5, pp. 89–110). Oxford, England: Oxford University Press.

Hayhoe, M. M. (2000). Vision using routines: A functional account ofvision. Visual Cognition, 7, 43–64.

Hayhoe, M. M., Bensinger, D. G., & Ballard, D. H. (1998). Task con-straints in visual working memory. Vision Research, 38, 125–137.

Henderson, J. M. (1994). Two representational systems in dynamic visualidentification. Journal of Experimental Psychology: General, 123, 410–426.

Henderson, J. M. (1997). Transsaccadic memory and integration duringreal-world object perception. Psychological Science, 8, 51–55.

Henderson, J. M., & Anes, M. D. (1994). Effects of object-file review andtype priming on visual identification within and across eye fixations.Journal of Experimental Psychology: Human Perception and Perfor-mance, 20, 826–839.

Henderson, J. M., & Hollingworth, A. (1998). Eye movements duringscene viewing: An overview. In G. Underwood (Ed.), Eye guidance inreading and scene perception (pp. 269–283). Oxford, England: Elsevier.

Henderson, J. M., & Hollingworth, A. (1999a). High-level scene percep-tion. Annual Review of Psychology, 50, 243–271.

Henderson, J. M., & Hollingworth, A. (1999b). The role of fixationposition in detecting scene changes across saccades. Psychological Sci-ence, 10, 438–443.

Henderson, J. M., & Hollingworth, A. (in press). Eye movements, visualmemory, and scene representation. In M. A. Peterson & G. Rhodes

133VISUAL MEMORY FOR NATURAL SCENES

Page 22: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

(Eds.), Analytic and holistic processes in the perception of faces, objects,and scenes. New York: JAI/Ablex.

Henderson, J. M., Hollingworth, A., & Subramanian, A. N. (1999, No-vember). The retention and integration of scene information acrosssaccades: A global change blindness effect. Paper presented at theannual meeting of the Psychonomic Society, Los Angeles, CA.

Henderson, J. M., McClure, K., Pierce, S., & Schrock, G. (1997). Objectidentification without foveal vision: Evidence from an artificial scotomaparadigm. Perception & Psychophysics, 59, 323–346.

Henderson, J. M., Pollatsek, A., & Rayner, K. (1989). Covert visualattention and extrafoveal information use during object identification.Perception & Psychophysics, 45, 196–208.

Henderson, J. M., & Siefert, A. B. (1999). The influence of enantiomorphictransformation on transsaccadic object integration. Journal of Experi-mental Psychology: Human Perception and Performance, 25, 243–255.

Henderson, J. M., & Siefert, A. B. C. (in press). Types and tokens intranssaccadic object integration. Psychonomic Bulletin & Review.

Henderson, J. M., Weeks, P. A., Jr., & Hollingworth, A. (1999). The effectsof semantic consistency on eye movements during complex scene view-ing. Journal of Experimental Psychology: Human Perception and Per-formance, 25, 210–228.

Hoffman, J. E., & Subramaniam, B. (1995). The role of visual attention insaccadic eye movements. Perception & Psychophysics, 57, 787–795.

Hollingworth, A., Schrock, G., & Henderson, J. M. (2001). Change detec-tion in the flicker paradigm: The role of fixation position within thescene. Memory & Cognition, 29, 296–304.

Hollingworth, A., Williams, C. C., & Henderson, J. M. (2001). To see andremember: Visually specific information is retained in memory frompreviously attended objects in natural scenes. Psychonomic Bulletin &Review, 8, 761–768.

Irwin, D. E. (1991). Information integration across saccadic eye move-ments. Cognitive Psychology, 23, 420–456.

Irwin, D. E. (1992a). Memory for position and identity across eye move-ments. Journal of Experimental Psychology: Learning, Memory, andCognition, 18, 307–317.

Irwin, D. E. (1992b). Visual memory within and across fixations. In K.Rayner (Ed.), Eye movements and visual cognition: Scene perceptionand reading (pp. 146–165). New York: Springer-Verlag.

Irwin, D. E., & Andrews, R. (1996). Integration and accumulation ofinformation across saccadic eye movements. In T. Inui & J. L. McClel-land (Eds.), Attention and performance XVI: Information integration inperception and communication (pp. 125–155). Cambridge, MA: MITPress.

Irwin, D. E., Yantis, S., & Jonides, J. (1983). Evidence against visualintegration across saccadic eye movements. Perception & Psychophys-ics, 34, 35–46.

Kahneman, D., & Treisman, A. (1984). Changing views of attention andautomaticity. In R. Parasuraman & D. Davies (Eds.), Varieties of atten-tion (pp. 29–61). New York: Academic Press.

Kahneman, D., Treisman, A., & Gibbs, B. J. (1992). The reviewing ofobject files: Object-specific integration of information. Cognitive Psy-chology, 24, 175–219.

Kowler, E., Anderson, E., Dosher, B., & Blaser, E. (1995). The role ofattention in the programming of saccades. Vision Research, 35, 1897–1916.

Levin, D. T., & Simons, D. J. (1997). Failure to detect changes to attendedobjects in motion pictures. Psychonomic Bulletin & Review, 4, 501–506.

Matin, E. (1974). Saccadic suppression: A review and an analysis. Psy-chological Bulletin, 81, 899–917.

McConkie, G. W. (1991). Where vision and cognition meet. Paper pre-sented at the Human Frontier Science Program Workshop on Object andScene Perception, Leuven, Belgium.

McConkie, G. W., & Currie, C. B. (1996). Visual stability across saccades

while viewing complex pictures. Journal of Experimental Psychology:Human Perception and Performance, 22, 563–581.

McConkie, G. W., & Rayner, K. (1976). Identifying the span of theeffective stimulus in reading: Literature review and theories of reading.In H. Singer & R. B. Ruddell (Eds.), Theoretical models and processesin reading (pp. 137–162). Newark, DE: International Reading Associa-tion.

McConkie, G. W., & Zola, D. (1979). Is visual information integratedacross successive fixations in reading? Perception & Psychophysics, 25,221–224.

Nelson, W. W., & Loftus, G. R. (1980). The functional visual field duringpicture viewing. Journal of Experimental Psychology: Human Learningand Memory, 6, 391–399.

Nickerson, R. S. (1965). Short-term memory for complex meaningfulvisual configurations: A demonstration of capacity. Canadian Journal ofPsychology, 19, 155–160.

O’Regan, J. K. (1992). Solving the “real” mysteries of visual perception:The world as an outside memory. Canadian Journal of Psychology, 46,461–488.

O’Regan, J. K., Deubel, H., Clark, J. J., & Rensink, R. A. (2000). Picturechanges during blinks: Looking without seeing and seeing withoutlooking. Visual Cognition, 7, 191–212.

O’Regan, J. K., & Levy-Schoen, A. (1983). Integrating visual informationfrom successive fixations: Does trans-saccadic fusion exist? VisionResearch, 23, 765–768.

O’Regan, J. K., Rensink, R. A., & Clark, J. J. (1999, March 4). Change-blindness as a result of “mudsplashes.” Nature, 398, 34.

Parker, R. E. (1978). Picture processing during recognition. Journal ofExperimental Psychology: Human Perception and Performance, 4,284–293.

Pollatsek, A., & Rayner, K. (1992). What is integrated across fixations? InK. Rayner (Ed.), Eye movements and visual cognition: Scene perceptionand reading (pp. 166–191). New York: Springer-Verlag.

Pollatsek, A., Rayner, K., & Collins, W. E. (1984). Integrating pictorialinformation across eye movements. Journal of Experimental Psychol-ogy: General, 113, 426–442.

Pollatsek, A., Rayner, K., & Henderson, J. M. (1990). Role of spatiallocation in integration of pictorial information across saccades. Journalof Experimental Psychology: Human Perception and Performance, 16,199–210.

Potter, M. C. (1999). Understanding sentences and scenes: The role ofconceptual short-term memory. In V. Coltheart (Ed.), Fleeting memories(pp. 13–46). Cambridge, MA: MIT Press.

Rayner, K., McConkie, G. W., & Ehrlich, S. (1978). Eye movements andintegrating information across fixations. Journal of Experimental Psy-chology: Human Perception and Performance, 4, 529–544.

Rayner, K., & Pollatsek, A. (1983). Is visual information integrated acrosssaccades? Perception & Psychophysics, 34, 39–48.

Rensink, R. A. (2000a). The dynamic representation of scenes. VisualCognition, 7, 17–42.

Rensink, R. A. (2000b). Seeing, sensing, and scrutinizing. Vision Research,40, 1469–1487.

Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see:The need for attention to perceive changes in scenes. PsychologicalScience, 8, 368–373.

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of objectrecognition in cortex. Nature Neuroscience, 2, 1019–1025.

Shepard, R. N. (1967). Recognition memory for words, sentences, andpictures. Journal of Verbal Learning and Verbal Behavior, 6, 156–163.

Shepherd, M., Findlay, J. M., & Hockey, R. J. (1986). The relationshipbetween eye movements and spatial attention. Quarterly Journal ofExperimental Psychology: Human Experimental Psychology, 38(A),475–491.

134 HOLLINGWORTH AND HENDERSON

Page 23: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

Simons, D. J. (1996). In sight, out of mind: When object representationsfail. Psychological Science, 7, 301–305.

Simons, D. J. (2000). Current approaches to change blindness. VisualCognition, 7, 1–16.

Simons, D. J., & Levin, D. T. (1997). Change blindness. Trends inCognitive Sciences, 1, 261–267.

Simons, D. J., & Levin, D. T. (1998). Failure to detect changes to peopleduring a real-world interaction. Psychonomic Bulletin & Review, 5,644–649.

Sperling, G. (1960). The information available in brief visual presentations.Psychological Monographs, 74 (11, Whole No. 498).

Standing, L., Conezio, J., & Haber, R. N. (1970). Perception and memoryfor pictures: Single-trial learning of 2500 visual stimuli. PsychonomicScience, 19, 73–74.

Tarr, M. J., Williams, P., Hayward, W. G., & Gauthier, I. (1998). Three-

dimensional object recognition is viewpoint dependent. Nature Neuro-science, 1, 275–277.

Treisman, A. (1988). Features and objects: The fourteenth Bartlett memo-rial lecture. Quarterly Journal of Experimental Psychology: HumanExperimental Psychology, 40, 201–237.

Williams, P., & Simons, D. J. (2000). Detecting changes in novel 3Dobjects: Effects of change magnitude, spatiotemporal continuity, andstimulus familiarity. Visual Cognition, 7, 297–322.

Wolfe, J. M. (1998). Visual memory: What do you know about what yousaw? Current Biology, 8, R303–R304.

Wolfe, J. M. (1999). Inattentional amnesia. In V. Coltheart (Ed.), Fleetingmemories (pp. 71–94). Cambridge, MA: MIT Press.

Wolfe, J. M., Klempen, N., & Dahlen, K. (2000). Postattentive vision.Journal of Experimental Psychology: Human Perception and Perfor-mance, 26, 693–716.

(Appendix follows)

135VISUAL MEMORY FOR NATURAL SCENES

Page 24: Accurate Visual Memory for Previously Attended …...Accurate Visual Memory for Previously Attended Objects in Natural Scenes Andrew Hollingworth and John M. Henderson Michigan State

Received July 24, 2000Revision received May 10, 2001

Accepted May 11, 2001 �

Appendix

Scene and Target Object Stimuli

Scene Original target object Type-change target

Art gallery (exterior) Trash container MailboxAttic A Crib CrateAttic B Stool Filing cabinetBar Ashtray Bowl of nutsBathroom A Hair dryer Tissue boxBathroom B Spray bottle Shampoo containerBedroom 1A Book Alarm clockBedroom 1B Lamp Flowers in vaseBedroom 2 (child’s) Toy truck Gumball machineComputer workstation Pen PencilDining room Candelabra Flowering plantFamily Room A Watch CoastersFamily Room B Eyeglasses Remote controlFamily Room C Briefcase WastebasketFront yard Watering can BucketIndoor Pool A Drinking glass Soda canIndoor Pool B Deck chair Side tableKitchen 1A Teapot PotKitchen 1B Coffee maker BlenderKitchen 2A Knife ForkKitchen 2B Toaster CanisterKitchen 3 Coffee cup AppleLaboratory A Microscope FlaskLaboratory B Cell phone StaplerLaundry room Iron Aerosol canLiving Room 1A Clock Picture in frameLiving Room 1B Magazine Serving trayLiving Room 2 Television AquariumLiving Room 3 Chandelier Ceiling fanLoft Pool table PianoOffice A Notebook Computer diskOffice B Telephone BinderPatio Barbeque grill Trash canRestaurant Flower in vase CandleStage Guitar Audio speakerStaircase Chair Fern

Note. A short description of each scene item is listed in the first column. Multiple examples of certain scenetypes were used. Some of these were created from different 3-D wire frame models and are differentiated bynumber; some were different views within the same model and are differentiated by letter. The second columnlists the original target object in each scene. The third column lists the object substituted for the target in thetype-change condition of Experiment 1. Changed targets in the token-change condition were different examplesof the same type of object described in the second column.

136 HOLLINGWORTH AND HENDERSON