15
Memory Representations in Natural Tasks Dana H. Ballard, Mary M. Hayhoe, and Jeff B. Pelz The University of Rochester Abstract H The very limited capacity of short-term or working memory is one of the most prominent features of human cognition. Most studies have stressed delimiting the upper bounds of this memory in memorization tasks rather than the performance of everyday tasks. We designed a series of experiments to test the use of short-term memory in the course of a natural hand-eye task where subjects have the freedom to choose their own task parameters. In this case subjects choose not to operate at the INTRODUCllON The very limited capacity of short-term memory is one of the most prominent features of human cognition (Baddeley, 1986; Miller, 1956). Most studies, however, have stressed delimiting the upper bounds of this mem- ory, and we have little understanding of the computa- tional role of these central processing limitations. Insight into its role may be gained from recent research in robot models. It has recently been demonstrated that a range of complex tasks can be efficiently modeled using very limited memory representations. The key aspect of these models is that complex internal representations are avoided by allowing frequent access to the sensory input during the problem-solvingprocess (Brooks, 1986,1991 ; Agre & Chapman, 1987; Ballard, 1989,1991). These mod- els use “deictic” primitives, in which aspects of a scene (e.g., color or shape) are dynamically referred to by indicating that part of the scene with a special marker. The word deictic means ”pointing” or “showing” and was first used in this context by Agre and Chapman (1987), building on work by Ullman (1984). It means that as- pects of the scene can be selectively referred to by denoting that part of the scene with a special referent or pointer. In contrast,a nondeictic system might remem- ber all the positions and properties of a set of objects in viewercentered coordinates. Deictic systems realize representational economies by using the momentary binding between the perceptual- motor system and the world. For example, with vision, this binding can be achieved by actively looking at, or fixating, an environmental point. This ability allows the 0 1995 Massachusetts Institute of Technology maximum capacity of short-term memory but instead seek to minimize its use. In particular, reducing the instantaneous mem- ory required to perform the task can be done by serializing the task with eye movements.These eye movements allow subjects to postpone the gathering of task-relevant information until just before it is required. The reluctance to use short-term memory can be explained if such memory is expensive to use with respect to the cost of the serializing strategy. H use of a frame of reference centered at the fixation point. As shown in Figure 1, the %cationframe”is viewer-ori- ented, but not viewercentered. The fixation frame allows for closed loop behavioral strategies that require only coarse, relative threedimen- sional information. For example, an object can be grasped by first looking at it and then directing the hand to the center of the fixation coordinate frame. In depth the hand can be servoed relative to the horopter by using binocular cues. Successions of deictic primitives can succinctly create complex behaviors as each primi- tive implicitly defines the context for its successor, thus allowing imprecise servoing strategies. For example, modeling the behavior of dialing a telephone is sim- plified by three successive behaviors, one which orients the posture to align the buttons perpendicular to the line of sight, a second to fixate the individual digits, and a third to push the fixated digits in succession. Thus behaviors that would be difficult to program in the general case are simple to program with respect to the fixation point. The deictic strategy of using the perceptual system to actively control the point of action in the world has precisely the right kind of invariance for learning many behaviors. No complex geometry is necessary, and, as a result, the descriptions transfer well to similar situations that differ in viewpoint. As an example, consider the problem of picking up a green block that has another block stacked on top of it, as shown in Figure 2 from Whitehead and Ballard (1991). This task can be accomplished by the following pro- gram: Journal of Cognitive Neuroscience 7: I, pp. 66-80

Memory Representations in Natural Tasks

  • Upload
    jeff-b

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Memory Representations in Natural Tasks

Memory Representations in Natural Tasks

Dana H. Ballard, Mary M. Hayhoe, and Jeff B. Pelz The University of Rochester

Abstract H The very limited capacity of short-term or working memory is one of the most prominent features of human cognition. Most studies have stressed delimiting the upper bounds of this memory in memorization tasks rather than the performance of everyday tasks. We designed a series of experiments to test the use of short-term memory in the course of a natural hand-eye task where subjects have the freedom to choose their own task parameters. In this case subjects choose not to operate at the

INTRODUCllON

The very limited capacity of short-term memory is one of the most prominent features of human cognition (Baddeley, 1986; Miller, 1956). Most studies, however, have stressed delimiting the upper bounds of this mem- ory, and we have little understanding of the computa- tional role of these central processing limitations. Insight into its role may be gained from recent research in robot models. It has recently been demonstrated that a range of complex tasks can be efficiently modeled using very limited memory representations. The key aspect of these models is that complex internal representations are avoided by allowing frequent access to the sensory input during the problem-solving process (Brooks, 1986,1991 ; Agre & Chapman, 1987; Ballard, 1989,1991). These mod- els use “deictic” primitives, in which aspects of a scene (e.g., color or shape) are dynamically referred to by indicating that part of the scene with a special marker. The word deictic means ”pointing” or “showing” and was first used in this context by Agre and Chapman (1987), building on work by Ullman (1984). It means that as- pects of the scene can be selectively referred to by denoting that part of the scene with a special referent or pointer. In contrast, a nondeictic system might remem- ber all the positions and properties of a set of objects in viewercentered coordinates.

Deictic systems realize representational economies by using the momentary binding between the perceptual- motor system and the world. For example, with vision, this binding can be achieved by actively looking at, or fixating, an environmental point. This ability allows the

0 1995 Massachusetts Institute of Technology

maximum capacity of short-term memory but instead seek to minimize its use. In particular, reducing the instantaneous mem- ory required to perform the task can be done by serializing the task with eye movements.These eye movements allow subjects to postpone the gathering of task-relevant information until just before it is required. The reluctance to use short-term memory can be explained if such memory is expensive to use with respect to the cost of the serializing strategy. H

use of a frame of reference centered at the fixation point. As shown in Figure 1, the %cation frame” is viewer-ori- ented, but not viewercentered.

The fixation frame allows for closed loop behavioral strategies that require only coarse, relative threedimen- sional information. For example, an object can be grasped by first looking at it and then directing the hand to the center of the fixation coordinate frame. In depth the hand can be servoed relative to the horopter by using binocular cues. Successions of deictic primitives can succinctly create complex behaviors as each primi- tive implicitly defines the context for its successor, thus allowing imprecise servoing strategies. For example, modeling the behavior of dialing a telephone is sim- plified by three successive behaviors, one which orients the posture to align the buttons perpendicular to the line of sight, a second to fixate the individual digits, and a third to push the fixated digits in succession. Thus behaviors that would be difficult to program in the general case are simple to program with respect to the fixation point.

The deictic strategy of using the perceptual system to actively control the point of action in the world has precisely the right kind of invariance for learning many behaviors. No complex geometry is necessary, and, as a result, the descriptions transfer well to similar situations that differ in viewpoint. As an example, consider the problem of picking up a green block that has another block stacked on top of it, as shown in Figure 2 from Whitehead and Ballard (1991).

This task can be accomplished by the following pro- gram:

Journal of Cognitive Neuroscience 7: I , pp. 66-80

Page 2: Memory Representations in Natural Tasks

Fixatemen) Fixate (Topof-stack) Pickup Fixate(S0mewhere on the table) Putdown Fixatemen) Pickup

In this program it is assumed that the instruction Fix- ate(image-feature) will orient the center of gaze to point to a place in the image with that feature. These actions are context-sensitive. For example, Fixatepopof-stack) will transfer the gaze to the top of the stack currently fixated. Pickup and Putdown are assumed to act at the center of the fixation frame. Whitehead and Ballard (1990) have shown that this task can be learned using a rather general set of primitives similar to the above. This includes instructions for moving a focus of attention. A focus of attention may be thought of as an electronic fovea in terms of its ability to select target locations. The key feature of this model is that the only short-term memory needed is just that which is instantaneously ”marked” with either fixation or attention. The rest of the visual world is not represented. The virtue is that the perceptual system can tune itself to select just the rele- vant visual context to define the next step in the current task.

Deictic strategies lead to a computational simplifica- tion of the general problem of relating internal models to objects in the world. Sequential, problemdependent

Figure 1. Much previous work in computational vision has as s u e d that the vision system is passive and computations are per- formed in a viewercentered frame (A). Instead, biological and psychophysical data argue for the use of the fixation frame (B). This m e is selected by the observer to suit information-gathering goals and is centered at the fixation point. In changing gaze, the observer attaches the fixation point to objectcentered frames (C).

eye movements avoid the general problem of associating many models to many parts of the image simultaneously. Instead the problem can be simplified into a series of two kinds of tasks. These tasks either fmd information about location (one internal model) or information about identification (one world object). Table 1 summa- rizes this view. A location task is to find the image coordinates of the next thing to be marked in the pres ence of many alternatives. In this task the image periph- ery must be searched; one can assume that its features have been chosen a priori. An identification task is to associate the foveated part of the image with one of many possible models. In this task one can assume im- plicitly that the relevant image location is the fixation point. This simplification leads to dramatically faster al- gorithms for each of the specialized tasks (Swain & Ballard, 1991; Swain, Kahn, & Ballard, 1992; Ballard & Rao, 1994). Thus, in addition to defining task frames for ma- nipulation, we can think of the eye movements as solv- ing a succession of location (where) and identification (what) subtasks in the process of meeting some larger cognitive goal.

Because the eyes allow a natural implementation of deictic strategies, the question is immediately raised whether humans in fact use their eye movements in this way in the context of natural behaviors. We designed a series of experiments to test the use of deictic strategies in the course of a hand-eye task. The main feature of the task is that subjects have the freedom to choose their own task parameters. The task involves a series of steps that require hand-eye coordination and memory, but the subjects can select their own strategy for organizing these steps.

METHODS

The task was to copy a pattern of colored blocks. This task was chosen to reflect basic sensory, cognitive, and motor operations involved in a wide range of human performance. A display of colored blocks was divided into three areas, the model, source, and workspace, as shown in Figure 3. The model area contains the block configwtion to be copied; the source contains the blocks to be used; and the workspace is the area where the copy is assembled.

Stimuli were displayed on a Macintosh color monitor, and subjects used the cursor driven by a mouse to “pick up” and “place” blocks on the screen. Picking up a block is accomplished by moving the cursor over the block and depressing a button on the mouse. Placing the block is accomplished by moving the block to the desired location and releasing the button. A set of coarse-grained, discrete locations was used for the block positions. Re- leasing the mouse button placed the block at the nearest discrete grid location. This obviated the need for very precise positioning and made the task easier to perform. Block sizes varied from 1 / 2 O to 2’. The blocks used in

Ballard et al. 67

Page 3: Memory Representations in Natural Tasks

Figure 2. A graphic display from the output of a program that has learned the task: "Pick up the green block." The steps in the program use deictic references, and do not require geometric coordinates. For each stage in the solution, the plus symbol shows the l e cation of the fixation point.

most conditions were 1.7" by 1.3". In that case the resultant grid was a 10 by 10 array, as can be inferred from Figure 3. Displays were random configurations of eight blocks of four saturated colors: red, green, yellow, and blue. Both the eye and cursor movements were monitored throughout the task. The eye movements were monitored using a Dual-Purkinje Image eye tracker, sampling the eye movements and hand movements every 17 msec. The head was held fixed throughout the experiment, using a bite bar. At the outset of each set of trials for an individual subject, the subject's gaze was calibrated by measuring the recording signal over a grid

Table 1. The Organization of Visual Computation into W'HAT/WHERE modulesa

Modekr

Image parts One

One Manipulation: trying to do something with an object whose identity and location are known???

trying to find a known object that may not be in view

Many Location (Search?):

Many

Identification: trying to identitfy an object whose location can be fixated

Too difficult?

? h e organization of visua~ computation into WHAT/WHERE mod- ules may have a basis in complexity. Trying to match a large number of image segments to a large number of models at once may be too difficult.

of 25 positions that spanned the display screen. The accuracy of the tracker is better than 15 min arc over most of the display, so that fixations of individual blocks could be detected with high confidence.

Subjects were instructed simply to copy the model pattern, "as quickly and accurately as possible," using the mouse to move the blocks. No other instructions were given. Data were collected on a total of seven subjects, including the authors. AU subjects had normal acuity and color vision.

RESULTS A striking feature of task performance is that subjects behave in a very similar, stereotypical way, characterized by frequent eye movements to the model pattern. Ob- servations of individual eye movements suggest that in- formation is acquired incrementally during the task and not acquired f n toto at the beginning of the task. For example, if the subject memorized and copied four two- block subpatterns, which is well within visual memory limitations, there would only be a total of four looks into the model area. However, subjects did not appear to memorize multiple blocks from the model, and some- times made as many as 18 fixations in the model area in the course of copying the pattern. In fact, they commonly made more than one fixation in the model area while copying a single block. An example is depicted in Figure 3. After dropping off the second block, the cursor is moved to the source while the eye is directed toward the model. The furation point then moves to the source area at the location of block three and is used to direct the hand for a pickup action. The eye then goes buck to

68 Journal of Cognitive Neuroscience Volume 7, Number 1

Page 4: Memory Representations in Natural Tasks

Figure 3. Display used in the experiments. The model is displayed on the top left, and the source area is on the right. The bottom left is the workspace for copying the model pattern. The display subtended 17 by 13’ of visual angle. The eye position trace is shown by the thin Line. The cursor trace is shown by dark h e . A single cycle is shown, from dropping off block two to dropping off block three (in the experimental trial the blocks are colored). Immediately after dropping off block two the hation point is transferred to the model. Simultaneously, the cursor is moved to the source area at the right of the screen. Subsequently the fixation point is transferred to the source area at the location of block three and used to direct the hand for a pickup action. Then the eye goes back to the model and the cursor is moved to the drop-off location. The eye moves to the drop-off location to facilitate the release of the block. (The block is erased immediately after it has been picked up.) The numbers indicate the temporal correspondence between hand and eye traces.

the model while the cursor is moved to the workspace. The eye then moves to the drop-off location to facilitate the release of the block. The fact that fixation is used for visual guidance in picking up and dropping off each block is expected from data on single hand-eye movements (Milner & Goodale, 1991). However, the ex- tent to which the eyes were used to check the model was unanticipated and suggests an equally crucial role in acquiring information at each stage of task perfor- mance.

It seems likely that in this task subjects use their ability to fixate to simphfy the task in two ways. First, the “fixation frame” allows the use of deictic primitives. For example, an object is picked up by first looking at it and then directing the hand to the center of the fixation coordinate frame. The alternative requires programming a command in a world- or ego-based coordinate repre- sentation, with much greater demands on the fidelity of the representation. Second, fixation is used to acquire

information en route at the point at which it is required. For example, consider the color of the third block. If this is memorized at the outset along with several other colors, then a corresponding number of memory loca- tions would be required. However, a single item that encodes the-color-of-thenext-block can be used if the loading of that item is performed at the appropriate moment in the task.

The basic cycle from the point just after one block is dropped off to the point where the next block is dropped off provides a convenient way of breaking up the task into component subtasks of single block moves. This allows us to explore the different sequences of primitive movements made in putting the blocks into place. A way of coding these subtasks is to summarize where the eyes go during a particular subtask. Thus the sequence in Figure 3 can be encoded as “model-pickup model-drop” (M-P-M-D) with the understanding that the pickup occurs in the source area and the drop in the

BaUardetal. 69

Page 5: Memory Representations in Natural Tasks

Figure 4. (a) The different categories of eye movements used in the task. "M" means that the eyes are directed to the model. P" and 3" mean that the eyes and mouse are coincident at the pickup

spectively. Thus for the PMD strategy, the eye goes directly to the source for pickup, then goes to the model area, and then to the workspace for dropoff. (b) Relative fre- quency of category use for a sample of about 50 block moves for each of seven ob servers. Error bars show stan- dard error between subjects.

point and dropoff point, re-

workspace area. Four principal sequences of eye move- ments can be identified, as shown in Figure 4a.

In this task the crucial information is the color and relative location of each block. The observed sequences can be understood in terms of whether the subject has remembered either the color and/or the location of the block currently needed. The necessary assumption is that the information is most conveniently obtained by explic- itly fixating the appropriate locations in the model and that the main preference is to acquire color or location information just before it is required. If both the color

and location are needed, that is, they have not been previously remembered, an MPMD sequence should re- sult. If the color is known, a PMD sequence should result; if the location is known, an MPD sequence should result; and if both are known, a PD sequence should result. PD sequences were infrequent, and were more likely to occur near the end of the task. Thus MPMD sequences are "memoryless" (with respect to the color and location information), and MPD, PMD, and PD sequences can be explained if the subjects sometimes remember an extra location and/or color when they furate the model area.

70 Journal of Cognitive Neumscience Volume 7, Number I

Page 6: Memory Representations in Natural Tasks

Figure 5. Strategy use over the course of the trial. Rela- tive frequency of each strat- egy for each block in the eight-block pattern. Note the decrease in the memoryless MPMD strategy and the in- crease in the PD strategy, sug- gesting that information is accumulated over the course of the trial. Data for four sub- jects are shown.

Summary data for seven subjects are shown as the dark bars in Figure 4b. The memoryless model-pickup modeldrop strategy is the most frequently used by all the subjects, substantially outweighing the others.' The "other" category shown in the figure was composed mostly of trials where there was an additional saccade, say from model to resource area and back to the model, before the saccade to the workspace to guide pickup. Another kind of sequence might involve an extra sac- cade between model and wo'rkspace. In addition, sub- jects occasionally dropped a block, or corrected a mistake, and this involved additional fixations. Errors in copying were infrequent, and subjects invariably cor- rected their mistakes. Note that if subjects were able to complete the task from memory, then a sequence com- posed exclusively of PDs could have been observed. This frequent access to the model area during the construc- tion of the copy we take as direct evidence of the incremental access to information in the display during the task.

Figure 5 shows how category use changes during the task. Strategy frequency is plotted as a function of the block being moved (first through eighth). The MPMD strategy is most frequent at the beginning of the task, and the frequency of the PD strategy increases toward the-end of the task. This reinforces the interpretation of the frequent model fmations as acquiring the color and position information necessary for task performance, and suggests some accrual of information about the

model over the course of the trial. That is, performance is not entirely memoryless.

Performing the Task from Memory

One necessary control is to establish how well subjects can perform the task when they are obliged to use memory of the block pattern. In one version of the experiment, subjects were given 10 sec to inspect the model, which was then removed from view. Performance was almost errorless up to four blocks, but degraded rapidly above this (Ballard, Hayhoe, Li, & Whitehead, 1992). When subjects were allowed to inspect the model for a variable duration before it was removed from view, performance approximately asymptoted by around 10 sec at about 60% correct for models of eight blocks. On this basis we might have expected that subjects would use memory more extensively in the main experiment, but they clearly use only minimal memory when they are free to do so.

Copying Blocks While Holding Gaze Fixed

The result in Figure 4 suggests that eye movements are an integral part of the economical execution of the task. The crucial role of fixations in performing the task is supported by a control experiment in which subjects had to perform the task while holding their gaze fixed. The model was kept visible but subjects had to fixate

BaUard et al. 71

Page 7: Memory Representations in Natural Tasks

the center of the display throughout the task. Three subjects performed the task. They were able to complete the task successUy, but required about three times longer (averaging about 90 sec, as opposed to 20-30). This is unlikely to be due simply to difficulty in seeing the blocks (which can be up to 5" eccentric) in periph- eral vision, since we varied the size of the blocks during this experiment and found that, for block sizes of 1' or more, the time to complete the task was constant. Thus fixations appear to have an additional role.

Copying Blocks of a Single Color

The above analysis of the eye movement pattern makes the assumption that the individual blocks are primitives for the task. This means that the eye movements back to the model are primarily to obtain properties of individual blocks. An alternate explanation is that the extra move- ments to the model area are in fact not essential but instead appear for some other reason. One such expla- nation is that the eyes move faster than the hand so that they can use this extra time to check the model in a way that is unrelated to the properties of individual blocks. However, a control where all the blocks were a single color argues that this is unlikely. The conditions of the control were identical to the standard experiment with the exception that all the blocks were one color. This reduces the informational demands of the task. In this case the analysis of the eye movement data shows a dramatic decrease in the number of eye movements used to inspect the model area. The number of such move- ments (averaged over four subjects) is 1.0 f 0.34 per

Figure 6. Comparison of strat- egy frequency for multicolor and single color patterns. Fewer saccades to the model area are seen in the m o m chrome condition, and sub jects copy small subpatterns with repeated PD strategies.

block in the monochrome case versus 1.6 f 0.3 per block in the multicolored case. A closer inspection of the indi- vidual trials in the monochrome case reveals that subjects copy subpatterns without reference to the model, sug- gesting that they are able to recognize these subpatterns as gestalts. This is further supported by the change in strategy use shown in Figure 6. This figure plots the strategies for the monochrome blocks compared with the multicolored block patterns for four subjects. "he significant drop in low memory strategies, along with the large increase in PD sequences, suggests that information about more than one block is acquired in a single fixation. Since subjects reduce eye movements to the model when color information is not required, this is evidence that the eye movements to the model in the multicolor case are not some artifact, but indeed reflect the instantaneous acquisition of information from the blocks.

Fixation Durations

Subjects differ, to some extent, in the frequency with which they use the MPMD strategy. This raises the ques- tion of whether subjects who make fewer model lixa- tions simply make more efficient use of memory or whether they acquire more information in each fixation by dwelling longer during the fixation period. Figure 7 shows both the fixation duration per block averaged over all the fixations in the model area, and also the total time per block for four subjects. The average fixation durations are quite similar for all the subjects. However, since some subjects make more frequent fixations than others, the total time spent in the model area is greater for those

72 Journal of Cognitive Neuroscience Volume 7, Number 1

Page 8: Memory Representations in Natural Tasks

Figure 7. (a) Average fixation duration for four subjects shows very little variation. (b) Total furation durations in the model area are a result of dif- ferent fixation frequency be- tween subjects; subjects KK and MH show approximately twice the number of looks into the model area than do subjects DB and MS.

subjects. Fixation frequency accounts for much of the differences between subjects in time spent in the model area. Thus subjects with fewer model fixations do not spend extra time gathering information during individual fixations. In fact, subjects MS and DB have slightly shorter fixations than the other two subjects, KK and MH, who make most use of the MPMD strategy. Either more infor- mation is gathered in the same time, or the information decays more slowly, requiring fewer return saccades. It is possible that our measurements are too crude to pick up small differences in fixation durations, since we were limited in these measurements to a 60 Hz sampling rate. However, to a first approximation, the primary difference in subject performance seems to be in the frequency with which they inspect the model.

Computer Simulation

The observed sequences where subjects postpone the gathering of both the color of the next block and its location until just before they are needed, represent a low memory approach to the task. However, the changes in strategy frequency as the task progresses indicate that there is some accrual of information during a trial. To estimate the extent to which subjects retain more than the immediately necessary information we developed a simple model that allowed color and location memory to have a variable capacity. If either of these locations are empty the eyes move to the model area. To explain this in more detail, consider the model control program:

Repeat until {pattern copied} If (no colors in memory) then GetColor Pickup If (no locations in memory) then GetLocation Drop

The instructions GetColor and GetLocation act as pro- ducers of color and location information, respectively. Similarly, Pickup and Drop act as consumers of color and location information. If there was never any color or location in memory, each of the observed sequences would be of the form MPMD. To explain the observed sequences, we only have to allow the model fixations to probabilistically produce an extra color and/or location. Thus Getcolor, in addition to determining a single color, is allowed to determine the subsequent color with prob- ability Pc. GetLocation is allowed to determine the sub- sequent location with probability PI, and, in addition, the subsequent color with probability Pc. With this model, the observed data can be modeled quite closely, as shown by the gray bars in Figure 8. To generate Figure 8, a hundred trials of the simulation were run, with PI = 0.5 and Pc = 0.5. In other words, on the fixation in the model to get a location there is a 50% chance that the location for the subsequent block move will be remembered. The probability of remembering an extra color is also 50%, but unlike location, an extra color can be remembered on any fixation in the model area. This rather arbitrary feature of the model was designed to produce the asym- metry between the PMD and MPD strategies observed in the data. The higher frequency of PMD strategies

BalkadetaL 73

Page 9: Memory Representations in Natural Tasks

Figure 8. Model compared with the data of Figure 4. Model used Pc = 0.5 and 4 = 0.5.

suggests that block color is more easily remembered than block location. With the parameters set to these values, on average, roughly one color and one block location are held in memory between block moves. We did not at- tempt to refine the model to improve the predictions, since our main concern was to show that a model of this general type, with very limited memory use, could illus- trate the main features of task performance. Thus the task is performed with some memory but nowhere near the maximum capacity. (These parameters probably slightly overestimate the probability of retaining an extra color or location, since the "other" category in Figure 8 is mostly composed of trials where more than two saccades to the model area occurred during a single block move.)

Monitoring Progress

If in fact subjects are not using extensive memory for this task, then they should have to cope with an addi- tional problem: If they have no extensive internal model, how do they keep track of what has been copied and what is left to do? One possibility is that subjects do this by an explicit comparison of model and partial copy for each block move. This could be done on either of two occasions: after placing a block in the workspace, when the subject needs to get information about the next block to copy, or on the fixation in the model area following a pickup and just before a drop. We have viewed the role of the fixation in this latter case as getting the relative location information for placement. This could involve a comparison of the two patterns either visually, during the fixation, or serially by fixating first the model and

then the copy. When subjects fixate the model and work- space successively, they tend to select corresponding points in the two patterns. By looking at corresponding points in the model and copy, gaze could establish a common frame of reference. This could simplify compari- son of the model and partially copied pattern and the identification of the uncopied part of the pattern. This method has the advantage of not having to encode the details of the pattern or the history of the copying procedure. To examine this, we computed the frequency distribution for the horizontal component of landing points in the workspace relative to the starting point in the model. The medians of these distributions (averaged across four subjects) were approximately 2" (M.25), which is about the width of a sinde block. Similarly, saccades in the opposite direction, from the workspace to the model following block placement, deviate from the corresponding point by 2.3 (M.2). Thus, saccades between the model area and the workspace tend to land on the same block. This is not very strong evidence, however, that fixation of corresponding points in the two patterns is important for keeping track of one's place in the task. First, if saccades from the model terminated randomly we would expect the median to be 2.6 M.2" (based on a sample of landing points chosen randomty from the actual distribution of landing points for all the saccades from the model into the workspace). This is not very different from the observed correspondence. (The value for saccades in the opposite direction is 2.9" &0.17".) Second, one would expect fixations in corre- sponding points because of the need for visual guidance in putting down a block, and also the tendency for

74 Journal of Cognitive Neutvscience Volume 7, Number I

Page 10: Memory Representations in Natural Tasks

Figure 9. The experimental setup consists of a model (upper left) made up of eight colored Duplo blocks, a source area (right), and a work- space m a (lower left). The headband mounted eyetracker monitors eye-in-head position, while head and hand positions are determined by magnetic coils on the headband and hand, respectively. The monitor in the background shows the scene from a scene camera mounted on the headband the crosshair cursor indicates the point of fixation. Note that the hand coil was taped to the thumb in the actual experiment, not at- tached to the back of the hand as shown in the photo.

subjects to work on successive neighboring blocks in constructing the model. Although the evidence in sup port of this idea is quite weak, it is at least suggestive that fixating the corresponding locations may have an additional advantage in helping the cognitive program keep track of what has been copied and what remains to be copied. Maintaining information about the model or copy beyond these fixations is unlikely, because it will be replaced by the information required for the next step in the task.

Experiments with Real Blocks

It is possible that the frequent fixations in the model area are an artifact of the use of the computer mouse to move the blocks, which may slow down performance and allow time for additional saccades that have little cogni- tive significance. Perhaps if real three-dimensional blocks and actual hand movements were used, the results would be different. One reason might be that with real blocks subjects can take advantage of memory by using it to

direct the arm through kinesthetic coordinates. To test this we performed a version of the experiment using real blocks. W e were able to do this by using a headband mounted, infrared camera-based eye tracking system (ASL Laboratories model E400SU) which monitored eye-in- head position during performance of the task. Head position was monitored with a six-DOF magnetic coil tracking system (Ascension Technology model 6DFOB). Thus it was possible to observe natural, unconstrained behavior. Accuracy of the eye and head movement re- cords was approximately 1" over the work area. In addi- tion, we monitored hand position using a second magnetic coil receiver. Colored Duplo blocks were mounted on a board that was positioned so that the blocks subtended approximately 2" of visual angle, as in the previous experiment. The centers of the model and workspace regions were separated by about 15" verti- cally, with the source region approximately 20" to the right. The setup is shown in Figure 9.

Figure 10 shows the results for five subjects. While there are small differences in the various strategy fre-

BaUardet al. 75

Page 11: Memory Representations in Natural Tasks

Figure 10. Task performance varies little when subjects move real blocks on a board instead of manipulating a CRT image with a mouse (compare to Fig. 4b). Data for five sub- jects are shown with between 50 and 150 block moves for each subject; error bars are be- tween subjects.

quencies (compare with Fig. 4), the memoryless MPMD strategy remained the modal one, and overall there was little memory use. From this we infer that the low use of memory comes about from some intrinsic cost of the memorization and is not an artifact of the mechanism of moving the blocks with a mouse in the previous experi- ments.

Memory Cost

The above experiments are designed to demonstrate the basic phenomena of limited memory use. The net result is that subjects use short-term memory in the blocks copying task, but that this memory is computationally expensive compared to the cost of acquiring the infor- mation on-line, so that low memory strategies are pre- ferred. On the basis of this conjecture we would predict that if the cost of the on-line acquisition of information could be increased relative to that of memorizing, the balance of effort should shift in the direction of increased memorization. The cost of on-line acquisition was in- creased by moving the model and workspace from their previous position separated by 15' to eccentric positions separated by 70'. This configuration meant that subjects had to make large head movements when making gaze changes from one area to another. Under these condi- tions subjects used more memorization as reflected in fewer eye movements to the model area. Figure 11 shows the frequency of the various strategies for two subjects in the "near" and "far" configurations described above. The relative frequency distributions varied between sub- jects; while these subjects' results represent the extremes in that variation, the frequency of double looks to the

model (MPMD strategy) went down significantly for both subjects when the distance between the model and workspace was increased. While KK relied much more on the memoryless MPMD strategy in the near condition than did MS, both subjects relied more on memorization when the cost of obtaining information on-line was in- creased; the average number of eye movements to the model area per block decreased from 2.1 to 1.5 for subject KK and from 1.2 to 1 .O for subject MS. (Note that the blocks classified as "Other" typically had at least two looks into the model area.) Thus the need for large head movements in gaze changes appears to increase the cost of saccades to the model area and to lead to increased memory use. The double looks are not entirely elimi- nated, however, suggesting a relatively high cost to mem- ory use.

CONCLUSIONS

In summary, the main result is that the information re- quired for the task is acquired just prior to its use. The alternate strategy of memorizing the configuration to be copied in its entirety before moving blocks is never used. Subjects choose not to operate at the maximum capacity of short-term memory but instead seek to minimize its use. The observations point to the use of a small amount of prediction that only extends to the color and location of the block after the current one. In addition, the ex- periments point to (1) the use of eye fixations to acquire information, and (2) the incremental acquisition of infor- mation of the color and location of each block.

Reducing the instantaneous memory required to per- form the task can be done by serializing the task with

76 Journal of Cognitive Neuroscience Volume 7, Number I

Page 12: Memory Representations in Natural Tasks

Figure 11. Relative fre- quency of block copying strategies for two subjects in the "near" and "far" conlipx- tions. Increasing the separa- tion of copy and model results in an increase in memorization. The model and workspace areas are separated by 15" in the "near" condition and 70" in the "far" condition. While subjects use more memorization in the far condi- tion, a significant number of block moves are still per- formed with the memoryless MPMD strategy. Strategies are defined in Figure 4. ( N n m = 192, Nhr = 152 for MS; N n a = 168, N b = 80 for KK.)

eye movements. These eye movements allow subjects to postpone the gathering of task-relevant information until just before it is required. The reluctance to use short- term memory to capacity can be explained if such mem- ory is expensive to use with respect to the cost of the serializing strategy. Our experiments support this inter- pretation.

Human performance in such tasks reveals the same fundamental characteristics as those robotic models that have been successfully implemented using deictic in- structions based on a small number of primitive opera- tions. This obviates the need for complex memory representations. The kinds of primitives used in the sim- ple control program above can clearly be used to gener-

BaUardetal. 77

Page 13: Memory Representations in Natural Tasks

ate more complex behaviors. These results suggest a new interpretation of the limitations of human working mem- ory. Rather than being thought of as a limitation on processing capacity, it can be seen as an integral part of a system that makes dynamic use of deictic variables. The limited number of variables need be a handicap only if the entire task is to be completed from memory; in that case, the short-term memory system is overburdened. In the more natural case of performing the task with ongo- ing access to the visual world, the task is completed perfectly. This suggests that a natural metric for evaluat- ing behavioral programs can be based on their spatio- temporal information requirements.

We can interpret the computational role of the items of short-term memory in the following way. In computa- tional parlance, the term pofnter is used to denote a symbol that allows access to some larger structure, such as the contents of an item of short-term memory. In our context, we invoke a particular kind of pointer, called a marker, and we can think of working memory as being composed of a limited number of markers. The marker notation raises the issue of binding, or setting the con- tents of a marker. This is because markers are general variables that can be reused for other computations. When are markers bound? For a visual marker, one pos- sible indication that it is being set could be fixation. Looking directly at a part of the scene provides special access to the features immediate to the fixation point, and these could be bound to a marker during the fixa- tion period. In all likelihood binding can take place faster than this, say by using an attentional process, but using fixation as an upper bound would allow us to bind at least three markers per second. Viewing attention as a marker also gives its selective nature a somewhat differ- ent computational significance. Attention can be seen as being necessarily limited by virtue of its role in specify- ing the variable for the next instruction. [A similar sug- gestion has been made by Auport (19891.1

This view of markers as variables in a cognitive pro- gram mises a different set of questions for study. One can characterize the complexity of cognitive programs in terms of the average amount of state they require at any instant. In other words, programs that require the maxi- mum number of variables would be complex, and pro- grams that require few or no variables would be simple. This view of memory items as markers in a cognitive program sheds some light on the difficulties faced by the subjects in the block experiments. Since the number of possible programs as a function of the number of mark- ers grows extremely rapidly, if the subject has to choose a program to do the task, then there is a premium on keeping the number of markers small.

These results also suggest an interpretation of the role of foveating eye movements in vision. Since Yarbus's classic observations (Yarbus, 1967), saccadic eye move- ments have often been thought to reflect cognitive events, in addition to being driven by the poor resolution

of peripheral vision. However, making this link has proved sufficiently difficult to raise questions about how much can be learned about cognitive operations by inspecting the fixation patterns (Viviani, 1990). One of the difficulties of relating fixations to cognitive proc- esses is that fixation itself does not indicate what prop erties are being acquired. In the block copying paradigm, fixation appears to be quite tightly linked to the under- lying processes by marking the location at which infor- mation (e.g., color, relative location) is to be acquired, or the location that specifies the target of the hand move- ment (picking up, putting down). Thus fixation can be seen as defining the variable currently relevant for task performance. Fixating the relevant block in the model area has a limited number of interpretations, such as acquiring either color or relative location, depending on its place in the task sequence. Our ability to relate fixa- tions to cognitive processes in this instance is a conse- quence of our ability to provide an explicit description of the task. In previous attempts to glean insight from eye movements (e.g., viewing a scene or identifying a pattern) the task demands are not well specified or observable. In the block copying task, the task cohtext allows a fairly tight interpretation of individual fixations' (e.g., a model fixation following pickup is presumably for acquisition of information for placement).

A similar suggestion for the role of fixation in binding variables has been made by Kowler and Anton (1987), who found that saccade size in reading was determined by the task. Saccades were approximately word-by-word for normal text, but letter-by-letter for scrambled text with the letters reversed. That is, saccades were deter- mined by the appropriate pattern recognition unit called for by the task. Similarly, same/different judgments of complex patterns appear to require eye movements (Schlingensiepen, Campbell, Legge, 81 Walker, 1986; Just & Carpenter, 1976). In investigations of chess playing, Chase and Simon (1973) showed that eye fixations are intimately related to spatial working memory, but did not suggest a definitive computational role for fixations. Fi- nally, note that this suggestion is a rather different role for fixations from that of Noton and Stark (1971), who hypothesized that the eye movements indexed a mem- ory representation.

Finally we can return to reexamine the computational hypothesis illustrated in Table 1, in the light of our observations of human performance in this task. We can think of task performance as being quite well explained by the successive application of three operations of the kind illustrated in the table. Thus a model fixation will acquire visual properties (color, relative location) at the marked (fixated) location (cf. the fdentffzcutfon or where box in the table). This will be followed by a visual search operation to find the target color in the resource, or the putdown location in the workspace (the locution, or where box), then by saccade programming to that loca- tion, and then visual guidance of the hand to the marked

78 Journal of Cognitive Neuroscience Volume 7, Number 1

Page 14: Memory Representations in Natural Tasks

(hated) location (the manfpulatfon box). In addition to this we need the operation of holding a very small number of model properties in working memory, and programming the ballistic phase of the hand movement. It is of interest to consider how these operations might correspond to the classic division of visual pathways into dorsal and ventral streams (Ungerleider & Mishkin, 1982). These have been postulated to underlie what and where, or what and how visual operations (alternatively, perception and action; e.g., Ungerleider & Mishkin, 1982; Goodale & Milner, 1992). One possibility might be that the dorsal stream to parietal cortex is involved in posi- tioning and keeping track of the markers (e.g., program- ming the saccade to the location identitied by visual search), and that the inferotemporal cortex does the computations for the acquisition of information (e.g., color, relative location or local configuration in the pat- tern) at the marked location. (The visual search opera- tion can be seen as involving both streams.) Manipulation, or hand movements, might involve other parts of cortex acting relative to the marked location given by parietal cortex using properties (such as size, orientation) given by temporal cortex. This conceptuali- zation is somewhat richer than the classic what/where distinction, or even than the whathow distinction. One issue is how it could explain the behavior of Goodale’s patient, DF, who displayed normal hand-shaping behavior in the context of reaching movements, together with an inability to report the visual properties that appeared to guide the hand shaping. On the view we have presented, we would need to postulate some basis for the impaired ability to report visual properties other than a loss of this function in inferotemporal cortex. Since the damage appears to be largely in the white matter, it is possible that some aspect of the circuitry involved in reporting the perceptual decision is damaged, rather than the com- putation itself.

Historically we have been accustomed to thinking of the job of perception as creating rich, task-independent descriptions of the world that are then reaccessed by cognition (Marr, 1982). However, an intriguing sugges tion that follows from these experiments is that perhaps the job of perception can be greatly simplified: it need only create descriptions that are relevant to the current task. To the extent that manipulations on a given block are largely independent of the information acquired in previous views, performance in this task suggests that it is unnecessary to construct an elaborate scene descrip- tion to perform the task and that there is only minimal processing of unattended information. In addition, since color and location information appear to be acquired separately, it appears that even in the attended regions the perceptual representation may be quite minimal. These observations consequently support the suggestion made previously by O’Regan (O’Regan & Levy-Schoen, 1983; O’Regan, 1992) that only minimal information about a scene is represented at any given time, and that

the scene can be used as a kind of “external” memory. A related suggestion has also been made by Nakayama (1990).

Acknowledgments The research described in this paper has been supported by the National Institutes of Health under Grants 1 R24 RR06853- 02 and EY-05729 and by AFOSR under Grant 91-0332-C.

Reprint requests should be sent to Dr. Dana Ballard, Depart- ment of Computer Science, University of Rochester, Rochester, NY 14627.

Note 1. This finding is robust across modest changes in the display. For example, similar strategy frequencies are observed when the number of blocks in the pattern is varied from 6 to 9, and when the blocks are not contiguous.

REFERENCES Agre, l? E., & Chapman, D. (1987). Pengi: An implementation

of a theory of activity. Proceedings of M, 87, 268-272. AUport, A. (1989). Selective attention. In M. Posner (Ed.),

Foundations of cognitive science (pp. 631-682). Cam- bridge, MA: MIT Press.

Baddeley, A. (1986). WoMng Memov. Oxford: Clarendon Press.

Ballard, D. H. (1989). Behavioral constraints on animate vi- sion. Image and Vision Computing, 7(1), 3-9.

Ballard, D. H. (1991). Animate vision. Artificral Intelligence Journal, 48, 57-86.

Ballard, D. H., Hayhoe, M. M., Li, E & Whitehead, S. D. (1992). Handeye coordination during sequential tasks. Pbilosophi- cal Transactions of the Royal Society of London B, 337,

Ballard, D. H., & Rao, R. l? (1994). Seeing behind occlusions. European Conference on Computer Vision, Stockholm, Sweden, May 1994.

Brooks, R. A. (1986). A robust layered control system for a m e bile robot. IEEE Journal of Robotics and Automation, 2,

Brooks, R. A. (1991). Intelligence without wason. (AI Memo 1293.) Cambridge, MA: Massachusetts Institute of Technol- ogy, Al Laboratory.

Chase, W. G., & S i o n , H. A. (1973). Perception in chess. Cog- nitive Psycholoa, 4, 55-81.

Goodale, M., & Milner A. D. (1992). Separate visual pathways for perception and action. T w n h in Neurosciences, 15,

Just, M. A., & Carpenter, l? A. (1976). Eye fixations and cogni- tive processes. Cognitive Psycboloa, 8, 44 1-480.

Kowler, E., & Anton, S. (1987). Reading twisted text: Implica- tions for the role of saccades. Vision Research, 27, 45-60.

Marr, D. C. (1982). Vision. San Francisco: W.H. Freeman. Miller, G. (1956). The magic number seven plus or minus

331-339.

14-22.

20-25

two: Some limits on your capacity for processing informa- tion. Psychological Review, 63, 81 -96.

Milner, A. D., & Goodale, M. A. (1991). Visual pathways to per- ception and action. (COGMEM 62). University of Western Ontario, Center for Cognitive Science.

Nakayama, K. (1990). The iconic bottleneck and the tenuous link between early visual processing and perception. In

Ballard et al. 79

Page 15: Memory Representations in Natural Tasks

C. Blakemore (Ed.), Vision: Coding and eflciency (pp. 41 1-422). Cambridge: Cambridge University Press.

Noton, D., & Stark, L. (1971). Eye movements and visual per- ception. Scientific American, 224, 34-43.

O'Regan, J. K. (1992). Solving the "real" mysteries of visual perception: The world as an outside memory. Canadian Journal of Psycbolou, 46, 461-488.

O'Regan, J. K., & Gvy-Schoen, A. (1983). Integrating visual in- formation from successive fixations: Does trans-saccadic fu- sion exist? Vision Research, 23, 765-769.

Schlingensiepen, K.-H., Campbell, E W., Legge, G. E., & Walker, T. D. (1986). The importance of eye movements in the analysis of simple patterns. Vtsion Research, 26, 11 11- 1117.

Swain, M. J., & Ballard, D. H. (1991). Color indexing. Interna- tional Journal of Computer Vision, 7(1), 11-32.

Swain, M. J., Kahn, R. E., & Ballard, D. H. (1992). Low resolu- tion cues for guiding saccadic eye movements. Proceed- ings of the Computer Vision and Pattern Recognition Conference, Urbana, IL.

80 Journal of Cognitive Neuroscience

Ullman, S. (1984). Visual routines. Cognitton, 18,97-157. (Also in Pinker, S. (Ed.), Visual cognition. Cambridge, MA: Bradford Books, 1984.)

Ungerleider, L., & Mishkin, M. (1982). lbo cortical visual sys tems. In D. Ingle, M. Goodale, & R. Mansfield (Eds.), Analy- sis of visual behavtor (pp. 549-585). Cambridge, MA: MIT Press.

Viviani, €? (1990). Eye movements in visual search Cognitive, perceptual, and motor control aspects. In E. Kowler (Ed.), Eye movements and their role in visual and cogntttve pmcesses. Reviews of oculomotor tvsearcb. (Vol. 4, pp. 353-383). Amsterdam: Elsevier.

Whitehead, S. D., & Ballard, D. H. (1990). Active perception and reinforcement learning. Neural Computation, 2(4), 409-419.

Whitehead, S. D., & Ballard, D. H. (1991). Learning to per- ceive and act by trial and error. Machine Learntng, 7, 45- 83.

Plenum. Yarbus, A. (1967). Eye movements and vision. New York:

Volume 7, Number I