28
Overview Tests of Visual Parsing Conclusion Visual Parsing After Recovery from Blindness Ostrovsky et al. Psychological Science 2009. Daniel O’Shea [email protected] Brain CS Lunch Talk 11 March 2010 Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 1 / 19

Visual Parsing after Recovery from Blindness

Embed Size (px)

DESCRIPTION

Brain CS Reading Groupbraincs.wiki.zoho.comLunch Talk11 Mar 2010

Citation preview

Page 1: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Visual Parsing After Recovery from BlindnessOstrovsky et al. Psychological Science 2009.

Daniel O’[email protected]

Brain CS Lunch Talk

11 March 2010

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 1 / 19

Page 2: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Overview

Project Prakash

Learning to Recognize Objects

Subjects

Tests of Visual Parsing

Static Visual Parsing

Dynamic Visual Parsing

Recovery of Static Object Recognition

Conclusion

Motion Information as a Scaffold

Experimental and Theoretical Evidence for Temporal Learning

Future Directions

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 2 / 19

Page 3: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Project Prakash

“The goal of Project Prakash is to bring light into the

lives of curably blind children and, in so doing, illuminate

some of the most fundamental scientific questions about

how the brain develops and learns to see.”

http://web.mit.edu/bcs/sinha/prakash.html

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 3 / 19

Page 4: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Project Prakash

I Launched in 2003 by Pawan Sinha at MIT BCS

I Based in New Delhi; India home to 30% of blind population

I Humanitarian aim: Screen children for treatable blindness

and provide modern treatment

Scientific Opportunity

I How does the visual system learn to perceive objects?

I Track object learning in longitudinal studies

I Complementary to infant studies which have more plastic

visual processing but are limited in experimental complexity

I Separation of visual learning from sensory development

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 4 / 19

Page 5: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Project Prakash

I Launched in 2003 by Pawan Sinha at MIT BCS

I Based in New Delhi; India home to 30% of blind population

I Humanitarian aim: Screen children for treatable blindness

and provide modern treatment

Scientific Opportunity

I How does the visual system learn to perceive objects?

I Track object learning in longitudinal studies

I Complementary to infant studies which have more plastic

visual processing but are limited in experimental complexity

I Separation of visual learning from sensory development

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 4 / 19

Page 6: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Learning to Recognize

I Visual system must learn to “parse” visual world into

meaningful entities

I Segmentation focused on heuristics: alignment of contours,

similarity of texture representations

1960). These results appear to suggest that visual parsingmight be subject to a critical period in the first few years of life.Despite the lack of conclusive evidence for the permanence ofdeprivation-induced deficits, individuals who have been blindpast the age of 5 or 6 years but have treatable conditions (asituation that is virtually nonexistent in developed nations, butunfortunately not as rare in the developing world) are oftenpassed over for treatment owing to the assumed poor prognosisfor recovery.It is worth noting that working with a noninfant population

provides us with both advantages and disadvantages. On the onehand, because the brain is otherwise almost fully mature in theindividuals we work with, visual learning can be segregated fromdevelopment of the other senses and from real-world knowledge.On the other hand, a mature brain may not undergo the sameprogression as an infant brain. Thus, it is appropriate to considerthis work as complementary to, rather than a replacement for,traditional studies with infants.

METHOD

ParticipantsS.K. is a 29-year-old male, born in Bihar, India. By the time hewas 4 months old, members of his family noticed his inability tofixate and a lack of visually guided behaviors. Because of fi-nancial and logistical constraints, medical intervention was notsought until S.K. was an adolescent. At the age of 12 years, hewas examined by an ophthalmologist, who recommended sur-gery to correct his sight. However, the operation was canceledbecause S.K.’s father had an illness that completely depleted thefamily’s finances. S.K. was admitted to the State School for theBlind in Darbangha, Bihar, where he studied for 12 years andlearned Braille. In 2000, he moved to a hostel for the blind inNewDelhi and enrolled in a correspondence course; he earned amaster’s degree in political science in April 2006. It was duringa visit to this hostel that we met S.K. in January 2004.Examinations by three independent ophthalmologists in New

Delhi yielded identical assessments: S.K. has secondary con-

genital bilateral aphakia (B.L. Johnson & Cheng, 1997; Pratt &Richards, 1968), with the lenses almost completely absorbed inthe anterior and posterior chambers of the right and left eyerespectively. The optical pathways in the eyes are clear. S.K.’sacuity was assessed to be 20/900. He had never been able toafford a pair of eyeglasses that could compensate for his apha-kia. During our next visit to India, in July 2004, we had S.K.reexamined by optometrists and ophthalmologists in New Delhiand purchased a pair of eyeglasses for him. Postcorrectionacuity was determined to be 20/120. The residual acuity im-pairment is likely due to neural amblyopia (Kiorpes & McKee,1999).Beginning 2 weeks after the refractive correction, we per-

formed a series of experiments to assess S.K.’s visual abilities.Tests of low-level visual function revealed that he had near-normal ability to discriminate among colors, luminances, andmotion directions.We also had the opportunity to work with 2 male children, P.B.

and J.A., whom we studied beginning 1 and 3 months, respec-tively, after surgery to correct their dense bilateral congenitalcataracts. P.B. received treatment at the age of 7 years, and J.A.at the age of 13 years. P.B. was born in a village near Panipat,Haryana. His family has a long history of congenital blindness.Both P.B. and his older sister T.B. (age 12) were congenitallyblind, as were his father, his paternal grandmother, his great-grandmother, two aunts, and an uncle. P.B. has been enrolled inthe Blind Relief Association’s school in Delhi since the age of4! years. His parents did not pursue treatment for him (or T.B.)because a doctor incorrectly told them that his condition wasuntreatable because of the development of nystagmus. Abotched eye surgery that P.B.’s uncle had undergone a few yearsearlier further dampened the parents’ desire to seek treatmentfor their children. We came across P.B. in an outreach eye-screening session we had organized in his school. His conditionwas determined to be treatable. In December 2005, P.B. un-derwent a small-incision cataract surgery with intraocular lensimplantation in both eyes; as a result, his acuity improved fromlight perception to 20/100.

Fig. 1. Example illustrating how a natural image (a) is typically a collection of many regions of different huesand luminances (b). The human visual system has to accomplish the task of integrating subsets of these regionsinto coherent objects, illustrated in (c).

Volume 20—Number 12 1485

Y. Ostrovsky et al.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 5 / 19

Page 7: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Central questions

I Critical period for learning visual parsing?

I Evidence that mature visual system uses these heuristics for

object recognition, but are they useful in organizing visual

information during development?

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 6 / 19

Page 8: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Project PrakashLearning to Recognize ObjectsSubjects

Recovery Subjects

I S.K. 29 yo, congenital bilateral aphakia (absence of lens);

corrected with eyeglasses!

I P.B. 7 yo, treated with bilateral cataract surgery with

intraocular lens implantation

I J.A. 13 yo, treated with bilateral cataract surgery with

intraocular lens implantation

Control Subjects

I Visual input scaled to simulate information loss due to

imperfect acuity in treated group (20/80 to 20/120)

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 7 / 19

Page 9: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I A: Non-overlapping shapes: No difficulty.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 10: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I B: Overlapping shapes as line drawings: overfragmentation,

all closed loops perceived as distinct.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 11: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I C: Overlapping shapes as filled translucent surfaces:

overfragmentation, all regions of constant luminance

perceived as distinct

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 12: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I D: Opaque overlapping shapes: could correctly indicate

number of objects.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 13: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I E: Opaque overlapping shapes: could not indicate depth

ordering (chance performance)

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 14: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I F: Field of oriented line segments: difficulty finding path of

coherent segments

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 15: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I G: 3D shapes with luminance from lighting/shadows:

percept of multiple objects, one for each facet

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 16: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Tests of Static Visual Parsing

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)

ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

I Conclusion: Tendency to overfragment driven by low-level

cues (hue and luminance regions). Inability to use cues of

contour continuation, junction structure, figural symmetry.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 8 / 19

Page 17: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Object Tracing

I Overfragmentated percept evident in object tracing

I Segmentation consistent with algorithm based on

hue/luminance similarity metric

images, J.A. recognized 34%, and P.B. recognized only 18%.Weasked subjects to point to objects in these images and also toindicate their extent, even if they could not name the objects.Wefound that subjects’ responses were driven by low-level imageattributes; they pointed to regions of different hues and lumi-nances as distinct objects. This approach greatly oversegmentedthe images and partitioned them into meaningless regions,which would be unstable across different views and uninfor-

mative regarding object identity. A robust object representationis difficult to construct on the basis of such fragments. Figure 2d,which shows S.K.’s responses to three sample images, illustratesthis tendency toward overfragmentation. In separate computa-tional simulations, we found that the recently treated subjects’parsing could be largely accounted for by a simple computa-tional algorithm based on luminance and hue (Fig. 2d, lowerrow).

100

a

b

c d

50

0

Perfo

rman

ce (%

cor

rect

)ControlGroup

S.K.

J.A.

P.B.

NA NANA NA

How manyobjects?

How manyobjects?

How manyobjects?

How manyobjects?

Which objectis in front?

Trace thelong curve

How manyobjects?

A B C D E F G

A B C D E F G

Fig. 2. Subjects’ parsing of static images. Seven tasks (a) were used to assess the recently treated subjects’ ability to perform simple image seg-mentation and shape analysis. The graph (b) shows the performance of these subjects relative to the control subjects on these tasks. ‘‘NA’’ indicatesthat data are not available for a subject. S.K.’s tracing of a pattern drawn by one of the authors (c) illustrates the fragmented percepts of the recentlytreated subjects. In the upper row of (d), the outlines indicate the regions of real-world images that S.K. saw as distinct objects. He was unable torecognize any of these images. For comparison, the lower row of (d) shows the segmentation of the same images according to a simple algorithm thatagglomerated spatially adjacent regions that satisfied a threshold criterion of similarity in their hue and luminance attributes.

Volume 20—Number 12 1487

Y. Ostrovsky et al.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 9 / 19

Page 18: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Dynamic Visual Input

I Natural visual experience is dynamic, includes motion

I Show individual shapes undergoing independent smooth

translational motion and study percept in a dynamic context

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 10 / 19

Page 19: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Dynamic Visual Input

I Motion facilitates object binding (linking together the parts)

and segregation from background in the presence of noise

So far, we have described the recently treated subjects’ per-formance with static imagery. In order to make our experimentsmore representative of everyday visual experience, which typ-ically involves dynamic inputs, we used a set of stimuli thatincorporated motion cues (Fig. 4, Tests A and B). The task wasthe same as for the tests of static visual parsing—to indicate thenumber of objects shown. The individual shapes underwentindependent smooth translational motion. For overlapping fig-ures, the movement was constrained such that the overlap wasmaintained at all times.The inclusion of motion brought about a dramatic change in

the recently treated subjects’ responses. As Figure 4 indicates,they responded correctly on a majority of the trials. Motion also

allowed S.K. to better perceive shapes embedded in noise(Fig. 4, Tests C and D). Motion thus appeared to be instrumentalfor enabling the recently treated subjects to link together partsof an object and segregate them from the background.The recently treated subjects’ recognition results with real-

world images, already summarized, provide evidence of anotherrole that motion might play in their object perception skills.When we examined which images the recently treated subjectswere able to recognize, an interesting pattern became evident.As illustrated in Figure 3, the recently treated subjects’ recog-nition responses showed a significant congruence with inde-pendently derived motility ratings of the objects depicted in theimages (p< .01 in a chi-square test for each of the 3 subjects). A

100 ControlGroup

50

0

Perfo

rman

ce (%

cor

rect

)

S.K.

J.A.

P.B.

NA NA NA NA

How manyobjects?

How manyobjects?

Name theobject

Name theobject

A B C D

Fig. 4. Recently treated and control subjects’ performance on tasks designed to assess the role ofdynamic information in object segregation. ‘‘NA’’ indicates that data are not available for a subject.In the illustrations of the four tasks, the arrows indicate direction of movement.

Fig. 3. Motility ratings of the 50 images used to test object recognition and the recently treated subjects’ ability to recognize these images. Motilityratings were obtained from 5 normally sighted respondents who were naive as to the purpose of the experiment; the height of the black bar below eachobject indicates that object’s average rating on a scale from 1 (very unlikely to be seen in motion) to 5 (very likely to be seen in motion). The circlesindicate correct recognition responses.

1488 Volume 20—Number 12

Visual Parsing After Recovery From Blindness

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 11 / 19

Page 20: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Static Recognition and Object Motility

I More “motile” objects more likely to be recognized

I Object motion (alongside low-level cues) binds regions into a

cohesive whole; learning of this cohesive structure facilitates

recognition in a static context

So far, we have described the recently treated subjects’ per-formance with static imagery. In order to make our experimentsmore representative of everyday visual experience, which typ-ically involves dynamic inputs, we used a set of stimuli thatincorporated motion cues (Fig. 4, Tests A and B). The task wasthe same as for the tests of static visual parsing—to indicate thenumber of objects shown. The individual shapes underwentindependent smooth translational motion. For overlapping fig-ures, the movement was constrained such that the overlap wasmaintained at all times.The inclusion of motion brought about a dramatic change in

the recently treated subjects’ responses. As Figure 4 indicates,they responded correctly on a majority of the trials. Motion also

allowed S.K. to better perceive shapes embedded in noise(Fig. 4, Tests C and D). Motion thus appeared to be instrumentalfor enabling the recently treated subjects to link together partsof an object and segregate them from the background.The recently treated subjects’ recognition results with real-

world images, already summarized, provide evidence of anotherrole that motion might play in their object perception skills.When we examined which images the recently treated subjectswere able to recognize, an interesting pattern became evident.As illustrated in Figure 3, the recently treated subjects’ recog-nition responses showed a significant congruence with inde-pendently derived motility ratings of the objects depicted in theimages (p< .01 in a chi-square test for each of the 3 subjects). A

100 ControlGroup

50

0

Perfo

rman

ce (%

cor

rect

)

S.K.

J.A.

P.B.

NA NA NA NA

How manyobjects?

How manyobjects?

Name theobject

Name theobject

A B C D

Fig. 4. Recently treated and control subjects’ performance on tasks designed to assess the role ofdynamic information in object segregation. ‘‘NA’’ indicates that data are not available for a subject.In the illustrations of the four tasks, the arrows indicate direction of movement.

Fig. 3. Motility ratings of the 50 images used to test object recognition and the recently treated subjects’ ability to recognize these images. Motilityratings were obtained from 5 normally sighted respondents who were naive as to the purpose of the experiment; the height of the black bar below eachobject indicates that object’s average rating on a scale from 1 (very unlikely to be seen in motion) to 5 (very likely to be seen in motion). The circlesindicate correct recognition responses.

1488 Volume 20—Number 12

Visual Parsing After Recovery From Blindness

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 12 / 19

Page 21: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Static Visual ParsingDynamic Visual ParsingRecovery of Static Object Recognition

Follow-up Test of Static Visual Parsing

I S.K. (18 mo), P.B. (10 mo), and J.A. (12 mo) registered

significant improvement in static visual parsing

I Critical period for visual learning should not be strictly

applied in predicting therapeutic gains

plausible, though not definitive, explanation of this congruenceis that motion of objects helps bind their constituent regions intocohesive representations, which can then be used to recognizeinstances in new inputs that may be static. It appears, however,that motion information is not used to the exclusion of figuralcues, as preliminary tests with point-light walkers of the kinddevised by Johansson (1973) proved ineffective for conveyingthe impression of a person: None of the 3 subjects in the recentlytreated group was able to perceive such displays as anythingother than a collection of moving dots.We also examined changes in the recently treated subjects’

performance as a function of time after treatment. S.K.’s per-formance pattern was unaltered when we tested him 6 and 12months after treatment. Given the relatively mature age at whichhe had received treatment, we were not hopeful of observingmuch visual recovery. However, follow-up tests conducted at 18months posttreatment demonstrated that S.K.’s visual skills,although still not normal, had registered a significant improve-ment. The results are summarized in Figure 5. Essentially, atthis time, S.K. could perform tasks with static images that hepreviously could perform only if motion cues were added. AsFigure 5 shows, P.B. and J.A. exhibited a similar improvement intheir ability to parse static images when tested several monthsafter initial treatment (10 months for P.B. and 12 monthsfor J.A.).

DISCUSSION

Taken together, these results provide a longitudinal glimpse intothe development of visual parsing skills several years postin-fancy. They suggest that the early stages of this process arecharacterized by integrative impairments. These impairmentslead to perceptual overfragmentation of images, and thus com-promise recognition performance. However, the use of motioninformation effectively mitigates these integrative difficulties.

During the early stages of visual learning, motion appears to beinstrumental both in segregating objects and in binding theirconstituents into representations for recognition.We derive confidence in the generality of the results described

here from the consistency among the 3 subjects, and also theircongruence with findings from previously reported case studiesof sight recovery. Although the earlier studies on sight recoveryin adulthood (Fine et al., 2003; Gregory & Wallace, 1963) didnot specifically focus on skills involved in region integration,they reported difficulties consistent with impairments in theseskills during recognition of natural images. For instance,Gregory and Wallace (1963), in describing their patient SB,wrote: ‘‘We formed the impression that he saw [the naturalscenes as] little more than patches of colour’’ (p. 24). Similarly,as regards simple image parsing, Fine et al. (2003) wrote thatMM ‘‘described two overlapping transparent squares as threesurfaces with the central square in front’’ (p. 915). Furthermore,in these past cases, as in the present one, motion sensitivity wasevident soon after treatment.The privileged status of motion observed with our recently

treated individuals is reminiscent of results reported in the in-fant literature. Although infants eventually become able to usestatic figural cues for object segregation (Needham, 1998),segmentation from motion arises at least 2 months prior to theability to segment from static cues (Arterberry & Yonas, 2000;S.P. Johnson, 2003), and infants’ ability to link spatially sepa-rated parts of a partially occluded object is initially drivenstrongly by common motion (S.P. Johnson, Bremner, Slater,Mason, & Foster, 2002; Kellman & Spelke, 1983). It is inter-esting to find this point of overlap between the developmentalprogressions of infants and of our recently treated group, giventhat maturational processes would presumably have alreadycompleted their time course in our subjects. The neuralunderpinnings of this similarity are unclear. However, the per-ceptual utility of an early sensitivity to motion for both popu-

100

50

0

Perfo

rman

ce (%

cor

rect

) S.K. (1 mo.) S.K. (18 mo.)J.A. (3 mo.) J.A. (12 mo.)P.B. (1 mo.) P.B. (10 mo.)

How manyobjects?

How manyobjects?

Name theobject

Trace thelong curve

Fig. 5. The recently treated subjects’ performance on four tasks with static displays soon aftertreatment and at follow-up testing after the passage of several months (indicated in the key).

Volume 20—Number 12 1489

Y. Ostrovsky et al.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 13 / 19

Page 22: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Motion Information Available Early

I Motion sensitivity develops early in primate brain

I Segmentation from motion 2 months prior to segmentation

from static cues in infants

I Infants link separated parts of partially occluded object

through common motion

Motion Information as a Scaffold

I Motion allows early grouping of parts

I Correlation of motion-based grouping and static cues (e.g.

aligned contours) facilitates grouping via static cues

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 14 / 19

Page 23: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Motion Information Available Early

I Motion sensitivity develops early in primate brain

I Segmentation from motion 2 months prior to segmentation

from static cues in infants

I Infants link separated parts of partially occluded object

through common motion

Motion Information as a Scaffold

I Motion allows early grouping of parts

I Correlation of motion-based grouping and static cues (e.g.

aligned contours) facilitates grouping via static cues

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 14 / 19

Page 24: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Temporal Contiguity and IT cortex

I Disruption of temporal contiguity of visual experience alters

object specificity of IT neurons

by the visual system? On the basis of theoretical(6–11) and behavioral (12, 13) work, one pos-sibility is that tolerance (“invariance”) is learnedfrom the temporal contiguity of object featuresduring natural visual experience, potentially in anunsupervisedmanner. Specifically, during naturalvisual experience, objects tend to remain presentfor seconds or longer, while object motion orviewer motion (e.g., eye movements) tends tocause rapid changes in the retinal image cast byeach object over shorter time intervals (hundredsof ms). The ventral visual stream could constructa tolerant object representation by taking advan-tage of this natural tendency for temporally con-tiguous retinal images to belong to the sameobject. If this hypothesis is correct, it might bepossible to uncover a neuronal signature of theunderlying learning by using targeted alterationof those spatiotemporal statistics (12, 13).

To look for such a signature, we focused onposition tolerance. If two objects consistentlyswapped identity across temporally contiguouschanges in retinal position then, after sufficientexperience in this “altered” visual world, thevisual system might incorrectly associate theneural representations of those objects viewed atdifferent positions into a single object representa-tion (12, 13). We focused on the top level of theprimate ventral visual stream, the inferior tempo-ral cortex (IT), where many individual neurons

possess position tolerance—they respond prefer-entially to different objects, and that selectivity islargely maintained across changes in object ret-inal position, even when images are simply pre-sented to a fixating animal (14, 15).

We tested a strong, “online” form of the tem-poral contiguity hypothesis—two monkeys visu-ally explored an altered visual world (Fig. 1A,“Exposure phase”), and we paused every ~15min to test each IT neuron for any change inposition tolerance produced by that altered ex-perience (Fig. 1A, “Test phase”). We concen-trated on each neuron’s responses to two objectsthat elicited strong (object “P”, preferred) andmoderate (object “N”, nonpreferred) responses,and we tested the position tolerance of that objectselectivity by briefly presenting each object at 3°above, below, or at the center of gaze (16) (fig.S1). All neuronal data reported in this study wereobtained in these test phases: animal tasks un-related to the test stimuli; no attentional cueing;and completely randomized, brief presentationsof test stimuli (16). We alternated between thesetwo phases (test phase ~5 min; exposure phase~15 min) until neuronal isolation was lost.

To create the altered visual world (“Exposurephase” in Fig. 1A), each monkey freely viewedthe video monitor on which isolated objectsappeared intermittently, and its only task was tofreely look at each object. This exposure “task” isa natural, automatic primate behavior in that itrequires no training. However, by means of real-time eye-tracking (17), the images that played outon the monkey’s retina during exploration of thisworld were under precise experimental control(16). The objects were placed on the video

monitor so as to (initially) cast their image atone of two possible retinal positions (+3° or !3°).One of these retinal positions was pre-chosen fortargeted alteration in visual experience (the“swap” position; counterbalanced across neu-rons) (Fig. 1B) (16); the other position acted as acontrol (the “non-swap” position). The monkeyquickly saccaded to each object (mean: 108 msafter object appearance), which rapidly broughtthe object image to the center of its retina (meansaccade duration 23 ms). When the object hadappeared at the non-swap position, its identityremained stable as the monkey saccaded to it,typical of real-world visual experience (“Normalexposure”, Fig. 1A) (16). However, when theobject had appeared at the swap position, it wasalways replaced by the other object (e.g., P!N)as the monkey saccaded to it (Fig. 1A, “Swapexposure”). This experience manipulation tookadvantage of the fact that primates are effectivelyblind during the brief time it takes to complete asaccade (18). It consistently made the image ofone object at a peripheral retinal position (swapposition) temporally contiguous with the retinalimage of the other object at the center of theretina (Fig. 1).

We recorded from 101 IT neurons while themonkeys were exposed to this altered visualworld (isolation held for at least two test phases;n = 50 in monkey 1; 51 in monkey 2). For eachneuron, wemeasured its object selectivity at eachposition as the difference in response to the twoobjects (P ! N; all key effects were also foundwith a contrast index of selectivity) (fig. S6). Wefound that, at the swap position, IT neurons (onaverage) decreased their initial object selectivity

McGovern Institute for Brain Research and Department ofBrain and Cognitive Sciences, Massachusetts Institute ofTechnology, Cambridge, MA 02139, USA.

*To whom correspondence should be addressed. E-mail:[email protected]

Fig. 1. Experimentaldesign and predictions.(A) IT responses weretested in “Test phase”(green boxes, see text),which alternated with“Exposure phase.” Eachexposure phase con-sisted of 100 normalexposures (50 P!P, 50N!N) and 100 swapexposures (50 P!N, 50N!P). Stimulus size was1.5° (16). (B) Each boxshows the exposure-phase design for a sin-gle neuron. Arrows showthe saccade-induced tem-poral contiguity of reti-nal images (arrowheadspoint to the retinal im-ages occurring later intime, i.e., at the end ofthe saccade). The swapposition was strictly alternated (neuron-by-neuron) so that it was counter-balanced across neurons. (C) Prediction for responses collected in the test phase:If the visual system builds tolerance using temporal contiguity (here driven bysaccades), the swap exposure should cause incorrect grouping of two different

object images (here P and N). Thus, the predicted effect is a decrease in objectselectivity at the swap position that increases with increasing exposure (in thelimit, reversing object preference), and little or no change in object selectivity atthe non-swap position.

www.sciencemag.org SCIENCE VOL 321 12 SEPTEMBER 2008 1503

REPORTS

on

Mar

ch 9

, 201

0 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Li and DiCarlo. Unsupervised natural experience rapidly alters invariant object representation in visual cortex. Science 2008.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 15 / 19

Page 25: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Invariant Object Recognition with Slow Feature Analysis

I Slowness principle can be used to learn independent

representations of object position, angle, and identity from

smoothly moving stimuli966 M. Franzius, N. Wilbert, and L. Wiskott

Fig. 2. Model architecture and stimuli. An input image is fed into the hierarchicalnetwork. The circles in each layer denote the overlapping receptive fields for the SFA-nodes and converge towards the top layer. The same set of steps is applied on eachlayer, which is visualized on the right hand side.

Our main objective here was to show that the relevant features were indeedextracted from the raw image data. The easiest way to do this is by calculating amultivariate linear regression of the SFA-output against the known configurationvalues. While the regression procedure is obviously supervised, it neverthelessshows that the relevant signals are easily accessible. Extracting this informationfrom the raw image data linearly is not possible (especially for the angles andobject identity). One should note, that the dimension of the model output issmaller than the raw image data by two orders of magnitude. We also checkedthat the nonlinear expansions without SFA reduces the regression performanceto almost chance level.

Since the predicted SFA solutions are cosine functions of the position values(see [9]), one has to map the reference values correspondingly before calculatingthe regression. The results from the regression are then mapped with the inversefunction. For those SFA solutions that code for rotational angles, the theorypredicts both sine and cosine functions of the angle value. Therefore we didregressions for both mappings and then calculated the angles via the arctangent.If multiple objects are trained and separated with a blank (see 2.1), the solutionswill in general be object dependent. This rules out global regressions for allobjects. The easiest way around this is to perform individual regressions for allobjects. While this procedure sacrifices the object invariance it does not a!ectthe invariance under all other transformations.

The object identity of N di!erent objects is optimally encoded (under theSFA objective) by up to N ! 1 uncorrelated step functions, which are invariantunder all other transformations. In the SFA-output this should lead to separatedclusters for the trained objects. For untrained objects, those SFA-outputs codingfor object identity should take new values, thereby separating all objects. To ex-plore the potential classification ability of the model we applied two very simpleclassifiers on the SFA-output: a k-nearest-neighbour and a Gaussian classifier.

Franzius, Wilbert, Wiskott. ICANN 2008.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 16 / 19

Page 26: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Learning Static Segmentation from Motion

I Segmentation According to Natural Examples (SANE)

11

SANE cgtg (5)

SA

NE

bet

ter

Mar

tin

bet

ter

Fig. 11. The examples for which color, multiresolution SANE most outperformed cgtg Martin with features radius 5 using match distance 3 and vice-versa.

the Martin detectors in any comparison to SANE.The results in Figure 10 make it clear that SANE outperforms

Martin on these data sets. The examples in Figure 11 illustrate thatSANE’s advantage lies in its higher precision. While the Martindetectors have little trouble finding object boundaries, they alsodetect many other non-object boundaries in the process. In all fourdata sets, the color, multiresolution SANE f-measures outperformall Martin detectors at all feature radii using the strict matchingdistance of 1. As the match maximum distance relaxes to 3 and 5,the Martin results improve, especially their recall. At radius 3, thebest Martin results only match SANE on the traffic data. At radius5, the best results match or slightly surpass SANE on traffic, andmatch mwalkA and mwalkC, but still lag behind on the walkingdata. Figure 11 contains examples in which SANE’s performancewas superior to cgtg with feature radius 5 (one of the overall best-performing Martin detectors) using match maximum distance 3.Most of the Martin results are riddled with non-object boundariescompared to the SANE output. The examples for which Martinperformed better demonstrate that extreme SANE recall failuresare responsible for several of the cases in which cgtg is superior.

Much of the performance disparity could derive from the datasets that each algorithm was originally designed for. Althoughthe Martin detectors were retrained on the SANE data sets,the Berkeley Segmentation Database contains low-noise, high-resolution images, and much of the success of the Martin detectorson that set appears to be due to the use of the sophisticated texturegradient features [22]. The SANE sample images have muchlower resolution and much poorer quality. There is less textureinformation available in each image, and boundaries are not assharp. SANE’s local edge detectors are simple, but the shapemodel compensates. Also, the segmentations in the Berkeleydatabase can include visually significant internal object regions,and it’s unsurprising that a detector designed to detect theseboundaries will have difficulty learning to ignore them when theydo not correspond to object boundaries. The shape informationstored in the SANE model appears to be more useful on thisdata than the sophisticated local Martin detectors. The fact thatthe SANE data requires different segmentation skills than theBerkeley database is a strong argument for its potential utility,in addition to the aforementioned advantages of learning frommotion-labeled training data.

E. Comparing SANE and LOCUS

To further investigate SANE’s performance, we compared itto LOCUS, a learning-based, top-down segmentation algorithmdeveloped by Winn and Jojic [45]. LOCUS segments a groupof images based on the assumption that they each contain an

instance of the same object class. Unlike SANE, which segmentseach image independently, LOCUS jointly segments a collectionof images, allowing information in any image to influence thesegmentation of the others.

LOCUS provides a good comparison to SANE. Its shapemodel, which includes a probabilistic template describing theexpected overall shape of the objects, is much more global thanSANE’s shape model, which learns the relationships betweenneighboring boundary segments. Furthermore, LOCUS does notattempt to learn a common image model, while SANE segmentsnew images using previously learned models of the image prop-erties of object boundaries.

LOCUS typically operates without supervision, but to providea fair comparison with SANE it was also run in a supervisedmode. This was accomplished by adding the training examples tothe collection of images and fixing their segmentations to be themotion-defined ground truth. In this way, their shape informationinfluences the joint segmentation process on the other images.LOCUS was run on the walking and mwalkA data sets. It couldnot be run on the traffic data because many of those imagescontain multiple cars and LOCUS is not designed to segmentmultiple objects in an image.

The LOCUS algorithm requires a set of equal-sized images.Therefore, it was provided with the uncropped images and bound-ing boxes equivalent to the crops used to create the SANE datasets. On each test image, the pixels external to the bounding boxwere assigned the “background” label, while the segmentation ofthe interior pixels was inferred by the algorithm. As a result, theLOCUS algorithm used the areas in each test image that lackedmoving objects as labeled training data for the construction ofthe background image model for that image. When training andtesting the SANE models, these regions were simply discarded onthe assumption that in the absence of motion the true segmentationwas unknown and unavailable to train or test against.

The results of the comparison are in Figure 12. On the walkingdata sets, performance of unsupervised and supervised LOCUSwas very similar to that of color, multiresolution SANE. LOCUShad higher recall, but lower precision. The small advantage ofSANE in f-measure was due to a few cases in which LOCUSfailed to find any object in images — several examples can beseen in the figure. These examples typically occurred when thefigure was far over to the side of the bounding box, which causedLOCUS to fail because it became stuck in a local minimum thatlabeled all pixels as “background.”

LOCUS outperformed SANE on the mwalkA data, especiallyin supervised mode. Examining the three images on whichsupervised LOCUS most outperformed SANE, it appears thatLOCUS’s strong global shape model enabled successful segmen-

Ross and Kaelbling. Segmentation According to Natural Examples: Learning Static Segmentation from Motion Segmentation. IEEE

Transactions on Pattern Analysis and Machine Intelligence 2008.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 17 / 19

Page 27: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Structure from Motion

I Recover 3D structure and camera motion from multiple

images without correspondence

where and themeasurements whose assignments willbe swapped, and and are the projections of the fea-tures originally assigned to them .To conclude the E-step and compute the virtual measure-

ments in (11), the only thing left to do is to compute themarginal probabilities from the sample . Fortu-nately, this can be done without explicitly storing the sam-ples by keeping running counts of how many times eachmeasurement is assigned to feature , and use that tocompute . If we define to be this count, we have:

(15)

4.3 Implementation in PracticeThe pseudo-code for the final algorithm is as follows:

1. Generate an initial structure and motion estimate .

2. Given and the data , run the Metropolis sam-pler in each image to obtain approximate values forthe weights , using equation (15).

3. Calculate the virtual measurements with (11).

4. Find the new estimate for structure and motionusing the virtual measurements as data. This canbe done using any SFM method compatible with theprojection model assumed.

5. If not converged, return to step 2.

To avoid getting stuck in local minima, it is important inpractice to add annealing to this basic scheme. In annealingwe artificially increase the noise parameter for the earlyiterations, gradually decreasing it to its correct value. Thishas two beneficial consequences. First, the posterior distri-bution will be less peaked when is high, so that theMetropolis sampler will explore the space of assignmentsmore easily, and avoid getting stuck on islands of highprobability. Second, the expected log likelihood issmoother and has less local maxima at higher values for .We use a exponentially decreasing annealing scheme, buthave found that the algorithm is not sensitive to the exactscheme used.

5 ResultsIn this section we show the results obtained with three

different sets of images. For each set we highlight a partic-ular property of our method. For all the results we present,the input to the algorithm was a set of manually obtainedimage measurements. To initialize , the 3D points weregenerated randomly in a normally distributed cloud arounda depth of 1, whereas the cameras were all initialized atthe origin. We ran the EM algorithm for 100 iterations each

Figure 4. Three out of 11 cube images. Although theimages were originally taken as a sequence in time, theordering of the images is irrelevant to our method.

t=0 !=0.0 t=1 !=25.1 t=3 !=23.5

t=10 !=18.7 t=20 !=13.5 t=100 !=1.0

Figure 5. The structure estimate as initialized and at suc-cessive iterations of the algorithm.

time, with the annealing parameter decreasing exponen-tially from 25 pixels to 1 pixel. For each EM iteration, weran the sampler in each image for 10000 steps. An entirerun takes about a minute of CPU time on a standard PC. Asis typical for EM, the algorithm can sometimes get stuck inlocal minima, in which case we restart it manually.In practice, the algorithm converges consistently and fast

to an estimate for the structure and motionwhere the correctcorrespondence is the most probable one, and where most ifnot all assignments in the different images agree with eachother. We illustrate this using the image set shown in Fig-ure 4, which was taken under orthographic projection. Thetypical evolution of the algorithm is illustrated in Figure 5,where we have shown a wire-frame model of the recoveredstructure at successive instants of time. There are two im-portant points to note: (a) the gross structure is recoveredin the very first iteration, starting from random initial struc-ture, and (b) finer details of the structure are gradually re-solved as the parameter is decreased. The estimate for thestructure after convergence is almost identical to the onefound by factorization when given the correct correspon-dence. Incidentally, we found the algorithm converges lessoften when we replace the random initialization by a ’good’initial estimate where all the points in some image are pro-

where and themeasurements whose assignments willbe swapped, and and are the projections of the fea-tures originally assigned to them .To conclude the E-step and compute the virtual measure-

ments in (11), the only thing left to do is to compute themarginal probabilities from the sample . Fortu-nately, this can be done without explicitly storing the sam-ples by keeping running counts of how many times eachmeasurement is assigned to feature , and use that tocompute . If we define to be this count, we have:

(15)

4.3 Implementation in PracticeThe pseudo-code for the final algorithm is as follows:

1. Generate an initial structure and motion estimate .

2. Given and the data , run the Metropolis sam-pler in each image to obtain approximate values forthe weights , using equation (15).

3. Calculate the virtual measurements with (11).

4. Find the new estimate for structure and motionusing the virtual measurements as data. This canbe done using any SFM method compatible with theprojection model assumed.

5. If not converged, return to step 2.

To avoid getting stuck in local minima, it is important inpractice to add annealing to this basic scheme. In annealingwe artificially increase the noise parameter for the earlyiterations, gradually decreasing it to its correct value. Thishas two beneficial consequences. First, the posterior distri-bution will be less peaked when is high, so that theMetropolis sampler will explore the space of assignmentsmore easily, and avoid getting stuck on islands of highprobability. Second, the expected log likelihood issmoother and has less local maxima at higher values for .We use a exponentially decreasing annealing scheme, buthave found that the algorithm is not sensitive to the exactscheme used.

5 ResultsIn this section we show the results obtained with three

different sets of images. For each set we highlight a partic-ular property of our method. For all the results we present,the input to the algorithm was a set of manually obtainedimage measurements. To initialize , the 3D points weregenerated randomly in a normally distributed cloud arounda depth of 1, whereas the cameras were all initialized atthe origin. We ran the EM algorithm for 100 iterations each

Figure 4. Three out of 11 cube images. Although theimages were originally taken as a sequence in time, theordering of the images is irrelevant to our method.

t=0 !=0.0 t=1 !=25.1 t=3 !=23.5

t=10 !=18.7 t=20 !=13.5 t=100 !=1.0

Figure 5. The structure estimate as initialized and at suc-cessive iterations of the algorithm.

time, with the annealing parameter decreasing exponen-tially from 25 pixels to 1 pixel. For each EM iteration, weran the sampler in each image for 10000 steps. An entirerun takes about a minute of CPU time on a standard PC. Asis typical for EM, the algorithm can sometimes get stuck inlocal minima, in which case we restart it manually.In practice, the algorithm converges consistently and fast

to an estimate for the structure and motionwhere the correctcorrespondence is the most probable one, and where most ifnot all assignments in the different images agree with eachother. We illustrate this using the image set shown in Fig-ure 4, which was taken under orthographic projection. Thetypical evolution of the algorithm is illustrated in Figure 5,where we have shown a wire-frame model of the recoveredstructure at successive instants of time. There are two im-portant points to note: (a) the gross structure is recoveredin the very first iteration, starting from random initial struc-ture, and (b) finer details of the structure are gradually re-solved as the parameter is decreased. The estimate for thestructure after convergence is almost identical to the onefound by factorization when given the correct correspon-dence. Incidentally, we found the algorithm converges lessoften when we replace the random initialization by a ’good’initial estimate where all the points in some image are pro-

Dellaert, Seitz, Thorpe, and Thrun. Structure from Motion without Correspondence. IEEE Computer Society Conference on Computer

Vision and Pattern Recognition 2000.

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 18 / 19

Page 28: Visual Parsing after Recovery from Blindness

OverviewTests of Visual Parsing

Conclusion

Motion Information as a ScaffoldExperimental and Theoretical Evidence for Temporal LearningFuture Directions

Future Directions (Suggestions for Presentations. . . )

I Evidence for motion-based learning of object binding?

I Evidence from motion-alteration or

temporal-contiguity-altering experiments?

I Neuroscience and SFA?

I Bootstrapped learning of novel objects?

Daniel O’Shea ([email protected]) Visual Parsing After Recovery from Blindness 19 / 19