Baby’s Eye View: Temporal Dynamics of Rapid Visual Object Learning

Baby’s Eye View: Temporal Dynamics of Rapid Visual Object LearningNicholas Butko ♦ Dept. of Cognitive Science ♦ UCSD ♦ [email protected]

Ian Fasel, Javier Movellan ♦ Institute for Neural Computation ♦ {ianfasel, movellan}@mplab.ucsd.edu

Motivation Motivation

We set out to explore the nature of the visual information that neonate infants have available to them. Is it enough to learn detailed object categories reliably? If so, this provides evidence that an alternative hypothesis to the dominant paradigm is feasible, viz. that infants may not be born with the ability to recognize conspecifics.

Hyper Adaptation? Hyper Adaptation? “We wish to propose the general term CONSPEC to refer to a unit of mental architecture in any species that ... contains structural information concerning the visual characteristics of conspecifics.” [bold emph. added]

--Morton & Johnson, Psych. Review, 1991

Computational ModelComputational Model

Social ContingencySocial Contingency

Baby Robot BEVBaby Robot BEV Can Faces Be Learned?Can Faces Be Learned?

Rapid Learning Rapid Learning HypothesisHypothesis

BEV DatasetBEV Dataset

• We hope to show that infants might be able to gather enough information to learn to locate faces very quickly.

• To this end, we will gather visual information and other information available to infants.

• We will present a computational learning system that shows that using only this limited amount of information, faces can be reliably located in images.

Current Hypotheses:

• To collect data from a Baby’s Eye View, we created BEV, a simple baby robot.

• BEV has two sensors: • A microphone in the chest to detect overall volume• An IEEE1394 Webcam in the forehead, capturing unrectified 320x240 pixel images.

• BEV has one actuator: • A monaural speaker in the chest, for vocalization.

Contingency Detection and Data Collection:

Schematic Generalization Schematic Generalization

Evidence for Conspecific Processing: [Johnson et al., Cognition, 1991]

• Social Hypothesis: - Infants are genetically predisposed to look at things that look like human faces.

• Sensory Hypothesis: - Infants look for general visual features, which are shared by faces

--Kleiner & Banks, Experimental Psych., Human Perception & Performance 1987

Infants quickly become interested in certain aspects of the visual scene presented to them, and learn to attend to specific salient things.

Bushnell et al. 1989

- 2 day old infants fixate longer to images of their mothers than to images of other women with similar hair colors and facial complexion

Evidence for Rapid Learning:

Segmental Boltzmann Segmental Boltzmann ProcessesProcesses

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Watson (1972) found that two-month infants exhibit social responses to contingent mobiles, indicating that infants use contingency as a means of identifying caregivers.

• Movellan & Watson (1985) found that ten-month infants are very optimized detectors of contingency.

• Movellan (2002) developed a model of this optimal contingency detection based on the principles of information maximization and optimal control.

• For this experiment we used auditory contingencies as a cue for the presence of a person. However, other cues like touch or uninitiated motion may be more appropriate for neonates, and should produce similar results.

• Fasel & Movellan (2006) developed a novel visual learning algorithm called “Segmental Boltzmann processes”

• This algorithm is a weakly supervised algorithm. It requires one label for per image, indicating whether an object of a category of interest is in that image with a probability better than chance.

• From this weak label, the algorithm learns to localize the object of interest in novel images, or indicate that the object is not present.

• The algorithm is a probabilistic model that looks for “objects”: clusters of pixels that are codependent but independent of the rest of the image.

• Segmental Boltzmann Processes can be viewed as a connectionist architecture, simulating 4,000,000 neurons running in real time (30 Frames Per Second).

• Segmental Boltzmann Processes are ideal for multimodal learning in which a secondary modality can provide a better than chance label about the presence of an object in the visual field.





• BEV was attached to a Mac PowerBook G4 laptop that ran contingency detection software and stored data continuously for 88 Minutes while the experiment was in progress, making moment-to-moment decisions about how to best vocalize in order to detect people.

• BEV used her speaker to utter baby sounds collected from the Internet. There were five sounds, ranked in level of excitement from high -> low by the experimenters. These were uttered when high -> low levels of contingency were detected respectively.

• 9 subjects were asked to interact with BEV so as to make her excited.

• An image was added to the dataset whenever a) a vocalization was made, and b) BEV was 97.5% confident that a person was present or absent.



• 3700 Images collected over 90 minutes of interaction.

• No experimenter intervention.

• Variety of lighting and background conditions.

• No post-processing of images (rectification, etc.)

18% - No face ; 4% - No Person



17% - Face ; 20% - Person

Contingency Detected

No Contingency Detected

QuickTime™ and aTIFF (LZW) decompressor


Very little information required:

Max Posterior & Posterior Probability Maps:

0 Minutes

3 Minutes

6 Minutes

(Of 3700 Images)(Of 90 Minutes)

“We would not expect the experience of the mother’s face to transfer to the two-dimensional schematic stimuli used with newborns.”

--Morton & Johnson, Psych. Review, 1991





Performance of one BEV-Trained SBP learner on Johnson Stimuli:

Performance of all BEV-Trained SBP learners on Johnson Stimuli:

Documents

Baby’s Eye View: Temporal Dynamics of Rapid Visual Object Learning