Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Bayesian models of human learning and reasoning

Josh TenenbaumMIT

Department of Brain and Cognitive SciencesComputer Science and AI Lab (CSAIL)

Charles Kemp

Pat ShaftoVikash Mansinghka Amy Perfors Lauren Schmidt

Chris Baker Noah Goodman

Lab members

Tom Griffiths*

Funding: AFOSR Cognition and Decision Program, AFOSR MURI, DARPA IPTO, NSF, HSARPA, NTT Communication Sciences Laboratories, James S. McDonnell Foundation

The probabilistic revolution in AI

• Principled and effective solutions for inductive inference from ambiguous data:– Vision– Robotics– Machine learning– Expert systems / reasoning– Natural language processing

• Standard view: no necessary connection to how the human brain solves these problems.

Bayesian models of cognitionVisual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman, Kersten, Knill, Maloney,

Olshausen, Jacobs, Pouget, ...]

Language acquisition and processing [Brent, de Marken, Niyogi, Klein, Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]

Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording, Kawato, Doya, Todorov, Shadmehr, …]

Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]

Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]

Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]

Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro, Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]

Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]

Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]

Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker, Goodman, Tenenbaum, …]

Everyday inductive leaps

How can people learn so much about the world from such limited evidence?– Learning concepts from examples

“horse” “horse” “horse”

Learning concepts from examples

“tufa”

“tufa”

“tufa”

Everyday inductive leaps

How can people learn so much about the world from such limited evidence?– Kinds of objects and their properties– The meanings of words, phrases, and sentences – Cause-effect relations– The beliefs, goals and plans of other people– Social structures, conventions, and rules

Modeling Goals• Principled quantitative models of human behavior, with

broad coverage and a minimum of free parameters and ad hoc assumptions.

• Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments.

• A framework for studying people’s implicit knowledge about the structure of the world: how it is structured, used, and acquired.

• A two-way bridge to state-of-the-art AI.

1. How does background knowledge guide learning from sparsely observed data?

Bayesian inference:

2. What form does background knowledge take, across different domains and tasks?

Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas, theories.

3. How is background knowledge itself acquired? Hierarchical probabilistic models, with inference at multiple levels of abstraction.

Flexible nonparametric models in which complexity grows with the data.

The approach: from statistics to intelligence

Hhii

i

hPhdP

hPhdPdhP

)()|(

)()|()|(

Basics of Bayesian inference

• Bayes’ rule:

• An example– Data: John is coughing

– Some hypotheses:1. John has a cold

2. John has lung cancer

3. John has a stomach flu

– Likelihood P(d|h) favors 1 and 2 over 3

– Prior probability P(h) favors 1 and 3 over 2

– Posterior probability P(h|d) favors 1 over 2 and 3

Hhii

i

hPhdP

hPhdPdhP

)()|(

)()|()|(

• You read about a movie that has made $60 million to date. How much money will it make in total?

• You see that something has been baking in the oven for 34 minutes. How long until it’s ready?

• You meet someone who is 78 years old. How long will they live?

• Your friend quotes to you from line 17 of his favorite poem. How long is the poem?

• You meet a US congressman who has served for 11 years. How long will he serve in total?

• You encounter a phenomenon or event with an unknown extent or duration, ttotal, at a random time or value of t <ttotal. What is the total extent or duration ttotal?

Everyday prediction problems(Griffiths & Tenenbaum, 2006)

Bayesian analysis

p(ttotal|t) p(t|ttotal) p(ttotal)

1/ttotal p(ttotal)

Assume randomsample

(for 0 < t < ttotal

else = 0)

Form of p(ttotal)? e.g., uninformative (Jeffreys) prior 1/ttotal

Priors P(ttotal) based on empirically measured durations or magnitudes for many real-world events in each class:

Median human judgments of the total duration or magnitude ttotal of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(ttotal|t)).

“tufa” “tufa”

“tufa”

Concept learningBayesian inference over tree-structured hypothesis space:

(Xu & Tenenbaum; Schmidt & Tenenbaum)

Some questions• How confident are we that a tree-structured model is the best

way to characterize this learning task?

• How do people construct an appropriate tree-structured hypothesis space?

• What other kinds of structured probabilistic models may be needed to explain other inductive leaps that people make, and how do people acquire these different structured models?

• Are there general unifying principles that explain our capacity to learn and reason with structured probabilistic models across different domains?

• Property induction

“Similarity”, “Typicality”,

“Diversity”

Gorillas have T9 hormones.Seals have T9 hormones.Squirrels have T9 hormones.

Horses have T9 hormones. Gorillas have T9 hormones.Chimps have T9 hormones.Monkeys have T9 hormones.Baboons have T9 hormones.

Horses have T9 hormones.

Gorillas have T9 hormones.Seals have T9 hormones.Squirrels have T9 hormones.

Flies have T9 hormones.

How can people generalize new concepts from just a few examples?

The computational problem(c.f., semi-supervised learning)

?

?????

??

Features New property

?

HorseCow

ChimpGorillaMouse

SquirrelDolphin

SealRhino

Elephant

85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

Model predictions

Human judgmentsof argument strength

Similarity-based models

Gorillas have property P.Mice have property P.Seals have property P.

All mammals have property P.

Cows have property P.Elephants have property P.Horses have property P.


Beyond similarity-based induction

• Reasoning based on dimensional thresholds: (Smith et al., 1993)

• Reasoning based on causal relations: (Medin et al., 2004; Coley & Shafto, 2003)

Poodles can bite through wire.

German shepherds can bite through wire.

Dobermans can bite through wire.

German shepherds can bite through wire.

Salmon carry E. Spirus bacteria.

Grizzly bears carry E. Spirus bacteria.



Different sources for priors

Chimps have T9 hormones.

Gorillas have T9 hormones.

Poodles can bite through wire.

Dobermans can bite through wire.



Taxonomic similarity

Jaw strength

Food web relations

F: form

S: structure

D: data

Tree with species at leaf nodes

mouse

squirrel

chimp

gorilla

mousesquirrel

chimpgorilla

F1

F2

F3

F4

Ha

s T

9h

orm

on

es

??

?

…

P(structure | form)

P(data | structure)

P(form)

Bac

kgro

und

know

ledg

eHierarchical Bayesian Framework

The value of structural form knowledge: inductive bias

F: form

S: structure

D: data

Tree with species at leaf nodes

Hierarchical Bayesian Framework

mouse

squirrel

chimp

gorilla

mousesquirrel

chimpgorilla

F1

F2

F3

F4

Ha

s T

9h

orm

on

es

??

?

…

Property induction

Smooth: P(h) high

P(D|S): How the structure constrains the data of experience

• Define a stochastic process over structure S that generates hypotheses h.– Intuitively, properties should vary smoothly over structure.

Not smooth: P(h) low

S

y

Gaussian Process (~ random walk, diffusion)

Threshold


[Zhu, Ghahramani & Lafferty 2003]

h

S

y

Gaussian Process (~ random walk, diffusion)

Threshold


[Zhu, Lafferty & Ghahramani 2003]

h

Species 1Species 2Species 3Species 4Species 5Species 6Species 7Species 8Species 9Species 10

Structure S

Data D

Features


[c.f., Lawrence, 2004; Smola & Kondor 2003]

Species 1Species 2Species 3Species 4Species 5Species 6Species 7Species 8Species 9Species 10

Features New property

Structure S

Data D ?

?????

??


Gorillas have property P.Mice have property P.Seals have property P.


Cows have property P.Elephants have property P.

Horses have property P.

Tre

e

2D

Reasoning about spatially varying properties

“Native American artifacts” task

Property type “has T9 hormones” “can bite through wire” “carry E. Spirus bacteria”

Theory Structure taxonomic tree directed chain directed network + diffusion process + drift process + noisy transmission

Class C

Class A

Class D

Class E

Class G

Class F

Class BClass C

Class A

Class D

Class E

Class G

Class F

Class B

Class AClass BClass CClass DClass EClass FClass G

. . . . . . . . .

Class C

Class G

Class F

Class E

Class D

Class B

Class A

Hypotheses

Reasoning with two property types

Bio

logi

cal

prop

erty

Dis

ease

prop

erty

Tree Web

Kelp Human

Dolphin

Sand shark

Mako sharkTunaHerring

Kelp

Human

Dolphin

Sand shark

Mako shark

Tuna

Herring

(Shafto, Kemp, Bonawitz, Coley & Tenenbaum)

“Given that X has property P, how likely is it that Y does?”

Summary so far• A framework for modeling human inductive

reasoning as rational statistical inference over structured knowledge representations– Qualitatively different priors are appropriate for different

domains of property induction.

– In each domain, a prior that matches the world’s structure fits people’s judgments well, and better than alternative priors.

– A language for representing different theories: graph structure defined over objects + probabilistic model for the distribution of properties over that graph.

• Remaining question: How can we learn appropriate theories for different domains?


F: form

S: structure

D: data mousesquirrel

chimpgorilla

F1

F2

F3

F4

Tree

mouse

squirrel

chimp

gorilla

mousesquirrel

chimpgorilla

SpaceChain

chimp

gorilla

squirrel

mouse

Discovering structural forms

Ostrich

Robin

Croco

dile

Snake

Bat

Orangu

tan

Turtle

Ostrich Robin Crocodile Snake Bat OrangutanTurtle

Ostrich

Robin

Croco

dile

Snake

Bat

Orangu

tan

Turtle

Angel

GodRock

Plant

Ostrich Robin Crocodile Snake Bat OrangutanTurtle

Discovering structural forms

Linnaeus

“Great chain of being”

• Scientific discoveries

• Children’s cognitive development– Hierarchical structure of category labels– Clique structure of social groups– Cyclical structure of seasons or days of the week– Transitive structure for value

People can discover structural forms

Tree structure for biological species

Periodic structure for chemical elements

(1579) (1837)

Systema Naturae

Kingdom Animalia Phylum Chordata Class Mammalia Order Primates Family Hominidae Genus Homo Species Homo sapiens

(1735)

“great chain of being”

Typical structure learning algorithms assume a fixed structural form

Flat Clusters

K-MeansMixture modelsCompetitive learning

Line

Guttman scalingIdeal point models

Tree

Hierarchical clusteringBayesian phylogenetics

Circle

Circumplex models

Euclidean Space

MDSPCAFactor Analysis

Grid

Self-Organizing MapGenerative topographic

mapping

The ultimate goal

“Universal Structure Learner”

K-MeansHierarchical clusteringFactor AnalysisGuttman scalingCircumplex modelsSelf-Organizing maps

···

Data Representation

A “universal grammar” for structural forms

Form FormProcess Process

F: form

S: structure

D: data


Favors simplicity

Favors smoothness[Zhu et al., 2003]

mousesquirrel

chimpgorilla

F1

F2

F3

F4

mouse

squirrel

chimp

gorilla

Model fitting

• Evaluate each form in parallel• For each form, heuristic search over structures

based on greedy growth from a one-node seed:

Primate troop Bush administration Prison inmates Kula islands “x beats y” “x told y” “x likes y” “x trades with y”

Dominance hierarchy Tree Cliques Ring

Structural forms from relational data

Development of structural forms as more data are observed

Beyond “Nativism” versus “Empiricism”• “Nativism”: Explicit knowledge of structural forms for

core domains is innate.– Atran (1998): The tendency to group living kinds into hierarchies reflects

an “innately determined cognitive structure”.– Chomsky (1980): “The belief that various systems of mind are organized

along quite different principles leads to the natural conclusion that these systems are intrinsically determined, not simply the result of common mechanisms of learning or growth.”

• “Empiricism”: General-purpose learning systems without explicit knowledge of structural form. – Connectionist networks (e.g., Rogers and McClelland, 2004). – Traditional structure learning in probabilistic graphical models.

Summary Bayesian inference over hierarchies

of structured representations provides a framework to understand core questions of human cognition:– What is the content and form of human

knowledge, at multiple levels of abstraction?

– How does abstract domain knowledge guide learning of new concepts?

– How is abstract domain knowledge learned? What must be built in?

F: form

S: structure

D: data

mouse

squirrel

chimp

gorilla

mousesquirrel

chimpgorilla

F1

F2

F3

F4

– How can domain-general learning mechanisms acquire domain-

specific representations? How can probabilistic inference work together with symbolic, flexibly structured representations?

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

“Universal Grammar” Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

(c.f. Chater and Manning, 2006)

P(grammar | UG)

(Han & Zhu, 2006; c.f.,Zhu, Yuanhao & Yuille NIPS 06 )

Vision as probabilistic parsing

Principles

Structure

Data

Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias

Learning word meanings

AbstractPrinciples

Structure

Data

(Griffiths, Tenenbaum, Kemp et al.)

Learning causal relations

First-order probabilistic theories for causal inference

True structure of graphical model G:

edge (G)

class (z)

edge (G)

1 2 3 4 5 6

7 8 9 10 11 12 13 14 15 16

# of samples: 20 80 1000

Data D

Graph G

Data D

Graph G

AbstractTheory

1 2 3 4 5 6…

7 8 9 10 11 12 1314 15 16…

…

0.40.0

0.0 0.0…

…

(Mansinghka, Kemp, Tenenbaum, Griffiths UAI 06)

c1 c2

c1

c2

c1

c2

Classes Z

Goal-directed action (production and comprehension)

(Wolpert et al., 2003)

Goal inference as inverseprobabilistic planning

(Baker, Tenenbaum & Saxe)

Constraints Goals

Actions

Rational planning(PO)MDP

model predictions

hum

an

judg

men

ts

The big picture• What we need to understand: the mind’s ability to build rich

models of the world from sparse data.– Learning about objects, categories, and their properties.

– Causal inference

– Understanding other people’s actions, plans, thoughts, goals

– Language comprehension and production

– Scene understanding

• What do we need to understand these abilities?– Bayesian inference in probabilistic generative models

– Hierarchical models, with inference at all levels of abstraction

– Structured representations: graphs, grammars, logic

– Flexible representations, growing in response to observed data

A raw data matrix:

The chicken-and-egg problem of structure learning and feature selection

Conventional clustering (CRP mixture):

The chicken-and-egg problem of structure learning and feature selection

Learning multiple structures to explain different feature subsets

(Shafto, Kemp, Mansinghka, Gordon & Tenenbaum, 2006)

System 1 System 2 System 3CrossCat:

The “nonparametric safety-net”

edge (G)

class (z)

edge (G)

12

3

4567

8

9

1011 12

# of samples: 40 100 1000

Data D

Graph G

Data D

Graph G

Abstract theory Z

True structure of graphical model G:

Bayesian prediction

P(ttotal|tpast)

ttotal

What is the best guess for ttotal? Compute t such that P(ttotal > t|tpast) = 0.5:

P(ttotal|tpast) 1/ttotal P(tpast)

posterior probability

Randomsampling

Domain-dependent prior

We compared the medianof the Bayesian posteriorwith the median of subjects’judgments… but what about the distribution of subjects’ judgments?

• Individuals’ judgments could by noisy.

• Individuals’ judgments could be optimal, but with different priors. – e.g., each individual has seen only a sparse sample of

the relevant population of events.

• Individuals’ inferences about the posterior could be optimal, but their judgments could be based on probability (or utility) matching rather than maximizing.

Sources of individual differences

Individual differences in prediction

P(ttotal|tpast)

ttotal

Quantile of Bayesian posterior distribution

Pro

port

ion

of ju

dgm

ents

bel

ow p

redi

cted

val

ue

Individual differences in prediction

Average over all prediction tasks:• movie run times• movie grosses• poem lengths• life spans• terms in congress• cake baking times

P(ttotal|tpast)

ttotal

Individual differences in concept learning

• Optimal behavior under some (evolutionarily natural) circumstances. – Optimal betting theory, portfolio theory– Optimal foraging theory– Competitive games– Dynamic tasks (changing probabilities or utilities)

• Side-effect of algorithms for approximating complex Bayesian computations.– Markov chain Monte Carlo (MCMC): instead of integrating over complex

hypothesis spaces, construct a sample of high-probability hypotheses.

– Judgments from individual (independent) samples can on average be almost as good

as using the full posterior distribution.

Why probability matching?

Markov chain Monte Carlo

(Metropolis-Hastings algorithm)

Bayesian inference in perception and sensorimotor integration

(Weiss, Simoncelli & Adelson 2002) (Kording & Wolpert 2004)

• You read about a movie that has made $60 million to date. How much money will it make in total?

• You see that something has been baking in the oven for 34 minutes. How long until it’s ready?

• You meet someone who is 78 years old. How long will they live?

• Your friend quotes to you from line 17 of his favorite poem. How long is the poem?

• You meet a US congressman who has served for 11 years. How long will he serve in total?

• You encounter a phenomenon or event with an unknown extent or duration, ttotal, at a random time or value of t <ttotal. What is the total extent or duration ttotal?

Everyday prediction problems(Griffiths & Tenenbaum, 2006)

Bayesian analysis

p(ttotal|t) p(t|ttotal) p(ttotal)

1/ttotal p(ttotal)

Assume randomsample

(for 0 < t < ttotal

else = 0)

Form of p(ttotal)? e.g., uninformative (Jeffreys) prior 1/ttotal

Priors P(ttotal) based on empirically measured durations or magnitudes for many real-world events in each class:

Median human judgments of the total duration or magnitude ttotal of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(ttotal|t)).

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

How long did the typicalpharaoh reign in ancientegypt?

Summary: prediction• Predictions about the extent or magnitude of everyday events

follow Bayesian principles.

• Contrast with Bayesian inference in perception, motor control, memory: no “universal priors” here.

• Predictions depend rationally on priors that are appropriately calibrated for different domains.– Form of the prior (e.g., power-law or exponential)– Specific distribution given that form (parameters)– Non-parametric distribution when necessary.

• In the absence of concrete experience, priors may be generated by qualitative background knowledge.

Learning concepts from examples

Cows have T9 hormones.Sheep have T9 hormones.Goats have T9 hormones.

All mammals have T9 hormones.

Cows have T9 hormones.Seals have T9 hormones.Squirrels have T9 hormones.

All mammals have T9 hormones.

• Property induction

• Word learning

“tufa”

“tufa”

“tufa”

Clustering models for relational data

• Social networks: block models

Does person x respect person y?

Does prisoner xlike prisoner y?

conc

ept

concept

predicate

Learning systems of concepts with infinite relational models

(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)

Biomedical predicate data from UMLS (McCrae et al.): – 134 concepts: enzyme, hormone, organ, disease, cell function ...

– 49 predicates: affects(hormone, organ), complicates(enzyme, cell function), treats(drug, disease), diagnoses(procedure, disease) …

Learning a medical ontology

e.g., Diseases affect Organisms

Chemicals interact with Chemicals

Chemicals cause Diseases

Clustering arbitrary relational systems

International relations circa 1965 (Rummel)– 14 countries: UK, USA, USSR, China, ….– 54 binary relations representing interactions between countries:

exports to( USA, UK ), protests( USA, USSR ), …. – 90 (dynamic) country features: purges, protests, unemployment,

communists, # languages, assassinations, ….

Learning a hierarchical ontology

Documents

Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)