24
Journal of Experimental Psychology: Learning, Memory, and Cognition 1984, Vol. 10, No. 2, 234-257 Copyright 1984 by the American Psychological Association, Inc. Induction of Category Distributions: A Framework for Classification Learning Lisbeth S. Fried and Keith J. Holyoak University of Michigan (Ann Arbor) We present a framework for classification learning that assumes that learners use presented instances (whether labeled or unlabeled) to infer the density functions of category exemplars over a feature space and that subsequent classification decisions employ a relative likelihood decision rule based on these inferred density functions. A specific model based on this general framework, the category density model} was proposed to account for the induction of normally distributed categories either with or without error correction or provision of labeled instances'. The model was implemented as a computer simulation. Results of five experiments indicated that people could learn category distributions not only without error correction, but without knowledge of the number of categories or even that there were categories to be learned. These and other findings dictated a more general learning model that integrated distributional representations based on both parametric descriptions and stored instances. In this article we present a new model of category learning and classification based on the acquisition and use of distributional knowledge. This category density model, de- rived from work by Fried (1979), makes the central assumption that the goal of the category learner is to develop a schematic description Experiment IA was reported at the meeting of the Psy- chonomic Society in San Antonio, Texas, November 1978, and Experiment 3 was reported at the meeting of the Mathematical Psychology Society in Santa Barbara, Cal- ifornia, August 1981. An earlier version of the present article appeared as Cognitive Science Technical Rep. No. 38, University of Michigan, 1982. This research was supported by National Science Foun- dation Grant BNS-7904730 to both authors, and a Rack- ham Faculty Grant from the University of Michigan and National Institute of Mental Health Grant 1-K02- MH00342-03 to K. Holyoak. We thank the numerous colleagues and students who helped us clarify our ideas in seminars and discussions. Dorrit Billman, Mary Gick, David H. Krantz, Tracy Sher- man, Wilson P. Tanner, Jr., and J. E. Keith Smith provided valuable comments on the earliest draft of this paper. Wil- liam Estes, Don Homa, Kyunghee Koh, Doug Medin, Edward Smith, and Tom Wallsten commented on sub- sequent drafts. Jack Abraham, Bill Barr, Holly Brewer, Ellen Junn, Roberta Mehoke, Shannon McDonnell, Allan Salmi, and Jan Stern assisted in testing subjects, and John Patterson provided programming assistance. Requests for reprints should be sent to Lisbeth S. Fried or Keith J. Holyoafc, University of Michigan, Human Per- formance Center, 330 Packard Road, Ann Arbor, Michigan 48104. of the distributions of category exemplars over a feature space. Highly salient features will tend to be encoded initially, although the learner may actively search for less salient fea- tures that are more diagnostic of category membership. We assume that the schematic representation is a parametric encoding of the category distribution over the feature dimen- sions to which the learner is currently attend- ing. Suppose, for example, that a learner is shown a set of exemplars randomly sampled from a category population that is normally distributed over n independent feature di- mensions. The density function for such a cat- egory can be sufficiently described by a vector of 2n parameters—the mean and variance of the population along each feature dimension. The density model assumes that in this ex- ample the effective representation of the cat- egory distribution will correspond to this pa- rameter vector. This assumption implies that the presented instances will be treated as a sample that can be used to estimate the dis- tributional properties of an indefinitely large population of potential category exemplars. A parametric representation also implies that there exists a set of statistics for each fea- ture dimension that is sufficient to describe the learner's conception of the category dis- tribution. The types of category distributions that people can encode parametrically may be 234

Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

Journal of Experimental Psychology:Learning, Memory, and Cognition1984, Vol. 10, No. 2, 234-257

Copyright 1984 by theAmerican Psychological Association, Inc.

Induction of Category Distributions: A Frameworkfor Classification Learning

Lisbeth S. Fried and Keith J. HolyoakUniversity of Michigan (Ann Arbor)

We present a framework for classification learning that assumes that learners usepresented instances (whether labeled or unlabeled) to infer the density functionsof category exemplars over a feature space and that subsequent classification decisionsemploy a relative likelihood decision rule based on these inferred density functions.A specific model based on this general framework, the category density model} wasproposed to account for the induction of normally distributed categories eitherwith or without error correction or provision of labeled instances'. The model wasimplemented as a computer simulation. Results of five experiments indicated thatpeople could learn category distributions not only without error correction, butwithout knowledge of the number of categories or even that there were categoriesto be learned. These and other findings dictated a more general learning modelthat integrated distributional representations based on both parametric descriptionsand stored instances.

In this article we present a new model ofcategory learning and classification based onthe acquisition and use of distributionalknowledge. This category density model, de-rived from work by Fried (1979), makes thecentral assumption that the goal of the categorylearner is to develop a schematic description

Experiment IA was reported at the meeting of the Psy-chonomic Society in San Antonio, Texas, November 1978,and Experiment 3 was reported at the meeting of theMathematical Psychology Society in Santa Barbara, Cal-ifornia, August 1981. An earlier version of the presentarticle appeared as Cognitive Science Technical Rep. No.38, University of Michigan, 1982.

This research was supported by National Science Foun-dation Grant BNS-7904730 to both authors, and a Rack-ham Faculty Grant from the University of Michigan andNational Institute of Mental Health Grant 1-K02-MH00342-03 to K. Holyoak.

We thank the numerous colleagues and students whohelped us clarify our ideas in seminars and discussions.Dorrit Billman, Mary Gick, David H. Krantz, Tracy Sher-man, Wilson P. Tanner, Jr., and J. E. Keith Smith providedvaluable comments on the earliest draft of this paper. Wil-liam Estes, Don Homa, Kyunghee Koh, Doug Medin,Edward Smith, and Tom Wallsten commented on sub-sequent drafts. Jack Abraham, Bill Barr, Holly Brewer,Ellen Junn, Roberta Mehoke, Shannon McDonnell, AllanSalmi, and Jan Stern assisted in testing subjects, and JohnPatterson provided programming assistance.

Requests for reprints should be sent to Lisbeth S. Friedor Keith J. Holyoafc, University of Michigan, Human Per-formance Center, 330 Packard Road, Ann Arbor, Michigan48104.

of the distributions of category exemplars overa feature space. Highly salient features willtend to be encoded initially, although thelearner may actively search for less salient fea-tures that are more diagnostic of categorymembership. We assume that the schematicrepresentation is a parametric encoding of thecategory distribution over the feature dimen-sions to which the learner is currently attend-ing. Suppose, for example, that a learner isshown a set of exemplars randomly sampledfrom a category population that is normallydistributed over n independent feature di-mensions. The density function for such a cat-egory can be sufficiently described by a vectorof 2n parameters—the mean and variance ofthe population along each feature dimension.The density model assumes that in this ex-ample the effective representation of the cat-egory distribution will correspond to this pa-rameter vector. This assumption implies thatthe presented instances will be treated as asample that can be used to estimate the dis-tributional properties of an indefinitely largepopulation of potential category exemplars.

A parametric representation also impliesthat there exists a set of statistics for each fea-ture dimension that is sufficient to describethe learner's conception of the category dis-tribution. The types of category distributionsthat people can encode parametrically may be

234

Page 2: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 235

small, although this is an open empirical issue.In the present study we will focus on a specificversion of the category density model that ac-counts for learning of multidimensional nor-mally distributed categories. Normal distri-butions may have particular ecological im-portance. Basic-level natural categories seemto consist of a dense central region of typicalinstances, surrounded by sparser regions ofatypical instances (Rosch, 1973, 1978; Rosch& Mervis, 1975; Rosch, Mervis, Gray, John-son, & Boyes-Braem, 1976). People maytherefore expect new categories to be unimodaland to have roughly symmetrical densityfunctions, which may be well approximatedby multidimensional normal distributions.

A second major assumption of the categorydensity model is that classification decisionsare based on relative likelihood. This decisionrule is related to that of signal detectabilitytheory (Swets, Tanner, & Birdsall, 1961), ex-cept that it is based on distributional infor-mation acquired during category learningrather than on information assumed knowna priori. The subjective probability >£,,, thatthe decision maker considers item X to be amember of category C, on trial t is assumedto be given by Bayes' theorem; that is,

= P,(C,\X) =

P,(x\cm)Pt(cm)(1)

where p,(X\C,) is the subjective conditionalprobability on trial t of item X given categoryChPt(Ct) is the subjective prior probability ofC( as of trial t, and k is the number of alter-native categories. The model further assumesthat the decision maker's probability of makingresponse C, on trial t given item X is

(2)k

2m - 1

where ft is a constant for each category re-flecting factors such as an asymmetrical payoffmatrix for different classification responses.Note that when the values of p,(Q and ft areequal for all categories (as will be assumed inall applications of Equations 1 and 2 in thepresent article because prior probabilities andpayoffs were made equal and symmetric in all

experiments), it then follows from Equation 2that the relative frequencies of the alternativecategories as responses to item X will be equalto the subjective relative likelihoods of the itemX given the alternative categories. We willtherefore refer to Equations 1 and 2 jointly asthe relative likelihood decision rule. This rulewill be used as a heuristic device for measuringdistributional learning,

A third assumption of the model, relatedto Bayesian learning theory (Edwards, Lind-man, & Savage, 1963), is that category learningis based on a cyclic process of parameter re-vision. We assume that people expect featuredimensions of perceptual categories to be nor-mally distributed, and that they enter a cat-egory learning task with (perhaps very vague)initial opinions about the central tendency anddegree of variability of each category on itssalient feature dimensions. People then usepresented instances to revise these prior ex-pectations. The revised opinions generate ex-pectancies in terms of which the next obser-vation is evaluated., The density model includes a mechanismby which normal distributions can be learnedby revising parameter vectors in response toeach successive instance in a set of trainingexemplars, with minimal reliance on memoryfor prior instances (discussion follows). In atypical experiment on classification learning,subjects are told the category to which eachtraining instance belongs. Under such condi-tions the parameter-revision process for learn-ing normal distributions is straightforward. Oneach trial the feature values of the currentinstance are used to update the dimensionmeans and variances for the appropriate cat-egory, while the occurrence of the categorylabel is used to update the index of the cate-gory's frequency. The new parameter valuesare then saved; the current instance may beincidentally stored in memory, but it plays nofurther necessary role in learning or classifi-cation.

In naturalistic learning situations, unlikestandard experimental paradigms, externalerror feedback may be delayed, unreliable, orcompletely absent (Bruner, Goodnow, & Aus-tin, 1956, p. 68). Furthermore, there: is ex-perimental evidence that people can sometimeslearn to classify instances of probabilistic cat-egories without any error correction or pro-

Page 3: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

236 LISBETH S. FRIED AND KEITH J. HOLYOAK

vision of category labels for instances (Ed-monds & Evans, 1966; Fried, 1979), althoughlearning is not always entirely successful undersuch conditions (Evans &, Arnoult, 1967; seealso Bersted, Brown, & Evans, 1969; Tracy &Evans, 1967). The-possibility of learning with-out external feedback is also suggested by E.Gibson's theory of perceptual learning (1953,1969; Gibson & Gibson, 1955). Under certainconditions, parameter revision can be used tolearn category distributions even in the absenceof error feedback, The task of learning dis-tributions without feedback can be modeledas a problem of decomposing the overall mix-ture density of the presented instances into itscomponent densities. This decomposition canbe accomplished by parameter-revision pro-cedures for mixtures of normal densities if thelearner knows the number of categories presentin the mixture (Duda & Hart, 1973; Fried,1979).

Category Density Model and Its Simulation

A version of the category density model fornormal distributions was implemented as aFORTRAN program.1 The program estimatesthe distributional parameters for categoriesdenned by independent, multidimensionalnormal distributions (i.e., a mean and variancefor each dimension of each category, and afrequency parameter for each category). Theprogram can learn these parameters with orwithout information about the category mem-bership of training exemplars, and was usedto validate the qualitative predictions outlinedbelow.

The learning sequence for the FORTRANprogram consists of randomly intermixed«-dimensional stimuli (« ^ 5) drawn from kcategories (k <, 5). Each category is dennedby normal distributions of values over eachof the dimensions. The dimensions are statis-tically independent. The category distributionsare specified by providing the program witha mean and variance for each dimension ofeach category, and a relative frequency for eachcategory. Each stimulus is thus represented bya vector of numbers, with each number rep-resenting a value on a dimension. The numberof categories and dimensions is specified. Runsused to validate the qualitative predictions

tested in the present paper (described later)used two 2-dimensional categories.

The learning process operates either withknowledge of the category membership of thetraining instances (feedback condition) orwithout such knowledge (no-feedback condi-tion). In either case learning involves twostages: formation of initial parameter esti-mates, followed by iterative revision of them.In the feedback condition, the dimension valueof the first instance of each category Q is usedas the initial mean A/,,, for the rth category'sJih dimension. The initial value of the fre-quency N] of this category is then 1. The di-mension variances Vu for category C, are ini-tialized when the second observation for thatcategory is obtained. The initial variance es-timate is a weighted average of the samplevariance of the two observations and an ar-bitrary large value that represents the vague-ness of the learner's prior opinion about thedistribution of category / oirdimension j.

In the no-feedback condition, initial pa-rameter estimates are formed after accumu-lating the first s instances, where s (atfc) is thesize of a short-term memory buffer (set at 6in the runs reported later). The s observationsare represented as points in an «-dimensionalspace, and a clustering algorithm uses the Eu-clidean distances between the points to dividethe observations into k groups. The algorithmused is based on the centroid method (Everitt,1974, pp'. 12-14). The distance between twogroups is defined as the distanceibetween theircentroids, where the centroid is the mean ofthe coordinate values for the items in a group.Initially each of the s instances is defined asa group with one member. The two closestgroups are then merged and replaced by thecoordinates of their centroid. This procedureis iterated until k groups remain. The initialfrequency JV/ of each category is then set equalto the number of instances in a correspondingcluster, and the initial dimensional mean Mufor each category is set equal to the mean ofthe clustered instances on each dimension. Theinitial variance Vy of each category on each

' The simulation model was programmed by KyungheeKoh, who also helped formulate the implemented pro-cedure for learning without feedback.

Page 4: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 237

dimension is obtained by pooling an arbitrarylarge value (as in the feedback condition) withthe variance of the cluster. (This pooling pro-cedure has the incidental effect of ensuringthat even a category with just one initial mem-ber will have a nonzero initial variance.)

After initial parameter estimates are formed,each successive observation is then used torevise the estimates. On trial t, the first stepis to determine the probability */>t that cat-egory Ci generated the observed instance X.For the feedback condition these probabilitiesare as follows:

1, if X is labeled as a member of C,

0, otherwise.(3)

For the no-feedback condition ^(>, is calculatedby inserting the parameter estimates as of trialt- 1 in a version of Equation 1 of the relativelikelihood rule. If an item X is represented bya vector of values on n independent featuredimensions, x\, x2, . . . xn, then Equation 1can be restated as follows:

,(C.) [ A

, (4)

2 p,(cm) nandp,(Xj\C,) can be determined by substitutingthe current estimates of MIJ and Viti in theequation for the normal distribution^ that is,

tjj-\ . (5)

In calculating pt(Ci), the program allows abias toward an assumption that the k categoriesare of equal frequency. The bias is given as aweight Wf(0 £. Wf <. 1), which can be inter-preted as a measure of the learner's confidenceafter one observation that the categories are

equally likely. The result is given by the fol-lowing:

where T is the total number of observations,

(7)fc

T= 2i = i

In Equation 6 the observed relative frequency,Ntj-i/Tt-i, is.weightedby 7^_,(1 - wf). Henceif Wf + 0 or 1, the impact of the observedrelative frequency will increase with the num-ber of observations.

After determining the values of *,j(, the nextstep is to revise the parameters for all cate-gories, with the degree of revision for eachcategory weighted in accord with tyitt. For theno-feedback condition ty,, may be fractional.For example, suppose X is twice as likely tohave been generated by Q than by C2, giventhe current parameter estimates. Then the pa-rameters will be revised as if two thirds of anobservation (with the dimension values of thenew item) had accrued to Ci and one thirdof an observation had accrued to €2. The re-vision procedures are based on standard equa-tions for revising running frequencies, means,and variances (Raiffa & Schlaifer, 1961), gen-eralized to accommodate fractional observa-tions.

The revised Nt is given by the following:

Nt>t = NU-! + #,,„

and the revised MitJ is as follows:

(8)

(9)

In updating Kw, the program allows a biastoward an assumption that the variances forany given dimension are equal across all cat-egories. Vijtt is a weighted average of the es-timated variance of the individual category C,on dimension j, IV^, and of the variancepooled over all categories, PVj>t, with relativeweights determined by a parameter wv, definedanalogously to vty. The revision of Vitj proceedsby first calculating the individual variance:

j-i - Mut)2 - 1), (10)

Page 5: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

238 LISBETH S. FRIED AND KEITH J. HOLYOAK

and the pooled variance:

2 (Niit -PVj.t =

T,-k (11)The revised variance is then given by theweighted average of 7F,,,,, and PVjtl,

7X1 - wv (12)

The program can use its current parameterestimates to classify items in accord with therelative likelihood rule. The learning phaseproceeds either until some criterion is reached(e.g., 10 correct classifications in a row), or afixed number of observations have been pre-sented. A transfer phase is then simulated, inwhich the program-uses its final parameterestimates to classify transfer items drawn fromvery broad distributions around the multidi-mensional means of the categories presentedduring the learning phase. An other responsecan optionally be allowed, in which case theprogram assumes that all instances have a fixedlikelihood (a parameter that is specified) ofbeing generated by an other category. The pro-gram thus treats other as a category with aspecified uniform distribution, so that the rel-ative likelihood that an item is drawn fromthe other category can be calculated using therelative likelihood rule.

/Predictions of the Density Model

The category density model as described inthe simulation generates a variety of qualitativepredictions. The major predictions tested inthe present experiments can be divided intotwo groups: those that concern learning andthose that concern transfer performance.

Predictions concerning learning. The fol-lowing predictions about learning normallydistributed categories can be derived from thedensity model.

LI. When subjects know the number of cat-egories to be learned, and correctly assumethat the distributions are normal, learning ispossible even without error feedback or in-stances labeled with respect to category mem-bership.

L2. Labeled instances will facilitate con-vergence on accurate estimates of distribu-tional parameters, relative to a no-feedbackcondition. Feedback enables thelearner to useeach observation to revise only the correct cat-egory distribution, rather than apportioningits value across all categories.

L3. Low-variability categories can belearned with fewer observations than requiredto learn high-variability categories with thesame means.

L4. Learning will be facilitated if the learnerknows the number of categories to be learned.Indeed, knowledge of number of categories isan essential prerequisite for the parameter-re-vision procedure described earlier.

Predictions concerning transfer. The fol-lowing predictions involve transfer perfor-mance after learning has taken place.

Tl. Classification performance will be inaccord with the relative likelihood rule (e.g.,the probability of classifying a novel instanceas a member of a category will be directlyproportional to its subjective likelihood ofbeing generated by the category distribution).

T2. When a random or other alternative isavailable at transfer, exemplars far from themean (prototype) of a learned category willmore likely be classified as members of thatcategory if the variability of the learned cat-egories is high. In contrast, such exemplarswill more often be classified as other when thevariability of the learned categories is low. Thisprediction follows from the fact that the like-lihood that a category will generate atypicalexemplars is greater if the variarice of the cat-egory is relatively high. In situations in whichother responses are classified as errors, the per-centage correct will be higher for groupstrained on high-variability instances.

T3. The above advantage of learning high-variability rather than low-variability cate-gories in classifying exemplars far from theprototype will not be obtained in the absenceof an other alternative at transfer.

T4. If subjects learn two equally probablecategories of unequal variability, they will tendto classify more items into the high-variabilitycategory at transfer, including some items thatare closer to the mean of the low-variabilitycategory but more likely to have been gener-ated by the high-variability one*

Page 6: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 239

Sample of Performance by theSimulation Program

Because the stimuli used in the experimentsreported later were complex forms for whichthe psychological features encoded by subjectswere not known, the simulation program can-not generate precise quantitative predictionsfor our experiments. However, the generalmodel can be applied even if the psychologicalfeatures are unknown and different subjectsencode stimuli in terms of different features,as long as subjects' feature sets can be ap-proximated by normal densities. The quali-tative predictions outlined earlier hold re-gardless of the specific nature or number offeatures used by subjects. As an illustration ofsome of our qualitative predictions, we willreport the results of some sample runs of theprogram. These test runs used two 2-dimen-sional categories, denned by bivariate normaldistributions with means (3, 6) and (6, 3), re-spectively, in arbitrary units. In the simulatedlow-variability condition the variances of bothcategories on each dimension were set equalto 1 (so that d' = 3 on each dimension), andin the high-variability condition the varianceswere set equal to 4 (d1 - 1.5). The arbitraryvalue used in initiating the variance estimateswas 20. In the no-feedback condition param-eters were initialized after clustering the firstsix items. The values of viy and wa were setequal to 0.9 and 0.1, respectively. Ten simu-lated subjects were used in each run.

As a measure of rate of learning, the meannumber of trials required to reach a criterionof 10 correct responses in a row was measured.A maximum of 300 learning trials were al-lowed. The mean number of trials to criterionwas 32 for the low-variability, feedback con-dition; 62 for the low-variability, no-feedbackcondition; 74 for the high-variability, feedbackcondition; and 146 for the high-variability, no-feedback condition. These results illustratePredictions L2 (advantage of labeled instances)and L3 (advantage of low-variability training).

Other runs were used to simulate transferperformance after a fixed number of learningtrials (100). The first set of runs, presented inTable 1 (top half), included an other alternativeat transfer. The subjective likelihood that anyitem was an other was specified to be .002. In

fact, all transfer items were drawn from broaddistributions around the prototypes of the twocategories presented during the learning phase.Table 1 (top half) features both the obtainedmean percentage correct and mean percentageother responses as a function of the Euclideandistance of transfer items from their generatingprototype (as measured in the same arbitraryunits). For all learning conditions the per-centage correct decreased and the percentageother increased with distance from the pro-totype, exemplifying Prediction Tl. The pre-dicted greater percentage correct and lesserpercentage other responses at high distancesfor groups trained on high- rather than low-variability exemplars (Prediction T2) was alsoapparent. Presented in the bottom half of Table1 are runs in which no other alternative wasallowed. Here the advantage of high-variability

Table 1Transfer Performance With and WithoutAvailability of an Other Category

Distance from prototype

Training condition 1 , 2 3 4 5

With other category

Low variabilityFeedback

% correct% other

No feedback% correct% other

High variabilityFeedback

% correct% other

No feedback% correct% other

.99 .93 .72 .31 .03

.01 .04 .15 .45 .72

.97 .84 .70 .46 .19

.01 .04 .12 .29 .54

.88 .82 .71 .61 ,49

.01 .06 .08 .13 .25

.82 .76 .68 .58 .49

.02 .06 .09 .14 .26

Without other category

Low variabilityFeedback (% correct) .99 ,96 .85 .73 .67No feedback (% correct) .98 .88 .82 .71 .67

High variabilityFeedback (% correct) .89 .86 .77 .72 .69No feedback (% correct) .85 .81 .74 .68 .71

Note. Distance from the prototype is measured in arbitraryunits.

Page 7: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

240 LISBETH S. FRIED AND KEITH J. HOLYOAK

training in percentage correct for items farfrom the prototype was eliminated (PredictionT3). As in the runs that included the otheralternative, low-variability training yieldedhigher percentage correct for items close tothe prototype, because the subjective likelihoodof such items is relatively high when the es-timated category variance is relatively low. Inthe runs in Table 1 the feedback conditionstended to yield higher percentage correct thanthe no-feedback conditions, indicating that thelatter had not achieved asymptotic learningafter 100 training trials.

In general, no-feedback learning by thesimulation program is more variable thanlearning with feedback, largely because theformer is more sensitive to the accuracy ofearly parameter estimates. We have run theprogram with s set at 2, thus simply using thefirst two items as the initial estimates of themeans of the two categories. We have also runthe program setting w/ and wv to 0, thus re-moving any biases toward the assumptions ofequal category frequencies and equal categoryvariances. In both cases learning still takesplace, although somewhat more slowly thanwith the parameter values used in the runspresented earlier. However, if both changes aremade (i.e., s — 2 and w/ = wv = 0), virtuallyno learning takes place in the high-variabilitycondition, although some learning is still pos-sible in the low-variability condition.

Comparison With Previous Models

With respect to its representational as-sumptions, the category density model is mostsimilar to prototype models (Posner & Keele,1968; Reed, 1972). Like prototype models,the density model assumes that a true induc-tion process takes place: The learner goes be-yond the sampled instances to infer category-level information. Furthermore, both types ofmodels assume that this category-level infor-mation is represented parametrically. Butwhereas a simple prototype represents thecentral tendencies of the category instances ontheir feature dimensions, parameters can alsobe used to represent the variability of a dis-tribution, as discussed earlier, and perhapsother distributional properties as well (e.g.,skewness). Like a prototype, however, repre-

sentations postulated by the density model canbe characterized as schemata (Attneave, 1957;Oldfield, 1954). Indeed, for the special case ofmultidimensional normal distributions (thefocus of the present article), the density modelis equivalent to a model that assumes thelearner abstracts the prototype plus variancefor each category.

However, in terms of its decision rule forclassification, the density model is more similarto feature frequency (e.g., Hayes-Roth &Hayes-Roth, 1977) and instance models(Medin & Schaffer, 1978) than to simple pro-totype models.2 Unlike the closest prototypedecision rule, the relative likelihood rule issensitive to category variability and other fac-tors that influence the degree of overlap amongexemplars of alternative categories.

Unlike other classification models, the cat-egory density model provides an explicitmechanism by which categories can, undersome conditions, be learned without any ex-ternal, instance-specific feedback. Regardlessof whether instances are being averaged toform prototypes, used to tabulate feature fre-quencies, or simply stored in memory, othermodels have tacitly assumed that error feed-back is critical in category learning, since thelearner must know the category to which aninstance belongs in order to use it to modifythe appropriate category representation. The

2 Medin and Schaffer (1978) pointed out that the distanceand cue validity decision rules proposed in the classificationliterature (Hayes-Roth & Hayes-Roth, 1977; Reed, 1972)are independent cue models, that is, rules that assume theinformation entering into category judgments is based onan additive combination of the information derived fromthe component feature dimensions. It follows from thenature of probability that the relative likelihood rule isnot an independent cue model; rather, as Equation 4 makesclear, it implies a multiplicative combination of dimensionalinformation. In this respect the relative likelihood ruleresembles the context model proposed by Medin andSchaffer. But whereas the latter model assumes that di-mensional information is combined to calculate a measureof instance-to-instance similarity, the relative likelihoodrule assumes that such information is combined to cal-culate the conditional probability of an instance given aparticular category.

Wallsten (1976) presented evidence indicating that theimpact of a dimension value on subjects' decisions dependson the dimension's salience as well as on its diagnosticity.Salience could be represented by different weights asso-ciated with each dimension.

Page 8: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 241

generalization procedures that have been pro-posed to reduce the storage requirements offeature frequency models (Anderson, Kline,& Beasley, 1979; Patterson, 1979) depend uponprovision of error correction.

Experiments IA and IB

Experiment IA focused on tests of transferpredictions T1-T3. Prediction T2 is in factsupported by Posner and Keele's (1968) find-ing that greater variability of training exem-plars produced slower initial learning, butmore accurate transfer performance in a clas-sification task. The bulk of the transfer errorsin the Posner and Keele study (Experiment 2)were made by the low-variability group, andinvolved the erroneous classification of thehighly distorted exemplars of meaningful pro-totypes into a category based on a randomdot configuration. The random-prototype cat-egory might have been viewed by subjects asa flat, rectangular distribution with a very widerange of acceptability on the feature dimen-sions.

Recall that in terms of the category densitymodel, if category prototypes are kept con-stant, an increase in category variability willresult in reduced discriminability betweencategories (measured in d'), making learningmore difficult (Prediction L3). In a subsequenttransfer task, however, subjects trained on twohigh-variability categories will view highlydistorted exemplars as relatively likely to havebeen generated by the category, and so willclassify them correctly. In contrast, thosetrained on two low-variability categories willnot view highly distorted exemplars as likelyto have been drawn from the category. Con-sequently, if a random alternative category isavailable, they will tend to classify those itemsas random. Thus when this alternative categoryis available, some highly distorted items arepredicted to be classified differently dependingonly on the variability of the training items.This prediction of the category density modelis not accounted for by a distance to prototypedecision rule, because the items are the samedistance from the prototypes in both groups.The above prediction was supported in pre-vious research when feedback was providedduring training (Fried, 1979), but has not been

previously tested when learning takes placewithout trial-by-trial error correction. Thepredicted difference between the two vari-ability groups depends on the presence of analternative random category, because withoutsuch an alternative category the relative like-lihood rule predicts no advantage for a groupthat learned relatively high-variability cate-gories (Prediction T3). This latter predictionwas investigated in Experiment IB, in whichthe random alternative or other category wasremoved.

Method

Stimuli. The choice of stimuli was guided by severalcriteria. We wanted stimuli: (a) that would allow an es-sentially infinite population of category exemplars; (b) forwhich objective measures of both distance between anytwo items and of the likelihood of any item given anycategory could be calculated; (c) for which category vari-ability could be systematically manipulated; (d) with arelatively realistic degree of perceptual complexity; and(e) that could be generated and displayed under computercontrol. These criteria were met by visual grid patternsof the sort depicted in Figure 1. The categories to belearned in Experiments IA and IB consisted of two setsof such visual patterns, each composed of instances derivedfrom a standard pattern by means of a probabilistic dis-tortion rule. All patterns consisted of light and dark cellsin a 10 X 10 grid displayed on a computer-controlled TVscreen. The two standard patterns, shown in Figure 1,were created using a modification of the method for gen-erating figures specified by Attneave.and Arnoult (1956;see Fried, 1979). The standards were adjusted so that 50cells in each were dark and 50 were light. In addition, 50cells overlapped between the two standards. Distortionswere generated on-line by changing each cell of the standardfrom light to dark or vice versa with some specified dis-tortion probability, p. Increasing the distortion probabilityin the range .00 to .50 increases the variability of thedistribution of instances, defined in terms of number ofcells changed from the standard. These distributions werebinomial approximations to the normal, with the meannumber of cells changed equal to lOOp and variance equalto lOOp (1 - p), where pis the distortion probability and100 is the number of cells in each pattern. A total of 2100

patterns were possible. The likelihood of any particularpattern wasp"(l - p)>00~N, where ./Vis the number of cellsdistinguishing the pattern from the generating standard.The categories were thus distinguishable only in their like-lihood of generating each of the 2100 possible patterns.The standard itself was the most likely individual patternof each category, but since its probability was nonethelessvanishingly small, (1 - p)100, it never actually was presentedduring learning trials in our experiments. As in Fried

'(1979), distortion probabilities of .07 and .15 were usedfor the low- and high-variability learning conditions, re-spectively, in all experiments to be reported. Figure 1 il-lustrates .07 and .15 distortions of each of the standard

Page 9: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

242 LISBETH S. FRIED AND KEITH J. HOLYOAK

STANDARDS

EXAMPLES OF RANDOM DISTORTIONS

Figure 1. The two standards used in Experiment 1, and examples of ,07 and .15 distortions of each.

patterns. In these illustrative examples the number ofchanged cells is set equal to exactly lOOp.3

Design and procedure (Experiment I A). Subjects wererandomly assigned to one of four conditions, denned bythe 2 X 2 (Variability X Feedback) factorial combinationof low versus high variability of training exemplars (.07and. 15 distortion probabilities, respectively) and presenceversus absence of item-specific error feedback. All subjectswere told that they would see a mixture of geometric pat-terns designed by two artists, named Smith and Wilson,and that they would have to distinguish the work of Smithfrom that of Wilson. The two standards shown in Figure1 were used for all subjects, but eaph subject saw a differentrandom sample of distortions, and no subject saw thestandard.

The patterns were displayed on a TV screen controlledby an IBM 1800 computer. During the learning phase allsubjects classified a series of patterns into two categoriesby pressing one of two response keys. A maximum of 7s was allowed to make each response. Subjects receivinginstance-specific error feedback were told whether or notthey were correct immediately after each response, thuseffectively labeling the instances with respect to categorymembership. Subjects not receiving instance-specific errorfeedback did not receive such information. However, allthe subjects were told the number of correct and incorrectresponses they had made for each block of 10 trials. Allsubjects in Experiment IA thus received general infor-mation about whether their classification accuracy wasimproving. However, the nonspecific feedback subjects werenever told the category to which any particular instancebelonged.

Subjects received a bonus of 1 cent for each correctanswer and were fined 1 cent for each error. In addition,their pay decreased 1 cent for every 10 trials they requiredto learn the categories. This learning phase continued untilsubjects responded correctly. 10 times in a row, or reacheda maximum of 200 trials. The response key assigned toa particular category by subjects in the nonspecific feedback

condition was necessarily arbitrary. Their responses werescored as correct in the manner that, maximized theirscore over all learning trials.

After completion of the learning phase, all subjects re-ceived an additional 100 transfer trials, without error cor-rection. Subjects were told that the patterns would includenew Wilsons and new Smiths, but also an unspecifiednumber of patterns designed by other people. In fact, therewere no true others; all the patterns presented during thetransfer phase were actually derived with equal frequenciesfrom the two original standards. Equal proportions of thetransfer items were created at each of four distortion prob-abilities: . 10, .20, .30, and .40. The transfer patterns there-fore included instances at higher levels of distortion thanthose that were presented during learning, even in thehigh-variability learning conditions. On each trial subjectspressed one of three response keys to classify the patternas a Wilson, a Smith, or an other. Subjects in the nonspecificfeedback condition had to maintain the same response-key assignments as they had established during the learningphase. A maximum of 5 s was allowed to make a response.Subjects received a 1 cent bonus for each correct classi-fication, lost 1 cent for classifying a Wilson as a Smith orvice versa, and neither won nor lost money for other re-sponses.

Forty-five University of Michigan undergraduates servedas paid subjects.

Design and procedure (Experiment IB). The designand procedure used in Experiment IB were identical to

3 We assume that on average there is a monotonic re-lationship between number of cells changed (an objectivecity-block measure of distance from the standard) andpsychological distance, for the range from 0 to 50 changedcells. A distortion probability of .50 (expected number ofchanged cells equal to 50) yields patterns statistically un-related to the generating standard.

Page 10: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 243

those used in Experiment IA, except for two changes.First, the summary information provided to subjects inExperiment 1 A after every 10 learning trials was eliminated.Subjects in the resulting no-feedback condition, unlikethose in the nonspecific feedback condition of ExperimentI A, therefore received no information about the degree towhich their classification accuracy was improving. Thefeedback condition in Experiment IB received the sameitem-specific feedback as did the comparable condition ofExperiment 1 A. Second, subjects were told that all transferpatterns were either Smiths or Wilsons, and were requiredto classify each pattern into one of those two categories;that is, no third other alternative was available at transfer.

Forty-five University of Michigan undergraduates servedas paid subjects.

Results and Discussion (Experiment IA)

Learning phase. Of the 45 subjects tested,8 had not reached criterion within the max-imum 200 trials. As expected (Prediction L3),all of these subjects were in the high-variabilitycondition: 2 subjects who received specific er-ror feedback and 6 who received nonspecificfeedback. The mean number of learning trialsfor all subjects was 39 for the low-variability,specific feedback condition; 51 for the low-variability, nonspecific feedback condition; 108for the high-variability, specific feedback con-dition; and 141 for the high-variability, non-specific feedback condition. The learning-trialsmeasure proved to be highly variable (MSE =2,692), reducing statistical power. Nevertheless,as in previous research (Fried, 1979; Posner& Keele, 1968), subjects in the high-variabilityconditions required significantly more learningtrials to reach criterion, F(l, 41) = 25.9, p <.001, in accord with Prediction L3. The non-specific feedback conditions tended to requiremore learning trials than the specific feedbackconditions; however, this trend was not sig-nificant, F(\, 41) = 2.09, p < .20. The factthat learning was possible without specificfeedback provides support for Prediction LI.

Transfer phase. Of the 8 subjects who hadnot reached criterion within 200 trials, 3 (allin the nonspecific feedback condition) re-sponded with accuracy levels significantlyabove chance during the transfer task. Sincewe were interested in transfer performance af-ter at least some learning had taken place, datafrom the other 5 subjects were excluded fromtransfer analyses. The remaining 40 subjectsincluded 10 in each of the four conditions.

The relative likelihood rule predicts that ifsubjects had learned the mean (or generating

standard) of each category, the percentage ofpatterns called other would increase as a func-tion of distance from the standards. The rulealso-predicts that if subjects had learned cat-egory variability, those in the high-variabilityconditions would classify fewer patterns farfrom the standard as other (Prediction T2).Furthermore, this pattern should obtain re-gardless of whether specific error feedback isgiven (Prediction Tl). Presented in Figure 2is the percentage of patterns called other as afunction of the number of cells by which thedistorted pattern differed from the standardused to generate it (averaging over blocks of10 cells).4 Percentage of patterns called otherincreased with increasing distance from thestandard for all groups, F(4, 144) = 23.0, p <.001,s Subjects in the low-variability conditionstended to make more other responses overallthan did those in the high-variability condi-tions, F(l, 36) = 3.76, p < .10. More impor-tantly, this difference became greater as thedistance from the transfer pattern to the stan-dard increased, f(144) = 2.60, p < .02, by abilinear trend test. Furthermore, lack of item-specific error feedback did not affect the overallpattern of results, F(l, 36) = 1.78, p > .25,and produced no significant interactions.

The relative likelihood rule predicts that thepercentage called other should be a decreasingfunction of relative likelihood; that is, thegreater the relative likelihood the greater theprobability that the item will be classified intothe appropriate category, and the lower theprobability that the pattern will be put intoan erroneous category, such as other. Presentedin Figure 3 is the percentage of patterns calledother as a function of the natural logarithmof the likelihood ratio in favor of the correctcategory, p(X\Sc)/p(X\SA), where Sc and S*are the correct,and alternative standards, re-spectively. The data points plotted in Figure3 were obtained by averaging over blocks ofapproximately 20 log units of likelihood ratio.Likelihood ratio reaches higher levels for the

4 About 3% of the transfer patterns were actually closerto the alternative standard than to the standard used togenerate them. However, the pattern of results was un-changed when the closer standard was scored as correct.

5 Throughout this article, all analyses of variance onproportions were performed after applying an arc sine*transformation.

Page 11: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

244 LISBETH S. FRIED AND KEITH J. HOLYOAK

Non-Specific Feedback, Low VariabilityNon-Specific Feedback, High VariabilitySpecific Feedback, Low VariabilitySpecific Feedback, High Variability

10 20 30 40 50Distance from Correct Standard

Figure 2. Percentage of transfer patterns called other as a function of distance from the correct standard(Experiment IA).

low-variability conditions, since low-variability also make it less likely that such instancesdistributions make it more likely that patterns could have been derived from the alternativeclose to the standard will be generated, and standard. This analysis shows-that the per-

.70.

.60.

.50.

.30j

.id

»•--* Non-Specific Feedback, Low Variability•—• Non-Specific Feedback, High VariabilityO---O Specific Feedback, Low Variability

Specific Feedback, High Variabil i ty

10 20 30 40 50 60 70 80 90 100Log Likelihood Ratio in Favor of Correct Category

Figure 3. Percentage of transfer patterns called other as function of log likelihood ratio (Experiment IA).

no

Page 12: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 245

centage called other decreased monotonicallyacross the four levels of relative likelihood atwhich all conditions can be compared, F(3,144) = 13.5, p < .001. The relative likelihoodrule predicts that among patterns equated onlikelihood ratio with respect to the two learnedcategories, low-variability training will pro-duce a greater proportion of other responses(since in this case the other category will oftenseem more likely to have generated the itemthan either of the two low-variability catego-ries). This prediction was supported, F(l,36) = 16.1, p< .001. The effect of specificityof error feedback did not approach signifi-cance. (Since for equal-variance categories loglikelihood is highly correlated with distancefrom the prototype, in subsequent cases wewill report only one measure.)

Because the high-variability conditionsproduced fewer other responses, did they thenproduce more correct responses? Presented inFigure 4 is the percentage correct for all fourgroups as a function of likelihood ratio. Ac-curacy increased with increasing likelihood

ratio, F(3, 108) = 25.8, p < .001. Provisionof specific error feedback during training didnot produce an advantage in percentage cor-rect (F < I). However, variability of the train-ing instances significantly affected transfer ac-curacy. The high-variability group was signif-icantly more accurate for patterns equated onlikelihood ratio, F(l, 36) = 16.5, p < .001. Itis apparent from inspection of Figure 4 thatfor any given level of likelihood ratio the low-variability group was more likely to call a pat-tern other (and thus have it counted as anerror), whereas the high-variability group wasmore likely to classify it correctly. This patternsupports Prediction T2.

Results and Discussion (Experiment IB)

Learning phase. Five of the 45 subjectsfailed to reach the criterion of 10 correct trialsin a row within the maximum allotment of200 trials. All of these were in the high-vari-ability conditions, with 3 in the no-feedbackcondition, and 2 in the feedback condition.

•••« Non-Specific Feedback, Low Variability•—• Non-Specific Feedback, High VariabilityO-O Specific Feedback, Low Variability

i Specific Feedback, High Variability

10 20 30 40 50 60 70 80 90 100 110

Log Likelihood Ratio in Favor of the Correct Category

Figure 4, Percentage of transfer patterns classified correctly as a function of log likelihood ratio (ExperimentU).

Page 13: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

246 LISBETH S, FRIED AND KEITH J. HOLYOAK

As in Experiment IA, the learning trials mea-sure was highly variable (MSE = 3,532). Onceagain subjects in the high-variability conditionsrequired more learning trials than did thosein the low-variability conditions (M = 91 trialsand M = 43 trials, respectively), P(l, 41) =7.17, p < .025. Although the trend favoredthe subjects who received error feedback overthose who did not (65 trials vs. 75 trials), thisdifference did not approach significance (F <1). Weak statistical power suggests caution inaccepting the null hypothesis; however, it isclear that error feedback was not a necessarycondition for category learning in the presenttask. It can also be concluded that the non-specific feedback in Experiment IA was notinstrumental in producing learning in thatcondition.

Transfer phase. The 5 subjects who failedto reach the learning criterion were excludedfrom analyses of transfer performance. Pre-sented in Figure 5 is the percentage correctclassification at transfer as a function of loglikelihood ratio in favor of the correct category.

As in Experiment IA, percentage correct in-creased as a function of likelihood ratio, F(3,108) = ll,l,p < .001. Also as in Experiment1 A, overall percentage correct was not influ-enced by error feedback (F < 1). However,level of feedback did interact significantly withvariability of the training set, F(\, 36) = 8.20,p < .01. As is apparent in Figure 5, this in-teraction was mainly due to the especially ac-curate performance of the high-variabilityfeedback condition at its two highest levels oflikelihood ratio. Error feedback did not havea significant effect for the low-variability con-dition, F(\, 18)= \A\,p> .25.

As predicted by the relative likelihood rule(Prediction T3), and in sharp contrast to Ex-periment IA, high-variability training in-stances did not improve transfer accuracywhen an other category was :not available.When only patterns equated on likelihood ratiowere considered (thus excluding patterns atthe two highest levels of likelihood ratio, ex-perienced only by the low-variability groups),variability had no significant effect, F( 1,36) =

High Variance, FeedbackHigh Variance, No FeedbackLow Variance, FeedbackLow Variance, No Feedback

0 10 20 30 40 50 60 70 80 90 100 110

Log Likelihood Ratio in Favor of the Correct CategoryFigure 5. Percentage of transfer patterns classified correctly as a function of log likelihood ratio (ExperimentIB).

Page 14: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 247

1.68, p > .20, as the relative likelihood rulepredicts. The results of Experiment IB indicatethat high-variability training does not alwaysproduce more accurate transfer performancefor exemplars far from the standard, nor doesexposure to a larger set of training items (be-cause in both Experiments 1 A and IB, subjectsin the high-variability condition received moreinstances before reaching the learning crite-rion). Rather, the existence of a random or0/Aeralternative at transfer (as in ExperimentIA) is critical to producing an advantage ofhigh-variability training.

Experiment 2

In Experiments IA and IB variability wasmanipulated across different groups of sub-jects. The relative likelihood rule can also betested by training a single group of subjectswith two equally probable categories of un-equal variability. If subjects learn the categorydistributions, the rule predicts that they willtend to classify more patterns into the high-variability category, even though exemplars ofthe two categories are equally likely a priori(Prediction T4). In particular, some patternsthat are physically more similar to the standardof the low-variability category will be morelikely to be generated by the high-variabilitycategory. If subjects learn the distributions andfollow the relative likelihood decision rule, theyshould tend to classify such patterns into thehigh-variability category. In contrast, if sub-jects employ a closest prototype rule, basedon some monotonic function of physical dis-tance, they will tend to classify such patternsinto the low-variability category.

Method

Two new standard patterns were used in Experiment 2.These were constructed in the same manner as the stan-dards used in Experiments 1 A and IB except that the newstandards were the same in only 40 (rather than SO) ofthe 100 cells. During the learning phase the instances ofone standard were derived by a .07 distortion probability(the low-variability category), and the instances of the otherstandard were derived by a . 15 distortion probability (thehigh-variability category). Assignment of the two standardsto distortion level was counterbalanced across subjects.

One other major change was introduced in the learningphase of Experiment 2. The results of Experiment 1 andearlier studies indicate that high-variability categories areharder to learn than low-variability categories. If subjectssimply had to discriminate instances of a low- versus a

high-variability category, they could do so by learning onlythe low-variability category, and then assigning all re-maining instances to the high-variability category. Subjectswould thus never need to acquire a clear conception ofthe high-variability category. Under these conditions thehigh-variability category would presumably be treated asan other alternative during transfer. As a result, high-leveldistortions of the low-variability category would tend tobe classified as members of the high-variability category,but not for the theoretically relevant reason.

It was therefore important to ensure that subjects wouldactually learn the distribution of the high-variability cat-egory during the learning phase, rather than treat it as avague other category. Accordingly, the training set consistedof equal numbers of .07 distortions of the standard forthe low-variability category, .15 distortions of the standardfor the high-variability category, and .50 distortions ofboth standards. Instances created by a .50 distortion prob-ability are truly random (i.e., they are statistically inde-pendent of the generating standard), and thus constituteda true other category. To reach the learning criterion of10 successive correct trials, subjects therefore had to learnnot only to discriminate instances of the low- versus high-variability categories but also to discriminate instances ofthe high-variability category from others.

Subjects were required to make a decision for each pat-tern within 7 s. Half the subjects received error correctionon each trial and half never received error correction. Forthose trained without error feedback the assignment ofcategory label (Smith or Wilson) to the two standard cat-egories was arbitrary; however, the category to be labeledother •was nonarbitrary.

At the beginning of the transfer phase, subjects weretold that they would see new works by Wilson and Smith,not necessarily in equal numbers, and that no works byother people would be included. They had to classify eachpattern as either a Wilson or a Smith, and thus were forcedto discriminate solely between the high-variability and thelow-variability categories. Chance accuracy was therefore50%. The transfer set consisted of a total of 100 patterns,half derived from each standard, with equal numbers gen-erated at distortion probabilities of .15, .25, .30, and .35.As in Experiments 1 A and IB, the assignments of categorylabels to instances in the transfer phase had to be the sameas those established during learning (i.e., for subjects inthe no-feedback condition, assignments were not arbitraryat transfer).

Twenty-five University of Michigan students served aspaid subjects.

Results and Discussion

Learning phase. The mean number oflearning trials was virtually identical for sub-jects who received feedback and those who didnot (although medians favored the feedbacksubjects, 119 vs. 149). However, consistent withPrediction L2, 5 subjects in the no-feedbackcondition (vs. none in the feedback condition)failed to reach the learning criterion within200 trials. In all cases the nonlearners had

Page 15: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

248 LISBETH S. FRIED AND KEITH J. HOLYOAK

difficulty discriminating the high-variabilityand other categories.

Transfer phase. Only data for the 20 sub-jects who reached the learning criterion wereanalyzed. An overall picture of transfer per-formance is provided by Figure 6, in whichappears the percentage of items classified intothe highwariability category as a function ofthe log likelihood ratio favoring that category.Percentage classified into the high-variabilitycategory was an increasing function of like-lihood ratio in favor of that category, F(5,90) = 23.0, p < .001. Error feedback did notsignificantly influence the pattern of results;however, the large percentage of nonfeedbacksubjects who failed to learn cautions againstaccepting the null hypothesis. The results alsodid not differ as a function of which standardwas assigned to the high-variability category.

The main concern in Experiment 2 was todetermine whether subjects base their classi-fication decisions on likelihood or distance.The relative likelihood rule predicts that morepatterns will be classified into the high-vari-ability than the low-variability category, sincesubjects will have learned a broader densityfunction for the former category. In contrast,a strict distance-to-prototype rule predicts thatan equal proportion of instances will be clas-sified into each category, because the distri-butions of instances around their standardswere actually identical for the two categories

during the transfer phase. The prediction ofthe relative likelihood rule was supported, as61% of the transfer items were placed in thehigh-variability category. This figure was sig-nificantly higher than the 50% predicted bythe distance rule, f(19) = 3.28, p < .01.

A separate analysis was performed for justthose items that were closer to the standardof the low-variability- category (in terms ofchanged cells), but more likely to be generatedby the high-variability category. These itemsprovide a particularly strong test of whethersubjects used a decision rule based on distanceor likelihood. If classifications were based onany monotonic function of physical distancethat is constant over category variability, thesepatterns would tend to be placed in the low-variability category; but if classifications werebased on any mohptonic function of likeli-hood, these items would tend to be placed inthe high-variability category. The predictionof the relative likelihood rule was confirmed;64% of these critical items, which were phys-ically closer to the low-variance standard, wereactually classified into the high-variability cat-egory. This percentage was significantly higherthan 50%, t(\9) = 2.67, p < .02.

Experiment 3

The purpose of Experiment 3 was to providea more extensive investigation -of the role of

.80.

.60.

|40.

•S.20.

No FeedbackFeedback

j -80 -60 -40 -20 0 20 40 60 80 100 120

Log likelihood ratio in favor of the high variability category

Figure 6. Percentage of transfer patterns classified into the high-variability category (Experiment 2).

Page 16: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 249

labeled instances and other types of supple-mentary information (Tracy & Evans, 1967)in facilitating category learning, and in par-ticular to determine whether knowledge of thenumber of categories facilitates learning (Pre-diction L4). Different groups of subjects re-ceived one of four levels of supplementary in-formation. Subjects in the labeled instancescondition received a category label (Wilson orSmith) with each exemplar. Subjects in thenumber known condition received neithercategory labels nor error correction, but weretold (correctly) that there were two categoriesto be learned. The first of these two conditionsrepresented complete information, whereas thesecond was similar to the no-feedback con-ditions of Experiments IB and 2 in the amountof information available. Subjects in two ad-ditional conditions received still less infor-mation about the learning task. Those in thenumber unknown condition were told theywere to try to learn the categories representedin the training set, but were not told how manycategories would be present. Finally, those inthe observation only condition were not toldthat their task was to learn categories, nor thatcategories were present; they were simply toldto "pay the utmost attention" to each patternin the set (cf. Reber & Allen, 1978). Prior toa subsequent transfer task, all subjects wereinformed that they had seen exemplars drawnfrom exactly two categories. The categorydensity model predicts that the labeled in-stances condition will yield superior learningto the number known condition (PredictionL2), and that the latter condition will yieldsuperior learning to the remaining two con-ditions (Prediction L4). In fact, because theparameter-revision procedure can only operateif the number of categories is known, themodelpredicts that no learning will take place in thenumber unknown and observation only con-ditions (unless subjects in these conditionshappen to guess the correct number of cate-gories).

Method

Apparatus and patterns. The patterns were presentedon a Hazeltine text terminal controlled by a PDF 11/34computer. Each pattern was composed of a 10 X 10 grid,in which each cell consisted of two horizontally alignedcharacter spaces. If the cell was defined as black, bothcharacter spaces were blank; if it was defined as white,

each space was occupied by the rectangular ASCII character127 rubout. Two standard patterns were generated for eachsubject. The first standard was created by randomly makingeach cell black or white with an equal probability. Thesecond standard was derived from the first by switching50 randomly selected cells from black to white or viceversa. Different standards were randomly generated foreach subject within a given condition, whereas acrosslearning conditions subjects were yoked with respect tothe standards, training exemplars, and transfer set. As inprevious experiments, exemplars were generated by prob-abilistic distortions of the standards.

Design and procedure.' Subjects were randomly assignedto one of eight experimental conditions, defined by the2 X 4 (Variability X Condition) factorial combination oflow- versus high-variability of training exemplars (.07 vs..15 distortion probabilities), and the four instructionalconditions outlined earlier. Following initial instructions,all the experimental subjects participated in a trainingphase in which they viewed 200 exemplars. These consistedof a random mixture of 100 instances derived from eachstandard. A major methodological change introduced inExperiment 3 was thus to present subjects with a fixednumber of training exemplars, rather than to allow subjectsto reach a learning criterion. Using a fixed number oflearning trials avoids several methodological problems in-herent in a criterion procedure. A criterion measure createsa confounding between learning difficulty and number oftraining trials. In addition, the exclusion of those subjectswho failed to reach the criterion can bias analyses of transferperformance in favor of conditions that are more difficultto learn. Although using a fixed number of learning trials.avoids the above problems, it does have a disadvantage ofits own. Differences in learning rate among instructionalconditions may not be observed if all conditions achieveasymptote within the allotted number of learning trials.This is most likely to occur for groups who receive low-variability training exemplars.

Each pattern was presented for just 2 s. The subjectthen pressed a key to initiate presentation of the nextpattern. Subjects did not make overt classification responsesduring learning, but simply observed the exemplars. Exceptfor the labeled instances condition, for which a categorylabel, Wilson or Smith, was written beneath each pattern,all subjects had the same type of observation experience.

After completing the training phase, subjects in thenumber unknown and observation only conditions weredebriefed to find out whether they had noticed that thepatterns were drawn from two categories. Subjects in theobservation only condition were asked a series of increas-ingly directive questions: (a) Did you notice anything in-teresting about the patterns? (b) Did you notice any sim-ilarities among them? (c) Did you notice that the patternsfell into different groups of categories? and finally, (d)How many categories do you think the patterns were di-vided into? The latter two questions were also askedof subjects in the number unknown condition. All sub-jects were then informed that the number of categorieswas two.

The subjects then received 100 transfer trials. In themanner of Experiment IA, subjects were told that thepatterns would include new works by Wilson and Smith,as well as an unspecified number of works by other people.The transfer patterns actually consisted of 50 exemplars

Page 17: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

250 LISBETH S. FRIED AND KEITH J. HOLYOAK

from each category, 10 at each of five exact distances fromthe standard (in terms of number of changed cells): 5, 15,25,35, and 45. Each pattern was presented for a maximumof 7 s, and subjects pressed one of three keys to classifythe pattern as a Wilson, a Smith, or an other. No errorcorrection was given. As in previous experiments, subjectsin all except the labeled instances condition were free tointerchange the keys for Wilson and Smith; their key as-signments were determined afterward by the usual con-sistency test.

In addition to the eight experimental conditions de-scribed so far, an additional transfer control condition wasincluded.6 Subjects in this condition did not receive anylearning trials. The transfer control condition was includedto provide a base-rate estimate of the amount of learningthat could occur without feedback solely during the transferphase of the experiment.

One hundred undergraduates served as paid subjects.Ten were assigned to each of the eight experimental con-ditions, and 20 to the transfer control condition.

Results

Knowledge of category number. A prelim-inary assessment of the difficulty of the learn-ing task in Experiment 3 is provided by theresponses to the questions asked subjects inthe number unknown and observation onlyconditions. The first two questions directed tothe observation only subjects (Did you noticeanything interesting about the patterns? Anysimilarities among them?) failed to elicit anyclear statements regarding the presence of cat-egories. When asked whether they had noticedthat the patterns were divided into categories,8 subjects in the observation only conditionsaid yes, 11 said no, and 1 did not respond(without notable differences between thosewho had viewed low- vs. high-variability dis-tributions). Finally, subjects in both the ob-servation only and number unknown condi-tions were directly asked to estimate the num-ber of categories. (Subjects in the observationonly condition were first told that the patternswere indeed drawn from different categories.)The distributions of estimates did not differacross either instructional conditions or levelsof training variability, so we will report theaggregate results for all 40 subjects: The me-dian estimate was 4, with a range of 2 to IS.Previous studies have also reported overesti-mation of the number of categories (Berstedet al., 1969; Hartley & Homa, 1981). Only 4subjects said there had been two categories;all of these 4 had received observation onlyinstructions, and 3 had seen high-variabilitydistributions—the learning situation one

might well suppose had the least a priori like-lihood of enabling the number of categoriesto be learned. Given the diversity of the es-timates, it seems quite likely that the few sub-jects who gave the correct answer did so byfortuitous guessing. It is clear that virtually(and perhaps literally) none of the subjectslearned the number of categories used to'gen-erate the patterns, even with exposure to 200examples.

Transfer performance. Presented in Figure7 is the percentage of patterns classified cor-rectly as a function of distance from the stan-dard, plotted separately for each instructionalcondition. The results for the conditions thatreceived low-variability distributions duringthe learning phase appear in panel A, whereasthe results for the conditions thatreceived high-variability distributions appear in panel B. Forpurposes of comparison, the results for thetransfer control condition are plotted in bothpanels. An analysis of variance was performedon these data, with the transfer control subjectsdivided into two arbitrary groups to create abalanced design. Percentage correct declinedsignificantly with increasing distance from thestandard for each of the eight experimentalgroups (p < .01) but not for the transfer controlcondition. Since subjects in the latter conditiondid not receive a training phase, any categorylearning would have had to take place overthe course of the transfer trials. In fact, eventhis control condition produced a significanteffect of distance from the standard when onlydata for the second half of the transfer trialswere considered, F(l, 76) = 11.4, p < .001,indicating that learning did take place duringthe transfer phase. Percentage correct declinedfrom .47 for the items closest to the standardto .36 for those farthest from it.

When the response other is available, per-centage correct should show a greater declinewith increasing distance for the low- than forthe high-variability groups (as in ExperimentIA). As can be seen by comparing panels Aand B in Figure 7, this prediction was con-firmed, F(4, 360) = 8.28, p < .001. Collapsingover the four experimental conditions withineach variability level, this interaction took the

6 This control condition was suggested by Michael Flan-nagan.

Page 18: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 251

form of a crossover: For those patterns closestto the standard, the percentage correct washigher for the low-variability condition (77%vs. 63%), whereas for those patterns furthestfrom the standard this difference reversed (20%vs. 32%). This pattern confirms the compa-rable result (Prediction T2) obtained in Ex-periment IA.

The results of primary interest concern theeffects of the different types of supplementaryinformation on learning. Because subjects inthe various instructional conditions withineach variability level received exactly the sameexemplars during learning, differences amongthe groups within each level of variability mustreflect differential learning of the distributions.Superior learning should be evidenced byhigher percentage correct for those patternsmost likely to belong to the training distri-

bution (i.e., those relatively close to the stan-dard). (From the learner's point of view, pat-terns relatively far from the standard ought tobe classified into the other category.) For thelow-variability groups, an analysis was per-formed on the percentage correct data for pat-terns 5 cells from the prototype. The labeledinstances and number known conditions didnot differ (t < 1) but were superior to thenumber unknown and observation only con-ditions (p < .01). The latter two groups didnot differ from each other (t < 1) but weresuperior to the transfer control condition (p <.01). A comparable analysis was performedfor the high-variability groups, examining per-centage correct for patterns 5 and 15 cells fromthe standard (the patterns most consistent withthe high-variability training distributions). Inthis analysis the labeled instances group ex-

Labelled InstancesNumber KnownNumber UnknownObservation OnlyTransfer Control

A. Low Variability Training B. High Variability Training

15 25 35 45 15 25 35 45

Distance from Correct StandardFigure 7. Percentage of transfer patterns classified correctly as a function of distance from the correctstandard for the various instructional conditions (Experiment 3). (Data for low-variability conditions appearin panel A, and data for high-variability conditions appear in panel B; data for the transfer control conditionare plotted in both panels.)

Page 19: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

252 LISBETH S. FRIED AND KEITH J. HOLYOAK

ceeded the other three experimental conditions(pk..0l). The latter did not differ amongthemselves, but were collectively superior tothe transfer control condition (p < .05).

A possibility to be considered is that theabove differences in percentage correct onlyreflect differences among the propensities pfdifferent groups to use the other category, sinceother responses were always scored as incor-rect. However, if only such response biases wereoperating, groups ranking relatively high inoverall percentage correct would tend to rankrelatively low in percentage correct of thosepatterns classified into a nonother category.No such trade-off was apparent; rather, groupsranked relatively high in overall percentagecorrect tended to also be ranked relatively highin percentage correct of those patterns notclassified other. For the low-variability groupsconditional percentage correct was .79, .79,.73, .71, and .60 for the labeled instances,number known, number unknown, observa-tion only, and control conditions, respectively.Comparable figures for the high-variabilitygroups were .75, .63, .66, .61, and .60.

Discussion

The results just presented provide a mixtureof support and difficulty for the category den-sity model. First, consider the conditions thatreceived low-variability training exemplars.Subjects in the labeled instances and numberknown conditions were most accurate in clas-sifying the patterns closest to the standard.These are the only two learning conditionsthat had the prerequisite information for useof a strategy of parameter revision. The lackof difference between the labeled instances andnumber known conditions at the lower levelof variability suggests that learning had ap-proached asymptote in these two conditionsafter the 200 training trials. This result is thusconsistent with the category density model.The advantage displayed by the number knowncondition relative to the number unknown andobservation only conditions also supports themodel (Prediction L4), because knowledge ofcategory number is critical to the postulatedparameter-revision process. However, the sub-stantial, albeit lesser, degree of learning bysubjects in the latter two conditions cannot beexplained by a parameter-revision process.

These subjects were not told the number ofcategories in advance, nor did they determinethe number of categories during the learningtrials; accordingly, they lacked the prerequisiteinformation for parameter revision. The resultsthus implicate a second type of learningmechanism that requires even less supple-mentary information than does parameter re-vision.

The data for the high-variability conditionalso present a mixed picture for the categorydensity model. The predicted advantage of re-ceiving labeled instances (Prediction L2) wasconfirmed for categories with a high degree ofoverlap. However, as in the low-variabilitycondition, substantial learning also took placein the number unknown and observation onlyconditions. Unlike the result obtained for thecomparable low-variability groups, the numberknown condition was not superior to the twoconditions that lacked knowledge of the num-ber of categories (a failure of Prediction L4).The results thus suggest that subjects hadavailable some other learning mechanism thatoperates as effectively for high-variability cat-egories as does parameter revision withoutfeedback.

Experiment 4

The relative efficacy of the various instruc-tional conditions should be independent ofwhether or not an other alternative is providedat transfer. In the absence of an other category,superior learning should again be evidencedby higher percentage correct for patterns atthe distortion levels most likely to have beenobserved during training. Without an othercategory, accuracy will necessarily approachan asymptote at chance level as distortion levelis increased. Differences in percentage correctacross instructional conditions should there-fore be progressively attenuated as transferpatterns become more remote from the stan-dards. As a result, the function relating per-centage correct to distance from the correctstandard should have a steeper slope for aninstruction condition that produces superiorlearning (when variability of the training ex-emplars is held constant). Experiment 4 wasperformed to examine the effects of alternativelevels of supplementary information on trans-fer performance in the absence of an other

Page 20: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 253

alternative. Rather than repeating the full de-sign of Experiment 3, only two comparisonsof particular theoretical import were made.First, the number known and number un-known conditions with low-variability trainingwere compared. We wished to replicate theadvantage of the number known group (Pre-diction L4), observed for the comparable con-ditions in Experiment 3. Second, the labeledinstances and number known conditions withhigh-variability training (for which subjects arenot likely to reach asymptote after 200 learningtrials) were also compared. Obtaining the pre-dicted advantage of providing labeled instances(Prediction L2) would extend the comparableresult observed in Experiment 3.

Method

The method of Experiment 4 was identical to that ofExperiment 3, except that only the four groups mentionedabove were included in the design, and no other alternative

was provided during the transfer phase (as in ExperimentIB). Thirteen subjects served in each of the two low-vari-ability conditions, and 8 served in each of the high-vari-ability conditions.

Results and Discussion

Presented in Figure 8 is the mean percentagecorrect for the four conditions as a functionof distance from the standard. All conditionsyielded response functions with significantnegative slopes (p < .01) approaching asymp-tote at the chance expectation of 50% for pat-terns maximally dissimilar to the standard.Both of the critical comparisons between con-ditions were in accord with the predictions ofthe density model. For the two low-variabilitygroups (panel A in Figure 8), the slope of thedistance function was significantly steeper forthe number known than for the number un-known condition, t(96) = 2.72, p < .01 (Pre-diction L4). Tests of the simple main effects

100

90

£ 80

oo

c0)ot-cu0-

c.o0)

- \

70

60

50

40 -

Labelled Instances •-Number Known o-Number Unknown x-

.\

' '~ow Variabil i ty Training B. High Variabi l i ty 'Training

_L15 25 35 45 15 25 45

Distance from Correct StandardFigure 8. Percentage of transfer patterns classified correctly as a function of distance from the correctstandard for the various instructional conditions {Experiment 4). (Data for low-variability, conditions appearin panel A, and data for high-variability conditions appear in panel B.)

Page 21: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

254 LISBETH S. FRIED AND KEITH J. HOLYOAK

of instructional condition revealed that thenumber known condition produced signifi-cantly higher percentage correct for patternsfive cells from the standard (p < .01) whereasthe two groups did not differ significantly forpatterns at higher distortion levels. As in Ex-periment 3, subjects in the number unknowngroup could not report the true number ofcategories (2) at the end of the learning phase.The median estimate was 4, with a range of1 to 15. The superior performance exhibitedby the number known condition for the pat-terns most likely to be generated by the trainingdistribution can be attributed to the use of aparameter-revision strategy, which knowledgeof category number makes possible. These re-sults replicate and extend the parallel findingsof Experiment 3.

Presented in panel B of Figure 8 are theresults for the two high-variability conditionsthat were tested. The slope of the distancefunction was higher for the labeled instancesthan for the number known condition, /(56) =4.96, p < .001, and percentage correct wassignificantly higher for the former conditionat the two lowest distortion levels (p < .01,Prediction L2). These results extend the com-parable findings obtained in Experiment 3.

General Discussion

Summary and Implications

The present results provide a broad initialbasis of support for the category density modeland its assumptions about distribution learn-ing, as well as suggesting ways in which themodel requires revision. The first two exper-iments yielded a number of results predictedby the proposed relative likelihood decisionrule. These include the following: (a) the su-perior transfer performance that resulted fromtraining on high-variability instances when another category was available (Experiment 1 A);(b) the absence of any clear advantage for thehigh-variability condition when the other cat-egory was removed (Experiment B); and (c)the tendency to classify items into the morelikely high-variance category, even for patternsphysically closer to the low-variance standard(Experiment 2). These results indicate thatlearners use exemplars to induce the distri-bution functions of categories, and then clas-sify novel instances according to a relative

likelihood rule based on these induced densityfunctions.

Other results obtained in Experiments 1 and2, plus the more detailed exploration of thelearning process in Experiments 3 and 4, haveimplications for possible mechanisms of dis-tribution learning. Major findings included thefollowing: (a) Category distributions can gen-erally be learned regardless of whether or notthe learner receives labeled instances or errorcorrection (Experiments lA-3); (b) labeled in-stances, which obviate the need for a decom-position process, facilitate category learningunder conditions of high distributional overlap(Experiments 3-4);, (c) without labeled in-stances, prior knowledge of the number of cat-egories (a prerequisite for paramenter revision)facilitates, acquisition of category knowledgewhen the distributions do not overlap exces-sively (Experiments 3-4); and (d) categorystructure can still be learned when the learnerdoes not receive error correction, informationabout category number, or even instructionsto learn categories (Experiment 3).

Toward a General Model ofDistribution Learning

This article has focused on a specific versionof the category density model that can accountfor the induction of normally distributed cat-egories when subjects know (or assume) theform of the distributions and the number ofcategories. It is clear that this specific modelis loo restrictive to account for all aspects ofhuman capacity to learn category distribu-tions. As we argued at the outset, normal dis-tributions may well be an ecologically impor-tant special case; nonetheless, there is exper-imental evidence that people can sometimeslearn markedly nonnormal distributions(Neumann, 1977). Furthermore, the results ofExperiment 3 demonstrated that people canlearn distributions to some degree not onlywithout knowledge of the number of catego-ries, but without knowledge that they are ina category-learning task at all. Perhaps alearning model entirely different from the cat-egory density model is required^ However, wewill instead suggest how the model can be re-vised and extended. The present findings, aswell as other considerations discussed later,lead us to sketch a more general learning modelwhich, while speculative, yields testable pre-dictions.

Page 22: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 255

The parameter-revision process describedearlier allows a schematic representation ofdistributional knowledge to be updated with-out exemplars necessarily being stored inmemory. The model is thus quite contrary inspirit to instance-storage models such as thatproposed by Medin and Schaffer (1978). How-ever, a more general model can be devised byintegrating these two approaches. In this dual-representation, dual-process model, categorydistributions can be represented either by sta-tistical parameters or by memory traces of in-stances. The latter type of representation per-mits a mechanism for learning category dis-tributions without knowledge of the numberof categories or the form of their density func-tions. This strategy, which corresponds closelyto a proposal made by Evans (1967), involvesstoring traces of presented instances until sep-arable clusters emerge. These memory tracesneed not be highly veridical, as long as theycollectively approximate the mixture densityof the sample., Each observed instance couldbe encoded as a point in a feature space, withan associated gradient of generalization. Asfurther instances are encoded, a multidimen-sional frequency histogram will gradually bebuilt up, providing a nonparametric repre-sentation of the mixture density.

An instance-clustering process might enablea person to learn categories without error cor-rection, even if the learner does not initiallyknow either the number of categories or theform of their distributions. If the underlyingcategory distributions are sufficiently discrim-inable, separable clusters will eventuallyemerge, corresponding to peaks and valleys inthe histogram for the mixture. Various clus-tering algorithms could be used to model thedecomposition process (Duda & Hart, 1973;Everitt, 1974), As is the case for parameterrevision, feedback would presumably facilitatedecomposition.

The two learning mechanisms we have out-lined—parameter revision and instance clus-tering—could be integrated by elaborating.thelearning procedure embodied in the computersimulation described in the beginning of thisarticle. In the simulation, a clustering algo-rithm is used on the first few instances to deriveinitial parameter estimates, which are thenupdated using parameter revision. More gen-erally, people may initially store and clusterinstances in order to get a general idea of what

the categories being presented are like. Theymay then summarize the information ex-tracted from the stored instances as a para-metric description, which subsequently can befine-tuned using parameter revision. Initialinstance clustering can be used to form initialconceptions of the category distribution, or tocheck a priori assumptions; parameter esti-mation and revision can be implemented atany point during the learning process once thelearner is sufficiently confident about thenumber and form of the distributions.

Even after a parameter-revision strategy isinvoked, continued incidental instance storagemay play a role in detecting violations of basicassumptions about the form of the categorydistributions. For example, suppose the learnerassumes the categories to be learned are nor-mally distributed over a feature space, whenin fact the distributions are markedly non-normal (e.g., V shaped). A strict parameter-revision process would never be able to correctthis misconception. Presumably, however,current parameter estimates would yield ex-pectations about the frequencies of possible,instances of the categories. If the learner foundthat supposedly rare instances were appearingtoo frequently, this would cast doubt on theassumed form of the density functions. A ra-tional learner would then temporarily abandonparameter revision and shift to instance clus-tering.

In addition, there are very likely circum-stances in which stored instances provide theonly possible memory representation for acategory distribution. Possible examples in-clude the following situations:

1. The distributions of the underlying cat-egories may be so irregular that no simple^parametric description of them exists. For ex-ample, the category of things stored in myattic may have no simpler description than alist of all the items,

2. The learning set presented to subjectsmay be so small and/or variable that the cat-egories appear to be collections of unrelatedobjects, as in Situation 1 just described. Nu-merous studies in the classification literaturemay exemplify this type of situation (e.g., Pe-terson, Meagher, Chait, & Gillie, 1973).

3. The learning task may emphasize rotememorization of instances and disguise theunderlying category structure present in thelearning set (Brooks, 1978).

Page 23: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

256 LISBETH S. FRIED AND KEITH J. HOLYOAK

, These three situations are cases in whichno simple schematic description of the cate-gory is available. This raises a question thatis basic to the distributional framework: Whatis the range of category distributions that peo-ple can encode as parameter vectors? Thepresent paper has emphasized normal distri-butions, which we suggested may be psycho-logically natural for continuous dimensions;however, other types of parametric represen-tations may also prove important. For ex-ample, a binary feature dimension can be de-scribed by a Bernoulli process with its singleparameter, p. Presumably there are limits onthe types of distributions that people can ve-ridically represent parametrically, and on thosethat may be learned by instance storage. Someresearch on the acquisition of nonnormal dis-tributions has been done in studies of decisionmaking (Pitz, Leung, Hamilos, & Terpening,1976) and of category learning (Flannagan,Fried, & Holyoak, 1981; Neumann, 1977), butmore work on this issue is clearly called for.

A second question that needs to be exploredconcerns the learning and representation ofcorrelations between features within a category(e.g., members of small-bird species are morelikely to sing than members of large-bird spe-cies). Medin and Schaffer(1978) have shownthat people are sensitive to within-categoryfeature correlations for artificial categories.Such correlational information could be rep-resented parametrically by the equivalent ofa variance-covariance matrix. Alternatively, asuperordinate category composed of severaldistinct subcategories (i.e., separate or over-lapping instance clusters) could be representedas the disjunction of the distributions of thesubcategories; the features within each sub-category might be independent. Individualstored instances can be viewed as the limitingcase of a category representation based on thedisjunction of multiple distributions.7

Further Directions

We have already touched on a number ofdirections in which the category density modelmay guide research. There is some evidencethat prior expectations can influence the in-duction of category distributions (Neumann,1977), but the issue has just begun to be in-vestigated systematically by manipulating theform of the distributions of category exemplars

over known feature dimensions (Flannagan etal., 1981). In addition, the possibility that thelearning process may undergo qualitativechanges (e.g., a shift from instance clusteringto parameter revision) calls for further studiesthat vary degree of category acquisition(Homa, Sterling, & Trepel, 1981). The typesof categories studied also need to be extended.The category density model is not inherentlyrestricted to categories denned solely by per-ceptual features as in the present study andmost similar research. In principle peoplecould learn category distributions over se-mantic or functional dimensions as well asperceptual ones.

Finally, it should be emphasized that therelationship between instance and categoryknowledge has broad import for cognitive the-ory. For example, Gick and Holyoak (1983)have investigated the induction of a "problemschema" from experience with multiple anal-ogous problems. The category density modelassumes that the ideal outcome of categorylearning is a representation of the dimensionsof variation among category exemplars, to-gether with a parametric description of cate-gory distributions over these dimensions. Sincein its broader sense category learning is clearly

7 The relative likelihood rule can be applied even ifcategories are represented solely by stored instances. Eachstored instance will establish a microdistribution, equivalentto a generalization gradient around the point in a featurespace denned by the instance. The relative likelihood rulethen predicts that the subjective probability of a particularnovel item given a particular category will be proportionalto the sum of the subjective probabilities1 of the item giventhe microdistributions associated with each, of the storedinstances of that category (assuming the stored instancesto be equally likely).

It should be noted that in this special case, in whichdistributions are represented solely by stored instances,the relative likelihood-rule is isomorphic to the decisionrule specified by Medin and Schaffer (1978, p. 211, As-sumptions 2,4, and 5). Their rule operates on a similarityparameter for each feature, which is denned to range invalue between 0 and 1. The corresponding parameter ofthe relative likelihood rule, expressed in terms of features(Equation 4), is interpreted as the subjective likelihood ofobserving the feature value of the item to be classifiedgiven the feature value of the stored instance to which itis being compared. Because this likelihood ranges between0 and 1 and is assumed to decrease monotonically withincreasing distance between the two feature values, it hasthe same properties as the corresponding similarity pa-rameter specified by Medin and Schaffef's rule. The tworules operate on these basic corresponding parameters inan entirely parallel fashion.

Page 24: Induction of Category Distributions: A Framework for ...reasoninglab.psych.ucla.edu/KH pdfs/Fried_Holyoak.1984.pdflearner knows the number of categories present in the mixture (Duda

CATEGORY INDUCTION 257

involved in various complex domains, such asproblem solving and story understanding, ageneral theory of the learning process couldserve to highlight commonalities among a widerange of cognitive activities.

References

Anderson, J. R., Kline, P. J., & Beasley, C. M. (1979). Ageneral learning theory and its application to schemaabstraction. In G. H. Bower (Ed.), The psychology oflearning and motivation (Vol. 13, pp. 277-318), New

• York: Academic Press.Attneave, F. (1957). Transfer of experience with a class-

schema to identification learning of patterns and shapes.Journal of Experimental Psychology, 54, 81-88,

Attneave, E, & Arnoult, M. D. (1956). The quantitativestudy of shape and pattern perception. PsychologicalBulletin, 53, 452-471.

Bersted, C. T., Brown, B. R., & Evans, S. H. (1969). Freesorting with stimuli clustered in a multidimensionalspace. Perception & Psychophysics, 6, 409-414.

Brooks, L. (1978). Nonanalytic concept formation andmemory for instances. In E. Rosch & B. B. Lloyd (Eds.),Cognition and categorization (pp. 170-207). New York:Wiley.

Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). Astudy of thinking. New York: Wiley.

Duda, R, O., & Hart, P. E. (1973). Pattern classificationand scene analysis. New York: Wiley.

Edmonds, E. M,, & Evans, S. H. (1966). Schema learningwithout a prototype. Psychonomic Science, 5,247-248.

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesianstatistical inference for psychological research. Psycho-logical Review, 70, 193-242.

Evans, S. H. (1967). A brief statement of schema theory.Psychonomic Science, 8, 87-88.

Evans, S. H., & Arnoult, M. D. (1967). Schematic conceptformation: Demonstration in a free sorting task. Psy-chonomic Science, 9, 221-222.

Everitt, B, (1974). Cluster analysis. London: HeinemannEducational Books.

Flannagan, M., Fried, L. S., & Holyoak, K. J. (1981, No-vember). Perceptual category learning and distributionalstructure. Paper presented at the 22nd meeting of thePsychonomic Society, Philadelphia, PA.

Fried, L. S. (1979). Perceptual learning and classificationwith ill-defined categories (Tech. Rep. No. MMPP 79-6). Ann Arbor: University of Michigan, MichiganMathematical Psychology Program.

Gibson, E. J. (1953).Improvement in perceptual judgmentsas a function of controlled practice or training. Psy-chological Review, 50, 401-431.

Gibson, E. J. (1969). Principles of perceptual learning anddevelopment. New York: Appleton.

Gibson, J. J., & Gibson, E. J. (1955). Perceptual learning:Differentiation or enrichment? Psychological Review, 62,32-41.

Gick, M. L., & Holyoak, K. J. (1983). Schema inductionand analogical transfer. Cognitive Psychology, 15,1-38.

Hartley, J., & Homa, D. (1981). Abstraction of stylisticconcepts. Journal of Experimental Psychology: HumanLearning and Memory, 7, 33-46.

Hayes-Roth, B., & Hayes-Roth, F. (1977). Concept learn-ing and the recognition and classification of exemplars.Journal of Verbal Learning and Verbal Behavior, 16,321-338.

Homa, D., Sterling, S., & Trepel, L. (1981). Limitationsof exemplar-based generalization and the abstraction ofcategorical information. Journal of Experimental Psy-chology: Human Learning and Memory, 7, 418-439.

Medin, D. C., & Schaffer, M. M. (1978). Context theoryof classification learning. Psychological Review, 85, 207-238.

Neumann, P. G. (1977). Visual prototype informationwith discontinuous representation of dimensions ofvariability. Memory & Cognition, 5, 187-197.

Oldfield, R. C. (1954). Memory mechanisms and the theoryof schemata. British Journal of Psychology, 45, 14-23.

Patterson, J. F. (1979). Hypothesis testing in a prototypeabstraction task. Unpublished doctoral dissertation,University of Michigan.

Peterson, M. J., Meagher, R. B., Chait, H., & Gillie, S.(1973). The abstraction and generalization of dot pat-terns. Cognitive Psychology, 4, 378-398.

Pitz, G. F., Leung, L. S., Hamilos, C., & Terpening, W.(1976). The use of probabilistic information in makingpredictions. Organizational Behavior and Human Per-formance, 17, 1-18.

Posner, M. I., & Keele, S. W. (1968). On the genesis ofabstract ideas. Journal of Experimental Psychology, 77,353-363.

Raiffa, H., & Schlaifer, R. (1961). Applied statistical de-cision theory. Cambridge, MA: Graduate School ofBusiness Administration of Harvard University.

Reber, A. S., & Allen, R. (1978). Analogic and abstractionstrategies in synthetic grammar learning: A functionalistinterpretation. Cognition, 6, 189-221.

Reed, S. K. (1972). Pattern recognition and categorization.Cognitive Psychology, 3, 383-407.

Rosch, E. (1973). On the internal structure of perceptualand semantic categories. In T. E. Moore (Ed.), Cognitivedevelopment and the acquisition of language (pp. 111-144). New York: Academic Press.

Rosch, E. (1978). Principles of categorization. In E. Rosch& B. B. Lloyd (Eds.), Cognition and categorization (pp.28-46). New York: Wiley.

Rosch, E., & Mervis, C. B, (1975). Family resemblances:Studies on the internal structure of categories. CognitivePsychology, 7, 573-605.

Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M.,& Boyes-Braem, P. (1976). Basic objects in natural cat-egories. Cognitive Psychology, 8, 382-439.

Swets, J. A., Tanner, W. P., Jr., & Birdsall, T. G. (1961).Decision processes in perception. Psychological Review,68, 301-340.

Tracy, J. F., & Evans, S. H. (1967). Supplementary in-formation in schematic concept formation. PsychonomicScience, 9, 313-314.

Wallsten, T, S. (1976). Using conjoint-measurement modelsto investigate a theory about probabilistic informationprocessing. Journal of Mathematical Psychology, 14,144-185.

Received April 30, 1982Revision received May 16, 1983