36
Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating Consistency in Typology and Classification John C. Whittaker, 1,2 Douglas Caulkins, 1 and Kathryn A. Kamp 1 Typological systems are essential for communication between anthropologists as well as for interpretive purposes. For both communication and interpretation, it is important to know that different individuals using the same typology classify artifacts in similar ways, but the consistency with which typologies are used is rarely evaluated or explicitly tested. There are theoretical, practical, and cultural reasons for this failure. Disagreements among archaeologists using the same typology may originate in the typology itself (i.e., imprecise type definitions, confusing structure) or in the classification process, because of observer errors, differences in perception and interpretation, and biases. We review previous attempts to evaluate consistency in typology and classification, and use consensus analysis to examine one well-established typology. Both consensus and disparity are apparent among the typologists in our case study, and this allows us to explore the kinds of forces that shape agreement and diversity in the use of all typological systems. We argue that issues of typological consistency are theoretically and methodologically important. Typological consistency can be explicitly tested, and must be if we hope to use typologies confidently. KEY WORDS: typology; consistency; consensus analysis; ceramics; archaeology. INTRODUCTION When Donald Grayson mentions "Lulu Linear Punctated," does Robert Dunnell picture the same sherd (Dunnell and Grayson, 1983)? IDepartment of Anthropology, Grirmell College, Grinnell, Iowa 50112. 2"1"o whom correspondence should be addressed. Fa~c515-269-4330. e-marl: [email protected]. 129 1072-5369/98/0600-0129515.00/0 1998 Plenum Publishing Corporation

Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998

Evaluating Consistency in Typology and Classification

John C. Whittaker, 1,2 Douglas Caulkins, 1 and Kathryn A. Kamp 1

Typological systems are essential for communication between anthropologists as well as for interpretive purposes. For both communica t ion and interpretation, it is important to know that different individuals using the same typology classify artifacts in similar ways, but the consistency with which typologies are used is rarely evaluated or explicitly tested. There are theoretical, practical, and cultural reasons for this failure. Disagreements among archaeologists using the same typology may originate in the typology itself (i.e., imprecise type definitions, confusing structure) or in the classification process, because of observer errors, differences in perception and interpretation, and biases. We review previous attempts to evaluate consistency in typology and classification, and use consensus analysis to examine one well-established typology. Both consensus and disparity are apparent among the typologists in our case study, and this allows us to explore the kinds of forces that shape agreement and diversity in the use of all typological systems. We argue that issues o f typological consistency are theoretically and methodologically important. Typological consistency can be explicitly tested, and must be if we hope to use typologies confidently.

KEY WORDS: typology; consistency; consensus analysis; ceramics; archaeology.

INTRODUCTION

When Donald Grayson mentions "Lulu Linear Punctated," does Robert Dunnell picture the same sherd (Dunnell and Grayson, 1983)?

IDepartment of Anthropology, Grirmell College, Grinnell, Iowa 50112. 2"1"o whom correspondence should be addressed. Fa~c 515-269-4330. e-marl: [email protected].

129

1072-5369/98/0600-0129515.00/0 �9 1998 Plenum Publishing Corporation

Page 2: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

130 Whittaker, Caulkins, and Kamp

The consistency with which typologies are used has a major effect on how well they serve us for communication and interpretation, yet neither those interested in the theory of typology nor those attempting the practical classification of artifacts have made many attempts to deal with the prob- lems of typological consistency. Poorly formulated typologies, human errors in classification, and theoretical biases may disrupt our ability to understand the typologies of others, to evaluate their interpretations, or even to be sure that our own are free of systematic errors. Both the difficulties of evaluating typological systems and the cultural nature of typologies have hindered consideration of these problems. Nevertheless, they must be tack- led head-on if we are to use any typology confidently. We review the sparse literature on the evaluation of typological consistency and present an ex- periment comparing the sherd classifications of 13 archaeologists. Our in- tent is to show that consistency is a critical issue for archaeology and to suggest some ways to improve communication and interpretation based on typology. While the focus is on prehistoric artifacts, especially ceramic and lithic artifacts, we view the issues as applicable to all typologies.

The 1960s and 1970s saw a florescence of "God's truth or hocus- pocus" arguments 3 between archaeologists who believed that their artifact typologies reflected natural divisions of reality or the perceptions of the artisans and those who believed that typologies are arbitrary or even whim- sical constructs of the archaeologist (Burling, 1964; Brew, 1946; Clarke, 1968; Deetz, 1967; Ford, 1954a,b; Gifford, 1960; Rouse, 1960; Spaulding, 1953, 1954; Swartz, 1967). Hill and Evans (1972) and Dunnell (1986) give excellent summaries of this debate. Archaeological consensus now seems to have arrived at the sensible middle ground that typologies are indeed at least partly arbitrary but, nevertheless, can be used to solve problems -- to describe a body of data, to communicate that description, and to an- swer interpretive questions. The attributes used and the form of the typo- logy should vary according to the problem being solved.

We view typological systems as comprising two halves: the typology, or set of rules for classifying items; and the classification process itself. Dunnell (1986) argues that interest in typology shifted in the last two dec- ades from theory to methodology, a belief that is also reflected by Adams (1988) and Adams and Adams (1991). The methodological issues Dunnell refers to were concerned mostly with the formation of typologies and the definition of categories, especially by computerized statistical techniques such as factor analysis and clustering. The work of the Adamses, despite

3While the controversy that Burling (1964) addressed dealt with the uses of methods of the New Ethnography, a parallel debate was occurring in archaeology over the epistemological status of typologies.

Page 3: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 131

an emphasis on what they term "practice," has dealt primarily with the terminology and definition of typological concepts, with the costs and bene- fits of classifying artifacts, and with the need to sort archaeological assem- blages efficiently. The methodological issue of how consistently artifacts are classified using a typology, which we consider of great theoretical impor- tance as well, has hardly been touched.

Whether the types archaeologists define are "real" or not, and whether they are produced intuitively or by the application of mathematical algorithms, our ability to use typologies to solve problems and our ability to describe, communicate, and evaluate one another's interpretations de- pend to some extent on using well-defined types and agreeing on the defi- nitions of any types that are in wide use. If you want to argue about ceramics with George Quimby, or compare his analyses to your own, you have to understand his definition of Lulu Linear Punctated.

Many typologies have a long and honorable history. Despite frequent arguments about what type a particular sherd falls into or whether "grey flecks" or "sherd temper" better describes the critical attributes, archae- ologists tend to assume that everyone means pretty much the same thing when they use a well-established type name. The same can be said of all typologies whether they concern archaeological classification of artifacts, anthropologists' attempts to describe social systems, or even biological no- menclature of life forms. Unless we can all agree on how to identify a Desert Side-Notched point, a chiefdom, an elite good, a redistributive sys- tem, or a Crotalus horridus horridus, it is difficult to use such terms as either communicative or analytic devices.

When two of us (Whittaker and Kamp) began work in the Sinagua region of northern Arizona, we had to learn one of the most venerable ce- ramic typologies in the Southwest. Beginning in the 1930s Harold S. Colton and associates (Colton, 1932, 1941, 1955, 1958; Colton and Hargrave, 1937; Hargrave, 1932) described the Northern Arizona ceramic wares and defmed a hierarchical system of wares, types, and varieties that became a model for other typological schemes. Since part of the necessary equipment for any archaeologist desiring to do survey or excavation in a region is a mastery of the local artifact typology, we began to learn the typological system for Sinagua ceramics. Receiving lessons from an established authority (Peter Pilles, Coconino National Forest Archaeologist), then slogging on with a small type collection and published descriptions, and eventually handling thousands of grubby sherds, we passed through a learning sequence that is probably fairly typical. First, despair: the innumerable types all look the same, they are all arbitrary, none of the distinctions make sense, we will never learn this, and it's all probably nonsense anyway. Then growing con- fidence: the types begin to make sense, there actually are patterns, and most

Page 4: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

132 Whittaker, Caulkins, and Kamp

of the sherds fit them. Finally, complacency: the system is not too bad and is actually useful, only a few sherds are anomalous, the types can be used for interpretive research, and many of the flaws in the system can be avoided by defining the types a little differently. In other words, with sufficient ex- perience most archaeologists internalize the rules of a typology to such an extent that they feel that their understanding of the typology is better than anyone else's.

Or is it? How often have you heard someone say, "Bill said it was Type X, but he's wrong; everyone knows it's Type M." It is often assumed that once there is a published type description and a type with temporal or functional implications has been widely used, everyone agrees pretty closely about what that type is and what it means. Moreover, comments like that above rarely reach print, except concealed in notes like, "Type M herein appears to be what Smith calls Type X," and occasional small-scale quibbling about attribute details.

Indeed, it is bad form to be too critical of another archaeologist's ty- pological skills, at least in print or to his or her face. Typologies are "sacred knowledge" acquired as a rite of passage into professional status, and the ability to classify things correctly is a basic professional skill about which some archaeologists are very sensitive. Familiarity also breeds belief; once we have learned and used a system, it appears natural and obvious.

By combining both explicit teaching of the underlying rules and practical experience sorting items according to those rules, the learning process de- scribed above contributes strongly to belief in a typology. Adams and Adams (1991, p. 198) term this "gestalt acquisition," in which "we learn to recognize at a glance types which we originally had to identify by a conscious process of attribute analysis." This is analogous to learning any other domain of cup tural knowledge. Artifact types become as ingrained and reified as kinship terminologies, and while we may feel less abhorrence of a misclassified sherd than an incestuous union, both come to feel "inherently" wrong.

Even archaeologists who explicitly recognize that typologies are arbi- trary, so that any group of objects can be classified in many ways, often come to harbor what Hill and Evans (1972, p. 236) call an "empiricist" belief in the fundamental reality of commonly used types. At this point, the types "do not often need to be questioned; they are primary data, and only the inferences from them are open to serious critique." Thus, while we frequently argue about type definitions and the classification of indi- vidual specimens, we also rarely evaluate the basic typological systems we use, and thus beg important questions: How well do our types agree with the definitions of other workers, and what are the sources of variation and disagreement when different analysts classify artifacts? Are we communi- cating what we think we are?

Page 5: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 133

Is it possible to examine these questions systematically? We believe so, but consistency in classification is one of the most studiously ignored problems in discussions of archaeological typology.

When archaeologists argue about whether or not a particular typologi- cal scheme is "good," there are several issues involved. Most often, the dis- cussion focuses on whether the typology makes intuitive sense or is suitable for attacking a particular interpretive problem. However, the practical issue of whether a typology has clear rules and can be consistently used by dif- ferent observers is also sometimes raised. As we discuss the evaluation of typological systems, our focus is on these practical aspects, but they impinge also on the more frequently considered issues of the nature and uses of typology. If multiple observers cannot consistently use a typology, how can they rely on or compare the interpretations based on it?

CONSISTENCY ISSUES IN TYPOLOGICAL THEORY

In 1972, at the height of the debates about typological theory, David Clarke's (1972) Models in Archaeology included two revealing articles, "A Model for Classification and Typology" (Hill and Evans, 1972) and "Re- search Design Models" (Daniels, 1972). Hill and Evans (1972) summarized the typological debate as it then stood. They began and concluded with a list of seven issues that they considered the loci of archaeological arguments about typology. It is useful to paraphrase their list (Hill and Evans, 1972, p. 231):

1. Are types real, or invented by archaeologists for their own purposes?

2. Is artifact variability continuous, or can types be discovered as nonrandom clusters of attributes?

3. Is there a single best type division of a domain, or are there many equally good ones?

4. Can we formulate standard types, and should we? 5. Are types basic data? 6. Do we need more or fewer types? 7. What should types mean (e.g., chronology, function, mental

templates)?

Hill and Evans concluded by arguing against the "empiricist" view that the "best" types for classifying any body of data are likely to be re- peatedly discovered by any analyst. Some types are "real" types in the sense that some nonrandom attribute clusters can be discovered, but because the researcher can not consider all attributes, it is impossible to pursue a single

Page 6: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

134 Whittaker, Caulkins, and Kamp

best typology. Attributes are always selected by the analyst, and therefore attributes -- and the types they define -- are always affected by the prob- lems and the biases of the investigator. Accordingly, Hill and Evans sup- ported a "positivist" model, arguing that because all phenomena, including nonrandom attribute clusters, are assigned meanings by the observer, there is no single or best typology. Thus, archaeologists should be guided in the formation of typological schemes by hypotheses about what they wish to learn. This seems to be the general view today (Adams and Adams, 1991) and is the theoretical basis on which we intend to proceed.

However, it should be noted that the list of interesting typological questions presented by Hill and Evans is almost completely concerned with the theoretical issues of forming and defining typologies, rather than the practical issue of whether or not archaeologists can apply a typology con- sistently and effectively. In fact, they quote a personal communication from A1 Spaulding (Hill and Evans, 1972, p. 261), who says, "Any number of ar- chaeologists employing the same analytical techniques on the same variables in the same collection should come up with the same results."

Our experience with the argumentative and opinionated nature of most scientists, including ourselves, leads us to wonder how Spaulding could have arrived at such "a conclusion. It is theoretically true that if humans were reliable, consistent, and unbiased observers, there would be no prob- lem. However, typologies are constructed and used by humans, and there are many reasons why "any number of archaeologists employing the same analytical techniques on the same collection" are almost certain to come up with different conclusions. Both the construction and the use of the typology may be involved, and a distinction should be made between the typology, as a system of rules and labels for classifying items, and classifi- cation, the act of classifying or assigning items to their defined categories. When we speak of evaluating typological systems, we must have both the rules and the behavior in mind, because they are not entirely separable, and both can be a source of error.

A few pages away from Hill and Evans (1972), S. G. H. Daniels (1972) was presenting a model for research design, which, while largely theoretical, dealt much more actively with the practical problems of classification than did Hill and Evans' discussion of typology. Daniels preached the need to deal with the kinds of error introduced into archaeological data by "post- depositional factors" (a concept soon to be popularized by Schiffer and others as one category of "formation processes") and by "research factors." Daniels' "research factors" are what concern us.

Daniels recognized three kinds of errors resulting from human failings in the research process. Random noise is produced by small mistakes. Rather than causing false patterns in the data, random errors tend to obscure patterns

Page 7: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 135

by increasing the variance. The effects of random errors can be decreased by increasing sample size. Gross errors are rare, but relatively large, mistakes, such as confusing the provenience of a batch of artifacts. The third kind of error, bias, is probably more common and presents the most problems. Be- cause bias is directional and cumulative it can seriously affect results, and it cannot be eliminated simply by increasing the number of observations.

Daniels' research design model presented a number of suggestions for controlling error, which are relevant to the practical use of typological sys- tems. The errors from random noise and gross mistakes can be dealt with relatively easily by ensuring an adequate data sample and by redundancy and rigor in recording and analytical procedures. Bias is a tougher nut to crack. As Daniels pointed out, when the subjectivity of judgments increases, directional errors produced by the differing biases of different observers also increase. Here Daniels is concerned with what we term "consistency," the degree to which different observers agree or make the same observations. His proposed solutions, randomization and quality control, indicate this. If specific analysts have particular biases in classification, the effect of these biases can be minimized by randomly assigning portions of the data to dif- ferent researchers. If the number of data subsets is large and the quantity of data analyzed by each participant is equivalent, this procedure should reduce the analysts' respective biases to random noise.

Quality control is "a system in industry for ensuring that an acceptably small portion of goods . . , is defective, while making an acceptably small ex- penditure on the inspection process" (Daniels, 1972, p. 210). Quality control in archaeology can be performed by duplicating a sample of the total obser- vations to assess "the reproducibility of results produced by a single observer, and the agreement of results produced by different observers." If the amount of error is too high, the observations, like damaged cookies, can be discarded or remade. In both industry and archaeology, it is necessary to decide what level of error is acceptable; in archaeology this decision is largely arbitrary and intuitive. Furthermore, in archaeology "the research process does not normally produce items with known specifications, but observations with an unknown true value and a certain degree of observational error" (Daniels, 1972, p. 210). This is why quality control in archaeological analysis must be a matter of comparing observations and dealing with the biases of different observers, rather than comparing data to a known standard.

Daniels (1972) produced good theoretical arguments on the need to control observational errors and a strong programmatic statement, in gen- eral terms, for how it could be done. He was, however, more "vox clamantis in deserto" than a prophet. In the 25 years since, few archaeologists have attempted quality control in classification, or even seriously discussed the problems of typological consistency.

Page 8: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

136 Whittaker, Caulkins, and Kamp

There is, however, an implicit consideration of one aspect of consistent typological practice embedded in some of the theoretical developments in the field. Hill and Evans (1972), among others, did discuss the issue of standardized types: Can we formulate them? and Do we need them? The question for Hill and Evans is whether standardized types are really useful. They seem to doubt it, preferring to suggest that some attributes be universally recorded for certain kinds of data, and avoiding the fact that there are already many standardized typologies in wide use. They quote communications with George Cowgill and Albert Spaulding, who promote standardized typologies for "strategic" purposes, by which they mean that some broadly based and widely used typologies should be useful for many problems. This has been the case with many well-known typologies, ranging from ceramic types to projectile point and other lithic forms to the social typologies of Service and Fried. However much we argue about them, many standard typologies serve as a vocabulary for communication. They are also heuristic, serving as starting points for hypotheses and arguments, and as data that can be manipulated to discern patterns for certain kinds of problems.

As we see it, the real problem with standardized typologies, and one which is not discussed by Hill and Evans, is the question of consistency. Some of those who promoted standardized typologies recognized the need for consistent classification and comparable use of the same typologies. They attacked the problem of reducing error by what Daniels would term "methodological rigor" by tightening up the definitions in the formulation of typologies. Quantitative definitions were a logical step. In effect, the numerous mathematical procedures that were proposed for defining (or "discovering") types (e.g., Spaulding, 1953; Hodson et al., 1966; Read, 1974; Dumond, 1974; Read and Christenson, 1977) or for classifying artifacts by type (e.g., Whallon, 1972; Gunn and Prewitt, 1975; Thomas, 1981) can be seen as attempts to reduce human bias. However, as Adams and Adams (1991) point out, computer typology has seen more theoretical discussion than practical application.

While clear and precise type definitions, whether metrically based or not, should be expected to improve the consistency of classification, they do not completely solve the problem. There will always be ambiguous or anomalous specimens, and differences in the perceptions of observers. Per- ceptual differences should not surprise us; the literature on individual vari- ation in artifact manufacture (Hill and Gunn, 1977; Whittaker, 1984) makes it clear that they are a fundamental basis for much of the variation in ar- tifact form. Perceptual differences are an equally fundamental factor in our analysis of artifacts.

Page 9: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 137

One subfield of artifact-based archaeology did attempt to deal with perceptual differences and the consistency of observations. As lithic use- wear analysis developed, there was considerable argument about the sub- jectivity and replicability of observations, particularly observations of microphenomena like polishes observed at high magnification.

A number of use-wear analysts used blind tests to demonstrate that wear patterns formed by different processes could be reliably identified (e.g., Keeley, 1980; Newcomer and Keeley, 1977; Odell and Odell- Vereecken, 1980; Vaughan, 1985). A few studies did compare different ob- servers and noted some variation in the interpretation of the same microwear traces by different individuals using the same microscopic tech- niques (Newcomer et aL, 1986) or macroscopic observation (Young and Bamforth, 1990). Even a measure as seemingly straightforward as the num- ber of flake scars recognized and counted on photographs of tool edges produced interobserver variability (McGuire et al., 1982). Nevertheless, as microwear studies became more common and promised great interpretive power, a consensus developed that can be clearly seen in recent reviews such as that by Shea (1992). Although "microwear analysis involves a de- gree of subjectivity, both in the recognition of microwear as use-wear and in the specific interpretation of wear patterns" (Shea, 1992, p. 149), these problems are seen as relatively minor and will continue to be reduced. Thus, "microwear is the most reliable method of obtaining information about use and discard of stone tools" (Shea, 1992, p. 150).

Although microwear studies are in fact typological exercises, in which damage on stone tool edges is classified according to its attributes into types that have been assigned interpretations through experimentation, they have had little or no effect on the contemporaneous debate about typology. This is perhaps because they deal with observer variance, a problem which, as we have seen, was not of interest to the theorists, and also perhaps because the categories of use-wear were not seen as comparable to artifact types. Also, many tests of microwear observations used experimentally manufactured edges, providing a known correct interpretation with which observations could be compared. Accordingly, differences between mi- crowear analysts were discussed as errors of observation rather than dif- ferences in classification.

In any case, the view that problems of consistency and observer bias in classification were not of much interest won out over the protestations of the few like Daniels (1972) and Schiffer (1976, pp. 97-98) who argued that such things must be considered. If American Antiquity serves as a ba- rometer of archaeological trends, it is notable that in the 25 years since the publication of Models in Archaeology, some 25 articles and reviews deal- ing largely with typological theory have occupied its pages, but only 2 of

Page 10: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

138 Whittaker, Caulkins, and Kamp

them (Fish, 1978; Beck and Jones, 1989) have been concerned with prob- lems of consistency. Observer error and consistency were considered in an additional two or three other articles that did not treat these problems as typological issues. The most comprehensive recent treatment of typology (Adams and Adams, 1991) treats "intersubjective agreement" as a relatively unimportant issue. It is the same throughout the literature: testing the con- sistency of typology and classification has never been more than a minor note in the typological symphony.

EVALUATING TYPOLOGICAL SYSTEMS: PREVIOUS WORK

Schiffer (1976, pp. 122-123) suggested a theoretical method for ex- amining a typology and deriving a measure of how much error in classifi- cation is likely. He enumerated the categories in a lithic typology that were formally similar and likely to be confused, then produced a largely subjec- tive measure based on the percentage error that would occur if all likely misclassifications were in fact made. This is useful primarily for suggesting which type definitions are confusingly similar to others and are likely to produce inconsistent classifications. Although most archaeologists probably have some intuitive grasp of where the most confusion is likely to occur in their typologies, Schiffer (1976) attempted an explicit model. However, he did not attempt to see if his expectations would be met in actual artifact sorting, which would be an interesting exercise.

The most direct way to consider the problem of consistency is to test different analysts classifying the same body of material. More archaeologists have worried about typological consistency than have actually tested it, and more have probably tried some simple tests and left them as footnotes to their analysis (e.g., Tuggle, 1970) than have chosen to focus attention on the problem. A brief survey of the rather sparse literature shows that while a number of archaeologists have considered problems of consistency and have attempted various means of testing and evaluating their typological systems, there is little agreement on effective methods, the amount of test- ing which is necessary, or the potential sources of error. Furthermore, none of the work appears to have had much impact on either typological theory or the classification practices of many archaeologists.

Paul Fish (1976, 1978) was one of the first seriously to evaluate his ty- pological systems. In one experiment, he had four analysts classify a set of 90 Tusayan Gray and White Ware sherds from the Kayenta Anasazi region, using only design elements. Pairwise comparisons showed 22-30% discrepancy in clas- sification between any two analysts. No single type was a consistent source of

Page 11: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

T y p o l o g y and Classification . 139

disagreement, and no one analyst disagreed consistently with all others, but the pattern of disagreements showed that one analyst favored classification as tem- porally early types, while another tended to classify sherds preferentially into later types. As a result, the errors were not random and would have caused distinct differences in interpretation.

Fish (1978) also examined consistency in the use of qualitative and quantitative attributes of flakes. He found little change in his own obser- vations after an interval of 3 years. When 25 flakes were examined by three individuals, he found little divergence in their measurements. However, he presents for comparison only the means of each observer's measurements, so it is difficult to evaluate how much they really varied. Measurements like platform angle (which are likely to have less well-defined landmarks and require more subjective judgments) were quite variable. An eight-in- terval ranking of the amount of cortex on flakes, also involving subjective judgment, produced discrepancies of 30-40% between observers.

Swartout and Dulaney (1982, pp. 102-120) conducted a similar study shortly after Fish's work was published. They had 20 archaeologists at the 1978 Pecos Conference classify 27 Cibola White Ware vessels. They found a general lack of agreement. Several individual vessels were assigned to between 10 and 13 types, while only three vessels were given 5 or fewer type labels. Seventeen archaeologists called one vessel Tularosa Black-on-White, but only two other pieces had as many as nine participants agreeing on their type. Most pots gamered no more than five agreements on any one type label. Agreement was not much better on previously typed vessels from the Museum of Northem Arizona type collection than on the possibly more ambiguous pots from the Coronado Project excavations.

When Swartout and Dulaney subdivided their participants by where they had worked and been trained, they identified a few consistent biases, but also a great diversity in each group. When they subdivided by length of experience, they found that the five subjects with more than 20 years of experience were only slightly better in agreement among themselves than those with fewer than 20 years of experience. However, the temporal ranges of the types assigned by the more experienced participants were tighter than those of the less experienced group. As Swartout and Dulaney point out, the Cibola types are used mostly for chronological inference, and as the younger typologists are now writing most of the reports, the lack of consensus may increasingly cause problems.

These two studies show that even very simple measures such as the per- centage of disagreement between pairs of observers can be revealing. However, percentages of agreement (Fish) and both the number of agreements and the number of types assigned (Swartout and Dulaney) are likely to be strongly affected by the number of choices available to the analysts. Some other simple

Page 12: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

140 Whittaker, Caulkins, and Kamp

comparisons, such as the mean of measurements, may not be adequate to allow an objective judgment. As we have already stated, there is no objective way to decide how much discrepancy will spoil interpretive results; we can only provide measures of how much variability exists.

Dibble and Bernard (1987), who were not concerned with typology, measured variability between individuals measuring lithic edge angles by three techniques. They noted systematic errors in some techniques which produced consistently high or low mean measurements compared to others. As in most archaeological analyses, there was no way of setting an objective standard to which all techniques could be compared; thus while bias was evident in the means, this was not helpful in selecting the "best" measure- ment technique. Random error, as judged by the variance of the measure- ments, also differed among the three techniques. Dibble and Bernard argued that the new technique they were proposing was preferable to the others because there were fewer differences among observers.

Differences between observers in classifying nominal attribute states were considered by Boyd (1987), who had five control groups of lithic ar- tifacts all classified by three observers to help identify coding and inter- pretive problems. Boyd's terminology is a bit confusing. The three "attributes" considered were "working edge," "blank," and "raw material." The "nominal attribute states" he used were categories most analysts would call types; thus the attribute states coded for "blank" included technological debitage classes such as "primary decortication flake" as well as different projectile point types. The attribute "working edge" included morphological categories such as "knife" and "utilized edge."

Boyd used a measure of the total percentage agreement among the three observers (three pairwise comparisons) for each of the five assemblages on each of the three attributes. His observers showed better than 90% agree- ment in their use of the "working edge" categories. However, the agreement over "blank" form was only from 50 to 76%, although only one assemblage produced less than 60% agreement. Boyd (1987, p. 93) noted that "unfortu- nately, this attribute measured temporal affiliation (projectile point type) and lithic technology." He felt, however, that even with experienced observers, "60-70% agreement between observers may be the optimum to be expected" given the numerous uncontrolled variables affecting the forms of the lithic artifacts. Moreover, many of the errors did not seem serious to him, as they involved disagreement over projectile point types. When these were lumped as "bifacial tools" the percentage agreement rose to above 80%. Boyd closes by noting that even experienced analysts will vary somewhat in assessing even simple nominal attributes and recommends both what Daniels (1972) would class as methodological rigor (good training) and quality control (monitoring observer differences through control data sets).

Page 13: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 141

While we agree with Boyd's methodological recommendations, he seems to have missed a couple of lessons in his experiment. We are not concerned with the statement that 20% disagreement over projectile point types is not serious, although that would distress many archaeologists. As we have said, there is no objective way to decide how much variance is too much, and the interpretive consequences are yet another question. However, it is apparent that some of the disagreement in Boyd's "attnqvute states" is due to the large and rather vaguely defined categories that they entail and to some attnqvutes being much more complex and subjectively judged than others. It is even difficult to compare the three attnbutes, as he does, to decide where improvements in the typology could be made. Why was agreement highest on "raw material"? Is it because the different types (attn~oute states) are well defined and dis tractive, making it easy to train observers, or is it merely because there were fewer choices than allowed under other attn"outes? The form of the typology is not entirely separable from the practical problems of classification. Training and monitoring analysts will improve consistency, but so will a well- planned and easy to apply set of typological definitions.

To date the most ambitious attempt to assess bias and differences be- tween observers in archaeological data has been that of Beck and Jones (1989). They argued that bias in analysis can be introduced by inexplicit class defini- tions, differences in perception among analysts, and changes in a single analyst's perception over time. Unlike other studies, which compared analysts using the same materials (data sets developed or chosen for the purpose), Beck and Jones compared the results of analyses done on different subsets of a lithic assemblage collected during a regional survey in Oregon. The weakness of this approach is that the different data sets examined by each analyst had to be assumed to be subsets of a single universe. As there were geographical and perhaps func- tional differences between the groups of assemblages assigned to the two analysts, the assumption of uniformity may be questioned. The strength of their approach is that it does demonstrate the possibility of controlling for, or at least identi- fying, bias when comparing related assemblages analyzed by different archae- ologists. This kind of comparability is widely recognized as a problem, but is usually considered intractable or ignored.

Beck and Jones' original research goal was to examine assemblage rich- ness, specifically the question of whether the number and diversity of artifact types in an assemblage reflected site function or merely responded to the overall size of the assemblage. Although their classification procedures incorporated some of Daniels' recommendations, such as explicit type definitions, similarly trained analysts, and conventions for dealing with ambiguous cases, they realized that if either of them preferentially classified artifacts into more categories, their results would be biased. They tested differences between analysts during analy- sis, but still suspected at the end that artificial variability had been introduced into the data, and set out to test for it.

Page 14: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

142 Whittaker, Caulkins, and Kamp

Beck and Jones started by examining the classificatory process. Edge shape, classified as "convex, concave, or straight," proved to be inconsis- tently judged. Although they believed that their definitions were explicit, they found disagreement on such matters as how much irregularity was allowable in a "straight" edge. A chi-square test confirmed that the two analysts differed in their perceptions of edge shape.

Plotting the number of classes against the number of tools in each site assemblage on a log-log scale was expected to produce a linear regression, and did. However, the sites analyzed by Beck tended to fall below the regres- sion line, while those analyzed by Jones tended to plot above the line. Thus if the sites analyzed by each are plotted separately, the differences between the two analysts can be examined statistically by comparing the slopes, inter- cepts, and correlation coefficients of their respective regression lines. Beck and Jones found no significant difference in regression line slopes, suggesting that the rates at which each added new tool categories as the assemblages increased in size were similar. The correlation coefficients, which measure the dispersion of points around the line, also were not significantly different, suggesting that both analysts were equally consistent. However, the intercepts of the two lines were different (at a low level of significance), showing at least a slight tendency for one analyst to assign artifacts to a greater number of different classes. Believing that a bias toward rare tool classes would have a disproportionate effect on the results, they lumped the rare classes together and compared the analysts again. This procedure greatly reduced the differences.

Beck and Jones admitted that their tests were not a fully conclusive dem- onstration of bias, since they did not compare both analysts by having them reanalyze the same artifacts. A further test examined the data sets produced by one analyst over time, and found no significant differences.

A lengthy book by Adams and Adams (1991) serves as a final example of previous considerations of consistency in typological systems. Adams and Adams (1991, p. 4) agree with our position that "useful typologies require in- tersubjective agreement (consistency)." They recognize that differences in ob- server perceptions and expertise, problems with ambiguous or fragmentary specimens, and the inherent subjectivity of some decisions all mean that full agreement can never be achieved. However, they find this of only minor theoretical inter- est, and in the discussion of their Nubian ceramic typology it is apparent that it was only a minor practical concern as well. They believe that the best way to promote consistency is through very precise and detailed type descriptions but argue that worrying about consistency is counter-productive beyond a cer- tain point. This point is reached when overdescription of types slows artifact classification without improving the accuracy, not of the classifications them- selves, but of the interpretations for which the typology was designed.

Page 15: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 143

Adams and Adams reported that 95% agreement on their Nubian sherds was the maximum achieved by experts, and 90% was considered adequate for the detailed chronological ordering of contexts for which they used the ceramic typology. The figure of 90% agreement, which they felt was typical of their analysts, implies that they tested consistency. However, it is probably revealing that such tests are not described. The origin of their agreement figures remains obscure, and neither the process of obtaining them nor their implications are considered of interest. In particular, the question of bias is ignored. Ninety per- cent agreement sounds good, but perhaps it is only good if the 10% error is largely random. If the different analysts show consistent biases, 10% divergence may be enough to skew some chronological conclusions unless, as Daniels rec- ommended, the (biased) observers were not associated with particular contex- tual groupings of data. Adams and Adams at least considered the problem of consistency. Similar doubts could be applied to all analyses that entail the clas- sification of artifacts, but most studies do not consider this issue at all.

A CASE STUDY: SINAGUA CERAMIC TYPOLOGY

In view of what others have done, it seems to us that the evaluation of typologies and their application is a useful exercise. An evaluation of a typological system ought to consider both the typology itself and how con- sistently its rules are applied in the classification of artifacts. We ought to measure how much agreement there is between individual archaeologists and examine the typology itself for the sources of disagreement, with a view to resolving them as much as possible.

The recent development of consensus modeling by ethnographers (Romney et al., 1986) as a means of examining cultural concepts seems ideally suited to evaluating a typological system. Herein two archaeologists with a ceramic ty- pology (Whittaker and Kamp) and an ethnographer with a statistical method (Caulkins) attempt an ethnographic study of an archaeological typology. Our goal is to illuminate typologies in general, more than just the one with which we happen to work. We argue that interobserver variability is unavoidable in the classification performed using any typology, largely because of variation in human perceptions and biases. An ethnographic examination of a typological system considers its history, learning networks, and cultural context, as well as its structure, rules, and application. An ethnographic viewpoint reminds us that much of the human variation in archaeological classification is culturally pat- temed, because an archaeological typology is a system of rules that is learned and transmitted like other cultural domains. We suggest that this is also one reason why archaeologists are often reluctant to question, test, and evaluate their typological systems in any meaningful way.

Page 16: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

144 Whittaker, Caulklns, and Kamp

D a t a C o n s t r u c t i o n

We used consensus analysis to evaluate the amount of variation among 13 archaeologists classifying plain ware sherds, all from the Northern Sinagua area around Flagstaff, Arizona. Consensus analysis is a statistical technique designed to discover, first, whether there are "culturally correct" or high con- sensus answers to questions about a domain--in this case, a typology--and, second, which informants are the most knowledgeable about these domains. A pile sort is a simple means of collecting data on classifications that can be used with consensus modeling to examine both the classification system itself and the degree of agreement among individuals using it. The form of pile sort we used is described as "unrestricted" because subjects are given a num- ber of objects, usually cards with pictures or written concepts on them, and asked to sort them into as many groups of similar items as they like. A further step in some studies is to have the subjects combine their groups hierarchi- cally, in a step-by-step manner. We did not do this, but did collect some information on higher-order groupings. 4

We collected data from 13 archaeologists who work in the Flagstaff area. They were each given a typed instruction and recording sheet, a 10x hand lens, and the same 28 freshly broken sherds. The sherds were all un- decorated types common in Northern Sinagua sites, collected during exca- vations at Lizard Man Village (Kamp and Whittaker, 1990, 1998). Kamp constructed the sherd sample to include sherds that she identified as be- longing to most of the common, uncorrugated plain wares in the region, including specimens that she judged to be both Alameda Brown Wares (Rio de Flag Brown, Early Angell, Late Angell, Winona, Youngs Brown, Turkey Hill Red, and Sunset Red/Brown) and San Francisco Mountain Gray Wares (Deadmans Gray and Deadmans Fugitive Red). These types are defined on the basis of temper type (five main temper types are distinguished in the descriptions), temper size (Early Angell, Late Angell, and Winona), and slip (Turkey Hill and Deadman's Fugitive Red). In addition, some of the types may be divided into smudged and unsmudged varieties. In the selection of sherds Kamp attempted to include a wide range of variation in attributes that could potentially be used to separate sherd categories (temper, slip, burnish, fL,'ing conditions).

4Our work on pile sorts has stimulated IC Romney and associates to reevaluate the use of pile sorts with consensus analysis. Romney (personal communication, July 14, 1995) now recommends a multiple-choice format to avoid the difficulties presented by the lumper/splitter problem encountered with some pile sorts. In this format informants would have to choose among four main types for each sherd. Another possibility, he suggests, is a true-false format in which each sherd is assigned a specific type and the informant is asked whether that characterization is true or false.

Page 17: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 145

The instructions given to each participant explained that we were studying typological consensus and promised to respect the participants even if we disagreed with their type assignments. Participants were instructed to "First sort the sherds into types that you would recognize and use in analyzing an archaeological assem- blage. Make a pile for each type. We have provided space to record twenty types below, but you may have as many or as few types as you wish." The numbered sherds in each pile were then recorded, and the participants were asked to add the names they use to each type/pile. They were then asked to note any hierarchical relationships they recognized (for example, grouping types into wares), but responses to this question were not very consistent. Each participant then rated his or her experience with these types as "extensive, moderate, some, or slight" and estimated how long they had worked with Sinagua ceramics. Our participants ranged in ex- perience from a couple of graduate students new to the area to Peter Pilles, the Coconino National Forest Archaeologist, who has worked in the area for more than 20 years. Finally, participants were asked for a brief description of how they learned Sinagua ceramic typology, including who taught them, whether they used any established type collection, and what site analysis experience they had. Again, we tried to include a diversity of experience and "learning genealogy."

Two apparent weaknesses in our data base were pointed out by some reviewers, namely, that the samples of archaeologists and of sherds were small and that there was a lack of independence between both people and sherds. The consensus analysis model that we use to analyze these responses, however, is well suited for small samples. Weller and Romney (1988, p. 77) assert that consensus analysis can give reliable and valid answers with very small samples of respondents and items. The method can give stable results with as few as half a dozen respondents and 20 items. We have exceeded that threshold. 5 The small sample size was also necessitated by the number of available archaeolo- gists, the complexity of the analysis, and the time required to administer pile sort interviews. As for the lack of independence, a cultural domain by definition consists of meaningfully related items. Many applications of consensus analysis deal with persons interacting with each other, in either the past or the present (see Boster, 1986). Our analysis reflects on the social transmission of a typology and, therefore, does not assume the independence of the subjects. In fact, re- lationships between the typologists help to structure their opinions and are, thus, one aspect of the ethnography of typologies in which we are interested.

5The question of the size of the sample pool for consensus analysis is complex. In the article in which they develop the consensus model, Romney et aL (1986) use 40 true-false or dichotomous items. They note that "fill-in-the-blank type questions are more reliable than multiple choice type, which in turn are more reliable than true-false type" (p. 331). In replication tests, they found that one fill-in-the-blank question produced the same test-retest reliability as five true-false questions (p. 331). Our sherd experiment uses 28 fill-in-the-blank type items, roughly the equivalent of 140 dichotomous items.

Page 18: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

146 Whittaker, Caulklns, and Kamp

Consensus Mode l ing

The recent work of Romney and his associates (Romney et al., 1986, 1987; Weller and Romney, 1988) on consensus analysis and informant ac- curacy provides a rigorous means of describing varying degrees of agree- ment about domains of knowledge such as typology. The method has been applied to an increasing variety of domains, such as disease categories (Weller, 1984), attitudes toward corporal punishment (Weller, 1986), eth- nobotanieal knowledge (Boster, 1986), similarities of fish (Boster and Johnson, 1989), lifestyles (Dressier et al., 1996), organizational cultures (Caulkins, 1991), and ethnic identity (Caulkins and Trosset, 1996; Trosset and Caulkins, 1993).

Consensus analysis enables us to describe the degree and pattern of diversity in a pool of information. Romney et al. (1987, p. 165) explain that "the central idea in the original theory is the use of the pattern of agreement or consensus among informants to make inferences about their differential knowledge of the shared information pool constituting culture." A high de- gree of consensus indicates that there is a single model of information in the pool, or a "single-culture," and the correspondence between the knowl- edge of any pair of informants is a function of the degree to which each matches the model. Of course, there may be no shared information pool or consensus. Instead, the pool may be "turbulent" and diverse, indicative of what Wallace (1961, pp. 27-39) has called "cognitive non-sharing." Alterna- tively, the pool may have two or more overlapping or separated centers of cultural knowledge (Caulkins, 1991; Hyatt and Caulkins, 1992). We expected to find a high level of agreement on the Sinagua typology.

We analyzed the pile sort data and carried out the consensus analysis using ANTHROPAC 3.2 (Borgatti, 1990). The program compares the in- formants' solutions to the pile sorting task and then calculates a proximity matrix of individuals that can be consensus analyzed. The consensus module then performs a factor analysis of the matrix to reveal whether or not one factor accounts for most of the variance. If so, the data conform to the "single-culture" model of high agreement, as is the case with our informants. Second, the module computes a "cultural competence" or "cultural central- ity" score, 6 indicating how closely each individual's responses matched the

6Although the term "cultural competence" has been used previously in some of the literature, this terminology may be misleading. In complex domains, where experts may have diverse and highly nuanced information, their "competency" score may be lower than those of novices who make judgments based on a simple set of criteria (e.g., Boster and Johnson, 1989). In these circumstances, the term "cultural centrality" appears more appropriate than "cultural competence." The advantage of this terminology is that it acknowledges the relativity of much cultural knowledge, whereas competency implies a normative ~ e that may be unwarranted.

Page 19: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification

1.47-

0.84 -

0.21

-0.42

-I .05-

I0

I I I [ I

I I

6

5

4

3 I

2 12

15

I I I I I -I .00 -0 .46 0 .08 0.62 1.16

Fig. 1. Multidimensional scaling plot of individual archaeologists. Kruskal stress = 0.163.

147

consensus norm for the "single-culture" model of the typology. Third, the module produces a "key" or "culturally correct" solution to the sorting task which is meaningful only when there is a high level of agreement among informants. Finally, the module produces a person-by-person agreement ma- trix showing how closely each pair of informants agreed on the solution to the sorting task. This matrix can be displayed using cluster analysis and mul- tidimensional scaling.

Evaluating Typologists

The results of our consensus analysis show that despite the range of experience and differences of opinion on many sherds, there is a great deal of overall consensus, with a well-defined, coherent, "normative" typology. Figure 1 is a multidimensional scaling plot of the subject-by-subject agree- ment matrix, showing the relative similarity of the individuals' solutions to the sorting task. At the center of the MDS plot are the individuals forming the norm, with others spread out around them at distances reflecting their degree of agreement with the norm. The duster analysis tree shown in Fig. 2 gives similar information, and in Table I the individuals are arranged in or- der of their cultural centrality scores, which give a measure of their agree- ment and closeness to the center of the MDS plot. The self-rankings of individual experience are included in Table I, as well as the number of types each participant distinguished among the sherds.

Page 20: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

148 Whittaker, Caulkins, and Kamp

g) > g) ._1

o.810

o.691

o.611

o.561

0.497

0.462

0.4.36

0.379

0.336

0.308

O, 275

0.241

A r c h a e o l o g i s t s

7 2 9 I0 8 II 4 5 I 3 6 12 t3

I

I

Fig. 2. Cluster analysis, average linkage, grouping archaeologists by the similarity of their sherd sorts.

The considerable agreement among participants is expressed by the overall high cultural centrality scores (mean = 0.647, SD = 0.148). A mean above 0.5 and standard deviations less than 0.2 are considered to indicate good agreement among subjects. The three individuals with the highest scores (above 0.8) can be considered the trendsetters, the most nor- mative members of the group. The cluster tree (Fig. 2) likewise shows a strong normative tendency around these individuals (1, 3, 6).

As an additional experiment, 14 students from Whittaker's 1992 archae- ological field methods class also typed the sherds. Three weeks previously they had been given a 2-hr introduction to Sinagua ceramics, during which they learned the major types and defining attributes as used by Whittaker and sorted sherds for an hour. One might expect that student retention of this kind of information would be low, but the consensus measures for the group of students were not too far below those for the group of archaeologists (mean = 0.528, Range = 0.500, SD = 0.152). When the student pile sorts were merged with those of the archaeologists, the students generally fell around the fringes of the duster, with the three norm-setting archaeologists remaining at the center. It appears that the students were in fact learning

Page 21: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification

Table I. Participants Ranked by Centrality Scores, with Their Self-Ranking of Experience and the Number of Types They Distinguished Among the Sherds

149

Participant Rank Central i ty Experience Types

1 1 0.847 Extensive 7

3 2 0.844 Moderate 6

6 3 0.843 Some 4

4 4 0.726 Extensive 10

5 5 0.725 Extensive 8 12 6 0.687 Some 13 8 7 0.637 Some 9 9 8 0.610 Slight 10

11 9 0.608 Slight 18 10 10 0.506 Some 10 13 11 0.491 Moderate 10 2 12 0.480 Extensive 14 7 13 0.461 Some 11

and attempting to follow the normative typology. This was also reflected when the sherds were clustered using the pile sort data to produce a normative typology. In both the archaeologists' data (discussed later; Fig. 4) and the students', the same four major clusters based on temper were quite clear, although they were much more subdivided in the student data, reflecting di- versity of opinion or inaccuracy in applying the typology.

Returning to the main experiment, more detailed knowledge of the participants helps to explain some of the patterning in the statistical meas- ures. The five highest centrality scores belong to younger experienced work- ers in the area, including Whittaker. Since she initially constructed the sherd sample, we did not use Kamp as a subject. In addition, we believed that including both Kamp and Whittaker, a married couple who often classify ceramics together, would have weighted the norm toward their views. The other married couple who learned and work together (4, 5) received almost identical centrality scores, but actually did not agree on all their sherd typ- ing, although as Fig. 2 shows, they were quite close.

The five archaeologists with the highest centrality scores all learned the typology between 1980 and 1985 and were all taught largely by Peter Pilles (Fig. 3). It is not surprising that we show broad agreement. However, Pilles (Participant 2), who is by far the most experienced with the ceramic assemblage in question, is well away from the norm set by his older stu- dents. A more recent Pilles student (13) is closer to Pilles than the norm- setters. Pilles seems to diverge from the norm because he is a "splitter"

Page 22: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

150 Whittaker, Caulkins, and Kamp

H. Colton d. Madsen A W. Smith Lin'dsay

D /

MNA

|

Fig. 3. Learning genealogy for 13 archaeologists. Institutional type collections cited by participants are Museum of Northern Arizona (MNA), Coconino National Forest (CNF), and Arizona State Museum (ASM).

who found 14 types among the sherds, while the norm plainly favors "lum- pers." The type names Pilles uses reflect his subdivision of the normative types by additional descriptive attributes such as slipping, smudging, and temper size. Pilles is recognizing what others might call varieties in the type-variety system promoted by some ceramic typologists (Gifford, 1960; Wheat et al., 1958). In fact, most of us recognize the distinctions Pilles makes, but disagree on their significance and prefer to use more broadly defined types for descriptive and chronological purposes, while testing the meanings of some of the other di- mensions of variability (Kamp et al., 1991).

The "learning genealogy" (Fig. 3) suggests three "generations" among the participants. Pilles is a "grandfather" figure who learned from the "dei- fied ancestor" (Harold Colton). Pilles then passed his knowledge to the second generation, whose members have interpreted it and modified it and now set the norm, at least in our sample. The third, and least experienced generation is most diverse, with influences from Pilles, the second genera- tion, and other sources. This raises interesting questions about how typolo- gies change through time, and how they are legitimized.

Colton and his associates devised the original typology, whose published form provides the "official" criteria for classification. PiUes learned from Colton and fol- lows him in using many types defined by attributes that the "lumpers" in our sample consider hard to identify and of unclear value. Pilles may be closer to Colton than the rest of the participants, but the types Pilles uses and teaches are not precisely those defined and published by Colton. In fact, Colton's own definitions of his types changed over time [e.g., Winona Brown in Colton and Hargrave (1937, p. 163)

Page 23: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 151

and in Colton (1941, p. 35)]. Publication of original type descriptions not only gives them official recognition and legitimacy in the archaeological world, but is supposed to set a standard for as long as pots make sherds. Obviously, it does not. We found it especially interesting that none of the participants mentioned using the published descriptions when asked how they learned the typology, although most, like us, must have referred to them. Appar- ently the direct teaching of others, and especially the personal experience which intemalizes a typology, takes precedence in our memories.

The modified Colton types, in turn, are passed on by Pilles. His po- sition and long experience give him authority, and his opinions are influ- ential to those he teaches. Nevertheless, their own experiences, biases, perceptions, and exposure to other opinions lead them to modify and adapt the types again, and a new consensus develops. Is it a stable consensus? Probably not. The third generation clearly has different ideas. Some of these younger typologists will be trained into the normative path, and some of their new ideas will be abandoned. Other new ideas may be adopted and become important, as attributes and types continue to be tested for interpretive value (Breternitz, 1966; Downum, 1988; Kamp et al., 1991). Recently published type descriptions (e.g., Wood, 1987, pp. 53-61) already reflect some of the diversity of views and divergence from Colton's defini- tions that is apparent in our sample of archaeologists.

Evaluating the Typology

As archaeologists who customarily use the existing Sinagua ceramic typology to communicate with others in the field, we are encouraged to see that there is considerable agreement among us. According to the consensus analysis, the data conform to a "single culture" model. Al- though two factors were found in the data, the first factor accounts for most of the variance, with a ratio of 7.8 between the eigenvalue of the first and that of the second factor (eigenvalues of 5.24 and 0.671, re- spectively).

Within the "single culture," however, there is also considerable vari- ation. Both in evaluating the typological system and in determining how it should be used and modified, it is useful to see which specimens and at- tributes are the sources of the most disagreement. Table II lists the 28 sherds with a normative type name and some of the attributes that appear to be important because they are used in published descriptions and by workers in the field or prove to be sources of disagreement among our analysts. The different names given by different participants are also listed, and the number of agreements with each name.

Page 24: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

152 Whittaker, Caulkins, and K a m p

Table II. Sherd Sample with Names Assigned by Participants

Normative type Alternate types

1 Winona Brown (3)

2 Turkey Hill Red (1)

3 Sunset Brown (3)

4 Rio de Flag Brown (5)

5 Sunset Brown (6)

6 Sunset Brown (4)

7 Sunset Brown (6)

8 Late Angell Brown (3)

9 Late Angell Brown (3)

10 Sunset Red (3) 11 Angell Brown (2)

12 Angell/Winona Brown (2)

13 Sunset Red (5)

14 Angell/Winona Brown (2)

15 Sunset Red (4)

16 Rio do Flag Brown (5)

17

18

Angell Brown (1), Verde Brown (1), Late Angell Brown (2), Angell/ Winona (2)

TH on Late AngeU (2), Verde Red (1), TH or Angell (1), AngeU/Winona (2)

SB smudged (2), Sunset Red (1), Rio de Flag (2) Youngs Brown (1)

Late Rio de Flag (1), Deadmans Grey (1)

Sunset Brown/Red (1)

Sunset Red (3), Sunset Brown/Red (1)

Sunset Brown/Red (1)

Angell Brown (2), Verde Brown (1), Early Angell Brown (I), Angell/ Winona Brown (2)

Angell Brown (1), Verde Red (1), Turkey Hill Red or Angell (1), Angell/ Winona (2)

Sunset Brown (2), Transitional Sunset (1). Sunset Bro~,~eRed (1), Youngs Brown (I)

Turkey Hill Red smudged on Early Angell (1). Verde Smudged (1), Early Angell (1), Angell/Winona (2), Late Angell Brown (2)

Angell Brown (1), Verde Brown (1). Turkey Hill on Winona (1), Late Angell or Youngs (I), Winona Brown (2), Late Angell Brown (I)

Sunset Brown (2), Sunset Red/Brown (1)

Turkey Hill on Late Angell Brown (I), Verde Brown (I), Early Angell (1), Angell Brown (2), Late Angell Brown (2)

Sunset Brown (2), Sunset Red/Brown (1), Transitional Sunset (1)

Late Rio de Flag (1), Deadmans Grey (1), Early Angell Brown (1)

Deadmans Fugitive Red (6) Prescott Red (1), Rio de Flag (1)

Youngs Brown (3) Sunset Brown (2), Hartley Plain (1), Early Angell Brown (1), Rio de Flag Brown (1)

19 Angell/Winona Brown (2) Angell Brown (1), Turkey Hill on Late Angell Brown Smudged (1), Verde Brown (1), Turkey Hill on Winona Brown (1), Late Angell Brown (1), Turkey Hill on Late Angell Brown (1)

20 Angell/Winona Brown (2) Angell Brown (2), Late Angell Brown (2), Turkey Hill on Late Angell (1), Verde Brown (1), Early AngeU (1)

21 Late Angell Brown (2), Angell Brown (1), Winona Brown (2). Early Angell Brown (1), Verde Brown (1)

22 Winona Brown (2), Late Angell Brown (2), Angell Brown (1), Early Angell Brown (1), Verde Brown (l)

23 Sunset Red (3). Sunset Brown (1)

24 Sunset Brown (1), Grapevine Brown (1)

25 San Francisco Mountain Grey Ware (1)

26 Deadmans Grey (1)

27 San Francisco Mountain Grey Ware (2)

28 Angell Brown (1), Angell/Winona (1), Rio de Flag (2)

Angell/Winona Brown (2)

Angell/Winona Brown (2)

Youngs Brown (4)

Youngs Brown (5)

Deadmans Grey (7)

Rio de Flag Brown (7) Deadmans Fugitive Red (7)

Early Angell Brown (4)

Page 25: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification

and Major Attributes as Described by Kamp and Whittaker

153

Color Temper Interior finish Exterior finish

Grayish brown Tuff, large, plentiful

Red brown Tuff, medium

Black Cinder and a little throughout quartz sand

Light brown Small soft crystalline

Brown Cinder

Red Cinder

Brown Cinder

Rough, unsmudged

Rough, smudged

Well burnished, smudged

Smoothed, Well smoothed, unsmudged unsmudged

Smoothed, smudged Lightly burnished, fire clouds

Smoothed, smudged Well burnished, unslipped

Smoothed, Well burnished, unsmudged, unslipped unsmudged

Lightly burnished, unslipped, unsmudged

Well burnished, slipped, red

Lightly burnished, smudged

Orange brown Tuff, medium, plentiful Smoothed, Lightly burnished, unsmudged, unslipped unsmudged

Brown Tuff, medium, plentiful Smoothed, Well burnished, unsmudged, unslipped unsmudged

Red to brown Cinder, a few "other" Smoothed, unsmudged Well burnished, red, possibly slipped

Gray

Brown to gray Tuff, large, plentiful

Red to gray Cinder

Brown Tuff, medium

Red-brown Cinder

Light brown Small dull flecks with crystalline bits

Gray to Quartz sand Orange

Brown

Brown

Brown

Gray

Brown

Brown

Brown Cinder with much "other" Smoothed

Gray Fine sparse quartz sand Smoothed

Brown Sparse soft crystalline Smoothed

Gray Abundant quartz sand Smoothed Brown Tuff, small, sparse Smoothed

Tuff, medium, plentiful Extreme burnish, smudged

Light burnish, gray, unslipped

Smoothed, lightly Well burnished, possibly slipped smudged

Poorly smoothed, Well burnished, red, possibly slipped reduced

Smoothed, reduced Moderately burnished, orange, unslipped

Smoothed, smudged Well burnished, dark red, possibly slipped

Rough Sparse burnish

Smoothed, lightly Smoothed, red fugitive paint smudged

Cinder plus much "other" Smoothed, reduced Moderate burnish, unslipped, unsmudged

Tuff, medium Smoothed, smudged Moderate burnish, unsmudged, possibly slipped

Tuff, medium, plentiful Smoothed Lightly burnished, unslipped, slightly smudged

Tuff, large, plentiful Rough, reduced Smoothed, unslipped, unsmudged

Tuff, medium, plentiful Smoothed Smoothed, unslipped, unsmudged

Cinder with some "other" Smoothed, smudged Moderately burnished, red, unslipped

Well smoothed, with voids, brown, unslipped

Well smoothed

Smoothed

Smoothed, traces of red hemitite Smoothed

Page 26: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

154 Whittaker, Caulkins, and Kamp

Of the 13 individuals who sorted the sherds, 9 gave type names to all or most of the sherds. Two of the remaining four grouped the sherds into unnamed categories and provided some comments about the criteria for defining some or all of . the categories, while the other two simply grouped the sherds. It is clear that of the nine who assigned Sinagua type names, three are lumpers, using primarily gross temper categories for de- lineating the sherd types; four are splitters, using finer temper composition and size categories plus the presence or absence of slip; and two are ex- treme splitters, using temper, slip, and the presence or absence of smudging to differentiate types. Within the lumpers there is a range of variability. The most extreme lumper used only the basic temper distinctions to pro- duce four categories (Rio de Flag, Sunset, Angell/Winona, and Dead- man's). The other two lumpers both recognized Youngs Brown, a cinder-tempered sherd with a significant proportion of other inclusions, and distinguished Deadman's Grey from Deadman's Fugitive Red, a variant with red hematite applied after firing. One of these lumpers also separates Turkey Hill Red, a red slipped variant, from other tuff-tempered An- gell/Winona sherds.

The typing strategies for the individuals who did not consistently name types are more difficult to discern, but an examination of the sherds in conjunction with comments provided by the respondents suggests that two of them rely primarily on variability in temper but divide the categories rather finely; one uses variation in both temper and slip; and the fourth is an extreme splitter, who uses variability in temper, slip, smudging, and bur- nishing to produce a total of 18 types. These four individuals are all on the low end of both the experience and the centrality scales, which probably accounts for their reluctance to assign type names.

In this sample, the consensus typology was that of the lumpers. There is wide agreement on basic sherd groupings defined by temper (Fig. 4). This can also be seen if the classifications of the eight individuals who cate- gorized sherds using the Northern Sinagua types are compared (Table III). (One of the nine participants who named types used some Southern Si- nagua type names that no one else used, making comparison difficult, and is not included here or in Table II.) Several of the disagreements are sys- tematic. For example, all of the non-consensual classifications of sherds which were usually classed as "Rio de Flag Brown" are due to nonstandarql temper identifications by two of the eight individuals. Despite occasional disagreements, however, consensus about basic tempering agents is ex- tremely high for both the splitters and the lumpers.

If we had continued the pile sort data collection by having respon- dents unite their piles in a stepwise sequence, the difference between lum- pers and splitters would have been reduced, and cultural centrality and the

Page 27: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 155

I 1,000

I0,000

9.667

9.00O

8.667

8.333

8.000

7.667

7,649

7.000

6.500

6.000

5.867

5.093

4.667

"=,.829

5.553

3.093

2.333

0.661

0.259

0.159

7 6 15 1,5 I0 18 25 24

A

Sherds

25 17 2i II 14 20 12

T / B D

21 22Z8

Fig. 4. Cluster tree of sherds based on the pile sorts by 13 archaeologists, showing the normative typology with four major groups. (A) Cinder temper (Sunset, Youngs). (B) Quartz temper (Deadmans). (C) Crystalline temper (Rio de Flag). (D) Tuff temper (Angell, Winona, Turkey Hill).

coherence of the domain would have increased. This is apparent from the unsystematic information provided by some respondents on hierarchical groupings and the clustering of the sherd types into four groups in Fig. 4. All but the least experienced would collapse these four groups into two formally defined wares: Alameda Brown Ware (Figs. 4A, C, D) and San Francisco Mountain Grey Ware (Fig. 4B). The difference between lumpers and splitters was partly produced by collecting data in a free pile sort for- mat that allowed respondents to use any number of types; lumpers felt that fewer distinctions were analytically useful. Just as most of the splitters rec- ognize the more inclusive categories of the lumpers, the lumpers may see the possibility of splitting their types for some purposes, even if they would not normally do so.

Page 28: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

156 Whittaker, Cauikins, and Kamp

e'~

0

r162

o

t - O

t-.

o p

o

sluomooz~es!p o~eluoo~o d

sluotuoo~es.tp ~oqtunN

suoi~eo~!SSel3 s n s u o s u O 0

aoqmnN

o~l~luoo.lo d

sluomooa~es!p zoqmnN

suoi~issg[o s n s u ~ s u o o

Joqmn N

sluomooa~sIp

sluotuoo2~'~s!p 1oqmnN

suop, e~UISSep SNSUOSUCO

a~qtun N

o

~J

o ~

Page 29: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 157

Table IV. Agreement on Classification of Ceramic Types Distinguished by Fine Temper Differences: The Five Splitters Who Identified Northern Sinagua Types Are Tabulated

Number of Number of Ceramic types Total consensus nonconcensus Percentage by consensus a sherds classifications classifications disagreement

Early Angell 2 5 5 50.0 Late Angell 8 23 17 42.5 Winona 2 6 4 40.0

Total 12 34 26 43.3

a The type named most is designated the consensus type. In case of a tie the earlier type is used.

Not surprisingly, disagreements about classification increase when finer distinctions are attempted (Table IV). Pilles and other splitters sub- divide tuff-tempered Angell/Winona sherds on the basis of gradations in temper size, from the "Early Angell," with the smallest temper (and also small, sparkling inclusions), to "Winona," with the largest temper. Although these splitters appear to agree on verbal definitions of these types, they plainly perceive them differently, and the number of divergent opinions on classification often match and sometimes exceed the majority vote. For some sherds the classifications ranged all the way from Early Angell to Late Angell to Winona. Overall, agreement on these more subtle temper gradations is not much better than 50%.

Analysts appear to agree more consistently about slips (Table V), but consensus is still nowhere near perfect among the six splitters (including the one who used some Southern Sinagua type names). Slips seem to be easier to discern on some sherd types than on others. For example, there is more agreement about the identification of slips on our sample of tuff- t empered (Angell /Winona) sherds than on c inder - tempered (Sunset) sherds, possibly because the former tend to have lighter paste which is more distinct from the generally reddish slips.

None of this says anything about the interpretive significance of the types and attributes themselves. We are testing the consistency with which the typology is used to classify sherds, not the typology's accuracy as a measure of anything. The variation among the sherds may be meaningful in chronological, functional, cultural, or stylistic senses, or not at all, but other data must be used to address these questions. We can, however, say something about the success of the Sinagua ceramic typology as a commu- nicative device. The "half-full" view notes that there is a lot of basic agree- ment; a pessimist can point out with equal validity that there is considerable divergence too.

Page 30: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

158 Whittaker, Caulkins, and Kamp

Table V. Agreement on Identification of Slips by the Six "Splitters" Who Used Named Sinagua Types

Number of Number Total consensus of Percentage

Ceramic types sherds classifications disagreements disagreements

Angell (unslipped) vs. Turkey Hill (slipped) 12 62 10 13.9

Sunset/Youngs Brown (unslipped) vs. Sunset Red (slipped) 10 47 13 21.7

Total 22 109 23 17.4

The divergence appears to have several sources. One is disparity in the perception and recognition of some attributes. The second is disagree- ment in the choice of attributes that are considered important in defining types. The third source of disagreement was produced by our use of a free, nonhierarchical pile sort. This produced a difference between lumpers and splitters, who may have recognized similar categories but who differed in their selection of a level in the hierarchy whose groups they considered analytically or interpretively significant.

Perhaps the first two kinds of divergence are most damaging to ty- pologies used as descriptive and interpretive tools. Where individuals differ in their definition and recognition of named types which have been assigned temporal (or other) meaning, interpretations made on the basis of these types become difficult to compare. If Indiana Jones classifies the pottery on a site as Rio de Flag, and Harold S. Colton calls the same sherds Wi- nona Brown, the resulting difference in the dates they assign to the site may be quite significant. The disagreements in recognition and definition, and thus the dangers to interpretive potential, appear to increase as finer and more difficult distinctions are attempted.

Our experiment in evaluating Sinagua ceramic typology has af- fected the way in which we use it (Kamp and Whittaker, 1998). Overall, we have become more aware of what we think we mean, and what we think other archaeologists mean, when they use the Sinagua ceramic type labels. As practical measures, we (Kamp and Whi t taker ) type sherds together, and we closely supervise anyone who is working with us, so we can argue, resolve ambiguous cases, and maintain the similarity of our definitions.

Page 31: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 159

This experiment has also made us feel more comfortable with our de- cision to be "lumpers." For instance, we do not distinguish in our analyses among Early Angell, Late Angell, and Winona Brown, both because of doubts about the interpretive value of doing so and because we have dem- onstrated a worrisome level of inconsistency in judgments about temper size which define these types. Similarly, we avoid typological distinctions made on the basis of the presence or absence of slip, when possible.

In effect, our "lumper" position is similar to Adams and Adams' (1991) notion of "efficiency" in classification, but we see consistency rather than efficiency as the paramount reason for keeping a typology simple. If a typology is to be effective, the need for consistency demands that, when possible, difficult distinctions should be dispensed with in favor of more obvious ones. In the absence of clear interpretive reasons for differentiating types, lumping will yield more comparable results from different observers than splitting.

CONCLUSIONS

Typology and classification pervade archaeology, whether in the study of ceramics, lithics, woven fiber, or social organization. Our own case study dealt with a ceramic typology; previous experiments along the same lines have used both ceramic and lithic artifacts. However, the issue is the same for all typologies: consistent classification is a sine qua non for both com- munication and problem solving.

We have discussed the casual arguments of archaeologists about ty- pological specifics, the formulations of theorists, the serious attempts of a few analysts to measure consistency, and the rare efforts of some Cassan- dras to promote both theoretical and practical consideration of problems of bias in classification. None of these have had much impact on most ar- chaeologists. We see several reasons for what we consider a serious meth- odological and theoretical omission.

1. The problem of consistency has not been considered theoretically interesting. It should have been. The ease and consistency with which a typology can be applied result partly from the way a typology is structured and the precision of its class definitions. In turn, the ability to apply a ty- pology consistently is necessary for it to fulfill its interpretive and commu- nicative purposes. Theoretical arguments about the "best" kind of typology should have considered these issues. The "best" kind of typology should be structured to promote interobserver consistency in classification.

Page 32: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

160 Whittaker, Cauikins, and Kamp

2. Evaluating typological systems can be complex. There are typically many types, defined by many attributes, each potentially with a multitude of different choices. Definitions at any level can produce ambiguities and inconsistent classification. Errors come from the analysts too. The most clearly defined type may be perceived or interpreted in different ways by analysts who will be humanly i m p e r f e c t - biased, near-sighted, poorly trained, relying on different experiences, and so on. What are the important sources of error, and how do we find them, especially when there is no absolute standard of right answers in typology?

To some extent we sympathize: evaluating typological systems can be difficult. However, as the studies above have shown, even some simple tests may be revealing, and a number of procedures can be useful. Certainly it is necessary to look at multiple facets of a typological system. Both the typology and the classification, the rules and the behavior, may produce inconsistencies which affect our ability to use a typology for interpretive archaeological problems. Even without objective independent standards to determine which class assignment is correct or what level of disagreement is tolerable, measurements of consistency can be made that will allow us to worry and make changes, or shrug and decide that the system is good enough for our purposes. The issue is too important to ignore.

3. Practical barriers to testing typologies are often large. Testing takes time, energy, and money away from the interpretive pursuits that are often our real goal. Then, if agreement is poor, what can be done anyway? It is usually impossible to redesign a typology midway through a project or con- duct a major reanalysis to improve agreement. However, the solutions sug- gested by Daniels (1972) 25 years ago are still good. Methodological rigor in the form of explicit type definitions, careful procedure, and consistent training of analysts should improve consistency by reducing ambiguity and differences in perception and interpretation. Quality control tests can be performed early rather than late and should be cost-effective, using a sam- ple of data rather than the whole data set.

4. The cultural nature of typological systems hinders their evaluation. Sometimes we learn typologies almost as a rite of passage into the profes- sion. At other times we may construct them ourselves, a process requiring considerable effort. Those who rely most on a typology and have long ex- perience with it have often internalized it so well that they do not readily question it. They may have an emotional or professional commitment to the belief that it works for their purposes, and it may well do so. In any case, they are unlikely to evaluate it rigorously. Those whose theoretical position is that types represent a concrete reality may have an additional inhibition against evaluating typologies. On the other hand, those who are

Page 33: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 161

skeptical of typologies, are less committed to them, or take the position that typologies are constructs, tools generated by the archaeological mind for particular purposes, may find it difficult to evaluate a system in which there are no absolute "right" answers.

The approaches above cannot solve the social problems of evaluating typological systems. However, the other aspects of evaluating and improv- ing them are complex but tractable. Consensus analysis and other statistical techniques, in particular, provide insights into the many places where error may creep in, and all typological systems should be improvable both in the typology itself and in the classification process. A commitment to honest methodology and accurate interpretation requires that we at least think hard about the issues involved in typological rigor and classificatory con- sistency.

ACKNOWLEDGMENTS

Our sincere thanks go to the participants in this study: Anne Baldwin, Michael Bremer, Tim Burchett, Amy Douglas, Chris Downum, Phil Geib, David Greenwald, Jackie Hovis, Carl Phagan, Peter Pilles, Alan Sullivan, and Jonathan Till (and Whittaker). All gave permission for their names to be used, but because archaeologists can be more sensitive about typologies than sunburn, we have chosen to protect those whose low centrality scores might be unjustly mocked by not using any names except that of Pilles, who is central to the system. Amy Henderson drew the figures. Reviewers' comments on various versions of this paper varied from insightful to vapidly vicious; nevertheless, we appreciate all the help. Peter Pilles, H. Wolcott Toll, and Paul Fish were among the more useful commentators. The criticism and patience of Michael Schiffer were indispensable and greatly improved the clarity and direction of this paper. However, we stubbornly maintain our own biases, idiosyncracies, and deviation from consensus.

REFERENCES

Adams, W. Y. (1988). Archaeological classification: Theory versus practice. Antiquity 61: 40-56. Adams, W. Y., and Adams, E. W. (1991). Archaeological Typology and Practical Reality: A

Dialectical Approach to Artifact Classification and Sortin~ Cambridge University Press, Cambridge.

Beck, C., and Jones, G. T. (1989). Bias and archaeological classification. American Antiquity 54(2): 244-261.

Borgatti, S. (1990). ANTHROPAC 3.2 Computer Program Analytic Technologies, Columbia, SC.

Page 34: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

162 Whittaker, Cauikins, and Kamp

Boster, J. (1986). Exchange of varieties and information between Aguaruna manioc cultivators. American Anthropologist 88(2): 429-436.

Boster, J., and Johnson, J. C. (1989). Form or function: A comparison of expert and novice judgments of similarity among fish. American Anthropologist 91(4): 866-889.

Boyd, C. C. (1987). Interobserver error in the analysis of nominal attribute states: A case study. Tennessee Anthropologist 12(1): 88-95.

Breternitz, D. A. (1966). An appraisal of tree-ring dated pottery in the Southwest. Anthropological Papers of the University of Arizona No. 10, University of Arizona Press, Tucson.

Brew, J. O. (1946). The use and abuse of taxonomy. In The Archaeology of Alkali Ridge, Southeastern Utah. Papers of the Peabody Museum of American Archaeology and Ethnology 21, Harvard University Press, Cambridge, MA, pp. 44-66.

Burling, R. (1964). Cognition and componential analysis: God's truth or hocus-pocus? American Anthropologist 66: 20-28.

Caulkins, D. (1991). Measuring diversity in organizational culture. Dyn: Journal of the Anthropological Society of Durham University 10: 1-21.

Caulkins, D., and Trosset, C. (1996) The ethnography of contemporary Welsh and Welsh-American identity and values. In Thompson, M. (ed.), Proceedings of the First Conference of the North American Association for the Study of Welsh Culture and History, NAASWCH, Rio Grande, Ohio.

Clarke, D. L. (1968). Analytical Archaeology, Methuen, London. Clarke, D. L., (ed.) (1972). Models in Archaeology, Methuen, London. Colton, H. S. (1932). A survey of prehistoric sites in the region of Flagstaff, Arizona. Bureau

of American Ethnology Bulletin 104, U.S. Government Printing Office, Washington, DC. Colton, H. S. (1941). Winona and Ridge Ruin Part II: Notes on the technology and taxonomy

of the pottery. Museum of Northern Arizona Bulletin 19, Museum of Northern Arizona, Flagstaff.

Colton, H. S. (1955). Pottery types of the Southwest. Museum of Northern Arizona Ceramic Series No. 2, Museum of Northern Arizona, Flagstaff.

Colton, H. S. (1958). Pottery types of the Southwest: Revised description of Alameda Brown, Prescott Grey, San Francisco Mountain Gray Wares 14, 15, 16, 17, 18. Museum of Northern Arizona Ceramic Series No. 3, Museum of Northern Arizona, Flagstaff.

Colton, H. S., and Hargrave, L. L. (1937). Handbook of northern Arizona pottery wares. Museum of Northern Arizona Bulletin No. 11, Museum of Northern Arizona, Flagstaff.

Crane, D. (1972). Invisible Colleges: Diffusion of Knowledge in Scientific Communities, University of Chicago Press, Chicago.

Daniels, S. G. H. (1972). Research design models. In Clarke, D. (ed.), Models in Archaeology, Methuen, London, pp. 201-229.

Deetz, J. D. (1967). Invitation to Archaeology, Natural History Press, New York. Dibble, H., and Bernard, M. (1980). A comparative study of basic edge angle measurement

techniques. American Antiquity 45(4): 857-865. Downum, C. E. (1988). "One Grand History" A Critical Review of Flagstaff Archaeology, 1851-1988,

Ph.D. dissertation, University of Arizona, University Microfilms, Ann Arbor, MI. Dressier, W., Santos, J. E., and Blierio, M. C. (1996). Studying diversity and sharing in culture:

An example of lifestyle in Brazil. Journal of Anthropological Research 52: 331-353. Dumond, D. (1974). Some uses of r-mode analyses in archaeology. American Antiquity 39(2):

253-270. Dunnell, R. C. (1986). Methodological issues in Americanist artifact classification. In Schiffer,

M. B. (ed.), Advances in Archaeological Method and Theory Volume 9, Academic Press, New York, pp. 149-208.

Dunnell, R. C., and Grayson, D. IC (eds.) (1983). Lulu Linear punctated: Essays in honor of George Irving Quimby. University of Michigan Museum of Anthropology Anthropological Papers 72, Ann Arbor, MI.

Fish, P. R. (1976). Replication studies in ceramic classification. Pottery Southwest 3(4): 4-6. Fish, P. R. (1978). Consistency in archaeological measurement and classification: A pilot study.

American Antiquity 43(1): 86-89.

Page 35: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

Typology and Classification 163

Ford, J. A. (1954a). Comment on A. C. Spaulding, "Statistical techniques for the discovery of artifact types." American Antiquity 19: 390-391.

Ford, J. A. (1954b). The type concept revisited. American Anthropologist 56(1): 42-54. Gifford, J. C. (1960). The type-variety method of ceramic classification as an indicator of

cultural phenomena. American Antiquity 25: 341-347. Gunn, J. E., and Prewitt, E. R. (1975). Automatic classification: Projectile points from west

Texas. Plains Anthropologist 20: 139-149. Hargrave, L. L. (1932). Guide to forty pottery types from the Hopi Country and the San

Francisco Mountains, Arizona. Museum of Northern Arizona Bulletin No. 1, Museum of Northern Arizona, Flagstaff.

Hill, J. N., and Evans, R. K. (1972). A model for classification and typology. In Clarke, D. L. (ed.), Models in Archaeology, Methuen, London, pp. 231-273.

Hill, J., and Gunn, J. (eds.) (1977), The Individual in Prehistory, Academic Press, New York. Hodson, F. R., Sneath, P. H. A., and Doran, J. E. (1966). Some experiments in the numerical

analysis of archaeological data. Biometrika 53: 311-324. Hyatt, S. B., and Caulkins, D. (1992). Putting bread on the table: The women's work of

community activism. Unit on Work and Gender, Occasional Paper No. 6, University of Bradford, Department of Social and Economic Studies, Bradford.

Kamp, K. A., and Whittaker, J. C. (1990). Lizard Man Village: A small site perspective on Sinagua social organization. Kiva 55: 99-125.

Kamp, IC A., and Whittaker, J. C. (1998). Surviving Adversity: The Sinagua of Lizard Man 1,7llage, No. 121, University of Utah Anthropological Papers, University of Utah, Salt Lake City.

Kamp, K. A., Wallace, R., and Saylor, G. (1991) A functional interpretation of two Sinagua brownwares. Pottery Southwest 14(4): 1-5.

Keeley, L. (1980). Experimental Determination of Stone Tool Uses: A Microwear Analysis, University of Chicago Press, Chicago.

Keeley, L., and Newcomer, M. (1977). Microwear analysis of experimental flint tools: A test case. Journal of Archaeological Science 4: 29-62.

Kempton, W. (1981). The Folk Classification of Ceramics: A Study of Cognitive Prototypes, Academic Press, New York.

McGuire, R. H., Whittaker, J. C., McCarthy M., and McSwain, R. (1982). A consideration of observational error in lithic use wear analysis. Lithic Technology 11(3): 59-63.

Newcomer, M., Grace, R., and Unger-Hamilton, R. (1986). Investigating microwear polishes with blind tests. Journal of Archaeological Science 13: 203-217.

Odell, G. H., and OdeU-Vereecken, F. (1980). Verifying the reliability of lithic use-wear assessments by "blind tests": The low power approach. Journal of Field Archaeology 7(1): 87-120.

Read, D. (1974). Some comments on typologies in archaeology and an outline of a methodology. American Antiquity 39(2): 216-249.

Read, D., and Christenson, A. (1977). Numerical taxonomy, r-mode factor analysis, and archaeological classification. American Antiquity 42(2): 163-179.

Reid, J. J. (1984). What is black, white, and vague all over? In Sullivan, A. P., and Hantman, J. L. (eds.), Regional Analysis of Prehistoric Ceramic Variation: Contemporary Studies of Cibola Whitewares, Anthropological Research Papers 31, Arizona State University, Tempe, pp. 135-152.

Romney, A. K., Weller, S. C., and Batchelder, W. H. (1986). Culture and consensus: A theory of culture and informant accuracy. American Anthropologist 88(2): 313-338.

Romney, A. IC, Batchelder, W. H., and Weller, S. C. (1987). Recent applications of cultural consensus theory. American Behavioral Scientist 31(2): 163q77.

Rouse, I. (1960). The classification of artifacts in archaeology. American Antiquity 25: 313-323. Schiffer, M. B. (1976). Behavioral Archeology, Academic Press, New York. Shea, J. J. (1992). Lithic microwear analysis in archeology. Evolutionary Anthropology 1(4):

143-150. Spaulding, A. C. (1953). Statistical techniques for the discovery of artifact types. American

Antiquity 18: 305-313. Spaulding, A. C. (1954). Reply to Ford. American Antiquity 19: 391-393.

Page 36: Evaluating consistency in typology and classificationusers.clas.ufl.edu/davidson/Material Culture course...Journal of Archaeological Method and Theory, Vol. 5, No. 2, 1998 Evaluating

164 Whittaker, Caulkins, and Kamp

Swarthout, J., and Dulaney, A. (1982). A description of ceramic collections from the railroad and transmission line corridors. Museum of Northern Arizona Research Paper 26.

Swartz, B. K. (1967). A logical sequence of archaeological objectives. American Antiquity 32: 487-497.

Thomas, D. H. (1981). How to classify the projectile points from Monitor Valley, Nevada. Journal of California and Great Basin Anthropology 3(1): 7-43.

Trosset, C., and Caulkins, D. (1993). Is ethnicity the locus of culture? The image and reality of ethnic subcultures in Wales. Paper presented at American Anthropological Association Meetings, Washington, DC, Nov. 19.

Tuggle, H. D. (1970). Prehistoric Community Relationships in East-Central Arizona, Ph.D. dissertation, Anthropology Department, University of Arizona, Tucson.

Vaughan, P. C. (1985). Use-WearAnalysis of Flaked Stone Tools, University of Arizona Press, Tucson.

Weller, S. C. (1984). Consistency and consensus among informants: Disease concepts in a rural Mexican village. American Anthropologist 86: 966-975.

Weller, S. C., and Romney, A. K. (1988). Systematic Data Collection, Sage, Beverly Hills, CA. Weller, S. C., Romney, A. K., and Orr, D. (1986). The myth of a sub-culture of corporal

punishment. Human Organization 46: 39-47. Whallon, R. (1972). A new approach to pottery typology. American Antiquity 37(1): 13-33. Wheat, J. B., Gifford, J. C., and Wasley, W. W. (1958). Ceramic variety, type cluster, and

ceramic system in Southwestern pottery analysis. American Antiquity 24(1): 34-47. Whittaker, J. C. (1984). Arrowheads and artisans: Stone tool manufacture and individual

variation at Grasshopper Pueblo, Ph.D. dissertation, University of Arizona, University Microfilms, Ann Arbor, MI.

Wood, J. S. (1987). Checklist of Pottery Types for the Tonto National Forest. The Arizona Archaeologist No. 21, Arizona Archaeological Society, Phoenix, AZ.

Young, D., and Bamforth, D. R. (1990). On the macroscopic identification of used flakes. American Antiquity 55(2): 403-409.