32
ISSUES AND TRENDS Jeffrey W. Bloom and Deborah Trumbull, Section Coeditors Contextual Assessment in Science Education: Background, Issues, and Policy STEPHEN KLASSEN University of Winnipeg, Winnipeg, Manitoba, Canada R3B 2E9 Received 5 April 2005;revised 30 November 2005, 10 January 2006; accepted 1 February 2006 DOI 10.1002/sce.20150 Published online 3 May 2006 in Wiley InterScience (www.interscience.wiley.com). ABSTRACT: Contemporary assessment practices in science education have undergone significant changes in recent decades. The basis for these changes and the resulting new assessment practices are the subject of this two-part paper. Part 1 considers the basis of assessment that, more than 25 years ago, was driven by the assumptions of decompos- ability and decontextualization of knowledge, resulting in a low-inference testing system, often described as “traditional.” This assessment model was replaced not on account of direct criticism, but rather on account of a larger revolution—the change from behavioral to cognitive psychology, developments in the philosophy of science, and the rise of construc- tivism. Most notably, the study of the active cognitive processes of the individual resulted in a major emphasis on context in learning and assessment. These changes gave rise to the development of various contextual assessment methodologies in science education, for example, concept mapping assessment, performance assessment, and portfolio assessment. In Part 2, the literature relating to the assessment methods identified in Part 1 is reviewed, revealing that there is not much research that supports their validity and reliability. However, encouraging new work on selected-response tests is forming the basis for reconsideration of past criticisms of this technique. Despite the major developments in contextual assessment methodologies in science education, two important questions remain unanswered, namely, whether grades can be considered as genuine numeric quantities and whether the individual student is the appropriate unit of assessment in public accountability. Given these issues and the requirement for science assessment to satisfy the goals of the individual, the classroom, and the society, tentative recommendations are put forward addressing these parallel needs in the assessment of science learning. C 2006 Wiley Periodicals, Inc. Sci Ed 90:820 – 851, 2006 Correspondence to: Stephen Klassen; e-mail: [email protected] C 2006 Wiley Periodicals, Inc.

Contextual assessment in science education: Background, issues, and policy

Embed Size (px)

Citation preview

ISSUES AND TRENDS

Jeffrey W. Bloom and Deborah Trumbull, Section Coeditors

Contextual Assessment inScience Education: Background,Issues, and Policy

STEPHEN KLASSENUniversity of Winnipeg, Winnipeg, Manitoba, Canada R3B 2E9

Received 5 April 2005;revised 30 November 2005, 10 January 2006;accepted 1 February 2006

DOI 10.1002/sce.20150Published online 3 May 2006 in Wiley InterScience (www.interscience.wiley.com).

ABSTRACT: Contemporary assessment practices in science education have undergone

significant changes in recent decades. The basis for these changes and the resulting

new assessment practices are the subject of this two-part paper. Part 1 considers the basis

of assessment that, more than 25 years ago, was driven by the assumptions of decompos-

ability and decontextualization of knowledge, resulting in a low-inference testing system,

often described as “traditional.” This assessment model was replaced not on account of

direct criticism, but rather on account of a larger revolution—the change from behavioral to

cognitive psychology, developments in the philosophy of science, and the rise of construc-

tivism. Most notably, the study of the active cognitive processes of the individual resulted

in a major emphasis on context in learning and assessment. These changes gave rise to

the development of various contextual assessment methodologies in science education, for

example, concept mapping assessment, performance assessment, and portfolio assessment.

In Part 2, the literature relating to the assessment methods identified in Part 1 is reviewed,

revealing that there is not much research that supports their validity and reliability. However,

encouraging new work on selected-response tests is forming the basis for reconsideration of

past criticisms of this technique. Despite the major developments in contextual assessment

methodologies in science education, two important questions remain unanswered, namely,

whether grades can be considered as genuine numeric quantities and whether the individual

student is the appropriate unit of assessment in public accountability. Given these issues and

the requirement for science assessment to satisfy the goals of the individual, the classroom,

and the society, tentative recommendations are put forward addressing these parallel needs

in the assessment of science learning. C© 2006 Wiley Periodicals, Inc. Sci Ed 90:820–851, 2006

Correspondence to: Stephen Klassen; e-mail: [email protected]

C© 2006 Wiley Periodicals, Inc.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 821

PART 1: HISTORICAL, PHILOSOPHICAL, AND PSYCHOLOGICAL

ROOTS OF ASSESSMENT IN SCIENCE EDUCATION

The theoretical and practical issues in the assessment of science learning are complex.It is this complexity, perhaps overwhelming to the nonspecialist, which makes it difficultto engage with the discipline in a critical manner. Gunzenhauser (2003) points out thatthere is a lack of reflective, engaged dialogue among the various stakeholders in educa-tional assessment. As assessment “end-users,” science educators should, nevertheless, haveadequate knowledge about philosophical and theoretical assumptions and methodology inassessment. A review designed for the nonspecialist in assessment focusing, in part, ontheoretical and philosophical aspects may help stimulate engagement in reflective dialogue.Reviews of assessment literature might consist of an exhaustive documentation of back-ground, basis for design, and summary of recommended practices in assessment (Pellegrino,Chudowsky, & Glaser, 2001), or they might expound on issues of professional developmentand classroom practice arising in the assessment literature (Shepardson, 2001). The currentreview adds another perspective to the literature—that of identifying underlying philosoph-ical and theoretical positions that may have a bearing in rethinking problematic assessmentareas and issues that may, in turn, stimulate a reflective dialogue among assessment theo-rists and practitioners. These issues are the subject of this two-part review. In Part 1, thehistorical, philosophical, and psychological roots of contemporary assessment practices inscience education are examined. Part 2 presents, in more detail, recent major trends andissues in science assessment and exposes recent practices and methodologies to criticalscrutiny.

The literature on assessment reveals that greater reflection and discrimination betweenassessment practices that are warranted and those that are not is needed to make improve-ments in this area. Along these lines, Gell-Mann maintains that “we must attach a higherprestige to that very creative act, the writing of serious review articles and books that dis-tinguish the reliable from the unreliable and systematize and encapsulate, in the form ofreasonably successful theories and other schemata, what does seem reliable” (Gell-Mann,1994, p. 342). A discriminatory review of assessment practices first requires an examina-tion of the historical and philosophical roots of assessment models and practice and howrecent developments in philosophy and psychology have influenced the development ofassessment approaches.

The Origins of Assessment

When one reads about assessment history, as well as present practice, one is struck byhow little the issues in regards to public accountability of education have changed over thepast century. Over a century ago, John Stuart Mill advocated a law making public educationcompulsory and proposed the institution of public examinations. He wrote that

[t]he instrument for enforcing the law could be no other than public examinations, extending

to all children and beginning at an early age. An age might be fixed at which every child

must be able to read. If a child proves unable, the father, unless he has some sufficient

ground of excuse, might be subjected to a moderate fine, to be worked out, if necessary,

by his labour, and the child might be put to school at his expense. Once in every year the

examination should be renewed, with a gradually extending range of subjects, so as to make

the universal acquisition, and what is more, retention, of a certain minimum of general

knowledge, virtually compulsory. (Mill, 1892/1859, p. 63)

Although Mill’s view on assessment and public accountability might strike the reader assomewhat quaint and even amusing, it, nevertheless, reflects the tension that exists, even

822 KLASSEN

today, between students and educators, on the one hand, and public policy, on the other.Looking back at historical pronouncements, such as those of Mill, casts current theoreticaland practical issues in assessment in a different light. Historical considerations like theseilluminate the philosophical and practical underpinnings of present assessment practice andhelp determine whether any of the historical challenges have, with present knowledge, beenovercome.

Although public policy has likely always played a role in the motivation for examina-tions, examinations have not always been used in the way that they are used today. Theiruse likely originated with the written Chinese civil service examination in 210 BC, to selectthe most competent civil servants (Madaus & Kellaghan, 1993), or even earlier (Lissitz,1997). The written examination first appeared in Europe in the sixteenth century, when itwas used to supplement the viva voce examination. Ranking, however, was not introduceduntil around 1750 and the quantification of grades did not appear until 1792, promoted byWilliam Farish, a pioneering engineering professor of Cambridge University (Madaus &Kellaghan, 1993). Quantification of marks was a major turning point in the nature of exam-inations since it led to the formulation of factual, categorical, narrowly focused questionsthat replaced questions aimed at the rhetorical exposition of topics ranging across the cur-riculum (Hoskins, 1968). These developments in assessment in Europe did not go unnoticedby Horace Mann in the United States, an influential advocate of public education, while hewas secretary to the Massachusetts Board of Education. In 1845, Mann administered thefirst public written examinations to Boston’s brightest 14-year-old public school studentsto encourage a higher degree of standardization (Rothstein, 1998), and from 1845 untilaround 1900, the essay examination became the dominant mode of testing in the UnitedStates (Madaus & Kellaghan, 1993). Mann discovered the quantitatively scored, ranked ex-amination to be the political tool for which he had been looking in his battle with Boston’sheadmasters over his attempt to have corporal punishment in schools abolished (Madaus &Kellaghan, 1993). Mann may have reasoned that if schools were publicly demonstrated tobe deficient in examination scores, this would imply incompetence on the part of teachersand headmasters, placing enormous pressure on them to change their ways in reference toboth standards of education and corporal punishment. Since that time, and especially today,the publicly administered examination has tended to be surrounded by controversy (Linn,2001; Odell, 1928; Resnick, 1982).

Assessment Issues in the 20th Century. From the 1920s to the 1960s, examinationswere not used to promote public policy to the degree that had been the case earlier, as,for example, they had been in the work of Horace Mann (Madaus & Kellaghan, 1993).However, more recently, the notion of public accountability has played an increasing rolein determining the purpose of examinations (Baker, 2001; Linn, 2001). It is interesting toobserve that views on public accountability expressed many decades ago could have beenwritten today; for example, the popular Prime Minister of Ireland, Eamon de Valera, wrotein 1941 that

[i]f we want to see that a certain standard is reached and we are paying the money, we have

the right to see that something is secured for that money. The ordinary way to test it and to

try to help the whole educational system is by arranging our tests in such a way that they

will work in the direction we want. (Jaeger, 1989, pp. 485–486)

Valera’s view led to the establishment of the sixth-standard examination program in Irelandin 1943, but the Irish National Teachers’ Organization steadfastly refused to cooperate with

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 823

the program and it was finally abolished in 1967 (Jaeger, 1989). In the United States, it wasnot until the watershed effect of the Woods Hole conference in September 1959 that publicconcern over the quality of science education began to emerge. The prestigious membersof the conference met under the

conviction that [they] were at the beginning of a period of new progress in, and concern for,

creating curricula and ways of teaching science, and that a general appraisal of this progress

and concern was in order, so as to better guide developments in the future. (Bruner, 1963,

p. vii)

This development of concern over the state of science education accompanied the increasedusage of standardized test results to monitor the quality of education.

Since testing was introduced, examinations or tests have served to check for and enforcepublic standards in education. Testing has been concerned mainly with ranking students forthe purposes of placement, certification, and college or university admission, or with pro-viding public accountability for student achievement. Until the 1990s, providing feedbackduring the process of instruction to either teachers or students was not emphasized (Tamir,1998). Rather, factual recall questions and problem-solving questions usually functioned assummative, high-stakes assessments after curriculum had been completely “covered” in theclassroom. Such tests had the additional attribute of a strict time limit, which is consistentwith the requirement of short, specific answers and the absence of reflective thought on thepart of students. Since ranking was a primary consideration, the student was always testedin isolation from other students.

The administration of large-scale assessment was made significantly easier by the devel-opment of digital electronics and the invention of the high-speed grading machine in the1950s. Machine scoring dictated a test form in which the student selected one of severalpossible answers to a question, here designated as a selected-response. Since then, selected-response testing has been a mainstay in the assessment of science learning. The main exam-ple of selected-response style testing is the multiple-choice format, but other styles includetrue–false, mix-and-match, and fill-in-the-blanks. Usually, the term “multiple-choice” isused in the research literature; however, the term “selected-response” will be used in thispaper to include the various types of tests used. Selected-response tests were consideredideal for testing minimum competency and basic skills.

Even though large-scale assessment was already becoming well-established, some com-ponents were changed from the selected-response methodology. Constructed-responsequestions, such as definitions or short explanations, were used, but not to a large extent.Especially in higher level science, exemplar-type problems in which the student applies aprinciple and calculates a numerical solution or obtains an algebraic solution have beencommonly used. Despite these somewhat more open-ended features, test questions, at leastuntil the 1990s, tended to focus narrowly on particular knowledge items. The task isolationthat is inherent in narrowly focused questions is a characteristic of testing until recently.However, in the 1970s, a nagging issue that arose was a significant discrepancy betweenstudent scores on achievement tests and intelligence tests. The discrepancy made it nec-essary to categorize students who performed poorly on achievement tests but much betteron intelligence tests as “learning disabled” (Willson, 1991). The issue was addressed byincluding written sections on assessments that were designed to assess higher order thinkingskills. An example of a test where this was done is the Scholastic Aptitude Test in the 1980s(Willson, 1991). In the 1970s and 1980s, the concern about the validity of achievementtests coincided with fundamental changes in thought in psychology and the philosophy ofscience, which were accompanied by the rise of the constructivist movement in education.

824 KLASSEN

These changes in thinking about the nature of scientific change and the nature of sciencelearning also resulted in proposals for fundamental changes in assessment. But what wasthe philosophical basis for assessment, and how did these fundamental changes arise?

The Roots of Contemporary Change in Assessment

in Science Education

Empiricist and Behaviorist Roots. In the face of the challenge that assessing studentwork represents, the tendency has been to simplify the tasks and break them into com-ponents that represent the constituent objectives. This inclination toward task isolationwas encouraged by behaviorist psychology, which, in turn, was influenced by empiri-cism. Today, however, empiricism has largely been rejected and behaviorism isdiscredited.

In part, behaviorism grew out of the empiricist philosophy of science, which held thatall knowledge originates in experience. The empiricist view has its origins in Aristotle’snotion, as stated by medieval scholars, nihil in intellectu quod non prius in sensu—“Thereis nothing in the mind except what has passed through the senses.” If one accepts Aristotle’sdictum, then it follows that the mind merely consists of representations of sensory stimuli,which are the result of verbal teaching and reading. The implication of this early empiricistview is that knowledge can be transferred intact from teacher to student through the senses.This transmission model of learning was most famously promoted by John Locke, who wasa major proponent of early empiricism. What the student learns, in the empiricist view, isan exact representation, or at least a subset, of what the teacher conveys, since no furtherprocessing is involved beyond the sensory transduction of information. It is easy to seehow this view supported the suppositions of behaviorism, which held that all behavior isthe result of external stimuli and that so-called knowledge is simply a record of what wasreceived through the senses.

An implication of this empiricist–behaviorist view is that knowledge can be “atomized”or broken up into small, simple steps that are easy to teach and learn. The atomistic view ofknowledge follows from the assumption that knowledge-based sensory stimuli are associ-ated on a one-to-one basis with the cognitive representations of these knowledge items. Thisrelationship of one mental knowledge representation to another cannot be changed at thecognitive level in the behavioristic view. B. F. Skinner developed and popularized the atom-istic view of knowledge and wrote that “the whole process of becoming competent in anyfield must be divided into a very large number of very small steps, and reinforcement mustbe contingent upon the accomplishment of each step” (Skinner, 1954, p. 94). The sequenceof knowledge presentation becomes very important under the assumption that knowledgecan be atomized and assimilated piecemeal. If knowledge is structured by the brain inthe order that it is received (with no further internal reprocessing), then relationships mustbe preformed in order for that knowledge to make sense. The sense-making, sequential,logical structure of knowledge must be preprogrammed into the process. Learning is thusviewed as a linear sequential process. Skinner details the formula for successful learning inthe linear sequential view:

If a learner attains the objectives subordinate to a higher objective, his probability of learning

the latter has been shown to be very high; if he misses one or more of the subordinate

objectives, his probability of learning the higher one drops to near zero. (1965, p. 30)

Finding out if the learner has missed learning objectives is the short-range objective oftraditional science instruction and has resulted in the familiar teach-test–teach-test sequence.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 825

The atomistic view of learning holds that the component skills or knowledge itemscomprising more general abilities may be mastered independently and out of context as longas they are in the correct logical sequence so that they may associate properly when they arerecorded by the brain. Therefore, any contextual factors, such as usefulness and everydayapplicability or the influence and advice of others, either are irrelevant to the items beingtaught or interfere with a narrow and clear presentation of the knowledge item. It shouldbe no surprise, then, that the behaviorist view of learning has been characterized as havingtwo central assumptions—those of decomposability and decontextualization (Resnick &Resnick, 1992).

In addition to simple mental knowledge structures, behavioral psychology postulatesthe existence of individual traits, such as intelligence or general scientific ability. Thesetraits cannot be tested directly, so proxies or “constructs” are found for these traits that arethen tested. Conventional testing has been much occupied with the inferential testing ofconstructs (Gay, 1996; Messick, 1989). The test questions target the areas to be tested in auniform fashion, but not exhaustively. The results are analyzed by inferential statistics andhave validity only in terms of the probability that the construct is adequately representedby the test.

Psychometric Roots. The notion that behavior can be measured stems back to the de-velopment of psychophysics in the 19th century, which led to the development of mentaltesting (Willson, 1991). Psychometric theory was developed along with behaviorist psy-chology from about 1920 to 1970 (Willson, 1991). Moss describes the typical psychometricapproach to assessment:

In a typical psychometric approach to assessment, each performance is scored independently

by readers who have no additional knowledge about the student or about the judgement of

other readers. Inferences about achievement, competence, or growth are based upon com-

posite scores, aggregated from independent observations across readers and performances,

and referenced to relevant criteria or norm groups. (1994, p. 7)

Independence of the test reader from other readers and aggregation of individual responsescores, as described by Moss, characterize the psychometric approach. If item responseson a test are accepted as the result of a purely behavioristic process, then any knowledgethat the test reader possesses about the student or the assessment of other readers woulddetract from the scoring validity. Any knowledge beyond the correct answers and the testitem responses might appear as bias and invalidate the score.

The process of aggregating or adding response scores is based on the assumption that testscores, although numeric, are also quantitative; that is, they genuinely represent a measurablequantity. The assumption that test scores are true representations of a quantitative entityis one that has always been made about psychological phenomena. A few times in thepast, this assumption has been questioned, but not successfully (Michell, 1997). Recently,however, a potentially devastating criticism is being launched against the assumption of thequantitative nature of entities, such as test scores (Dalziel, 1998; Michell, 1997). This issueof the numerical nature of grades is addressed in Part 2.

Shephard points out that the psychometric–behavioristic learning model “assumes thatall important learning objectives can be specified and measured both completely and ex-haustively” (1991, p. 7). Hence, the assessment system is what is called a “low inferencesystem” in which the tests and learning objectives are equivalent (Shephard, 1991, p. 7).The procedure of criterion referencing of tests follows naturally from the direct equivalence

826 KLASSEN

between learning objectives and test questions. Criterion-referenced tests are set and markedon the basis of explicit knowledge criteria, which, in this case, correspond exactly to learningobjectives.

The Demise of Prevailing Assessment Approaches

Criticisms. The traditional relationship among instruction, learning, and assessment isreflected in a teacher’s statement quoted in Pallrand (1996): “How do they [assessmentreformers] expect students to answer questions that teachers have not already given them theanswers to?” This teacher reflects a common view of teaching, learning, and assessment thatsees students as a tabula rasa, in John Locke’s metaphor, to whom information is impartedby telling and from whom it may be retrieved on demand. However, learning theory hasdeparted radically from its more traditional form during the last quarter of the 20th century.In the vein of Locke’s metaphor, today the mind of the learner is seen not as a blank slate, butrather as a slate on which much is already written and where the learner writes new wordsand phrases in appropriate spots and rearranges phrases to make room for new ones. In theconventional view, when the student is tested, it is expected that the information will comeback in essentially the same form in which it was originally presented. The tabula rasanotion stands in direct opposition to the currently dominant constructivist learning theory,which sees the student as actively reconstructing new information in order to assimilateit into existing knowledge structures, and what comes back to the teacher in the form oftest answers is, in practice, rarely exactly the same as what was originally “imparted.”Conventional instruction values simple factual recall through rote memorization. Simplefacts that are memorized verbatim by the student may be regurgitated on a test and comeback to the teacher unchanged. Strike and Posner reject rote learning, stating that

[t]he task of learning is primarily one of relating what one has encountered. . . to one’s

current ideas. . . . To learn an idea any other way is to acquire a piece of verbal behaviour

which one emits to a stimulus, rather than to understand an idea which one can employ in

an intellectually productive way. (1985, p. 212)

Strike’s and Posner’s criticism of conventional teaching and learning is typical of the per-spective that has brought about dissatisfaction with the view of learning, which, to a largedegree, relies on simple factual recall. The factors that led to these criticisms and to theabandonment of some of the traditionally accepted presuppositions and methods grew outof a psychological and philosophical paradigm1 shift around the 1970s. This shift was fromthe empiricist-behaviorist–dominated paradigm to a paradigm based on cognitive psychol-ogy, constructivism, and the philosophical positions of philosophers of science, especiallythat of Kuhn. Both cognitive psychology and constructivism were also influenced early onby the work of Piaget.

The various criticisms of conventional assessment methods have centered on the assump-tions of decomposability and decontextualization that were employed by the psychometricmodel of the day. The primary assessment method that resulted from all three factors(decomposability, decontextualization, and psychometrics) was that of selected-responsetesting, which has been criticized on the following grounds: (a) it is unsuitable for evaluating

1 The term “paradigm” is used in the somewhat relaxed sense of a “disciplinary matrix” that Kuhn definedin the postscript to his 1969 edition of The Structure of Scientific Revolutions. In Kuhn’s modified notion, theparadigm consists of a theory, its metaphysical assumptions, its scientific values, and exemplars of concreteproblem solutions which need not all change simultaneously (Kuhn, 1996, pp. 182–187).

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 827

scientific processes (Champagne & Newell, 1992; Gardner, 1992), (b) it is unable to demon-strate conceptual understanding (Tamir, 1998), (c) it is inadequate for measuring students’reasoning ability (Frederiksen, 1984; Resnick & Resnick, 1992), (d) it is unable to measurewhat people can do with what they know (Race, 1992), (e) it is unable to evaluate problem-solving ability (O’Neill, 1992), and (f) it leads teachers to teach unrelated facts (Tamir, 1998).These criticisms of conventional assessment are made, largely, in hindsight; the replace-ment of the traditionally accepted view was initiated by a psychological and philosophicalrevolution. However, it is shown in Part 2 that recent developments in multiple-choice testdesign and measurement theory have largely answered the above criticisms.

The Accompanying Demise of Behaviorism. The course of the psychological and ed-ucational paradigm was greatly affected in 1959 by the unlikely field of linguistics andthe involvement of the linguist, philosopher, and activist Noam Chomsky (Corsini, 1994;Houts & Haddock, 1992). Although linguistics is not directly related to science education,Chomsky’s involvement was a major initial factor in producing the shift in thinking thatincluded science education. In 1959, Chomsky published a review of B. F. Skinner’s bookVerbal Behaviour. In his book, Skinner had attempted to show that behavioristic stimulus,response, and reinforcement mechanisms govern language development. Chomsky arguedthat Skinner’s model was unable to account for the complexities of language development.Moreover, Chomsky pointed out that saying each language element is a response to a stim-ulus is a scientifically meaningless claim since a stimulus can always be posited to explainany response (Chomsky, 1959). It was easy to see, in Chomsky’s account, that if a processof hearing and repetition were to be the exclusive mechanism in language learning, it wouldtake a person an incredibly long time to hear and repeat enough variations of grammar andsyntax in order to learn a language—a much longer time than is, in fact, available. Chomsky’spaper marked the beginning of the demise of behaviorism and the rise of cognitive psy-chology. Much later, Chomsky’s critique was challenged (Houts & Haddock, 1992), but bythen it was too late to change a historically established fact. It was not long before the viewsof cognitive psychologists, among them Piaget, and philosophers of science, among themKuhn, began to gain prominence. Behaviorism gradually diminished as a viable theory. It isgenerally accepted that behaviorism has been displaced as a viable theory (P. J. Black, 1993;Shepard, 1991; Willson, 1991), but some versions are still active (Houts & Haddock, 1992).

The Emergence of a New Philosophical and Psychological Paradigm

According to cognitive psychology, understanding is a mental process of perceivingand knowing. Sensory stimuli, such as sight, assume a secondary role. Hanson succinctlyexpressed the subordinate nature of sensation to thought when he wrote,

People, not their eyes, see. Cameras and eye-balls, are blind. Attempts to locate within the

organs of sight (or within the neurological reticulum behind the eyes) some nameable called

“seeing” may be dismissed. That Kepler and Tycho do, or do not, see the same thing cannot

be supported by reference to the physical states of their retinas, optical nerves or visual

cortices: there is more to seeing than meets the eyeball. (1958, pp. 6–7)

What Kepler and Tycho Brahe understood about the heavens was not dependent primarilyon the observations that they used in their work, which were the same, but on their under-standings about those observations. That Kepler and Brahe, using the same data, came todifferent theories suggests that the process of understanding takes place beyond sensory

828 KLASSEN

perception. Hanson, an early proponent of cognitive psychology, saw evidence for activecognitive processes in the history of science. The developing field of cognitive psychologyturned the attention of learning theory decisively to the active cognitive processes of theindividual.

Other early views of cognitive processes, like that of Chomsky about inherent learningabilities, implied that science understanding, like language understanding, stemmed froma complex cognitive structure. Piaget was sympathetic to Chomsky’s thesis about languagelearning and pointed out (Piaget, 1970/1968) that his work, like Chomsky’s, rejected theempiricist–behavioristic view. Piaget reflected that “I find myself opposed to the viewof knowledge as a copy, a passive copy, of reality” (1970/1968, p. 15). The empiricistsconsidered logic as a linguistic convention whereas Chomsky saw language as based oninnate reason (Piaget, 1970/1968). Cognitive structures, such as language learning ability,were seen by Chomsky as innate to the learner. This early static view of cognitive abilitiesmay be similar to the notion of innate abilities such as “scientific ability.”

Piaget’s view of science learning, however, emerged as a much more dynamic entity. Thealternative view of Piaget, which did much to promote the cognitive–psychological andconstructivist paradigms and the emerging philosophy of science, is articulated by Piagetin the following excerpt:

The current state of knowledge is a moment in history, changing just as rapidly as the state

of knowledge in the past has ever changed and, in many instances, more rapidly. Scientific

thought, then, is not momentary, it is not a static instance; it is a process. More specifically,

it is a process of continual construction and reorganisation. This is true in almost every

branch of scientific investigation. (1970/1968, p. 2)

It is evident that Piaget and other members of the new cognitive, constructivist, and philo-sophical paradigms saw a similarity between historical knowledge developments and knowl-edge structures of the mind as no accident (Duschl, Hamilton, & Grandy, 1990; Piaget,1970/1968). The historical development of scientific knowledge was postulated to holdvaluable information as to how knowledge developed in the individual. In this sense, thehistorical correspondence thesis is not unreasonable, seeing that both the science studentand the scientist use dynamic cognitive processes to assimilate information about the world.Although the scientist is a highly exceptional and gifted individual, she or he employscognitive processes similar to everyone else, according to cognitive psychology.

The views of learning of Piaget had a profound influence on Kuhn. Kuhn states, “Partof what I know about how to ask questions of dead scientists has been learned by exam-ining Piaget’s interrogations of living children” (Kuhn, 1977, p. 21). One of Kuhn’s majorcontributions was to challenge the separation of philosophy and psychology (Giere, 1992).Two notions on which philosophy of science and cognitive psychology came to agree, dur-ing Kuhn’s time, were in the notions of theory ladenness and the importance of context.The philosophy of science, especially in Kuhn’s formulation, saw all experimental obser-vation as being theory-laden. According to Kuhn’s thesis, the paradigm determines how acommunity of scientists will see the world, what questions they will find interesting, andwhat kinds of solutions to problems are possible. The understandings of a paradigm dictatethe interpretation given to observations and even what kinds of observations can be made(Kuhn, 1962/1996). Hanson stated as early as 1958 that “a theory is not pieced togetherfrom observed phenomena; it is rather what makes it possible to observe phenomena asbeing of a certain sort, and as related to other phenomena” (1958, p. 90). More recently,the philosopher of science Churchland (1992) writes that theory ladenness is natural to allcognitive activity.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 829

A moderate interpretation of theory ladenness is that all observation must take place from aparticular conceptual perspective in order for meaningful interpretation to take place. In thisform, theory ladenness is a fairly unproblematic assumption and has important consequencesfor student work in science and its assessment.

New Directions in Learning Theory

From the 1960s to the 1980s, a number of new developments in learning theory andinstructional design began to have an impact on education. The popularity of Piaget’swritings saw a dramatic resurgence. Ausubel, who was initially influenced by the work ofPiaget, developed a significant approach of his own. Educational psychologist Gagne, whohad been raised in the behaviorist tradition, began to move in a different direction. Thedevelopment of the digital computer, which made possible an alternative theory of mind,was a major influence on Gagne’s information-processing model of learning. And last, theworks of Vygotsky were finally translated into English, beginning in the 1960s, and had aprofound impact on thinking in North America.

The Resurgent Influence of Piaget. In the 1960s, Piaget’s position on learning becameboth prominent and popular (Novak, 1977). Piaget’s theoretical position was not a “learningtheory” per se, but a developmental theory. He had been developing his cognitivist approachfor decades, unaffected by the onset of behaviorism. The popular acceptance of Piaget’sviews in the science and mathematics education communities especially encouraged twodevelopments in curriculum design and teaching methodology. First, Piaget’s stage theorywas widely used to guide curriculum development and teaching practice. The acceptance ofstage theory, thereby accepting the notion that children below teenage years are not capableof abstract formal reasoning, resulted in the limiting of concepts (and their difficulty) thatwere considered appropriate for the earlier years. Piaget’s “stages” stemmed from his viewthat the human mind possesses cognitive structures as a biological property, and it is thesestructures that develop during maturation. Learning then occurs through the processes ofassimilation and accommodation of new information to the existing mental structures.Piaget also advocated a discovery approach to learning, maintaining that “each time oneprematurely teaches a child something he could have discovered for himself the child iskept from inventing it and, consequently, from understanding it completely” (Piaget, 1970,p. 715). Discovery learning became a popular methodology for a time. However, researchersbegan to question the assumptions behind stage theory and discovery learning since newresearch findings did not support some aspects of these approaches (e.g., see Gennovese,2003; Klahr & Nigam, 2004; Novak, 1977; Straus, 1972).

Ausubel’s Alternative. Ausubel diverged from Piaget in focusing his efforts on effectiveteaching strategies and what he called “meaningful learning” rather than on a general theoryof cognitive development (Ausubel, Novak, & Hanesian, 1978). Meaningful learning wasenvisioned as reception learning designed to relate strongly to already existing knowledge.Thus, the logical organization and hierarchical structure of taught knowledge was empha-sized. Ausubel maintained that learning became more stable (i.e., more stably integratedwith long-term memory structures) if it were linked in a nonarbitrary substantive fashionto existing knowledge structures of the brain (Ausubel et al., 1978, p. 27). As far as Piagetwas concerned, learning was subservient to cognitive growth, but Ausubel saw cognitivegrowth as subservient to learning, and this was a fundamental shift away from stage theory.Ausubel is probably best known for the importance he placed on the role of prior knowledge

830 KLASSEN

in learning, maintaining that “[t]he most important single factor influencing learning is whatthe learner already knows. Ascertain this and teach him accordingly” (1978, p. 163). Theimportance of prior knowledge, now known as preconceptions, has since become a majoringredient of constructivist learning theory.

Gagne’s Influence. Both Ausubel and Gagne were concerned with school learning asopposed to Piaget’s more basic focus on the development of children’s thinking and reason-ing capabilities. Gagne’s educational ideas demonstrated a strong behavioristic influence.Regarding learning, he believed that children must learn simpler capabilities before morecomplex ones and that this process must proceed in an additive orderly progression (Strauss,1972). Regarding curriculum, Gagne wrote that curriculum is “a sequence of content unitsarranged in such a way that the learning of each unit may be accomplished as a single act”(Gagne, 1977, p. 23). These are clearly views that belong in the behavioristic tradition. How-ever, he gradually developed a cognitive theory that relied on an information-processingmodel of learning. He believed that “[cognitive] theories propose that stimulation encoun-tered by the learner is transformed, or processed, in a number of ways by internal structuresduring the period in which the changes identified as learning take place” (Gagne, 1977,p. 14). The stages of information transduction, processing, storage, and retrieval, as he sawthem, could be mirrored in sound instructional strategies that attempted to mirror, step bystep, the learning process. These instructional strategies were expressed in his nine “eventsof instruction” (Gagne, Briggs, & Wager, 1992).

Vygotsky and Education. Vygostsky worked in relative obscurity during his lifetime, andhis works were not translated into English until about 30 years after his death. Althoughsome of Vygotsky’s views on learning have much in common with behaviorism, he is alsoconsidered a contributor to constructivist thought (DeVries, 2000). He was the primaryoriginator of social constructivism, a theory of learning that assumes the importance ofsocial factors in knowledge acquisition (Matthews, 1994). In his view, the learner activelyconstructs concepts as a result of social interaction. The student’s (or child’s) potentialfor cognitive growth is limited, on the one hand, by what he or she is able to accomplishon his or her own and, on the other hand, by what he or she is able to accomplish withthe help of a more knowledgeable individual. This range of learning ability is known asthe zone of proximal development. Vygotsky defined the zone of proximal development as“the distance between the actual developmental level as determined by independent problemsolving and the level of potential development as determined through problem solving underadult guidance or in collaboration with more capable peers” (Vygotsky, 1935/1978, p. 86).Aspects of his work have been used to justify various educational approaches, includingcooperative learning (Doolittle, 1997) and portfolio assessment (Wineburg, 1997).

Contributions to Assessment Design. Piaget, Ausubel, Gagne, and Vygotsky hadwidely varying influences upon the development of assessment. Piaget saw the child asbeing an active learner engaged in the discovery process. To assess and analyze what thechild learned, Piaget developed the clinical interview in which children performed simpletasks after which they were required to answer probing questions. The clinical interviewmethod has informed qualitative research in education but is not easily applicable to class-room assessment with large numbers of students. Gagne’s approach, on the other hand, en-couraged frequent testing due to the decomposable nature of learning in his model. Ausubelfollowed a completely different track in his cognitivist view of learning that eventually

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 831

saw fulfillment, mainly in the work of Novak, in the concept-mapping technique of knowl-edge representation and assessment. Vygotsky did not conceive of learning as the activityof an isolated individual and so the application of his views tended to turn the notion ofpsychometric validity on its head (Wineburg, 1997). According to Vygotsky, the help ofinstruments and persons alike were as necessary for understanding as for successful perfor-mance of most activities. Consequently, the separating of knowledge from the support inacquiring that knowledge was unconceivable in the assessment of the knowledge as it hadbeen in its acquisition. The views of Ausubel and Vygotsky are discussed further in theirapplication to assessment in Part 2.

Assessment and Context

The role of context in the learning process, especially as it has been influenced by thelearning psychology of Ausubel and promoted by constructivism, has profound implicationsfor assessment. The essence of constructivist philosophy is expressed by von Glasersfeldwhen he writes: “Knowledge is the result of an individual subject’s constructive activity,not a commodity that somehow resides outside the knower and can be conveyed or instilledby diligent perception or linguistic communication” (1990, p. 37). Learning is, in thisview, a sense-making activity by the learner whereby she or he tries to accommodate newinformation to existing mental structures. New information is always connected to similarinformation where conceptual overlap or context is the dominant factor. By this path ofreasoning, one arrives at the importance of context to the learning process, since informationcannot exist in isolation in long-term memory, and even in the reasoning process there areconstant attempts to make connections among concepts.

In light of the preceding discussion, it is reasonable to view the process of endeavoringto learn as an attempt to find appropriate and desirable contexts into which to fit newknowledge. Rogoff (1984) describes context as “the integral aspect of cognitive events,”so that cognition and context are inseparable. The current view of cognitive psychologyabout the importance of context is strongly expressed by Tweney (1992) when he writesthat “cognition is contextually dependent and must be described in that context before itis understood at all.” But what does the word context convey in the science-educationalsetting? Baker, O’Neil, and Linn explain two different usages of the word “context”:

. . . [T]he term context has different and somewhat conflicting meanings. Some proponents

use context to denote domain specificity. Performance in this context would presumably

show deep expertise. On the other hand, context has been used to signal tasks with authentic-

ity for the learner. The adjective authentic is used to denote. . . tasks that contain true-to-life

problems or that embed. . . skills in applied contexts. (1994, p. 335)

Baker et al. see the knowledge-centered and the activity-centered contexts as somehowincompatible, but, in the constructivist view, knowledge development proceeds as an activityof the learner. Hence, the argument can be made that the two meanings of context are notcontradictory, but rather complementary (Koul & Dana, 1997; Rogoff, 1984). This is alsoborne out in research reported by Ebenezer and Gaskell (1995), who draw on the insight ofMarton (1981) when they write that

. . . we often find variation in conceptions not only between children but also within the

same individual. Depending on the context, children may exhibit qualitatively different

conceptions of the very same phenomenon. Thus meanings are context dependent. Con-

ceptions are, therefore, not characteristics of an individual; rather they are characteristics

832 KLASSEN

of the relations between an individual, content, and context. Learning is both context and

content dependent. (Ebenezer & Gaskell, 1995, p. 2)

Baker’s, O’Neil’s, and Linn’s dual meaning of context—domain-specific and authentic—correlate with Ebenezer and Gaskell’s “content” and “context.” The domain-specific contextrelates to disciplinary knowledge that the learner wishes to acquire, and the “true-to-life”context relates to the learner’s use of practical abilities in the process of acquiring knowledgeor in applying this knowledge.

The importance of context for assessment can be viewed from various perspectives.If knowledge is tested in a decontextualized fashion, it is not necessarily clear what isbeing tested. Cognitive psychology has shown that knowledge is imbedded in complexcognitive networks. It is possible that multiple representations of a concept exist in differentcontexts or mental networks (Bloom, 1992); for example, scientific concepts may existboth in its outside-of-school experiential form and in its school-science form (Gunstone,Gray, & Searle, 1992; Villani, 1992). Testing for a particular piece of knowledge in adecontextualized manner will not tell the assessor to what degree this knowledge has beenintegrated with long-term memory structures. The same assessment task performed sometime later might produce a different result for the same student. The value of knowledgegained through cramming to produce a correct answer to a test question is questionablesince that knowledge will be forgotten soon thereafter (Berenson & Carter, 1995). Bytaking into account various contexts in assessment, one produces a clearer picture of howthe knowledge has been integrated and whether this knowledge can be used productivelyby the learner.

Another way in which context is applied in assessment is in students’ demonstration oftheir ability to use or apply knowledge. In the application of knowledge, the presence ofcontext is natural. At issue is what kind of context is a relevant or desirable one. Cognitiveresearch shows that the ability to apply knowledge is fairly domain specific (Chi, Glaser, &Farr, 1998), meaning that the cognitive structures relating to the performance of a particulartask in a particular setting do not generalize well to the performance of a similar task inanother setting. The selection of a context for a task is important since the assessment maynot be relevant to a wide range of ability. While the presence of context appears to be thestrength of the emerging forms of assessment, the lack of generalizability may turn out tobe its weakness.

Contextualization: A Framework for Emerging Assessment Practices

Since about 1990, various terms have been used in referring to the emerging forms andapproaches to assessment. Some use the generic term “alternative assessment” to refer to theemerging forms (Berenson & Carter, 1995; Naizer, 1997; Slater, Ryan, & Samson, 1997).Others use the terms “authentic assessment” and “performance assessment” interchangeably(Crotty, 1994; Willson, 1991). Still others ascribe “authentic assessment” to all the emergingassessment forms, in agreement with the approach of Leon and Elias (1998). “Performanceassessment” is a type of “authentic” assessment that utilizes practical (usually hands-on)student tasks in a suitable context. To add a greater degree of precision to the terminologyand remove some of the confusion, I propose that the term “contextual” be applied to thevarious forms of emerging assessment practices. As has already been established above, theshift in emphasis from decontextualization to contextualization of knowledge in assessmenthas been a central feature in the theoretical and practical issues of assessment in the pasttwo decades. One can, therefore, hypothesize that contextualization will serve as a usefulconceptual framework for currently emerging assessment practices.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 833

The assessment literature contains a generous number of recommendations as to thedesired characteristics of assessment. One dominant theme is that contextualized assessmentshould reflect “real-life” (i.e., outside of the classroom) tasks2 and require students to utilizehigher order thinking skills (Crotty, 1994; Leon & Elias, 1998). Bell and Cowie (2001) listthree general features of the emerging assessments: (a) they assess a wider range of learningoutcomes, (b) they use a wider range of types of assessment tasks, and (c) they assess inmore authentic contexts. In addition, Wiggins (1990) describes “authentic” assessmentas (d) requiring intellectually worthy tasks, (e) mirroring best instructional activities, and(f) consisting of ill-structured challenges that are similar to the complex ambiguities of life.These approaches pay more attention to learning psychology, to tasks that have a practicaland applied nature, and to the best efforts that students can produce when allowed to usevarious resources.

Recommendations like those listed are part of a “culture of advocacy” that has devel-oped around contextual assessment approaches. When advanced on the basis of advocacy,contextual assessment forms tend to acquire good face validity. Face validity is normallydefined as the apparent validity of an assessment, as distinct from its demonstrated validity.Apparent or face validity is established, for example, by the degree of similarity betweenthe test questions and the outcomes or constructs that are to be measured. The reaction toa test by nonexperts may also contribute to its face validity. The danger of emphasizingface validity is that contextualized assessments may be viewed as valid on this basis alone,rather than on the basis of appropriate research results. A better approach would first seekto establish philosophical, theoretical, and research bases for these types of assessment.

Ways in Which Context Can Be Expressed in Assessment. The currently emergingforms of assessment, characteristics of which have been listed above, make explicit useof context in a variety of ways. Three major categories of context use can be identified.First, Ausubel introduced a new way of looking at context, paying explicit attention to howconcepts are psychologically related, which I will call the cognitive context. Another wayof looking at context is from the perspective of the practical application of concepts, which Iwill call the practical context. A third way of approaching context is from the perspective ofthe classroom, brimming with activity, collaboration, and accomplishment, which I will callthe classroom context. These three categories of contextual assessment correspond to thefollowing major assessment practices: (a) by assessment with concept maps, (b) by meansof practical or performance assessment, and (c) by holistic assessment of representativesamples of students’ work or portfolio assessment.

PART 2: CRITICAL ISSUES AND PUBLIC POLICY IN ASSESSMENT IN

SCIENCE EDUCATION

Given that assessment theorists and practitioners have broadened the scope of assessmentby increasing the degree of contextualization, how is one to approach a critical scrutiny ofthis wide and diverse field? It is a fundamental necessity first to consider the philosophicalunderpinnings of any field of research and scholarship. Michell argues that “[i]f the methodsof science are not sanctioned philosophically then the claim that science is intellectuallysuperior to opinion, superstition and mythology is not sustained” (1997, pp. 355–356). Themethods of assessment in science education are not exempt from such scrutiny. Today, a

2 “Real-life” tasks test the transferability of learned knowledge to unfamiliar situations and for studentsmay represent a challenge that they find most difficult to handle.

834 KLASSEN

number of critics within the contextual or “authentic” assessment movement are callingfor just such a reappraisal. Gunzenhauser claims that there is of “a lack of reflective,engaged dialogue among educators and schools communities” (2003, p. 51), which resultsfrom uncritical acceptance of current assessment policy. In line with the call for closerphilosophical scrutiny of assessment theory and practice, Chudowsky and Pellegrino callfor the rethinking of “some of the fundamental assumptions, values, and beliefs that currentlydrive large-scale assessment practices” (2003, p. 75). Not only should assessment experts beknowledgeable about the philosophical warrants of their conclusions, but they should alsobe aware of the extent to which developments in science education, including assessment,are driven by the larger developments in the philosophy of science and cognitive theory.

Assessment in the Cognitive Context

As was already pointed out in Part 1, the work of Ausubel contributed to the developmentof contextual teaching and assessment methodologies. From the vantage point of Ausubel,both instruction and assessment should be strongly influenced by the manner in whichcognitive learning takes place (Ausubel et al., 1978). Ausubel’s learning psychology wasadopted by Novak in his research program on children’s learning of science. The conceptmap was a central feature of Novak’s application of Ausubel’s ideas (Novak, 1998).

The Nature of Concept Map Assessment. Concept maps are hierarchical, graphic orga-nizers used to represent the relationships among concepts. The relationships are representedby a network of nodes connected by labeled lines that may or may not be directional. Thenetwork consists of linked “propositions” that each consist of two or more linked nodesconsisting of concepts; for example, the concepts “charge” and “current” might be linkedas follows to form a proposition: “current consists of moving charge.” Although it mightappear on the surface that concept maps are networks of logical propositions, they are notso, primarily. Wandersee observes that “concept maps reflect the psychological structureof knowledge” (1990, p. 923). That is to say, concept maps are intended to represent thecognitive networks that have been constructed by students in the process of learning.

Concept maps have been used for a variety of purposes in science education, includingthat for assessment. When used in assessment, the concept map technique must consist of the“combination of a task, a response format and a scoring system” (Ruiz-Primo & Shavelson,1996a, p. 573). A large number of different response formats and scoring systems havebeen devised. On one end of the spectrum, the response format may consist of free-form,hand-written nodes with links on a large sheet of paper. On the other end, the responseformat may consist of fill-in-the-blanks maps where some link-phrases or concepts may beselected from a list. Scoring systems vary from a count of the number of valid propositionsto a comparison of the concept map to that constructed by an expert.

A Basis for Concept Map Assessment. Ausubel made a significant distinction betweenwhat he called “meaningful reception learning” (Ausubel et al., 1978, p. 117) and rotelearning. The distinction is important because rote learning is soon forgotten because ofweak conceptual overlap with existing memory structures (Mayer, 2002; Novak, 1991). In-formation learned by rote must be reinforced repeatedly to be maintained. On the other hand,meaningful learning is characterized by a high degree of integration of existing memorystructures (prior knowledge) with new knowledge. Strong integration results in meaningfullearning that is recalled substantially at much later times than is the case for rote learning.Not only must material being taught be designed so as to relate to existing knowledge, but

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 835

teaching sequences must be designed so as to build appropriate prior knowledge for whatis to follow. Ausubel stressed that meaningfulness is not inherent in the material itself, butin the way it relates to prior learning. Also, meaningful learning is a choice made by eachlearner, since most concepts may be learned in either a rote or meaningful fashion. One wayof making certain that the student engages in meaningful interaction with the material isto design the assessment strategies so that he or she may identify meaningfully integratedlearning and discriminate against information learned by rote. It is this last issue that is aprimary motivator for using concept map assessment. The act of constructing a concept mapexplicitly elicits propositions and discriminates against recalling concepts learned by rote.

An Example of Concept Map Assessment. Zeilik and colleagues report on a large-scaleeffort to reform astronomy instruction in which concept maps play a role in both instruc-tion and assessment (Zeilik, Bisard, & Lee, 2002; Zeilik et al., 1997). The basis of theirinstructional model is the principle that “knowledge must be organized and interconnectedin learners’ cognitive structures to be accessible from long-term memory and so useful forproblem-solving” (Zeilik et al., 1997, p. 988). This cognitive principle was realized in thecourse design and implementation by the liberal use of concept maps for course planning,instruction, and assessment. To check for strongly integrated conceptual understanding, stu-dents were pre- and posttested with validated multiple-choice and concept map tests. Theapproach has now been tested both in the original institution and in a second institution. Theresults consistently indicate a large improvement in conceptual understanding (i.e., a largedecrease in misconceptions) as indicated by both multiple-choice and concept map tests.To understand the significance of the result, the reader should know that courses taught inthe usual didactic lecture style normally show an improvement score of approximately 0(Zeilik et al., 1997). The type of concept map used was influenced by the fact that the coursehad an enrollment of at least 130 students in the smallest class so that free-form conceptmaps were considered too labor intensive to mark. Instead, a fill-in-the-blanks concept mapwas designed where some nodes were left blank and a list of possible words to place inthe blanks was included. The concept map could thus be answered by filling in a computerscoring sheet. Both the concept map test and the multiple-choice test were found to havea high reliability. The concept map test correlated moderately (0.3) with the diagnostictest, leading to the conclusion that the concept map probed domains of knowledge thatoverlapped with the diagnostic test (Zeilik et al., 2002). The concept map test correlatedmore strongly with the final mark in the course (about 0.5). The reform effort of Zeilikand colleagues is fairly unique in testing for transferability across educational institutions,a feature that was strongly demonstrated by the results. The achieved reliability and sig-nificant student improvement indicate that the method is worthy of emulation and possiblefurther refinement.

Some Research Findings About Concept Map Assessment. While cases of reliableconcept map assessment can be found, as in the one described above, there are also reasonsgiven in the literature for exercising caution. Used in their original free form, concept mapsare highly idiosyncratic (Edmondson, 2000; Regis, Albertazzi, & Roletto, 1996) since theyrepresent one individual conceptual structure. Scoring highly individualistic concept mapson the basis of comparison to expert maps makes the reliability of the scoring suspect. Thelack of standardization in map-scoring schemes raises the question of their validity andreliability (Edmsondson, 2000; Ruiz-Primo & Shavelson, 1996a). There is also reason tosuspect that different scoring techniques measure different aspects of conceptual organiza-tion or understanding (Rice, Ryan, & Samson, 1998; Ruiz-Primo, Schultz, Li, & Shavelson,2001), although McClure, Sonak, and Suen (1999) found significant correlation among six

836 KLASSEN

different scoring techniques. The question of what underlying trait each combination ofconcept map form and scoring method measures still eludes researchers (Ruiz-Primo et al.,2001). In addition to the issues with scoring, it has been found that becoming proficient atconstructing concept maps requires a significant mental effort so that unless concept mapsare already a part of normal instruction, their sole use in assessment can introduce a signif-icant time factor for training students in their use (Novak, 1991). Despite the challenges inthe validation of concept map assessment, significant effects have been exhibited with itsappropriate classroom use.

Assessment in the Practical Context

There is a purported Chinese proverb that states, “I hear and I forget, I see and I remem-ber, I do and I understand.” Advocates of performance assessment would agree with theproverb and likely add, “I show I can do and I demonstrate understanding.” Assessment of“performances” in the practical context was proposed by assessment experts as a solutionto the decontextualized nature of conventional testing and as a way of testing thinking skillsof higher order than those tested by conventional methods. The nature of performance as-sessment requires that the student demonstrate science process skills and knowledge in apractical, that is, hands-on or applied, setting.

The Nature of Performance Assessment. Typical science performance assessment “pro-vides students with laboratory equipment, poses a problem, and allows students to use theseresources to generate a solution” (Ruiz-Primo & Shavelson, 1996b). Baker et al. (1994) iden-tify the attributes of performance assessment. They include (a) the use of open-ended tasks,(b) focusing on higher order or complex skills, (c) employing context-sensitive strategies,(d) frequent use of complex problems requiring several types of performance and signifi-cant student time, (e) either individual or group performance, and (f) a significant degreeof student choice. Students are expected to make critical comments on the results, makeconnections to their prior knowledge, make some conclusions, and recommend furtherexperiments to resolve problems not solved in the experiment (Willson, 1991).

As a “new” type of assessment, performance assessment differs little from well-designedscience laboratory experiments and laboratory examinations that have been used for decadesby competent science teachers from grade-school level to university level; however, the tasksof performance assessment are meant to present the student with an unfamiliar situation sothat generating the answer is not primarily factual or procedural recall. Because it will affecttheir validity, systemwide assessments (by definition summative) should not use assessmenttasks that could have already been used in the classroom, for such assessment would nolonger be measuring higher order skills for some students. Although student laboratorywork is an established technique, the use of performances as summative assessment isnew. In addition, assessment experts have proposed the use of performance assessmentsfor statewide assessment of science learning, and, in many places, this has been done(Aschbacher, 1991; Erickson & Meyer, 1998; Ferrara, Huynh, & Baghi, 1997; Lomask,Baron, & Greig, 1998).

A Basis for Performance Assessment. Performance assessment is a direct outcome ofthe changes in education outlined in Part 1. Khattri, Reeve, and Kane (1998) point out thatthe current push toward using performances is partly due to “the reaction on the part ofeducators against pressures for accountability based on multiple-choice, norm-referencedtesting, the development in the cognitive sciences of a constructivist model of learning”(p. 2). The constructivist reorientation of science education has brought about a greater

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 837

emphasis on inquiry-based learning. From the perspective that student learning is largelyself-constructed and experiential, it is a logical and straightforward step to promote activity-based assessment (Shymansky et al., 1997). The reader is referred back to Part 1 for acomplete discussion of the overriding changes that are the basis for performance, as wellas other forms of contextual assessment.

An Example of Performance Assessment. In 1992, Erickson et al. reported on a large-scale performance assessment in science education, done in British Columbia in 1991,that departed significantly from decontextualized assessment forms for the first time. Morerecently, Erickson and Meyer (1998), retrospectively more critical, reported again on theoriginal performance assessment of 1991. The performance assessment was one of fourcomponents of the overall assessment of science learning that took place for grades 4, 7,and 10, which consisted of two individual investigations and six experimental station tasks,performed in 7 min each. Some of the station tasks were given to all four grade groups.

One of the station tasks given to all four grades was the “Sound Board.” It consisted of aboard with four guitar strings of varying thickness stretched over it. Three questions wereasked of the students:

1. Look at the Sound Board. What do you see that is different about each of the strings?2. Hit each string with your pencil, and listen. What differences do you hear?3. Why do you think these strings sound different? (Erickson et al., 1992, p. 20)

Teachers did the scoring of the tasks on a standard scoring form that called for assessmentof the students’ ability to observe accurately and the ability to explain well. Ericksonand Meyer (1998) reported that it was obvious that students drew on their prior out-of-school experience with stringed instruments in describing the differences observed and inattributing the differences in pitch to the tension in the strings—a response that was notanticipated. The task had been designed with the expectation that the differences in pitchwould be attributed to the observed differences in the thickness of the strings. Contrary toexpectations, there were more similarities than differences among responses from the threedifferent grade levels. The only significant difference was the level of vocabulary used inexpressing the observations, with higher grade students using more scientific vocabularyand grade 4 students having difficulty expressing their findings.

To attempt to interpret the results, Erickson and Meyer (1998) postulated the existenceof three cognitive resources used by the students in performing the various tasks. Theywere (a) science-specific general cognitive abilities, (b) school-related science knowledge,and (c) everyday science content knowledge. Of these, the existence of science-specificgeneral cognitive abilities is controversial, as there is an ongoing debate as to the existenceof science-specific skills (Erickson & Meyer, 1998; Selley, 1989). From the study, Ericksonand Meyer (1998) made two conclusions: first, it is not possible to separate out uniqueskills from the task assessment and it is, therefore, not possible to claim to be measuringany of the conventional unitary traits, such as “science process skills”; and second, it isnot possible to determine whether responses are based on school-acquired knowledge orout-of-school experiences. The Erickson and Meyer study, due to the thoroughness anddepth of its analysis, is a valuable resource for evaluating performance assessment.

Some Research Findings About Performance Assessment. As stated before, perfor-mance assessment is new in its summative, systemwide role. Most of what has been pub-lished about recent performance assessment experiences is unfavorable in its evaluation.Erickson and Meyer state that their evaluation of the performance assessment experience

838 KLASSEN

is “certainly more circumspect than it was several years ago” (1998, p. 860). Webb,Schlackman, and Sugrue (2000) and Stecher et al. (2000) report that performance assess-ments are highly volatile and affected significantly by task and occasion. Lissitz (1997)observes that the generalizability of performance assessments is highly limited, the resultsare extremely difficult to interpret on account of the confounding effect of multiple interfer-ing factors, and the costs are great. He also argues that the very nature of the performanceassessment works against itself because requiring the test to be highly contextualized willdecrease its capacity to be abstract. The issue that he has raised—the desirability of abstractreasoning as opposed to contextualized performance—is an important one that must bedebated by performance assessment researchers and practitioners. Lomask et al. (1998), inreporting on the large-scale performance assessment in Connecticut, concluded that the re-sults were not comparable enough to be used as a stable measure of students’ progress from1 year to the next. Shavelson, Baxter, and Pine (1991) observed that the political pressure forimplementation of performance assessment was “way ahead of the development researchand technology” (p. 357). On the basis of their research, Ruiz-Primo and Shavelson (1996b)maintain that from 8 to 23 performance tasks are required to assess an individual student’scompetency in a single-subject area to a reasonable level of reliability. If the results wereevaluated only at the group level, a lower number of tasks would be required, making theassessment process less costly. The danger of lowering the reliability in such a way is thatschools might use the results to evaluate students anyway even though the results are notreliable at the individual level.

The existing research findings and developments in performance assessment have fallenfar short of the claims and promise of performance assessment and do not address theperceived shortcomings of decontextualized assessment and the objectives of new learningtheories, such as constructivism. The results of investigations of performance assessmenthave, so far, been primarily negative. Many of the reports have come from investigators whoare operating under the umbrella of state assessment reform where as much as hundredsof millions of dollars are being invested in performance assessments (Gilman, 1997). Thefact that researchers report negative findings on performance assessment, despite indirectpressure to do otherwise on account of the huge public investment into these experimentalassessment methodologies, adds credence to the thesis that there are serious problems withthe current implementation of performance assessment.

Assessment in the Classroom Context

Student work in the science classroom, other than taking tests, can be creative and flexible,often tapping the strongest abilities of the individual. The pressures of time limits andperformance do not play a role in students creating products in classroom activities, and itcould be argued that they are a better representation of various abilities, not having beenproduced under pressure. Students will do well on some tasks and not so well on others.

The Nature of Portfolio Assessment. The portfolio has emerged as a popularly advo-cated method for students to exhibit their classroom work and demonstrate their understand-ing, progress, achievements, and attitude in a particular subject area or across the curriculum(Berenson & Carter, 1995). The method of using portfolios in instruction and assessmentis not highly prescribed in the literature and allows for flexibility in how the teacher andthe student approach it (S. Black, 1993). Portfolios have emerged partly as a reaction to theperceived shortcomings of decontextualized assessment methods and partly as a means toembrace new views in assessment, such as the desirability of integrating assessment and

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 839

instruction. Since portfolios include many different examples of student work, they arepresumably sampling student ability over a wider domain and in more instances, therebyproviding a more reliable measure of performance.

A Basis for Portfolio Assessment. A theoretical basis for portfolio use in assessmentcan be found in the work of Vygotsky. Wineburg argues that the notion behind portfo-lio assessment stems from the concept of “authenticity” as it arose from the writingsof Vygotsky (Wineburg, 1997). On the basis of Vygotsky’s work, Wineburg argues that“human beings draw heavily on the specific features of their environment to structure andsupport mental activity. In other words, understanding how people think requires seriousattention to the context in which their thought occurs” (1997, p. 61). Wineburg is promotinga notion different from the common meaning of authenticity in science education as, forinstance, discussed by Rahm, Miller, Hartley, and Moore (2003). In the model of Rahmet al., authenticity is the dynamically evolving negotiation among students, teachers, andscientists of how science activities should be structured. In Wineburg’s model, student workis “authentic” when it is done in the way work is produced in the adult world outside ofschool, that is to say, together with the assistance of others and the aid of various tools. Inother words, the process and product of classroom science are inseparable. So, the portfoliocannot include work that is done in isolation from references, materials, tools, and the helpof others. To comply with the Vygotsky–Wineburg model of portfolios requires a certainperspective of how classroom work is structured. Written tests, for example, cannot be apart of such assessment. Science fair project reports, on the other hand, would qualify. Thebasis of portfolio assessment is, thus, the sampling of student work that has been producedin the classroom context undisturbed by the artificial environment produced by tests andexaminations.

An Example of Portfolio Assessment. An investigation of the use of portfolio assess-ment in a college introductory noncalculus physics course was performed by Slater et al.(1997) at a large urban community college. The study was set up so that conventionalteaching methods were maintained and assessment results could be compared between de-contextualized and portfolio assessments. The course content was organized into 24 learningobjectives. Half of the students were assessed in the conventional manner and the other halfby means of portfolios. The later group of students was challenged to demonstrate clearlythat they had mastered the material in the 24 learning objectives. These students wereallowed to determine what they placed into their portfolios, and were given a list of 11types of portfolio entries that had been used by students in the past. Each portfolio entrywas to be accompanied by a statement of self-reflection about the entry. The portfolioswere graded three times during the course, at the same time students who were assessedin a conventional manner wrote tests. The grading scheme used was a four-level holisticscoring guide rating the strength of evidence supplied for having met the course learningobjectives.

An analysis of the final grades showed no significant differences between the two groups.As a result of group discussions, though, it was found that students assessed by portfoliosfelt less anxious about studying physics, devoted considerably more time to studying, andenjoyed the experience more. Gitomer and Duschl (1995) have pointed out that portfoliouse is best suited to be integrated with progressive teaching techniques. However, the studydescribed here employed conventional classroom instruction, which is reason to questionits generalizability to usage in constructivist-oriented classrooms which might have yieldeddifferent results. In this particular study, affective and motivational benefits for portfolio

840 KLASSEN

use were demonstrated, but the critical benefits of testing for higher level skills or thedemonstration of higher reliability were not realized.

Some Research Findings About Portfolio Assessment. The study by Slater, Ryan, andSamson demonstrated that students were able to perform as well with portfolio assessmentas with decontextualized assessment. Baume and Yorke (2002) were able to demonstratesatisfactory reliability and validity on a sample of 53 portfolios used in the UK OpenUniversity, but they observed that this can be achieved only at the expense of a very largeeffort. Koretz (1998) assembled evidence from several large-scale portfolio assessments inthe United States and concluded that there is not sufficient evidence that the measurementobjectives of these projects have been met. Other studies have found that there are no clearrelationships between portfolio performance and other measures of performance althoughlower performing students do seem to perform better in portfolio assessments (Gearhart &Herman, 1995; Leon & Elias, 1998). In addition, Gearhart and Herman (1995) have raisedserious questions about the validity of portfolio assessments on the basis that the workmay not always be the students’ own. It may be copied or include a significant amountof parental or teacher assistance. The availability of Internet resources exacerbates thispotential problem.

Issues for Policy Consideration in Assessment

Rarely do issues discussed in academic circles generate as much rhetoric as that surround-ing assessment. Broadfoot observes that “[a]ssessment activity now shapes the goals, theorganization, the delivery and the evaluation of education” (2002, p. 285). Public concernscontinue to call for tests of student achievement in order to see if educational standards arebeing maintained. At the same time, education experts are in agreement about their rejectionof the aspects of assessment that they characterize as “traditional.” Wandersee, Mintzes,and Novak stated in the 1994 Handbook of research on science teaching and learning that

[i]t may be that no single concept has had as pernicious an effect on educational improvement

as the one called achievement. This was typically defined (operationally) as the difference

between pretest and posttest scores, often after a planned intervention of relatively short

duration. Thus, a great deal of meaning has been tied to a single score and to an established

(but conceptually inadequate) technology (we might say “tyranny”) of testing. (1994, p.

201)

However, the “tyranny” of testing can exist for both decontextualized and contextual assess-ment methods, and when used for public accountability purposes, both decontextualizedand contextual assessment methods may raise issues of concern. Issues that must be con-sidered in evaluating the science education assessment field are (1) a reconsideration ofthe original criticisms of selected-response testing that brought about the changes towardcontextualized assessment; (2) establishment of validity and reliability through a variety offorms of evidence in the published research together with various theoretical arguments;(3) the appropriate implementation level of any assessment scheme, ranging from classroomformative assessment to systemwide summative assessment; and (4) the appropriate targetunit for any assessment scheme, be it the individual student or the school. At this point thereare more questions than answers on these issues.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 841

The Issue of Selected-Response Tests

Since the original criticisms of selected-response testing were launched (see Part 1), a newtechnology of testing has been developed that, to a significant degree, answers the originalobjections. These developments have come in two separate areas of psychometrics: first,in the application of learning psychology to test design and, second, in new mathematicalmodels of test results to improve test validity by the better linking of test scores to traits.Hamilton, Nussbaum, and Snow (1997) have correlated selected-response and constructed-response tests with interviews that probe students’ thinking processes during test-taking.Distracters in the selected-response questions were based on research about student pre-conceptions. They found that such questions can elicit fairly sophisticated understandingsof phenomena. On the other hand, constructed-response questions were more problematic,tending more often to elicit confusion or turning to knowledge gained out-of-school to an-swer the questions. Sadler (1998) also used research-based selected-response test questionsand has constructed a psychometric model that reliably models the various conceptions thatthe student holds. Sadler maintains that such tests, together with appropriate analysis, arethe preferred method for modeling student conceptions and tracking their development overtime. Other psychometric models are being proposed, and Wilson (2003) has developed theconstruct map method of selecting the most appropriate measurement model. These de-velopments are cause for rethinking the dismissal of selected-response tests as a suitabletesting format.

The Issue of Validity and Reliability

Validity and reliability issues in assessment are so wide-ranging and complex that adetailed consideration of them is beyond the scope of this paper. Two issues need to besingled out, however: the question of which underlying traits are being, or can be, measuredby tests and the question of whether measurement can be achieved reliably. Before theseissues are addressed, the accepted definitions of validity and reliability will be recalledbriefly.

Probably the most authoritative definition of validity (P. J. Black, 1993) was given byMessick when he wrote, “Validity is an integrated evaluative judgement of the degree towhich empirical evidence and theoretical rationales support the adequacy and appropriate-ness of inferences and actions based on test scores or other modes of assessment” (1989,p. 13). He stresses that validity is a unitary concept. The many aspects of validity, be theyface validity, construct validity, and so on, are just that—aspects of validity, not kindsof validity. The most fundamental aspect of validity is construct validity—whether a testactually measures what it purports to measure. Essential to validity is the consideration ofreliability. Reliability is defined as

the degree to which a test consistently measures whatever it measures. The more reliable a

test is, the more confidence we can have that the scores obtained from the administration of

the test are essentially the same scores that would be obtained if the test were re-administered.

(Gay, 1996, p. 145)

It is accepted as a psychometric measurement principle that without reliability there can beno validity (Moss, 1994). The principles of validity and reliability as they have been appliedin conventional assessments cannot simply be imposed upon contextual assessment, andthis is so because of basic underlying presuppositions that are different for each of thetwo paradigms of assessment. The latter promotes personal knowledge of the student’s

842 KLASSEN

performance and potential on the part of the test reader, whereas the former opposes thisdiametrically (Moss, 1994). In contextual assessment, it is seen as an advantage that theassessment reader knows the student well in order to make a better judgment of the student’sperformance. In the conventional view of reliability, the absence of any form of bias,including knowledge about the student, is paramount.

As indicated in the separate sections above, research in the three contextual assessmentmethodologies being considered is disappointing in their findings about validity and relia-bility. A possible explanation for the state of affairs in regards to validity lies in uncertaintiesabout, or inconsistencies in, what underlying traits each technique should measure. In otherwords, issues of construct validity exist. The findings about reliability seem to relate to dif-ficulty in the application or scoring of the assessment methods. These tentative explanationsfor the difficulties in establishing validity and reliability are delineated in the specific casesof the three assessment methodologies, below.

The Validity and Reliability of Concept Map Assessment. The fundamental purpose ofthe concept map is to facilitate and assess meaningful learning as originally conceptualizedby Ausubel and Novak. The concept map was designed to operationalize the concept ofmeaningful learning in terms of the strength of conceptual integration. A well-designedconcept map test should be able to discriminate between knowledge that has been learnedby rote memorization and knowledge that has become integrated with related conceptionsand preconceptions. As was noted above, research-based selected-response tests are ableto measure conceptual knowledge in a validated and reliable fashion. The only way ofchecking independently for the correctness of the assumptions behind concept maps wouldbe to reassess conceptual understanding, possibly by both methods, at intervals where rotelearning is known to have been forgotten. Surprisingly, this most fundamental factor hasnot been tested to a significant degree in concept map assessment research.

Even though concept maps, both in instruction and in assessment, are able to demonstratestrong effects, it is not certain how to interpret the results. Also, because of a lack ofstandardization of scoring methods, reliability is difficult to achieve. In order for conceptmap assessment to gain legitimacy for large-scale usage, these issues must first be overcome.

The Validity and Reliability of Portfolio Assessment. Portfolio assessment was intro-duced largely as a reaction against decontextualized assessment methods. In retrospect, itis possible to see its theoretical basis, as has been demonstrated in this paper. Basically, theportfolio integrates the process and product of student classroom work and scores studentperformance. Scoring typically attempts to measure student actions through their prod-ucts. Typical objectives for student actions are begun by the verbs “identify,” “describe,”“explain,” “apply,” or “solve” (Slater et al., 1997); yet, it is not necessarily clear what types ofmental states or underlying abilities are being demonstrated by these types of performances.Such student actions and products may be produced just as easily by rote or algorithmiclearning as by complex conceptual understanding. Educators will need to decide whetherthey are, in fact, seeking evidence for conceptual understanding. If that is the case, thenproperly designed, research-based, selected-response tests would be a more appropriatemethod of assessment and a less labor-intensive scoring process. As it stands, the issue inportfolio assessment may be one of construct validity or deciding what should, ultimately,be measured.

As in the case of concept map assessment, it has been challenging to demonstrate relia-bility for portfolio assessment. Because a significant amount of training of test scorers byexperts is required to produce reliable scoring, it is not clear if reliability, once demonstrated,is sustainable.

The Validity and Reliability of Performance Assessment. Performance assessment usedand adapted laboratory exercises so that students would be able to demonstrate the

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 843

application of knowledge and scientific process skills. Of the three contextual assessmentmethods that have been considered in this paper, performance assessment has generatedthe most disappointing research results in terms of reliability and validity. A synopsis ofresearch findings, included here, reveals the seriousness of the situation in this area.

Studies of performance assessment in science so far cited show that it is not possible toisolate unitary traits (Erickson & Meyer, 1998), that it is difficult to obtain reliable scoringon learning objectives (Klein et al., 1998; Webb et al., 2000), that it is difficult to createreliable tasks (Brookhart, 2001; Shymansky et al., 1997; Stecher et al., 2000), that it isnot clear whether out-of-school learning or school learning is being assessed (Erickson &Meyer, 1998), and that it is not clear whether performance assessments promote higherorder thinking at all (Ruiz-Primo & Shavelson, 1996b).

As in previous cases, there is cause to suspect a lack of construct validity. Erickson andMeyer (1998) point out that students can draw on school learning, out-of-school learning,and other cognitive abilities in approaching a performance task. In addition, there is alwaysthe risk that at least some elements of a task may be familiar to some students because of out-of-school experience, so that they are not, in fact, constructing a solution, but rather recallingwhat they already know. Therefore, there is reason to think that performance assessmentsuffers from the confounding of several effects that may be inherently inseparable. Lastly,as was already mentioned in Part 1, learning does not easily transfer to new contexts. Theproblem of “far transfer” (Barnett & Ceci, 2002) may simply make it too difficult for studentsto perform in novel settings. These issues require urgent consideration, as they threaten toinvalidate the entire basis for performance as an assessment technique.

The Issue of Numerical Grading. The use of numbers as a means of establishing authorityin every area of life is a fact of life. In assessment, the use of numbers to assign grades,often as a percentage, has recently come under critical scrutiny. In 1979, Boulding madethe following insightful statement about the obsession of our society with numbers:

Perhaps the greatest superstition in the world today is numerology—the belief that some-

how numerical information is always superior to qualitative, structural, and topological

information. . . . Everything indeed that is presented to the decision-maker in terms of num-

bers is evidence, not truth. There is nothing wrong with evidence as long as it is not mistaken

for truth. (Boulding, 1979, in Willson, 1991, p. 254)

The legitimacy of the use of numbers in grading and the authority of the numerical gradeto promote or fail students is a matter accepted on faith by most. However, a number ofpublications have recently called the quantitative nature of grades into question (Dalziel,1998; Kohn, 1999; Michell, 1997; Tognolini & Andrich, 1996; Willson, 1991).

In order for grades to be considered legitimately numerical, they must be shown to begenuinely quantitative. The theoretical justification for considering a number to representa quantitative property was first established in 1901 by the German mathematician OttoHolder (Michell, 1997). Simply put, Holder’s axioms required two main conditions fora property to be quantitative—order and additivity. First, the objects of a particular typemust be capable of being ordered from least to greatest in a reliable and consistent way.Second, the magnitudes of any two quantities of a particular type must be capable of beingadded to obtain another magnitude of the same type, but greater. This attribute of additivityis a theoretical one and does not necessarily entail the concatenation of two entities. Theimportant point is that if either of these two conditions for quantitativeness fails, there isno legitimate basis for using the numbers as if they were describing a quantitative property.

844 KLASSEN

The difficulty of establishing the second condition is that although many entities can beordered, the operation of “addition” is an abstract concept that often does not make theaddition of entities obvious; for example, if a student scores 7 out of 10 on an essay and 8out of 10 on a test, is there a quantitative entity called “term mark” that can be constructedby adding the two scores originating from dissimilar tasks? In this case, it is not clear asto what the underlying quantitative trait might be for which the marks are magnitudes, andsuch additions are open to question. In principle, it is possible to determine the quantitativenature of traits based on the specialized mathematical operation of conjoint measurement(Dalziel, 1998), but, in practice, this is difficult to establish. So far, most of the tests ofconjoint measurement have failed (Dalziel, 1998) or, at best, have shown that some testitem scores may be quantitative, but others not (Green, 1986).

In some jurisdictions, as in the province of Manitoba, in Canada, the reporting of agrade as a percentage is required by law (Manitoba Education and Training, 1997). Graderepresentation by a percentage implies that a numerical grade can represent what pro-portion of the course material or course objectives has been mastered by the student. Inthe atomistic view of knowledge, where the remembering of a large number of facts isto be tested, percentage grades make sense since then they represent a true proportion offacts understood or remembered by students. In the current approach to curriculum, how-ever, this type of proportioning does not make sense; for example, a curriculum objectivemight be to “understand the properties and structures of matter, as well as various commonmanifestations and applications of the actions and interactions of matter” (Manitoba Edu-cation and Youth, 2003, p. A263). Establishing the property of additivity would be difficultfor a trait as complex as this since measuring the students’ competency on this objectivewould require a multiplicity of nonparallel tasks and assessments.

Research also bears out that decisions that are made about students’ futures on thebasis of percentage and even fractional percentage grades are likely unwarranted (Dalziel,1998). Reliability theory shows that each individual grade is subject to a significant range ofuncertainty. Yet, major student awards are presented on the basis of a fraction of a percentagedifference between the grades of two students. The likelihood of the result being reversedif the situation were replayed is very high. Another example of such high-stakes decisionsbased on percentage grades is the use of grades to relegate students to remedial classes ordeny them admission into special programs. Again, decisions based on a percentage cutoffwhen the uncertainty of the grade determination is not known are not warranted.

The Issue of High-Stakes Assessment. A well-known effect of high-stakes testing is thephenomenon of “teaching to the test.” The effect that testing has on teaching is sometimescalled the “backwash effect” (Prodromou, 1995). Both teachers and students tend to concen-trate on doing well on final examinations, especially if these are systemwide examinations.Tamir states categorically that [i]nnovations that compete with tests are bound to fail (1998,p. 766, italics in original). Tamir pinpoints the tension that exists between reform effortsand assessment pressures. Some effects of high-stakes assessment reported by Harlen andCrick (2003) and McMillan (1999) indicate that high-stakes assessment tends to produceelevated stress levels for teachers and students and decreased student motivation.

Black (1995) maintains that external public examinations do not deserve support sincethe assumptions on which they are based are faulty. According to him, the implicit publicassumption about external assessments is that they are more trustworthy than those of theclassroom teacher. The notion of the superiority of external, detached judgments is onethat is informed by nostalgia and custom (Black, 1998; Harmon, 1995). Such public per-ceptions are also encouraged by a lack of awareness of significant recent developments

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 845

in learning theory and philosophy of science. Black warns that great damage to educationcan follow from actions based on such assumptions. In the case of systemwide, publiclyadministered contextual assessment, these examinations have frequently been implementedbefore their validity has been established. These new external assessments are often seenas more trustworthy than the existing classroom teacher assessment even though the rel-ative validity or reliability of each has not been determined nor taken into account. Re-placing the decontextualized system with one of doubtful validity is not a practice to berecommended.

The Issue of the Fundamental Assessment Unit for Public Policy. A fundamental as-sumption of large-scale assessment is that the individual student is the appropriate unitof assessment for public accountability purposes. This assumption is rarely questioned(Whitehead, 1929; Wineburg, 1997). If one accepts the assumption made by contextualassessment that the inclusion of richer, more true-to-life,3 or true-to-science contexts isdesirable, then the notion of examining the individual student in isolation must be calledinto question, since isolating the student runs counter to the premise of contextual assess-ment. It is of doubtful utility to use test scores of individual students to rate schools andteachers, in this way using student scores as a kind of “whip” to compel students, teachers,and schools to “shape up.” Although improved student learning is the main objective ofassessment, just knowing that students are not performing to expectations will not shed anylight on how to improve the situation. A more helpful and remedial approach would be toassess the factors that are known to affect student performance. Evaluating these factorswould be a more direct way of influencing the quality of education. One example of suchalternative approaches is that developed by Marzano (2003), who identifies factors that havebeen found to affect student performance at the student, teacher, and school levels. At asystemwide level, assessment approaches with a view to possible remediation at the schoollevel would shift the burden of accountability and do so in a constructive manner.

Policy Recommendations

Delandshire remarks that “[t]he absence of theoretical, conceptual and philosophicaldebates with regard to assessment may. . . result in practices that tend to reproduce them-selves in a vacuum, resist change, and are disconnected from relevant issues of knowledge,power and social organization in general” (2001, p. 113). Others have echoed Delandshire’ssentiments and even called for the examination of fundamental assumptions in assessmenttheory and practice (Chudowsky & Pellegrino, 2003).

Several serious, contentious issues remain unresolved in the contextual assessment arena.The validity of various contextual assessment methods that would justify their use for thepurpose of public accountability has not been established; widely used methods of numericalgrade calculations are open to serious question; and the universal use of the individualstudent as the basic unit of assessment provides little direction for the task of improvingstudent learning. In consideration of these critical issues, the following recommendationsare consistent with the preceding discussion:

1. to place an immediate moratorium on the usage of nonvalidated contextual assessmentmethods in their systemwide application;

3 “True-to-life” tasks test the transferability of learned knowledge to unfamiliar situations and for studentsmay represent a challenge that they find most difficult to handle.

846 KLASSEN

2. to replace percentage grade reporting with letter grade reporting for both summativeand large-scale assessment and to supplement critical decisions about individualstudents, formerly based exclusively on marks, with other, more reliable, indicators;and

3. to designate the school as the basic unit of assessment for public accountability,rather than the individual student, and to develop assessment methods with a viewto facilitating school improvement.

The force of systemwide assessment practices is, admittedly, not propelling policy in thedirection of yet another reform effort. At this point one must begin with the difficult andfundamental questions in order to attempt to steer assessment in science education, andassessment in general, in a new direction.

CONCLUSION

The disappointing reviews of contextual assessment strategies, despite the continuingattempts to use these strategies for public policy purposes, has created, in the words of Baker,“a mess” (Baker, 2001, p. 8). A period of bold reassessment of both the foundations anddirections of assessment in science education is vital. In this context, one is reminded againof the advice of Michell that “[i]f the methods of science are not sanctioned philosophicallythen the claim that science is intellectually superior to opinion, superstition and mythology isnot sustained” (1997, pp. 355–356). The underpinnings of assessment in science educationare, after all, its most important elements. This becomes manifestly evident when onelooks at popular contextual assessment methodologies, such as concept map assessment,performance assessment, and portfolio assessment. Although these methodologies holdgreat face value, the absence of significant evidence for their reliability and validity whenused in the context of public accountability should be sufficient reason for a moratoriumon their increased implementation for that purpose. On the other hand, any changes toassessment systems should be implemented so as to create a minimum amount of disruption.Schwab, in writing about the assessment and change of curriculum in the practical arts,offered some practical advice for any systemic educational change:

The practical arts begin with the requirement that existing institutions and existing practices

be preserved and altered piecemeal, not dismantled and replaced. Changes must be so

planned and so articulated with what remains unchanged that the functioning of the whole

remains coherent and unimpaired. (Schwab, 1978, p. 312)

To initiate and produce change by Schwab’s method will be challenging because the changesenvisioned by the contextual assessment movement are fundamental, and any developmentsto adopt even more fundamental changes, like those advocated here, will tend to result indisequilibrium and disruption.

The author thanks the editor of Science Education and the three anonymous reviewers for critical

insights and helpful suggestions for the revision of this article.

REFERENCES

Aschbacher, P. R. (1991). Performance assessment: State activity, interest, and concerns. Applied Measurement

in Education, 4(4), 275–288.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 847

Ausubel, D. P., Novak, J. D., & Hanesian, H. (1978). Educational psychology: A cognitive view (2nd ed.).

New York: Holt, Reinhart and Winston.

Baker, E. L. (2001). Testing and assessment: A progress report. Educational Assessment, 7(1), 1–12.

Baker, E., O’Neil, H. F., Jr., & Linn, R. L. (1994). Policy and validity prospects for performance-based assessment.

Journal for the Education of the Gifted, 17(4), 332–353.

Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy for far transfer.

Psychological Bulletin, 128(4), 612–637.

Baume, D., & Yorke, M. (2002). The reliability of assessment by portfolio on a course to develop and accredit

teachers in higher education. Studies in Higher Education, 27(1), 1–25.

Bell, B., & Cowie, B. (2001). The characteristics of formative assessment in science education. Science Education,

85, 536–553.

Berenson, S. B., & Carter, G. S. (1995). Changing assessment practices in science and mathematics. School

Science and Mathematics, 95(4), 182–186.

Black, P. (1995). Assessment and feedback in science education. Studies in Educational Evaluation, 21(3),

257–279.

Black, P. (1998). Assessment by teachers and the improvement of students’ learning. In B. J. Fraser & K. G. Tobin

(Eds.), International handbook of science education (pp. 811–822). Dordrecht: Kluwer Academic Publishers.

Black, P. J. (1993). Formative and summative assessment by teachers. Studies in Science Education, 21, 49–97.

Black, S. (1993, February). Portfolio assessment. The Executive Educator, pp. 28–31.

Bloom, J. W. (1992). Contexts of meaning and conceptual integration: How children understand and learn. In

R. A. Duschl & R. J. Hamilton (Eds.), Philosophy of science, cognitive psychology, and educational theory and

practice (pp. 177–194). New York: State University of New York Press.

Boulding, K. (1979). In praise of inefficiency. Graduate Woman, 73, 28–30.

Broadfoot, P. (2002). Editorial. Beware the consequences of assessment! Assessment in Education, 9(3), 285–288.

Brookhart, S. M. (2001). Effects of the classroom assessment environment on mathematics and science achieve-

ment. The Journal of Educational Research, 90(6), 323–330.

Bruner, J. S. (1963). The process of education. Cambridge, MA: Harvard University Press.

Champagne, A. B., & Newell, S. T. (1992). Directions for research and development: Alternative methods of

assessing scientific literacy. Journal of Research in Science Teaching, 29(8), 841–860.

Chi, M. T. H., Glaser, R., & Farr, M. (Eds.). (1988). The nature of expertise. Hillsdale, NJ: Erlbaum.

Chomsky, N. (1959). A review of B. F. Skinner’s verbal behavior. Language, 35(1), 26–58.

Chudowsky, N., & Pellegrino, J. W. (2003). Large-scale assessments that support learning: What Will it take?

Theory Into Practice, 42(1), 75–83.

Churchland, P. M. (1992). A deeper unity: Some feyerabendian themes in neurocomputational form. In R. N. Giere

(Ed.), Minnesota studies in the philosophy of science, Vol. XV: Cognitive models of science (pp. 341–363).

Minneapolis: University of Minnesota Press.

Corsini, R. J. (Ed.) (1994). Encyclopedia of psychology (2nd ed.). New York: Wiley.

Crotty, E. K. (1994). The role of cooperative learning in an authentic performance assessment approach. Social

Science Record, 31(1), 38–41.

Dalziel, J. (1998). Using marks to assess student performance: Some problems and alternatives. Assessment &

Evaluation in Higher Education, 23(4), 351–366.

Delandshire, G. (2001). Implicit theories, unexamined assumptions and the status quo of educational assessment.

Assessment in Education, 8(2), 113–133.

DeVries, R. (2000). Vygotsky, Piaget, and education: A reciprocal assimilation of theories and educational prac-

tices. New Ideas in Psychology, 18, 187–213.

Doolittle, P. E. (1997). Vygotsky’s zone of proximal development as a theoretical foundation for cooperative

learning. Journal on Excellence in College Teaching, 8(1), 83–103.

Duschl, R., Hamilton, R., & Grandy, R. E. (1990). Psychology and epistemology: Match or mismatch when applied

to science education? International Journal of Science Education, 12(3), 230–243.

Ebenezer, J. V., & Gaskell, P. J. (1995). Relational conceptual change in solution chemistry. Science Education,

79(1), 1–17.

Edmondson, K. M. (2000). Assessing science understanding through concept maps. In J. D. Mintzes, J. H.

Wandersee, & J. D. Novak (Eds.), Assessing science understanding: A human constructivist view (pp. 22–33).

Toronto, Ontario, Canada: Academic Press.

Erickson, G., Bartley, A., Blake, L., Carlisle, R., Meyer, K., & Stavey, R. (1992). British Columbia Assessment

of Science 1991 Technical Report II: Student performance component. Victoria, Canada: British Columbia

Ministry of Education.

Erickson, G. L., & Meyer, K. (1998). Performance assessment tasks in science: What are they measuring? In B. J.

Fraser & K. G. Tobin (Eds.), International handbook of science education (pp. 845–865). Dordrecht: Kluwer

Academic Publishers.

848 KLASSEN

Ferrara, S., Huynh, H., & Baghi, H. (1997). Contextual characteristics of locally dependent open-ended item

clusters in a large-scale performance assessment. Applied Measurement in Education, 10(2), 123–144.

Frederiksen, N. (1984). The real test bias. American Psychologist, 39, 193–201.

Gagne, R., Briggs, L., & Wager, W. (1992). Principles of instructional design (4th ed.). Fort Worth, TX: HBJ

College.

Gagne, R. M. (1977). The conditions of learning and theory of instruction (3rd ed.). New York: Holt, Reinhart

and Winston.

Gardner, H. (1992). The rhetoric of school reform: Complex theories vs. the quick fix. The Chronicle of Higher

Education, 38(35), B1–B2.

Gay, L. R. (1996). Educational research: Competencies for analysis and application (5th ed.). Upper Saddle River,

NJ: Prentice-Hall.

Gearhart, M., & Herman, J. L. (1995, Winter). Portfolio assessment: Whose work is it? Evaluation Comment,

pp. 1–16.

Gell-Mann, M. (1994). The quark and the jaguar: Adventures in the simple and the complex. New York: W. H.

Freeman.

Gennovese, J. E. C. (2003). Piaget, pedagogy, and evolutionary psychology. Evolutionary Psychology, 1, 127–137.

Giere, R. N. (1992). Introduction: Cognitive models of science. In R. N. Giere (Ed.), Minnesota studies in the

philosophy of science, Vol. XV: Cognitive models of science (pp. xv–xxviii). Minneapolis: University of

Minnesota Press.

Gilman, D. A. (1997). This issue. . . A funny thing happened on the way to statewide performance assessment.

Contemporary Education, 69(1), 4–5.

Gitomer, D. H., & Duschl, R. A. (1995). Moving toward a portfolio culture in science education. In S. M. Glynn

& R. Duit (Eds.), Learning science in the schools: Research reforming practice (pp. 299–326). Mahwah, NJ:

Erlbaum.

Glasersfeld, E. von (1990). Environment and communication. In L. P. Steffe & T. Wood (Eds.), Transforming

children’s mathematics education: International perspectives (pp. 30–38). Hillsdale, NJ: Erlbaum.

Green, K. E. (1986). Fundamental measurement: A review and application of additive conjoint measurement in

educational testing. Journal of Experimental Education, 54, 141–147.

Gunstone, R., Gray, C. M. R., & Searle, P. (1992). Some long-term effects of uninformed conceptual change.

Science Education, 76(2), 175–197.

Gunzenhauser, M. G. (2003). High-stakes testing and the default philosophy of education. Theory Into Practice,

42(1), 51–58.

Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validating science assessments.

Applied Measurement in Education, 10(2), 181–200.

Hanson, N. R. (1958). Patterns of discovery. New York: Cambridge University Press.

Harlen, W., & Crick, R. D. (2003). Testing and motivation for learning. Assessment in Education, 10(2), 169–207.

Harmon, M. (1995). The changing role of assessment in evaluating science education reform. New Directions for

Program Evaluation, 65, 31–51.

Hoskins, K. (1968). The examination, disciplinary power and rational schooling. History of Education, 8, 135–146.

Houts, C., & Haddock, C. K. (1992). Answers to philosophical and sociological uses of psychologism in science

studies: A behavioural psychology of science. In R. N. Giere, (Ed.), Minnesota studies in the philosophy of

science, Vol. XV: Cognitive models of science (pp. 367–399). Minneapolis: University of Minnesota Press.

Jaeger, R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (3rd ed.,

pp. 485–514). New York: Macmillan.

Khattri, N., Reeve, A. L., & Kane, M. B. (1998). Principles and practices of performance assessment. Mahwah,

NJ: Erlbaum.

Klahr, D., & Nigam, M. (2004). The equivalence of learning paths in early science instruction: Effects of direct

instruction and discovery learning. Psychological Science, 15(10), 661–667.

Klein, S. P., Stecher, B. M., Shavelson, R. J., McCaffrey, D., Ormseth, T., Bell, R. M., Comfort, K., & Othman, A.

R. (1998). Analytic versus holistic scoring of science performance tasks. Applied Measurement in Education,

11(2), 121–137.

Kohn, A. (1999). From degrading to de-grading. The High School Magazine, 6(5), 38–43.

Koretz, D. (1998). Large-scale portfolio assessments in the US: Evidence pertaining to the quality of measurement.

Assessment in Education, 5(3), 309–334.

Koul, R., & Dana, R. (1997). Contextualized science for teaching science and technology. Interchange, 28(2&3),

121–144.

Kuhn, T. S. (1962/1996). The structure of scientific revolutions (3rd ed.). Chicago: The University of Chicago

Press.

Kuhn, T. S. (1977). The essential tension. Chicago: The University of Chicago Press.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 849

Leon, S., & Elias, M. (1998). A comparison of portfolio, performance, and traditional assessment in the middle

school. Research in Middle Level Education Quarterly, 21(2), 21–37.

Linn, R. L. (2001). A century of standardized testing: Controversies and pendulum swings. Educational Assess-

ment, 7(1), 29–38.

Lissitz, R. W. (1997). Statewide performance assessment: Continuity, context, and concerns. Contemporary

Education, 69(1), 15–19.

Lomask, M. S., Baron, J. B., & Greig, J. (1998). Large-scale science performance assessment in Connecticut:

Challenges and resolutions. In B. J. Fraser & K. G. Tobin (Eds.), International handbook of science education

(pp. 823–844). Dordrecht: Kluwer Academic Publishers.

Madaus, G. F., & Kellaghan, T. (1993, April). Testing as a mechanism of public policy: A brief history and

description. Measurement and Evaluation in Counselling and Development, 26, 6–10.

Manitoba Education and Training. (1997). Reporting on student progress and achievement: A policy handbook

for teachers, administrators, and parents. Winnipeg, MB: Manitoba Education and Training.

Manitoba Education and Youth. (2003). Senior 2 science: A foundation for implementation. Winnipeg, MB:

Manitoba Education and Youth.

Marton, F. (1981). Phenomenography—describing conceptions of the world around us. Instructional Science, 10,

177–200.

Marzano, R. J. (2003). Using data: Two wrongs and a right. Educational Leadership, 60(5), 56–60.

Matthews, M. R. (1994). Science teaching: The role of history and philosophy of science. New York: Routledge.

Mayer, R. E. (2002). Rote versus meaningful learning. Theory Into Practice, 41(4), 226–232.

McClure, J. R., Sonak, B., & Suen, H. K. (1999). Concept map assessment of classroom learning: Reliability,

validity, and logistical practicality. Journal of Research in Science Teaching, 36(4), 475–492.

McMillan, M. (1999). The troubling consequences of the ABCs: Who’s accountable? Raleigh, NC: Common

Sense Foundation.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York:

Macmillan.

Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of

Psychology, 88, 355–383.

Mill, J. S. (1892/1859). On liberty. London: Longmans, Green and Co.

Moss, P. A. (1994, March). Can there be validity without reliability? Educational Researcher, 5–12.

Naizer, G. L. (1997). Validity and reliability issues of performance-portfolio assessment. Action in Teacher Edu-

cation, 18(4), 1–9.

Novak, J. (1991). Clarify with concept maps. The Science Teacher, 58(7), 45–49.

Novak, J. (1998). Learning, creating, and using knowledge: Concept maps as facilitative tools in schools and

corporations. Mahwah, NJ: Erlbaum.

Novak, J. D. (1977). An alternative to Piagetian psychology for science and mathematics education. Science

Education, 61(4), 453–477.

O’Neill, J. (1992). Putting performance assessment to the test. Educational Leadership, 49(8), 14–19.

Odell, C. W. (1928). Traditional examinations and new-type tests. New York: Century.

Pallrand, G. J. (1996). The relationship of assessment to knowledge development in science education. Phi Delta

Kappan, 78(4), 315–318.

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of

educational assessment. Washington, DC: National Academy Press.

Piaget, J. (1970). Piaget’s theory. In P. H. Mussen (Ed.), Carmichael’s manual of child psychology (3rd ed., Vol. 1,

pp. 703–732). New York: Wiley.

Piaget, J. (1970/1968). Genetic epistemology. New York: Columbia University Press.

Prodromou, L. (1995). The backwash effect: From testing to learning. ELT Journal, 49(1), 13–25.

Race, P. (1992). Ten worries about assessment. British Journal of Educational Technology, 23(2), 141.

Rahm, J., Miller, H. C., Hartley, L., & Moore, J. C. (2003). The value of an emergent notion of authenticity:

Examples from two student/teacher–scientist partnership programs. Journal of Research in Science Teaching,

40(8), 737–756.

Regis, A., Albertazzi, P. G., & Roletto, E. (1996). Concept maps in chemistry education. Journal of Chemical

Education, 73(11), 1084–1088.

Resnick, D. (1982). History of educational testing. In A. K. Wigdor & W. R. Garner (Eds.), Ability testing: Uses,

consequences and controversies. Part II. Documentation section (pp. 173–194). Washington, DC: National

Academy Press.

Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for educational reform.

In B. R. Gifford & M. C. O’Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement

and instruction (pp. 37–75). Boston: Kluwer.

850 KLASSEN

Rice, D. C., Ryan, J. M., & Samson, S. M. (1998). Using concept maps to assess student learning in the

science classroom: Must different methods compete? Journal of Research in Science Teaching, 35(10),

1103–1127.

Rogoff, B. (1984). Introduction: Thinking and learning in social context. In B. Rogoff & J. Lave (Eds.), Everyday

cognition (pp. 1–8). Cambridge: Harvard University Press.

Rothstein, R. (1998). Skewed comparisons. The School Administrator, 55(8), 20–24.

Ruiz-Primo, M. A., & Shavelson, R. J. (1996a). Problems and issues in the use of concept maps in science

assessment. Journal of Research in Science Teaching, 33(6), 569–600.

Ruiz-Primo, M. A., & Shavelson, R. J. (1996b). Rhetoric and reality in science performance assessment. Journal

of Research in Science Teaching, 33(10), 1045–1063.

Ruiz-Primo, M. A., Schultz, S. E., Li, M., & Shavelson, R. J. (2001). Comparison of the reliability and va-

lidity of scores from two concept-mapping techniques. Journal of Research in Science Teaching, 38(2),

260–278.

Sadler, P. M. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and

distractor-driven assessment instruments. Journal of Research in Science Teaching, 35(3), 265–296.

Schwab, J. J. (1978). On curriculum-building. In I. Westbury & N. J. Wilkof (Eds.), Science, curriculum, and

liberal education: Selected essays (pp. 275–384). Chicago: The University of Chicago Press.

Selley, N. (1989). Philosophies of science and their relation to scientific processes and the science curriculum. In

J. Wellington (Ed.), Skills and processes in science education (pp. 5–20). London: Routledge.

Shavelson, R. J., Baxter, G. P., & Pine, J. (1991). Performance assessment in science. Applied Measurement in

Education, 4(4), 347–362.

Shepard, L. A. (1991, October). Psychometricians’ beliefs about learning. Educational Researcher, pp. 2–16.

Shepardson, D. P. (2001). Assessment in science: A guide to development and classroom practice. Dordrecht:

Kluwer.

Shymansky, J. A., Chidsey, J. L., Henriquez, L., Enger, S., Yore, L. D., Wolfe, E. W., & Jorgensen, M. (1997).

Performance assessment in science as a tool to enhance the picture of student learning. School Science and

Mathematics, 97(4), 172–183.

Skinner, B. F. (1954). The science of learning and the art of teaching. Harvard Educational Review, 24,

86–97.

Skinner, B. F. (1965). Reflections on a decade of teaching machines. In R. Glaser (Ed.), Teaching machines and

programmed learning: II. Data and directions (pp. 5–20). Washington, DC: National Education Association.

Slater, T. F., Ryan, J. M., & Samson, S. L. (1997). Impact and dynamics of portfolio assessment and traditional

assessment in a college physics course. Journal of Research in Science Teaching, 34(3), 255–271.

Stecher, B. M., Klein, S. P., Solano-Flores, G., McCaffrey, D., Robyn, A., Shavelson, R. J., & Haertel, E. (2000). The

effects of content, format, and inquiry level on science performance assessment scores. Applied Measurement

in Education, 13(2), 139–160.

Strauss, S. (1972). Learning theories of Gagne and Piaget: Implications for curriculum development. Teachers

College Record, 74(1), 81–102.

Strike, K. A., & Posner, G. J. (1985). A conceptual change view of learning and understanding. In L. West & L.

Pines (Eds.), Cognitive structure and conceptual change (pp. 211–231). New York: Academic Press.

Tamir, P. (1998). Assessment and evaluation in science education: Opportunities to learn and outcomes. In B. J.

Fraser & K. G. Tobin (Eds.), International handbook of science education (pp. 761–789). Dordrecht: Kluwer

Academic Publishers.

Tognolini, J., & Andrich, D. (1996). Analysis of profiles of students applying for entrance to universities. Applied

Measurement in Education, 9(4), 323–353.

Tweney, R. D. (1992). Serial and parallel processing in scientific discovery. In R. N. Giere (Ed.), Minnesota studies

in the philosophy of science, Vol. XV: Cognitive models of science (pp. 77–88). Minneapolis: University of

Minnesota Press.

Villani, A. (1992). Conceptual change in science and science education. Science Education 76(2), 223–237.

Vygotsky, L. S. (1935/1978). Mind in society: The development of higher psychological processes. Cambridge,

MA: Harvard University Press.

Wandersee, J. H. (1990). Concept mapping and the cartography of cognition. Journal of Research in Science

Teaching, 27(10), 923–936.

Wandersee, J. H., Mintzes, J. J., & Novak, J. D. (1994). Research on alternative conceptions in science. In D. L.

Gabel (Ed.), Handbook of research on science teaching and learning (pp. 177–210). New York: Macmillan.

Webb, N. M., Schlackman, J., & Sugrue, B. (2000). The dependability and interchangeability of assessment

methods in science. Applied Measurement in Education, 13(3), 277–301.

Whitehead, A. N. (1929). The aims of education and other essays. New York: McMillan.

CONTEXTUAL ASSESSMENT IN SCIENCE EDUCATION 851

Wiggins, G. (1990). The case for authentic assessment. Eric Digest [WWW document]; http://ericae.net/db/

edo/ED328611.htm

Willson, V. L. (1991). Performance assessment, psychometric theory, and cognitive learning theory: Ships crossing

in the night. Contemporary Education, 62(4), 250–254.

Wilson, M. (2003). On choosing a model for measuring. Methods of Psychological Research, 8(3), 1–22.

Wineburg, S. (1997). T. S. Eliot, collaboration, and the quandaries of assessment in a rapidly changing world. Phi

Delta Kappan, 79(1), 59–65.

Zeilik, M., Bisard, W., & Lee, K. (2002). Research-based astronomy: Will it travel? Astronomy Education Review,

1(1), 33–46.

Zeilik, M., Schau, C., Mattem, N., Hall, S., Teague, W., & Bisard, W. (1997). Conceptual astronomy: A novel

model for teaching postsecondary science. American Journal of Physics, 65, 987–996.