Patrick Meyer Reliability Understanding Statistics 2010

Embed Size (px)

Citation preview

  • reliability

  • S E R I E S I N UNDERSTAND ING S TA T I S T I C S

    natasha beretvas Series Editor-in-Chiefpatricia leavy Qualitative Editor

    Quantitative Statistics

    Confirmatory Factor AnalysisTimothy J. Brown

    Effect SizesSteve Olejnik

    Exploratory Factor AnalysisLeandre Fabrigar andDuane Wegener

    Multilevel ModelingJoop Hox

    Structural Equation ModelingMarilyn Thompson

    Tests of MediationKeenan Pituch

    Measurement

    Item Response TheoryChristine DeMars

    ReliabilityPatrick Meyer

    ValidityCatherine Taylor

    Qualitative

    Oral HistoryPatricia Leavy

    The FundamentalsJohnny Saldana

  • j. patrick meyer

    RE L IAB I L I TY

    12010

  • 1Oxford University Press, Inc., publishes works that further Oxford Universitys

    objective of excellence in research, scholarship, and education.

    Oxford New York

    Auckland Cape Town Dar es Salaam Hong Kong Karachi

    Kuala Lumpur Madrid Melbourne Mexico City Nairobi

    New Delhi Shanghai Taipei Toronto

    With offices in

    Argentina Austria Brazil Chile Czech Republic France Greece

    Guatemala Hungary Italy Japan Poland Portugal Singapore

    South Korea Switzerland Thailand Turkey Ukraine Vietnam

    Copyright 2010 by Oxford University Press, Inc.

    Published by Oxford University Press, Inc.

    198 Madison Avenue, New York, New York 10016

    www.oup.com

    Oxford is a registered trademark of Oxford University Press, Inc.

    All rights reserved. No part of this publication may be reproduced, stored in a

    retrieval system, or transmitted, in any form or by any means, electronic,

    mechanical, photocopying, recording, or otherwise, without the prior permission

    of Oxford University Press

    Library of Congress Cataloging-in-Publication Data

    Meyer, J. Patrick.

    Reliability / J. Patrick Meyer.

    p. cm. (Series in understanding statistics)

    Includes bibliographical references.

    ISBN 978-0-19-538036-1

    1. Psychometrics. 2. Psychological testsEvaluation. 3. Educational tests

    and measurementsEvaluation. 4. ExaminationsScoring. I. Title.

    BF39.M487 2010

    150.2807dc222009026618

    9 8 7 6 5 4 3 2 1

    Printed in the United States of America

    on acid-free paper

  • In loving memory of my mother, Dottie Meyer

  • This page intentionally left blank

  • acknowledgments

    I owe many thanks to my wife Christina and son Aidan for their lovingsupport and encouragement. You were very patient with me as I spentmany nights and weekends writing this manuscript. I love you both, andI am forever grateful for your support.

    I would like to thank the South Carolina Department of Education forthe use of the Benchmark Assessment and the Palmetto AchievementChallenge Test data. Special thanks are extended to Teri Siskind, RobinRivers, Elizabeth Jones, Imelda Go, and Dawn Mazzie. I also thank JoeSaunders for his help with the PACT data files and Christy Schneider fordiscussing parts of the analysis with me.

    I am very grateful for the help and support of colleagues at theUniversity of Virginia. In particular, I thank Billie-Jo Grant for providingfeedback on earlier drafts. Your input was invaluable and producedneeded improvements to the manuscript. I also thank Sara Rimm-Kaufman and Temple Walkowiak for writing a description of theResponsive Classroom Efficacy Study and allowing me to use theMSCAN data.

  • This page intentionally left blank

  • contents

    CHAPTER 1 INTRODUCT I ON . . . . . . . . . . . . . . . 3

    CHAPTER 2 DATA COL L ECT I ON DES I GNS . . . . . . . . . . 50

    CHAPTER 3 ASSUMPT IONS . . . . . . . . . . . . . . 73

    CHAPTER 4 METHODS . . . . . . . . . . . . . . . . 92

    CHAPTER 5 RESULTS . . . . . . . . . . . . . . . . 109

    CHAPTER 6 D ISCUSS ION AND RECOMMENDED READ INGS . . . . . 130

    References . . . . . . . . . . . . . . . . 136

    Index . . . . . . . . . . . . . . . . . . 143

  • This page intentionally left blank

  • reliability

  • This page intentionally left blank

  • 1introduction

    Context and Overview

    social scientists frequently measure unobservable char-acteristics of people such as mathematics achievement or musicalaptitude. These unobservable characteristics are also referred to asconstructs or latent traits. To accomplish this task, educational andpsychological tests are designed to elicit observable behaviors thatare hypothesized to be due to the underlying construct. Forexample, math achievement manifests in an examinees ability toselect the correct answer tomathematical questions, and a flautistsmusical aptitude manifests in the ratings of a music performancetask. Points are awarded for certain behaviors, and an examineesobserved score is the sum of these points. For example, each item ona 60-item multiple-choice test may be awarded 1 point for acorrect response and 0 points for an incorrect response. An exam-inees observed score is the sum of the points awarded. In thismanner, a score is assigned to an observable behavior that isposited to be due to some underlying construct.

    Simply eliciting a certain type of behavior is not sufficient foreducational and psychological measurement. Rather, the scoresascribed to these behaviors should exhibit certain properties: thescores should be consistent and lead to the proper interpretation ofthe construct. The former property is a matter of test score

  • reliability, whereas the latter concerns test score validation (Kane,2006). Test score reliability refers to the degree of test score con-sistency over many replications of a test or performance task. It isinversely related to the concept of measurement error, whichreflects the discrepancy of an examinees scores over many replica-tions. Reliability and measurement error are the focus of this text.The extent to which test scores lead to proper interpretation of theconstruct is a matter of test validity and is the subject of anothervolume in this series.

    The Importance of Test Score Reliability

    Spearman (1904) recognized that measuring unobservable char-acteristics, such as mathematics achievement or musical aptitude,is not as deterministic as measuring physical attributes, such as thelength of someones arm or leg. Indeed, he acknowledged thatmeasurement error contributed to random variation amongrepeated measurements of the same unobservable entity. Forexample, an examinee may be distracted during one administra-tion of a math test but not during another, causing a fluctuation intest scores. Similarly, a flautist may perform one set of excerptsbetter than another set, producing slight variations in the ratingsof musical aptitude. These random variations are due to measure-ment error and are undesirable characteristics of scores from a testor performance assessment. Therefore, one task in measurement isto quantify the impact on observed test scores of one or moresources of measurement error. Understanding the impact of mea-surement error is important because it affects (a) statistics com-puted from observed scores, (b) decisions made about examinees,and (c) test score inferences.

    Spearman (1904, 1910) showed that measurement error attenu-ates the correlation between two measures, but other statistics areaffected as well (see Ree & Carretta, 2006). Test statistics, such asthe independent samples t-test, involve observed score variance intheir computation, and measurement error increases observedscore variance. Consequently, measurement error causes test sta-tistics and effect size to be smaller, confidence intervals to be wider,and statistical power to be lower than they should be (Kopriva &Shaw, 1991). For example, Cohens d is the effect size for anexperimental design suitable for a independent-samples t-test.

    4 : RELIABILITY

  • An effect size of d50:67 that is obtained when reliability is 1.0notably decreases as reliability decreases; decreasing reliability to .8attenuates the effect size to .60, and decreasing reliability to .5attenuates effect size to .47. Figure 1.1 demonstrates the impact ofthis effect on statistical power for an independent-samples t-test.The horizontal line marks the statistical power of 0.8. The curvedlines represent power as a function of sample size per group forscore reliabilities of 1.0, 0.8, and 0.5. Notice that as reliabilitydecreases, more examinees are needed per group to maintain apower of 0.8. Indeed, a dramatic difference exists between scoresthat are perfectly, but unrealistically, reliable and scores that arenot reliable. Given the influence of reliability on statistics, theconclusions and inferences based on these statistics may be erro-neous and misleading if scores are presumed to be perfectlyreliable.

    Although biased statistics are of concern, some of the greatestconsequences of measurement error are found in applications that

    Sample Size per Group

    Stat

    istica

    l Pow

    er

    10

    Reliability = 1.0 (Effect Size = 0.67)Reliability = 0.8 (Effect Size = 0.60)Reliability = 0.5 (Effect Size = 0.47)

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0

    20 30 40 50

    Figure 1.1. The Influence of Reliability on the Statistical Power of aTwo-sample t-Test

    INTRODUCTION : 5

  • concern the simple reporting of scores for individual examinees.Many uses of test scores involve high-stakes consequences for theexaminee. For example, a student may be granted or deniedgraduation because of a score on a mathematics exam. An appli-cant may be granted or denied a job because of a score obtained ona measure of some characteristic desired by an employer. Becauseof such high stakes, test scores assigned to an examinee over manyreplications should be consistent. We cannot have confidence intest scores if the decision about an examinee (e.g., granted ordenied graduation) differed from one administration of the testto another without any true change in the examinees capability.Sizable measurement error can produce inconsistent decisions andreduce the quality of inferences based on those scores.

    The amount of measurement error in test scores must be closelymonitored, not only to appreciate the consistency of the test scoresbut also to evaluate the quality of the inferences based on thescores. Reliability is a necessary but not sufficient condition forvalidity (Linn & Gronlund, 2000, p. 108). Scores must be reliablein order to make valid inferences. If test scores are not consistent,there is no way to determine whether inferences based on thosescores are accurate. Imagine watching someone throw darts at adart board and trying to guess (i.e., infer) the score zone that theperson is trying to hit. Suppose further that you have no knowl-edge of the throwers intended target. If the darts hit the board inclose proximity to each other, you have a good chance of correctlyguessing the zone the person is trying to hit. Conversely, if thedarts are widely dispersed, you have very little chance of correctlyguessing the target score zone. Imagine that three darts are thrownand each hits a different score zone: What is the likely target? Itcould be any of the three zones or none of them. Wide dispersionof the darts would limit your confidence in guessing the intendedtarget. Now suppose that all three darts hit close to each other inthe 18 zone. What is the intended target? The 18 zone is a goodguess for the intended target because of the consistency of theirlocation.

    Although reliability is a necessary condition for making validinferences, it does not guarantee that our inferences about scoreswill be accurate. It is not a sufficient condition for validation.Scores may be reliably off-target. The tight grouping of dartsdescribed in the previous paragraph is indicative of reliability.

    6 : RELIABILITY

  • However, reliability does not guarantee that the darts are anywhereclose to the target. Suppose you see three darts hit the bulls-eyebut later learn that the thrower was aiming for the 20 zone! Yourguess would have been a good one but nonetheless incorrect.

    Reliability plays a key role in social science research and appliedtesting situations. It impacts the quality of test scores, statisticaltests, and score inferences. Given the importance of reliability, thepurpose of this book is to facilitate a thorough understanding ofthe selection, interpretation, and documentation of test scorereliability.

    Organization of the Book

    Since Spearmans (1904) seminal work, various theories of testscores have evolved to address the technical challenges in theproduction of test scores and the characterization of score relia-bility. Classical test theory, classification decisions, and general-izability theory are three approaches discussed in this text. Generalconcepts and commonalities among these theories are emphasizedto make connections between them and facilitate an under-standing of reliability. Distinctions between these theories arealso discussed to appreciate the necessity of each theory and tohelp the psychometrician select an appropriate approach to char-acterizing score reliability.

    Each book in the Understanding Statistics or UnderstandingMeasurement series is organized around the same six chapters, asprescribed by the series editor. As a result, the development ofconcepts in this text differs from the development in other texts onreliability. Rather than have separate chapters for classical testtheory, classification decisions, and generalizability theory, theseapproaches are discussed together and organized around a parti-cular theme. For example, data collection designs for each theoryare discussed in Chapter 2, whereas assumptions for each theoryare described in Chapter 3. This structure presented some chal-lenges in deciding how to best organize the material, and somereaders may question my selection of content for each chapter.However, the organization of this text fits with modern dataanalysis and reporting techniques. Organizing the material inthis manner will facilitate a deeper understanding of reliability aswell as facilitate reliability analysis and reporting.

    INTRODUCTION : 7

  • Chapter 1 presents an overview of classical test theory, classifi-cation decisions, strong true score theory, and generalizabilitytheory. General concepts that cut across all topic areas are dis-cussed, followed by a description of those concepts that are specificto each theory. Emphasis is placed on concepts that are central tounderstanding reliability and measurement error.

    Chapter 2 begins with a description of data obtained from anoperational testing program. This data will provide a basis formany of the explanations and examples throughout the text. It isfollowed by a description of data collection designs used in relia-bility analysis. An emphasis is placed on describing the type ortypes of measurement error present in each design.

    Assumptions in classical test theory, classification decisions,and generalizability theory are described in Chapter 3. Particularattention is given to assumptions involving the nature of measure-ment procedure replications, and the consequences of theseassumptions on the part-test covariance matrix.

    Chapters 1 through 3 provide the theoretical basis for selecting amethod of estimating reliability and reporting the results. Chapter4 describes a variety of methods for estimating reliability. Thesemethods are organized in decision trees to facilitate the selection ofan appropriate method.

    Reporting conventions described in Chapter 5 are based onguidelines set forth in the Standards for Educational andPsychological Testing (American Educational Research Association,American Psychological Association, & National Council onMeasurement in Education, 1999). While it may seem unnecessaryto discuss the reporting of reliability, several studies have shownthat the literature is filled with poorly or improperly documentedscore reliability (Qualls & Moss, 1996; Thompson & Vacha-Haase,2000; Vacha-Haase, Kogan, & Thompson, 2000). Guidelines forreporting reliability are illustrated with an analysis of scores fromthree different testing programs. Each program served a differentpurpose, which permitted a wide variety of reliability methods to bedemonstrated. These examples are concluded in Chapter 6, withrecommended strategies for discussing a reliability analysis.

    Finally, this text is by no means exhaustive. Many others havewrittenmore technical and exhaustive works on one ormore of thetopics discussed herein. Chapter 6 also lists texts and other recom-mended readings in reliability.

    8 : RELIABILITY

  • General Concepts

    Defining reliability as the degree of test score consistency conveys ageneral sense of what it means to be reliabletest scores should beconsistent with somethingbut it lacks specification of the entitieswith which test scores should be consistent. A more completedefinition of reliability would state something like, Reliability isthe extent to which test scores are consistent with another set of testscores produced from a similar process. Further improvements tothis definition would state the specific details about the process thatproduced the test scores, such as information about the selectionitems for each test and the manner in which the data were collected.A general definition is useful in that it can apply to many situations;but in actuality, reliability is situation-specific. It only obtains its fullmeaning with a compete specification of the situation. For example,the statement, Scores from a sixth grade math test are consistent,is incomplete. It does not indicate how the scores are consistent. Thestatement, Scores from a sixth grade math test are consistent withscores obtained from the same examinees who took the same testbut at a different time, is more complete and describes the processby which scores were obtained. It also differs from other possiblestatements such as, Scores from a sixth grade math test are con-sistent with scores from a different but very similar sixth grade mathtest that was administered to the same examinees at a different timepoint. These two statements are more complete statements aboutscore reliability, and they communicate very different notions ofwhat it means for test scores to be consistent. Most importantly,they reflect different sources of measurement error. In the firststatement, change in the examinees state from one testing occasionto another is the primary source of error. The exact same test wasused for each administration. Therefore, error due to lack of testsimilarity is not possible. In the second statement, scores may beinconsistent because of changes in an examinees state from oneadministration of the test to another, as well as a lack of similaritybetween the two test forms. Although both of these statements referto the extent to which sixth grademath test scores are consistent, thenature of the consistency and the sources of measurement errordiffer. Therefore, a complete understanding of the meaning ofreliability is only possible through a complete specification of theprocess that produced the test scores.

    INTRODUCTION : 9

  • The Measurement Procedure

    When people think of educational and psychological measure-ment, they often think of individual tests. However, measurementinvolves muchmore than the test itself. The entire testing situationand the process that produces test scores must be considered. Ameasurement procedure (Lord & Novick, 1968, p. 302) encom-passes all aspects of the testing situation, such as the occasion ortime of test administration, the use of raters, the particular selec-tion of test items, the mode of test administration, and the stan-dardized conditions of testing (i.e., those aspects of testing that arefixed). It includes multiple aspects of the testing process, and it isnot simply limited to the test itself. All aspects of the measurementprocedure may affect the consistency of scores.

    Sampling in Measurement

    Sampling the Measurement Procedure. There are two types ofsampling in measurement: the sampling of one or more aspectsof the measurement procedure (e.g., items), and the sampling ofexaminees (see Cronbach & Shavelson, 2004; Lord, 1955b; Lord &Novick, 1968). The primary source or sources of measurementerror are attributed to sampling the measurement procedure. Forexample, suppose a 50-item multiple-choice test of English lan-guage arts (ELA) is constructed from an item pool of 200 items,and this test is administered to a single examinee. Suppose furtherthat any administration of this test occurs at the same time of day.The observed score that is obtained by counting the number ofitems answered correctly is only one of the 4:5386 1047 possiblescores the examinee could earn from 50-item tests created fromthis pool of 200 items. It is very unlikely that the examinee willobtain the exact same score for all of these possible tests. Observedscores will vary due to the particular sample of items that compriseeach test (i.e., measurement error due to the selection of items).One sample may be more difficult than another, or the examineemay be more familiar with some topics discussed in each item butnot others. Other random aspects of the measurement proceduremay also produce variation of observed scores. For example, theexaminee may experience fatigue, distraction, or forgetfulnessduring the administration of some tests but not others.

    10 : RELIABILITY

  • Sampling the measurement procedure is also described as repli-cating the measurement procedure (Lord & Novick, 1968, p. 47).The latter phrase is preferred, given that the sampling may or maynot involve simple random sampling. Avoiding use of the wordsampling helps avoid the mistake of assuming that an instance ofthe measurement procedure was obtained through simple randomsampling when it was not. Moreover, measurement procedurereplications may involve samples with specific characteristics,such as prescribed relationships among observed score averages(see Chapter 3). The term replication helps convey the notionthat the samples should have certain characteristics.

    Brennan (2001a) stressed the importance of measurement pro-cedure replications for the proper interpretation of reliability. Hewrote that to understand reliability, an investigator must have aclear answer to the following question: (1) What are the intended(possibly idealized) replications of the measurement procedure?(p. 296). Continuing with the previous example, each sample of 50ELA test items from the pool of 200 items may be considered areplication. Therefore, the consistency of scores that the examineemay obtain from all 50-item tests constructed from this poolreflects the similarity of test items. It does not reflect variation ofscores due to testing occasion (e.g., testing in the morning ratherthan the afternoon) because each replication occurred at the sametime of day. The meaning and interpretation of score reliabilityare inseparably tied to replicating the measurement procedure:Reliability is a measure of the degree of consistency in examineescores over replications of a measurement procedure (Brennan,pp. 295296). A clear definition of the measurement procedureand the process of replicating it provide a clear specification of thesource or sources of error affecting scores and the interpretation ofreliability.

    As emphasized by Brennan, the replications may be idealizedand only conceptual in nature. For example, the ELA item pooldiscussed previously may not actually exist. Rather, the item poolmay be composed of all 50-item tests that might be createdaccording to the test specifications. This type of conceptual repli-cation is frequently encountered in practice. Even though there isno real item pool from which to sample items, only two realreplications are necessary for estimating reliability, and thesemay be obtained by dividing a test into parts. All of the other

    INTRODUCTION : 11

  • possible replications do not need to occur, but they exist concep-tually to facilitate the development of statistical theories under-lying the scores and the interpretation of reliability.

    Details of replicating the measurement procedure are particu-larly evident in the development of specific theories of test scoresand the assumptions about test scores. Classical test theory,classification decisions, and generalizability theory define scoresdifferently by imposing different restrictions on the way a mea-surement procedure is replicated. As a result, there are manydifferent interpretations of reliability and methods for estimatingit. For example, replications in classical test theory may use theadministration of parallel test forms. In strong true score theory,replications may involve randomly sampling items from adomain. Generalizability theory permits the most exhaustivedefinition of a replication, such that multiple aspects of themeasurement procedure may be considered simultaneously. Forexample, randomly sampling test forms and testing occasionsmay constitute a replication. As discussed in Chapter 3, theassumptions about test scores specify the nature of replicatingthe measurement procedure.

    Sampling Examinees. Examinees participating in a measurementprocedure are assumed to be randomly sampled from a popula-tion. If measurement error was held constant, and the populationis heterogeneous on the construct of interest, then different scoreswould be observed for a group of examinees simply because of theprocess of randomly sampling examinees. For example, consider apopulation of examinees who differ in their mathematics ability. Arandom sample of two examinees would likely result in one exam-inee having better mathematical ability than the other. If errorscores for these two examinees were the same, the observed scorefor the first examinee would be higher than the one for the secondexaminee. Therefore, variation among observed scores is due, inpart, to randomly sampling examinees.

    Replicating the Measurement Procedure and Sampling Examinees

    Replicating the measurement procedure and sampling examineesdo not occur in isolation. They occur in tandem. Every examineein a sample has scores that are affected by sampling the

    12 : RELIABILITY

  • measurement procedure. Therefore, the total variance of scores ismade up of variance due to real differences among examinees andvariance due to sampling the measurement procedure.

    Table 1.1 contains an examinee-by-itemmatrix for the entire popu-lation and all conditions of the measurement procedure that are ofpossible interest. The observed score random variables are denoted byX, with subscripts that denote the row and column number, respec-tively. For example, X23 refers to the observed score random variablefor the second person and the third item. In this example, the observedscore random variable is the score on an individual item. Replicatingthemeasurementprocedure involves the selectionof columns fromthetable, whereas sampling examinees involves the selection of rows fromthe table. Three possible samples of four examinees and two items areindicated by the shaded areas of Table 1.1. The total variability ofobserved scores in each shaded area will differ because of samplingitems and examinees. Reliability coefficients provide an index thatreflects the extent to which variation among real examinees differences(i.e., sampling rows) explains this total variability. However, reliabilityis a property of test scores from each sample. It is not a property of thetest itself. Each sample will produce a different reliability estimate.

    Table 1.1

    The InfiniteMatrix and Three Possible Samples of Four Examinees

    and Two Items

    Item

    Examinee 1 2 3 4 5 6 7 . . . 11 X11 X12 X13 X14 X15 X16 X17 . . . X11

    2 X21 X22 X23 X24 X25 X26 X27 . . . X21

    3 X31 X32 X33 X34 X35 X36 X37 . . . X31

    4 X41 X42 X43 X44 X45 X46 X47 . . . X11

    5 X51 X52 X53 X54 X55 X56 X57 . . . X51

    6 X61 X62 X63 X64 X65 X66 X67 . . . X61

    7 X71 X72 X73 X74 X75 X76 X77 . . . X71... ..

    . ... ..

    . ... ..

    . ... ..

    . ...

    1 X11 X12 X13 X14 X15 X16 X17 . . . X11

    . ..

    INTRODUCTION : 13

  • In the next three sections, important details and concepts inclassical test theory, strong true score theory, and generalizabilitytheory will be reviewed. This review is by no means exhaustive andthe reader is encouraged to consult additional sources for a moretechnical and thorough review.

    Classical Test Theory

    Observed scores and error scores were previously defined in broadconceptual terms. However, they have a very specific definition inclassical test theory. Moreover, they are related to a third type ofscore, the true score, which will be defined in more detail shortly.Observed scores, error scores, and true scores are related by a well-known formula,

    X5T E 1:1

    where X, T, and E represent the observed, true, and error score fora randomly selected examinee, respectively.1 The definitions ofthese scores are established with respect to replication of themeasurement procedure.

    Replicating the Measurement Procedure

    An examinees observed score is likely to differ upon each replica-tion of the measurement procedure due to transient, internalcharacteristics of an examinee, such as forgetfulness, guessing,hunger, and distractibility. Variation among specific instances ofthe measurement procedure, such as using a different test form foreach replication, also causes scores to vary from one replication tothe next. Supposing that many (preferably infinite) replications ofthe measurement procedure have occurred, then an examinee willhave many observed scores that vary randomly. A histogram couldbe used to graphically depict an examinees observed score dis-tribution, as illustrated in Figure 1.2. The top portion of this figure

    1The letter T is actually the upper case Greek letter tau.

    14 : RELIABILITY

  • shows the distribution of observed scores obtained by an examineeover all possible replications of the measurement procedure. Thisdistribution provides information about the amount of variabilityamong observed scores, referred to as examinee-specific observedscore variance, and the location of the distribution, referred to asthe true score, which is marked by a thick solid vertical line in thebottom panel of Figure 1.2.

    An examinees true score is the score of primary interest inmeasurement. It is defined as an examinees average observedscore obtained from all replications of the measurement proce-dure. If an examinee actually participated in an infinite number ofreplications, we could compute the average observed score toobtain the actual true score. However, an examinee usually onlyparticipates in one or two replications of the measurement proce-dure, which is far short of the infinite number of replications

    Observed Scores for an Examinee Over Many Replications

    Observed Scores for an Examinee Over Many Replications

    80

    0.00

    0.04

    Den

    sity

    Den

    sity

    0.08

    0.00

    0.04

    0.08

    90 100 110 120

    80 90 100 110 120

    Figure 1.2. Observed Score Distribution for a Single Examinee

    INTRODUCTION : 15

  • needed to obtain the actual true score. We never know the actualvalue of the true score, and we must estimate it from the observedscores that are actually obtained. While defining true scores interms of an infinite number of observed scores appears as alimitation, this definition facilitates the statistical development ofthe theory and leads to methods for estimating the true score. Italso results in a definition of the reliability coefficient, an indexthat describes the similarity of true and observed scores. Each ofthese methods will be discussed in detail later. Finally, this defini-tion of a true score makes evident that a true score is not defined assome inherent aspect of the individual examinee. It is not some-thing possessed, like hair color or weight. It is an abstractiondefined in terms of replicating the measurement procedure.

    Now that observed scores and true scores have been defined, theformal definition of error scores should be evident. The error scoreis simply the difference between an examinees observed score andtrue score. Like an observed score, an examinee has a differenterror score for each replication of the measurement procedure,and all of these scores make up an error score distribution. Thisdistribution has an average value of zero, but it has the samevariance as the examinees observed score distribution. Both ofthese characteristics are evident in Figure 1.3. Examinee-specificobserved scores are displayed in the top panel of the figure, andexaminee-specific error scores are depicted in the bottom panel.Notice the similarity of the two distributions. They have the samevariance. Only the average value is different, as evident in the valueof the bold vertical line. The observed score distribution has anaverage equal to the true score, whereas the error score distribu-tion has an average value of zero.

    In each replication of the measurement procedure, the goal is toobtain an observed score that is very close to an examinees truescore. When each replication of the measurement results in a smallerror score, the variance of the examinee-specific error score dis-tribution will be small, and we can have confidence that ourobserved score is close to the true score on any one replication ofthe measurement procedure. Conversely, when each replication ofthe measurement procedure results in a large error score, thevariance of the examinee-specific error score distribution will belarge and indicate a large difference between observed scores andtrue score. A large variance of the examinee-specific error score

    16 : RELIABILITY

  • distribution means that our measurement procedure produced alot of random noise. Our confidence in the quality of observedscores decreases as measurement error increases. However, mea-surement error to this point has been defined in terms of a singleexaminee who participates in many replications of the measure-ment procedurea situation that rarely, if ever, holds in practice.Therefore, measurement error and reliability are defined withrespect to a sample of examinees.

    Additional Restrictions That Define Classical Test Theory. Severalrestrictions on the measurement procedure must be stated tocomplete the definition of classical test theory. The restrictionsare that (a) the expected value (i.e., mean) of the examinee-specificerror score distribution is zero, (b) there is no relationshipbetween true scores and error scores, (c) there is no relationship

    Observed Scores for an Examinee Over Many Replications

    Error Scores for an Examinee Over Many Replications20

    0.00

    0.04

    0.08

    10 0 10 20

    0.00

    0.04

    0.08

    80 90 100 110 120

    Den

    sity

    Den

    sity

    Figure 1.3. Examinee Specific Observed and Error Score Distributions

    INTRODUCTION : 17

  • between the error scores obtained from any two replications of themeasurement procedure, and (d) there is no relationship betweenerror scores obtained from one replication of the measurementprocedure and true scores obtained from any other replication ofthe measurement procedure. The complete definition leads toimportant results in classical test theory, such as the decomposi-tion of observed score variance and the development of reliabilitycoefficient estimates.

    Sources of Variance

    The variance of observed scores over a sample of examinees is truescore variance plus error score variance,

    s2X5 s2T s2E: 1:2

    The covariance between true and error scores is not part of thisresult, given restriction (b) listed in the previous paragraph.Observed score variance is the easiest term to explain. It issimply the variance of the scores obtained from every examineein the sample.

    True score variance is also easy to explain. The true score is fixedfor a single examinee, and it is the same in all replications of themeasurement procedure. There is no true score variance within anexaminee. However, true scores vary for a sample of examinees.That is, each examinee likely will have a different true score. If eachexaminees true score were known, we could compute the varianceof the true scores for a sample of examinees to obtain the true score

    variance, s2T. However, each examinees true score is unknown.True score variance must be estimated from observed score var-iance using certain methods that follow from classical test theory.These methods will be described later.

    Error score variance is more difficult to define because errorscores differ within an examinee (i.e., over replications), and eachexaminee has a different error variance. This aspect is illustrated inFigure 1.4 for three examinees, each with a different true score.

    Error score variance, s2E, for a sample of examinees is defined asthe average (i.e., expected value) of the examinee-specific error

    18 : RELIABILITY

  • variances. If every examinee has a large examinee-specific errorvariance, then the average (over examinees) error variance will alsobe large. Conversely, if every examinee has a small examinee-specific error variance, then the average (over examinees) errorvariance will be small. For Figure 1.4, error variance is computed as25 4 36=3521:67:

    A difficulty in using a variance term to describe error is that itinvolves squared units. Therefore, the standard error of measure-ment is the square root of the average2 of the examinee-specific

    Error Scores for Examinee 1, True Score = 85

    Error Scores for Examinee 3, True Score = 95

    Error Scores for Examinee 2, True Score = 120

    0.00

    0.00

    0.00

    0.10

    0.20

    0.06

    0.08

    0.04

    0.04

    30 20 10 0 10 20 30

    30 20 10 0 10 20 30

    30 20 10 0 10 20 30

    0.02 2(E) = 361

    2(E) = 253

    2(E) = 42

    Den

    sity

    Den

    sity

    Den

    sity

    Figure 1.4. Error Score Distributions for Three Examinees

    2More accurately, it is the expected value of the examinee-specific error

    variances. The term average is used for simplicity.

    INTRODUCTION : 19

  • error variances. It describes the amount of measurement errorusing the same units (i.e., the same scale) as the test itself, ratherthan squared units. For the information in Figure 1.4, the standard

    error of measurement is21:67

    p54:66: In practice, error variance

    and the standard error of measurement is not so easy to computebecause we do not know the true or error scores. Error scorevariance may also be estimated with certain methods, as describedlater.

    Although using a statistic like the standard error of measure-ment that employs the same scale as the test is useful forinterpretation, it limits comparisons among similar tests thatuse different scales. For example, suppose there are two tests ofeighth grade ELAs that are proportionally built to the samespecifications, but each test has a different number of items. Ifone test contains 60 items and another contains 90 items, andthe observed score is the reporting metric, the tests have twodifferent scales and the standard error of measurement may notbe used to compare the quality of the two measures. A scale-free index is necessary to compare measures that involve dif-ferent scales.

    The Reliability Coefficient

    Reliability is defined as the squared correlation between observedscores and true scores, and this index turns out to be the ratio oftrue score variance to observed score variance,

    r2XT 5s2T

    s2T s2E : 1:3

    Unlike the standard error of measurement, the reliability coeffi-cient is scale-independent, and it may be used to compare thequality of measurement procedures that use different scales. Thedownside is that the reliability coefficient does not tell us how farobserved scores deviate from true scores in the original metric.Therefore, a useful practice is reporting reliability and the standarderror of measurement.

    20 : RELIABILITY

  • Traub and Rowley (1991, p. 175) provide a succinct explanationof the reliability coefficient and the meaning of its metric:

    It is a dimensionless number (i.e., it has no units). The maximum value of the reliability coefficient is 1, when allthe variance of observed scores is attributable to true scores.

    The minimum value of the coefficient is 0, when there is notrue-score variance and all the variance of observed scores isattributable to errors of measurement.

    In practice, any test that wemay use will yield scores for whichthe reliability coefficient is between 0 and 1; the greater thereliability of the scores, the closer to 1 the associated reliabilitycoefficient will be.

    The fact that all reliability coefficients behave in the mannerdescribed by Traub and Rowley may perpetuate the myth that atest has only one reliability coefficient. For many people, a descrip-tion of the reliability coefficient metric is the sole framework forinterpreting reliability. However, as Brennan (2001a) points out,the interpretation of reliability is founded upon the notion ofreplicating the measurement procedure. This distinction is impor-tant. All estimates of the reliability coefficient will behave asdescribed by Traub and Rowley, but each coefficient is interpreteddifferently depending upon themanner in which themeasurementprocedure is replicated.

    Estimating Unobserved Quantities

    Kelley (1947) provided a formula for estimating an examineestrue score. It is an empirical Bayes estimator that is computed as aweighted sum of an examinees score, x, and the group average, mX ,where the weight is an estimate of the reliability coefficient.Specifically, Kelleys equation, is given by

    T^ 5 r^2XT x 1 r^2XTm^ X : 1:4

    When reliability equals 1, the estimated true score is the exam-inees observed score. If reliability equals 0, the estimated true

    INTRODUCTION : 21

  • score is the group average observed score. Kelleys equation isuseful for reporting scores for examinees. All that is needed toobtain an estimated true score is an examinees observed score, anestimate of the group average observed score, and an estimate ofreliability.

    True score variance and error variance may be defined in termsof observable quantities: observed score variance and reliability.Rearranging the definition of the reliability coefficient (Equation1.3) results in an expression for the true score variance,

    s2T5s2Xr

    2XT . True score variance may be estimated by substituting

    an estimate of observed score variance for s2X and an estimate ofreliability for r2XT :

    Error variance may also be expressed in terms of observable

    quantities by rearranging Equation 1.3, s2E5s2X1 r2XT. The

    square root of this value is the standard error of measurement(SEM), which is given by

    SEM5sX1 r2XT

    q: 1:5

    Error variance and the SEM may also be estimated by substitutingan estimate of observed score variance (or standard deviation) andreliability into the preceding equations.

    A g 100% confidence interval for true scores is XzSEM,where z is the 1 1 g=2 quantile from the standard normaldistribution. This interval assumes that true scores have a unitnormal distribution. A 95% confidence interval is obtained whenz51:96.

    This section defined classical test theory and presented some ofthe results of this definition, such as the decomposition ofobserved score variance and an expression for the reliability coef-ficient. To select an appropriate estimate of reliability and properlyinterpret it, the data collection design and underlying assumptionsmust be specified. Data collection designs will be discussed in thenext chapter, along with implications for the proper interpretationof reliability. Chapter 3 will discuss different assumptions in clas-sical test theory and explain how each further restricts the mea-surement procedure replications.

    22 : RELIABILITY

  • Classification Decisions

    Test scores as numbers alone are meaningless. They must gothrough a process of scaling in order to take on meaning and havean interpretive frame of reference. Two common methods of pro-viding a frame of reference for test scores are the creation of testnorms and the establishment of cut-scores. Test norms reflect thedistribution of scores for a given population, and they allow for thecomputation of auxiliary scores (Kolen, 2006), such as percentiles,which facilitate the interpretation of scores. A percentile score refersto the proportion of examinees that score at or below a particularlevel. For example, a score of 85 on a mathematics test says littleabout examinee achievement. If the norms indicate that a score of85 falls in the 75th percentile, then we know that 75% of examineesobtained a score at or below 85.Moreover, this number is well abovethe median score and suggests an excellent level of mathematicsachievement. The number 85 takes on much more meaning whenreported in the context of test norms. Score interpretations in thissituation are relative to other examinees, and a test used for thispurpose is referred to as a norm-referenced test (Glaser, 1963).

    Reporting test scores in a relative manner using percentiles orsome other score based on norms does not always serve the purposeof testing. In many applications, such as licensure exams and high-stakes achievement tests, the purpose of testing is to determinewhere an examinees score falls with respect to an absolute standard.A cut-score defines the boundary between two achievement levels,such as pass or fail, and this absolute standard is commonly estab-lished through a process of standard setting (see Cizek & Bunch,2007). Licensure exams may have a single cut-score that defines theboundary between granting the license or not. Educational testsoften have multiple achievement levels differentiated by multiplecut-scores. For example, the National Assessment of EducationalProgress uses two cut-scores to define three achievement levels:Basic, Proficient, and Advanced (Allen, Carlson, & Zelenak, 1999,p. 251). Score interpretations in this setting are relative to thecriterion (i.e., cut-score) and not other examinees. Tests that usecut-scores to distinguish different achievement levels are referred toas criterion-referenced tests (Glaser, 1963).

    Norm-referenced and criterion-referenced tests differ not onlyin their implications for score interpretation, but also in their

    INTRODUCTION : 23

  • meaning of score reliability (Popham & Husek, 1969). Norm-referenced tests are design to spread out examinees to facilitatetheir rank ordering (i.e., relative comparison). As such, hetero-geneity of true scores is desirable and will increase the classical testtheory reliability coefficient. Criterion-referenced tests, on theother hand, are designed to classify examinees into achievementlevels, and it is perfectly acceptable for all examinees to be placed inthe same achievement level and have the same true score. As aresult, the classical test theory reliability coefficient may be 0 evenwhen all examinees are placed in the correct achievement level.Popham and Husek noted this distinction and indicated thatreliability coefficients for norm-referenced tests are not appro-priate for criterion-referenced tests.

    Reliability for a criterion-referenced test must take into accountthe cut-score or cut-scores and evaluate the consistency of classi-fying examinees into achievement levels. One approach fordefining a reliability coefficient for criterion-referenced testsinvolves the squared error loss (see Livingston, 1972). Thismethod is based on classical test theory, and it adjusts the classicaltest theory reliability coefficient to take into account the distancebetween the average score and the cut-score. Brennan and Kane(1977) extended this approach to generalizability theory, as will bediscussed in that section. The supposition in squared error lossmethods is that achievement-level classification will be more cor-rect when the average score is far away from the cut-score.However, this method does not quantify the consistency or accu-racy of classifying examinees into the achievement levels.Hambleton and Novick (1973) make an argument for a secondapproach, threshold loss, that specifically evaluates whether anexaminees score is above or below a cut-score.

    The foundation for squared error methods are covered in thesections on classical test theory and generalizability theory.Foundational concepts for threshold loss methods are providedin detail below, given that they differ from those discussed inclassical test theory and generalizability theory.

    Replicating the Measurement Procedure

    Reliability methods for criterion-referenced tests make use of thenotion of a domain score. A domain represents all of the items or

    24 : RELIABILITY

  • tasks (either real or imagined) that correspond to the content areaor construct of interest. It is more than just the items on a test oritems that have actually been produced. It is a larger entity thatrepresents all possible items that are considered appropriate for ameasurement procedure. A domain is a sampling frame thatdefines the characteristics of the items that may be included in ameasurement procedure. For example, the domain may be a bankof 10,000 items or all possible items that could be automaticallygenerated by a computer (see Bejar, 1991).

    A domain score is the proportion of items in the domain that anexaminee can answer correctly (Crocker & Algina, 1986, p. 193). Itis akin to a true score in classical test theory. Like a true score, thedefinition of a domain score involves many more replications ofthe measurement procedure than those that actually take place.Take a common statistics analogy for an example. Suppose an urnis filled with 100 white balls and 300 red balls. The true proportionof red balls is 0.75. Rather than count every ball in the urn, thisproportion can be estimated by taking a random sample of, say, 30balls. The proportion of balls in this sample of 30 is an estimate ofthe true proportion of balls in the urn. To make a more directconnection to testing, the domain is the urn of red and white balls.A red ball represents a correctly answered item, and a white ballrepresents an incorrectly answered item. Each examinee has adifferent number of red balls in the urn. An examinees domainscore is the true proportion of red balls in the urn (i.e., proportionof correctly answered items), and the observed score is the numberof red balls in the sample of 30. Because of the simple act ofsampling items from the domain, an examinees observed scorewill differ from the domain score. Some random samples mayinvolve higher estimates due to the inclusion of more items thatcan be answered correctly in the sample than those that can beanswered incorrectly. Other random samples may result in lowerestimates because the sample involves more items that can beanswered incorrectly than those that can be answered correctly.The different estimates in this case are due to randomly samplingitems from the domain.

    Observed scores will vary due to randomly sampling items fromthe domain. Consequently, the proportion of examinees aboveand below the cut-score will also vary from one replication toanother. Decision consistency refers to the extent to which

    INTRODUCTION : 25

  • examinees are placed into the same achievement level on tworeplications of the measurement procedure. It is also referred toas raw agreement or the proportion of agreement. Figure 1.5illustrates decision consistency for a sample of 500 examinees.The vertical and horizontal lines intersecting the plot identify thecut-score. Examinees placed in the upper-right and lower-leftquadrant represent consistent decisions. The other two quadrantsrepresent inconsistent decisions. Decision consistency is based onobserved scores. Therefore, it is often of interest to also know theextent to which observed score classifications match true scoreclassifications.

    Decision accuracy refers the extent that achievement-level clas-sification on the basis of observed scores agrees with classificationbased on domain (i.e., true) scores. Figure 1.6 illustrates decisionaccuracy for the same data illustrated in Figure 1.4 using observedscores from the first replication. Notice that the y-axis now refers

    First Replication Observed Score

    Seco

    nd R

    eplic

    atio

    n O

    bser

    ved

    Scor

    e

    0.4

    Fail/Pass

    Fail/Fail Pass/Fail

    Pass/Pass

    0.5 0.6 0.7 0.8 0.9 1.0

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Figure 1.5. Decision Consistency for Two Replications and a Cut-score of 0.75

    26 : RELIABILITY

  • to domain scores, and the upper-right and lower-left quadrantsrepresent instances when examinees are classified by domainscores in the same way that they are classified using observedscores.

    Approaches for estimating decision consistency and decisionaccuracy require that a continuous score be divided into two ormore discrete categories. For example, a score of 85 is transformedto a score of Basic, Proficient, or Advanced depending on thelocation of the cut-scores. Estimates are then based on thesecategorical variables.

    Methods for estimating decision consistency and decision accu-racy from two replications are relatively straight forward andsimply quantify the information graphically summarized inFigures 1.5 and 1.6. When only one replication is involved, themethods become noticeably more complicated and require strongassumptions about the nature of true and observed scores. Strong

    Fail/True Pass

    Fail/True Fail Pass/True Fail

    Pass/True Pass

    Observed Score0.4 0.5 0.6 0.7 0.8 0.9 1.0

    Dom

    ain

    Scor

    e0.

    50.

    60.

    70.

    80.

    91.

    0

    Figure 1.6. Decision Accuracy for a Cut-score of 0.75 (Note:r2XT50:88)

    INTRODUCTION : 27

  • true score theory refers to methods that make strong assumptionsabout the nature of test scores. These methods are considerablymore complicated than those used in classical test theory or gen-eralizability theory.

    Strong true score theory will be introduced in the next section.Any presentation of this material is complicated, but an attempt ismade to here describe it in as simple a manner as possible withoutcompletely omitting it.

    Strong True Score Theory

    Classical test theory assumes that test scores are continuous. It alsostipulates certain relationships among the moments (e.g., meanand standard deviation) of the score distributions, although noparticular distributions are specified. That is, the scores are notpresumed to follow any specific probability distribution. Classicaltest theory is a distribution-free theory. Strong true score theory,on the other hand, does assume that scores follow a particulardistribution. These additional assumptions are what gives rise tothe strong in strong true score theory. The advantage of strongtrue score theory is that scores may be modeled with probabilitydistributions, and the fit of the model can be evaluated. However,the disadvantage is that stronger assumptions are made about thescores and, if these assumptions are not appropriate, the model isof limited utility.

    Distributional Assumptions

    Domain Scores. The two-parameter beta distribution provides theprobability of a domain score, , which ranges from 0 to 1. It hastwo shape parameters, a and b, and it is given by

    f 5Ba;ba1 1 b1; 1:6

    where Ba; b denotes the beta function. One would not computeEquation 1.6 by hand. It would only be computed using acomputer.

    28 : RELIABILITY

  • Lord (1965) introduced a four-parameter beta distribution thatrestricts the lower (l) and upper (u) limits of the beta distribution,such that 0 l u 1. It is given by

    f 5 l a1 u b 1

    u 1ab1Ba; b : 1:7

    A benefit of these restrictions is that the lowest true scores may berestricted to account for random guessing. That is, domain scoreslower than expected from random guessing have a 0 probability.Overall model fit, therefore, tends to be better with the four-parameter beta distribution.

    Conditional Distribution of Observed Scores. Once a domain scoreis known, the binomial distribution provides the probability of anobserved score for an n-item test. Specifically,

    f xj5 nx

    x 1 n x: 1:8

    Unlike classical test theory, error scores in this model are notindependent of domain scores. This feature results in a methodfor computing a conditional error variance, as described below.The challenge in computing this probability is that the examineesdomain score is unknown. However, it may be modeled with aprobability distribution or estimated directly from the data, asdescribed below.

    Observed Scores. Keats and Lord (1962) demonstrated that whenthe conditional distribution of observed scores given domainscores is binomial, and the distribution of domain scores is atwo-parameter beta distribution, the marginal distribution ofobserved scores follows a beta-binomial distribution,

    hx5 nx

    Bx a; n x b

    Ba;b : 1:9

    INTRODUCTION : 29

  • Furthermore, the two parameters are given by,

    a5 1 1r21

    m^X 1:10

    and,

    b5a nr21

    n: 1:11

    Equation 1.9 provides the probability of a particular observed score,x, given the number of test items and estimates of the meanobserved score,m^ X , and the Kuder-Richardson formula 21 relia-bility coefficient,r21;(Kuder & Richardson, 1937). Unfortunately,Equation 1.9 is too complicated to compute by hand and must beevaluated with a computer. However, the advantage of Equation 1.9is that it provides a way of modeling observed scores without regardfor domain scores, and it lends itself to other applications in mea-surement, such as smoothing scores for equipercentile equating, aswell as the derivation of decision consistency indices.

    Lord (1965) combined the four-parameter beta distributionwith the binomial distribution to compute a different form ofEquation 1.9. His four-parameter beta-binomial distribution isnot shown here, but it too will be denoted hx.

    Estimating Domain Scores. If no distributional assumptions aremade, an examinees domain score may be estimated directly from

    the data using the maximum likelihood estimator, ^51=n n

    i51xi,

    which is the proportion of items an examinee answers correctly.An empirical Bayes estimator is also possible under the conditionthat the examinee is from a group with a unimodal score distribu-tion with group mean, m^ X . This estimator of the domain score is

    ^5 r21^1 r21 m^X

    n

    : 1:12

    30 : RELIABILITY

  • This formula is Kelleys equation based on domain scores. If ameasure has an r21 estimate of 1, then the maximum likelihoodand empirical Bayes estimators are the same.

    If domain scores are assumed to follow a two- or four-para-meter beta distribution, the posterior distribution of domainscores given observed scores may be computed using Bayes the-orem (see Hogg & Tanis, 2001) as

    f jx5 f xjf hx : 1:13

    In this way, domain score estimates and 95% credible intervals fordomain scores may be computed.

    Alternatively, a g 100% confidence interval for true scoresmay be obtained from the binomial error model by solvingEquation 1.8 for the lower and upper bound. Clopper andPearson (1934) define the lower, L, and upper, U , bounds ofthe confidence interval as

    Xni5X

    ni

    iL 1 L n i5 1 g=2

    XXi50

    ni

    iU 1 U n i5 1 g=2:

    1:14

    Reliability and Conditional Standard Error of Measurement

    In strong true score theory, reliability is still the ratio of true scorevariance to observed score variance. Although strong true scoretheory is not developed in the same manner as classical test theory,it leads to some of the same reliability estimates as classical testtheory, such as the Kuder-Richardson formula 20 and 21 (Kuder &Richardson, 1937). Chapter 4 discusses these estimates in moredetail.

    In strong true score theory, the binomial distribution provides arelatively easy method for obtaining examinee-specific error

    INTRODUCTION : 31

  • variances (i.e., error variance conditional on true score). Thebinomial distribution has a variance of n1 (see Hogg &Tanis, 2001), and Lord (1955b) showed that when adopting thenotion of sampling items from a domain, this formula was theerror variance for a given score. By transforming the domainscore to the raw score metric and denoting it by t, domain scorestake on a value between 0 and the total number of items, n: Thevariance of the binomial distribution may then be expressed astn t=n: Keats (1957) improved upon this estimator byincorporating a small sample correction to obtaintn t=n 1. He also multiplied this value by the adjust-ment factor 1 r2XT=1 r21 to account for small differencesin item difficulty. Taking the square root of this conditionalerror variance results in an estimate of the standard error ofmeasurement conditional on true score,

    CSEMt51 r2XT1 r21

    tn tn 1

    s: 1:15

    Keats further demonstrated that the average value of the condi-tional error variance was equal to the standard error of measure-ment in classical test theory. To compute the conditional standarderror of measurement, an estimate of reliability is substituted for

    r2XT . The important implication of Equation 1.15 is that errorvariance may be computed for specific score levels. This contrastswith classical test theory, in which all examinees are assigned thesame error variance (see Equation 1.5), regardless of their truescore.

    This section introduced squared error and threshold lossmethods for evaluating reliability in criterion-referenced tests.Decision consistency and decision accuracy were defined,and strong true score concepts necessary for estimating thesequantities from a single replication were briefly discussed.The methods presented in this section suffer from the sameweakness as classical test theory: Only one source of erroris reflected in a single estimate. Multiple sources of error canonly be considered simultaneously using generalizabilitytheory.

    32 : RELIABILITY

  • Generalizability Theory

    A reliability analysis conducted via generalizability theory consistsof two distinct steps, and each step consists of its own terminologyand concepts. The first step is a generalizability study, and itspurpose is to identify important characteristics of the measure-ment procedure and evaluate the amount of score variabilityattributed to each characteristic. The second step is a decisionstudy, and its purpose is to determine the dependability of scoresobtained from a measurement procedure and possibly designmore efficient measurement procedures.

    Generalizability Study

    In the discussion of classical test theory, two types of samplingwere described: sampling of examinees and replicating the mea-surement procedure. These types of sampling are more formallyintegrated into generalizability theory, and, as discussed inChapter 3, the sampling is assumed to be random. The populationdefines characteristics of the objects of measurement that may besampled for participation in the measurement procedure.Examinees typically constitute the object of measurement, butgeneralizability theory permits other entities, such as expert per-formance raters, to be the object of measurement. In what follows,examinees are the objects of measurement. A desirable source ofscore variability arises from sampling examinees from the popula-tion. Undesirable sources of variability result from the process ofsampling other aspects of the measurement procedure.

    Sampling (i.e., replicating) the measurement procedure contri-butes to measurement error. Classical test theory and strong truescore theory only permit one source of sampling in addition to thesampling of examinees. Therefore, only one source of measure-ment error may be considered at any one time. Generalizabilitytheory, on the other hand, allows sampling from multiple sourcesin addition to sampling examinees. This multifaceted samplingpermits error to be partitioned into multiple sources and theinfluence of each or all sources, as well as their interactions, maybe considered simultaneously. To properly conduct or conceptua-lize this multifaceted sampling, the sampling framemust be clearlydefined.

    INTRODUCTION : 33

  • Sources of measurement error are referred to as facets in general-izability theory. Test items are an example of an item facet. Theoccasion in which a measurement procedure is conducted (e.g.,morning, afternoon, and evening administrations) constitutes anoccasion facet. A measurement procedure may have one facet ormany facets. The number of facets selected depends on the majorsources of error present in a measurement procedure. However, it iswise to select only those sources of measurement error that are likelyto have the greatest impact on scores. Model complexity increasesexponentially as the number of facets increases. The universe ofadmissible observations refers to all of the facets from which samplesmay be drawn in order to create an instance of the measurementprocedure. It is a sampling frame that specifies the characteristics ofevery facet that may be included in a measurement procedure.

    In a generalizability study, the population and universe ofadmissible observations are defined, as well as the observed uni-verse design. A design specifies (a) the organization of the facets,(b) the facet sample size, and (c) the size of the universe. Whendiscussing design considerations for the generalizability study, thephrase observed universe design or observed design is used toindicate the design that resulted in actual data. It reflects thecharacteristics of an observed measurement procedure. Anobserved design pertains to the universe of admissible observa-tions and is part of the generalizability study because the data areused to estimate variance components for individual observationsof each facet. Design considerations in the dependability study,discussed later, are referred to as the data collection design. Thisdistinction is important because, in the generalizability study, thedesign pertains to single observations (e.g., a single item), but itrefers to a collection of observations (e.g., a group of items) in thedependability study. Moreover, the observed design and datacollection design may be the same or they may be different.

    Observed designs may involve a single facet or multiple facets. Adesign with a single facet is referred to as a single-facet design. Onewith two facets is referred to as a two-facet design. Designs withmore than two facets are referenced in a similar manner. Once anumber of facets have been specified, aspects of their organizationmust be described.

    A facet is crossed with another facet when two or more condi-tions of one facet are observed with each condition of another

    34 : RELIABILITY

  • facet. Consider a design that involves an item facet and an occasionfacet. When all items are administered on all occasions, the itemand occasion facets are crossed (see Fig. 1.7). Crossing is denotedwith a multiplication sign. For example, Item Occasionwould be read as item crossed with occasion. This notationmay be simplified to i o. (Note that this text uses Brennans[2001b] notational conventions). A facet may also be crossed withthe object of measurement. When all objects of measurement, sayexaminees, respond to all items, examinees and items are crossed.This design would be denoted as p i, where p refers to examinees(i.e., persons). Data collection designs in which all facets andobjects of measurement are crossed are referred to as fully crosseddesigns.

    Cronbach noted that generalizability theorys strength lies in itsability to specify designs that are not fully crossed (Cronbach &Shavelson, 2004). A nested facet (call it A) occurs when (a) two ormore conditions of A are observed with each condition of anotherfacet (referred to as B), and (b) each condition of B containsdifferent levels of A (Shavelson &Webb, 1991, p. 46). For example,suppose a measurement procedure is conducted on two occasions,and a different set of items is administered on each occasion. Itemsare nested within occasion (see Fig. 1.8). Nesting is denoted with acolon. For example, Items : Occasion would be read as itemsnested within occasion. For brevity, this relationship may also bedenoted as i : o. A facet may also be nested within the object of

    Day 1 Day 2

    Item 1 Item 2 Item 3 Item 4 Item 5

    Figure 1.7. Diagram of Items Crossed with Occasions (i.e., Day ofAdministration)

    INTRODUCTION : 35

  • measurement. For example, items are nested within examinees ifeach examinee is given a different set of test items, a situationcommonly encountered in computerized adaptive testing (see vander Linden & Glass, 2000). This design may be denoted i : p.Designs that do not involve any crossed facets are referred to asnested or fully nested designs. Those that involve a combination ofnested and crossed facets are referred to as partially nested designs.

    Facet sample sizes are specified in the observed design strictlyfor the purposes of observing a measurement procedure andobtaining data. These sample sizes are denoted with a lowercasen with a subscript that represents the facet. For example, theobserved sample size for the item facet is denoted ni. The linearmodel effects and the variance components in a generalizabilitystudy (described below) refer to facet sample sizes of 1. Forexample, a measurement procedure may be observed by adminis-tering a 64-item test to 2,000 examinees. Every examinee respondsto the same 64 items. Therefore, this design is crossed. Variancecomponents estimated in the generalizability study from this datarefer to variance attributable to a single item or the interaction ofan examinee and a single item. The data collection design in thedecision study (discussed later) specifies facet sample sizes for thepurpose of estimating error variance and reliability for a collectionof elements from a facet (i.e., a group of test items). Facet samplesizes play a greater role in the decision study.

    When defining the observed design, the size of the universemust be considered. Random universes are unrestricted andlarger than the particular conditions included in a measurement

    Day 1 Day 2

    Item 1 Item 2 Item 3 Item 4

    Figure 1.8. Diagram of Items Nested Within Occasion (i.e., Day ofAdministration)

    36 : RELIABILITY

  • procedure. An item facet is random if the items that appear on atest are randomly sampled from, or considered exchangeable with,all possible items that could have been included on the test. TheELA item pool scenario described earlier is an example. In thatscenario, a 50-item ELA test was created from an item pool of 200items. The item pool serves as the universe and the 50 selecteditems are part of a specific instance of the measurement procedure.Presumably the 50 items were selected at random or the testdesigner is willing to exchange any one of the 50 items for anyother item in the pool. In this case, the item facet is considered tobe random. Conversely, fixed universes are restricted to thoseconditions that are actually included in the measurement proce-dure. For example, the item facet is fixed if there was no item pooland the 50 ELA items were considered to exhaust all of the items inthe universe. The difference between random and fixed universes ismore than a semantic one. Each implies a different generalizationand interpretation of score reliability.

    An observed universe designmay be characterized using a linearmodel, and many different models are possible in generalizabilitytheory. Each model decomposes a single examinees score onsingle observations of the facets. Three designs will be consideredin this text: p i, p i o, and p i : o. In the p i design, allexaminees respond to all items. The score for an examinee on asingle item, Xpi, is the sum of four parts,

    Xpi5 m vp vi vpi; 1:16

    where m refers to the grand mean (i.e., average score over allexaminees and all facets), vp refers to the person effect, and virefers to the item effect. The term vpi reflects the interaction between

    persons and items.However, this latter effect is confoundedwith theresidual, given that there is only one observation per cell. This effectis due to the interaction of the person and item, as well as any otherrandom source of error not reflected in the design. Because of thisconfounding, it is referred to as the residual effect.

    The notation used here is not entirely descriptive of the effects inthe linear model. For example, the item effect is actually vi5mi m.The shortened versions of these effects are used herein for brevity

    INTRODUCTION : 37

  • and to reduce the technical aspects as much as possible to facilitatean understanding of reliability. The complete linear models aredescribed in Brennan (2001b) and Shavelson and Webb (1991).

    A p i o design with an item and occasion facet requires thatall examinees respond to all items on all occasions. Scores aredescribed as

    Xpio5m vp vi vo vpi vpo vio vpio: 1:17

    Effects due to person, item, and the person-by-item interaction arethe same as stated previously. The effect due to occasion is vo. Two-way interactions between person and occasion, vpo, and item and

    occasion, vio, and the three-way interaction between person, item,and occasion, vpio are also part of this model.

    Finally, a score from a two-facet design with items nested withinoccasion, p i : o, is characterized by

    Xpio5m vp vo vi : o vpo vpi : o: 1:18

    Nesting items within occasion reduces the number of effects thatinfluence scores. There are only five effects in this model but seveneffects in the two-facet fully crossed design.

    Variance Components. A generalizability study is conducted toestimate the amount of score variance associated with each effectin the universe of admissible observations. Each effect in the linearmodel has its own variance component. That is, total score varia-bility is decomposed into variance due to each effect and theirinteractions. In the p i design, the variance components are

    s2Xpi5s2p s2i s2pi: 1:19

    Item variance, s2i, reflects the extent to which items vary indifficulty. Variance due to the person-by-item interaction,

    s2pi, indicates variability associated with items that are difficultfor some examinees yet easy for others. Finally, s2p indicates the

    38 : RELIABILITY

  • amount of variability among examinees due to the underlyingconstruct. Each of these effects is found in the more elaboratetwo-facet crossed design.

    Variance components for the p i o design are

    s2Xpio5s2p s2i s2o s2pi s2po s2io s2pio: 1:20

    Scores may be higher on some occasions than others. Variability of

    scores due to occasion is reflected in s2o. Item difficulty mayvary depending on occasion. That is, items may be difficult on oneoccasion but easy on another. This source of variability is reflected

    in s2io. Variance for the person-by-occasion interaction, s2po,is interpreted similarly; variability of examinee performance fromone occasion to another. Finally, variance due to the interaction ofpersons, items, and occasions, as well as any remaining source of

    random error, is reflected in the residual term, s2pio.The p i : o design consists of fewer variance components

    than the p i o due to nesting. These components are

    s2Xpio5s2p s2o s2i : o s2po s2pi : o: 1:21

    The two components unique to this design are s2i : o ands2pi : o. The former term reflects variability of item difficultywithin an occasion. The latter term corresponds to the interactionbetween persons and items nested within occasion, as well as other

    sources of random variability. That is, s2pi : o is also a residualterm.

    In a generalizability study, variance components are estimatedusing ANOVA methods. Estimates are then interpreted in a rela-tive manner by computing the proportion of variance explained byeach component (i.e., the ratio of one source of variance to thetotal variance). Examples of these computations will be providedlater. Variance component estimates are used in a decision studyto evaluate the dependability of scores and to design efficientmeasurement procedures.

    INTRODUCTION : 39

  • Decision Study

    Scores in a generalizability study reflect the contribution of singleobservations of each facet. In the p i design, the score Xpi is apersons score on an individual item. In the p i o design, thescore Xpio is a persons score on a single item on a single occasion.

    Variance components estimated in a generalizability study tell ushow much each facet and facet interaction contributes to varia-bility of these single observation scores. In practice, tests arerarely composed of a single item or a single item administered on asingle occasion. Rather, scores are based on a collection of items ora collection of items obtained from a collection of occasions. Adecision study provides the framework for evaluating the depend-ability of scores obtained by averaging over a collection of facets.More importantly, the universe of generalization is defined in adecision study by specifying the number of observations of eachfacet that constitutes a replication of the measurement procedureand other details of the data collection design.

    The number of observations of each facet (i.e., the items facetsample size) is denoted in the same way as the facet sample size inthe observed design, but a prime mark is added. For example, theitem facet sample size is denoted n0i . The prime is added becausethe sample size in the decision study may differ from the samplesize in the observed design.

    A decision study allows one to evaluate the reliability of differentdata collection designs that may be the same as or different from theobserved design. Decision study facet sample sizes may be the sameas or different from the actual sample sizes used when imple-menting the measurement procedure in a generalizability study.For example, a 60-item test may have been administered to exam-inees when conducting the generalizability study (ni560), but aresearcher may be interested in evaluating the reliability of a shorter45-item test in the decision study (n0i 545). As another example,suppose a 60-item test is administered over a two-day period (30items per day), but the researcher notices examinees experiencingnotable fatigue during testing each day. Data from the two-daytesting could be used in a decision study to evaluate the reliability ofextending testing to four days, but reducing the number of itemsper day to 15. A decision study allows one to consider various datacollection designs, as well as different ways to organize the facets.

    40 : RELIABILITY

  • A variety of methods for organizing facets may be evaluated in adecision study. For example, a p i observed design may be usedto evaluate a p I or an I : p data collection design (use of thecapital letter I will be explained shortly). Or, a p i o observeddesign may be used to evaluate a p I O, p I : O, orI : O : p data collection design. The number of possible data col-lection designs is not unlimited, however. Nested facets in thegeneralizability study cannot be crossed in the decision study. Ani : p generalizability study cannot be used to conduct a p Idecision study. Therefore, a generalizability study should be con-ducted with as many facets crossed as possible.

    The size of the universe of generalization may be the sameas or different from the universe of admissible observations.An infinite (i.e., random) facet in the universe of admissibleobservations may be fixed in the universe of generalization aslong as there is at least one random facet in the design. Theconverse is not true. A fixed facet in the universe of admis-sible observations may not be made infinite in the universe ofgeneralization. Therefore, to have the most flexibility in con-ducting a decision study, the generalizability study should beconducted with as many random facets as possible.

    Linear models in a decision study reflect the composite natureof scores. For example, the linear model for the p I design withn0i items is given by

    XpI5m vp vI vpI : 1:22

    The score XpI refers to an average score over n0i items. The capital

    letter I denotes an average over the n0i items. More generally, acapitalized facet index indicates an average over that facet (seeBrennan, 2001b). This notation is intentional, and it distinguishesa decision study linear model from a generalizability study linearmodel. Each effect in the model is interpreted with respect to thecollection of observations for each facet. The item effect, vI , refersto the effect of n0i items (i.e., a test form), and the person-by-iteminteraction refers to the differential effect of test forms for exam-inees. Linear models for the two facet designs are adjusted in asimilar manner.

    INTRODUCTION : 41

  • In the p I O design, an examinees average score from n0iitems and n0o occasions is given by

    XpIO5m vp vI vO vpI vpI vIO vpIO: 1:23

    Nesting items within occasion reduces the total number of effectsin the decision study linear model, just as it did in the general-izability study linear model. In the p I : O design, an exam-inees average score from n0i items that are nested within n

    0o

    occasions is

    XpIO5 m vp vO vI :O vpO vpI :O: 1:24

    Each of these linear models shows that observed scores are com-prised of a number of effects, all of which influence scores uponeach replication of the measurement procedure.

    Replicating the Measurement Procedure. The universe of general-ization includes all possible randomly parallel instances of a mea-surement procedure. For example, a universe of generalization mayinvolve all possible 60-item eighth grade mathematics test formsthat could be administered to every examinee. Another universe ofgeneralization may involve all possible 30-item sixth grade ELA testforms that could be administered on one day, and all possible 30-item sixth grade ELA test forms that could be administered on asecond day. Each particular instance of the universe of general-ization constitutes a replication of the measurement procedure. Forexample, one 60-item eighth grade mathematics test administeredto all examinees is one replication of the measurement procedure. Asecond 60-item eighth grade mathematics test administered to allexaminees is another replication of the measurement procedure. Asin classical test theory, the score a person obtains from one replica-tion will differ from the score obtained on another replicationbecause of random error (i.e., sampling from each facet).However, in generalizability theory, scores observed in each repli-cation are affected by multiple sources of error, not just a singlesource of error, as in classical test theory. All of the terms inEquations 1.22, 1.23, and 1.24 except m and vp may be thought of

    42 : RELIABILITY

  • as error scores. However, the exclusion of m and vp does not implythat these two effects are true scores.

    Generalizability theory does not make use of true scores as theywere defined in classical test theory. A similar concept exists, and itdepends on the definition of a universe of generalization. A uni-verse score is the average of all possible observed scores a personwould obtain from every possible randomly parallel replication ofthe measurement procedure. For example, imagine that a randomsample of 60 mathematics items were given to an examinee, andthe examinees average score was computed. Suppose then thatanother random sample of 60 mathematics items was selected andadministered to the examinee and another average score wascomputed. Repeating the process of randomly sampling itemsand computing an average score a large number of times wouldresult in a distribution of observed scores. The universe scorewould be the average value of all of these scores. A histogram ofthese scores would look similar to Figure 1.2, but the x-axis wouldbe average scores, not sum scores.

    In generalizability theory, we seek to determine how well anobserved score generalizes to the universe score in the universe ofgeneralization (Brennan, 2001b). This notion explains the use ofthe term generalizability in generalizability theory. To evaluatethis generalization, the amount of variance associated withobserved scores must be determined. If observed score varianceis primarily due to universe score variance (i.e., the variance ofuniverse scores among a sample of examinees), then observedscores may be reliably generalized to universe scores. That is, wecan have confidence that observed scores are close to universescores. Contrarily, if observed score variance is mostly due toerror variance, then observed scores will not reliably generalize touniverse scores, and we cannot have confidence that observedscores are close to universe scores.

    Sources of Variance. It is well known in statistics that if scoresx1; x2; . . . ; xn are independently and identically distributed with agiven mean, m, and variance, s2, then the distribution of the

    average of these scores, x51=n n

    i51xi, has a mean m and variance

    s2=n (see Hogg & Tanis, 2001). That is, the variance for an averagescore is the variance for the individual score divided by the sample

    INTRODUCTION : 43

  • size. This result is commonly encountered in statistics classes,where it is known as the standard error. It also plays an importantrole in a decision study.

    Variance components estimated in a generalizability study per-tain to individual observations, and these components are used toestimate variance components for an average of observations in adecision study. Therefore, a decision study variance component isusually the generalizability study variance component divided bythe facet sample size. (Usually, because in some designs, multiplegeneralizability study variance components are combined beforedivision by the sample size.) An obvious result of dividing avariance component by the facet sample size is that the amountof variance decreases. Consider the variance components for thep I design, which are given by

    s2XpI5s2p s2in0i

    s2pin0i

    : 1:25

    Three sources of variance are evident in this equation. Universe

    score variance is denoted s2p, and it indicates the amount thatscores vary due to real differences among examinees. Varianceattributed to the group of n0i items (i.e., a test form) and theperson-by-form interaction is reflected in the remaining twoterms on the right-hand side. More importantly, these latterterms represent two sources of error variance that decrease as thenumber of items increases.

    Increasing the number of facets increases the sources of errorvariance, as is evident in the p I O design. Seven sources ofvariance comprise the total observed score variance,

    s2XpIO5s2p s2in0i

    s2on0o

    s2pin0i

    s2pon0o

    s2ion0in0o

    s2pion0in0o

    : 1:26

    One source is universe score variance, while the others are sourcesof error variance. Notice that variance for interacting facets are

    44 : RELIABILITY

  • divided by the product of two facet sample sizes. Therefore,increasing a facet sample size not only reduces variance for thatfacet, but also other variance terms that involve that facet.

    Similar observations can be made among the p I : O designvariance components,

    s2XpIO5s2p s2on0o

    s2i : on0in0o

    s2pon0o

    s2pi : on0in0o

    : 1:27

    Universe score variance and four sources of error variance areevident in this design. Although there are fewer sources of errorvariance in the p I : O than in the p I O, it is not necessa-rily true the total amount of variance will be smaller in the formerdesign than the latter.

    Types of Decisions. At the beginning of the section on classificationdecisions, relative and absolute score scales were discussed. Arelative method of scaling compares one examinees scores toanother examinees score. Absolute scaling compares an exami-nees score to some standard, suc