43
Technical Report August 2000 TR-15 Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward W. Wolfe

August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

Technical Report

August 2000

TR-15

Strengthening the Ties that Bind: Improving the LinkingNetwork in Sparsely Connected Rating Designs

Carol M. MyfordEdward W. Wolfe

Page 2: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs

Carol M. Myford

Edward W. Wolfe

Educational Testing Service Princeton, New Jersey

RR-00-9

Page 3: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

Educational Testing Service is an Equal Opportunity/Affirmative Action Employer.

Copyright © 2000 by Educational Testing Service. All rights reserved.

No part of this report may be reproduced or transmitted in any form or by any means,electronic or mechanical, including photocopy, recording, or any information storageand retrieval system, without permission in writing from the publisher. Violators will beprosecuted in accordance with both U.S. and international copyright laws.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRE, SPEAK, TOEFL, theTOEFL logo, and TSE are registered trademarks of Educational Testing Service. Themodernized ETS logo is a trademark of Educational Testing Service.

FACETS Software is copyrighted by MESA Press, University of Chicago.

MICROSOFT is a registered trademark of Microsoft Corporation.

Page 4: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

i

Abstract

The purpose of this study was to evaluate the effectiveness of a strategy for linking raters when there are large numbers of raters involved in a scoring session and the overlap among raters is minimal. In sparsely connected rating designs, the number of examinees any given pair of raters has scored in common is very limited. Connections between raters may be weak and tentative at best. The linking strategy we employed involved having all raters in a Test of Spoken English (TSE®) scoring session rate a small set of six benchmark audiotapes, in addition to those examinee tapes that each rater scored as part of his or her normal workload. Using output from Facets analyses of the rating data, we looked at the effects of embedding blocks of ratings from various smaller sets of these benchmark tapes on key indicators of rating quality. We found that all of our benchmark sets were effective for establishing at least the minimal connectivity needed in the rating design in order to allow placement of all raters and all examinees on a single scale. When benchmark sets were used, the highest scoring benchmarks (i.e., those examinees that scored 50s and 60s across the items) produced the highest quality linking (i.e., the most stable linking). The least consistent benchmark sets (i.e., those that were somewhat harder to rate because an examinee�s performance varied across items) tended to provide fairly stable links. The most consistent benchmarks (i.e., those that were somewhat easier to rate because an examinee�s performance was similar across items) and middle scoring benchmarks (i.e., those from examinees who scored 30s and 40s across the items) tended to provide less stable linking. Low scoring benchmark sets provided the least stable linking. When a single benchmark tape was used, the highest scoring single tape provided higher quality linking than either the least consistent or most consistent benchmark tape. Key words: speaking assessment, Item Response Theory (IRT), performance assessment, quality control,

rater performance, Rasch measurement, Facets

Page 5: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

ii

The Test of English as a Foreign Language (TOEFL®) was developed in 1963 by the National Councilon the Testing of English as a Foreign Language. The Council was formed through the cooperativeeffort of more than 30 public and private organizations concerned with testing the English proficiencyof nonnative speakers of the language applying for admission to institutions in the United States. In1965, Educational Testing Service (ETS®) and the College Board assumed joint responsibility for theprogram. In 1973, a cooperative arrangement for the operation of the program was entered into by ETS,the College Board, and the Graduate Record Examinations (GRE®) Board. The membership of theCollege Board is composed of schools, colleges, school systems, and educational associations; GREBoard members are associated with graduate education.

ETS administers the TOEFL program under the general direction of a policy board that was establishedby, and is affiliated with, the sponsoring organizations. Members of the TOEFL Board (previously thePolicy Council) represent the College Board, the GRE Board, and such institutions and agencies asgraduate schools of business, junior and community colleges, nonprofit educational exchangeagencies, and agencies of the United States government.

✥ ✥ ✥

A continuing program of research related to the TOEFL test is carried out under the direction of theTOEFL Committee of Examiners. Its 11 members include representatives of the TOEFL Board, anddistinguished English as a second language specialists from the academic community. The Committeemeets twice yearly to review and approve proposals for test-related research and to set guidelines forthe entire scope of the TOEFL research program. Members of the Committee of Examiners servethree-year terms at the invitation of the Board; the chair of the committee serves on the Board.

Because the studies are specific to the TOEFL test and the testing program, most of the actual researchis conducted by ETS staff rather than by outside researchers. Many projects require the cooperationof other institutions, however, particularly those with programs in the teaching of English as a foreignor second language and applied linguistics. Representatives of such programs who are interested inparticipating in or conducting TOEFL-related research are invited to contact the TOEFL programoffice. All TOEFL research projects must undergo appropriate ETS review to ascertain that dataconfidentiality will be protected.

Current (1999-2000) members of the TOEFL Committee of Examiners are:

Diane Belcher The Ohio State UniversityRichard Berwick Ritsumeikan Asia Pacific UniversityMicheline Chalhoub-Deville University of IowaJoAnn Crandall (Chair) University of Maryland, Baltimore CountyFred Davidson University of Illinois at Urbana-ChampaignGlenn Fulcher University of SurreyAntony J. Kunnan (Ex-Officio) California State University, LAAyatollah Labadi Institut Superieur des Langues de TunisReynaldo F. Macías University of California, Los AngelesMerrill Swain The University of TorontoCarolyn E. Turner McGill University

To obtain more information about TOEFL programs and services, use one of the following:

E-mail: [email protected]

Web site: http://www.toefl.org

Page 6: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

iii

Acknowledgments

This work was supported by The Test of English as a Foreign Language (TOEFL®) Research Program at Educational Testing Service.

We are grateful to Eiji Muraki, Bob Boldt, J. Michael Linacre, Larry Stricker, and to the TOEFL Research Committee for helpful comments on an earlier draft of the paper. We especially thank the raters of the Test of Spoken English (TSE®), without whose cooperation this project could never have succeeded.

Page 7: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

iv

Table of Contents

Page

Introduction .................................................................................................................................................1

Method .........................................................................................................................................................2 Test of Spoken English ....................................................................................................................2

Examinees ........................................................................................................................................2

Raters ...............................................................................................................................................3

Scoring of the TSE...........................................................................................................................3

Procedure .........................................................................................................................................3

Data Analysis ...............................................................................................................................................4 A Description of Facets....................................................................................................................4

A Description of the Analyses .........................................................................................................8

Results ........................................................................................................................................................14 Examinee and Rater Separation .....................................................................................................14

Examinee Proficiency Measures ....................................................................................................14

Rater Severity Measures ................................................................................................................17

Rater Fit..........................................................................................................................................17

Linking Quality ..............................................................................................................................21

Summary of Findings................................................................................................................................25

References ..................................................................................................................................................27

Appendix A ................................................................................................................................................30

Appendix B ................................................................................................................................................32

Page 8: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

v

List of Tables

Page

Table 1 Benchmark Audiotapes for the February 1997 TSE Scoring............................................10 Table 2 Benchmark Audiotapes for the April 1997 TSE Scoring..................................................10 Table 3 Summary of Results from the Facets Analyses of the

February 1997 TSE Rating Data.......................................................................................15 Table 4 Summary of Results from the Facets Analyses of the

April 1997 TSE Rating Data.............................................................................................15 Table 5 February 1997 TSE Examinees Whose Proficiency Measures

Showed Significant Change When Compared to Measures from the Operational Scoring............................................................................................16

Table 6 April 1997 TSE Examinees Whose Proficiency Measures

Showed Significant Change When Compared to Measures from the Operational Scoring............................................................................................16

Table 7 February 1997 TSE Raters Whose Severity Measures Showed

Significant Change When Compared to Measures from the Operational Scoring............................................................................................18

Table 8 April 1997 TSE Raters Whose Severity Measures Showed

Significant Change When Compared to Measures from the Operational Scoring............................................................................................19

Table 9 Mean-Square Infit and Outfit Statistics for Misfitting Raters

from the February 1997 TSE Scoring ...............................................................................20 Table 10 Mean-Square Infit and Outfit Statistics for Misfitting Raters

from the April 1997 TSE Scoring .....................................................................................20 Table 11 Chi-Square Linking Summary for February 1997 TSE Data ............................................22 Table 12 Chi-Square Linking Summary for April 1997 TSE Data ..................................................22 Table 13 Summary of Chi-Square Rankings for February 1997 and

April 1997 TSE Data.........................................................................................................24

Page 9: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

1

Introduction

The scoring of large-scale performance assessments often involves working with many raters. For example, American College Testing (ACT, Inc.), National Computer Systems (NCS), CTB/McGraw Hill, Psychological Corporation, and Educational Testing Service (ETS®) routinely conduct scoring sessions for which they hire, train, and certify hundreds of persons to serve as raters. Additionally, states will often contract with testing companies to hire and train large numbers of raters to evaluate examinees' products and performances for their statewide testing programs. A scoring session that involves many raters poses special challenges for data analysis. Devising a rating design that will adequately link raters, examinees, and items or tasks can be a trying experience under such circumstances. The problems become further compounded when raters evaluate complex performances that are very time consuming to score (e.g., portfolios, videotapes, or audiotapes of examinees' performances). An individual rater can evaluate only a small number of these kinds of performances in any given scoring session. This problem is exacerbated when large numbers of raters are hired in order to reduce the amount of time needed to score the performances. If there are many raters involved, and if each rater scores relatively few examinees, then establishing sufficient connectivity in the rating design becomes an even more arduous task. Suppose that two raters score each examinee's work, as is the case in some high-stakes assessment programs. Because each rater rates few examinees, a given rater is paired with only a small number of the other raters, and there may be a number of raters with whom that rater is never paired. Even if the rating design works out so that all raters can be linked, in a number of cases the connections between raters are likely to be weak and tentative at best (i.e., the number of examinees any given pair of raters has scored in common is very limited). A lack of connectivity among raters is problematic because insufficient connectivity in the data makes it impossible to calibrate all raters on the same scale. Consequently, raters cannot be directly compared in terms of the degree of severity or leniency they exercise when scoring examinees' work since there is no single frame of reference established for making such comparisons (Linacre & Wright, 1994). In the absence of a single frame of reference, examinees (and items, too) cannot be compared on the same scale. Having the capability to compare raters is important if one wants to quantify the extent to which raters are interchangeable (i.e., rate with the same degree of severity or leniency). Because raters are often not interchangeable, some have advocated adjusting examinees' scores for differences in rater severity or leniency (Raymond, Webb, & Houston, 1991), and computer programs have been devised that can accomplish this task. For example, the output from the Facets computer program (Linacre, 1999) includes an estimate of severity for each rater and a score for each examinee that has been adjusted for severity/ leniency differences among raters. However, some have questioned the practice of adjusting examinee scores when raters are only sparsely connected in a rating design and when the estimates of rater severity that arise from such rating designs lack precision and stability over time (Marr, 1994; Myford, Marr, & Linacre, 1996). Adjustments for rater effects are hard to defend unless the effects of those raters are precisely estimated, stable, and not dependent upon either the particular examinees each rater happened to rate or the other raters with whom the rater happened to be paired in a given scoring session. What can be done to strengthen the links among raters when there are large numbers of raters involved in a scoring session and the overlap among raters is minimal? Myford, Marr, and Linacre (1996) suggest instituting a rating design that calls for all raters involved in a scoring session to rate the

Page 10: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

2

same small set of examinees' works (in addition to the examinees' works that each rater would score as part of his or her normal workload in a scoring session). In effect, a block of ratings from a fully crossed rating design is embedded into an otherwise lean data structure, thereby ensuring that all raters will be linked through their ratings of this common set of examinees' works. No studies to test the utility of this strategy for strengthening the linking network of raters have as yet been undertaken.

The purpose of this study was to evaluate the effectiveness of the rater linking strategy that Myford, Marr, and Linacre (1996) proposed. All raters who took part in the scoring of the Test of Spoken English (TSE®) during the February 1997 and April 1997 scoring sessions rated a common set of six audiotaped examinee performances that were selected from the tapes to be scored during that session. The specific questions that focused the study included the following: 1. How does embedding in the operational data blocks of ratings from various smaller sets of the six

examinee tapes affect:

� the stability of the examinee proficiency measures? � the stability of the rater severity measures? � the fit of the raters? � the spread of the rater severity measures? � the spread of the examinee proficiency measures?

2. How many tapes do raters need to score in common in order to establish the minimal requisite connectivity in the rating design so that all raters and examinees can be placed on a single scale? What (if anything) is to be gained by having all raters score more than one or two tapes in common?

3. What are the characteristics of tapes that produce the highest quality linking? Are tapes that exhibit

certain characteristics more effective as linking tools than tapes that exhibit other characteristics?

Method

Test of Spoken English The TSE functions as "a test of general speaking ability designed to evaluate the oral language proficiency of nonnative speakers of English who are at or beyond the postsecondary level of education" (TSE Program Office, 1995, p. 6). The TSE is a semidirect speaking test that is administered via audio recording equipment, prerecorded prompts, and printed test booklets. Each of the 12 items that appears on a test consists of a single task that is designed to elicit one of 10 language functions in a particular context. The test lasts about 20 minutes and is individually administered. An interviewer asks the examinee questions on tape. After hearing a question, the examinee is encouraged to answer the question as completely as possible in the time allotted. The examinee's oral responses are recorded, and each examinee's test score is based on an evaluation of the resulting speech sample.

Examinees Examinees whose data were selected for use in this study took the TSE during either the February 1997 (N = 1,463) or April 1997 (N = 1,446) administrations. About 7% of these examinees were under 20 years of age, 83% were between 20 and 39, and 10% were over 40. Fifty-three percent were male,

Page 11: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

3

and 47% were female. The majority of examinees were from Eastern Asian countries (56% in February 1997, 63% in April 1997). The next largest group of examinees were from European countries (17% in February 1997, 14% in April 1997), and the rest of the examinees were from South American, North American, African, Middle Eastern, and Western Asian countries.

Raters The raters who score the TSE audiotapes are all experienced teachers and specialists in the fields of English or English as a Second Language who teach at the high school or college level. All raters must successfully complete a rater training program prior to participating in TSE scoring sessions. At the conclusion of the training program each rater must score a set qualifying tapes in order to be certified to score the TSE. Sixty-six raters participated in the scoring of the February 1997 TSE, while 74 raters were involved in the scoring of the April 1997 TSE. There were 41 raters who participated in both of the scoring sessions. Not counting the common set of six examinees that the raters scored, each rater scored on average 44 examinees in the two-day February 1997 scoring session and 38 examinees in the two-day April 1997 scoring session. The average number of ratings each rater gave was 533 for the February 1997 session and 461 for the April 1997 session.

Scoring of the TSE In a TSE scoring session, audiotapes are identified by number only and are randomly assigned to raters. Two raters independently score each examinee's tape. Each tape takes approximately 20-30 minutes for a rater to score. The raters evaluate an examinee's performance on each item using the holistic, five-point TSE rating scale. The raters use the same scale to rate each of the 12 items. Each point on the scale is defined by a band descriptor; these descriptors correspond to the four language competencies that the test is designed to measure (i.e., functional competence, sociolinguistic competence, discourse competence, and linguistic competence), and strategic competence. Raters assign a holistic score from one of the five bands for each of the 12 items. For the interested reader, Appendix A presents the TSE Band Descriptor Chart, and Appendix B describes the process used to calculate an examinee�s score.

Procedure Prior to each of the two scoring sessions, a small group of experienced TSE raters who supervised each scoring session met to select the six common benchmark tapes that all raters in the upcoming scoring session would be required to score. These experienced raters listened to a number of tapes to select the set of six. For each scoring session, they chose a set of tapes that displayed a range of examinee proficiency (i.e., a few low scoring examinees, a few high scoring examinees, and some examinees whose scores would fall in the middle categories of the TSE rating scale). They included in this set a few tapes that they believed would be hard for raters to rate (e.g., examinees who showed variability in their level of performance from item to item, doing well on some items but poorly on others) as well as tapes that they judged to be solid examples of performance at a specific point on the scale (i.e., examinees that showed little variability in performance across the items, who would receive the same or nearly the same rating on each item).

Page 12: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

4

Once the benchmark tapes were identified, additional copies of each tape were made. These tapes were seeded into the scoring session so that raters would not know that these particular tapes were any different from any of the other tapes they were scoring (i.e., the raters were not aware that we would be using their ratings of these benchmark tapes for a special purpose).

Data Analysis

A Description of Facets The data collected in this study were analyzed using Facets (Linacre, 1999), a rating scale

analysis computer program based on an extension of Wright and Masters� rating scale model (1982). Facets has been used to analyze rater behavior and to review patterns of ratings in many and varied performance assessment settings (e.g., Engelhard, 1994; Heller, Sheingold, & Myford, 1998; Lumley & McNamara, 1995; Lunz & Stahl, 1990; Myford & Mislevy, 1995).1 The Facets computer program is particularly well suited for analyzing data from performance assessments that are judged by human raters. In the TSE context, those in charge of monitoring quality control need information not only about how the examinees performed but also about the performance of individual raters, items, and the TSE rating scale. The output from a Facets analysis can provide useful diagnostic information about the quality of each examinee's performance, the scoring behavior of each rater, the utility of each of the 12 items on the TSE, and the adequacy of the TSE rating scale. Having access to such detailed information enables one to pinpoint specific weaknesses or deficiencies in a complex assessment system so that meaningful, informed steps could be initiated to improve the system.

The many-facet Rasch model extends the rating scale model to incorporate more facets than the

two that are typically included in a paper-and-pencil testing situation (i.e., examinees and items). When applied to analyze data from a performance assessment, the model can be expanded to incorporate additional facets of that setting that may be of particular interest. In our study we included raters as a facet.2 Within each facet, each element--that is, each individual rater, examinee, or item--is represented by a parameter.

1 Other approaches to calibrating raters that use raw score (i.e., nonlinearized, scale-dependent) ratings have been implemented. Some advocate using a generalizability theory approach to studying rater effects (e.g., Cronbach, Linn, Brennan, & Haertel, 1995; Koretz, Stecher, Klein, & McCaffrey, 1994). Others advocate using a Bayesian approach to calibrate raters (e.g., Braun, 1988; Paul 1981). The Facets approach to studying rater effects differs from these two approaches in a fundamental way: The measures of item difficulty, examinee proficiency, and rater severity that Facets produces are all in the same linear unit of measure (i.e., in logits, or log-odds units). Having all measures in the same linear unit of measure facilitates making comparisons within and between the various facets of the analysis. 2 Psychometricians differ in their views of how best to account for the effects of raters in measurement models. Some contend that the Facets approach to modeling raters is inappropriate since it does not take into account possible dependency in information from a rater�s multiple ratings of an examinee�s responses (E. Muraki, personal communication, August 3, 1998). Bock (1997) has proposed and Bock and Muraki (1998) have devised an approach to analyzing rating data that introduces a rater-reliability correction to scale score standard errors that may be overly optimistic as a result of conditional dependence among a rater�s ratings. Wilson and Hoskens (1999) present an alternative approach, the use of rater bundle models, that are appropriate for repeated measures situations in which dependence among multiple ratings may be suspected. Finally, Patz, Junker, and Johnson (1999) have proposed the use of the hierarchical rater model to properly account for dependence among multiple ratings of the same examinee response. While it is never possible to achieve perfect local independence, if one suspects that some type of dependency is present in the data, Facets can look for it and estimate its effect on the measurement system (J. M. Linacre, personal communication, September 5, 1998). Quality-control fit statistics and residual analyses that are built into the program can detect particular types of dependence in the data (e.g., if there was suspected dependence among a rater's ratings of the 12 TSE items for a given examinee, then one could check the examinee infit statistics for evidence of gross overfit). Additionally, other tests could be run to look for more subtle, specific types of dependencies in the data (e.g., a cluster analysis or factor analysis of residuals).

Page 13: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

5

Facets provides a measure for each element of each facet included in an analysis. For each examinee, the measure is an estimate of that examinee's proficiency. The larger the measure, the more proficient the examinee. For each rater, the measure is an estimate of the degree of severity that rater exercised when rating examinees' tapes. The larger the measure, the more severe the rater. For each item, the measure is an estimate of the difficulty of the item. The larger the measure, the more difficult it was for an examinee to obtain high ratings on the item. In addition to reporting a measure for each element, Facets also provides a standard error for that measure (i.e., information about the precision of the estimate).

The many-facet measurement model we used in this study describes the probability that a specific

examinee (n) rated by a specific rater (j) will receive a rating in a particular category (k) on a specific item (i). The mathematical form of this probability (Equation 1) depicts the relationships among these elements in terms of a logistic odds ratio:

log (Pnijk

Pnijk−1

) = Bn − Di − Cj − Fk (1)

where Pnijk is the probability of examinee n, when rated on item i by rater j, being awarded a rating of k

(Equation 2), Pnijk−1 is the probability of examinee n, when rated on item i by rater j, being awarded a rating of k-

1, Bn is the proficiency of examinee n, Di is the difficulty of item i, Cj is the severity of rater j,

Fk is the difficulty (F) of category k of the rating scale.3 Equation 1 is the general expression for a three-facet Rasch rating scale model (Linacre, 1994). The three facets included in this model are the examinees, the TSE items, and the raters. The probabilities are modeled as an additive combination of these three facets. It follows that the probability of a rating in category k on item i for examinee n from rater j is

Pnijk =exp Bn(

k = 0

x

�� � �

− Di − C j − Fk )� � �

expI =0

m i

� B n − Di − C j − Fk )(k = 0

x

�� � �

� �

, where x = 0, 1, ..., mi (2)

The Facets program uses the ratings that raters award on all items to estimate individual rater severities, item difficulties, and examinee proficiencies using a joint maximum likelihood estimation

3This measurement model specifies that a common rating scale category structure applies across all items and for all raters (i.e., that Fk is constant across TSE items and raters).

Page 14: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

6

approach.4 Facets calculates the examinee proficiency measures after first subtracting the effects of the TSE items and raters.

For those monitoring quality control of a rating procedure, the fit statistics that the Facets program generates are particularly valuable. The fit statistics provide information about how well the data for each element in the analysis "fit" the expectations of the measurement model that was used (i.e., how well the data cooperates with the requirements of the measurement model). The fit statistics can be used to identify individual examinees, items, or raters who performed in a manner inconsistent with the expectations of the measurement model. Facets output includes two measures of fit for each facet of the model: weighted (infit) and unweighted (outfit). Each fit mean square is a summary of the degree of correspondence between the observed data and the values expected, based on the parameter estimates that the measurement model produces. The mean square is a chi-square statistic that is divided by its degrees of freedom. Its expectation is 1.0, and its range is 0 to infinity.

In this study, we focused particularly on rater fit. The rater mean-square fit statistics we present summarize each rater�s level of internal consistency (i.e., intrarater reliability). The mean-square fit statistics also provide useful information regarding the extent to which each rater is able to use the TSE five-point rating scale to make viable distinctions among examinees� performances (Lunz, Stahl, & Wright, 1996). The rater outfit statistic is an unweighted mean-square residual that is particularly sensitive to occasional unexpected, outlying residuals (hence, the acronym �outfit� to signify �outlier-sensitive fit statistic�). The infit statistic, on the other hand, weights each standardized residual by its variance. As a result, infit statistics are more sensitive to unexpected patterns of small residuals (hence, the acronym �infit� to signify �information-weighted fit statistic�). For the purposes of this study, we defined "misfitting" raters as those raters having either a mean-square infit or outfit statistic greater than 1.5 (Lunz, Wright, & Linacre, 1990).5 A rater mean-square fit statistic of 1.5 indicates 50 percent more variance in the rater�s ratings than is modeled--a �noisy� rating pattern. If a rater�s mean-square infit statistic was greater than 1.5, then we would conclude that the rater used the TSE rating scale in an idiosyncratic fashion, unlike other raters. There is a pattern of inconsistency in the rater�s ratings, most likely involving the rater�s use of the inner categories of the TSE rating scale (i.e., the 30, 40, and 50 categories). A number of the ratings were unexpected (or surprising)

4It should be noted that there are other approaches to estimating likelihood (e.g., conditional maximum likelihood estimation, marginal maximum likelihood estimation, etc.), and each approach has its own advantages and disadvantages (Adams & Wilson, 1996; Linacre, 1994). Currently, there is a lack of consensus among psychometricians concerning which approaches are most suitable for modeling the structure of various complex judgment situations (Adams & Wilson, 1996). Linacre has chosen to use a joint maximum likelihood approach (Linacre, 1994). For the interested reader, Linacre (1994) provides a detailed description of the estimation approach and the rationale for its use. 5 Different testing programs currently use different cutoffs for identifying misfitting raters. There are no hard-and-fast rules (Wright & Linacre, 1994). The decision depends upon how much unmodelled noise the testing program is willing to tolerate in raters� ratings of examinee performance and the level of stakes attached to the test. For high-stakes tests, less noise would typically be tolerated than for low-stakes tests. Wright and Linacre (1994) suggest as a guideline for situations in which raters are rendering judgments where agreement is encouraged that item mean-square infit and outfit statistics in the range of 0.4 to 1.2 are reasonable. However, they do not provide guidance for establishing a reasonable range for rater mean-square infit and outfit statistics. Lunz, Wright and Linacre (1990) used a cutoff of 1.5 for rater fit in their study of examinees� performance on a clinical pathology certification exam (i.e., a high-stakes exam). We follow their lead in adopting that cutoff for our study, since the TSE is considered a high-stakes exam for those taking it.

Page 15: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

7

when we consider the rater�s overall level of severity and the ratings that other raters gave the same examinees. Once we have identified the rater, we could then pinpoint the individual standardized residuals greater than 1.5 for each examinee-item combination involving that rater. We could look for patterns in those standardized residuals that would help us diagnose the particular problem(s) the rater was experiencing (e.g., Were there certain items on the TSE that were more difficult for this rater to rate consistently? Were there certain types of examinees that this rater had difficulty rating? Were there certain categories on the TSE rating scale that this rater couldn�t reliably distinguish between?). If a rater�s mean-square outfit statistic were greater than 1.5, but the rater�s mean-square infit was less than 1.5, then that would be an indication that the rater gave an occasional highly unexpected extreme rating. By and large, the rater used the TSE rating scale consistently, but once in awhile the rater gave a rating in the extreme categories of the scale (i.e., the 20 and 60 categories) that was quite surprising, given the rater�s overall level of severity and the ratings that other raters gave the same examinees. Again, the next step would be to review the individual standardized residuals for that rater to identify those few examinees who received these very unexpected (or surprising) ratings from this rater and to determine whether those ratings were indeed warranted or were perhaps anomalies. Facets uses Equation 3 to compute the standardized residuals that enable one to compare observed ratings to expected ratings:

tnij =xnij − Enij

(k − Enijk =0

m

� )2 Pnijk (3)

where

Enij = kPnijkk =0

m

� (4)

and xnij is the observed rating for examinee n on item i by rater j. Enij is the expected rating based on the three-facet measurement model we have specified. Pnijk is the probability of a specific rating (k), given the examinee proficiency measures and the rater severities, item difficulties, and rating scale calibrations (Equation 2). The mean-square fit statistics are calculated by summing over the standardized residuals. Facets calculates the mean-square outfit statistic, uj, for the rater facet using Equation 5:

uj =n=1

N

�i=1

I

� znij2 (N + I).

(5)

Facets calculates the mean-square infit statistic, vj, for the rater facet using Equation 6:

vj = Wnijznij2

i =1

I

�n=1

N

� Wniji =1

I

�n =1

N

� (6)

Page 16: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

8

where

Wnij = (k − Enij )2

k=0

m

� Pnijk (7)

In this study, we looked at how embedding blocks of ratings from various smaller sets of the common six examinee tapes in the operational data affect examinee and rater separation. Facets produces an examinee separation index, which is a measure of the spread of the examinee proficiency measures relative to their precision. Separation is expressed as a ratio of the variance of examinee proficiency measures adjusted for measurement error over the average examinee error (Equation 8):

GN = SN2

SEN2

n=0

N

N

SEn2

n= 0

N

N (8)

where

SN2 is the variance of the non-extreme examinee proficiency measures,

SEn2 is the squared standard error for examinee n,

N is the number of examinees. The examinee separation index provides a measure of the number of measurably different levels of examinee performance in the sample (Wright, 1998). Similarly, Facets produces a rater separation index, which is a measure of the spread of the rater severity measures relative to their precision. The index connotes the number of statistically distinct levels of rater severity in the sample of raters.

A Description of the Analyses After the two TSE scoring sessions were completed, we ran two preliminary Facets analyses on the TSE data--an analysis of the February 1997 rating data and an analysis of the April 1997 rating data. When we ran this first set of analyses, we did not include the ratings of the six benchmark tapes that the raters scored in each scoring session. The output from these two preliminary analyses showed no obvious connectivity problems in this data.6 Because we were interested in studying data in which disconnected subsets occur, we set out to deliberately create instances of disconnection7 in our two data sets. Creating disconnected subsets. The first step in creating disconnection involved some selective editing of our data. In each scoring session, there were some raters who for unknown reasons did not score one or more of the benchmark tapes. Therefore, we eliminated those raters from our data as well as

6 When Facets detects a lack of connectedness in the data, it identifies �disconnected subsets,� pinpointing those particular examinees, raters, and items that are in the same subset. Only examinees that are in the same subset are directly comparable. Similarly, only raters (or items) that are in the same subset can be directly compared. Attempts to compare examinees (or raters or items) that appear in two or more different subsets can be misleading. 7 Disconnection occurs as a result of instituting a judging plan for data collection that makes it impossible to place all raters, examinees, and items in one frame of reference so that appropriate comparisons can be drawn (Linacre, 1994). The allocation of raters to items and examinees must result in a network of links that is complete enough to connect all the raters through common items and common examinees (Lunz, Wright, & Linacre, 1990). If there are insufficient patterns of nonextreme high ratings and nonextreme low ratings to be able to connect two elements (e.g., two raters, two examinees, two items), then the two elements will appear in separate subsets of Facets output as �disconnected.�

Page 17: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

9

the examinees that those raters scored. We also eliminated raters who scored fewer than 10 examinees. For the February 1997 data, this selective editing resulted in a reduction in the number of raters from 66 to 41 and a reduction in the number of examinees from 1,463 to 747. Similarly, for the April 1997 data, our selective editing resulted in a reduction in the number of raters from 74 to 40 and a reduction in the number of examinees from 1,446 to 710. The next step involved selectively deleting additional examinees from each data set to deliberately create disconnection in each data set. To accomplish this, we identified the subset of raters who rated the fewest examinees across the entire scoring session (referred to here as Group 1--all other raters were in Group 2). We then deleted all examinees that were rated by members of rater Group 1 and Group 2, creating disconnected subsets of raters with minimal deletion of examinees. For the February 1997 data, after we selectively deleted 184 examinees, Subset 1 contained 553 examinees and 33 raters, and Subset 2 contained 10 examinees and 7 raters. For the April 1997 data, after we deleted 154 examinees, Subset 1 contained 543 examinees and 32 raters, Subset 2 contained 11 examinees and 7 raters, and Subset 3 contained 1 examinee and 2 raters. After creating disconnection in our two data sets, we ran a second set of Facets analyses. We ran an analysis on the February 1997 operational data containing no ratings of the benchmark tapes, noncentering the examinees (that is, not constraining this facet to have a mean element measure of zero).8 We ran the analysis a second time, noncentering the raters. We then added to the February operational data the raters' ratings of all six benchmark tapes and ran a third Facets analysis, noncentering the examinees. We ran the same analysis a second time, noncentering the raters. We used the output from the analyses in which we noncentered the examinees to compare examinee proficiency measures, since examinees were the objects of measurement in those analyses. By contrast, we used the output from the analyses in which we noncentered the raters to compare rater severity measures, since raters were the objects of measurement in those analyses. The results from this second set of analyses served as the standards of comparison for subsequent analyses. The output from the analyses that included ratings of the six examinees' benchmark tapes provided summary information for each of these six examinees. Table 1 reports the examinee proficiency measure (in logits) and the standard error of each measure. The table also includes the mean-square infit and outfit values for each examinee's benchmark tape as well as the observed average (i.e., the average of the raters' ratings of that examinee's tape) and the fair average (i.e., the observed average adjusted for the severity or leniency of the raters who scored the examinee's tape and for the difficulty of the TSE items). We also ran these four Facets analyses on the April 1997 TSE data and obtained summary information for each of those benchmark tapes. That information is presented in Table 2. Grouping the benchmark tapes. We used the information we obtained from the second set of analyses to decide how we would group the benchmark tapes, since we were interested in finding out whether benchmark tapes having certain characteristics were more effective as linking tools than benchmark tapes having other characteristics. For example, when we examined the mean-square fit 8 When running Facets analyses, it is customary to center all facets except one to establish a common origin, usually zero. If more than one facet is noncentered, then ambiguity may result since the frame of reference is not sufficiently constrained (Linacre, 1994, p. 28). Generally, one centers on the facet that is of measurement interest. If one wanted to compare examinees, then the examinee facet would be centered in the analysis. By contrast, if one wanted to compare raters, then the rater facet would be centered. Manipulating the center of the distribution does not significantly affect the results of this research: it simply produces an additive shift in the entire distribution of parameter estimates for a given facet.

Page 18: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

10

Table 1. Benchmark Audiotapes for the February 1997 TSE Scoring

Selected output from the Facets analysis

Benchmark tapes included in each of the comparison sets

Bench-mark

Observed Average

Fair

Average

Logit

Measure

Standard

Error

Mean-Square

Infit

Mean-Square Outfit

The 3 most consistent

bench-marks

The 3 least consistent

bench-marks

The 3 lowest bench-marks

The 3 highest bench-marks

The least consistent

bench- mark

The most consistent

bench- mark

1

40.9

40.77

0.16

.11

0.9

0.9

2 49.6 49.56 3.81 .09 1.6 1.6 ✔ ✔ 3 48.2 48.21 3.23 .09 1.1 1.1 ✔ ✔ 4 39.9 39.88 -0.39 .11 1.7 1.7 ✔ ✔ ✔ 5 38.4 38.61 -1.14 .10 1.4 1.4 ✔ ✔ 6 51.2 51.01 4.45 .10 1.1 1.1 ✔ ✔

Table 2. Benchmark Audiotapes for the April 1997 TSE Scoring

Selected output from the Facets analysis

Benchmark tapes included in each of the comparison sets

Bench-mark

Observed Average

Fair

Average

Logit

Measure

Standard

Error

Mean-Square

Infit

Mean-Square Outfit

The 4 most consistent

bench-marks

The 2 least consistent

bench-marks

Two 30s

bench-marks

Three 40s

bench-marks

One 50s

bench- mark

The least consistent

bench- mark

The most consistent

bench- mark

1

58.6

58.82

8.31

.13

1.0

1.1

2 45.4 45.40 2.33 .09 1.1 1.1 ✔ ✔ 3 39.1 39.22 -0.94 .11 1.3 1.4 ✔ ✔ 4 42.0 41.78 0.82 .11 1.1 1.1 ✔ ✔ 5 34.8 35.03 -2.80 .08 0.9 1.0 ✔ ✔ ✔ 6 48.2 48.20 3.42 .09 1.4 1.4 ✔ ✔ ✔

Page 19: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

11

statistics for the six benchmark tapes scored in the February 1997 TSE scoring session, we found that three of the tapes had fit statistics within a normal range (i.e., between 0.8 and 1.2), while the other three tapes had fit statistics outside this range (i.e., equal to or greater than 1.4).9 The three tapes that had mean-square fit statistics outside the normal range were somewhat harder to rate because each examinee showed a pattern of uneven performance, according to the experienced TSE raters who selected these particular tapes for inclusion in the benchmark set. Each of these three examinees performed well on some items but not as well on others. By contrast, those three tapes having mean-square fit statistics within the normal range were somewhat easier to rate, according to the experienced TSE raters, because each examinee showed a pattern of consistent performance across items, and raters were therefore likely to give them a similar rating on each item. Were benchmark tapes that showed inconsistent performance across items (i.e., were somewhat harder to rate) any more effective as linking tools than benchmark tapes that showed stable, consistent performance across items (i.e., were somewhat easier to rate)? Did one kind of tape provide a stronger linking mechanism than the other?

We were also interested in whether benchmark tapes from less proficient examinees were more effective as linking tools than benchmark tapes from more proficient examinees. For example, when we looked at the fair average for each of the six benchmark tapes from the February 1997 TSE scoring session, we found that three had fair averages in the 35-45 range, while three had fair averages in the 45-55 range. Were benchmark tapes in the 35-45 range any more effective as linking tools than benchmark tapes in the 45-55 range? We categorized the benchmark tapes from the April 1997 TSE scoring into three fair-average ranges: (1) tapes in the 30s range, (2) tapes in the 40s range, and (3) tapes in the 50s range. Were benchmark tapes in one of these ranges any more effective as linking tools than benchmark tapes in the other ranges? The benchmark tapes that were included in each of the comparison sets are shown in the right-hand side of Table 1 and Table 2. A checkmark in a column denotes a benchmark tape included in that particular comparison set. Analyses to compare sets of benchmark tapes. We ran a number of additional Facets analyses on the February 1997 and April 1997 TSE data, with each analysis including an embedded block of ratings from one of our sets of benchmark tapes. To the operational February 1997 data that contained no ratings from any of the benchmark tapes we added the block of ratings from the three most consistent benchmark tapes, and then we ran an analysis on this data. Next, we added the block of ratings from the three least consistent benchmark tapes to the February 1997 operational data and ran an analysis on that data. We then ran four additional analyses, each time including an embedded block of ratings from a different set of tapes to the February operational data--the block of ratings from the three lowest scoring benchmark tapes, the block of ratings from the three highest scoring benchmark tapes, the block of ratings from the least consistent benchmark tape, and finally the block of ratings from the most consistent benchmark tape. We first ran these six Facets analyses noncentering the examinees, and we then ran them a second time noncentering the raters.

9 Again, there are no hard-and-fast rules for establishing cutoffs for examinee misfit (Wright & Linacre, 1994). In this study, we used a cutoff of 1.4 for the mean-square fit statistics for examinees since, in both sets of benchmark tapes, there seemed to be a natural break in the distribution of examinee misfit statistics at that particular point. Also, in consulting with the experienced TSE raters, the raters noted that benchmark tapes having misfit statistics equal to or greater than 1.4 were somewhat harder to rate. Having these two sources of evidence seemed to us to provide sufficient justification for using a cutoff of 1.4 as the basis for assigning the benchmark tapes to two categories (i.e., �most consistent� or �least consistent�).

Page 20: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

12

We ran a similar set of analyses using the April 1997 TSE data, incorporating the blocks of ratings for the various sets of benchmark tapes shown in Table 2. We first ran this set of seven Facets analyses noncentering the examinees, and then we ran these analyses again noncentering the raters. After we completed the Facets analyses, we imported selected information about examinees (i.e., the examinee proficiency measures and standard errors) into Microsoft® Excel files. We compared the examinee proficiency measures obtained from the Facets analyses of the various sets of benchmark tapes to the examinee proficiency measures obtained from the analysis of the operational data.10 Each time we compared two distributions of examinee proficiency measures, we found that the means for the two score distributions differed slightly. To make the two distributions comparable, we adjusted each examinee proficiency measure in one of the distributions by the difference between the two means. The two examinee proficiency measures for each examinee could then be directly compared. To determine whether two examinee proficiency measures (indicated as 1ψ and 2ψ in Equation 9 below) were significantly different, we computed a z-score using Equation 9:

2221

21 ww SESEz

+−= ψψ

(9)

We also imported rater severity measures and the standard errors of those measures into Excel

files. We compared the rater severity measures obtained from the Facets analyses of the various sets of the benchmark tapes to the rater severity measures obtained from the analysis of the operational data containing none of the ratings of the benchmark tapes.11 Again, we compared each set of rater severity measures by computing a z-score to identify those pairs of measures that were significantly different. Analyses to evaluate linking quality and strength of various benchmark sets. We ran a number of additional Facets analyses on the ratings of the benchmark tapes from the February 1997 and April 1997 scoring sessions to evaluate linking quality and strength of the various benchmark sets. Each analysis used a block of ratings from one of the sets of benchmark tapes that we previously described. First, we performed a Facets analysis on all 40 raters� ratings of the three most consistent benchmark tapes from the February 1997 scoring session. Next, we ran an analysis on the raters� ratings of the three least consistent benchmark tapes from that scoring session. We then ran four additional analyses on ratings of sets of benchmark tapes from the February 1997 scoring session--one on the block of ratings from the three lowest scoring benchmark tapes, one on the block of ratings from the three highest scoring benchmark tapes, one on the block of ratings from the least consistent benchmark tape, and finally, one on the block of ratings from the most consistent benchmark tape. We also ran a similar set of analyses using the April 1997 TSE data, using the blocks of ratings from each of the seven sets of benchmark tapes shown in Table 2. After we completed these Facets analyses, we imported selected information about raters and items (i.e., rater severity measures, item calibrations, and standard errors of these measures) into Excel files and merged this information with the equivalent information obtained from our analyses of the two operational data sets. (Remember that the output from the analyses of the operational data contained disconnected subsets of raters but not of items.) To evaluate the links established by each linking set, we

10For these comparisons, we used the output from the Facets analyses in which we noncentered the examinees. 11For these comparisons, we used the output from the Facets analyses in which we noncentered the raters.

Page 21: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

13

compared the rater severity measures and item calibrations obtained from the Facets analyses of the various sets of benchmark tapes to the rater severity measures and item calibrations obtained from the analysis of the operational data.

Smith (1996) differentiates between the �strength� of a link and the �quality� of that link. The strength of a link depends on the number of elements that are rated in common (e.g., the number of benchmark tapes that all raters are asked to rate). Typically, more elements are better (i.e., result in a stronger link). Not all links provide equal link quality, however. Here, we define link quality in terms of the stability of parameter estimates obtained between two (or more) subsets of elements. A strong link of low quality is better than a weak link of high quality, both statistically and from the point of view of fairness (J. M. Linacre, personal communication, August 6, 1998). Wright and Stone (1979) describe a procedure for evaluating the quality of links between tests that is based on the principle of parameter invariance. In short, their method specifies that the quality of a link between two tests can be evaluated by determining whether the equating constant that places the distributions of parameter estimates from each test onto the same scale explains the differences between pairs of parameter estimates. Linacre (1998) extended this method to many-facet measurement contexts. We used these methods in our evaluation of the quality of the links established by the various sets of benchmark tapes we developed.

To accomplish this, we computed an information-weighted equating constant for the two item links and for the two rater links. In both cases, the two links were established between the first disconnected subset and the set of benchmark tapes and between the second disconnected subset and the set of benchmark tapes. The information-weighted item equating constant, shown in Equation 10, depicts the shift required to center the two sets of item calibrations on an identity line:

ItemLink = Ii (DiA − DiB )/ Iii =1

K

�i =1

K

� (10)

where Ii = (SE2(DiA) + SE2(DiB))-1 represents the Fisher information in each item difficulty shift. The standard error of that equating constant is shown in Equation 11:

SE (ItemLink) ≈1

Iii =1

K

(11)

The quality of the item link can be evaluated using a chi-square fit statistic as shown in Equation 12:

Ii(DiA − DiB)2

i =1

K

� − (ItemLink)2 Iii=1

K

� ≈ χ k−12 (12)

Generally, sets of items with higher quality links (provided they have the same degrees of freedom) will have smaller chi-square fit statistics. The quality of the link established between subsets of raters can be evaluated by substituting rater severity measures (e.g., CjA) and counts (i.e., J) into equations 10, 11, and 12.

Page 22: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

14

Results Initially, when we ran Facets analyses on the edited February 1997 and April 1997 operational data that included no benchmark tapes, the output from the two analyses contained disconnected subsets of raters and examinees. Each time we added an embedded block of ratings from a different set of the benchmark tapes to one of these operational data sets and then ran a Facets analysis on that data, the output from the analysis no longer showed disconnected subsets of raters or examinees. All of our benchmark sets were effective for establishing at least the minimal requisite connectivity in the rating design in order to allow placement of all raters and all examinees on a single scale.

Examinee and Rater Separation In Table 3 and Table 4, we compare the results from the Facets analyses of the various sets of benchmark tapes on examinee and rater separation. Table 3 presents results from the analyses of the February 1997 TSE data, while Table 4 presents results from the analyses of the April 1997 TSE data. For comparison purposes, the results obtained from a Facets analysis of the operational data containing no ratings of benchmark tapes, an analysis of the operational data containing ratings from all six benchmark tapes, and an analysis of the raters' ratings of only the six benchmark tapes (i.e., no operational data) are presented at the bottom of each table. The examinee separation indices are very similar across the various sets of benchmark tapes. In each case, examinees could be separated into about six statistically distinct proficiency strata. There are small differences in the rater separation indices, however, especially for the raters participating in the April 1997 TSE scoring. There are more statistically distinct strata of rater severity resulting from scoring some sets of benchmark tapes than from scoring other sets. The rater separation indices for sets involving the scoring of less consistent benchmark tapes are higher than the rater separation indices for sets involving the scoring of more consistent benchmark tapes.Embedding the less consistent benchmark tapes into the operational data set resulted in greater separation of the raters in terms of their severity (i.e., the differences in rater severity were more exaggerated when we embedded the less consistent benchmark tapes into the operational data set).

Examinee Proficiency Measures For each examinee, we computed a standardized difference (z), comparing the examinee proficiency measure (in logits) obtained from the analysis of the operational data containing no ratings from the benchmark tapes to the examinee proficiency measure obtained from the analysis of the operational data including the ratings from the various sets of benchmark tapes. Table 5 and Table 6 show those examinees whose proficiency measures changed under each of the benchmark sets (i.e., examinees whose standardized differences were 2.0 or greater12). The effects of including ratings from various sets of the benchmark tapes on examinee proficiency measures were minimal; only two examinees' proficiency measures (out of 1,118 examinees) showed significant change, and those examinees were both in the small, disconnected subsets (i.e., Subset 2, containing 10 examinees, from the February 1997 data and Subset 3, containing one examinee, from the April 1997 data).

12 If the standardized difference between two examinee proficiency measures is ≥ 2.0, then we can reject the null hypothesis (i.e., that the two proficiency measures are the same) at the .02 significance level.

Page 23: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

15

Table 3. Summary of Results from the Facets Analyses of the February 1997 TSE Rating Data

Sets of Benchmark Tapes Rater

Separation Index

Examinee Separation

Index The 3 most consistent benchmarks

3.81

6.72

The 3 least consistent benchmarks 3.83 6.57 The 3 lowest benchmarks 4.06 6.69 The 3 highest benchmarks 3.86 6.62 The least consistent benchmark 3.66 6.70 The most consistent benchmark 3.21 6.77 Operational with no benchmarks 3.17 6.76 Operational with all six benchmarks 4.52 6.58 Six benchmarks ONLY 4.16

Table 4. Summary of Results from the Facets Analyses of the April 1997 TSE Rating Data

Sets of Benchmark Tapes Rater

Separation Index

Examinee Separation

Index The 4 most consistent benchmarks

4.57

6.16

The 2 least consistent benchmarks 5.24 6.09 Two 30s benchmarks 3.62 6.15 Three 40s benchmarks 5.60 6.08 One 50s benchmark 2.90 6.19 The least consistent benchmark 5.62 6.12 The most consistent benchmark 3.12 6.16 Operational with no benchmarks 2.49 6.16 Operational with all six benchmarks 5.34 6.09 Six benchmarks ONLY 3.36

Page 24: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

16

Table 5. February 1997 TSE Examinees Whose Proficiency Measures Showed Significant Change When Compared to Measures from the Operational Scoring

Examinee ID

Operational

The 3 most consistent

benchmarks

The 3 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

The 3 lowest

benchmarks

The 3 highest

benchmarks Proficiency

Measure Standard

Error Adj.

Measure z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

272

1.45

.43

2.69

-2.1

2.64

-2.0

2.63

-2.0

2.88

-2.4

Table 6. April 1997 TSE Examinees Whose Proficiency Measures Showed Significant Change When Compared to Measures from the Operational Scoring

Examinee ID

Operational

The 4 most consistent

benchmarks

The 2 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

Two 30s

benchmarks

Three 40s

benchmarks

One 50s

benchmark Proficiency

Measure Standard

Error Adj.

Measure z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

201

2.40

.42

0.05 3.9

-1.35

6.3

3.6

-2.0

0.76

2.7

Page 25: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

17

Rater Severity Measures Table 7 and Table 8 list those raters participating in the February 1997 and April 1997 TSE scoring sessions whose severity measures differed depending upon which set of ratings from the benchmark tapes we embedded in the operational data. Each rater's severity measure and associated standard error from the Facets analysis of the operational data containing no ratings of benchmark tapes are reported in the second and third columns. The raters whose severity measures were significantly different from their operational severity measures are listed under each set of benchmark tapes.13

As Table 7 shows, 40% of the raters in the February 1997 scoring (i.e., 16 of the 40 raters) had severity measures that were significantly different from their operational severity measures for at least one set of benchmark tapes. Fifty-seven percent of the raters in Subset 2 (i.e., 4 of the 7 raters in that subset) and 36% of the raters in Subset 1 (i.e., 12 of the 33 raters in that subset) had significantly different severity measures for at least one set of benchmark tapes. It is important to note that the raters in the smaller subset (i.e., Subset 2) rated many fewer examinees� tapes on average than the raters in the larger subset (i.e., Subset 1). Raters in Subset 2 rated between one and five examinees' tapes, while raters in Subset 1 rated between 18 and 51 tapes. Similarly, Table 8 reveals that 41% of the raters in the April 1997 scoring (i.e., 17 of the 41 raters) had severity measures that were significantly different from their operational severity measures for at least one set of benchmark tapes. Both raters in Subset 3 had severity measures that were significantly different, as did 57% of the raters in Subset 2 (i.e., 4 of the 7 raters in that subset), and 34% of the raters in Subset 1 (i.e., 11 of the 32 raters in that subset). Again, the raters in the two smaller subsets (i.e., Subsets 2 and 3) rated many fewer examinees� tapes on average than the raters in the larger subset (i.e., Subset 1). Raters in Subsets 2 and 3 rated between one and five examinees' tapes, while raters in Subset 1 rated between 15 and 52 tapes. It appears, then, that the embedding of ratings from benchmark tapes into the operational data has a stronger impact on the calculation of the severity measures for those raters in the small disconnected subsets than on the severity measures for raters in the larger, main subset. There are substantive reasons to believe that the raters in the smaller subsets were not a representative sample of the larger pool of raters. The raters in the smaller subsets tended to rate a smaller number of examinees, suggesting that they were either slower raters or they were not present during the entire scoring session. We might expect such raters to perform differently, as a group, than other raters.

Rater Fit We compared the mean-square fit indices of raters when we introduced into the operational data the different blocks of ratings for the various sets of benchmark tapes. Those results are presented in Table 9 and Table 10. Table 9 shows the raters who misfit when we added the ratings from the various sets of benchmark tapes to the February 1997 operational data, while Table 10 shows the raters who misfit when we added the ratings from the various sets of benchmark tapes to the April 1997 operational data. For comparison purposes, the misfitting raters from the Facets analysis of the operational data containing no ratings of benchmark tapes are listed in columns two and three of each table, and the

13 For each set, we report the rater's severity measure adjusted for the difference between the mean severity of the distribution of severity measures from the analysis of the operational data containing no ratings of benchmark tapes and the mean severity of the distribution of severity measures from the analysis of the operational data including the ratings from that particular set of benchmark tapes.

Page 26: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

18

Table 7. February 1997 TSE Raters Whose Severity Measures Showed Significant Change When Compared to Measures from the Operational Scoring

Rater ID

Operational

The 3 most consistent

benchmarks

The 3 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

The 3 lowest

benchmarks

The 3 highest

benchmarks Severity

Measure Standard

Error Adj.

Measure z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

150

-2.65

.12

-3.28

3.9

-3.27

3.8

-3.00

2.1

-3.44

4.9

-3.12

2.9

213 -2.17 .15 -3.27 5.4 -2.89 3.5 -2.81 3.1 -3.33 5.7 290 -4.75 .62 -2.52 -3.1 -2.63 -3.0 -2.89 -2.4 -2.52 -3.2 -2.68 -2.9 333 -2.21 .11 -2.60 2.6 -2.52 -2.1 342 -3.52 .13 -3.06 -2.6 -3.02 -2.8 360 -3.55 .53 -2.27 -2.1 -1.50 -3.2 -1.81 -2.9 390 -2.92 .12 -3.32 2.5 410 -2.22 .10 -2.67 3.3 -2.70 3.6 -2.49 2.0 -2.53 2.2 -2.85 4.7 -2.53 2.3 520 -2.81 .35 -1.69 -2.6 -1.56 -2.9 -1.49 -3.1 -1.79 -2.4 530 -2.42 .11 -2.82 2.6 -2.87 2.9 -2.87 2.9 -2.82 2.6 300 -4.50 .11 -4.17 2.2 -4.18 -2.2 445 -2.81 .10 -3.23 3.1 -3.32 3.6 -3.29 3.6 477 -1.61 .11 -2.05 3.0 -1.92 2.1 -2.00 2.6 302 -2.76 .27 -1.85 -2.4 -1.76 -2.9 260 -1.87 .12 -2.29 2.6 516 -2.11 .12 -2.47 2.2

Page 27: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

19

Table 8. April 1997 TSE Raters Whose Severity Measures Showed Significant Change When Compared to Measures from the Operational Scoring

Rater ID

Operational

The 4 most consistent

benchmarks

The 2 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

Two 30s

benchmarks

Three 40s

benchmarks

One 50s

benchmark Severity

Measure Standard

Error Adj.

Measure z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

Adj. Measure

z

500 -3.08 .10 -2.60 -3.6 -2.75 -2.3 -2.79 -2.1 -2.54 -4.0 572 -2.42 .09 -2.13 -2.3 404 -3.38 .63 -6.72 4.5 -7.29 4.8 -1.83 -2.1 -5.73 3.4 520 -1.99 .26 -1.24 -2.2 -1.24 -2.1 67 -2.34 .56 -3.72 2.1 -5.49 4.6

123 -2.28 .15 -1.86 -2.1 124 -1.95 .11 -1.63 -2.1 317 -2.22 .09 -1.93 -2.3 340 -3.04 .08 -2.81 -2.0 -3.27 2.0 -3.30 2.3 490 -2.01 .15 -1.54 -2.3 -1.58 -2.2 572 -2.42 .09 -2.10 -2.5 -2.05 -2.9 111 -1.03 .13 -1.42 2.2 -1.39 2.0 213 -2.31 .13 -2.67 2.0 216 -2.29 .33 -1.43 -2.1 60 -3.70 .66 -2.11 -2.1

281 -2.82 .10 -2.45 -2.6 470 -1.10 .52 -2.69 2.7

Page 28: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

20

Table 9. Mean-Square Infit and Outfit Statistics for Misfitting Raters from the February 1997 TSE Scoring Rater

ID

Operational with NO

benchmarks

Operational with 6

benchmarks

The 3 most consistent

benchmarks

The 3 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

The 3 lowest

benchmarks

The 3 highest

benchmarks Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit

10

1.6

1.6

1.6

1.6

445 1.6 360 1.6 1.7 1.6

Table 10. Mean-Square Infit and Outfit Statistics for Misfitting Raters from the April 1997 TSE Scoring

Rater

ID

Operational with NO

benchmarks

Operational with 6

benchmarks

The 4 most consistent

benchmarks

The 2 least consistent

benchmarks

The least consistent benchmark

The most consistent benchmark

Two 30s

benchmarks

Three 40s

benchmarks

One 50s

benchmark Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit Infit Outfit

40

1.7

1.6

1.6

1.7

1.7 520 1.8 1.7 1.6 1.7 1.7 1.6 1.6 1.7 1.6 597 2.0 2.0 2.0 2.0 1.9 2.0 1.6 1.6 1.6 1.7 1.9 530 1.7 60 2.1 1.9 2.7 2.4 1.6 1.8 1.9

404 2.6 2.6 67 2.0 2.2 1.6

Page 29: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

21

misfitting raters from the analysis of the operational data containing ratings of all six benchmark tapes are listed in columns four and five of each table. Table 9 reveals that only three raters showed significant misfit when we added the ratings from at least one of the sets of benchmark tapes to the February 1997 operational data, and in each case the level of misfit was minor (i.e., mean-square infit and outfit of 1.6 and 1.7). By contrast, there were more instances of misfit among the raters who participated in the April 1997 scoring session, as is shown in Table 10, and the level of misfit was somewhat higher (i.e., mean-square infit and outfit in the range of 1.6 to 2.7). All seven misfitting raters from the April 1997 scoring were from the small disconnected subsets (i.e., Subsets 2 and 3). We would expect the raters in the small, disconnected subsets to show misfit more often than raters in the larger, main subset, since each of the raters in the small, disconnected subsets rated only a few examinees and the calculation of the fit statistic is highly sensitive to the number of examinees a rater rates.

Table 9 and Table 10 show that there were more instances of rater misfit when we embedded less consistent benchmark tapes into the operational data than when we embedded more consistent tapes. These results suggest that when we embed ratings from benchmark tapes that are harder for raters to rate (i.e., show a pattern of inconsistent examinee performance across items), we are likely to have more instances of rater misfit than when we embed ratings from benchmark tapes that are easier for raters to rate (i.e., show a pattern of consistent examinee performance across items).

Linking Quality

In this section, we describe the results of our investigation of the second and third research questions--what is the minimum number of tapes required to establish connectivity (i.e., a question of strength), and what are the characteristics of tapes that establish the highest quality links. As explained previously, we relied on chi-square fit statistics to depict the invariance of parameter estimates derived from each of the disconnected subsets and each of the linking benchmark sets that we investigated. We were unable to evaluate examinee link qualities since there were no common examinees in our analyses: In the operational data, examinees were nested within subset, and none of the examinees in these subsets were included in the sets of benchmark tapes. On the other hand, we were able to evaluate the quality of the link between the benchmark tapes and the items, and between the benchmark tapes and each disconnected subset of raters.

Minimum Number of Tapes. Technically, the minimum number of common benchmark tapes

required to establish connectivity in two disconnected subsets is one, and one might hypothesize that the links that contain larger numbers of common benchmark tapes would be stronger than links that contain a single benchmark tape. However, our results do not seem to provide support for that hypothesis. Table 11 shows the chi-square fit statistics for the pair of item links and the pair of rater links for each linking strategy used in the February 1997 data sets. Each pair consists of the link between the large subset and the designated linking benchmark set and the link between the smaller subset and the designated linking benchmark set. In some cases, weak item links resulted from using single benchmark tapes (e.g., note that the largest chi-square for items in the large subset was 149.34 for the one least consistent benchmark). However, in other cases, weak item links resulted from using multiple benchmark tapes (e.g., note the large chi-square of 87.53 for items in the large subset for the three lowest benchmarks).

Page 30: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

22

Table 11. Chi-Square Linking Summary for February 1997 TSE Data

Items Raters Large

Subset (11)

Small Subset

(11)

Large Subset

(32)

Small Subset

(6) Three most consistent

45.84 58.36 464.99 39.94

Three least consistent

74.67 27.78 328.45 36.25

Three lowest

87.53 33.66 478.08 52.65

Three highest

39.80 46.83 397.13 26.40

One least Consistent

149.34 60.77 497.65 86.15

One most Consistent

80.68 69.45 237.46 55.47

Note: The degrees of freedom associated with each chi-square fit statistic are shown in parentheses in the second row of each column.

Table 12. Chi-Square Linking Summary for April 1997 TSE Data

Items Raters Large

Subset (11)

Small Subset

(11)

Large Subset

(31)

Small Subset

(6) Four most consistent

101.20 21.30 134.80 30.95

Two least consistent

110.81 38.57 200.85 3.38

Two 30s

165.09 51.68 263.70 18.34

Three 40s

125.11 30.02 184.36 30.03

One 50s

21.18 30.55 67.17 11.96

One most consistent

124.24 67.46 195.77 14.41

One least consistent

54.90 40.23 436.95 27.37

Note: The degrees of freedom associated with each chi-square fit statistic are shown in parentheses in the second row of each column.

Page 31: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

23

Similarly confusing results were obtained for the rater linking chi-square statistics. In some cases, weak links resulted from using single benchmark tapes (e.g., note that the largest chi-square for raters in the large subset was 497.65 for the least consistent benchmark). However, in other cases, weak links resulted from using multiple benchmark tapes (e.g., note the large chi-square of 478.08 for raters in the large subset for the three lowest benchmarks), while a stronger link resulted from using a single benchmark tape (e.g., note that the smallest chi-square for raters in the large subset was 237.46 for the one most consistent benchmark). Table 12 shows the corresponding chi-square fit statistics for the April 1997 data sets. Again, the results do not support the hypothesis that having more benchmark tapes in a set necessarily leads to stronger linking. Tape Characteristics. We were also interested in determining whether there was a relationship between the composition of the linking benchmark sets and link quality. One might hypothesize that tapes exhibiting certain characteristics would produce higher quality linking than tapes exhibiting other characteristics. Our results provide qualified support for that hypothesis. We refer to Table 13 for these evaluations. When we compare the benchmark sets that appear in the first five rows of Table 13, we see that for both item and rater links, the highest scoring benchmark sets tended to provide the highest quality, most stable linking (i.e., the average ranks for the item chi-squares and rater chi-squares are both 2). In addition, the least consistent benchmark sets tended to provide fairly stable links (i.e., the average ranks for the item chi-squares and rater chi-squares range from 2 to 4). The most consistent benchmark sets and middle scoring benchmark sets provided somewhat less stable links (i.e., the average ranks for the item chi-squares and rater chi-squares range from 2 to 4.5), and the lowest scoring benchmark sets provided the least stable link (i.e., the average ranks for the item chi-squares and rater chi-squares range from 3.5 to 6.5). When we compare the single benchmarks that appear in the last three rows of Table 13, we see that again the highest scoring single benchmark provided higher quality, more stable linking (i.e., the average ranks for the item chi-squares and rater chi-squares range from 1.5 to 2) than either the most consistent single benchmark (i.e., the average ranks for the item chi-squares and rater chi-squares range from 3 to 6) or the least consistent single benchmark (i.e., the average ranks for the item chi-squares and rater chi-squares range from 3.5 to 6).

Page 32: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

24

Table 13. Summary of Chi-Square Rankings for February 1997 and April 1997 TSE Data

Linking Set1 February 1997 April 1997 Average

Item Rank Average

Rater Rank Average

Item Rank Average

Rater Rank Highest scoring set

2 2 -- --

Lowest scoring set

3.5 4.5 6.5 5

Most consistent set

3 3.5 2 4.5

Least consistent set

2 2 4 3

Middle scoring set

-- -- 4 4.5

Highest scoring single

-- -- 2 1.5

Most consistent single

5 3 6 3.5

Least consistent single

5.5 6 3.5 6

1 Linking sets from February 1997 and April 1997 were roughly matched according to number and types of benchmarks.

Page 33: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

25

Summary of Findings For each of the questions we posed for this study, we summarize below the findings from our research. 1. How does embedding in the operational data blocks of ratings from various smaller sets of the six

examinee tapes affect: the spread of the rater severity measures, the spread of the examinee proficiency measures, the stability of rater severity measures, the stability of the examinee proficiency measures, and the fit of the raters?

The examinee separation indices were quite similar across the various sets of benchmark tapes. We found small differences, though, in the rater separation indices. In general, the rater separation indices for sets that involved the scoring of less consistent benchmark tapes were higher than the rater separation indices for sets involving the scoring of more consistent tapes. We also found that when we embedded ratings from less consistent benchmark tapes into the operational data, we were more likely to have higher rates of rater misfit than when we embedded ratings from more consistent benchmark tapes. The effects of including ratings from various sets of the benchmark tapes on examinee proficiency measures were minimal; only a few examinees� proficiency measures changed. The changes that occurred affected two examinees in the small, disconnected subsets. There was little impact on the calculations of the proficiency measures for examinees in the larger, main subset. Similarly, the embedding of ratings from benchmark tapes into the operational data had a stronger impact on the calculation of the severity measures for raters in the small disconnected subsets than on the severity measures for raters in the larger, main subset. The fact that some examinee proficiency measures changed in the small disconnected subsets suggests that TSE raters need to have some benchmark tapes that they score in common in order to avoid the disconnected subsets problem. Otherwise, the measures of examinee proficiency for examinees in the small subsets can be strongly biased by the accidental allocation of overly lenient or overly severe raters to that small group of examinees.

2. How many tapes do raters need to score in common in order to establish the minimal requisite connectivity in the rating design so that all raters and examinees can be placed on a single scale? What (if anything) is to be gained by having all raters score more than one or two tapes in common?

We found that all of our benchmark sets were effective for establishing at least the minimal connectivity needed in the rating design in order to allow placement of all raters and all examinees on a single scale. Having all raters in a scoring session rate a single benchmark tape in addition to the tapes each would score as part of his or her normal workload was sufficient to connect all raters in the rating design. However, it is not clear from our study that having more than one benchmark tape is necessarily better than having a single tape. It appears that the characteristics of the tape that is used, rather than how many tapes are used, are a more important consideration. One strong tape that provides high quality linking could be as effective, if not more effective, than three weak tapes that provide low quality linking. At this point, it is unclear just how many benchmark tapes are necessary to establish a link of adequate strength. However, the cost of adding extra benchmark tapes must be taken into account. At some point, we would expect diminishing returns in link quality as the cost of increasing the strength of a link increases.

Page 34: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

26

3. What are the characteristics of tapes that produce the highest quality linking? Are tapes that exhibit certain characteristics more effective as linking tools than tapes that exhibit other characteristics?

The results of this study suggest that when benchmark sets were used, the highest scoring benchmarks (i.e., those examinees that scored 50s and 60s across the items) produced the highest quality linking (i.e., the most stable linking). The least consistent benchmark sets (i.e., those that were somewhat harder to rate because an examinee�s performance varied across items) tended to provide fairly stable links. The most consistent benchmarks (i.e., those that were somewhat easier to rate because an examinee�s performance was similar across items) and middle scoring benchmarks (i.e., those from examinees that scored 30s and 40s across the items) tended to provide less stable linking. Low scoring benchmark sets provided the least stable linking. When a single benchmark tape was used, the highest scoring single tape provided higher quality linking than either the least consistent or most consistent benchmark tape. Of course, the problem with drawing these conclusions is that, with so few replications of each condition we examined (i.e., high scoring, middle scoring, low scoring, least consistent, most consistent), we cannot be sure whether the observed differences are due to true variability between these conditions or whether the observed differences are simply due to sampling error. Hence, we suggest that further studies should be conducted to determine the generalizability and validity of our findings. First, simulation studies should be performed to determine how the statistical qualities of various benchmarking sets influence the quality of the links that result. That is, by systematically varying features of the simulated examinee responses (e.g., magnitude of the underlying parameters, consistency of raters� ratings of those responses, etc.), we should be able to determine which types of examinee responses would be good candidates for linking sets in operational settings. Second, the relationship between the cost of rating additional tapes and the increases in the strength of the resulting link is unclear. Again, simulation studies would prove helpful in determining how many tapes are needed to establish the most efficient links. Third, additional studies should be performed on operational data to determine if other nonstatistical qualities of the responses (e.g., the auditory quality of the tape, the task characteristics, the examinee�s background, etc.) can be used to supplement the information provided by simulation studies concerning the characteristics of good linking tapes. It is likely that a combination of both statistical and qualitative criteria would result in the highest quality links. Studies such as these would provide us with a better understanding of how to solve the connectivity problems inherent in operational scoring sessions. We would recommend as a next step that the TSE program develop procedures for monitoring TSE raters in �real time� to determine whether the link among raters is sufficient to avoid the disconnected subsets problem. If Facets were run in real time, then the analyst could determine whether there were weaknesses in the judging plan that would result in disconnected subsets. If the real-time analyses showed disconnection among raters, then steps could be taken during the scoring session to detect and remedy weaknesses in the linking structure by administering additional benchmark tapes to relevant raters.

Page 35: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

27

References Adams, R. J., & Wilson, M. (1996). Formulating the Rasch model as a mixed coefficients multinomial

logit. In G. Engelhard, Jr., & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 143-166). Norwood, NJ: Ablex.

Bock, R. D. (1997). IRT analysis and scoring of multiple ratings. Unpublished manuscript. Bock, R. D., & Muraki, E. (1998, June). The information in multiple ratings. Paper presented at the

annual meeting of the Psychometric Society of North America, Champaign, IL. Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal

of Educational Statistics, 13(1), 1-18. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. (1995). Generalizability analysis for

educational assessments. Evaluation comment. Los Angeles: UCLA Center for the Study of Evaluation and The National Center for Research on Evaluation, Standards and Student Testing.

Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-

faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112. Heller, J., Sheingold, K., & Myford, C. (1998). Reasoning about evidence in portfolios: Cognitive

foundations for valid and reliable assessment. Educational Assessment, 5(1), 5-40. Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). The Vermont portfolio assessment program:

Findings and implications. Educational Measurement: Issues and Practices, 13(3), 5-16. Linacre, J. M. (1994). Many-faceted Rasch measurement. Chicago: MESA Press, University of

Chicago. Linacre, J. M. (1998). Linking constants with common items and judges. Rasch Measurement:

Transactions of the Rasch Measurement SIG, 12(1), 621. Linacre, J. M. (1999). Facets (Version 3.17) [Computer software]. Chicago: MESA Press, University

of Chicago. Linacre, J. M., & Wright, B. D. (1994). A user's guide to Facets: Rasch measurement computer program

[Computer program manual]. Chicago: MESA Press, University of Chicago. Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training.

Language Testing, 12(1), 54-71. Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation

and the Health Professions, 13(14), 425-444.

Page 36: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

28

Lunz, M. E., Stahl, J. A., & Wright, B. D. (1996). The invariance of judge severity calibrations. In G. Engelhard, Jr., & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 99-112). Norwood, NJ: Ablex.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on

examination scores. Applied Measurement in Education, 3(4), 331-345. Marr, D. B. (1994). A comparison of equating and calibration methods for the Test of Spoken English.

Unpublished manuscript. McNamara, T. J. (1996). Measuring second language performance. New York: Addison Wesley

Longman. McNamara, T. J., & Adams, R. J. (1991, March). Exploring rater behaviour with Rasch techniques.

Paper presented at the 13th Language Testing Research Colloquium, Educational Testing Service, Princeton, NJ.

Myford, C. M., Marr, D. B., & Linacre, J. M. (1996). Reader calibration and its potential role in equating

for the Test of Written English (TOEFL Research Report No. 52). Princeton, NJ: Educational Testing Service.

Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio assessment system

(Center for Performance Assessment Report No. MS 94-05). Princeton, NJ: Educational Testing Service.

Patz, R. J., Junker, B. W., & Johnson, M. S. (1999, April). The hierarchical rater model for rated test

items and its application to large scale educational assessment data. Paper presented at the annual meeting of the American Educational Research Association, Montréal, Canada.

Paul, S. R. (1981). Bayesian methods for calibration of examiners. British Journal of Mathematical and

Statistical Psychology, 34(2), 213-223. Raymond, M. R., Webb, L. C., & Houston, W. M. (1991). Correcting performance--Rating errors in oral

examinations. Evaluation and the Health Professions, 14(1), 100-122. Smith, R. M. (1996). Item component equating. In G. Engelhard, Jr., & M. Wilson (Eds.), Objective

measurement: Theory into practice (Vol. 3, pp. 289-308). Norwood, NJ: Ablex. TOEFL Program (1996). SPEAK®: Speaking Proficiency English Assessment Kit. Rater training guide.

Princeton, NJ: Educational Testing Service. TSE Program Office. (1995). TSE score user's manual. Princeton, NJ: Educational Testing Service. Wilson, M., & Hoskens, M. (1999, April). The rater bundle model. Paper presented at the annual meeting of the

American Educational Research Association, Montréal, Canada. Wright, B. D. (1998). Interpreting reliabilities. Rasch Measurement: Transactions of the Rasch Measurement

SIG, 11(4), 602.

Page 37: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

29

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement: Transactions of the Rasch Measurement SIG, 8(3), 370.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA

Press, University of Chicago. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press, University of Chicago.

Page 38: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

30

Appendix A

Test of Spoken English Band Descriptor Chart

Page 39: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

31

Overall features to consider:

Functional competence is thespeaker’s ability to select functionsto reasonably address the task and toselect the language needed to carryout the function.

Sociolinguistic competence is thespeaker’s ability to demonstrate anawareness of audience and situationby selecting language, register(level of formality) and tone, that isappropriate.

Discourse competence is thespeaker’s ability to develop andorganize information in a coherentmanner and to make effective use ofcohesive devices to help the listenerfollow the organization of theresponse.

Linguistic competence is theeffective selection of vocabulary,control of grammatical structures,and accurate pronunciation alongwith smooth delivery in order toproduce intelligible speech.

TSE BAND DESCRIPTOR CHARTDraft 4/5/96 60 50 40 30 20

Communication almost alwayseffective: task performed verycompetentlySpeaker volunteers information freely,with little or no effort, and may gobeyond the task by using additionalappropriate functions.• Native-like repair strategies• Sophisticated expressions• Very strong content• Almost no listener effort required

Functions performed clearly andeffectivelySpeaker is highly skillful in selectinglanguage to carry out intendedfunctions that reasonably address thetask.

Appropriate response toaudience/situationSpeaker almost always considersregister and demonstrates audienceawareness.• Understanding of context, and

strength in discourse and linguisticcompetence, demonstratesophistication

Coherent, with effective use ofcohesive devicesResponse is coherent, with logicalorganization and clear development.• Contains enough details to almost

always be effective• Sophisticated cohesive devices result

in smooth connection of ideas

Use of linguistic features almostalways effective; communication notaffected by minor errors• Errors not noticeable• Accent not distracting• Range in grammatical structures and

vocabulary• Delivery often has native-like

smoothness

Communication generally effective:task performed competently

Speaker volunteers information,sometimes with effort; usually does notrun out of time.• Linguistic weaknesses may necessitate

some repair strategies that may beslightly distracting

• Expressions sometimes awkward• Generally strong content• Little listener effort requiredFunctions generally performed clearlyand effectivelySpeaker is able to select language to carryout functions that reasonably address thetask.

Generally appropriate response toaudience/situationSpeaker generally considers register anddemonstrates sense of audienceawareness.• Occasionally lacks extensive range,

variety, and sophistication; responsemay be slightly unpolished

Coherent, with some effective use ofcohesive devicesResponse is generally coherent, withgenerally clear, logical organization, andadequate development.• Contains enough details to be generally

effective• Some lack of sophistication in use of

cohesive devices may detract fromsmooth connection of ideas

Use of linguistic features generallyeffective; communication generally notaffected by errors• Errors not unusual, but rarely major• Accent may be slightly distracting• Some range in vocabulary and

grammatical structures, which may beslightly awkward or inaccurate

• Delivery generally smooth with somehesitancy and pauses

Communication somewhat effective: taskperformed somewhat competently

Speaker responds with effort; sometimesprovides limited speech sample andsometimes runs out of time.• Sometimes excessive, distracting, and

ineffective repair strategies used tocompensate for linguistic weaknesses(e.g., vocabulary and/or grammar)

• Adequate content• Some listener effort requiredFunctions performed somewhat clearlyand effectivelySpeaker may lack skill in selecting languageto carry out functions that reasonablyaddress the task.

Somewhat appropriate task response toaudience/situationSpeaker demonstrates some audienceawareness, but register is not alwaysconsidered.• Lack of linguistic skills that would

demonstrate sociolinguistic sophistication

Somewhat coherent, with some use ofcohesive devicesCoherence of the response is sometimesaffected by lack of development and/orsomewhat illogical or unclear organization,sometimes leaving listener confused.• May lack details• Mostly simple cohesive devices are used• Somewhat abrupt openings and closures

Use of linguistic features somewhateffective; communications sometimesaffected by errors• Minor and major errors present• Accent usually distracting• Simple structures sometimes accurate, but

errors in more complex structurescommon.

• Limited ranges in vocabulary; someinaccurate word choices

• Delivery often slow or choppy; hesitancyand pauses common

Communication generally not effective:task generally performed poorly

Speaker responds with much effort; provideslimited speech sample and often runs out oftime.• Repair strategies excessive, very

distracting, and ineffective• Much listener effort required• Difficult to tell if task is fully performed

because of linguistic weaknesses, butfunction can be identified

Functions generally performed unclearlyand ineffectivelySpeaker often lacks skills in selectinglanguage to carry out functions thatreasonably address the task.

Generally inappropriate response toaudience/situationSpeaker usually does not demonstrateaudience awareness since register is oftennot considered.• Lack of linguistic skills generally masks

sociolinguistic skills

Generally incoherent, with little use ofcohesive devicesResponse is often incoherent; looselyorganized, and inadequately developed ordisjointed discourse often leave listenerconfused.• Often lacks detail• Simple conjunctions used as cohesive

devices, if at all• Abrupt openings and closuresUse of linguistic features generally poor;communication often impeded by majorerrors• Limited linguistic control; major errors

present• Accent very distracting• Speech contains numerous sentence

fragments and errors in simple structures• Frequent inaccurate word choices; generally

lack of vocabulary for task completion• Delivery almost always plodding, choppy

and repetitive; hesitancy and pauses verycommon

No effective communication;no evidence of ability to perform task

Extreme speaker effort is evident; speakermay repeat prompt, give up on task, or besilent.• Attempts to perform task end in failure• Only isolated words or phrases intelligible,

even with much listener effort• Function cannot be identified

No evidence that functions were performed

Speaker is unable to select language to carryout the functions.

No evidence of ability to respondappropriately to audience/situationSpeaker is unable to demonstratesociolinguistic skills and fails to acknowledgeaudience or consider register.

Incoherent, with no use of cohesive devices

Response is incoherent.• Lack of linguistic competence interferes

with listener’s ability to assess discoursecompetence

Use of linguistic features poor;communication ineffective due to majorerrors• Lack of linguistic control• Accent so distracting that few words are

intelligible• Speech contains mostly sentence

fragments, repetition of vocabulary, andsimple phrases

• Delivery so plodding that only few wordsare produced

Page 40: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

32

Appendix B

Calculation of TSE Scores

Page 41: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

33

Scores on the TSE are calculated using the following method:14 Step 1: Calculate the score averages for both raters. To calculate each rater�s score average, add the 12

item scores assigned and divide by 12. Then figure the averages to the second decimal place. Do not round to the nearest five-point increment until Step 4.

Step 2: Calculate the difference between two score averages. Subtract the low average from the high

average to determine the difference. Step 3: Average the two raters� scores to obtain a third rating if needed.

• If the difference between the two raters� averages is less than 10, calculate the average of the averages by adding them and dividing by two.

• If the difference between the two raters� averages is 10 or greater, the TSE coordinator

will do a third rating. Average the two closest averages and disregard the discrepant average. If the averaged third rating is equidistant from the two discrepant ratings, use the third rater�s average.

Step 4: The average of the two raters� scores should then be rounded up or down to the nearest five-

point score increment. Scores exactly half way between two levels should be rounded up. For example, a score of 42.5 would be rounded up to a score of 45. However, a score of 42.49 would be rounded down to 40.

14 This description of the calculation of TSE scores is adapted from the description of the calculation of SPEAK scores, as found on page 2 of the SPEAK Rater Training Guide (TOEFL Program, 1996). The process used for scoring the SPEAK test is identical to the process used for scoring the TSE.

Page 42: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

34

Example of Score Calculation

Rater 1�s Scores

Rater 2’s

Scores

Rater 3�s Scores

(after Step 3) Item 1 50 60 50 Item 2 50 60 50 Item 3 50 60 50 Item 4 50 60 60 Item 5 50 60 60 Item 6 50 60 60 Item 7 50 60 50 Item 8 50 60 60 Item 9 50 60 60 Item 10 50 60 50 Item 11 50 60 50 Item 12 50 60 60 Step 1 Avg = 50 Avg = 60 Avg = 55 Step 2 60 - 50 = 10 Step 3 The difference between the two raters� averages is 10, so a third rating is obtained. Step 4 Rater 3�s average is equidistant, so rater 3�s average is used, rounded to a reported score of 55.

Page 43: August 2000 Technical Report - ETS Home · 2016-05-19 · Strengthening the Ties that Bind: Improving the Linking Network in Sparsely Connected Rating Designs Carol M. Myford Edward

TOEFL is a program ofEducational Testing ServicePrinceton, New Jersey, USA

57906-005815 • Y80M.675 • Printed in U.S.A.

I.N. 989082