32
Report on Knowledge Assessment Construction and Validation for New Case Manager Certification April 2007 PREPARED FOR The Tennessee Department of Children’s Services BY JULIANNA MAGDA, MS SISSIE HADJIHARALAMBOUS, PhD THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE

Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

Report on Knowledge Assessment Construction and Validation for New Case Manager Certification April 2007

PREPARED FOR

The Tennessee Department of Children’s Services

BY

JULIANNA MAGDA, MS

SISSIE HADJIHARALAMBOUS, PhD

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE

Page 2: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 i

The University of Tennessee College of Social Work Office of Research and Public Service

KAREN SOWERS, DEAN

PAUL CAMPBELL, DIRECTOR

The University of Tennessee does not discriminate on the basis of race, sex, color, religion, national origin, age, disability or veteran status in provision of educational programs and services or employment opportunities and benefits. This policy extends to both employment by and admission to the University.

The University does not discriminate on the basis of race, sex, or disability in its education programs and activities pursuant to the requirements of Title VI of the Civil Rights Act of 1964, Title IX of the Education Amendments of 1972, Section 504 of the Rehabilitation Act of 1973, and the Americans with Disabilities Act (ADA) of 1990.

Inquiries and charges of violation concerning Title VI, Title IX, Section 504, ADA or the Age Discrimination in Employment Act (ADEA) or any of the other above referenced policies should be directed to the Office of Equity and Diversity (OED), 1840 Melrose Avenue, Knoxville, TN 37996-3560, telephone (865) 974-2498 (V/TTY available) or 974-2440. Requests for accommodation of a disability should be directed to the ADA Coordinator at the UTK Office of Human Resources, 600 Henley Street, Knoxville, TN 37996-4125.

The University of Tennessee, Knoxville, in its efforts to ensure a welcoming environment for all persons, does not discriminate on the basis of sexual orientation in its campus-based programs, services, and activities. Inquiries and complaints should be directed to the Office of Equity and Diversity.

Project # 07048

Page 3: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 ii

Table of Contents

List of Tables .......................................................................................... iv

List of Figures ......................................................................................... v

Introduction ............................................................................................. 1

Background ..................................................................................................... 2

Development of Knowledge Assessment Objectives and Specifications/Blueprint .................................................................................. 3

Item Writing and Item Editing ........................................................................ 4

Field Administration........................................................................................ 5

Item Analysis ........................................................................................... 6

Methodology............................................................................................ 7

Classical Test Theory ...................................................................................... 7

Initial Analysis ................................................................................................ 9

Number of Items Assessed .................................................................. 12

Future Directions .................................................................................. 13

Item Response Theory (IRT)......................................................................... 13

Conclusion............................................................................................. 18

Comparing Classical Test Theory and Item Response Theory ..................... 18

Exploring Exam Bias..................................................................................... 18

References............................................................................................. 20

Page 4: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 iii

Appendix A: Item Analysis Summary.................................................. 21

Appendix B: Comparison of Item Performance in End of Course 4, Version 1 and Version 4...................................... 23

Appendix C: Comparison of Item Performance in End of Course 4 Version 1, Version 3, and Version 4 for Selected Items ....................................................................................... 25

Page 5: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 iv

List of Tables Table 1. Overall Performance and Recommendations for Item Revision ............. 9

Page 6: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 v

List of Figures Figure 1. Comparison of Item Performance Over Time in Course 4

Post-Assessment ...........................................................................10

Figure 2. Item Characteristic Curve and CTT Item Statistics of a Good Item According to Both Analyses.......................................15

Figure 3. Item Characteristic Curve and CTT Item Statistics of an Item that Proved To Be Good by IRT and Poor by CTT..............16

Figure 4. Item Characteristic Curve and CTT Item Statistics of an Item that Proved To Be Poor by IRT and Moderate by CTT .......17

Page 7: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 1

Introduction The Adoption and Safe Families Act (ASFA) of 1997 (Public Law 105-89) was designed to prevent children in foster care from being returned to unsafe homes and to find safe homes for children who are not able to return to their families. Since then, the Tennessee Department of Children’s Services (TDCS), like many other states, has revamped preservice training for new frontline staff hired to work with children and families in an effort to better prepare child welfare workers to fulfill AFSA’s goals of safety, permanence, and well-being. More recent revisions to preservice training were implemented in the summer and fall of 2004, as the agency embraced a new best practice model of child welfare1 and at the same time tried to address deficiencies in key case manager competencies previously identified in a statewide needs assessment. Three primary themes form the foundation of the new outcomes-based preservice training: family-centered focus, strengths-based approach, and cultural sensitivity.

Four weeks of classroom teaching are combined with four weeks of on-the-job training (OJT) to allow new workers to build skills identified as critical before assuming an independent caseload. Throughout classroom training and during on-the-job training, knowledge and skills assessments are embedded to help identify worker strengths and continuing needs. A certification requirement for all newly hired case managers before assignment of a caseload provides an additional mechanism for accountability in service delivery and illustrates the department’s commitment to the importance of continued professional development. The certification requirement includes both a knowledge-based assessment and a skills assessment. The purpose of this document is to describe implementation issues related to the knowledge-based assessment requirement in

1 For a detailed description of the best practice model see Tennessee Department of Children’s

Services. (2003). Standards of Professional Practice for Serving Children and Families: A Model of Practice.

Page 8: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 2

the newly hired case managers’ certification program.2 Particular attention is given to description of the psychometric analysis that is underway at both the item level and the test level to ensure that the assessment provides a valid measure of cognitive knowledge domains relevant to the job.3 Whereas knowledge alone is not sufficient for quality work with children and families, there is an implicit assumption that knowledge provides a foundation for the new worker to draw upon for building skills needed for the job.

Background

Tennessee’s new case manager training is a 9-week program. It uses an alternating-week structure: one week in the classroom, with workers across program areas trained conjointly until they reach the last (4th) week of classroom training, and one week in the worker’s actual field setting. An on-the-job training coach is assigned to each new worker during orientation; the coach assists the new worker in putting together a professional development team that works collectively to ensure that an individualized learning plan is in place for the new worker. The plan is regularly updated based on observations and results of assessments embedded throughout the training. Supervisors and experienced staff interested in mentoring new workers serve on professional development teams along with the new worker, his or her OJT coach, and the classroom trainer.

The first in-class course of preservice training is built on a model of the helping process that includes core conditions along with engagement and helping skills around attending, balanced use of questions, empathic reflection of content and emotions, concretizing, and summarizing. The second in-class course focuses on gathering information for assessing safety, permanence, well-being, resources available to support families, and critical thinking skills for analyzing information gathered. The third in-class course focuses on the development of individualized case plans that are built upon a family’s identified strengths and needs along with monitoring and guidelines for updates of case plans. Finally, the fourth in-class week focuses on the new worker’s program area with Child Protective Services workers in one track and Permanence workers in a second track. At the end of each in-class course there is a knowledge assessment that consists of 20 multiple choice items.

The post-course knowledge assessment is designed to enable the trainer, the new worker (trainee), and the new worker’s professional development team to track

2 Development and validation issues related to the skills assessment are discussed in a separate

document. 3 Additional information related to validation efforts for the knowledge assessment that involve

primarily qualitative input gathered from ongoing monthly meetings with a panel of field experts can be furnished upon request.

Page 9: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 3

the trainee’s progress. It is intended to do so in a fashion that parallels the philosophy of practice embodied in the new worker certification training. Assessment is viewed as developmental, seeking to encourage continuous enhancement of knowledge. The end of course knowledge assessment also allows the new worker to familiarize himself or herself with the types of questions included in the final knowledge assessment required for certification before assuming an independent caseload. Items for both assessments are drawn from the same item bank. As “test anxiety” increases with high-stakes assessments, the hope is to alleviate some of the worker’s stress by building familiarity with the knowledge assessment process throughout preservice training. The end of course knowledge assessments consist of 20 items, and the end of preservice final assessment consists of 120 items (30 items corresponding to each of the 4 classroom weeks).

Development of Knowledge Assessment Objectives and Specifications/Blueprint

In the development of any knowledge-based assessment, it is essential to specify as clearly as possible the domain of content or behaviors that define the objectives measured by the instrument. With a certification or licensure exam, it is common practice to conduct a “role delineation study” or “task analysis” first, with individuals working in the field identifying the responsibilities, sub-responsibilities, and activities that define each role/task. Next, the knowledge and skills needed to carry out each task are identified, and later a panel of experts validates the list of desirable knowledge and skills.4 The validated or approved list of knowledge and skills comprise the specific objectives that need to be measured.

In the initial stage of development, evaluators from the University of Tennessee College of Social Work Office of Research and Public Service (SWORPS) identified a list of key competencies in consultation with members of a Technical Assistance Committee (TAC) and the Curriculum Development Team. SWORPS evaluators conducted a task analysis session with experienced TDCS case managers who worked in an urban setting.5 The purpose of the session was to begin the process of validating the key competencies initially identified during the session. Case managers were invited to comment on the accuracy of the listed

4 To make development more efficient, desirable knowledge and skills were identified during the

task analysis process described here. The results of the analysis were also utilized in the development of the skills assessment (to avoid reconvening groups of workers).

5 A similar session was not repeated in a rural area. It is assumed here that even though time allocated to various daily tasks may vary between rural and urban frontline workers, desired key competencies for good casework are similar in the two settings.

Page 10: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 4

competencies. During the latter part of the discussion, participants were asked to articulate specific responsibilities and activities/tasks related to each competency. SWORPS evaluators then facilitated a task analysis session with TDCS frontline supervisors in order to continue the process of validating the key competencies initially identified. Information from the case managers’ task analysis session was shared with TDCS frontline supervisors. Supervisors acted as a panel of experts, providing additional insights about what constitutes key competencies for case managers who serve children and families. In turn, SWORPS evaluators drafted a blueprint for the new worker knowledge assessment based on the results of the task analysis. Additional steps to refine this initial draft included the review of the learning objectives as outlined in each unit of the preservice curriculum and attending pilot sessions in the classroom to help evaluators gain a sense of the emphasis placed on various concepts. The revised blueprint for the assessment was submitted to the Technical Advisory Committee for approval before further test development steps were taken.

Item Writing and Item Editing

Using preservice training material, SWORPS evaluators developed a large pool of items that was utilized for piloting the knowledge assessment process. More specifically, the number of items in the initial pool used for piloting was approximately three times greater than the number of items needed for a single final knowledge assessment administration (i.e., for a 120-item test, the initial item bank consisted of approximately 360 items).6 The distribution of pool items across various content areas was based on the importance assigned to various learning outcomes in the test blueprint.

Colleagues familiar with the revised preservice curriculum reviewed the pool of items developed by individual evaluators. Special attention was given to the following issues during the review process:

1. Does each item measure an important learning outcome included in the blueprint specifications?

2. Does each item present a clearly formulated problem?

3. Is the item stated in simple, clear, non-biased language? Is terminology consistent with preservice coursework and policy?

4. Is the item free from extraneous clues leading to the correct answer?

5. Is the difficulty of the item appropriate?

6 In the Standards for Educational and Psychological Testing (1999) it is recommended that at least

three to four times the number of test items required to construct an examination be developed.

Page 11: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 5

6. Do the items included in the assessment provide adequate coverage of the blueprint specifications?

Once the initial editing of pool items was completed, items were submitted to preservice curriculum developers in an effort to identify additional weaknesses. The review was based on the same guidelines enumerated above. SWORPS evaluators incorporated feedback and made revisions as needed. Items were stored in an item bank database with sub-pools created for each preservice classroom course. Also stored were links for each item to specific competencies and unit objectives in the preservice curriculum.

Field Administration

SWORPS evaluators developed a document with directions designed to communicate the following information to all trainees:

1. Purpose of the end of course assessment;

2. Time allowed to complete the end of course assessment;

3. How to record answers; and

4. Whether to guess when in doubt about the answer.

Directions were included in trainees’ knowledge assessment booklets to ensure consistency across settings along with scripted verbal guidelines that trainers were asked to share in class prior to administering the assessment.

Over time, multiple versions of the knowledge assessment have been developed for use at the end of the classroom courses. The remainder of this document describes how data gathered from early piloting of items have been used to continuously improve the quality of individual items in the item bank and the overall quality of the instruments used at the end of each course.

Page 12: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 6

Item Analysis Psychometrics is a field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. There are two well known methodologies in psychometric study used to achieve the goal of item validation:

♦ Classical test theory (CTT) uses traditional sample-dependent statistics. These include, but are not limited to, item difficulty and item discrimination indices, item-test intercorrelations, and distractor analysis. CTT analysis employs relatively simple mathematical techniques, thus explaining its wide use and popularity in test validation.

♦ Item Response Theory (IRT), also known as modern test theory, is the study of test and item scores based on assumptions concerning the mathematical relationship between ability and item responses. IRT models are known as “strong” models since the assumptions are harder to meet. IRT is the gold standard for item validation, mainly because of its property of group invariance, or the fact that the calculated item parameters are independent of the ability level of examinees responding to the item.

Page 13: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 7

Methodology Although IRT is the preferred method for item validation, it is necessary to rely on CTT in early steps of the analysis. IRT requires large sample sizes that were not available at the beginning of the program implementation. Therefore, CTT was the only alternative given the knowledge instruments with small sample sizes. As the number of examinees taking the same end of class course increased in some cases over time, evaluators utilized IRT models to build additional support for validity of designed measures.

Classical Test Theory

The goal in item analysis is to construct an assessment that has the necessary degree of reliability and validity. To achieve this, a large number of items were analyzed for their psychometric properties. In developing the assessment, CTT was used to make judgments about item quality. The goal is to recognize those items that were functioning well with good qualities and those items that were not. Items that had poor item characteristics were revised or eliminated. Classical test theory procedures involve selecting items using the item difficulty index, corrected point biserial coefficient, index of discrimination, and distractor analysis.

♦ Item difficulty index is the proportion of examinees who answered the item correctly; lower percentages reflect higher item difficulty. In general, for an item to discriminate well between examinees, the item difficulty should not be too high or too low. Extremely low values may indicate that the question is too difficult, written poorly, or has problems with item content. Questions with a high item difficulty index are avoided, as they may be too easy and not measure knowledge acquisition.

Page 14: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 8

♦ Point biserial correlation computes the correlation between item response and total test score (rpbi). The higher the value of rpbi, the stronger the relationship between the item response and total score. Point biserial can range from –1.00 to 1.00, similar to Pearson’s product-moment correlation. It is sometimes more valuable to compute the correlation excluding the particular test item from the overall test score. This statistic is called corrected point biserial correlation. There are other correlation coefficients that can be computed: biserial, phi, and tetrachoric. “Possibly the use of point biserial might tend to produce a more reliable test for groups exactly like the pretest group, whereas biserial might work better for subsequent groups of examinees that differ somewhat from the pretest group” (Lord & Novick, 1968, p. 344). In this study, it is assumed that future worker groups will be similar in ability to the original sample group.

♦ Index of Discrimination. Overall assessment scores are dichotomized into high and low scorers in order to compute a stable item discrimination index. The upper 27% of the examinee group and lower 27% of the examinee group is constructed. The difference between the percentage of examinees in the top 27% who correctly answered the test item and the percentage in the bottom 27% who correctly answered the item is called the discrimination index. One would hope to see as an end result that high scorers selected the correct answer, while a large proportion of low scorers decided to choose one of the distractors.

♦ Distractor Analysis. The purpose of distractor analysis is to polish up the items. This process involves looking at frequencies for each response to an item. Distractors that are not chosen by any examinees should be revised or eliminated. The group in the bottom 27% should select incorrect options in greater proportion than the upper 27%. Items created in such a way that a single distractor is selected more often than others or more often than the correct answer should be revised.

Parameters that are suggested in psychometric literature are used. These “rules of thumb” can be adjusted up or down, according to circumstances. For this study, a good item is identified by an item difficulty index of 0.20 to 0.80, a corrected point biserial correlation coefficient ≥ 0.09, and discrimination index ≥ 0.30. To satisfy each of the three item statistics, an item receives one point in overall performance. Thus, for every item performance, the points vary from 0 (none of the three statistics meets threshold) to 3 (all of the three statistics meet threshold).

As can be seen in Table 1, a recommendation based on overall performance follows each item. These recommendations are taken into consideration as efforts are made to improve individual item performance and overall exam performance.

Page 15: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 9

Table 1. Overall Performance and Recommendations for Item Revision

Overall Performance Recommendation 3 Performed well Retain without revision

2 Performed moderately

Based on one poor item statistic, most of the time needs little revision

0 or 1 Performed poorly Based on poor item statistics, discard or revise the item

Initial Analysis

An assessment was developed to provide feedback to the participant at the end of Course 4 (Conducting Family-Centered Assessments) and is selected to illustrate how item analysis was used to assess and improve overall performance of items. The process was similar for other instruments.

Four different versions of the Course 4 assessment were developed over time. Different versions are produced for many reasons. Early on, results from a first version can be used for strengthening the quality of the items. At other times, a new version may be needed because of changes to the classroom curriculum. Finally, multiple versions are needed as items become “overexposed,” and participants may answer correctly, as a person who went through the training before shares a correct answer.

The first version of the Course 4 post-assessment was used with 122 examinees. Version 2 was used only with 16 examinees as the curriculum changed based on field needs, and evaluators had to adjust the content of the post-assessment. Versions 3 and 4 were used with 246 and 453 examinees, respectively. Figure 1 presents the item improvement from Version 1 to Version 4 (Version 2 results are not included due to a small number of examinees). As indicated in Figure 1, there is an increase in the proportion of items performing moderately well and well over the three versions. Conversely, there is a decrease in the proportion of items performing poorly. The overall conclusion is that item performance successfully improved with each version.

Page 16: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 10

Figure 1. Comparison of Item Performance Over Time in Course 4 Post-Assessment

A more detailed item analysis shows that Version 1 had 7 items (35%) that performed well (overall performance = 3), 4 items (20%) with moderate performance (overall performance = 2) and 9 items (45%) were classified with poor performance (overall performance = 0 or 1).

Version 3 had 12 items (60%) that performed well, 2 items (10%) that performed moderately, and 6 items (30%) that performed poorly.

Version 4 had 11 items (55%) that were judged as having performed well. Four items (20%) performed moderately, and 5 items (25%) had a poor performance.

Appendix A presents a summary of item statistics for Course 4, Version 4. For each item, the difficulty index, the corrected point biserial coefficient, the discrimination index, and an overall performance rating is reported. Based on the overall performance rating of each item, recommendations for retention or revision were given.

Appendix B demonstrates a comparison of end of Course 4 item overall performance from Version 1 to Version 4. From Version 1 to Version 4, 15 items were changed (75%). Within the group of changed items, 7 items improved their overall performance and 4 performed well on both occasions; 2 items dropped their performance and 2 items performed poorly on both occasions. The amount of change varied. Slightly changed items had distractors added, changed, or dropped. In contrast, for drastically changed items the format of the question was altered. Sometimes the item assessed knowledge related to the same topic or possibly a totally new item was created.

Appendix C displays information related to the performance of 3 items over time with recommendations. The first item changed step by step from an overall

55% 45%

70%

30%

75%

25%

0

20

40

60

80

100

Version 1 Version 3 Version 4

moderatelywell or wellpoor

Page 17: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 11

performance of 0 to 2, but still needs revision because the discrimination index is low. The second item had good difficulty and a good discrimination index. Revision was suggested because the corrected point biserial correlation was low. After revision, performance remained at level three. The third item stayed the same in all three versions. This item holds at a high performance level in all versions. One disadvantage of exposing a good item too often is that the difficulty index gradually gets larger as more examinees become familiar with the item and are more likely to pick the correct answer.

Page 18: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 12

Number of Items Assessed By fall 2006, data have been analyzed for 362 items in one or more versions of an assessment. Given the relatively small sample size available from preservice examinees, it was necessary to rely also on data gathered from experienced staff. Still, not all instruments were suitable for item analysis because some versions were used with only a small number of examinees. This is especially true for the end of Course 8 and some versions of the final knowledge assessments.

Using classical test theory techniques, 82 items (23%) have proven to be stable, consistently performing well across multiple tests on all three parameters (difficulty index, corrected point biserial correlation, and discrimination index). A small number of stable items (18) have been revised because they did not meet the additional requirement of measuring a specific learning objective well when reviewed by a panel of content experts (trainers, OJT coaches, and TDCS supervisors).

In addition, there were 103 items (28%) for which initial results are promising, but items have not been included in enough versions to assess consistency of good performance. More specifically, 59 items performed well or moderately well on only one instrument, and SWORPS evaluators did not have enough data to assess across instruments to know if the instruments perform consistently well. Forty-four items have an acceptable difficulty index but a small number of responses (under 150). Those 44 items will stay as is until the sample size increases with future administrations, since the other two parameters are very sensitive to the sample size.

These results represent the data analyzed through fall 2006. Since then, new instruments have been created and some have been archived. The number of active items (185) in the pool has increased as of the date of publication of this report.

Page 19: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 13

Future Directions

Item Response Theory (IRT)

While classical test theory analysis is useful when constructing and evaluating knowledge assessment instruments, it has several limitations. The item statistics are sample dependent, and it is difficult to compare examinees’ results between different assessments. To avoid this limitation with CTT, more and more psychometricians are using IRT. One disadvantage of the IRT approach is that models require large samples for stable parameter estimates. There is rich literature on both methods. Sample size requirements for estimating item parameters are not entirely clear. Tsutakawa and Johnson (1990) recommend a sample size of n=500; however, other sources have suggested that n=300 is sufficient (Chuah, S.C., Drasgow, F., & Luecht, R.M., 2006). For the three parameters, a logistic model as large as n=1,000 is recommended.

For analyzing the certification exam, 12 instrument versions met the minimum required sample size for utilizing IRT. At present, four instrument versions from Course 2 have been analyzed using IRT. Item parameters, item characteristic curves (ICC), item information curves, and model fit were estimated with BILOG 3.0 using the two-parameter logistic model (2PL).

The purpose of this analysis is to compare the item statistics from the CTT with the item parameters from the IRT to confirm whether results from IRT are consistent with CTT, the approach that was used the most.

The Course 2 assessment, Version 3, will be used to illustrate the findings. Based on item characteristic curves, 18 items from a possible 20 were classified in the same direction as that which CTT recommended.

Page 20: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 14

In a 2PL model used for analysis, the location parameter (b) represents the difficulty of the item, with higher location values representing more difficult items (see Figures 2–4). The slope parameter (a) represents the discrimination power of the item, or how well the item differentiates between examinees of higher and lower ability. Items with higher slope values discriminate better between high and low ability examinees (see Figures 2–4). For this study, a good item is identified by an item that has a large slope parameter and location parameter near 0.0.7

Three items were selected to illustrate agreement between CTT and IRT—one item to present how the results from the two methods correspond and two items to present how they differ.

In Figures 2 through 4, the item characteristic curves of selected items are shown with a table of CTT item statistics and overall performance.

7 The guidelines for item selection are adopted from the IRT Modeling Lab, Website:

http://work.psych.uiuc.edu/irc.

Page 21: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 15

Figure 2. Item Characteristic Curve and CTT Item Statistics of a Good Item According to Both Analyses (n=526)

CTT Item Statistics

Difficulty Index: 0.48

Corrected Point Biserial Coefficient 0.26

Discrimination Index: 0.64

Overall Performance: 3

0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

b

Ability

Prob

abili

tyItem Characteristic Curve: CULTURE_15V1

a = 0.593 b = 0.105

Page 22: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 16

Figure 3. Item Characteristic Curve and CTT Item Statistics of an Item that Proved To Be Good by IRT and Poor by CTT (n=526)

CTT Item Statistics

Difficulty Index: 0.89

Corrected Point Biserial Coefficient: 0.23

Discrimination Index: 0.27

Overall Performance: 1

0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

b

Ability

Prob

abili

tyItem Characteristic Curve:STRENGTHS_01V2

a = 0.676 b = -2.104

Page 23: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 17

Figure 4. Item Characteristic Curve and CTT Item Statistics of an Item that Proved To Be Poor by IRT and Moderate by CTT (n=526)

CTT Item Statistics

Difficulty Index 0.80

Corrected Point Biserial Coefficient 0.09

Discrimination Index: 0.24

Overall Performance: 2

0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

b

Ability

Prob

abili

tyItem Characteristic Curve: STRENGTHS_BASED_01V2

a = 0.253 b = -2.456

Page 24: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 18

Conclusion The overall conclusion of the analysis is that the agreement between results from the item analysis within two different frameworks, CTT and IRT, was reasonably good. Analysis using IRT techniques has just begun. Clearly, much work still needs to be done.

Comparing Classical Test Theory and Item Response Theory

It is important to continue to use classical test theory because these statistics produce valuable information about the items. In fact, it is good practice to run a CTT item analysis before estimating item parameters to remove items with zero or negative corrected point biserial correlations. In recent years, a larger number of psychometricians have analyzed their data with IRT. There are important questions to be considered for future study. For example, how do statistics from CTT compare with item parameters estimated by IRT? It is essential to investigate whether the chosen 2PL model fits the data properly in order to improve the item analysis and test design. Conclusions from IRT analysis should correspond to those from CTT when the correct model is applied. After determining a correct model, IRT can be used with confidence from one group of examinees to another. Conducting classical analysis will remain an essential part of the analysis.

Exploring Exam Bias

To supplement the analysis of individual items, a logical next step is to examine possible test bias by looking at examinee demographics in relation to assessment scores. Some factors that could be examined are gender, race, area of study, and

Page 25: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 19

examinee region of employment. The purpose of bias analysis is to confirm that no one group of examinees has an unfair advantage over the rest.

Page 26: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE APRIL 2007 20

References

Baker F. (2001). The basics of item response theory. ERIC Clearinghouse on Assessment and Evaluation.

BILOG-MG, MULTILOG, PARASCALE, and TESTFACT, Lincolnwood. IL: Scientific Software International.

Chuah, S.C., Drasgow, F., & Luecht, R. M. (2006). How big is big enough? Sample Size Requirements for CAST Item Parameter Estimation. Applied Measurement in Education, 19(3), 241–251.

Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley

Stage, C. (2003). Classical test theory or item response theory: The Swedish experience. (No.42). Umea: Department of Educational Measurement.

The American Educational Research Association, The American Psychological Association, The National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.

Tsutakawa, R.K., & Johnson, J.C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.

Page 27: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE JANUARY 2007 21

Appendix A

Item Analysis Summary

Page 28: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

Item Analysis Summary

End of Course 4 Version 4 ( n=453)

Item Short Name

Difficulty Index Corrected Point Biserial

Coefficient

Discrimination Index

Overall Performance

1 SEPT_14V2 0.78 0.12 0.34 3

2 LIFET_24V1 0.73 0.19 0.44 3

3 SAFETY_02V1 0.70 0.19 0.43 3

4 ASSESS_20V2 0.58 0.15 0.17 2

5 TOOLS_03V2 0.79 0.09 0.26 2

6 Funct_assess_01V2 0.58 0.25 0.60 3

7 STAGES_03V2 0.64 0.32 0.56 3

8 STAGESDEV_01V1 0.48 0.02 0.29 1

9 Tools_04V2 0.86 0.13 0.23 1

10 Analyzing_info_01V2 0.85 0.33 0.41 2

11 CHILDDEV_01V1 0.78 0.07 0.21 1

12 PROTA_26V1 0.71 0.22 0.51 3

13 TOOLS_05V2 0.82 0.07 0.25 0

14 NEWFAM__363V1 0.80 0.21 0.38 3

15 ASSESS_19V3 0.94 0.16 0.15 1

16 ATTACH_02V3 0.74 0.20 0.43 3

17 Domestic_violence_03V1 0.49 0.15 0.43 3

18 Underlying_needs_03V1 0.46 0.04 0.36 2

19 Signs_of_safety_01V1 0.75 0.31 0.52 3

20 STAGES_02V2 0.74 0.19 0.42 3

Page 29: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE JANUARY 2007 23

Appendix B

Comparison of Item Performance in End of Course 4, Version 1 and Version 4

Page 30: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE JANUARY 2007 24

Comparison of Item Performance in End of Course 4 Version 1 and Version 4

Item Change in Version

Performance Version 1

Performance Version 4

Change in Performance

1 SEPT_14V2 Slightly Performed Poorly Performed Well

2 LIFET_24V1 No Change Performed Moderately

Performed Well

3 SAFETY_02V1 No Change Performed Well Performed Well

4 ASSESS_20V2 Drastically Performed Poorly Performed Moderately

5 TOOLS_03V2 Slightly Performed Poorly Performed Moderately

6 Funct_assess_01V2 Drastically Performed Well Performed Well

7 STAGES_03V2 Slightly Performed Moderately

Performed Well

8 STAGESDEV_01V1 No Change Performed Poorly Performed Poorly

9 Tools_04V2 Drastically Performed Poorly Performed Poorly

10 Analyzing_info_01V2 Drastically Performed Poorly Performed Moderately

11 CHILDDEV_01V1 Drastically Performed Well Performed Poorly

12 PROTA_26V1 No Change Performed Poorly Performed Well

13 TOOLS_05V2 Drastically Performed Performed Poorly

14 NEWFAM__363V1 No Change Performed Well Performed Well

15 ASSESS_19V3 Drastically Performed Poorly Performed Poorly

16 ATTACH_02V3 Slightly Performed Well Performed Well

17 Domestic_violence_03V1 Drastically Performed Well Performed Well

18 Underlying_needs_03V1 Drastically Performed Poorly Performed Moderately

19 Signs_of_safety_01V1 Drastically Performed Performed Well

20 STAGES_02V2 Slightly Performed Well Performed Well

*The symbols and indicate change in overall performance; the direction of the arrow indicates that either the overall performance increased or decreased from version to version.

Page 31: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION

THE UNIVERSITY OF TENNESSEE COLLEGE OF SOCIAL WORK OFFICE OF RESEARCH AND PUBLIC SERVICE JANUARY 2007 25

Appendix C

Comparison of Item Performance in End of Course 4 Version 1, Version 3 and Version 4 for Selected Items

Page 32: Report on Knowledge Assessment Construction and Validation ...€¦ · REPORT ON KNOWLEDGE ASSESSMENT CONSTRUCTION AND VALIDATION FOR NEW CASE MANAGER CERTIFICATION THE UNIVERSITY

Comparison of Item Performance in End of Course 4 Version 1, Version 3 and Version 4 for Selected Items

Version Item Short Name Difficulty Index

Corrected Point Biserial

Coefficient

Discrimination Index

Overall Performance Recommendations

V1 ASSESS_04V1 0.97 0.06 0.00 0

Item has low corrected point biserial correlation and discrimination index. Difficulty index is very high. Items with low corrected point biserial correlation should be eliminated or substantially revised. Very low discrimination index supports this decision.

V3 ASSESS_20V1 0.83 0.16 0.25 1

Item has a difficulty level (p=0.83) that is only slightly greater than recommended. Discrimination index is D=0.25. According to guidelines this item is marginal and needs revision.

V4 ASSESS_20V2 0.58 0.15 0.17 2

Item has a good difficulty index and good corrected point biserial correlation but the discrimination index is recommending revision.

V1 STAGES_03V1 0.55 0.05 0.40 2

Item has a good discrimination and difficulty index, but corrected point biserial correlation is low. Item needs revision.

V3 STAGES_03V2 0.63 0.15 0.48 3 Retain without revision.

V4 STAGES_03V2 0.64 0.32 0.56 3 Retain without revision.

V1

SAFETY_02V1

0.56 0.29 0.63 3

V3

SAFETY_02V1

0.61 0.14 0.44 3

V4

SAFETY_02V1

0.70 0.19 0.43 3

Retain without revision. Difficulty index is increasing. Do not use item in subsequent versions.