04 Reliability and Validity

8/20/2019 04 Reliability and Validity

1/17

1

C.4.4 Item and Test Analysis: Determination of Test Quality

Kanupriya Katyal, Dr. Jagrook Dawra

Abstract:

This module deals with defining and determining quality of test instruments and test items. Tests as an

instrument for evaluation need to be accurate, objective, practical and reliable. Further they should be able

to discriminate between good and bad performers and have a uniform difficulty level. This module explains

each of these terms and describes how they can be measured. Module specifically touched on six measures

of test quality, objectivity, practicability, reliability, validity, difficulty level and discrimination index. It

also talks about mathematical measures like mean, median, mode, standard deviation and correlation that

help in measuring test quality.

Objective:

1. To enable the reader to define the quality of a test and measure it.

a.

To understand the concepts of reliability and validity in a test.

b. To understand the various measurements used in defining quality like mean, median, mode,

standard deviation & correlation

Introduction:

A test needs to evaluate and measure the performance of the candidate, department or an institution.

Measurement is purely quantitative and when an individual‟s judgment is added it becomes evaluation.

A test should measure what it intended to measure, with considerable accuracy and at the same time it

should be able to discriminate between students of varied abilities.

Subjective judgment leads to inaccuracy and errors. These errors are the standard errors of measurements.

Hence, these need to be identified and eliminated.

There are several valid reasons for analyzing questions and tests that students have completed and that have

already been graded. Some of the reasons include the following:

Identify content that has not been adequately covered and should be re-taught,

Provide feedback to students,

Determine if any items need to be revised in the event they are to be used again or become part of

an item file or bank,

Identify items that may not have functioned as they were intended,


2/17

2

Direct the teacher's attention to individual student weaknesses.

Validity and reliability are the overarching principles that govern test design. Validity is the extent to which

a test measures what it intends to measure. Reliability is the extent to which the test scores are consistent.

Reliability is a property of the test as a measuring instrument. Other measures like objectivity,

practicability, difficulty level and discrimination index are also some measures of test quality and have

been discussed in the subsequent sections.

Understanding Test Item and Test Quality

There are various forms of assessment techniques available to examiner. They range from assessing

students using a fixed-response multiple choice test or an open-response short answer, long answer or essay

type of exam. These exams serve a variety of purposes. The results may be used to access a student‟s

strengths and weaknesses or plan further instructional activity. They may be used for selection, placement

or for certification. They may be used as tools for appraisals. Regardless of the objective of assessment, all

assessments need to possess certain characteristics and need to have a certain degree of quality. A test is

said to be of good quality if it satisfies the following criteria i:

1. Objectivity (justice): Objectivity is said to be ensured when the paper setter is given a design/

method to follow. Objectivity of the „darts‟ exercise would depend upon how well is the task

defined to the players. A test with good objectivity would define number of attempts, distance

from where to aim, etc.

For example, teachers at several levels of education assess students‟ overall learning by giving

them projects. Often, students are not told anything about the scope of the work. They are also

unaware of what distinguishes a good project from a bad project and how would they be graded.

It has often been observed that students‟ learning is enhanced from a project if the scope of the

project is clearly defined and the student is also told clearly about certain specific performance

characteristics arranged in levels, indicating the degree to which the standard has been met.

If a biology student is asked to maintain a journal on leaf collection, a test with good objectivity

for this project would look like as follows:

Grade A Grade B Grade C Grade D

Appearance/ Neatness Extremely neat,with cover page,

leaves dried and

neatly pasted

Neat with cover page, leaves dried

and pasted

Neat with no cover page, leaves not

dried & pasted

Untidy, no cover page & leaves not

dried

Organization Well organized

and categorized/

catalogued

organized and

categorized/

catalogued with

some errors

organized and

categorized/

catalogued with a

lot of errors

Disorganized and

no cataloguing

Information and Both common Both common Both common Such information


3/17

3

understanding name and scientific

name given.

Information about

Species/ Genus/

family given and

accurate.

name and scientific

name given.

Information about

Species/ Genus/

family given with

some errors

name and scientific

name given.

Information about

Species/ Genus/

family given with

a lot of errors

is missing

Objectivity needs to be maintained not only for the test but also for test items.

2.

Practicability (usability): All test instruments should be easily usable and have simple and clear

instructions for administration of the instrument. For example, an online test may not be practical

in remote areas where internet connectivity is poor. A paper based test would probably be more

appropriate.

3. Reliability (dependability): A test instrument is said to be reliable if it produces the same result

every time. It is the consistency of measurement. A measure is considered reliable if a person's

score on the same test given twice is similar. The ability of a player to consistently hit around the

bulls eye is his measure of reliability.

There are several ways by which reliability is generally measured: Test-retest, alternate form, split

half, internal consistency (inter-item) and inter-rater.

a.

Test/retest: This is the more conservative method to estimate reliability. In this method, the

scores from repeated tests of same participants, with the same test are compared. The test

instrument remains the same. A reliable test would produce very similar scores. Simply put,

the idea behind test/retest is that you should get the same score on test 1 as you do on test 2.

For example, IQ tests typically show high test-retest reliability.

The reliability of weighing scales in a physics experiment can be tested by recording weight 3

to 4 times with an interval of few minutes.

Test-retest reliability is a measure of stability.

b. Alternate form reliability: when participants are able to recall their previous responses, test re-

test procedures fail. In such cases, alternate form reliability is used. As the name suggests, two

or more versions of the tests are constructed that are equivalent in content and difficulty.

For example, the marks in the pre board test should be consistent to the board exam if there is

no change in the underlying conditions between the two.

Teachers also use this technique to create replacement exams for students who have for some

reason missed the main exam.

Alternate form reliability is a measure of equivalence .


4/17

4

c.

Split half reliability: this method of reliability compares scores from different parts of the test

such as comparing the scores form even vs. odd numbered questions.

d.

Internal consistency or inter item reliability estimates reliability by grouping questions in a

questionnaire that measure the same concept. For example, you could write two sets of three

questions that measure the same concept and after collecting the responses, run a correlation

between those two groups of three questions to determine if your instrument is reliably

measuring that concept.

For example,

Vocabulary could be tested using synonyms, antonyms, sentence completion or analogies.

Understanding of Newton‟s laws can be tested by asking the student to state the laws or also

by giving him a numerical based on these laws.

Inter-item reliability is the degree to which different items measuring the same variable attain

consistent results. Scores on different items designed to measure the same construct should be

highly correlated.

e.

Inter rater reliability: scorer reliability needs to be measured when observers use their

judgment for interpretation.

For example, when analyzing live or video taped behavior and written answers to open ended

essay type questions, different observers take measurement of the same responses. A high

degree of correlation between the scores given by different observers gives high inter-rater

reliability.

There are often more than two judges to judge the performance of gymnasts in a sporting

event.

There are also more than one teachers present during the viva-voce examination of a student.

A high correlation between the scores given by different judges to the gymnasts and teachers

to the students indicates a high inter-rater reliability.

4.

Validity (accuracy): A test instrument should accurately measure what it is designed to test. It is

the strength of our conclusions. Most tests are designed to measure hypothetical constructs like

intelligence or learning which the examiner needs to operationalize. A valid test will measure this

construct (learning) without being influenced by other factors (student‟s motivation level). It

answers the examiner‟s question “was I right in giving the student this test/ test item?” in the

above example of playing darts, if the student is able to aim the bull‟s eye correctly, he is valid.

So, he is valid in A and B in the diagram above (though he is less reliable in B). For example, a


5/17

5

test intended to examine a student‟s understanding of Wordsworth‟s literary contribution, a

question can be asked in the following ways:

Summarize Wordsworth‟s poem „Daffodils‟.

Critically evaluate Wordsworth‟s poem „Daffodils‟.

The first question tests the student‟s memory and not his/ her understanding of „Daffodils‟.

Validity is also of different types:

a.

Face Validity - the test looks to be a good one: what teachers and students think of the

test. Is it a reasonable way of assessing students? Is it too simple? Or is it too difficult?

The consensus of experts (generally) that a measure represents a concept. It is the least

stringent type of validity.

b. Construct validity – A construct is an individual characteristic that we assume exists in

order to explain some aspect of behavior. Whenever we wish to interpret the assessment

results in terms of some individual characteristics (e.g. reading comprehension,

mathematics problem solving ability), we are concerned with a construct.

Some other examples of constructs are: Reasoning abilities, understanding of principles

of electricity, intelligence, creativity, personality characteristics like sociability, honesty

and anxiety. Constructs are often difficult to define. They are often generated from some

theoretical position that the examiner assumes. E.g. one examiner‟s model of successful

salesperson may propose that an aggressive person is likely to be a successful

salesperson. Whereas, another examiner might opine that aggressiveness is a negativetrait and a salesperson should rather be assertive.

Construct validity measures whether the test is accurately measuring a particular

construct.

For example an examiner constructs a SALESPERSON scale with questions testing both

aggressive and assertive behavior and administers it to certain sales people whose

performance as salespeople is known. The items that have a high correlation with the

performance of a sales person indicate high construct validity while those with low

correlation indicate low construct validity.

When measuring a student‟s understanding of principles of thermodynamics, if the

examiner examines the adequacy (or inadequacy) of the answer, he would measure the

construct appropriately. But if the examiner also examines the student on grammar/

neatness/ etc., the construct is not being measured appropriately.


6/17

6

c.

Content Validity: Content validity is the property of a test such that the test items sample

the universe of items for which the test is designed. Content validity helps us understand

whether a sample of items truly represents the entire universe of items for a particular

topic.

For example a teacher gives her students a list of 200 words and would like to know

whether they have learnt to spell them correctly. She may choose a sample of say 20

words for a small test. We would like to know how representative were these 20 words of

the entire list so that we can generalize that a student who spells 80% of these 20 words

correctly would be able to spell 80% of the entire list correctly.

d.

Criterion Validity: Criterion validity assesses whether a test reflects a set of abilities in a

current or a future setting as measured by some other test. It is of two types – predictive

(future) and concurrent validity (present).

Predictive validity - the test accurately predicts performance in some subsequent

situation. For example, candidates were selected to do a certain job by

interviewing them. If these selected candidates also perform well in their jobs

then the test method (interview) has a good predictive validity.

Concurrent validity - the test gives similar results to existing tests that have

already been validated. For example, assume that interview as a method has

already been validated as a good indicator for employee performance. A written

technical exam shall have high concurrent validity if it also gives similar results.

For example reading readiness test scores might be used to predict students‟ future

achievement in reading or a test of dictionary skills might be used to estimate students‟

current skills in the actual use of a dictionary.

Difference between Reliability & Validity: Assume that there are some individuals playing darts.

The success of their skill is based on the fact of how close to the bulls‟ eye can they hit

consistently.

Let there be four persons playing, Person A, B, C, and D and their results are given in figure 1.

Then it can be said from the figure below that Player A is both valid and reliable. Player A not just

achieves the desired result (valid) but also does it consistently (reliable).


7/17

7

5.

Difficulty level: A question paper or any test instrument is generally administered to a group,

which is of about the same age and in the same grade/ standard. Thus, the test instrument must be

made to a difficulty level suitable to the group. Item difficulty is simply the percentage of students

taking the test who answered the item correctly. The larger the percentage getting an item right,

the easier the item. The higher the difficulty index, the easier the item is understood to be.

For example, in the questions below, which item is more difficult?

a.

Who was AO Hume?

b.

Who was Mahatma Gandhi?

It is relatively easier to recognize the individual in the second question than the first.

Also for example, an English test item that is very difficult for an elementary student will be very

easy for a high school student.

Difficulty index tells us how difficult the item is or how many people got that item correct. It is

calculated as follows:

T

LU D

cc

B: Valid but not reliableA: Reliable and Valid

C: Reliable but not valid D: Neither reliable nor valid


8/17

8

Where, Uc is the number of people in the upper group who answered the item correctly, L c is the

number of people in the lower group who answered the item correctly. T is the total number of

responses to the item.

For example, in a class, if out of the top 10 students 9 gave a correct response to the question

“Who is the president of India?” and if out of the bottom 10 students, only 4 gave a correct

response to the same, the difficulty level of the question would be:

%6565.020

49

D

This means that only 65 % of the students could answer the question correctly.

6.

Discrimination Value: Even though it has been stated that a test instrument must be suited for a

homogenous group yet it should be able to distinguish between the different ability levels of

different individuals being tested. The darts test should be able to discriminate between a novice,

an amateur and an expert.

A good item discriminates between those who do well on the test and those who do poorly. The

item discrimination index, D can be computed to determine the discriminating power of an item. If

a test is given to a large group of people, the discriminating power of an item can be measured by

comparing the number of people with high test scores who answered that item correctly with the

number of people with low scores who answered the same item correctly. If a particular item is

doing a good job of discriminating between those who score high and those who score low, more

people in the top-scoring group will have answered the item correctly. Discrimination index „D‟ is

given by:

2/T

LU d

cc

Where, Uc is the number of people in the upper group who answered the item correctly, L c is the

number of people in the lower group who answered the item correctly; U and L are the number of

people in the upper and lower groups respectively. T is the total number of responses to the item.

For example, if 15 out of 20 persons in the upper group answered a particular question correctly

and 5 out of 30 people in the lower group answered the same question correctly, then,

4.0

25

10

23020

515

d

The higher the discrimination index, the better the item because such a value indicates that the

item discriminates in favor of the upper group, which should get more items correct.


9/17

9

An item that everyone gets correct or that everyone gets incorrect will have a discrimination index

equal to zero.

When more students in the lower group than in the upper group select the right answer to an item,

the item actually has negative discrimination index

While it is important to analyze the performance of individual test items (reliability, difficulty level,

discrimination value, etc.) it is also important to analyze the overall performance of the complete test or its

subsections. These criteria are measured using certain statistical measures primarily based on measures of

central tendency – mean, median, mode and standard deviation (measure of dispersion). The mean, median

and mode show how the test scores cluster together and the standard deviation shows how widely the

scores are spread out.

Mean (also called average): For a data set, the mean is the sum of the observations divided by the number

of observations.

n

i

i x

n Mean

1

1

For example, the arithmetic mean of 34, 27, 45, 55, 22, 34 (six values) is (34+27+45+55+22+34)/6 = 217/6

≈ 36.167.

Median is described as the number separating the higher half of a data set from the lower half.

For example, consider the dataset {1, 2, 2, 2, 3, 9}. The median is 2 in this case.

Mode is the value that occurs the most frequently in a data set.

For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.

Standard deviation of a data set is a measure of the spread of its values. It is a measure of dispersion that

takes every test score into account. Simply put, it the average amount that each students‟ score deviates

(differs) from the mean of the class. The standard deviation is usually denoted with the letter σ.

n

i

i x x

n 1

21

For example, the standard deviation of 34, 27, 45, 55, 22, 34 (six values) is 12.06.

These measures of central tendency and dispersion show how appropriately a test has been designed for its

intended purpose. They help the examiner determine the level of difficulty required hand how well

different levels of students‟ can be differentiated. If the test results show skewness, either there is clustering

of marks towards the top or clustering towards the bottom, the examiner may conclude that the test

designed is too easy or too difficult for the students.


10/17

10

Correlation: This concept lays foundations for most concepts of test analysis. It tells the examiner the

extent to which two or more sets of results agree with each other.

For example,

Case 1: The results of two tests for the same set of students yielded the following results.

Student No Test 1 Rank Test 2 Rank

A 1 1

B 2 2

C 3 3

D 4 4

E 5 5

This shows that the students ranked identically on the two tests, that is, all ranks are same for both the tests.

This shows a perfect positive correlation or a correlation of +1.

Case 2: If the results of two tests for the same set of students yielded the following results:


A 1 5

B 2 4

C 3 3

D 4 2

E 5 1

Here the ranks are as different from each other as it is possible to be. The student who was ranked 1 in first

test was ranked last in the second test and vice versa. This shows a perfect negative correlation or a

correlation of -1.

Case 3: If the results of two tests for the same set of students yielded the following results:


A 1 3

B 2 2

C 3 4

D 4 5

E 5 1


11/17

11

0

1

2

3

4

5

6

0 1 2 3 4 5 6

Test 1 Rank Test 4 Rank

This graph shows that there is no visible pattern between the Test 1 Ranks and Test 4 Ranks. Hence it can

be said that there is no correlation.

However, in most situations there will be some amount of association. And to measure this association

whether positive or negative the coefficient of correlation is used. The following table may be used as a basis for interpreting coefficient of correlationii

Correlation Negative Positive

Small −0.3 to −0.1 0.1 to 0.3

Medium −0.5 to −0.3 0.3 to 0.5

Large −1.0 to −0.5 0.5 to 1.0

The formula for calculating this coefficient is

2222 )()()()( Y Y n X X n

Y X XY nr

Points to remember:

A good test satisfies the criteria of objectivity, practicability, reliability, validity, difficulty level

and discriminatory power.

Objectivity is said to be ensured when the paper setter is given a design/ method to follow.

All test instruments should be easily usable and have simple and clear instructions for

administration of the instrument.

A test instrument is said to be reliable if it produces the same result every time.

A test instrument should accurately measure what it is designed to test.

The test instrument must be made to a difficulty level suitable to the group.


12/17

12

A test item should be able to distinguish between the different ability levels of different

individuals being tested.

Exercises

Q1. If a vocabulary test was conducted with persons from various age groups, determine for the testing

authority if there was any relationship between the age and the marks obtained.

x = age of person

y = marks obtained

x y x2 y2 xy

9 28.4 81 806.56 255.6

15 29.3 225 858.49 439.5

24 37.6 576 1413.76 902.4

30 36.2 900 1310.44 1086

38 36.5 1444 1332.25 1387

46 35.3 2116 1246.09 1623.8

53 36.2 2809 1310.44 1918.6

60 44.1 3600 1944.81 2646

64 44.8 4096 2007.04 2867.2

76 47.2 5776 2227.84 3587.2

Total 415 375.6 21623 14457.72 16713.3

2222 )()()()( Y Y n X X n

Y X XY nr

r = 10 x 16713.3 - 415 x 375.6 / {(10 x 21623 - 4152) (10 x 14457.72 - 375.62)}

r = 11259 / (44005 x 3501.84)

r = 11259 / 124.14

r = 0.91

Thus it is found that the correlation coefficient is 0.91; this is very large. Hence, the district testing

authority can assume that there is a strong positive correlation between the age the person and the test

scores obtained.

Q2. Using the test information given below, determine the range, mean, and median of the scores, the item

difficulty and item discrimination indices of the questions.

There are 6 true-false questions (1-6), and 4 multiple choice questions


13/17

13

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Correct answers T F F T F T A C B B Score

Amit T T F T F T A C B B

Prakash T F T T F T A C B B

Rahul T F F T F T A C B B

Gina F F F T F T B A C B

Tom T F F T T F C C B B

Ritu T F T F T T A C B B

Kriti T F F T F F B A B B

Prerna F F T T F T C C C B

Bhim F F F F T F B A C B

Arjun T F T F T F C B C B

Solution to the above exercise

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Total

correct

Amit 1 0 1 1 1 1 1 1 1 1 9

Prakash 1 1 0 1 1 1 1 1 1 1 9

Rahul 1 1 1 1 1 1 1 1 1 1 10

Gina 0 1 1 1 1 1 0 0 0 1 6

Tom 1 1 1 1 0 0 0 1 1 1 7

Ritu 1 1 0 0 0 1 1 1 1 1 7

Kriti 1 1 1 1 1 0 0 0 1 1 7Prerna 0 1 0 1 1 1 0 1 0 1 6

Bhim 0 1 1 0 0 0 0 0 0 1 3

Arjun 1 1 0 0 0 0 0 0 0 1 3

1 in the above table indicates a correct response and 0 indicates an incorrect response.

7.610

3...1099

Mean

Median (The middle score when all scores are put in rank order) = 7

Mode (Score(s) occurring most often) = 7

Range (Low score to high score) = 3-10

Arranging the above table in descending order of total score,

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total

Rahul 1 1 1 1 1 1 1 1 1 1 10

Amit 1 0 1 1 1 1 1 1 1 1 9


14/17

14

Prakash 1 1 0 1 1 1 1 1 1 1 9

Tom 1 1 1 1 0 0 0 1 1 1 7

Ritu 1 1 0 0 0 1 1 1 1 1 7

Kriti 1 1 1 1 1 0 0 0 1 1 7

Gina 0 1 1 1 1 1 0 0 0 1 6

Prerna 0 1 0 1 1 1 0 1 0 1 6

Bhim 0 1 1 0 0 0 0 0 0 1 3

Arjun 1 1 0 0 0 0 0 0 0 1 3

Let us consider a students getting a score of 7 and above as the “upper group” and those getting below 7 as

the “lower group”

Using the formulaT

LU D

cc

to calculate item difficulty:

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Uc 6 5 4 5 4 4 4 5 6 6

Lc 1 4 2 2 2 2 0 1 0 4

D 50.00% 10.00% 20.00% 30.00% 20.00% 20.00% 40.00% 40.00% 60.00% 20.00%

Discrimination Index:

Calculating Discrimination index using the formula2/T

LU d

cc

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Uc 6 5 4 5 4 4 4 5 6 6

Lc 1 4 2 2 2 2 0 1 0 4

D 1.00 0.20 0.40 0.60 0.40 0.40 0.80 0.80 1.20 0.40

Q 3: A BPO firm wants to re-examine its recruitment strategy for tele-callers. It had some past data on

performance of existing employees in their jobs and the scores on 3 tests that they had scored at the time of

their recruitment. Examine these scores and suggest a future recruitment strategy for the firm.

successful

tele caller

english

grammar

test

Vocabulary

test

performance

in verbal

ability test

1 9 3 8

1 10 3 71 9 4 8

0 4 5 4

1 9 0 9

0 5 9 4

1 9 2 7

0 8 9 3

0 2 6 5


15/17

15

1 7 3 10

0 7 2 5

0 6 0 2

0 4 0 6

1 8 10 8

1 6 10 8

1 8 0 7

0 5 4 4

1 10 7 9

0 5 0 3

1 8 0 10

0 6 10 5

0 5 5 4

1 8 6 9

0 3 4 5

1 7 10 9

Answer:

Correlation between the construct “Successful tele caller” and the test scores would measure the construct

validity of the tests. A high correlation would indicate the appropriateness of the test. The correlation can

be obtained using the formula:

2222 )()()()( Y Y n X X n

Y X XY nr

English grammar test 0.770359

Vocabulary test -0.00542

Verbal ability test 0.897702

The results show that the Verbal ability test is the most valid test in measuring the performance of a tele-

caller, followed by the English Grammar test. The vocabulary test has no correlation with the job

performance and therefore can be discontinued with.

Tips for further study:

There are statistical measures to measure and interpret reliability and validity like cronbach alpha, kappacoefficient, etc. These can be further studied from the book titled „Statistics for the Social Sciences‟ by

Victoria L. Mantzopoulos published by Prentice Hall, Englewood Cliffs, NJ (1995).

.


16/17

16

Colleges like IcfaiTech College of Engineering make use of the principles of Standard deviation, mean,

range to access the reliability of test scores between different teachers teaching the same subject. Some

colleges like IBS, Hyderabad also use such measures extensively.

Bibliographical References:

HS Srivastava, “Challenges in education evaluation”, UBS Publishers Distributors Ltd.

Noen Entwistle, “Handbook of educational ideas and practices”, Routledge publications.

Airasian, Peter W. (2000). Assessment in the classroom. A concise approach. Boston: McGraw-

Hill.

Linn, Robert L. & Gronlund, Norman E. (2000). Measurement and assessment in teaching. Upper

Saddle River, NJ: Prentice-Hall, Inc.

Wiersma, William & Jurs, Stephen G. (1985). Educational measurement and testing. Boston:

Allyn and Bacon, Inc.

Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th Ed). New

York: MacMillan

Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests.

Columbus, OH: Charles E. Merrill Books, Inc

Nunnally, J.C. (1972). Educational measurement and evaluation (2nd Ed). New York: McGraw-

Hill

Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation.

Cambridge University Press

Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests &

Measurement. Sage Publications, Inc.

Linn R L. & Miller M D. (2005) Measurement & Assessment in Teaching (9th Ed). Merrill

Prentice Hall

i Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when

guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a

student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test

developer in determining what is wrong with individual items. Item and test analysis provide empirical data

about how individual items and whole tests are performing in real test situations.


17/17

17

Test designers need to accomplish some requirements concerning validity, objectivity and reliability for the

items and for the test itself; they also have to follow some logical procedures.

ii Even though guidelines for interpreting the coefficient of correlation have been given however, all such

criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation

of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if

one is verifying a physical law using high-quality instruments, but may be regarded as very high in the

social sciences where there may be a greater contribution from complicating factors.

Documents

04 Reliability and Validity