04 Reliability and Validity

Embed Size (px)

Citation preview

  • 8/20/2019 04 Reliability and Validity

    1/17

    1

    C.4.4 Item and Test Analysis: Determination of Test Quality

    Kanupriya Katyal, Dr. Jagrook Dawra

    Abstract:

    This module deals with defining and determining quality of test instruments and test items. Tests as an

    instrument for evaluation need to be accurate, objective, practical and reliable. Further they should be able

    to discriminate between good and bad performers and have a uniform difficulty level. This module explains

    each of these terms and describes how they can be measured. Module specifically touched on six measures

    of test quality, objectivity, practicability, reliability, validity, difficulty level and discrimination index. It

    also talks about mathematical measures like mean, median, mode, standard deviation and correlation that

    help in measuring test quality.

    Objective:

    1.  To enable the reader to define the quality of a test and measure it.

    a. 

    To understand the concepts of reliability and validity in a test.

     b.  To understand the various measurements used in defining quality like mean, median, mode,

    standard deviation & correlation

    Introduction:

    A test needs to evaluate and measure the performance of the candidate, department or an institution.

    Measurement is purely quantitative and when an individual‟s judgment is added it becomes evaluation.  

    A test should measure what it intended to measure, with considerable accuracy and at the same time it

    should be able to discriminate between students of varied abilities.

    Subjective judgment leads to inaccuracy and errors. These errors are the standard errors of measurements.

    Hence, these need to be identified and eliminated.

    There are several valid reasons for analyzing questions and tests that students have completed and that have

    already been graded. Some of the reasons include the following:

     

    Identify content that has not been adequately covered and should be re-taught,

      Provide feedback to students,

      Determine if any items need to be revised in the event they are to be used again or become part of

    an item file or bank,

      Identify items that may not have functioned as they were intended,

  • 8/20/2019 04 Reliability and Validity

    2/17

    2

      Direct the teacher's attention to individual student weaknesses.

    Validity and reliability are the overarching principles that govern test design. Validity is the extent to which

    a test measures what it intends to measure. Reliability is the extent to which the test scores are consistent.

    Reliability is a property of the test as a measuring instrument. Other measures like objectivity,

     practicability, difficulty level and discrimination index are also some measures of test quality and have

     been discussed in the subsequent sections.

    Understanding Test Item and Test Quality

    There are various forms of assessment techniques available to examiner. They range from assessing

    students using a fixed-response multiple choice test or an open-response short answer, long answer or essay

    type of exam. These exams serve a variety of purposes. The results may be used to access a student‟s

    strengths and weaknesses or plan further instructional activity. They may be used for selection, placement

    or for certification. They may be used as tools for appraisals. Regardless of the objective of assessment, all

    assessments need to possess certain characteristics and need to have a certain degree of quality. A test is

    said to be of good quality if it satisfies the following criteria i:

    1.  Objectivity (justice): Objectivity is said to be ensured when the paper setter is given a design/

    method to follow. Objectivity of the „darts‟ exercise would depend upon how well is the task

    defined to the players. A test with good objectivity would define number of attempts, distance

    from where to aim, etc.

    For example, teachers at several levels of education assess students‟ overall learning by giving

    them projects. Often, students are not told anything about the scope of the work. They are also

    unaware of what distinguishes a good project from a bad project and how would they be graded.

    It has often been observed that students‟ learning is enhanced from a project if the scope of the

     project is clearly defined and the student is also told clearly about certain specific performance

    characteristics arranged in levels, indicating the degree to which the standard has been met.

    If a biology student is asked to maintain a journal on leaf collection, a test with good objectivity

    for this project would look like as follows:

    Grade A Grade B Grade C Grade D

    Appearance/ Neatness Extremely neat,with cover page,

    leaves dried and

    neatly pasted

     Neat with cover page, leaves dried

    and pasted

     Neat with no cover page, leaves not

    dried & pasted

    Untidy, no cover page & leaves not

    dried

    Organization Well organized

    and categorized/

    catalogued

    organized and

    categorized/

    catalogued with

    some errors

    organized and

    categorized/

    catalogued with a

    lot of errors

    Disorganized and

    no cataloguing

    Information and Both common Both common Both common Such information

  • 8/20/2019 04 Reliability and Validity

    3/17

    3

    understanding name and scientific

    name given.

    Information about

    Species/ Genus/

    family given and

    accurate.

    name and scientific

    name given.

    Information about

    Species/ Genus/

    family given with

    some errors

    name and scientific

    name given.

    Information about

    Species/ Genus/

    family given with

    a lot of errors

    is missing

    Objectivity needs to be maintained not only for the test but also for test items.

    2. 

    Practicability (usability): All test instruments should be easily usable and have simple and clear

    instructions for administration of the instrument. For example, an online test may not be practical

    in remote areas where internet connectivity is poor. A paper based test would probably be more

    appropriate.

    3.  Reliability (dependability): A test instrument is said to be reliable if it produces the same result

    every time. It is the consistency of measurement. A measure is considered reliable if a person's

    score on the same test given twice is similar. The ability of a player to consistently hit around the

     bulls eye is his measure of reliability.

    There are several ways by which reliability is generally measured: Test-retest, alternate form, split

    half, internal consistency (inter-item) and inter-rater.

    a. 

    Test/retest: This is the more conservative method to estimate reliability. In this method, the

    scores from repeated tests of same participants, with the same test are compared. The test

    instrument remains the same. A reliable test would produce very similar scores. Simply put,

    the idea behind test/retest is that you should get the same score on test 1 as you do on test 2.

    For example, IQ tests typically show high test-retest reliability.

    The reliability of weighing scales in a physics experiment can be tested by recording weight 3

    to 4 times with an interval of few minutes.

    Test-retest reliability is a measure of stability.

     b.  Alternate form reliability: when participants are able to recall their previous responses, test re-

    test procedures fail. In such cases, alternate form reliability is used. As the name suggests, two

    or more versions of the tests are constructed that are equivalent in content and difficulty.

    For example, the marks in the pre board test should be consistent to the board exam if there is

    no change in the underlying conditions between the two.

    Teachers also use this technique to create replacement exams for students who have for some

    reason missed the main exam.

    Alternate form reliability is a measure of equivalence . 

  • 8/20/2019 04 Reliability and Validity

    4/17

    4

    c. 

    Split half reliability: this method of reliability compares scores from different parts of the test

    such as comparing the scores form even vs. odd numbered questions.

    d. 

    Internal consistency or inter item reliability estimates reliability by grouping questions in a

    questionnaire that measure the same concept. For example, you could write two sets of three

    questions that measure the same concept and after collecting the responses, run a correlation

     between those two groups of three questions to determine if your instrument is reliably

    measuring that concept.

    For example,

    Vocabulary could be tested using synonyms, antonyms, sentence completion or analogies.

    Understanding of Newton‟s laws can be tested by asking the student to state the laws or also

     by giving him a numerical based on these laws.

    Inter-item reliability is the degree to which different items measuring the same variable attain

    consistent results. Scores on different items designed to measure the same construct should be

    highly correlated.

    e. 

    Inter rater reliability: scorer reliability needs to be measured when observers use their

     judgment for interpretation.

    For example, when analyzing live or video taped behavior and written answers to open ended

    essay type questions, different observers take measurement of the same responses. A high

    degree of correlation between the scores given by different observers gives high inter-rater

    reliability.

    There are often more than two judges to judge the performance of gymnasts in a sporting

    event.

    There are also more than one teachers present during the viva-voce examination of a student.

    A high correlation between the scores given by different judges to the gymnasts and teachers

    to the students indicates a high inter-rater reliability.

    4. 

    Validity (accuracy): A test instrument should accurately measure what it is designed to test. It is

    the strength of our conclusions. Most tests are designed to measure hypothetical constructs like

    intelligence or learning which the examiner needs to operationalize. A valid test will measure this

    construct (learning) without being influenced by other factors (student‟s motivation level). It

    answers the examiner‟s question “was I right in giving the student this test/ test item?” in the

    above example of playing darts, if the student is able to aim the bull‟s eye correctly, he is valid.

    So, he is valid in A and B in the diagram above (though he is less reliable in B). For example, a

  • 8/20/2019 04 Reliability and Validity

    5/17

    5

    test intended to examine a student‟s understanding of Wordsworth‟s literary contribution, a

    question can be asked in the following ways:

      Summarize Wordsworth‟s poem „Daffodils‟. 

      Critically evaluate Wordsworth‟s poem „Daffodils‟. 

    The first question tests the student‟s memory and not his/ her understanding of „Daffodils‟.  

    Validity is also of different types:

    a. 

    Face Validity - the test looks to be a good one: what teachers and students think of the

    test. Is it a reasonable way of assessing students? Is it too simple? Or is it too difficult?

    The consensus of experts (generally) that a measure represents a concept. It is the least

    stringent type of validity.

     b.  Construct validity –  A construct is an individual characteristic that we assume exists in

    order to explain some aspect of behavior. Whenever we wish to interpret the assessment

    results in terms of some individual characteristics (e.g. reading comprehension,

    mathematics problem solving ability), we are concerned with a construct.

    Some other examples of constructs are: Reasoning abilities, understanding of principles

    of electricity, intelligence, creativity, personality characteristics like sociability, honesty

    and anxiety. Constructs are often difficult to define. They are often generated from some

    theoretical position that the examiner assumes. E.g. one examiner‟s model of successful

    salesperson may propose that an aggressive person is likely to be a successful

    salesperson. Whereas, another examiner might opine that aggressiveness is a negativetrait and a salesperson should rather be assertive.

    Construct validity measures whether the test is accurately measuring a particular

    construct.

    For example an examiner constructs a SALESPERSON scale with questions testing both

    aggressive and assertive behavior and administers it to certain sales people whose

     performance as salespeople is known. The items that have a high correlation with the

     performance of a sales person indicate high construct validity while those with low

    correlation indicate low construct validity.

    When measuring a student‟s understanding of principles of thermodynamics, if the

    examiner examines the adequacy (or inadequacy) of the answer, he would measure the

    construct appropriately. But if the examiner also examines the student on grammar/

    neatness/ etc., the construct is not being measured appropriately.

  • 8/20/2019 04 Reliability and Validity

    6/17

    6

    c. 

    Content Validity: Content validity is the property of a test such that the test items sample

    the universe of items for which the test is designed. Content validity helps us understand

    whether a sample of items truly represents the entire universe of items for a particular

    topic.

    For example a teacher gives her students a list of 200 words and would like to know

    whether they have learnt to spell them correctly. She may choose a sample of say 20

    words for a small test. We would like to know how representative were these 20 words of

    the entire list so that we can generalize that a student who spells 80% of these 20 words

    correctly would be able to spell 80% of the entire list correctly.

    d. 

    Criterion Validity: Criterion validity assesses whether a test reflects a set of abilities in a

    current or a future setting as measured by some other test. It is of two types  –  predictive

    (future) and concurrent validity (present).

     

    Predictive validity - the test accurately predicts performance in some subsequent

    situation. For example, candidates were selected to do a certain job by

    interviewing them. If these selected candidates also perform well in their jobs

    then the test method (interview) has a good predictive validity.

      Concurrent validity - the test gives similar results to existing tests that have

    already been validated. For example, assume that interview as a method has

    already been validated as a good indicator for employee performance. A written

    technical exam shall have high concurrent validity if it also gives similar results.

    For example reading readiness test scores might be used to predict students‟ future

    achievement in reading or a test of dictionary skills might be used to estimate students‟

    current skills in the actual use of a dictionary.

    Difference between Reliability & Validity: Assume that there are some individuals playing darts.

    The success of their skill is based on the fact of how close to the bulls‟ eye can they hit

    consistently.

    Let there be four persons playing, Person A, B, C, and D and their results are given in figure 1.

    Then it can be said from the figure below that Player A is both valid and reliable. Player A not just

    achieves the desired result (valid) but also does it consistently (reliable).

  • 8/20/2019 04 Reliability and Validity

    7/17

    7

    5. 

    Difficulty level: A question paper or any test instrument is generally administered to a group,

    which is of about the same age and in the same grade/ standard. Thus, the test instrument must be

    made to a difficulty level suitable to the group. Item difficulty is simply the percentage of students

    taking the test who answered the item correctly. The larger the percentage getting an item right,

    the easier the item. The higher the difficulty index, the easier the item is understood to be.

    For example, in the questions below, which item is more difficult?

    a. 

    Who was AO Hume?

     b. 

    Who was Mahatma Gandhi?

    It is relatively easier to recognize the individual in the second question than the first.

    Also for example, an English test item that is very difficult for an elementary student will be very

    easy for a high school student.

    Difficulty index tells us how difficult the item is or how many people got that item correct. It is

    calculated as follows:

     LU  D

      cc 

     

    B: Valid but not reliableA: Reliable and Valid

    C: Reliable but not valid D: Neither reliable nor valid

  • 8/20/2019 04 Reliability and Validity

    8/17

    8

    Where, Uc is the number of people in the upper group who answered the item correctly, L c is the

    number of people in the lower group who answered the item correctly. T is the total number of

    responses to the item.

    For example, in a class, if out of the top 10 students 9 gave a correct response to the question

    “Who is the president of India?” and if out of the bottom 10 students, only 4 gave a correct

    response to the same, the difficulty level of the question would be:

    %6565.020

    49

     D  

    This means that only 65 % of the students could answer the question correctly.

    6. 

    Discrimination Value: Even though it has been stated that a test instrument must be suited for a

    homogenous group yet it should be able to distinguish between the different ability levels of

    different individuals being tested. The darts test should be able to discriminate between a novice,

    an amateur and an expert.

    A good item discriminates between those who do well on the test and those who do poorly. The

    item discrimination index, D can be computed to determine the discriminating power of an item. If

    a test is given to a large group of people, the discriminating power of an item can be measured by

    comparing the number of people with high test scores who answered that item correctly with the

    number of people with low scores who answered the same item correctly. If a particular item is

    doing a good job of discriminating between those who score high and those who score low, more

     people in the top-scoring group will have answered the item correctly. Discrimination index „D‟ is

    given by:

    2/T 

     LU d 

      cc 

     

    Where, Uc is the number of people in the upper group who answered the item correctly, L c is the

    number of people in the lower group who answered the item correctly; U and L are the number of

     people in the upper and lower groups respectively. T is the total number of responses to the item.

    For example, if 15 out of 20 persons in the upper group answered a particular question correctly

    and 5 out of 30 people in the lower group answered the same question correctly, then,

      4.0

    25

    10

    23020

    515

    d   

    The higher the discrimination index, the better the item because such a value indicates that the

    item discriminates in favor of the upper group, which should get more items correct.

  • 8/20/2019 04 Reliability and Validity

    9/17

    9

    An item that everyone gets correct or that everyone gets incorrect will have a discrimination index

    equal to zero.

    When more students in the lower group than in the upper group select the right answer to an item,

    the item actually has negative discrimination index

    While it is important to analyze the performance of individual test items (reliability, difficulty level,

    discrimination value, etc.) it is also important to analyze the overall performance of the complete test or its

    subsections. These criteria are measured using certain statistical measures primarily based on measures of

    central tendency –  mean, median, mode and standard deviation (measure of dispersion). The mean, median

    and mode show how the test scores cluster together and the standard deviation shows how widely the

    scores are spread out.

     Mean (also called average): For a data set, the mean is the sum of the observations divided by the number

    of observations.

    n

    i

    i x

    n Mean

    1

    For example, the arithmetic mean of 34, 27, 45, 55, 22, 34 (six values) is (34+27+45+55+22+34)/6 = 217/6

    ≈ 36.167. 

     Median is described as the number separating the higher half of a data set from the lower half.

    For example, consider the dataset {1, 2, 2, 2, 3, 9}. The median is 2 in this case.

     Mode is the value that occurs the most frequently in a data set.

    For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.

    Standard deviation of a data set is a measure of the spread of its values. It is a measure of dispersion that

    takes every test score into account. Simply put, it the average amount that each students‟ score deviates

    (differs) from the mean of the class. The standard deviation is usually denoted with the letter σ.

    n

    i

    i  x x

    n   1

    21    

    For example, the standard deviation of 34, 27, 45, 55, 22, 34 (six values) is 12.06.

    These measures of central tendency and dispersion show how appropriately a test has been designed for its

    intended purpose. They help the examiner determine the level of difficulty required hand how well

    different levels of students‟ can be differentiated. If the test results show skewness, either there is clustering

    of marks towards the top or clustering towards the bottom, the examiner may conclude that the test

    designed is too easy or too difficult for the students.

  • 8/20/2019 04 Reliability and Validity

    10/17

    10

    Correlation: This concept lays foundations for most concepts of test analysis. It tells the examiner the

    extent to which two or more sets of results agree with each other.

    For example,

    Case 1: The results of two tests for the same set of students yielded the following results.

    Student No Test 1 Rank Test 2 Rank

    A 1 1

    B 2 2

    C 3 3

    D 4 4

    E 5 5

    This shows that the students ranked identically on the two tests, that is, all ranks are same for both the tests.

    This shows a perfect positive correlation or a correlation of +1.

    Case 2: If the results of two tests for the same set of students yielded the following results:

    Student No Test 1 Rank Test 3 Rank

    A 1 5

    B 2 4

    C 3 3

    D 4 2

    E 5 1

    Here the ranks are as different from each other as it is possible to be. The student who was ranked 1 in first

    test was ranked last in the second test and vice versa. This shows a perfect negative correlation or a

    correlation of -1.

    Case 3: If the results of two tests for the same set of students yielded the following results:

    Student No Test 1 Rank Test 4 Rank

    A 1 3

    B 2 2

    C 3 4

    D 4 5

    E 5 1

  • 8/20/2019 04 Reliability and Validity

    11/17

    11

    0

    1

    2

    3

    4

    5

    6

    0 1 2 3 4 5 6

    Test 1 Rank Test 4 Rank 

     

    This graph shows that there is no visible pattern between the Test 1 Ranks and Test 4 Ranks. Hence it can

     be said that there is no correlation.

    However, in most situations there will be some amount of association. And to measure this association

    whether positive or negative the coefficient of correlation is used. The following table may be used as a basis for interpreting coefficient of correlationii 

    Correlation Negative Positive

    Small −0.3 to −0.1  0.1 to 0.3

    Medium −0.5 to −0.3  0.3 to 0.5

    Large −1.0 to −0.5  0.5 to 1.0

    The formula for calculating this coefficient is

    2222 )()()()(   Y Y n X  X n

    Y  X  XY nr 

     

    Points to remember:

      A good test satisfies the criteria of objectivity, practicability, reliability, validity, difficulty level

    and discriminatory power.

      Objectivity is said to be ensured when the paper setter is given a design/ method to follow.

     

    All test instruments should be easily usable and have simple and clear instructions for

    administration of the instrument.

      A test instrument is said to be reliable if it produces the same result every time.

      A test instrument should accurately measure what it is designed to test.

      The test instrument must be made to a difficulty level suitable to the group.

  • 8/20/2019 04 Reliability and Validity

    12/17

    12

      A test item should be able to distinguish between the different ability levels of different

    individuals being tested.

    Exercises

    Q1. If a vocabulary test was conducted with persons from various age groups, determine for the testing

    authority if there was any relationship between the age and the marks obtained.

    x = age of person

    y = marks obtained

    x y x2  y2  xy

    9 28.4 81 806.56 255.6

    15 29.3 225 858.49 439.5

    24 37.6 576 1413.76 902.4

    30 36.2 900 1310.44 1086

    38 36.5 1444 1332.25 1387

    46 35.3 2116 1246.09 1623.8

    53 36.2 2809 1310.44 1918.6

    60 44.1 3600 1944.81 2646

    64 44.8 4096 2007.04 2867.2

    76 47.2 5776 2227.84 3587.2

    Total 415 375.6 21623 14457.72 16713.3

    2222 )()()()(   Y Y n X  X n

    Y  X  XY nr 

     

    r = 10 x 16713.3 - 415 x 375.6 / {(10 x 21623 - 4152) (10 x 14457.72 - 375.62)}

    r = 11259 / (44005 x 3501.84)

    r = 11259 / 124.14

    r = 0.91

    Thus it is found that the correlation coefficient is 0.91; this is very large. Hence, the district testing

    authority can assume that there is a strong positive correlation between the age the person and the test

    scores obtained.

    Q2. Using the test information given below, determine the range, mean, and median of the scores, the item

    difficulty and item discrimination indices of the questions.

    There are 6 true-false questions (1-6), and 4 multiple choice questions

  • 8/20/2019 04 Reliability and Validity

    13/17

    13

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

    Correct answers T F F T F T A C B B Score

    Amit T T F T F T A C B B

    Prakash T F T T F T A C B B

    Rahul T F F T F T A C B B

    Gina F F F T F T B A C B

    Tom T F F T T F C C B B

    Ritu T F T F T T A C B B

    Kriti T F F T F F B A B B

    Prerna F F T T F T C C C B

    Bhim F F F F T F B A C B

    Arjun T F T F T F C B C B

    Solution to the above exercise

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Total

    correct

    Amit 1 0 1 1 1 1 1 1 1 1 9

    Prakash 1 1 0 1 1 1 1 1 1 1 9

    Rahul 1 1 1 1 1 1 1 1 1 1 10

    Gina 0 1 1 1 1 1 0 0 0 1 6

    Tom 1 1 1 1 0 0 0 1 1 1 7

    Ritu 1 1 0 0 0 1 1 1 1 1 7

    Kriti 1 1 1 1 1 0 0 0 1 1 7Prerna 0 1 0 1 1 1 0 1 0 1 6

    Bhim 0 1 1 0 0 0 0 0 0 1 3

    Arjun 1 1 0 0 0 0 0 0 0 1 3

    1 in the above table indicates a correct response and 0 indicates an incorrect response.

    7.610

    3...1099

     Mean  

     Median (The middle score when all scores are put in rank order) = 7 

     Mode (Score(s) occurring most often) = 7

     Range (Low score to high score) = 3-10

    Arranging the above table in descending order of total score,

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total

    Rahul 1 1 1 1 1 1 1 1 1 1 10

    Amit 1 0 1 1 1 1 1 1 1 1 9

  • 8/20/2019 04 Reliability and Validity

    14/17

    14

    Prakash 1 1 0 1 1 1 1 1 1 1 9

    Tom 1 1 1 1 0 0 0 1 1 1 7

    Ritu 1 1 0 0 0 1 1 1 1 1 7

    Kriti 1 1 1 1 1 0 0 0 1 1 7

    Gina 0 1 1 1 1 1 0 0 0 1 6

    Prerna 0 1 0 1 1 1 0 1 0 1 6

    Bhim 0 1 1 0 0 0 0 0 0 1 3

    Arjun 1 1 0 0 0 0 0 0 0 1 3

    Let us consider a students getting a score of 7 and above as the “upper group” and those getting below 7 as 

    the “lower group” 

    Using the formulaT 

     LU  D

      cc 

     to calculate item difficulty:

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

    Uc 6 5 4 5 4 4 4 5 6 6

    Lc 1 4 2 2 2 2 0 1 0 4

    D 50.00% 10.00% 20.00% 30.00% 20.00% 20.00% 40.00% 40.00% 60.00% 20.00%

    Discrimination Index:

    Calculating Discrimination index using the formula2/T 

     LU d 

      cc 

     

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

    Uc 6 5 4 5 4 4 4 5 6 6

    Lc 1 4 2 2 2 2 0 1 0 4

    D 1.00 0.20 0.40 0.60 0.40 0.40 0.80 0.80 1.20 0.40

    Q 3: A BPO firm wants to re-examine its recruitment strategy for tele-callers. It had some past data on

     performance of existing employees in their jobs and the scores on 3 tests that they had scored at the time of

    their recruitment. Examine these scores and suggest a future recruitment strategy for the firm.

    successful

    tele caller

    english

    grammar

    test

    Vocabulary

    test

     performance

    in verbal

    ability test

    1 9 3 8

    1 10 3 71 9 4 8

    0 4 5 4

    1 9 0 9

    0 5 9 4

    1 9 2 7

    0 8 9 3

    0 2 6 5

  • 8/20/2019 04 Reliability and Validity

    15/17

    15

    1 7 3 10

    0 7 2 5

    0 6 0 2

    0 4 0 6

    1 8 10 8

    1 6 10 8

    1 8 0 7

    0 5 4 4

    1 10 7 9

    0 5 0 3

    1 8 0 10

    0 6 10 5

    0 5 5 4

    1 8 6 9

    0 3 4 5

    1 7 10 9

    Answer:

    Correlation between the construct “Successful tele caller” and the test scores would measure the construct

    validity of the tests. A high correlation would indicate the appropriateness of the test. The correlation can

     be obtained using the formula:

    2222 )()()()(   Y Y n X  X n

    Y  X  XY nr 

     

    English grammar test 0.770359

    Vocabulary test -0.00542

    Verbal ability test 0.897702

    The results show that the Verbal ability test is the most valid test in measuring the performance of a tele-

    caller, followed by the English Grammar test. The vocabulary test has no correlation with the job

     performance and therefore can be discontinued with.

    Tips for further study:

    There are statistical measures to measure and interpret reliability and validity like cronbach alpha, kappacoefficient, etc. These can be further studied from the book titled „Statistics for the Social Sciences‟ by

    Victoria L. Mantzopoulos published by Prentice Hall, Englewood Cliffs, NJ (1995).

    .

  • 8/20/2019 04 Reliability and Validity

    16/17

    16

    Colleges like IcfaiTech College of Engineering make use of the principles of Standard deviation, mean,

    range to access the reliability of test scores between different teachers teaching the same subject. Some

    colleges like IBS, Hyderabad also use such measures extensively.

    Bibliographical References:

      HS Srivastava, “Challenges in education evaluation”, UBS Publishers Distributors Ltd. 

       Noen Entwistle, “Handbook of educational ideas and practices”, Routledge publications. 

      Airasian, Peter W. (2000). Assessment in the classroom. A concise approach. Boston: McGraw-

    Hill.

      Linn, Robert L. & Gronlund, Norman E. (2000). Measurement and assessment in teaching. Upper

    Saddle River, NJ: Prentice-Hall, Inc.

      Wiersma, William & Jurs, Stephen G. (1985). Educational measurement and testing. Boston:

    Allyn and Bacon, Inc.

      Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th Ed). New

    York: MacMillan

      Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests.

    Columbus, OH: Charles E. Merrill Books, Inc

       Nunnally, J.C. (1972). Educational measurement and evaluation (2nd Ed). New York: McGraw-

    Hill

      Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation.

    Cambridge University Press

      Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests &

    Measurement. Sage Publications, Inc.

      Linn R L. & Miller M D. (2005) Measurement & Assessment in Teaching (9th Ed). Merrill

    Prentice Hall

    i Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when

    guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a

    student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test

    developer in determining what is wrong with individual items. Item and test analysis provide empirical data

    about how individual items and whole tests are performing in real test situations.

  • 8/20/2019 04 Reliability and Validity

    17/17

    17

    Test designers need to accomplish some requirements concerning validity, objectivity and reliability for the

    items and for the test itself; they also have to follow some logical procedures.

    ii Even though guidelines for interpreting the coefficient of correlation have been given however, all such

    criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation

    of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if

    one is verifying a physical law using high-quality instruments, but may be regarded as very high in the

    social sciences where there may be a greater contribution from complicating factors.