Upload
matthew-lindsay
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
ITC ConferenceITC Conference, Winchester, 2002
Computer-based Testing
Usability of Psychometric Admeasurements Usability of Psychometric Admeasurements
Dr. J. M. Müller
University of Tübingen, GermanyUniversity of Tübingen, Germany
http://www.joergmmueller.de/default.htm
Overview
1. Introduction: Formal test descriptions in practice
2. Definition of usability in the context of test description
3. Illustrating problems: Reliability
4. Criteria of usability: foundation, scaling, general attributes
5. Two examples of enhanced usability: NDR and PDR
6. Summary
Introduction: Psychometric admeasurements in practice today and tomorrow
1. Test users often use poor quality tests (e.g. Piotrowski et al.; Wade & Baker, 1977) Psychometric knowledge (Moreland et al. 1995)/Competence approach (Bartram, 1995, 1996)
2. What should be described? CBT: Criteria for software usability (ISO 9241/10, 1991; Willumeit, Gediga & Hamborg, 1995) and further criteria: platform-independence, possibility of making own norm banking, protection)
3. How should it be described?
4. “Good practice” guidelines and standards are based on quality criteria (e.g. Standards for educational and psychological Testing, PA, 1999; International Guidelines for testing, ITC, 2000)
Quality Supply Quality Demand
Definition of Usability
Scope of usability: Usability in the context of psychological testing concerns all important kinds of information for test users to describe a test for various purposes and the ways to communicate them. This includes test manuals as well as a formal test descriptions with the help of psychometric admeasurements.
Aim of usability: The product or effect of good usability is that any test user finds all necessary information quickly and in a proper standardized form, ready to use for answering the questions of the test users to enable them to decide whether a test is an appropriate help for the diagnostic question.
Frame of usability: Quality assurance in the context of psychological testing refers to test construction, test translation, test description and the use of tests in practice.Methods to enhance quality control can contain guidelines for test use, standards for test description, etc. Usability is a strategy to enhance quality on the level of formal description.
Consequences of usability concern the reengineering of formal test description,
Indices of measurement of error
Spe a
rman
Co r
rela
t i on
% o
r SM
C
Phi
-Coe
f f ici
ent
Ret
est P
ears
o n c
orre
l ati o
n
Yul
e‘s
Y
Cro
nba c
h‘s
Alp
ha
Kud
e r- R
i ch a
r ds o
n‘s
For
mul
a 20
Spe a
rman
-Bro
wn
prop
hecy
f orm
u la
intr
acla
ss-c
orre
latio
n
S en s
i tiv i
t y T
P/(
TP
+F
N)
S pe c
i fic y
TN
/(T
N+
FP
)
S ta n
d ar d
er r
o r o
f a s
c ore
Kap
pa R
ecla
ssifi
catio
n
Mod
el-F
it L
ikel
ihoo
ds
Info
rma t
i on -
f un c
t ion
Kap
pa I
nter
rate
r
Stan
dard
err
or s
core
Measurement of error
Dimensional construct Categorical construct
CTT IRTGeneralizability Theorynonspecific
misclassificationspecific
misclassification
Standard error score
Reliability
Relationships between indices of error of measurementSp
e arm
an C
o rre
lat i o
n
% o
r SM
C
Phi
-Coe
f f ici
ent
Ret
est P
ears
o n c
orre
l ati o
n
Yul
e‘s
Y
Cro
nba c
h‘s
Alp
ha
Kud
e r- R
i ch a
r ds o
n‘s
For
mul
a 20
Spe a
rman
-Bro
wn
prop
hecy
f orm
u la
intr
acla
ss-c
orre
latio
n
S en s
i tiv i
t y T
P/(
TP
+F
N)
S pe c
i fic y
TN
/(T
N+
FP
)
S ta n
d ar d
er r
o r o
f a s
c ore
Kap
pa R
ecla
ssifi
catio
n
Mod
el-F
it L
ikel
ihoo
ds
Info
rma t
i on -
f un c
t ion
Kap
pa I
nter
rate
r
Info
rmat
ion-
crite
ria
Y/ Kappa/ Phi
Korrelation
Phi
Kappa
test theory/statistic
Index: Generic formula
Algorithm
scale (correction)
Interpretation of the score (operational meaning)
Top-down vs. bottom-up strategy to develop a coefficient
Practitioner‘s point of view
Scientist‘s point of view
Defining the operational meaning
Scale definition
Specification of within a test theory
Index: Defining the influencing factors
Index: Generic formula
Rescaling reliability: Number of distinctive results (NDR)
(Wright & Master, 1982; Lehrl & Kinzel, 1973; Müller, 2001)
Rang R
Test score distribution
x1 x2
criticaldifference
criticaldifference
criticaldifference
criticaldifference
criticaldifference
ttx rsxxk 1296,105.012Formula
R = test score range
k = critical difference
21
1*2
2
ttrk
RD
Foundation1. Unambiguous
operational meaning
2. Unambiguous formal definition
3. Broad application area
4. Relevant dependencies
5. Independent of irrelevant factors
Scale Definition1. Meaningful scale unit,
that implies:• Interval scale• Positive values• Defined range of
values 2. Comparable to the
reference scale 3. Significant scale unit that
implies a minimum of observations (Nmin)
Global attributes in using
1. Relevance2. Informative (not
redundant)3. Predictable for the test
user (nominal/actual value comparison)
4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria
of estimating
Criteria of usabilityfor formal quality criteria
(modified from Müller, 2001, 2002a,b; Goodmann & Kruskal, 1954)
Foundation1. Unambiguous
operational meaning
2. Unambiguous formal definition
3. Broad application area
4. Relevant dependencies
5. Independent of irrelevant factors
Scale Definition1. Meaningful scale unit,
that implies:• Interval scale• Positive values• Defined range of
values 2. Comparable to the
reference scale 3. Significant scale unit that
implies a minimum of observations (Nmin)
Global attributes in using
1. Relevance2. Informative (not
redundant)3. Predictable for the test
user (nominal/actual value comparison)
4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria
of estimating
NDR at work...
NDR = 2NDR = 5NDR = 10
r = .50r = .92r = .98
Distribution of reliability coefficient Distribution of NDR coefficient
Conclusion: many precise tests Conclusion: some precise tests
Probability of distinctive results (PDR)
2
)1(*
nntD
Formula
tD
sDPDR
n
ji jiji
jijiji kxxifs
kxxifsssD
, ,
,, ,0
,1
Complete score comparison of pairs
Rectangular distribution shows an 80 %
probability to distinguish two test scores
Gaussian distribution shows a 60 %
probability to distinguish two test scores
Reliability
PDR
PDR: Simulation studyPerformance to separate test scores with respect
to reliability and score distribution
PDR: Example
Subscale ‚Resignation‘; Stress-Coping-Questionnaire
SVF-KJ; Hampel, Petermann & Dickow, 1999; N=1123
Subscale ‚Unsicherheit‘ Symptom Check List
(Derogatis, 1977; German Version Franke, 1995; N=875
r = 0.81
PDR = 41.6 % PDR = 30.6 %
r = 0.81
Reviewing NDR and PDR
1. NDR and PDR can be derived in any test theoretical model – there is progress in the application area.
2. NDR and PDR have an easy to understand operational meaning
3. NDR and PDR are predictable for the test user for the nominal/actual value comparison
NDR and PDR serve as examples of how to develop more usable NDR and PDR serve as examples of how to develop more usable formal test descriptionsformal test descriptions
Summary
1. Usability is a possible strategy with explicit and observable criteria, for improving formal test descriptions – and strengthening indirectly the role of guidelines and standards.
2. With NDR and PDR two easy to understood coefficients have been proposed, the application of which in is progress in several test theoretical models.
Thank you for your attention!
Medicine: Effect-size measures
Practitioners coefficient
m
i i
ii
P
PPw
1 0
201
Scientific coefficient (Cohen, 1988)
CER*
1
RRRNNT
NNTs [Number-Needed-to-Treat] the number of patients who need to be treated to prevent 1 adverse outcome. Taken from EBM Glossary - Evidence Based Medicine Volume 125 Number 1
Measuring in technical fields: Solutions from engineering
The is a German Norm DIN 2257 on how to measure the physical length of an object and how to report the result. The norm allows as output only values with statistical evidence.
Criteria of usabilityfor formal quality criteria for NNT
Foundation1. Unambiguous
operational meaning
2. Unambiguous formal definition
3. Broad application area
4. Relevant dependencies
5. Independent of irrelevant factors
Scale Definition1. Meaningful scale unit,
that implies:• Interval scale• Positive values• Defined range of
values 2. Comparable to the
reference scale 3. Significant scale unit, that
implies a minimum of observations (Nmin)
Global attributes in using
1. Relevance2. Informative (not
redundant)3. Predictable for the test
user (nominal/actual value comparison)
4. Easy to learn 5. Easy to utilise6. Fisher‘s (1925)
criteria of estimating
Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)
Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:
1. Suitability for the task
2. Self-descriptiveness
3. Controllability
4. Conformity with user expectations
5. Error tolerance
6. Suitability for individualization
7. Suitability for learning
KR20 and Cronbach
2
111 t
n
uii
tt
qp
n
nr
Kuder-Richardson-Formel KR20i itempi relative Anzahl von 1qi relative Anzahl von 0
(aus Cronbach, 1951)
Cronbachs Alphac Anzahl der Variablen
si2 Varianz der Variablen i
stot2 Varianz der Summe
21
2
11 x
J
ii
s
s
c
c
Formula to the error of measurement in categorial constructs
Cohen‘s KappaWeiter 16 Maße zur Konkordanz zweier
Messungen für binaäre Daten verglichen Conger & Ward (1984)
Yule Vierfelderinterdependenzmaß
Q-Koeffizient
Phi-Koeffizient Abhängigkeit von Randsummenverteilung Abhängigkeit des Signifkanztests von N
(Yates-Kontinuitätskorrektur, 1934)
adbc
adbdY
1
1
2
1
2
1
2
2
i i ij
ijij
e
ef
B1 B2
A1 a bA2 c d
bcad
bcadQ
e
e
p
pp
1
02
N
dap
0
2
)()()()(
N
dbdcbacape
N2
Formula to the error of measurement in categorial constructs
Frickes Übereinstimmungs-koeffizient SS: Quadratsumme innerhalb einer
Person; max SS: maximal mögliche Quadratsumme innerhalb der Personen
Punkt-biseriale KorrelationX=arithmetisches Mittel aller Testrohwerte
XR=arithmetishes Mittel der Pbn mit richtigen Antworten
sx=Standardabweichung der Testrohwerte aller Pbn
N = Anzahl aller Pbn
NR=Anzahl der Pbn, mit richtigen Antworten
Tetrachrorische Korrelation
max
1SS
SSÜ
n
daÜ
A B C
I 1 4 3
II 0 4 2
III 0 5 2
bcadrtet
1
180cos
0
q
p
s
XXr
x
Rjtbisp
_
Formula to the error of measurement in CTT, IRT + prophecy-formula
Spearman-Brown-Formel
k= Faktor der Testverlängerung
Rasch model
CTT
tt
tttt rk
rkr
11
)1(
1)(
1vi
k
ivi pp
EVar
ttxe rss 1
Some Formula for the error of measurement in metric constructs
reliability (Kelley, 1921)
Pearson(1907) -Correlation Bravais (1846)
Spearman‘s rho (1904)
Kendalls Tau , 1942(S=difference of pro- und inversionsnumber)
22
2
ew
wtt ss
sr
N
i yx
ii
ssN
yyxxr
122
1
61
2
2
NN
dRho i
i
2/)1(
NN
S
r
rZ
1
1ln
2
1
1 2 3 4 5
3 2 3 5 4
Non-linear Relationsship between reliability, NDR and the standard error score
1reliability
NDR Standard error score
NDR
Standard error score
Item-Response-Theory(Fischer & Molenaar, 1994)
1. Dichotomous raschmodel
2. Linear logistic test model
3. Linear logistic model for change
4. Dynamic generalization of the raschmodel
5. One parametric logistic model
6. Linear logistic latent class analysis
7. Mixture distribution rasch models
8. Polytomous rasch Models
9. Extended rating scale and partial credit models
10. Polytomous mixed rasch models
11. ...
...more IRT (van der Linden & Hambleton, 1997)
1. Nominal categories model
2. Response model for multiple choice
3. Graded response model
4. Partial credit model
5. Generalized partial credit model
6. Logistic model for time-limit tests
7. Hyperbolic cosine IRT model for unfolding direct responses
8. Single-item response model
9. Response model with manifest predictors
10. A linear multidimensional model
11. ...
Formula of some IRT
rasch model
binomial model
Unfolding-model
iA
iAAiAi
xxp
exp1
exp
A
AAxp
exp1
exp
))(exp(1
))(exp()(
ivi
ivivivi
xxp
Birnbaum model
))(exp(1
))(exp()(
2
2
iv
ivvivi
xxp
ii xig
G
g
k
i
xiggxp
1
1 1
)1()( Latent-Class-model
Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)
Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:
1. Suitable for the task
2. Self-descriptiveness
3. Controllability
4. Conformity with user expectations
5. Error Tolerance
6. Suitable for individualization
7. Suitability for learning
Norm scales
SCL-90-R test score distribution
Simulation study about the relationsship between measures of association
Y/ Kappa/ Phi
correlation Y/ Kappa/ Phi Q
Y/ Kappa/ Phi
correlation
SMCY/Kappa/Phi
Q
correlation
Phi
SMC
Phi
Kappa
SMC
KappaMeasure of associationMea
sure
of
asso
ciat
ion
Linear relationship?
dcA2
baA1
B2B1
dichotome Normal distribution- equal marginals
Skewed distribution - unequal marginals
Efficiency in measuring
Content: efficiencyConcept: The less effort you need for the same
amount of information, the more efficiency the test isefficiency = f(Information;effort)
Indice: E = Amount of Information/TimeEstimates: Information Theory
(Shannon & Weaver, 1949)
Amount of Information of a signal: Chess example
In the chess example you need at least (binary, 50-50 chance) 6 question‘s that are 6 bit.
1.Frage: links-rechts?
2.Frage: oben-unten?
3.Frage
4. 5.
6.
The scale unit ‚bit‘ can be understand as the minimal or optimal number of question‘s, to identify a signal out of quantity of alternatives.
AB
Schachspieler
C1:21:2
Rasch variances are a measure of the variability of person‘s within a dimension
Als Maßeinheit der Unterschiedlichkeit dient die
Differenz der Gewinnwahrscheinlichkeiten.
1: 21: 2
1: 21: 2
1: 21: 2
1: 21:2
1. Gewinnwahrscheinlichkeiten -> Lösungswahrscheinlichkeiten
2. Gegner -> Testaufgabe (Itemparameter)
3. Spielstärke -> Personenparameter
4. Differenz der Gewinnwahrscheinlichkeit definiert über den Logit des Raschmodells
Interpretable Rasch Variances
personen parameter
Probability to solve an item
Item i with = 0
B A
item m with = 1
C
iA
iAAiAi
xxp
exp1
exp
minmax A
Difference to solve a question or task
Empirical Evidence of the range of person parameters in rasch units
AID Kubingen & Wurst Standardform Parallelform
Alltagswissen 21,1 21,3
Realitätssicherheit 13,3 13,1
Angewandtes Rechnen 21,7 20,5
Eigenschaft Autor AusdehnungVerbaler Intelligenztest Metzler & Schmidt 11,4
Averbale Intelligenz Forman & Pieswanger 8,2
Einstellung zur Sexualmoral Wakenhut 8,1
Einstellung zur Strafrechtsreform Wakenhut 7,2
Beschwerdeliste Fahrenberg 6,4
Räumliches Vorstellungsvermögen Gittler 5,9
Umgang mit Zahlen bei Kindern Rasch 3,5
Usability criteria explanations
• Relevant dependencies: Example: Reliability and test length, stability, ...
• Irrelevant dependencies: Example: Reliability and test score distribution
• Displaying numbers: Integer, positive, predictable range
• Meaningful scale unit
• Familiarness: each new coefficient should distinctly more usable than the traditional
7. Linearität zur Unit-in-Change
Erläuterung: ‚Linearität zur Unit-in-Change‘
- Im Falle der Messgenauigkeit betrifft dies die Beziehung der Reliabilität zum Messfehler.
- Im Falle der Übereinstimmung betrifft dies die Beziehung von Yules Y zur Veränderung der Zellhäufigkeit a bzw. d.
Korrelation/Reliabilität
Standardmessfehler
Yules Y
Freq (Zelle a)
Evaluation the progress trough enhancing usability
1. Formal test criteria are used more frequently for test selection
2. Tests in practice are of higher quality
Ergonomics in psychological test selection
Ergonomics
Psychological diagnostic
Configuration of Environment
Software conception
Designing a tool to fit in hand.
Developing a program to be used
intuitively
Restrict a test description, that
relevant information are ready to use
Integrating ergonomics in the formal test description
Human interface techniques
test user Psychometric admeasurements
test
Analysis
of usage evaluationUsability criteria
1. Formal test criteria are used more frequently for test selection2. Tests in practice are of higher quality
Ergonomics and the development of criteria of usability
Requirement Analysis (Mayhew,1999)
User-ProfileTask-
Analysis
Platform Capabilities/ Constrains
Testuser Test selection Test theory
Top-down vs. bottom-up strategy to develop a coefficient
test theory/statistic
Index: Generic formula
Algorithm
scale (correction)
Interpretation of the score (operational meaning)
Practitioner‘s point of view
Scientist‘s point of view
Defining the operational meaning
Scale definition
Specification of within a test theory
Index: Defining the influencing factors
Index: Generic formula
CTT
22
2
ew
w
ss
sr
N
i yx
ii
ssN
yyxxr
122
none
associationP-R-E
SEDTTS
NDR
k
RD
f(me, score range, probability)
ttx rsD
1296,1
s*6 x