Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)
UvA-DARE (Digital Academic Repository)
Math Garden: A new educational and scientific instrument
Straatemeier, M.
Link to publication
Citation for published version (APA):Straatemeier, M. (2014). Math Garden: A new educational and scientific instrument.
General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s),other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).
Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, statingyour reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Askthe Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam,The Netherlands. You will be contacted as soon as possible.
Download date: 12 Feb 2020
2CHAPTER
Computer adaptive practice of math ability using a new item response model for on
the fly ability and difficulty estimation
This chapter has been published as:
Klinkenberg, S., Straatemeier, M., & Van der Maas, H. L. J. (2011). Computer adaptive
practice of maths ability using a new item response model for on the fly ability and
difficulty estimation. Computers & Education, 57, 1813-1824.
24 | Chap te r 2
Abstract
In this chapter we present a model for computerized adaptive practice and monitoring.
This model is used in Math Garden, a web-based monitoring system, which includes a
challenging web environment for children to practice arithmetic. Using a new item re-
sponse model based on the Elo (1978) rating system and an explicit scoring rule, estimates
of the ability of persons and the difficulty of items are updated with every answered item,
allowing for on the fly item calibration. In the scoring rule both accuracy and response time
are accounted for. Items are sampled with a mean success probability of .75, making the
tasks challenging yet not too difficult. In a period of ten months our sample of 3648 chil-
dren completed over 3.5 million arithmetic problems. The children completed about 33%
of these problems outside school hours. Results show better measurement precision, high
validity and reliability, high pupil satisfaction, and many interesting options for monitoring
progress, diagnosing errors and analyzing development
Introduction
In this chapter we present a computerized adaptive practice (CAP) system for monitor-
ing arithmetic in primary education: Math Garden. Math Garden is a web-based computer
adaptive practice and monitoring system based on weekly measurements. In recent years
math abilities of Dutch students have been widely debated. This is mainly due to the results
of the National Periodical Education Polls (PPON). These results show that few children
reach the required math level at the end of their primary education (Kraemer, Janssen,
Van der Schoot, & Hemker, 2005). Based on these findings a parliamentary inquiry into
Dutch education was initiated. Both the committee “Dijsselbloem” (2008) and the expert
group “Doorlopende Leerlijnen” (2008) recommended several improvements to the Dutch
education system in general and math education in particular. Recommendations included
the provision of more time to practice and maintain basic math skills, more efficient and
effective measurement in education, and the use of these measurement results to improve
the ability of individual students, the classroom and education in general. These recom-
mendations are also supported by Fullan (2006), who claimed that acting on data is critical
for learning from experience.
Combining practice and measurementIn the light of these recommendations we propose to combine practice and measure-
ment in a playful manner using computerized educational games. We expect that in the
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 25
near future children will increasingly use mini computers and handheld devices to do their
daily exercises in arithmetic, spelling, and other subjects. The use of computers has two
main advantages. First, the input can be analyzed automatically and feedback can be given
immediately, which will free teachers from checking and correcting the children’s exercise
books. The recorded and automatically analyzed data can provide teachers with detailed
information on children’s progress and the errors they make. Teachers can use this infor-
mation to optimize individual instruction. The information concerning the child’s progress
and abilities, which is accumulated over time, may ultimately obviate the need to conduct
tests and examinations. Second, by using computers it is possible to let children practice
at their individual ability level. Research on the development of expertise performance has
shown that people do improve their performance considerably if they regularly do specific
exercises that are adjusted to their ability level and include immediate feedback. In the
development of Math Garden we follow these ideas developed in sports and expertise train-
ing, especially the idea of deliberate practice (Ericsson, 2006, pp. 683–703).
Three problems of CATTo implement individualized practice, we apply the technique of computer adaptive test-
ing (Van der Linden & Glas, 2000; Wainer, 2000). Computer adaptive testing (CAT) is
based on item response theory (IRT). This theory consists of statistical models that relate
item responses to the (latent) abilities that the items measure (Lord & Novick, 1968). A
large collection of item response models is available, but these are all basically variations
on the simplest model, i.e., the one-parameter logistic (1PL) model or Rasch model (Rasch,
1960). In the Rasch model the probability of a correct or affirmative answer is a logistic
function of the difference between the ability of the subject and the difficulty of the item.
In the two-parameter logistic model, the difference is weighted by an item discrimination
parameter, which has a high value when an item discriminates well between low and high
ability subjects. Item response models can be used for equating tests, to detect and study
differential item functioning (bias), and to develop computer adaptive tests (Van der Lin-
den & Hambleton, 1997). The idea of CAT is to determine the ability level of a person dy-
namically. In CAT, item administration depends on the subject’s previous responses. If the
preceding item is answered correctly (incorrectly), a more (less) difficult item is presented.
Hence, each person is presented a test tailored to his or her ability. Using CAT, test length
can be shortened up to 50% (Eggen & Verschoor, 2006). Originally, CAT was developed
for measurement only. Our aim to combine practice and measurement raises several novel
issues. We distinguish the following three issues.
First, in standard CAT the parameters of the items, especially the difficulty, have to be
26 | Chap te r 2
known in advance to test adaptively. Items therefore have to be “pre-calibrated” before
they can be used in real test situations. This means that a large representative sample of
the population has to have answered the items in the item bank to provide the information
for item calibration. The difficulty of the items is determined using the data of this sample.
This method is obviously time-consuming and costly, especially as the calibration has to
be carried out repeatedly (e.g., every few years) to acquire accurate norm referenced item
parameters.
Second, CAT operates most effectively if the difficulty level of administered items equals
the ability estimate of the person. The probability of answering such items correctly is .5.
However, for most children and many adults the success rate associated with a .5 probabil-
ity is experienced as discouraging. Research by Eggen and Verschoor (2006) showed that
increasing this probability to above .7 greatly reduces measurement precision. Given a .7
probability, more items need to be administered to obtain an accurate estimate of person
ability. This requirement reduces the efficiency of computer adaptive testing.
The third problem concerns a testing problem that applies to psychological and educa-
tional measurement in general, namely, the trade-off between speed and accuracy. Without
explicit instructions, participants in tests and experiments are free to balance speed and
accuracy as they wish. Consequently the trade-off between speed and accuracy can be
a source of large individual differences. The current solution in psychometrics (Van der
Linden, 2007) and experimental psychology (Ratcliff & Rouder, 1998; Vandekerckhove &
Tuerlinckx, 2008) is to estimate person parameters involved in this trade-off on the basis of
the data. However, this procedure requires large amounts of high quality data.
New CATWe developed an extended CAT approach to solve these problems. This Computer Adap-
tive Practice (CAP) system provides the basis of Math Garden. The CAP system includes the
following two innovations. First, we have applied a new estimation method based on the Elo
(1978) rating system (ERS) developed for chess competitions. The ERS allows for on the
fly estimation of item difficulty and person ability parameters. With this method, pre-testing
is no longer required. Second, we have used an explicit scoring rule for speed and accuracy,
which is known to the subject during the test. Inclusion of speed in the scoring has the ad-
vantage that we acquire more information about ability. Research by Van der Maas and Wa-
genmakers (2005) showed that in the responses to easy chess items there is a strong negative
relation between response time and ability. Subjects tend to answer easy items correctly, but
more advanced subjects answer them more quickly. Third, by integrating response time into
the estimation of ability, we can decrease the difficulty of administered items with less loss
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 27
of measurement precision than noted by Eggen and Verschoor (2006). In addition we expect
the higher success rate to increase the motivation of children during the test. In the Method
section we describe Math Garden, the Elo algorithm and the new scoring rule in more detail.
In the results section of this chapter we test the working of Math Garden. We present evidence
for high validity and reliability of ability and difficulty estimation, the motivational value of
the Math Garden, and its usefulness as a diagnostic and monitoring instrument.
Methods
ParticipantsA total of 35 primary schools, eight remedial teachers and 32 families participated in this
study, comprising of N = 3648 active participants. Also 334 aspiring kindergarten pupils
joined Math Garden. In the time period from August 2008 to early June 2009 more than 3.5
million arithmetic problems were answered in our sample. In addition to the responses, we
registered the gender, age and grade of the participants. Table 1 shows the mean age with
standard deviation and the number of children for each grade.
MaterialsThe main measurement tool used in this study is the web-based practice and monitoring
system we developed: Math Garden. The student interface consists of a garden containing
distinct flowerbeds, representing, among others, the four domains: addition, subtraction,
multiplication, and division (Figure 1a) on which we focus in this chapter. The size of the
flowers represents the math ability of the student. By clicking on a flowerbed the math
game is started for a specific domain and the student can start playing.
Table 1 Age, gender and N per grade
Grade Age category N Mean age (SD) % Male (Female)
kindergarten 4-5 103 4.32 (0.51) 50.49 (49.51)
kindergarten 5-6 231 5.45 (0.51) 47.19 (52.81)
1 6-7 529 6.61 (0.51) 53.50 (46.50)
2 7-8 681 7.69 (0.54) 55.21 (44.79)
3 8-9 526 8.68 (0.79) 47.91 (52.09)
4 9-10 513 9.70 (0.61) 47.24 (52.76)
5 10-11 574 10.79 (0.60) 49.48 (50.52)
6 11-12 416 11.80 (0.57) 50 (50)
Secondary Education > 12 75 13.33 (3.94) 64 (36)
28 | Chap te r 2
The visual interface of the math task consists of a math question, six answer options
(addition and subtraction) or a number pad (multiplication and division), a coin bag, a
question mark, a stop sign, and an elapsing coin bar that indicates the time left on the item
(Figure 1b). The game rules are intuitive and therefore only require minimal explanation
on the website. Students gain points (coins), displayed at the bottom of the game interface,
by answering items correctly and lose coins when answering incorrectly. With each item a
total number of twenty coins, corresponding to the maximum time in seconds, can be won
or lost. Every second one coin disappears. The remaining coins are added to the coin bag if
the item has been solved correctly and are subtracted if solved incorrectly. If the time limit
has expired or the question mark has been clicked, no coins are lost or won. The rationale
of this scoring rule is explained in the psychometrics section. A session consists of fi fteen
items after which the math game terminates and the student is returned to his or her garden.
The fl owers will start growing according to the progression that has been made. Students
are motivated by two reward systems. Good performance is rewarded by growing fl owers
and virtual coins.13The Math Garden website contains a dedicated area, the prize cabinet,
where virtual prizes can be bought with the earned coins. Another way students are moti-
vated to continue playing in Math Garden is that the fl owerbeds wither if the student does
not play. Withering worsens over time and can only be undone by completing a new session
of 15 items.
Figure 1 The main Math Garden interface (a) and an addition item (b).
The four domains, addition, subtraction, multiplication, and division, contain 738, 723,
659, and 664 items, respectively. The items in the four domains cover the curriculum in
primary education. They vary from easy (e.g., 3 + 4 with response options: 7, 8, 6, 1, 9
and 12) to diffi cult (e.g., 7.34 + 311.4 with response options: 318.74; 318.38; 318.47;
31 Because of the adaptive nature of the test, every student has roughly the same percentage correct. Hence the num-ber of coins won refl ects only how often a student plays and not his arithmetic level.
a) b)
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 29
317.74; 319.74 and 318.34). The response options are selected to be informative distracters.
Open-ended items were used in the games multiplication and division. Variables measured
by the task are response time, the given answer, the correctness (0, 1) and a timestamp at
administration.
We studied the validity of the data by comparing the ability estimate, measured with the
Math Garden, with students’ scores on the math tests from the pupil monitoring system
(Janssen & Engelen, 2002) of the National Institute for Educational Measurements (Cito).
In the Cito monitoring system math tests are administered twice a year from mid grade 1
until mid grade 6. These tests assess the knowledge and skills that are being taught in these
grades. The tests contain both open-ended and forced-choice items. Students’ scores on the
math test (the total of correct answers) are transformed to a score on a norm-referenced
general math ability scale. This allows one to compare students’ scores from different
grades using one scale.
Psychometrics
Elo rating system In chess the Elo (1978) rating system (ERS) is used to estimate the relative ability of a
player. The ERS is a dynamic paired comparison model, which is mathematically closely
related to the Rasch IRT model (Batchelder & Bershad, 1979). Initially chess players are
given a provisional ability rating θ, which is incrementally updated (see equation (1)) based
on match results (in chess 0, .5 and 1, for loss, draw and win outcomes). The updated abil-
ity estimate
Opmerkingen ronde 3 Ik gebruik weer af en toe dikgedrukt en schuin om de verschillende woorden te onderscheiden, deze formatting niet overnemen.
• P 1: Concept doctoral thesis mag weg • P 3: Graag ook academisch proefschrift in schreefloze lettertype als kopje
erboven met witregel ertussen. Academisch Proefschrift beiden met hoofdletter
• P 3: AU moet Aula der Universiteit van Amsterdam zijn • P 5: paginanummering klopt nog niet, maar dit wilde je geloof ik op het
eind doen. Hoofdstuk 1
• P. 9, alinea 2, derde zin. Graag nog een komma achter 2005 • P. 14 regel 5. scPhools moet schools zijn (sorry dat je dit weer moet
aanpassen, was een tikfout) Hoofdstuk 2
• P. 27. 2e alinea regel 4: komen toevoegen achter reponses: In addition to the responses, we registered
• P. 27. Is het mogelijk om de tabel van pagina 28 onder het stukje participants te zetten op pagina 27? Dan staan de plaatjes van pagina 29 ook weer dichterbij de tekst waar ze horen.
• P. 28. De footnote en de verwijzing ernaar in de tekst bevat het cijfer 3, moet 1 zijn.
• P. 29: laatste zin θ moet schuingedrukt zijn. • Overall heb ik het idee dat de formules behoorlijk klein zijn, misschien
moeten we even kijken hoe dit uitpakt in de drukproeven of denk je dat we het nu al groter moeten maken.
• P. 29: laatste zin er mist een ): (see equation (1)) • P. 30: regel 2, er mist een teken achter estimate 𝜃𝜃
P. 30: na regel 4 ontbreekt er tekst. De hele zin moet zijn: The expected match result is a function of the difference between the ability estimates of both player j and k preceding the match and expresses the probability of winning (see equation (2)):
• P. 30. De footnote betreft footnote 2 (ook in de tekst) • P. 31. Witregel na regel 2 mag weg. • P. 31. Na 1e formule niet inspringen • P. 31: 4e regel onder high speed high stakes. Dit moet een andere x zijn,
dezelfde als in de 3e regel van onder maar dan zonder de subscripts. • P. 31. 5e regel van onder, niet inspringen • P. 32: Figuur 2 mag eventueel wat kleiner als dat nu door verspringen van
teksten handiger is. • P. 32, er zitten nog veel fouten in de tekens en subscripts in deze alinea.
Hierbij de goede tekst. Dikgedrukt zijn de wijzigingen: Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is
(signifi ed by the hat) depends on the weighted difference in match result S
and expected match result E(S). The expected match result is a function of the difference
between the ability estimates of both player j and k preceding the match and expresses the
probability of winning (see equation (2)):
The K factor in equation (1) weights the impact of the deviation from expectation on the
new ability estimate. This value essentially determines the rate at which θ can change over
matches. In the standard ERS the K factor is constant. Glickman (1995) argued that not all
ability ratings are estimated accurately by the ERS update function (eq. 1). Inaccuracies
mostly occur when players are new or have not played for an extended period of time, re-
23
Figure 1. The main Math Garden interface (a) and an addition item (b).
The four domains, addition, subtraction, multiplication and division, contain 738, 723, 659
and 664 items, respectively. The items in the four domains cover the curriculum in primary
education. They vary from easy (e.g., 3 + 4 with response options: 7, 8, 6, 1, 9 and 12) to difficult
(e.g., 7.34 + 311.4 with response options: 318.74; 318.38; 318.47; 317.74; 319.74 and 318.34). The
response options are selected to be informative distracters. Open-‐ended items were used in the
games multiplication and division. Variables measured by the task are response time, the given
answer, the correctness (0, 1) and a timestamp at administration.
We studied the validity of the data by comparing the ability estimate, measured with the
Math Garden, with students’ scores on the math tests from the pupil monitoring system (Janssen &
Engelen, 2002) of the National Institute for Educational Measurements (Cito). In the Cito monitoring
system math tests are administered twice a year from mid grade 1 until mid grade 6. These tests
assess the knowledge and skills that are being taught in these grades. The tests contain both open-‐
ended and forced-‐choice items. Students’ scores on the math test (the total of correct answers) are
transformed to a score on a norm-‐referenced general math ability scale. This allows one to compare
students’ scores from different grades using one scale.
Psychometrics
Elo rating system. In chess the Elo (1978) rating system (ERS) is used to estimate the relative
ability of a player. The ERS is a dynamic paired comparison model, which is mathematically closely
related to the Rasch IRT model (Batchelder & Bershad, 1979). Initially chess players are given a
provisional ability rating θ, which is incrementally updated (see equation (1) based on match results
(in chess 0, .5 and 1, for loss, draw and win outcomes). The updated ability estimate 𝜃𝜃 (signified by the hat) depends on the weighted difference in match result S and expected match result E(S). The
expected match result is a function of the difference between the ability estimates of both player j
(1)
θ j =θ j +K(Sj −E(Sj ))
θk =θk +K(Sk −E(Sk ))
24
(2)
The K factor in equation (1) weights the impact of the deviation from expectation on the new ability
estimate. This value essentially determines the rate at which θ can change over matches. In the
standard ERS the K factor is constant. Glickman (1995) argued that not all ability ratings are
estimated accurately by the ERS update function (eq. 1). Inaccuracies mostly occur when players are
new or have not played for an extended period of time, resulting in much uncertainty in their ability
rating θ. Glickman proposed to let the K factor reflect the uncertainty in ability estimates by making it
a function of time and playing frequency. If there is little uncertainty, the K factor for recent and
frequent players will be low. If there is much uncertainty the K factor will be high.
Computer adaptive practice. Our suggestion for creating an on the fly item calibrating and
computer adaptive practice (CAP) system is to replace one player in the Elo system by an item.2
Solving an item correctly is interpreted as winning the match against the item. The updating function
in equation (1) can be rewritten to equation (3) for updating player and item ratings:
(3)
where βi is the difficulty estimate of the item and Sij and E(Sij) are the score and expected probability
of winning for person j on item i. Following Glickman, the K factor in our CAP system is a function of
the rating uncertainty U of the player and the item (eq. 4):
K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )
(4)
where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K-‐ = 0.5 are the
weights for the rating uncertainty for person j and item i. These values determine the rate at which θ
and β can change following each item response. These values have been determined through
extensive simulations.
The uncertainty U depends on both recency and frequency. Equation (5) combines these
opposite effects on uncertainty. We apply the same equation to items and players, with provisional
uncertainty of U = 1 and 0 ≤ U ≤ 1:
U =U −140
+130
D (5)
2 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tactics Server (http://chess.emrald.net)
E(Sj ) =1
1+10(θk−θ j )/400
θ j =θ j +K j (Sij −E(Sij ))
βi = βi +Ki (E(Sij )− Sij )
30 | Chap te r 2
sulting in much uncertainty in their ability rating θ. Glickman proposed to let the K factor
reflect the uncertainty in ability estimates by making it a function of time and playing fre-
quency. If there is little uncertainty, the K factor for recent and frequent players will be low.
If there is much uncertainty the K factor will be high.
Computer adaptive practice Our suggestion for creating an on the fly item calibrating and computer adaptive prac-
tice (CAP) system is to replace one player in the Elo system by an item.24 Solving an
item correctly is interpreted as winning the match against the item. The updating func-
tion in equation (1) can be rewritten to equation (3) for updating player and item ratings:
where βi is the difficulty estimate of the item and S
ij and E(S
ij) are the score and expected
probability of winning for person j on item i. Following Glickman, the K factor in our
CAP system is a function of the rating uncertainty U of the player and the item (eq. 4):
where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K
- = 0.5
are the weights for the rating uncertainty for person j and item i. These values determine
the rate at which θ and β can change following each item response. These values have been
determined through extensive simulations.
The uncertainty U depends on both recency and frequency. Equation (5) combines these
opposite effects on uncertainty. We apply the same equation to items and players, with pro-
visional uncertainty of U = 1 and 0 ≤ U ≤ 1:
We assume that uncertainty for players and items decreases after every administration and
increases with time. Therefore uncertainty reduces to zero after 40 administrations and
conversely increases to the maximum of 1 after 30 days D of not playing.
24 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tac-tics Server (http://chess.emrald.net)
24
(2)
The K factor in equation (1) weights the impact of the deviation from expectation on the new ability
estimate. This value essentially determines the rate at which θ can change over matches. In the
standard ERS the K factor is constant. Glickman (1995) argued that not all ability ratings are
estimated accurately by the ERS update function (eq. 1). Inaccuracies mostly occur when players are
new or have not played for an extended period of time, resulting in much uncertainty in their ability
rating θ. Glickman proposed to let the K factor reflect the uncertainty in ability estimates by making it
a function of time and playing frequency. If there is little uncertainty, the K factor for recent and
frequent players will be low. If there is much uncertainty the K factor will be high.
Computer adaptive practice. Our suggestion for creating an on the fly item calibrating and
computer adaptive practice (CAP) system is to replace one player in the Elo system by an item.2
Solving an item correctly is interpreted as winning the match against the item. The updating function
in equation (1) can be rewritten to equation (3) for updating player and item ratings:
(3)
where βi is the difficulty estimate of the item and Sij and E(Sij) are the score and expected probability
of winning for person j on item i. Following Glickman, the K factor in our CAP system is a function of
the rating uncertainty U of the player and the item (eq. 4):
K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )
(4)
where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K-‐ = 0.5 are the
weights for the rating uncertainty for person j and item i. These values determine the rate at which θ
and β can change following each item response. These values have been determined through
extensive simulations.
The uncertainty U depends on both recency and frequency. Equation (5) combines these
opposite effects on uncertainty. We apply the same equation to items and players, with provisional
uncertainty of U = 1 and 0 ≤ U ≤ 1:
U =U −140
+130
D (5)
2 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tactics Server (http://chess.emrald.net)
E(Sj ) =1
1+10(θk−θ j )/400
θ j =θ j +K j (Sij −E(Sij ))
βi = βi +Ki (E(Sij )− Sij )
24
(2)
The K factor in equation (1) weights the impact of the deviation from expectation on the new ability
estimate. This value essentially determines the rate at which θ can change over matches. In the
standard ERS the K factor is constant. Glickman (1995) argued that not all ability ratings are
estimated accurately by the ERS update function (eq. 1). Inaccuracies mostly occur when players are
new or have not played for an extended period of time, resulting in much uncertainty in their ability
rating θ. Glickman proposed to let the K factor reflect the uncertainty in ability estimates by making it
a function of time and playing frequency. If there is little uncertainty, the K factor for recent and
frequent players will be low. If there is much uncertainty the K factor will be high.
Computer adaptive practice. Our suggestion for creating an on the fly item calibrating and
computer adaptive practice (CAP) system is to replace one player in the Elo system by an item.2
Solving an item correctly is interpreted as winning the match against the item. The updating function
in equation (1) can be rewritten to equation (3) for updating player and item ratings:
(3)
where βi is the difficulty estimate of the item and Sij and E(Sij) are the score and expected probability
of winning for person j on item i. Following Glickman, the K factor in our CAP system is a function of
the rating uncertainty U of the player and the item (eq. 4):
K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )
(4)
where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K-‐ = 0.5 are the
weights for the rating uncertainty for person j and item i. These values determine the rate at which θ
and β can change following each item response. These values have been determined through
extensive simulations.
The uncertainty U depends on both recency and frequency. Equation (5) combines these
opposite effects on uncertainty. We apply the same equation to items and players, with provisional
uncertainty of U = 1 and 0 ≤ U ≤ 1:
U =U −140
+130
D (5)
2 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tactics Server (http://chess.emrald.net)
E(Sj ) =1
1+10(θk−θ j )/400
θ j =θ j +K j (Sij −E(Sij ))
βi = βi +Ki (E(Sij )− Sij )
24
(2)
The K factor in equation (1) weights the impact of the deviation from expectation on the new ability
estimate. This value essentially determines the rate at which θ can change over matches. In the
standard ERS the K factor is constant. Glickman (1995) argued that not all ability ratings are
estimated accurately by the ERS update function (eq. 1). Inaccuracies mostly occur when players are
new or have not played for an extended period of time, resulting in much uncertainty in their ability
rating θ. Glickman proposed to let the K factor reflect the uncertainty in ability estimates by making it
a function of time and playing frequency. If there is little uncertainty, the K factor for recent and
frequent players will be low. If there is much uncertainty the K factor will be high.
Computer adaptive practice. Our suggestion for creating an on the fly item calibrating and
computer adaptive practice (CAP) system is to replace one player in the Elo system by an item.2
Solving an item correctly is interpreted as winning the match against the item. The updating function
in equation (1) can be rewritten to equation (3) for updating player and item ratings:
(3)
where βi is the difficulty estimate of the item and Sij and E(Sij) are the score and expected probability
of winning for person j on item i. Following Glickman, the K factor in our CAP system is a function of
the rating uncertainty U of the player and the item (eq. 4):
K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )
(4)
where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K-‐ = 0.5 are the
weights for the rating uncertainty for person j and item i. These values determine the rate at which θ
and β can change following each item response. These values have been determined through
extensive simulations.
The uncertainty U depends on both recency and frequency. Equation (5) combines these
opposite effects on uncertainty. We apply the same equation to items and players, with provisional
uncertainty of U = 1 and 0 ≤ U ≤ 1:
U =U −140
+130
D (5)
2 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tactics Server (http://chess.emrald.net)
E(Sj ) =1
1+10(θk−θ j )/400
θ j =θ j +K j (Sij −E(Sij ))
βi = βi +Ki (E(Sij )− Sij )
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 31
High speed, high stakes We incorporate speed by using the scoring rule (shown in eq. 6) for speed and accuracy,
which we call the high speed, high stakes (HSHS) scoring rule (Maris & Van der Maas,
2012). This rule imposes a speed-accuracy trade-off setting on the individual. Player j has
to respond x in time tij before the time limit d
i for item i. The score S
ij is scaled by the dis-
crimination parameter ai:
In this scoring rule the stakes are high when the subject responds quickly. In case of a
correct answer (xij = 1) the score equals the remaining time. In case of an incorrect answer
(xij = 0) the remaining time is multiplied by −1. Thus a quick incorrect answer leads to a
large negative score. This scoring rule is depicted in Figure 2. The scoring rule is expected
to minimize guessing by encouraging deliberate and thoughtful responses.
Figure 2 High speed, high stakes scoring rule.
Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scor-
ing rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is based on the
ability estimate of the person θj, the difficulty estimate of the item β
i, the time limit d
i and
discrimination parameter ai for that item. In Math Garden, we set a
i = 1/d
i, such that the
effective discrimination equals that of the 1PL model:
25
We assume that uncertainty for players and items decreases after every administration and increases
with time. Therefore uncertainty reduces to zero after 40 administrations and conversely increases
to the maximum of 1 after 30 days D of not playing.
High speed, high stakes. We incorporate speed by using the scoring rule (shown in eq. 6) for
speed and accuracy, which we call the high speed high stakes (HSHS) scoring rule (Maris & Van der
Maas, 2012). This rule imposes a speed accuracy trade-‐off setting on the individual. Player j has to
respond x in time tij before the time limit di for item i. The score Sij is scaled by the discrimination
parameter ai:
Sij = (2xij −1)(aidi − aitij ) (6)
In this scoring rule the stakes are high when the subject responds quickly. In case of a correct answer
(xij = 1) the score equals the remaining time. In case of an incorrect answer (xij = 0) the remaining
time is multiplied by −1. Thus a quick incorrect answer leads to a large negative score. This scoring
rule is depicted in Figure 2. The scoring rule is expected to minimize guessing by encouraging
deliberate and thoughtful responses.
Figure 2. High speed, high stakes scoring rule.
Maris & Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule.
The expected score (eq. 7) can be inferred from this model. E(Sij) is based on the ability estimate of
the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai
for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of
the 1PL model:
di
+aidi
-aidi
0
+.5aidi
-.5aidi
correct
incorrect
tjtime
score
di
+aidi
-aidi
0
+.5aidi
-.5aidi
correct
incorrect
tjtime
scor
e
26
E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1
−1
θ j −βi (7)
We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified
Elo update function (eq. 3).
Item selection. Items are selected for which the mean probability of answering correctly is
about .75. Repetition of the same items is restricted, by ensuring that items are reused only after 20
other items have been answered. A new target βt is selected by using:
(8)
where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and restricted such
that .5 < P < 1. For administration the nearest available item is selected by: mini| βi -‐ βt|.
Procedure
Although Math Garden started out as a pilot project, only available to a limited number of
schools in the Netherlands, the website later on became available for a larger audience. In the pilot
period the students received a login account and an instruction from their teacher. In this instruction,
teachers explained the scoring rule of the games and students were told that they could click on the
question mark if they did not know the answer. After this, students could start playing on their own.
Teachers were told that the first two sessions should be played at school. After this, students were
also allowed to play at home, but they were instructed to play by themselves. After the pilot period
the Math Garden also became available to remedial teachers and families. The remedial teachers and
families were not instructed on the frequency of playing. The manuals on how to use the Math
Garden were all available on the website but the scoring rule of the games was not explicitly
explained to the children.
Results
Measurement precision
To test whether the incorporation of response time in the estimation of ability allows us to
lower the difficulty of administered items with less loss of measurement precision, we conducted a
simulation study. We compared our results to those of Eggen and Verschoor (2006). In a simulation
study, Eggen and Verschoor showed3 an increasing (negative) bias (Figure 3: left) and a drop in
measurement precision (Figure 3: right) when selecting easy items in a standard CAT using the
weighted maximum likelihood estimator (WML) and the one-‐parameter logistic (1PL) model. Average
bias was computed by: 1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!) and measurement precision was quantified by calculating the
mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.
3 Table 1 in Eggen and Verschoor (2006).
βt = θ j − lnP1−P
32 | Chap te r 2
We use the HSHS score Sij (eq. 6) and the corresponding expected score E(S
ij) (eq. 7) in our
modified Elo update function (eq. 3).
Item selection Items are selected for which the mean probability of answering correctly is about .75.
Repetition of the same items is restricted, by ensuring that items are reused only after 20
other items have been answered. A new target βt is selected by using:
where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and re-
stricted such that .5 < P < 1. For administration the nearest available item is selected by:
mini| β
i - β
t|.
ProcedureAlthough Math Garden started out as a pilot project, only available to a limited number
of schools in the Netherlands, the website later on became available for a larger audience.
In the pilot period the students received a login account and an instruction from their teach-
er. In this instruction, teachers explained the scoring rule of the games and students were
told that they could click on the question mark if they did not know the answer. After this,
students could start playing on their own. Teachers were told that the first two sessions
should be played at school. After this, students were also allowed to play at home, but they
were instructed to play by themselves. After the pilot period the Math Garden also became
available to remedial teachers and families. The remedial teachers and families were not
instructed on the frequency of playing. The manuals on how to use the Math Garden were
all available on the website but the scoring rule of the games was not explicitly explained
to the children.
Results
Measurement precisionTo test whether the incorporation of response time in the estimation of ability allows us
to lower the difficulty of administered items with less loss of measurement precision, we
conducted a simulation study. We compared our results to those of Eggen and Verschoor
(2006). In a simulation study, Eggen and Verschoor showed35 an increasing (negative) bias
35Table 1 in Eggen and Verschoor (2006).
26
E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1
−1
θ j −βi (7)
We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified
Elo update function (eq. 3).
Item selection. Items are selected for which the mean probability of answering correctly is
about .75. Repetition of the same items is restricted, by ensuring that items are reused only after 20
other items have been answered. A new target βt is selected by using:
(8)
where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and restricted such
that .5 < P < 1. For administration the nearest available item is selected by: mini| βi -‐ βt|.
Procedure
Although Math Garden started out as a pilot project, only available to a limited number of
schools in the Netherlands, the website later on became available for a larger audience. In the pilot
period the students received a login account and an instruction from their teacher. In this instruction,
teachers explained the scoring rule of the games and students were told that they could click on the
question mark if they did not know the answer. After this, students could start playing on their own.
Teachers were told that the first two sessions should be played at school. After this, students were
also allowed to play at home, but they were instructed to play by themselves. After the pilot period
the Math Garden also became available to remedial teachers and families. The remedial teachers and
families were not instructed on the frequency of playing. The manuals on how to use the Math
Garden were all available on the website but the scoring rule of the games was not explicitly
explained to the children.
Results
Measurement precision
To test whether the incorporation of response time in the estimation of ability allows us to
lower the difficulty of administered items with less loss of measurement precision, we conducted a
simulation study. We compared our results to those of Eggen and Verschoor (2006). In a simulation
study, Eggen and Verschoor showed3 an increasing (negative) bias (Figure 3: left) and a drop in
measurement precision (Figure 3: right) when selecting easy items in a standard CAT using the
weighted maximum likelihood estimator (WML) and the one-‐parameter logistic (1PL) model. Average
bias was computed by: 1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!) and measurement precision was quantified by calculating the
mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.
3 Table 1 in Eggen and Verschoor (2006).
βt = θ j − lnP1−P
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 33
(Figure 3: left) and a drop in measurement precision (Figure 3: right) when selecting easy
items in a standard CAT using the weighted maximum likelihood estimator (WML) and the
one-parameter logistic (1PL) model. Average bias was computed by:
based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:
• P. 32. Na formule niet inspringen. En ook deze tekst klopt niet: We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).
• P. 32. De punt in het kopje Item selection moet weg. • P. 33. Footnote moet nummering 3 zijn. • P. 33. In de laatste zin zijn nog tekens weggevallen. Dit moet zijn:
Average bias was computed by: 1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!) and measurement precision was quantified by calculating the mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.
• P. 34, 3e zin van onder: niet inspringen • P. 35, 3e zin er missen nog tekens na estimated abilities, het moet zijn:
and estimated abilities 𝜃𝜃. • P. 35. Het is niet mooi dat hier zoveel wit is. De tekst van de volgende
pagina kan hier al beginnen. • P. 36. Onderschrift grafiek. Norm referenced moet een streepje tussen. • P. 37, regel 5: er moet nog 1 ω2 italic worden • P. 38, onderschrift Figuur 6. De β ziet er raar uit. • P. 39, 6e regel van onder. De punt achter Si. moet niet in het onderschrift
maar als gewone punt. • P. 40, er zit wederom veel wit op pagina 40. • P. 43. Regel 2. Er moet nog een punt achter de zin. • P. 45. Kopje Discussion is van een niveau hoger. • P. 45. Regel 3. Streepje toevoegen tussen norm en referenced • P. 45, laatste zin voor grafiek. Komma toevoegen na However • P. 45. Figuur 13 dus graag voor het stuk Discussie gedeelte plaatsen. Waar
precies is ook afhankelijk wat voor jou makkelijk werkt, maar nadat er in de tekst naar verwezen wordt en voor het kopje Discussie.
• P. 47. Derde regel niet inspringen. • P. 47. 2e alinea, ie zin. Math moet met een kleine letter • P. 47. 2e alinea, 3e zin. Komma toevoegen na estimation: on ability
estimation, the question of Hoofdstuk 3
• P 52, alinea 2, 3e regel. CITO moet Cito zijn. • P. 52, alinea 2, 4e regel. Komma na In this Chapter mag weg. • P 52: Kopje Sources of Problem Difficulty moet Sources of problem difficulty zijn. Dus alleen eerste woord met hoofdletter.
• P 54, 3e regel. Staat ere en extra spatie voor Siegler? • P 54, 2e alinea, regel 6. Komma toevoegen voor and, dus: , and the
minimum operand • P 54: Er mist een inspringing onder het kopje Tie effect • P 54: 3e zin van onder. Het lijkt of er een extra spatie staat voor An
alternative explanation • P 55: Er mist een inspringing onder het kopje Special numbers • P 57: Regel 3. We use data moet We used data zijn.
! and
measurement precision was quantified by calculating the mean standard error of estimation
based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:
• P. 32. Na formule niet inspringen. En ook deze tekst klopt niet: We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).
• P. 32. De punt in het kopje Item selection moet weg. • P. 33. Footnote moet nummering 3 zijn. • P. 33. In de laatste zin zijn nog tekens weggevallen. Dit moet zijn:
Average bias was computed by: 1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!) and measurement precision was quantified by calculating the mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.
• P. 34, 3e zin van onder: niet inspringen • P. 35, 3e zin er missen nog tekens na estimated abilities, het moet zijn:
and estimated abilities 𝜃𝜃. • P. 35. Het is niet mooi dat hier zoveel wit is. De tekst van de volgende
pagina kan hier al beginnen. • P. 36. Onderschrift grafiek. Norm referenced moet een streepje tussen. • P. 37, regel 5: er moet nog 1 ω2 italic worden • P. 38, onderschrift Figuur 6. De β ziet er raar uit. • P. 39, 6e regel van onder. De punt achter Si. moet niet in het onderschrift
maar als gewone punt. • P. 40, er zit wederom veel wit op pagina 40. • P. 43. Regel 2. Er moet nog een punt achter de zin. • P. 45. Kopje Discussion is van een niveau hoger. • P. 45. Regel 3. Streepje toevoegen tussen norm en referenced • P. 45, laatste zin voor grafiek. Komma toevoegen na However • P. 45. Figuur 13 dus graag voor het stuk Discussie gedeelte plaatsen. Waar
precies is ook afhankelijk wat voor jou makkelijk werkt, maar nadat er in de tekst naar verwezen wordt en voor het kopje Discussie.
• P. 47. Derde regel niet inspringen. • P. 47. 2e alinea, ie zin. Math moet met een kleine letter • P. 47. 2e alinea, 3e zin. Komma toevoegen na estimation: on ability
estimation, the question of Hoofdstuk 3
• P 52, alinea 2, 3e regel. CITO moet Cito zijn. • P. 52, alinea 2, 4e regel. Komma na In this Chapter mag weg. • P 52: Kopje Sources of Problem Difficulty moet Sources of problem difficulty zijn. Dus alleen eerste woord met hoofdletter.
• P 54, 3e regel. Staat ere en extra spatie voor Siegler? • P 54, 2e alinea, regel 6. Komma toevoegen voor and, dus: , and the
minimum operand • P 54: Er mist een inspringing onder het kopje Tie effect • P 54: 3e zin van onder. Het lijkt of er een extra spatie staat voor An
alternative explanation • P 55: Er mist een inspringing onder het kopje Special numbers • P 57: Regel 3. We use data moet We used data zijn.
! using the information function for the 1PL model.
Figure 3 BIAS and SE for different computer adaptive methods at different values of expected
probability correct.
In our simulation we used the Elo update function to estimate ability and difficulty,
utilizing: a) accuracy data with the 1PL model and b) accuracy and response time data
using the HSHS model. As in the study by Eggen and Verschoor (2006), our item bank
consisted of 300 items with normally distributed β ~ N(0,1) difficulties and we also sam-
pled 4000 abilities from a normal distribution θ ~ N(0,1). The CAP algorithm starts with an
item of intermediate difficulty −0.5 < β < 0.5 and terminates after 40 items. As a starting
point for ability we selected a random ability from a normal distribution β ~ N(0,1). We
compared our Elo based HSHS model, at different desired success probabilities, to Eggen
and Verschoor’s 1PL model using standard CAT. Eggen and Verschoor investigated suc-
cess probabilities up to .75. With regard to bias it can be concluded that the Elo estima-
tion method performs slightly worse with accuracy data only (Figure 3: left: Elo+1PL),
but outperforms Eggen and Verschoor’s standard CAT method, when RT’s are included
(Figure 3: left: Elo + HSHS). With regard to the standard error of estimation we also com-
pared our two Elo methods to the theoretical maximum information for the 1PL model.
0.5 0.6 0.7 0.8 0.9
-0.1
0-0
.08
-0.0
6-0
.04
-0.0
20.
00
BIAS
Expected probability correct
mea
n B
IAS
Eggen et al.Elo + 1PLElo + HSHS
0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9
0.30
0.35
0.40
0.45
0.50
SE
Expected probability correct
SE
sd of ^i i
mean se(^i)
sd of ^i i
mean se(^i)
Eggen et al.Elo + 1PLElo + HSHS Max Info.
0.5 0.6 0.7 0.75 0.8 0.9
34 | Chap te r 2
We calculated the maximum information (Figure 3: right: Max Info.) with equation (9):
where a = 1 is the discrimination parameter, N = 40 is the number of items, and Pi(θ) is
the desired probability correct. As can be seen in (Figure 3: right: Max info.), when the
probability of answering correctly assumes large values (x-axis), the theoretical minimum
SE (eq. 9) for the 1PL model increases exponentially (y-axis). For the standard error of esti-
mation (Figure 3: right) we calculated the standard deviation of the difference in simulated
abilities θ and estimated abilities
Opmerkingen ronde 3 Ik gebruik weer af en toe dikgedrukt en schuin om de verschillende woorden te onderscheiden, deze formatting niet overnemen.
• P 1: Concept doctoral thesis mag weg • P 3: Graag ook academisch proefschrift in schreefloze lettertype als kopje
erboven met witregel ertussen. Academisch Proefschrift beiden met hoofdletter
• P 3: AU moet Aula der Universiteit van Amsterdam zijn • P 5: paginanummering klopt nog niet, maar dit wilde je geloof ik op het
eind doen. Hoofdstuk 1
• P. 9, alinea 2, derde zin. Graag nog een komma achter 2005 • P. 14 regel 5. scPhools moet schools zijn (sorry dat je dit weer moet
aanpassen, was een tikfout) Hoofdstuk 2
• P. 27. 2e alinea regel 4: komen toevoegen achter reponses: In addition to the responses, we registered
• P. 27. Is het mogelijk om de tabel van pagina 28 onder het stukje participants te zetten op pagina 27? Dan staan de plaatjes van pagina 29 ook weer dichterbij de tekst waar ze horen.
• P. 28. De footnote en de verwijzing ernaar in de tekst bevat het cijfer 3, moet 1 zijn.
• P. 29: laatste zin θ moet schuingedrukt zijn. • Overall heb ik het idee dat de formules behoorlijk klein zijn, misschien
moeten we even kijken hoe dit uitpakt in de drukproeven of denk je dat we het nu al groter moeten maken.
• P. 29: laatste zin er mist een ): (see equation (1)) • P. 30: regel 2, er mist een teken achter estimate 𝜃𝜃
P. 30: na regel 4 ontbreekt er tekst. De hele zin moet zijn: The expected match result is a function of the difference between the ability estimates of both player j and k preceding the match and expresses the probability of winning (see equation (2)):
• P. 30. De footnote betreft footnote 2 (ook in de tekst) • P. 31. Witregel na regel 2 mag weg. • P. 31. Na 1e formule niet inspringen • P. 31: 4e regel onder high speed high stakes. Dit moet een andere x zijn,
dezelfde als in de 3e regel van onder maar dan zonder de subscripts. • P. 31. 5e regel van onder, niet inspringen • P. 32: Figuur 2 mag eventueel wat kleiner als dat nu door verspringen van
teksten handiger is. • P. 32, er zitten nog veel fouten in de tekens en subscripts in deze alinea.
Hierbij de goede tekst. Dikgedrukt zijn de wijzigingen: Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is
. This method of calculation is simpler, yet comparable
with the procedure used by Eggen and Verschoor (2006) for calculating the standard error
of estimation.
The SE of the Elo estimation method using only accuracy data (Figure 3: right:
Elo+1PL) is largest for almost all probability levels. This is to be expected as this method
is statistically inferior to the WML method used by Eggen and Verschoor (2006). Up to
the probability level of about .69 the SE using the HSHS Elo method (Figure 3: right:
Elo + HSHS) is larger than the SE found in the Eggen and Verschoor simulation. However,
at higher probability levels, especially compared to our target of .75, the SE is considerably
lower. At probability levels higher than about .78 the SE even drops below the theoretical
maximum information (Figure 3: left: Max info.) for the 1PL model. This demonstrates
that incorporating response times results in much better measurement precision when using
easy items.
ValidityTo assess the validity of the Math Garden measurements, the ratings of the students were
compared to their scores on the norm-referenced general math ability scale of the pupil
monitoring systems of Cito (Janssen & Engelen, 2002). The correlations between these
two measures, which serve as a measure of convergent validity, ranged from .78 to .84 for
the four domains addition, subtraction, multiplication and division. These correlations were
based on a subset of our sample. Cito scores where available for N = 964 participants. To
put these correlations into perspective we looked at the correlation between two subsequent
Cito scores. The correlation between Cito mid year and end of the year 2007–2008 was
.95. This indicates that our correlations can be considered fairly high. Figure 4 displays the
relation between test scores. The numbers indicate the regression line for each grade.
27
Figure 3. BIAS and SE for different computer adaptive methods at different values of expected probability correct.
In our simulation we used the Elo update function to estimate ability and difficulty, utilizing:
a) accuracy data with the 1PL model and b) accuracy and response time data using the HSHS model.
As in the study by Eggen and Verschoor (2006), our item bank consisted of 300 items with normally
distributed β ∼ N(0,1) difficulties and we also sampled 4000 abilities from a normal distribution θ ∼
N(0,1). The CAP algorithm starts with an item of intermediate difficulty −0.5 < β < 0.5 and terminates
after 40 items. As a starting point for ability we selected a random ability from a normal distribution
β ∼ N(0,1). We compared our Elo based HSHS model, at different desired success probabilities, to
Eggen and Verschoor’s 1PL model using standard CAT. Eggen and Verschoor investigated success
probabilities up to .75. With regard to bias it can be concluded that the Elo estimation method
performs slightly worse with accuracy data only (Figure 3: left: Elo+1PL), but outperforms Eggen and
Verschoor’s standard CAT method, when RT’s are included (Figure 3: left: Elo + HSHS). With regard to
the standard error of estimation we also compared our two Elo methods to the theoretical maximum
information for the 1PL model. We calculated the maximum information (Figure 3: right: Max Info.)
with equation (9):
(9)
where a = 1 is the discrimination parameter, N = 40 is the number of items, and Pi(θ) is the desired
probability correct. As can be seen in (Figure 3: right: Max info.), when the probability of answering
correctly assumes large values (x-‐axis), the theoretical minimum SE (eq. 9) for the 1PL model
increases exponentially (y-‐axis). For the standard error of estimation (Figure 3: right) we calculated
the standard deviation of the difference in simulated abilities θ and estimated abilities 𝜃𝜃. This
0.5 0.6 0.7 0.8 0.9
-0.1
0-0
.08
-0.0
6-0
.04
-0.0
20.0
0
BIAS
Expected probability correct
mean B
IAS
Eggen et al.
Elo + 1PL
Elo + HSHS
0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9
0.3
00.3
50.4
00.4
50.5
0
SE
Expected probability correct
SE
sd of ^
i i
mean se(^i)
sd of ^
i i
mean se(^i)
Eggen et al.
Elo + 1PL
Elo + HSHS
Max Info.
0.5 0.6 0.7 0.75 0.8 0.9
se(θ ) =1 Nai2Pi (θ )(1−Pi (θ ))
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 35
Figure 4 Correlation between Math Garden rating for the domains addition, subtraction, mul-
tiplication, and division and the norm-referenced Cito scores (mid 2008). Included are regression
lines for each grade (Dutch grades) indicated by grade numbers.
We also studied the validity by comparing the mean ability ratings of children in differ-
ent grades. We expected a positive relation between grade and ability. Figure 5 shows the
average ability rating for each grade and domain. As expected, children in older age groups
had a higher rating than children in younger age groups. In all four domains, there is an
overall significant effect of grade: addition, F(5, 1456) = 1091.4, p < .01, ω2 = .78; subtrac-
tion, F(5, 1363) = 780.5, p < .01, ω2 = .74; multiplication, F(5, 1215) = 409.6, p < .01, ω2
= .62; and division, F(5, 973) = 223.31, p < .01, ω2 = .53. Levene’s tests show differenc-
es in variances for the domains multiplication and division. However, the non-parametric
Kruskal–Wallis tests also show significant differences for these domains: χ2(5) = 753.28,
p < .01 for multiplication and χ2(5) = 505.17, p < .01 for division. For all domains, post
hoc analyses show significant differences between all grades, except for the differences
between grades five and six.
••••
•••
•• •
• •••
•••
• •••
•••
•
•
•
•• • •••
••
• •
••
• • •••
••• •
••
•••
•
••
•
•
•• •
•
•
•
•
• ••
•••
•
••
••
••••
•
•• •
••
•
••
••
•
••
•
•
• ••
••
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
••• •
• •
•
•
•
• •
•
•
•
•
• •
• •
••••
•
••
• ••
••
••
••
•
• •
•
•
•••••
••
••
•••
•
• •
•
• •
•
••
•
•• •
••• ••
•
•••
•••
•
••
•
•
•
••
•
••
••
• ••
• •
••• •
•• ••
••
•
••
•
••
•
•
••
• •
•
••
••
••
•
••
•
•
••
•
••
•
•
••
•
••
•• •
•
•••
••
•
••
•
••
•
•
•••
•
•
• ••
• •
•
•
••
••
••
•• ••
• ••
•
•
••
•
••
•
••
•
•• •
•
•
• • •
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••••
•
••
•
••
•
•
• ••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
••
••
••
••
••
•
••
••
•
•
•
•
••
•
•
••
•
•
••
•
•
•
••
••
• •
•
•••
••
•
••
•
•
•
••
• •
•
•
••
•
• •
•
••
•
•
•
••
•
••
•
•
•
• •
•
•
•
•
••
••
•
•
••
• •••
•
• •••
•
••
•
•
••
•
•
•
•
•
•
•
• ••
••• • •
••
•
•
••
••
•
• •••
•
••
•
•
•
•
•
••
•
•••
•
•
•
•
• ••
•
•
•
•
•
•
• •
••
••
••
•
• •••
•
•
•
•
•••
•• •
•
••
•
•
•
•
•
••
•
•••
••
•
•
••
•
••
•
•
•
•
•
•
•
•
••
••••
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
0 5 10
050
100
150
r = 0.83
rating addition
Cito
Nor
m s
core
•• ••
•••
•• •
• •••
•••
••••
••
•
•
•
•
•• • ••••
•• •
••
•• •••
••• •
•••
• •
•
••
•
•• •
•
•
•
• ••
•• •
•
••
••
••
••
•
•••
••
•
••
••
•
••
•
•
• ••
••
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
••• •
••
•
•
•
• •
•
•
•
•
• •
• •
••• •
•
••
• ••
••
••
••
•
• •
•
•
•• •••
••
••
•••
•
• •
•
• •
•
••
•
•• •
•• • ••
•
••
••
••
•
••
•
•
•
• •
•
••
••
• ••
• •
••••
• •••
••
•
••
•
••
•
•
••
••
•
• •
••
•••
••
•
•
••
•
••
•
•
• •
•
•••• •
•
•••
••
•
••
•
••
•
•
•••
•
•
• ••
• •
•
•
••
••
••
•• ••
• ••
•
•
••
•
••
•
••
•
•••
•
•
•
• ••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••••
•
••
•
••
•
•
•••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
••
••
••
••
••
•
••
••
•
•
•
•
••
•
•
••
•
•
••
•
••
••••
•
•••
••
•
••
•
•
•
••
• •
•
•
••
•
• •
•
•
••
•
••
•
•
•
•
•
•
•
••
••
•
•
••
• •• •
• •••
•
••
•
•
••
•
•
•
•
•
•
•
• ••
••• • •
••
•
•
••
••
•
• •••
•
• •
•
•
•
•
•
••
•
••
•
• •
••
•
•
•
•
•
•
• •
••
••••
•
••••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
••
•
••
•
••
•
•
•
•
•
••
•
••
• ••
••
•
•
••
•
•
−5 0 5 10
050
100
150
r = 0.84
rating subtraction
Cito
Nor
m s
core
• •••
•••
•• •
• •••
••
••• •
•••
•
•
•
•
• • • •••
••• •
••
• • •••
•• • •
••
•• •
•
••
•
•
•• •
•
•
•
•
•••
•• •
•
••
••
••
••
•
•• •••
•
••
•••
• •
•
•
• ••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
• •• •
• •
•
•
•
••
•
•
•
•
• •
• •
••• •
•
••
• ••
••
• •
••
•
• •
•
•
•• •
• •
••
••
•• •
•
• •
•
• •
•
••
•
•• •
• •• ••
•
••
••
••
•
••
•
•
•
• •
•
••
••
• ••
• •
••••
•• ••
•
•
••
•
••
•
•
••
• •
•
• •
••
•••
••
•
•
••
•
••
•
•
• •
•
••
• ••
•
•••
••
•
••
•
••
•
•
•••
•
•
• ••
• •
•
•
••
••
••
••••
•••
•
•
••
•
••
•
••
•
•• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
• •••
•
••
•
••
•
•
• ••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
••
••
••
••
••
•
••
••
•
•
•
•
••
•
•
••
•
•
•
•
•
••
••
••
•
•••
••
•
••
•
•
•
••
• ••
••
•
••
•
•
••
•• •
•
•
••
•••
••
•• ••
• •• •
•
••
•
•
••
•
•
•
•
•
•
•
• ••
•• • • •
••
•
•
••
••
•
• •••
•
••
•
•
•
•
•
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
••••
•
• •••
•
•
•
•• •
•
•
•
•
•
•
••
•
•••
•
•
•
••
•
•
••
−5 0 5 10
050
100
150
r = 0.8
rating multiplication
Cito
Nor
m s
core
• •••
•••
•••
• •• •
••
••• •
•••
•
•
•
•
•• • •••
••
• •
••
•• ••••
• •••
•• •
•
••
•
•
• • •
•
•
•••
•• •
•
••
••
••
••
•
•• •
••
•
••
••
•
• •
•
•
• ••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
• •••
• •
•
•
•
••
•
•
•
•
• •
• •
• •• •
•
••
• ••
••
• •
••
•
• •
•
•
•• •• •
••
••
•• •
•
• •
•
• •
•
••
•
•• •
•• • ••
•
••
••
••
•
••
•
•
•
••
•
••
• •
• ••
• •
••••
• • ••
•
•
••
•
•
•
•
••
•
• •
••
••
•
••
•
•
•••
••
•
•
••
•
••
•• •
•
•••
••
•
••
•
••
•
•
•••
•
•
• ••
• •
•
•
••
•••
•
••••
• ••
•
•
••
•
••
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
• •••
•
••
•
••
•
••
• ••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
••
••
••
••
••
•
••
••
•
•
•
•
••
•
•
••
•
•
•
••
•
••
••
• •
•
•••
••
•
••
• ••
•
•
••
•
•
••
•• •
••
•••
••
• • ••
• •• •
•
••
•
•
••
•
•
•
•
•
•
•
• ••
•••••
••
•
•
••
••
•
• • ••
•
••
•
•
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
•
••
••••
•
• •
•
••
•
•
•
•••
•
•
•
•
••
•
••
•
••
−5 0 5 10
050
100
150
r = 0.78
rating division
Cito
Nor
m s
core
4 5
67
8
4 5
6
78
4 5
6
78
4 5
67
8
36 | Chap te r 2
Figure 5 Rating per domain and age group.
ReliabilityOne way to assess reliability is to compare children’s ratings across domains. Since the
domains involve related operations we expect high correlations between them. The correla-
tions between the four domains, addition, subtraction, multiplication, and division, vary
between .67 and .88, all signifi cant at p < .01, indicating fairly high correlations. Another
relatively simple way to assess the reliability of Math Garden is to construct parallel tests.
We can compare item difficulty β’s of so-called mirrored items (e.g. 7 + 4, 4 + 7 and
2 × 4, 4 × 2) for the domains addition (N = 48) and multiplication (N = 81). Mirrored
items should have very similar β’s. Figure 6 shows the correlation of the mirrored item β’s
for the domains addition and multiplication. These correlations are .88 and .98 (p < .01),
respectively, indicating a high reliability of these item sets.
Figure 6 Scatter plot of diffi culty β’s of mirrored items. Included are some example items, indicated
with black dots.
1 2 3 4 5 6
-10
-50
510
+
grade
Rat
ing
1 2 3 4 5 6-1
0-5
05
10
-
grade
Rat
ing
1 2 3 4 5 6
-10
-50
510
x
grade
Rat
ing
1 2 3 4 5 6
-10
-50
510
:
grade
Rat
ing
-8 -6 -4 -2 0
-8-6
-4-2
0
Correlation m+n, n+m
Rating mirror item (n+m)
Rat
ing
item
(m
+n)
7 + 3
7 + 6
8 + 7
1 + 3
2 + 3
0 + 4
4 + 1
5 + 35 + 6
7 + 1
7 + 48 + 4
8 + 5
6 + 8
9 + 0
9 + 1
3 + 9
10 + 7
15 + 82
6 + 1
6 + 2
8 + 110 + 2
9 + 8
10 + 40
-5 0 5 10
-50
510
Correlation mxn, nxm
Rating mirror item (nxm)
Rat
ing
item
(m
xn)
5 x 1
5 x 2
7 x 6
8 x 5
8 x 6
9 x 4
10 x 1
15 x 19
20 x 19
80 x 12
12 x 700
700 x 500
80 x 500
25 x 500
64 x 20
11 x 3311 x 75
11 x 2011 x 500
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 37
Besides diffi culty β’s, we can also compute the discriminatory power of items, which in-
dicates how well the item discriminates low from high ability subjects. We estimated these
so called a-parameters by using a logistic regression analysis on the accuracy responses
predicted by the difference in rating between item and respondent. As in the preceding
analysis, we compared the discriminatory power between mirrored items. The scatter plots
in Figure 7 show rather high significant (p < .01) positive correlations. The correlations for
addition and multiplication are .74 and .71.
Figure 7 Scatter plot of discriminatory a-parameters for mirrored items.
As a fi nal test of reliability, we investigated the stability of the diffi culty ratings β. A
high correlation between β values of items at two time points far apart indicates high re-
liability. Therefore we would expect a stable item bank to correlate highly over time. We
fi rst looked at the correlation between the item β ratings, as they were set at the start of the
project (week 36) and the item ratings in all subsequent weeks. In Figure 8, this correlation
is shown by the solid line. Clearly, the initial ratings, set on the basis of an analysis of Math
materials used by the schools, were quite good, as the correlation between initial ratings
and the ratings after 40 weeks is still .85. Secondly, we also correlated established item
ratings in week 44 with all item ratings in subsequent weeks (dotted line). This shows that
these established ratings are very stable as the correlations in all 32 weeks stay above .95.
0.2 0.4 0.6 0.8 1.0 1.2
0.2
0.4
0.6
0.8
1.0
1.2
a-parameter correlation n+m, m+n
a-parameter n+m
a-pa
ram
eter
m+n
0.2 0.4 0.6 0.8 1.0 1.2 1.4
0.2
0.4
0.6
0.8
1.0
1.2
a-parameter correlation nxm, mxn
a-parameter nxm
a-pa
ram
eter
mxn
38 | Chap te r 2
Figure 8 Stability of items ratings for initial ratings (solid line) and established ratings after 2
months (dotted line). The x-axis displays week numbers (v = vacation). Correlations are computed
over active (played) items in each week (Ni = number of administered items).
Item reuseAs a result of the longitudinal nature of the Math Garden system, items are presented
to the same child more than once. Although the system ensures that at least 20 other items
are administered before an item is reused, this reuse may present a threat to the assump-
tion of local independence (e.g., the response to an item must not depend on the previous
response to the same item). To test this, we performed regression analyses with both the
number of items and the amount of time between two presentations of the same item to the
same child as predictors for the child’s performance on that item. The child’s performance
was measured by subtracting his expected score E(Si) from the actual score S
i. If there is an
item-specifi c learning effect, any child that encounters an item for the second time is likely
to have a higher than expected score for that item. We selected pairs of data points that rep-
resented subsequent presentations of the same item to the same child. We selected the data
so that no child contributed more than one pair of data points, resulting in N = 478 pairs
of data points. Because item-specific learning effects are logically more likely to occur if
there is a small amount of time between two presentations of the same item to the same
child, we removed 90 data points with more than 30 minutes between the two presentations
of the item. A regression analysis with this dataset shows no main effect for either the
number of items, or the amount of time between two presentations of the same item to the
same child: number of items, R2 < .001, F(1, 476) = 0.39, p = .53, and amount of time, R2 <
.001, F(1, 476) = 0.0072, p = .93.
Math Garden aimsIn order to keep children motivated, items were sampled so that children solved about
75% of the items successfully. However, in the fi rst few months we imposed a success rate
of 70%. Figure 9a shows the proportion of correctly answered items per grade and domain.
0.80
0.85
0.90
0.95
1.00
Played item correlation across weeks
Weeknumbers 2008-2009
Correlation
Spearman's ρ correlation of βw and β>w
v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v v 10 11 12 13 14 15 16 v v 19 20 21 22 23
ρw=44ρw=35
2784
1737
1779
1769
1779
1802
1893
1824
1510
2784
1946
1945
1943
1913
2006
2034
2043
1641
1415
1971
1968
1915
1986
2033
2145
2111
2051
2041
2015
1996
1999
2005
1989
1938
1947
1490
1828
1965
1877
2042
1773Ni
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 39
Only the results of the children who answered more than fi fteen items were included in the
graph. The graphs show that the proportion of correctly answered items varied between .6
and .8 for most children. The proportion correct seems to be somewhat lower for subtrac-
tion and lower still for multiplication and division. At the start of this project, the domains
addition and subtraction were briefl y available for the lower age groups. This resulted in
a lot of question mark use in these domains. To counter this unwanted effect we made the
availability of these domains dependent on the profi ciency on addition and subtraction. In
total, the amount of question mark use in the math games was about 7.3%. Filtering out
the question mark responses (Figure 9b) results in considerably higher proportions correct.
Figure 9 Proportion correct per grade and domain.
One of the aims of Math Garden was that it should be a challenging web environment
for children of all competency levels. The usage statistics can answer the question whether
children are motivated to play the math games. They provide an indication of how attrac-
tive and challenging the children found the Math games. It is possible that children visit
the Math Garden site mainly because their teachers told them to. To assess how intrinsi-
cally motivated the children were to play the games, we looked at the days and hours that
children played in Math Garden. Figure 10 (top) shows the number of solved arithmetic
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
+
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
-
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
x
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
:
gradeP
ropo
rtion
cor
rect
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
+
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
-
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
x
grade
Pro
porti
on c
orre
ct
1 2 3 4 5 6
0.5
0.6
0.7
0.8
0.9
1.0
:
grade
Pro
porti
on c
orre
ct
a) Question mark response included
b) Question mark response excluded
40 | Chap te r 2
problems per day of the week and Figure 10 (bottom) shows the number of solved items
per hour of the day. Not surprisingly, most problems were solved on Monday till Friday
and between 9.00 a.m. and 3.00 p.m. However, both graphs also show that a considerable
number of problems were solved after school hours and during the weekends. Actually,
33.2% of all problems were solved outside school hours.
Figure 10 Playing frequency during the week and during the day.
To investigate whether competency had any effect on motivation, we looked at the rela-
tion between ability and playing frequency. Only data of children who solved 15 or more
problems were included to ensure accuracy of the ability estimates. We found only low
but signifi cant (p < .01) correlations between ability level and playing frequency for all
domains. The correlations for the domains addition, subtraction, multiplication, and divi-
sion were, −.15, −.12, −.05, and .09, respectively. The playing frequency does not appear
to depend importantly on the competency level of the children.
Diagnostic abilityWe will briefl y demonstrate the diagnostic and tracking ability of Math Garden by con-
sidering a few examples. Using the high frequency dataset, we were able to provide in-
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
:x-+
Frequency per weekday
Freq
uenc
y x
1000
0100200300400500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Frequency per hour
Freq
uenc
y x
1000
0100200300400500
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 41
dividual and group diagnostics. Figure 11 shows the percentage of typical errors a given
child had made (bars) compared to the percentage of these errors made by children of the
same grade (solid line). We can see, for instance, that this child makes significantly more
zero errors (400−200 = 380) for the domain subtraction than other children in the same
grade. We provided teachers with such graphs for individuals and groups of individuals
(e.g., for the whole class).
Detailed analysis of the item difficulties provides us with insight into sources of item
difficulty. Some interesting results have emerged. For example multiplications by 10 or
even 100 and one digit numbers (7 × 100) are among the 10% easiest items for this domain.
In division it appears that items of the type nn/n (77/7) are also very easy (again among the
10% easiest items). In Chapter 3 we tested how well all kinds of item effects, previously
studied in isolation, predict item difficulty. The combined item effects, such as problem
size, ties, and the 5 effect, explained 90% of the variance in the difficulty of simple multi-
plications items.
Figure 11 Error analysis of answers to subtraction problems of a child in grade 5. Bars display the
percentage of errors for this child in a specific week. The lines display the percentage of errors made
by other children in Math Garden (dotted line) and by other children in grade 5 (solid line).
use of question mark
to slow
adding units (42-23=25)
reversibility + borrow error (68-29=31)
reversibility error (93-78=25)
counting error 2 (93-78=13)
counting error 1 (93-78=14)
borrow error (80 – 29 = 61)
mirror error (18-5=31)
addition (7-4=11)
digit forgotten (95-75=25)
-0=0 error (7-0=0)
0 forgotten (3000-2000=100)
0 error (9000-5000=8500)
position error (22-10=21)
unknown
Freqency: 23 grade: 5
%
0 2 4 6 8 10 12 14
55
55
55
55
55
55555
5 significant
non significant
grade mean
total mean
42 | Chap te r 2
A window on developmental changeThe high frequency measurements combined with the size of the sample, provide unique
insights into arithmetic development and learning trajectories of children. In Math Garden,
trend analyses are provided to teachers. Figure 12 shows the progress of a single child
compared to all other children in the same age group. Teachers can use this information to
consider interventions. As can be seen in the graph, this child started out having an average
rating and a flat growth curve. By week 45 this child started to acquire the necessary ability
and by week 49 the child was in the top 25% of all children.
Figure 12 Progress chart of a child in grade 6 (black line), in comparison to the mean of grade 6
(dotted line).
At micro level it is even possible to study the learning pattern of one child on a specifi c
item over time. For example, in Figure 13 we see the answers and response times of two
children on two items across weeks. In the top graph of Figure 13 we see an individual who
did not know the answer to the math question 9 × 9 and answered with a question mark in
about 5 to 10 seconds at the fi rst ten occasions. Then there were two mistakes in which the
child joined the two digits instead of multiplying. However, in the next attempt the question
was answered correctly but more time was needed to respond. From this point on, the abil-
ity level seems suffi cient for consistent correct and speedier answers. The bottom graph of
Figure 13 shows a lucky guess in the fi rst week (third trial) followed by a gradual gain in
insight. Half way week 42 this child started answering correctly more often but with highly
varying response times. At the end of week 44 the response time dropped. Note that occa-
sionally errors keep occurring. These examples illustrate the level of detail that is possible
in the analysis of Math Garden data.
02
46
810
Week numbers 2008-2009
Rat
ing
v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v
90%
75%
25%10%
j
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 43
Figure 13 Response time pattern for two children on different items during a number of weeks
(x-axis). The y-axis indicates the response time in seconds. The answer is displayed in the graph. The
question mark answer means that the child pressed the ‘?’ button.
Discussion
In this chapter we presented and tested a new model for computerized adaptive prac-
tice and monitoring. The results concerning the validity and reliability are promising. The
high correlations with the norm-referenced Cito scores indicate high criterion validity. The
increase in player ability rating across grades also supports this, although the children in
grades 5 and 6 did not seem to differ. This is probably due to the fact that in the domains
we tested no new mental arithmetic techniques are taught in grade 6.
By simulation, we compared measurement precision and measurement bias of CAP to
standard CAT. For easy items the use of the HSHS scoring model, which combines speed
and accuracy and the Elo rating system (ERS) resulted in less loss in measurement pre-
cision and less bias than found in standard CAT estimation. The ERS combined with the
1PL model, using only accuracy data, resulted in worse estimations. Concerning the items
and the item bank, we found that diffi culty ratings converge in about eight playing weeks,
resulting in consistent diffi culty ratings across time. High reliability is also indicated by the
high correlations of the diffi culty and discrimination parameters between sets of mirrored
items. We have not found any indication of learning effects caused by the reuse of items,
therefore also indicating the assumption of local independence has been met for reuse of
items. However, in other learning domains this issue still requires careful consideration.
The fi t statistics for the HSHS model are still in development, and are therefore not in-
cluded in the result section of this chapter. Evaluation of the goodness of fi t for IRT models
05
1015
20
Answer to item: 9 x 9 and response time of a grade 3 child
week numbers
resp
onse
tim
e (s
ec.)
?
?
? ?
? ?
? ?? ? 99 99
81
81
?
8181 81 81
81
8181 81 81 81 81 81 81 81 81 81 81
8181
8181
81 81 81 81 81 81 81 81 81 8181
8181 81
8181 81
?
35 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38 38 38 38
05
1015
20
Answer to item: 21 : 3 and response time of a grade 3 child
week numbers
resp
onse
tim
e (s
ec.)
? ?
7
? ?? ? ? ? ? ?
?53
? ?? ? ?
? ?? ?
? ? ? ? ? ? ? ? ? ? ?
7?
?
?
77 7
?7 7
7 7
7 7
77
77
7 7
7
4
6
6
77
7
6
7
7 46
6
67
6
77
7 7 7
7 7
?
7
7
41 41 41 41 41 41 41 41 41 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 44 44 44 44 44 44 45 45 45 45 46 22 23
44 | Chap te r 2
is an active area of research, and so far definite solutions are lacking (Embretson & Reise,
2000). Some of the relevant issues (Hambleton, Swaminathan, & Rogers, 1991) concern:
the sensitivity of the chi-square fit statistic to sample sizes, technical issues in the testing
of dimensionality (Hattie, 1984, 1985), and the testing of the assumption of local indepen-
dence. Evaluating the fit of IRT models is more complicated still in the context of computer
adaptive testing, due to the inherent incomplete item-person data matrix. An alternative
approach to comprehensive model fitting consists of checking model assumptions, and
establishing reliability and validity (Hambleton & Swaminathan, 1984). Here we have suf-
ficed with this alternative approach.
We can conclude that children were motivated to play the Math games. The frequency
data demonstrated that children played a lot outside school hours. Children with a lower
math ability did not play appreciably less, which suggests that they found the math games
as motivating as high ability children did. We demonstrated that Math Garden has many
possibilities as a diagnostic tool. The error analysis can provide teachers with valuable
insight into the kind of errors that individual pupils make. This information can be used to
optimize interventions. The current dataset, consisting of a large number of individual high
frequent time series, allows for many further investigations of difficulty effects (Chapter
3), strategy patterns in mathematical problem solving, and individual learning trajectories.
The item ratings also provide insight into what we call informal learning paths. Because
of the adaptive item ratings, we gain an on the fly insight into the difficulty of arithmetic
problems. Some items turned out to be unexpectedly easy. For instance, 8 + 6, 5000 + 5 and
50 + 60 were almost equally difficult whereas 8 + 6 is taught much earlier on in the Dutch
curriculum than the other two addition problems. This kind of information can be used to
determine the curriculum (i.e., what is taught) in each grade.
One of the problems with the Elo rating system is the occurrence of rating inflation and
deflation (Glickman, 1999), which we call drift. In educational applications, one source
of drift is that new young players start with low ratings and stop playing when they leave
school with high ratings. This causes a systematic downwards drift in item rating and, as
a consequence, lowers person ratings. This does not seem to jeopardize the operation of
Math Garden, since drift influences player and item ratings simultaneously. The main prob-
lem lays in the interpretation of the rating. Rating points cannot be accurately compared
following inflation or deflation. Therefore we present transformed ratings to teachers and
users to prevent interpretation problems.
Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 45
Transformation is conducted by calculating the average probability correct for a single
user on all items in the domain, as shown in equation 10:
This value is an estimation of the percentage of items in the domain that the user is able to
answer correctly. We also reduced drift by incorporating the rating uncertainty in calculat-
ing the K factor, which minimizes the influence of unreliable person and item estimations
on the updating proces. A related issue is the convergence speed. This is the time or number
of responses needed to get a stable rating. We set the rating uncertainty parameters of the
K factor, which determine the convergence speed, on the basis of extended testing. A better
approach would perhaps be to estimate the uncertainty based on aberrant response patterns,
where unexpected responses are used as an indication of unreliability.
A last issue concerns the one-dimensionality of the math domains. In practice, every test
and item bank is expected to violate the assumption of one-dimensionality to some degree.
Though we see no immediate effects on ability estimation, the question of how robust the
HSHS Elo algorithm is to violation of this assumption needs further investigation. We also
intend to further address the possible individual differences between children and how the
HSHS scoring rule affects their behavior.
In conclusion, Math Garden meets the requirements we set for the practice and progress
monitoring system. It is worth noting that although the new CAP algorithm is implemented
in the domain of math, the system can be applied to all kinds of learning domains. In the
2010 release of Math Garden more games, e.g. fractions, have been added and a language
garden is in development. Also, the number of schools using Math Garden continues to
grow steadily (about 150 in October 2010), yielding about 50 thousand responses per day.
We expect a fast adoption of computers, such as handhelds, minicomputers and tablets,
in primary schools in the next 5 years. If children do their daily exercises in practice and
progress monitoring systems using these devices, we expect many benefits for students,
teachers, and scientists.
37
were almost equally difficult whereas 8 + 6 is taught much earlier on in the Dutch curriculum than
the other two addition problems. This kind of information can be used to determine the curriculum
(i.e. what is taught) in each grade.
One of the problems with the Elo rating system is the occurrence of rating inflation and
deflation (Glickman, 1999), which we call drift. In educational applications, one source of drift is that
new young players start with low ratings and stop playing when they leave school with high ratings.
This causes a systematic downwards drift in item rating and, as a consequence, lowers person
ratings. This does not seem to jeopardize the operation of Math Garden, since drift influences player
and item ratings simultaneously. The main problem lays in the interpretation of the rating. Rating
points cannot be accurately compared following inflation or deflation. Therefore we present
transformed ratings to teachers and users to prevent interpretation problems. Transformation is
conducted by calculating the average probability correct for a single user on all items in the domain,
as shown in equation 10:
P =1/ n 11+ e−a(θ j−βi )i=1
n
∑ (10)
This value is an estimation of the percentage of items in the domain that the user is able to
answer correctly. We also reduced drift by incorporating the rating uncertainty in calculating the K
factor, which minimizes the influence of unreliable person and item estimations on the updating
proces. A related issue is the convergence speed. This is the time or number of responses needed to
get a stable rating. We set the rating uncertainty parameters of the K factor, which determine the
convergence speed, on the basis of extended testing. A better approach would perhaps be to
estimate the uncertainty based on aberrant response patterns, where unexpected responses are
used as an indication of unreliability.
A last issue concerns the one-‐dimensionality of the Math domains. In practice, every test and
item bank is expected to violate the assumption of one-‐dimensionality to some degree. Though we
see no immediate effects on ability estimation the question of how robust the HSHS Elo algorithm is
to violation of this assumption needs further investigation. We also intend to further address the
possible individual differences between children and how the HSHS scoring rule affects their
behavior.
In conclusion, Math Garden meets the requirements we set for the practice and progress
monitoring system. It is worth noting that although the new CAP algorithm is implemented in the
domain of math, the system can be applied to all kinds of learning domains. In the 2010 release of
Math Garden more games, e.g. fractions, have been added and a language garden is in development.
Also, the number of schools using Math Garden continues to grow steadily (about 150 in October
2010), yielding about 50 thousand responses per day. We expect a fast adoption of computers, such
as handhelds, minicomputers and tablets, in primary schools in the next 5 years. If children do their
daily exercises in practice and progress monitoring systems using these devices, we expect many
benefits for students, teachers, and scientists.