UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Math Garden: A new educational and scientific instrument

Straatemeier, M.

Link to publication

Citation for published version (APA):Straatemeier, M. (2014). Math Garden: A new educational and scientific instrument.

General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s),other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, statingyour reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Askthe Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam,The Netherlands. You will be contacted as soon as possible.

Download date: 12 Feb 2020

https://dare.uva.nl/personal/pure/en/publications/math-garden-a-new-educational-and-scientific-instrument(67ab8977-5e36-4f9f-962d-be294f2b8c09).html

2CHAPTER

Computer adaptive practice of math ability using a new item response model for on

the fly ability and difficulty estimation

This chapter has been published as:

Klinkenberg, S., Straatemeier, M., & Van der Maas, H. L. J. (2011). Computer adaptive

practice of maths ability using a new item response model for on the fly ability and

difficulty estimation. Computers & Education, 57, 1813-1824.

24 | Chap te r 2

Abstract

In this chapter we present a model for computerized adaptive practice and monitoring.

This model is used in Math Garden, a web-based monitoring system, which includes a

challenging web environment for children to practice arithmetic. Using a new item re-

sponse model based on the Elo (1978) rating system and an explicit scoring rule, estimates

of the ability of persons and the difficulty of items are updated with every answered item,

allowing for on the fly item calibration. In the scoring rule both accuracy and response time

are accounted for. Items are sampled with a mean success probability of .75, making the

tasks challenging yet not too difficult. In a period of ten months our sample of 3648 chil-

dren completed over 3.5 million arithmetic problems. The children completed about 33%

of these problems outside school hours. Results show better measurement precision, high

validity and reliability, high pupil satisfaction, and many interesting options for monitoring

progress, diagnosing errors and analyzing development

Introduction

In this chapter we present a computerized adaptive practice (CAP) system for monitor-

ing arithmetic in primary education: Math Garden. Math Garden is a web-based computer

adaptive practice and monitoring system based on weekly measurements. In recent years

math abilities of Dutch students have been widely debated. This is mainly due to the results

of the National Periodical Education Polls (PPON). These results show that few children

reach the required math level at the end of their primary education (Kraemer, Janssen,

Van der Schoot, & Hemker, 2005). Based on these findings a parliamentary inquiry into

Dutch education was initiated. Both the committee “Dijsselbloem” (2008) and the expert

group “Doorlopende Leerlijnen” (2008) recommended several improvements to the Dutch

education system in general and math education in particular. Recommendations included

the provision of more time to practice and maintain basic math skills, more efficient and

effective measurement in education, and the use of these measurement results to improve

the ability of individual students, the classroom and education in general. These recom-

mendations are also supported by Fullan (2006), who claimed that acting on data is critical

for learning from experience.

Combining practice and measurementIn the light of these recommendations we propose to combine practice and measure-

ment in a playful manner using computerized educational games. We expect that in the

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 25

near future children will increasingly use mini computers and handheld devices to do their

daily exercises in arithmetic, spelling, and other subjects. The use of computers has two

main advantages. First, the input can be analyzed automatically and feedback can be given

immediately, which will free teachers from checking and correcting the children’s exercise

books. The recorded and automatically analyzed data can provide teachers with detailed

information on children’s progress and the errors they make. Teachers can use this infor-

mation to optimize individual instruction. The information concerning the child’s progress

and abilities, which is accumulated over time, may ultimately obviate the need to conduct

tests and examinations. Second, by using computers it is possible to let children practice

at their individual ability level. Research on the development of expertise performance has

shown that people do improve their performance considerably if they regularly do specific

exercises that are adjusted to their ability level and include immediate feedback. In the

development of Math Garden we follow these ideas developed in sports and expertise train-

ing, especially the idea of deliberate practice (Ericsson, 2006, pp. 683–703).

Three problems of CATTo implement individualized practice, we apply the technique of computer adaptive test-

ing (Van der Linden & Glas, 2000; Wainer, 2000). Computer adaptive testing (CAT) is

based on item response theory (IRT). This theory consists of statistical models that relate

item responses to the (latent) abilities that the items measure (Lord & Novick, 1968). A

large collection of item response models is available, but these are all basically variations

on the simplest model, i.e., the one-parameter logistic (1PL) model or Rasch model (Rasch,

1960). In the Rasch model the probability of a correct or affirmative answer is a logistic

function of the difference between the ability of the subject and the difficulty of the item.

In the two-parameter logistic model, the difference is weighted by an item discrimination

parameter, which has a high value when an item discriminates well between low and high

ability subjects. Item response models can be used for equating tests, to detect and study

differential item functioning (bias), and to develop computer adaptive tests (Van der Lin-

den & Hambleton, 1997). The idea of CAT is to determine the ability level of a person dy-

namically. In CAT, item administration depends on the subject’s previous responses. If the

preceding item is answered correctly (incorrectly), a more (less) difficult item is presented.

Hence, each person is presented a test tailored to his or her ability. Using CAT, test length

can be shortened up to 50% (Eggen & Verschoor, 2006). Originally, CAT was developed

for measurement only. Our aim to combine practice and measurement raises several novel

issues. We distinguish the following three issues.

First, in standard CAT the parameters of the items, especially the difficulty, have to be

26 | Chap te r 2

known in advance to test adaptively. Items therefore have to be “pre-calibrated” before

they can be used in real test situations. This means that a large representative sample of

the population has to have answered the items in the item bank to provide the information

for item calibration. The difficulty of the items is determined using the data of this sample.

This method is obviously time-consuming and costly, especially as the calibration has to

be carried out repeatedly (e.g., every few years) to acquire accurate norm referenced item

parameters.

Second, CAT operates most effectively if the difficulty level of administered items equals

the ability estimate of the person. The probability of answering such items correctly is .5.

However, for most children and many adults the success rate associated with a .5 probabil-

ity is experienced as discouraging. Research by Eggen and Verschoor (2006) showed that

increasing this probability to above .7 greatly reduces measurement precision. Given a .7

probability, more items need to be administered to obtain an accurate estimate of person

ability. This requirement reduces the efficiency of computer adaptive testing.

The third problem concerns a testing problem that applies to psychological and educa-

tional measurement in general, namely, the trade-off between speed and accuracy. Without

explicit instructions, participants in tests and experiments are free to balance speed and

accuracy as they wish. Consequently the trade-off between speed and accuracy can be

a source of large individual differences. The current solution in psychometrics (Van der

Linden, 2007) and experimental psychology (Ratcliff & Rouder, 1998; Vandekerckhove &

Tuerlinckx, 2008) is to estimate person parameters involved in this trade-off on the basis of

the data. However, this procedure requires large amounts of high quality data.

New CATWe developed an extended CAT approach to solve these problems. This Computer Adap-

tive Practice (CAP) system provides the basis of Math Garden. The CAP system includes the

following two innovations. First, we have applied a new estimation method based on the Elo

(1978) rating system (ERS) developed for chess competitions. The ERS allows for on the

fly estimation of item difficulty and person ability parameters. With this method, pre-testing

is no longer required. Second, we have used an explicit scoring rule for speed and accuracy,

which is known to the subject during the test. Inclusion of speed in the scoring has the ad-

vantage that we acquire more information about ability. Research by Van der Maas and Wa-

genmakers (2005) showed that in the responses to easy chess items there is a strong negative

relation between response time and ability. Subjects tend to answer easy items correctly, but

more advanced subjects answer them more quickly. Third, by integrating response time into

the estimation of ability, we can decrease the difficulty of administered items with less loss


of measurement precision than noted by Eggen and Verschoor (2006). In addition we expect

the higher success rate to increase the motivation of children during the test. In the Method

section we describe Math Garden, the Elo algorithm and the new scoring rule in more detail.

In the results section of this chapter we test the working of Math Garden. We present evidence

for high validity and reliability of ability and difficulty estimation, the motivational value of

the Math Garden, and its usefulness as a diagnostic and monitoring instrument.

Methods

ParticipantsA total of 35 primary schools, eight remedial teachers and 32 families participated in this

study, comprising of N = 3648 active participants. Also 334 aspiring kindergarten pupils

joined Math Garden. In the time period from August 2008 to early June 2009 more than 3.5

million arithmetic problems were answered in our sample. In addition to the responses, we

registered the gender, age and grade of the participants. Table 1 shows the mean age with

standard deviation and the number of children for each grade.

MaterialsThe main measurement tool used in this study is the web-based practice and monitoring

system we developed: Math Garden. The student interface consists of a garden containing

distinct flowerbeds, representing, among others, the four domains: addition, subtraction,

multiplication, and division (Figure 1a) on which we focus in this chapter. The size of the

flowers represents the math ability of the student. By clicking on a flowerbed the math

game is started for a specific domain and the student can start playing.

Table 1 Age, gender and N per grade

Grade Age category N Mean age (SD) % Male (Female)

kindergarten 4-5 103 4.32 (0.51) 50.49 (49.51)

kindergarten 5-6 231 5.45 (0.51) 47.19 (52.81)

1 6-7 529 6.61 (0.51) 53.50 (46.50)

2 7-8 681 7.69 (0.54) 55.21 (44.79)

3 8-9 526 8.68 (0.79) 47.91 (52.09)

4 9-10 513 9.70 (0.61) 47.24 (52.76)

5 10-11 574 10.79 (0.60) 49.48 (50.52)

6 11-12 416 11.80 (0.57) 50 (50)

Secondary Education > 12 75 13.33 (3.94) 64 (36)

28 | Chap te r 2

The visual interface of the math task consists of a math question, six answer options

(addition and subtraction) or a number pad (multiplication and division), a coin bag, a

question mark, a stop sign, and an elapsing coin bar that indicates the time left on the item

(Figure 1b). The game rules are intuitive and therefore only require minimal explanation

on the website. Students gain points (coins), displayed at the bottom of the game interface,

by answering items correctly and lose coins when answering incorrectly. With each item a

total number of twenty coins, corresponding to the maximum time in seconds, can be won

or lost. Every second one coin disappears. The remaining coins are added to the coin bag if

the item has been solved correctly and are subtracted if solved incorrectly. If the time limit

has expired or the question mark has been clicked, no coins are lost or won. The rationale

of this scoring rule is explained in the psychometrics section. A session consists of fi fteen

items after which the math game terminates and the student is returned to his or her garden.

The fl owers will start growing according to the progression that has been made. Students

are motivated by two reward systems. Good performance is rewarded by growing fl owers

and virtual coins.13The Math Garden website contains a dedicated area, the prize cabinet,

where virtual prizes can be bought with the earned coins. Another way students are moti-

vated to continue playing in Math Garden is that the fl owerbeds wither if the student does

not play. Withering worsens over time and can only be undone by completing a new session

of 15 items.

Figure 1 The main Math Garden interface (a) and an addition item (b).

The four domains, addition, subtraction, multiplication, and division, contain 738, 723,

659, and 664 items, respectively. The items in the four domains cover the curriculum in

primary education. They vary from easy (e.g., 3 + 4 with response options: 7, 8, 6, 1, 9

and 12) to diffi cult (e.g., 7.34 + 311.4 with response options: 318.74; 318.38; 318.47;

31 Because of the adaptive nature of the test, every student has roughly the same percentage correct. Hence the num-ber of coins won refl ects only how often a student plays and not his arithmetic level.

a) b)


317.74; 319.74 and 318.34). The response options are selected to be informative distracters.

Open-ended items were used in the games multiplication and division. Variables measured

by the task are response time, the given answer, the correctness (0, 1) and a timestamp at

administration.

We studied the validity of the data by comparing the ability estimate, measured with the

Math Garden, with students’ scores on the math tests from the pupil monitoring system

(Janssen & Engelen, 2002) of the National Institute for Educational Measurements (Cito).

In the Cito monitoring system math tests are administered twice a year from mid grade 1

until mid grade 6. These tests assess the knowledge and skills that are being taught in these

grades. The tests contain both open-ended and forced-choice items. Students’ scores on the

math test (the total of correct answers) are transformed to a score on a norm-referenced

general math ability scale. This allows one to compare students’ scores from different

grades using one scale.

Psychometrics

Elo rating system In chess the Elo (1978) rating system (ERS) is used to estimate the relative ability of a

player. The ERS is a dynamic paired comparison model, which is mathematically closely

related to the Rasch IRT model (Batchelder & Bershad, 1979). Initially chess players are

given a provisional ability rating θ, which is incrementally updated (see equation (1)) based

on match results (in chess 0, .5 and 1, for loss, draw and win outcomes). The updated abil-

ity estimate

Opmerkingen ronde 3 Ik gebruik weer af en toe dikgedrukt en schuin om de verschillende woorden te onderscheiden, deze formatting niet overnemen.

• P 1: Concept doctoral thesis mag weg • P 3: Graag ook academisch proefschrift in schreefloze lettertype als kopje

erboven met witregel ertussen. Academisch Proefschrift beiden met hoofdletter

• P 3: AU moet Aula der Universiteit van Amsterdam zijn • P 5: paginanummering klopt nog niet, maar dit wilde je geloof ik op het

eind doen. Hoofdstuk 1

• P. 9, alinea 2, derde zin. Graag nog een komma achter 2005 • P. 14 regel 5. scPhools moet schools zijn (sorry dat je dit weer moet

aanpassen, was een tikfout) Hoofdstuk 2

• P. 27. 2e alinea regel 4: komen toevoegen achter reponses: In addition to the responses, we registered

• P. 27. Is het mogelijk om de tabel van pagina 28 onder het stukje participants te zetten op pagina 27? Dan staan de plaatjes van pagina 29 ook weer dichterbij de tekst waar ze horen.

• P. 28. De footnote en de verwijzing ernaar in de tekst bevat het cijfer 3, moet 1 zijn.

• P. 29: laatste zin θ moet schuingedrukt zijn. • Overall heb ik het idee dat de formules behoorlijk klein zijn, misschien

moeten we even kijken hoe dit uitpakt in de drukproeven of denk je dat we het nu al groter moeten maken.

• P. 29: laatste zin er mist een ): (see equation (1)) • P. 30: regel 2, er mist een teken achter estimate 𝜃𝜃

P. 30: na regel 4 ontbreekt er tekst. De hele zin moet zijn: The expected match result is a function of the difference between the ability estimates of both player j and k preceding the match and expresses the probability of winning (see equation (2)):

• P. 30. De footnote betreft footnote 2 (ook in de tekst) • P. 31. Witregel na regel 2 mag weg. • P. 31. Na 1e formule niet inspringen • P. 31: 4e regel onder high speed high stakes. Dit moet een andere x zijn,

dezelfde als in de 3e regel van onder maar dan zonder de subscripts. • P. 31. 5e regel van onder, niet inspringen • P. 32: Figuur 2 mag eventueel wat kleiner als dat nu door verspringen van

teksten handiger is. • P. 32, er zitten nog veel fouten in de tekens en subscripts in deze alinea.

Hierbij de goede tekst. Dikgedrukt zijn de wijzigingen: Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is

(signifi ed by the hat) depends on the weighted difference in match result S

and expected match result E(S). The expected match result is a function of the difference

between the ability estimates of both player j and k preceding the match and expresses the

probability of winning (see equation (2)):

The K factor in equation (1) weights the impact of the deviation from expectation on the

new ability estimate. This value essentially determines the rate at which θ can change over

matches. In the standard ERS the K factor is constant. Glickman (1995) argued that not all

ability ratings are estimated accurately by the ERS update function (eq. 1). Inaccuracies

mostly occur when players are new or have not played for an extended period of time, re-

23

Figure 1. The main Math Garden interface (a) and an addition item (b).

The four domains, addition, subtraction, multiplication and division, contain 738, 723, 659

and 664 items, respectively. The items in the four domains cover the curriculum in primary

education. They vary from easy (e.g., 3 + 4 with response options: 7, 8, 6, 1, 9 and 12) to difficult

(e.g., 7.34 + 311.4 with response options: 318.74; 318.38; 318.47; 317.74; 319.74 and 318.34). The

response options are selected to be informative distracters. Open-‐ended items were used in the

games multiplication and division. Variables measured by the task are response time, the given

answer, the correctness (0, 1) and a timestamp at administration.

We studied the validity of the data by comparing the ability estimate, measured with the

Math Garden, with students’ scores on the math tests from the pupil monitoring system (Janssen &

Engelen, 2002) of the National Institute for Educational Measurements (Cito). In the Cito monitoring

system math tests are administered twice a year from mid grade 1 until mid grade 6. These tests

assess the knowledge and skills that are being taught in these grades. The tests contain both open-‐

ended and forced-‐choice items. Students’ scores on the math test (the total of correct answers) are

transformed to a score on a norm-‐referenced general math ability scale. This allows one to compare

students’ scores from different grades using one scale.

Psychometrics

Elo rating system. In chess the Elo (1978) rating system (ERS) is used to estimate the relative

ability of a player. The ERS is a dynamic paired comparison model, which is mathematically closely

related to the Rasch IRT model (Batchelder & Bershad, 1979). Initially chess players are given a

provisional ability rating θ, which is incrementally updated (see equation (1) based on match results

(in chess 0, .5 and 1, for loss, draw and win outcomes). The updated ability estimate 𝜃𝜃 (signified by the hat) depends on the weighted difference in match result S and expected match result E(S). The

expected match result is a function of the difference between the ability estimates of both player j

(1)

θ j =θ j +K(Sj −E(Sj ))

θk =θk +K(Sk −E(Sk ))

24

(2)

The K factor in equation (1) weights the impact of the deviation from expectation on the new ability

estimate. This value essentially determines the rate at which θ can change over matches. In the

standard ERS the K factor is constant. Glickman (1995) argued that not all ability ratings are

estimated accurately by the ERS update function (eq. 1). Inaccuracies mostly occur when players are

new or have not played for an extended period of time, resulting in much uncertainty in their ability

rating θ. Glickman proposed to let the K factor reflect the uncertainty in ability estimates by making it

a function of time and playing frequency. If there is little uncertainty, the K factor for recent and

frequent players will be low. If there is much uncertainty the K factor will be high.

Computer adaptive practice. Our suggestion for creating an on the fly item calibrating and

computer adaptive practice (CAP) system is to replace one player in the Elo system by an item.2

Solving an item correctly is interpreted as winning the match against the item. The updating function

in equation (1) can be rewritten to equation (3) for updating player and item ratings:

(3)

where βi is the difficulty estimate of the item and Sij and E(Sij) are the score and expected probability

of winning for person j on item i. Following Glickman, the K factor in our CAP system is a function of

the rating uncertainty U of the player and the item (eq. 4):

K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )

(4)

where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K-‐ = 0.5 are the

weights for the rating uncertainty for person j and item i. These values determine the rate at which θ

and β can change following each item response. These values have been determined through

extensive simulations.

The uncertainty U depends on both recency and frequency. Equation (5) combines these

opposite effects on uncertainty. We apply the same equation to items and players, with provisional

uncertainty of U = 1 and 0 ≤ U ≤ 1:

U =U −140

+130

D (5)

2 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tactics Server (http://chess.emrald.net)

E(Sj ) =1

1+10(θk−θ j )/400

θ j =θ j +K j (Sij −E(Sij ))

βi = βi +Ki (E(Sij )− Sij )

30 | Chap te r 2

sulting in much uncertainty in their ability rating θ. Glickman proposed to let the K factor

reflect the uncertainty in ability estimates by making it a function of time and playing fre-

quency. If there is little uncertainty, the K factor for recent and frequent players will be low.

If there is much uncertainty the K factor will be high.

Computer adaptive practice Our suggestion for creating an on the fly item calibrating and computer adaptive prac-

tice (CAP) system is to replace one player in the Elo system by an item.24 Solving an

item correctly is interpreted as winning the match against the item. The updating func-

tion in equation (1) can be rewritten to equation (3) for updating player and item ratings:

where βi is the difficulty estimate of the item and S

ij and E(S

ij) are the score and expected

probability of winning for person j on item i. Following Glickman, the K factor in our

CAP system is a function of the rating uncertainty U of the player and the item (eq. 4):

where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K

- = 0.5

are the weights for the rating uncertainty for person j and item i. These values determine

the rate at which θ and β can change following each item response. These values have been

determined through extensive simulations.


opposite effects on uncertainty. We apply the same equation to items and players, with pro-

visional uncertainty of U = 1 and 0 ≤ U ≤ 1:

We assume that uncertainty for players and items decreases after every administration and

increases with time. Therefore uncertainty reduces to zero after 40 administrations and

conversely increases to the maximum of 1 after 30 days D of not playing.

24 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tac-tics Server (http://chess.emrald.net)

24

(2)













(3)





(4)








U =U −140

+130

D (5)


E(Sj ) =1

1+10(θk−θ j )/400



24

(2)













(3)





(4)








U =U −140

+130

D (5)


E(Sj ) =1

1+10(θk−θ j )/400



24

(2)













(3)





(4)








U =U −140

+130

D (5)


E(Sj ) =1

1+10(θk−θ j )/400




High speed, high stakes We incorporate speed by using the scoring rule (shown in eq. 6) for speed and accuracy,

which we call the high speed, high stakes (HSHS) scoring rule (Maris & Van der Maas,

2012). This rule imposes a speed-accuracy trade-off setting on the individual. Player j has

to respond x in time tij before the time limit d

i for item i. The score S

ij is scaled by the dis-

crimination parameter ai:

In this scoring rule the stakes are high when the subject responds quickly. In case of a

correct answer (xij = 1) the score equals the remaining time. In case of an incorrect answer

(xij = 0) the remaining time is multiplied by −1. Thus a quick incorrect answer leads to a

large negative score. This scoring rule is depicted in Figure 2. The scoring rule is expected

to minimize guessing by encouraging deliberate and thoughtful responses.

Figure 2 High speed, high stakes scoring rule.

Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scor-

ing rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is based on the

ability estimate of the person θj, the difficulty estimate of the item β

i, the time limit d

i and

discrimination parameter ai for that item. In Math Garden, we set a

i = 1/d

i, such that the

effective discrimination equals that of the 1PL model:

25

We assume that uncertainty for players and items decreases after every administration and increases

with time. Therefore uncertainty reduces to zero after 40 administrations and conversely increases

to the maximum of 1 after 30 days D of not playing.

High speed, high stakes. We incorporate speed by using the scoring rule (shown in eq. 6) for

speed and accuracy, which we call the high speed high stakes (HSHS) scoring rule (Maris & Van der

Maas, 2012). This rule imposes a speed accuracy trade-‐off setting on the individual. Player j has to

respond x in time tij before the time limit di for item i. The score Sij is scaled by the discrimination

parameter ai:

Sij = (2xij −1)(aidi − aitij ) (6)

In this scoring rule the stakes are high when the subject responds quickly. In case of a correct answer

(xij = 1) the score equals the remaining time. In case of an incorrect answer (xij = 0) the remaining

time is multiplied by −1. Thus a quick incorrect answer leads to a large negative score. This scoring

rule is depicted in Figure 2. The scoring rule is expected to minimize guessing by encouraging

deliberate and thoughtful responses.

Figure 2. High speed, high stakes scoring rule.

Maris & Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule.

The expected score (eq. 7) can be inferred from this model. E(Sij) is based on the ability estimate of

the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai

for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of

the 1PL model:

di

+aidi

-aidi

0

+.5aidi

-.5aidi

correct

incorrect

tjtime

score

di

+aidi

-aidi

0

+.5aidi

-.5aidi

correct

incorrect

tjtime

scor

e

26

E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1

−1

θ j −βi (7)

We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified

Elo update function (eq. 3).

Item selection. Items are selected for which the mean probability of answering correctly is

about .75. Repetition of the same items is restricted, by ensuring that items are reused only after 20

other items have been answered. A new target βt is selected by using:

(8)

where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and restricted such

that .5 < P < 1. For administration the nearest available item is selected by: mini| βi -‐ βt|.

Procedure

Although Math Garden started out as a pilot project, only available to a limited number of

schools in the Netherlands, the website later on became available for a larger audience. In the pilot

period the students received a login account and an instruction from their teacher. In this instruction,

teachers explained the scoring rule of the games and students were told that they could click on the

question mark if they did not know the answer. After this, students could start playing on their own.

Teachers were told that the first two sessions should be played at school. After this, students were

also allowed to play at home, but they were instructed to play by themselves. After the pilot period

the Math Garden also became available to remedial teachers and families. The remedial teachers and

families were not instructed on the frequency of playing. The manuals on how to use the Math

Garden were all available on the website but the scoring rule of the games was not explicitly

explained to the children.

Results

Measurement precision

To test whether the incorporation of response time in the estimation of ability allows us to

lower the difficulty of administered items with less loss of measurement precision, we conducted a

simulation study. We compared our results to those of Eggen and Verschoor (2006). In a simulation

study, Eggen and Verschoor showed3 an increasing (negative) bias (Figure 3: left) and a drop in

measurement precision (Figure 3: right) when selecting easy items in a standard CAT using the

weighted maximum likelihood estimator (WML) and the one-‐parameter logistic (1PL) model. Average

bias was computed by: 1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!) and measurement precision was quantified by calculating the

mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.

3 Table 1 in Eggen and Verschoor (2006).

βt = θ j − lnP1−P

32 | Chap te r 2

We use the HSHS score Sij (eq. 6) and the corresponding expected score E(S

ij) (eq. 7) in our

modified Elo update function (eq. 3).

Item selection Items are selected for which the mean probability of answering correctly is about .75.

Repetition of the same items is restricted, by ensuring that items are reused only after 20


where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and re-

stricted such that .5 < P < 1. For administration the nearest available item is selected by:

mini| β

i - β

t|.

ProcedureAlthough Math Garden started out as a pilot project, only available to a limited number

of schools in the Netherlands, the website later on became available for a larger audience.

In the pilot period the students received a login account and an instruction from their teach-

er. In this instruction, teachers explained the scoring rule of the games and students were

told that they could click on the question mark if they did not know the answer. After this,

students could start playing on their own. Teachers were told that the first two sessions

should be played at school. After this, students were also allowed to play at home, but they

were instructed to play by themselves. After the pilot period the Math Garden also became

available to remedial teachers and families. The remedial teachers and families were not

instructed on the frequency of playing. The manuals on how to use the Math Garden were

all available on the website but the scoring rule of the games was not explicitly explained

to the children.

Results

Measurement precisionTo test whether the incorporation of response time in the estimation of ability allows us

to lower the difficulty of administered items with less loss of measurement precision, we

conducted a simulation study. We compared our results to those of Eggen and Verschoor

(2006). In a simulation study, Eggen and Verschoor showed35 an increasing (negative) bias

35Table 1 in Eggen and Verschoor (2006).

26

E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1

−1

θ j −βi (7)

We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified

Elo update function (eq. 3).

Item selection. Items are selected for which the mean probability of answering correctly is

about .75. Repetition of the same items is restricted, by ensuring that items are reused only after 20


(8)

where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and restricted such

that .5 < P < 1. For administration the nearest available item is selected by: mini| βi -‐ βt|.

Procedure

Although Math Garden started out as a pilot project, only available to a limited number of

schools in the Netherlands, the website later on became available for a larger audience. In the pilot

period the students received a login account and an instruction from their teacher. In this instruction,

teachers explained the scoring rule of the games and students were told that they could click on the

question mark if they did not know the answer. After this, students could start playing on their own.

Teachers were told that the first two sessions should be played at school. After this, students were

also allowed to play at home, but they were instructed to play by themselves. After the pilot period

the Math Garden also became available to remedial teachers and families. The remedial teachers and

families were not instructed on the frequency of playing. The manuals on how to use the Math

Garden were all available on the website but the scoring rule of the games was not explicitly

explained to the children.

Results

Measurement precision

To test whether the incorporation of response time in the estimation of ability allows us to

lower the difficulty of administered items with less loss of measurement precision, we conducted a

simulation study. We compared our results to those of Eggen and Verschoor (2006). In a simulation

study, Eggen and Verschoor showed3 an increasing (negative) bias (Figure 3: left) and a drop in

measurement precision (Figure 3: right) when selecting easy items in a standard CAT using the

weighted maximum likelihood estimator (WML) and the one-‐parameter logistic (1PL) model. Average

bias was computed by: 1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!) and measurement precision was quantified by calculating the

mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.

3 Table 1 in Eggen and Verschoor (2006).

βt = θ j − lnP1−P


(Figure 3: left) and a drop in measurement precision (Figure 3: right) when selecting easy

items in a standard CAT using the weighted maximum likelihood estimator (WML) and the

one-parameter logistic (1PL) model. Average bias was computed by:

based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:

• P. 32. Na formule niet inspringen. En ook deze tekst klopt niet: We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).

• P. 32. De punt in het kopje Item selection moet weg. • P. 33. Footnote moet nummering 3 zijn. • P. 33. In de laatste zin zijn nog tekens weggevallen. Dit moet zijn:

Average bias was computed by: 1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!) and measurement precision was quantified by calculating the mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.

• P. 34, 3e zin van onder: niet inspringen • P. 35, 3e zin er missen nog tekens na estimated abilities, het moet zijn:

and estimated abilities 𝜃𝜃. • P. 35. Het is niet mooi dat hier zoveel wit is. De tekst van de volgende

pagina kan hier al beginnen. • P. 36. Onderschrift grafiek. Norm referenced moet een streepje tussen. • P. 37, regel 5: er moet nog 1 ω2 italic worden • P. 38, onderschrift Figuur 6. De β ziet er raar uit. • P. 39, 6e regel van onder. De punt achter Si. moet niet in het onderschrift

maar als gewone punt. • P. 40, er zit wederom veel wit op pagina 40. • P. 43. Regel 2. Er moet nog een punt achter de zin. • P. 45. Kopje Discussion is van een niveau hoger. • P. 45. Regel 3. Streepje toevoegen tussen norm en referenced • P. 45, laatste zin voor grafiek. Komma toevoegen na However • P. 45. Figuur 13 dus graag voor het stuk Discussie gedeelte plaatsen. Waar

precies is ook afhankelijk wat voor jou makkelijk werkt, maar nadat er in de tekst naar verwezen wordt en voor het kopje Discussie.

• P. 47. Derde regel niet inspringen. • P. 47. 2e alinea, ie zin. Math moet met een kleine letter • P. 47. 2e alinea, 3e zin. Komma toevoegen na estimation: on ability

estimation, the question of Hoofdstuk 3

• P 52, alinea 2, 3e regel. CITO moet Cito zijn. • P. 52, alinea 2, 4e regel. Komma na In this Chapter mag weg. • P 52: Kopje Sources of Problem Difficulty moet Sources of problem difficulty zijn. Dus alleen eerste woord met hoofdletter.

• P 54, 3e regel. Staat ere en extra spatie voor Siegler? • P 54, 2e alinea, regel 6. Komma toevoegen voor and, dus: , and the

minimum operand • P 54: Er mist een inspringing onder het kopje Tie effect • P 54: 3e zin van onder. Het lijkt of er een extra spatie staat voor An

alternative explanation • P 55: Er mist een inspringing onder het kopje Special numbers • P 57: Regel 3. We use data moet We used data zijn.

! and

measurement precision was quantified by calculating the mean standard error of estimation

based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:

• P. 32. Na formule niet inspringen. En ook deze tekst klopt niet: We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).

• P. 32. De punt in het kopje Item selection moet weg. • P. 33. Footnote moet nummering 3 zijn. • P. 33. In de laatste zin zijn nog tekens weggevallen. Dit moet zijn:

Average bias was computed by: 1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!) and measurement precision was quantified by calculating the mean standard error of estimation 𝑠𝑠𝑠𝑠(𝜃𝜃) using the information function for the 1PL model.

• P. 34, 3e zin van onder: niet inspringen • P. 35, 3e zin er missen nog tekens na estimated abilities, het moet zijn:

and estimated abilities 𝜃𝜃. • P. 35. Het is niet mooi dat hier zoveel wit is. De tekst van de volgende

pagina kan hier al beginnen. • P. 36. Onderschrift grafiek. Norm referenced moet een streepje tussen. • P. 37, regel 5: er moet nog 1 ω2 italic worden • P. 38, onderschrift Figuur 6. De β ziet er raar uit. • P. 39, 6e regel van onder. De punt achter Si. moet niet in het onderschrift

maar als gewone punt. • P. 40, er zit wederom veel wit op pagina 40. • P. 43. Regel 2. Er moet nog een punt achter de zin. • P. 45. Kopje Discussion is van een niveau hoger. • P. 45. Regel 3. Streepje toevoegen tussen norm en referenced • P. 45, laatste zin voor grafiek. Komma toevoegen na However • P. 45. Figuur 13 dus graag voor het stuk Discussie gedeelte plaatsen. Waar

precies is ook afhankelijk wat voor jou makkelijk werkt, maar nadat er in de tekst naar verwezen wordt en voor het kopje Discussie.

• P. 47. Derde regel niet inspringen. • P. 47. 2e alinea, ie zin. Math moet met een kleine letter • P. 47. 2e alinea, 3e zin. Komma toevoegen na estimation: on ability

estimation, the question of Hoofdstuk 3

• P 52, alinea 2, 3e regel. CITO moet Cito zijn. • P. 52, alinea 2, 4e regel. Komma na In this Chapter mag weg. • P 52: Kopje Sources of Problem Difficulty moet Sources of problem difficulty zijn. Dus alleen eerste woord met hoofdletter.

• P 54, 3e regel. Staat ere en extra spatie voor Siegler? • P 54, 2e alinea, regel 6. Komma toevoegen voor and, dus: , and the

minimum operand • P 54: Er mist een inspringing onder het kopje Tie effect • P 54: 3e zin van onder. Het lijkt of er een extra spatie staat voor An

alternative explanation • P 55: Er mist een inspringing onder het kopje Special numbers • P 57: Regel 3. We use data moet We used data zijn.

! using the information function for the 1PL model.

Figure 3 BIAS and SE for different computer adaptive methods at different values of expected

probability correct.

In our simulation we used the Elo update function to estimate ability and difficulty,

utilizing: a) accuracy data with the 1PL model and b) accuracy and response time data

using the HSHS model. As in the study by Eggen and Verschoor (2006), our item bank

consisted of 300 items with normally distributed β ~ N(0,1) difficulties and we also sam-

pled 4000 abilities from a normal distribution θ ~ N(0,1). The CAP algorithm starts with an

item of intermediate difficulty −0.5 < β < 0.5 and terminates after 40 items. As a starting

point for ability we selected a random ability from a normal distribution β ~ N(0,1). We

compared our Elo based HSHS model, at different desired success probabilities, to Eggen

and Verschoor’s 1PL model using standard CAT. Eggen and Verschoor investigated suc-

cess probabilities up to .75. With regard to bias it can be concluded that the Elo estima-

tion method performs slightly worse with accuracy data only (Figure 3: left: Elo+1PL),

but outperforms Eggen and Verschoor’s standard CAT method, when RT’s are included

(Figure 3: left: Elo + HSHS). With regard to the standard error of estimation we also com-

pared our two Elo methods to the theoretical maximum information for the 1PL model.

0.5 0.6 0.7 0.8 0.9

-0.1

0-0

.08

-0.0

6-0

.04

-0.0

20.

00

BIAS

Expected probability correct

mea

n B

IAS

Eggen et al.Elo + 1PLElo + HSHS

0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9

0.30

0.35

0.40

0.45

0.50

SE


SE

sd of î i

mean se(î)

sd of î i

mean se(î)

Eggen et al.Elo + 1PLElo + HSHS Max Info.

0.5 0.6 0.7 0.75 0.8 0.9

34 | Chap te r 2

We calculated the maximum information (Figure 3: right: Max Info.) with equation (9):

where a = 1 is the discrimination parameter, N = 40 is the number of items, and Pi(θ) is

the desired probability correct. As can be seen in (Figure 3: right: Max info.), when the

probability of answering correctly assumes large values (x-axis), the theoretical minimum

SE (eq. 9) for the 1PL model increases exponentially (y-axis). For the standard error of esti-

mation (Figure 3: right) we calculated the standard deviation of the difference in simulated

abilities θ and estimated abilities

Opmerkingen ronde 3 Ik gebruik weer af en toe dikgedrukt en schuin om de verschillende woorden te onderscheiden, deze formatting niet overnemen.

• P 1: Concept doctoral thesis mag weg • P 3: Graag ook academisch proefschrift in schreefloze lettertype als kopje

erboven met witregel ertussen. Academisch Proefschrift beiden met hoofdletter

• P 3: AU moet Aula der Universiteit van Amsterdam zijn • P 5: paginanummering klopt nog niet, maar dit wilde je geloof ik op het

eind doen. Hoofdstuk 1

• P. 9, alinea 2, derde zin. Graag nog een komma achter 2005 • P. 14 regel 5. scPhools moet schools zijn (sorry dat je dit weer moet

aanpassen, was een tikfout) Hoofdstuk 2

• P. 27. 2e alinea regel 4: komen toevoegen achter reponses: In addition to the responses, we registered

• P. 27. Is het mogelijk om de tabel van pagina 28 onder het stukje participants te zetten op pagina 27? Dan staan de plaatjes van pagina 29 ook weer dichterbij de tekst waar ze horen.

• P. 28. De footnote en de verwijzing ernaar in de tekst bevat het cijfer 3, moet 1 zijn.

• P. 29: laatste zin θ moet schuingedrukt zijn. • Overall heb ik het idee dat de formules behoorlijk klein zijn, misschien

moeten we even kijken hoe dit uitpakt in de drukproeven of denk je dat we het nu al groter moeten maken.

• P. 29: laatste zin er mist een ): (see equation (1)) • P. 30: regel 2, er mist een teken achter estimate 𝜃𝜃

P. 30: na regel 4 ontbreekt er tekst. De hele zin moet zijn: The expected match result is a function of the difference between the ability estimates of both player j and k preceding the match and expresses the probability of winning (see equation (2)):

• P. 30. De footnote betreft footnote 2 (ook in de tekst) • P. 31. Witregel na regel 2 mag weg. • P. 31. Na 1e formule niet inspringen • P. 31: 4e regel onder high speed high stakes. Dit moet een andere x zijn,

dezelfde als in de 3e regel van onder maar dan zonder de subscripts. • P. 31. 5e regel van onder, niet inspringen • P. 32: Figuur 2 mag eventueel wat kleiner als dat nu door verspringen van

teksten handiger is. • P. 32, er zitten nog veel fouten in de tekens en subscripts in deze alinea.

Hierbij de goede tekst. Dikgedrukt zijn de wijzigingen: Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is

. This method of calculation is simpler, yet comparable

with the procedure used by Eggen and Verschoor (2006) for calculating the standard error

of estimation.

The SE of the Elo estimation method using only accuracy data (Figure 3: right:

Elo+1PL) is largest for almost all probability levels. This is to be expected as this method

is statistically inferior to the WML method used by Eggen and Verschoor (2006). Up to

the probability level of about .69 the SE using the HSHS Elo method (Figure 3: right:

Elo + HSHS) is larger than the SE found in the Eggen and Verschoor simulation. However,

at higher probability levels, especially compared to our target of .75, the SE is considerably

lower. At probability levels higher than about .78 the SE even drops below the theoretical

maximum information (Figure 3: left: Max info.) for the 1PL model. This demonstrates

that incorporating response times results in much better measurement precision when using

easy items.

ValidityTo assess the validity of the Math Garden measurements, the ratings of the students were

compared to their scores on the norm-referenced general math ability scale of the pupil

monitoring systems of Cito (Janssen & Engelen, 2002). The correlations between these

two measures, which serve as a measure of convergent validity, ranged from .78 to .84 for

the four domains addition, subtraction, multiplication and division. These correlations were

based on a subset of our sample. Cito scores where available for N = 964 participants. To

put these correlations into perspective we looked at the correlation between two subsequent

Cito scores. The correlation between Cito mid year and end of the year 2007–2008 was

.95. This indicates that our correlations can be considered fairly high. Figure 4 displays the

relation between test scores. The numbers indicate the regression line for each grade.

27

Figure 3. BIAS and SE for different computer adaptive methods at different values of expected probability correct.

In our simulation we used the Elo update function to estimate ability and difficulty, utilizing:

a) accuracy data with the 1PL model and b) accuracy and response time data using the HSHS model.

As in the study by Eggen and Verschoor (2006), our item bank consisted of 300 items with normally

distributed β ∼ N(0,1) difficulties and we also sampled 4000 abilities from a normal distribution θ ∼

N(0,1). The CAP algorithm starts with an item of intermediate difficulty −0.5 < β < 0.5 and terminates

after 40 items. As a starting point for ability we selected a random ability from a normal distribution

β ∼ N(0,1). We compared our Elo based HSHS model, at different desired success probabilities, to

Eggen and Verschoor’s 1PL model using standard CAT. Eggen and Verschoor investigated success

probabilities up to .75. With regard to bias it can be concluded that the Elo estimation method

performs slightly worse with accuracy data only (Figure 3: left: Elo+1PL), but outperforms Eggen and

Verschoor’s standard CAT method, when RT’s are included (Figure 3: left: Elo + HSHS). With regard to

the standard error of estimation we also compared our two Elo methods to the theoretical maximum

information for the 1PL model. We calculated the maximum information (Figure 3: right: Max Info.)

with equation (9):

(9)

where a = 1 is the discrimination parameter, N = 40 is the number of items, and Pi(θ) is the desired

probability correct. As can be seen in (Figure 3: right: Max info.), when the probability of answering

correctly assumes large values (x-‐axis), the theoretical minimum SE (eq. 9) for the 1PL model

increases exponentially (y-‐axis). For the standard error of estimation (Figure 3: right) we calculated

the standard deviation of the difference in simulated abilities θ and estimated abilities 𝜃𝜃. This

0.5 0.6 0.7 0.8 0.9

-0.1

0-0

.08

-0.0

6-0

.04

-0.0

20.0

0

BIAS


mean B

IAS

Eggen et al.

Elo + 1PL

Elo + HSHS

0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9

0.3

00.3

50.4

00.4

50.5

0

SE


SE

sd of ^

i i

mean se(î)

sd of ^

i i

mean se(î)

Eggen et al.

Elo + 1PL

Elo + HSHS

Max Info.

0.5 0.6 0.7 0.75 0.8 0.9

se(θ ) =1 Nai2Pi (θ )(1−Pi (θ ))


Figure 4 Correlation between Math Garden rating for the domains addition, subtraction, mul-

tiplication, and division and the norm-referenced Cito scores (mid 2008). Included are regression

lines for each grade (Dutch grades) indicated by grade numbers.

We also studied the validity by comparing the mean ability ratings of children in differ-

ent grades. We expected a positive relation between grade and ability. Figure 5 shows the

average ability rating for each grade and domain. As expected, children in older age groups

had a higher rating than children in younger age groups. In all four domains, there is an

overall significant effect of grade: addition, F(5, 1456) = 1091.4, p < .01, ω2 = .78; subtrac-

tion, F(5, 1363) = 780.5, p < .01, ω2 = .74; multiplication, F(5, 1215) = 409.6, p < .01, ω2

= .62; and division, F(5, 973) = 223.31, p < .01, ω2 = .53. Levene’s tests show differenc-

es in variances for the domains multiplication and division. However, the non-parametric

Kruskal–Wallis tests also show significant differences for these domains: χ2(5) = 753.28,

p < .01 for multiplication and χ2(5) = 505.17, p < .01 for division. For all domains, post

hoc analyses show significant differences between all grades, except for the differences

between grades five and six.

••••

•••

•• •

• •••

•••

• •••

•••

•

•

•

•• • •••

••

• •

••

• • •••

••• •

••

•••

•

••

•

•

•• •

•

•

•

•

• ••

•••

•

••

••

••••

•

•• •

••

•

••

••

•

••

•

•

• ••

••

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

•

•

•

••

•

••

•

•

••

•

•

•

•

•

•

•

••• •

• •

•

•

•

• •

•

•

•

•

• •

• •

••••

•

••

• ••

••

••

••

•

• •

•

•

•••••

••

••

•••

•

• •

•

• •

•

••

•

•• •

••• ••

•

•••

•••

•

••

•

•

•

••

•

••

••

• ••

• •

••• •

•• ••

••

•

••

•

••

•

•

••

• •

•

••

••

••

•

••

•

•

••

•

••

•

•

••

•

••

•• •

•

•••

••

•

••

•

••

•

•

•••

•

•

• ••

• •

•

•

••

••

••

•• ••

• ••

•

•

••

•

••

•

••

•

•• •

•

•

• • •

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

••

•

•

••••

•

••

•

••

•

•

• ••

•

•

•

•

•

•

•

••

•

•

••

•

•

•

••

••

••

••

••

•

••

••

•

•

•

•

••

•

•

••

•

•

••

•

•

•

••

••

• •

•

•••

••

•

••

•

•

•

••

• •

•

•

••

•

• •

•

••

•

•

•

••

•

••

•

•

•

• •

•

•

•

•

••

••

•

•

••

• •••

•

• •••

•

••

•

•

••

•

•

•

•

•

•

•

• ••

••• • •

••

•

•

••

••

•

• •••

•

••

•

•

•

•

•

••

•

•••

•

•

•

•

• ••

•

•

•

•

•

•

• •

••

••

••

•

• •••

•

•

•

•

•••

•• •

•

••

•

•

•

•

•

••

•

•••

••

•

•

••

•

••

•

•

•

•

•

•

•

•

••

••••

••

•

••

•

•

•

•

•

•

•

•

•

••

•

•

0 5 10

050

100

150

r = 0.83

rating addition

Cito

Nor

m s

core

•• ••

•••

•• •

• •••

•••

••••

••

•

•

•

•

•• • ••••

•• •

••

•• •••

••• •

•••

• •

•

••

•

•• •

•

•

•

• ••

•• •

•

••

••

••

••

•

•••

••

•

••

••

•

••

•

•

• ••

••

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

•

•

•

••

•

••

•

•

••

•

•

•

•

•

•

•

••• •

••

•

•

•

• •

•

•

•

•

• •

• •

••• •

•

••

• ••

••

••

••

•

• •

•

•

•• •••

••

••

•••

•

• •

•

• •

•

••

•

•• •

•• • ••

•

••

••

••

•

••

•

•

•

• •

•

••

••

• ••

• •

••••

• •••

••

•

••

•

••

•

•

••

••

•

• •

••

•••

••

•

•

••

•

••

•

•

• •

•

•••• •

•

•••

••

•

••

•

••

•

•

•••

•

•

• ••

• •

•

•

••

••

••

•• ••

• ••

•

•

••

•

••

•

••

•

•••

•

•

•

• ••

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••••

•

••

•

••

•

•

•••

•

•

•

•

•

•

•

••

•

•

••

•

•

•

••

••

••

••

••

•

••

••

•

•

•

•

••

•

•

••

•

•

••

•

••

••••

•

•••

••

•

••

•

•

•

••

• •

•

•

••

•

• •

•

•

••

•

••

•

•

•

•

•

•

•

••

••

•

•

••

• •• •

• •••

•

••

•

•

••

•

•

•

•

•

•

•

• ••

••• • •

••

•

•

••

••

•

• •••

•

• •

•

•

•

•

•

••

•

••

•

• •

••

•

•

•

•

•

•

• •

••

••••

•

••••

•

•

•

•

••

••

•

•

•

•

•

•

•

•

••

•

••

•

••

•

••

•

•

•

•

•

••

•

••

• ••

••

•

•

••

•

•

−5 0 5 10

050

100

150

r = 0.84

rating subtraction

Cito

Nor

m s

core

• •••

•••

•• •

• •••

••

••• •

•••

•

•

•

•

• • • •••

••• •

••

• • •••

•• • •

••

•• •

•

••

•

•

•• •

•

•

•

•

•••

•• •

•

••

••

••

••

•

•• •••

•

••

•••

• •

•

•

• ••

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

•

•

•

••

•

••

•

•

••

•

•

•

•

•

•

•

• •• •

• •

•

•

•

••

•

•

•

•

• •

• •

••• •

•

••

• ••

••

• •

••

•

• •

•

•

•• •

• •

••

••

•• •

•

• •

•

• •

•

••

•

•• •

• •• ••

•

••

••

••

•

••

•

•

•

• •

•

••

••

• ••

• •

••••

•• ••

•

•

••

•

••

•

•

••

• •

•

• •

••

•••

••

•

•

••

•

••

•

•

• •

•

••

• ••

•

•••

••

•

••

•

••

•

•

•••

•

•

• ••

• •

•

•

••

••

••

••••

•••

•

•

••

•

••

•

••

•

•• •

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

• •••

•

••

•

••

•

•

• ••

•

•

•

•

•

•

•

••

•

•

••

•

•

•

••

••

••

••

••

•

••

••

•

•

•

•

••

•

•

••

•

•

•

•

•

••

••

••

•

•••

••

•

••

•

•

•

••

• ••

••

•

••

•

•

••

•• •

•

•

••

•••

••

•• ••

• •• •

•

••

•

•

••

•

•

•

•

•

•

•

• ••

•• • • •

••

•

•

••

••

•

• •••

•

••

•

•

•

•

•

••

•

•

•

•

••

••

•

•

•

•

•

•

•

••

••••

•

• •••

•

•

•

•• •

•

•

•

•

•

•

••

•

•••

•

•

•

••

•

•

••

−5 0 5 10

050

100

150

r = 0.8

rating multiplication

Cito

Nor

m s

core

• •••

•••

•••

• •• •

••

••• •

•••

•

•

•

•

•• • •••

••

• •

••

•• ••••

• •••

•• •

•

••

•

•

• • •

•

•

•••

•• •

•

••

••

••

••

•

•• •

••

•

••

••

•

• •

•

•

• ••

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

•

•

•

••

•

••

•

•

••

•

•

•

•

•

•

•

• •••

• •

•

•

•

••

•

•

•

•

• •

• •

• •• •

•

••

• ••

••

• •

••

•

• •

•

•

•• •• •

••

••

•• •

•

• •

•

• •

•

••

•

•• •

•• • ••

•

••

••

••

•

••

•

•

•

••

•

••

• •

• ••

• •

••••

• • ••

•

•

••

•

•

•

•

••

•

• •

••

••

•

••

•

•

•••

••

•

•

••

•

••

•• •

•

•••

••

•

••

•

••

•

•

•••

•

•

• ••

• •

•

•

••

•••

•

••••

• ••

•

•

••

•

••

•

••

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

• •••

•

••

•

••

•

••

• ••

•

•

•

•

•

•

•

••

•

•

••

•

•

•

••

••

••

••

••

•

••

••

•

•

•

•

••

•

•

••

•

•

•

••

•

••

••

• •

•

•••

••

•

••

• ••

•

•

••

•

•

••

•• •

••

•••

••

• • ••

• •• •

•

••

•

•

••

•

•

•

•

•

•

•

• ••

•••••

••

•

•

••

••

•

• • ••

•

••

•

•

•

•

•

••

•

•

•

••

••

•

•

•

•

•

•

•

••

••••

•

• •

•

••

•

•

•

•••

•

•

•

•

••

•

••

•

••

−5 0 5 10

050

100

150

r = 0.78

rating division

Cito

Nor

m s

core

4 5

67

8

4 5

6

78

4 5

6

78

4 5

67

8

36 | Chap te r 2

Figure 5 Rating per domain and age group.

ReliabilityOne way to assess reliability is to compare children’s ratings across domains. Since the

domains involve related operations we expect high correlations between them. The correla-

tions between the four domains, addition, subtraction, multiplication, and division, vary

between .67 and .88, all signifi cant at p < .01, indicating fairly high correlations. Another

relatively simple way to assess the reliability of Math Garden is to construct parallel tests.

We can compare item difficulty β’s of so-called mirrored items (e.g. 7 + 4, 4 + 7 and

2 × 4, 4 × 2) for the domains addition (N = 48) and multiplication (N = 81). Mirrored

items should have very similar β’s. Figure 6 shows the correlation of the mirrored item β’s

for the domains addition and multiplication. These correlations are .88 and .98 (p < .01),

respectively, indicating a high reliability of these item sets.

Figure 6 Scatter plot of diffi culty β’s of mirrored items. Included are some example items, indicated

with black dots.

1 2 3 4 5 6

-10

-50

510

+

grade

Rat

ing

1 2 3 4 5 6-1

0-5

05

10

-

grade

Rat

ing

1 2 3 4 5 6

-10

-50

510

x

grade

Rat

ing

1 2 3 4 5 6

-10

-50

510

:

grade

Rat

ing

-8 -6 -4 -2 0

-8-6

-4-2

0

Correlation m+n, n+m

Rating mirror item (n+m)

Rat

ing

item

(m

+n)

7 + 3

7 + 6

8 + 7

1 + 3

2 + 3

0 + 4

4 + 1

5 + 35 + 6

7 + 1

7 + 48 + 4

8 + 5

6 + 8

9 + 0

9 + 1

3 + 9

10 + 7

15 + 82

6 + 1

6 + 2

8 + 110 + 2

9 + 8

10 + 40

-5 0 5 10

-50

510

Correlation mxn, nxm

Rating mirror item (nxm)

Rat

ing

item

(m

xn)

5 x 1

5 x 2

7 x 6

8 x 5

8 x 6

9 x 4

10 x 1

15 x 19

20 x 19

80 x 12

12 x 700

700 x 500

80 x 500

25 x 500

64 x 20

11 x 3311 x 75

11 x 2011 x 500


Besides diffi culty β’s, we can also compute the discriminatory power of items, which in-

dicates how well the item discriminates low from high ability subjects. We estimated these

so called a-parameters by using a logistic regression analysis on the accuracy responses

predicted by the difference in rating between item and respondent. As in the preceding

analysis, we compared the discriminatory power between mirrored items. The scatter plots

in Figure 7 show rather high significant (p < .01) positive correlations. The correlations for

addition and multiplication are .74 and .71.

Figure 7 Scatter plot of discriminatory a-parameters for mirrored items.

As a fi nal test of reliability, we investigated the stability of the diffi culty ratings β. A

high correlation between β values of items at two time points far apart indicates high re-

liability. Therefore we would expect a stable item bank to correlate highly over time. We

fi rst looked at the correlation between the item β ratings, as they were set at the start of the

project (week 36) and the item ratings in all subsequent weeks. In Figure 8, this correlation

is shown by the solid line. Clearly, the initial ratings, set on the basis of an analysis of Math

materials used by the schools, were quite good, as the correlation between initial ratings

and the ratings after 40 weeks is still .85. Secondly, we also correlated established item

ratings in week 44 with all item ratings in subsequent weeks (dotted line). This shows that

these established ratings are very stable as the correlations in all 32 weeks stay above .95.

0.2 0.4 0.6 0.8 1.0 1.2

0.2

0.4

0.6

0.8

1.0

1.2

a-parameter correlation n+m, m+n

a-parameter n+m

a-pa

ram

eter

m+n

0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.2

0.4

0.6

0.8

1.0

1.2

a-parameter correlation nxm, mxn

a-parameter nxm

a-pa

ram

eter

mxn

38 | Chap te r 2

Figure 8 Stability of items ratings for initial ratings (solid line) and established ratings after 2

months (dotted line). The x-axis displays week numbers (v = vacation). Correlations are computed

over active (played) items in each week (Ni = number of administered items).

Item reuseAs a result of the longitudinal nature of the Math Garden system, items are presented

to the same child more than once. Although the system ensures that at least 20 other items

are administered before an item is reused, this reuse may present a threat to the assump-

tion of local independence (e.g., the response to an item must not depend on the previous

response to the same item). To test this, we performed regression analyses with both the

number of items and the amount of time between two presentations of the same item to the

same child as predictors for the child’s performance on that item. The child’s performance

was measured by subtracting his expected score E(Si) from the actual score S

i. If there is an

item-specifi c learning effect, any child that encounters an item for the second time is likely

to have a higher than expected score for that item. We selected pairs of data points that rep-

resented subsequent presentations of the same item to the same child. We selected the data

so that no child contributed more than one pair of data points, resulting in N = 478 pairs

of data points. Because item-specific learning effects are logically more likely to occur if

there is a small amount of time between two presentations of the same item to the same

child, we removed 90 data points with more than 30 minutes between the two presentations

of the item. A regression analysis with this dataset shows no main effect for either the

number of items, or the amount of time between two presentations of the same item to the

same child: number of items, R2 < .001, F(1, 476) = 0.39, p = .53, and amount of time, R2 <

.001, F(1, 476) = 0.0072, p = .93.

Math Garden aimsIn order to keep children motivated, items were sampled so that children solved about

75% of the items successfully. However, in the fi rst few months we imposed a success rate

of 70%. Figure 9a shows the proportion of correctly answered items per grade and domain.

0.80

0.85

0.90

0.95

1.00

Played item correlation across weeks

Weeknumbers 2008-2009

Correlation

Spearman's ρ correlation of βw and β>w

v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v v 10 11 12 13 14 15 16 v v 19 20 21 22 23

ρw=44ρw=35

2784

1737

1779

1769

1779

1802

1893

1824

1510

2784

1946

1945

1943

1913

2006

2034

2043

1641

1415

1971

1968

1915

1986

2033

2145

2111

2051

2041

2015

1996

1999

2005

1989

1938

1947

1490

1828

1965

1877

2042

1773Ni


Only the results of the children who answered more than fi fteen items were included in the

graph. The graphs show that the proportion of correctly answered items varied between .6

and .8 for most children. The proportion correct seems to be somewhat lower for subtrac-

tion and lower still for multiplication and division. At the start of this project, the domains

addition and subtraction were briefl y available for the lower age groups. This resulted in

a lot of question mark use in these domains. To counter this unwanted effect we made the

availability of these domains dependent on the profi ciency on addition and subtraction. In

total, the amount of question mark use in the math games was about 7.3%. Filtering out

the question mark responses (Figure 9b) results in considerably higher proportions correct.

Figure 9 Proportion correct per grade and domain.

One of the aims of Math Garden was that it should be a challenging web environment

for children of all competency levels. The usage statistics can answer the question whether

children are motivated to play the math games. They provide an indication of how attrac-

tive and challenging the children found the Math games. It is possible that children visit

the Math Garden site mainly because their teachers told them to. To assess how intrinsi-

cally motivated the children were to play the games, we looked at the days and hours that

children played in Math Garden. Figure 10 (top) shows the number of solved arithmetic

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

+

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

-

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

x

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

:

gradeP

ropo

rtion

cor

rect

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

+

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

-

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

x

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

:

grade

Pro

porti

on c

orre

ct

a) Question mark response included

b) Question mark response excluded

40 | Chap te r 2

problems per day of the week and Figure 10 (bottom) shows the number of solved items

per hour of the day. Not surprisingly, most problems were solved on Monday till Friday

and between 9.00 a.m. and 3.00 p.m. However, both graphs also show that a considerable

number of problems were solved after school hours and during the weekends. Actually,

33.2% of all problems were solved outside school hours.

Figure 10 Playing frequency during the week and during the day.

To investigate whether competency had any effect on motivation, we looked at the rela-

tion between ability and playing frequency. Only data of children who solved 15 or more

problems were included to ensure accuracy of the ability estimates. We found only low

but signifi cant (p < .01) correlations between ability level and playing frequency for all

domains. The correlations for the domains addition, subtraction, multiplication, and divi-

sion were, −.15, −.12, −.05, and .09, respectively. The playing frequency does not appear

to depend importantly on the competency level of the children.

Diagnostic abilityWe will briefl y demonstrate the diagnostic and tracking ability of Math Garden by con-

sidering a few examples. Using the high frequency dataset, we were able to provide in-

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

:x-+

Frequency per weekday

Freq

uenc

y x

1000

0100200300400500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Frequency per hour

Freq

uenc

y x

1000

0100200300400500


dividual and group diagnostics. Figure 11 shows the percentage of typical errors a given

child had made (bars) compared to the percentage of these errors made by children of the

same grade (solid line). We can see, for instance, that this child makes significantly more

zero errors (400−200 = 380) for the domain subtraction than other children in the same

grade. We provided teachers with such graphs for individuals and groups of individuals

(e.g., for the whole class).

Detailed analysis of the item difficulties provides us with insight into sources of item

difficulty. Some interesting results have emerged. For example multiplications by 10 or

even 100 and one digit numbers (7 × 100) are among the 10% easiest items for this domain.

In division it appears that items of the type nn/n (77/7) are also very easy (again among the

10% easiest items). In Chapter 3 we tested how well all kinds of item effects, previously

studied in isolation, predict item difficulty. The combined item effects, such as problem

size, ties, and the 5 effect, explained 90% of the variance in the difficulty of simple multi-

plications items.

Figure 11 Error analysis of answers to subtraction problems of a child in grade 5. Bars display the

percentage of errors for this child in a specific week. The lines display the percentage of errors made

by other children in Math Garden (dotted line) and by other children in grade 5 (solid line).

use of question mark

to slow

adding units (42-23=25)

reversibility + borrow error (68-29=31)

reversibility error (93-78=25)

counting error 2 (93-78=13)

counting error 1 (93-78=14)

borrow error (80 – 29 = 61)

mirror error (18-5=31)

addition (7-4=11)

digit forgotten (95-75=25)

-0=0 error (7-0=0)

0 forgotten (3000-2000=100)

0 error (9000-5000=8500)

position error (22-10=21)

unknown

Freqency: 23 grade: 5

%

0 2 4 6 8 10 12 14

55

55

55

55

55

55555

5 significant

non significant

grade mean

total mean

42 | Chap te r 2

A window on developmental changeThe high frequency measurements combined with the size of the sample, provide unique

insights into arithmetic development and learning trajectories of children. In Math Garden,

trend analyses are provided to teachers. Figure 12 shows the progress of a single child

compared to all other children in the same age group. Teachers can use this information to

consider interventions. As can be seen in the graph, this child started out having an average

rating and a flat growth curve. By week 45 this child started to acquire the necessary ability

and by week 49 the child was in the top 25% of all children.

Figure 12 Progress chart of a child in grade 6 (black line), in comparison to the mean of grade 6

(dotted line).

At micro level it is even possible to study the learning pattern of one child on a specifi c

item over time. For example, in Figure 13 we see the answers and response times of two

children on two items across weeks. In the top graph of Figure 13 we see an individual who

did not know the answer to the math question 9 × 9 and answered with a question mark in

about 5 to 10 seconds at the fi rst ten occasions. Then there were two mistakes in which the

child joined the two digits instead of multiplying. However, in the next attempt the question

was answered correctly but more time was needed to respond. From this point on, the abil-

ity level seems suffi cient for consistent correct and speedier answers. The bottom graph of

Figure 13 shows a lucky guess in the fi rst week (third trial) followed by a gradual gain in

insight. Half way week 42 this child started answering correctly more often but with highly

varying response times. At the end of week 44 the response time dropped. Note that occa-

sionally errors keep occurring. These examples illustrate the level of detail that is possible

in the analysis of Math Garden data.

02

46

810

Week numbers 2008-2009

Rat

ing

v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v

90%

75%

25%10%

j


Figure 13 Response time pattern for two children on different items during a number of weeks

(x-axis). The y-axis indicates the response time in seconds. The answer is displayed in the graph. The

question mark answer means that the child pressed the ‘?’ button.

Discussion

In this chapter we presented and tested a new model for computerized adaptive prac-

tice and monitoring. The results concerning the validity and reliability are promising. The

high correlations with the norm-referenced Cito scores indicate high criterion validity. The

increase in player ability rating across grades also supports this, although the children in

grades 5 and 6 did not seem to differ. This is probably due to the fact that in the domains

we tested no new mental arithmetic techniques are taught in grade 6.

By simulation, we compared measurement precision and measurement bias of CAP to

standard CAT. For easy items the use of the HSHS scoring model, which combines speed

and accuracy and the Elo rating system (ERS) resulted in less loss in measurement pre-

cision and less bias than found in standard CAT estimation. The ERS combined with the

1PL model, using only accuracy data, resulted in worse estimations. Concerning the items

and the item bank, we found that diffi culty ratings converge in about eight playing weeks,

resulting in consistent diffi culty ratings across time. High reliability is also indicated by the

high correlations of the diffi culty and discrimination parameters between sets of mirrored

items. We have not found any indication of learning effects caused by the reuse of items,

therefore also indicating the assumption of local independence has been met for reuse of

items. However, in other learning domains this issue still requires careful consideration.

The fi t statistics for the HSHS model are still in development, and are therefore not in-

cluded in the result section of this chapter. Evaluation of the goodness of fi t for IRT models

05

1015

20

Answer to item: 9 x 9 and response time of a grade 3 child

week numbers

resp

onse

tim

e (s

ec.)

?

?

? ?

? ?

? ?? ? 99 99

81

81

?

8181 81 81

81

8181 81 81 81 81 81 81 81 81 81 81

8181

8181

81 81 81 81 81 81 81 81 81 8181

8181 81

8181 81

?

35 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38 38 38 38

05

1015

20

Answer to item: 21 : 3 and response time of a grade 3 child

week numbers

resp

onse

tim

e (s

ec.)

? ?

7

? ?? ? ? ? ? ?

?53

? ?? ? ?

? ?? ?

? ? ? ? ? ? ? ? ? ? ?

7?

?

?

77 7

?7 7

7 7

7 7

77

77

7 7

7

4

6

6

77

7

6

7

7 46

6

67

6

77

7 7 7

7 7

?

7

7

41 41 41 41 41 41 41 41 41 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 44 44 44 44 44 44 45 45 45 45 46 22 23

44 | Chap te r 2

is an active area of research, and so far definite solutions are lacking (Embretson & Reise,

2000). Some of the relevant issues (Hambleton, Swaminathan, & Rogers, 1991) concern:

the sensitivity of the chi-square fit statistic to sample sizes, technical issues in the testing

of dimensionality (Hattie, 1984, 1985), and the testing of the assumption of local indepen-

dence. Evaluating the fit of IRT models is more complicated still in the context of computer

adaptive testing, due to the inherent incomplete item-person data matrix. An alternative

approach to comprehensive model fitting consists of checking model assumptions, and

establishing reliability and validity (Hambleton & Swaminathan, 1984). Here we have suf-

ficed with this alternative approach.

We can conclude that children were motivated to play the Math games. The frequency

data demonstrated that children played a lot outside school hours. Children with a lower

math ability did not play appreciably less, which suggests that they found the math games

as motivating as high ability children did. We demonstrated that Math Garden has many

possibilities as a diagnostic tool. The error analysis can provide teachers with valuable

insight into the kind of errors that individual pupils make. This information can be used to

optimize interventions. The current dataset, consisting of a large number of individual high

frequent time series, allows for many further investigations of difficulty effects (Chapter

3), strategy patterns in mathematical problem solving, and individual learning trajectories.

The item ratings also provide insight into what we call informal learning paths. Because

of the adaptive item ratings, we gain an on the fly insight into the difficulty of arithmetic

problems. Some items turned out to be unexpectedly easy. For instance, 8 + 6, 5000 + 5 and

50 + 60 were almost equally difficult whereas 8 + 6 is taught much earlier on in the Dutch

curriculum than the other two addition problems. This kind of information can be used to

determine the curriculum (i.e., what is taught) in each grade.

One of the problems with the Elo rating system is the occurrence of rating inflation and

deflation (Glickman, 1999), which we call drift. In educational applications, one source

of drift is that new young players start with low ratings and stop playing when they leave

school with high ratings. This causes a systematic downwards drift in item rating and, as

a consequence, lowers person ratings. This does not seem to jeopardize the operation of

Math Garden, since drift influences player and item ratings simultaneously. The main prob-

lem lays in the interpretation of the rating. Rating points cannot be accurately compared

following inflation or deflation. Therefore we present transformed ratings to teachers and

users to prevent interpretation problems.


Transformation is conducted by calculating the average probability correct for a single

user on all items in the domain, as shown in equation 10:

This value is an estimation of the percentage of items in the domain that the user is able to

answer correctly. We also reduced drift by incorporating the rating uncertainty in calculat-

ing the K factor, which minimizes the influence of unreliable person and item estimations

on the updating proces. A related issue is the convergence speed. This is the time or number

of responses needed to get a stable rating. We set the rating uncertainty parameters of the

K factor, which determine the convergence speed, on the basis of extended testing. A better

approach would perhaps be to estimate the uncertainty based on aberrant response patterns,

where unexpected responses are used as an indication of unreliability.

A last issue concerns the one-dimensionality of the math domains. In practice, every test

and item bank is expected to violate the assumption of one-dimensionality to some degree.

Though we see no immediate effects on ability estimation, the question of how robust the

HSHS Elo algorithm is to violation of this assumption needs further investigation. We also

intend to further address the possible individual differences between children and how the

HSHS scoring rule affects their behavior.

In conclusion, Math Garden meets the requirements we set for the practice and progress

monitoring system. It is worth noting that although the new CAP algorithm is implemented

in the domain of math, the system can be applied to all kinds of learning domains. In the

2010 release of Math Garden more games, e.g. fractions, have been added and a language

garden is in development. Also, the number of schools using Math Garden continues to

grow steadily (about 150 in October 2010), yielding about 50 thousand responses per day.

We expect a fast adoption of computers, such as handhelds, minicomputers and tablets,

in primary schools in the next 5 years. If children do their daily exercises in practice and

progress monitoring systems using these devices, we expect many benefits for students,

teachers, and scientists.

37

were almost equally difficult whereas 8 + 6 is taught much earlier on in the Dutch curriculum than

the other two addition problems. This kind of information can be used to determine the curriculum

(i.e. what is taught) in each grade.

One of the problems with the Elo rating system is the occurrence of rating inflation and

deflation (Glickman, 1999), which we call drift. In educational applications, one source of drift is that

new young players start with low ratings and stop playing when they leave school with high ratings.

This causes a systematic downwards drift in item rating and, as a consequence, lowers person

ratings. This does not seem to jeopardize the operation of Math Garden, since drift influences player

and item ratings simultaneously. The main problem lays in the interpretation of the rating. Rating

points cannot be accurately compared following inflation or deflation. Therefore we present

transformed ratings to teachers and users to prevent interpretation problems. Transformation is

conducted by calculating the average probability correct for a single user on all items in the domain,

as shown in equation 10:

P =1/ n 11+ e−a(θ j−βi )i=1

n

∑ (10)

This value is an estimation of the percentage of items in the domain that the user is able to

answer correctly. We also reduced drift by incorporating the rating uncertainty in calculating the K

factor, which minimizes the influence of unreliable person and item estimations on the updating

proces. A related issue is the convergence speed. This is the time or number of responses needed to

get a stable rating. We set the rating uncertainty parameters of the K factor, which determine the

convergence speed, on the basis of extended testing. A better approach would perhaps be to

estimate the uncertainty based on aberrant response patterns, where unexpected responses are

used as an indication of unreliability.

A last issue concerns the one-‐dimensionality of the Math domains. In practice, every test and

item bank is expected to violate the assumption of one-‐dimensionality to some degree. Though we

see no immediate effects on ability estimation the question of how robust the HSHS Elo algorithm is

to violation of this assumption needs further investigation. We also intend to further address the

possible individual differences between children and how the HSHS scoring rule affects their

behavior.

In conclusion, Math Garden meets the requirements we set for the practice and progress

monitoring system. It is worth noting that although the new CAP algorithm is implemented in the

domain of math, the system can be applied to all kinds of learning domains. In the 2010 release of

Math Garden more games, e.g. fractions, have been added and a language garden is in development.

Also, the number of schools using Math Garden continues to grow steadily (about 150 in October

2010), yielding about 50 thousand responses per day. We expect a fast adoption of computers, such

as handhelds, minicomputers and tablets, in primary schools in the next 5 years. If children do their

daily exercises in practice and progress monitoring systems using these devices, we expect many

benefits for students, teachers, and scientists.

Documents

UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea