24
UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl) UvA-DARE (Digital Academic Repository) Math Garden: A new educational and scientific instrument Straatemeier, M. Link to publication Citation for published version (APA): Straatemeier, M. (2014). Math Garden: A new educational and scientific instrument. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date: 12 Feb 2020

UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Math Garden: A new educational and scientific instrument

Straatemeier, M.

Link to publication

Citation for published version (APA):Straatemeier, M. (2014). Math Garden: A new educational and scientific instrument.

General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s),other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, statingyour reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Askthe Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam,The Netherlands. You will be contacted as soon as possible.

Download date: 12 Feb 2020

Page 2: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

2CHAPTER

Computer adaptive practice of math ability using a new item response model for on

the fly ability and difficulty estimation

This chapter has been published as:

Klinkenberg, S., Straatemeier, M., & Van der Maas, H. L. J. (2011). Computer adaptive

practice of maths ability using a new item response model for on the fly ability and

difficulty estimation. Computers & Education, 57, 1813-1824.

Page 3: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

24 | Chap te r 2

Abstract

In this chapter we present a model for computerized adaptive practice and monitoring.

This model is used in Math Garden, a web-based monitoring system, which includes a

challenging web environment for children to practice arithmetic. Using a new item re-

sponse model based on the Elo (1978) rating system and an explicit scoring rule, estimates

of the ability of persons and the difficulty of items are updated with every answered item,

allowing for on the fly item calibration. In the scoring rule both accuracy and response time

are accounted for. Items are sampled with a mean success probability of .75, making the

tasks challenging yet not too difficult. In a period of ten months our sample of 3648 chil-

dren completed over 3.5 million arithmetic problems. The children completed about 33%

of these problems outside school hours. Results show better measurement precision, high

validity and reliability, high pupil satisfaction, and many interesting options for monitoring

progress, diagnosing errors and analyzing development

Introduction

In this chapter we present a computerized adaptive practice (CAP) system for monitor-

ing arithmetic in primary education: Math Garden. Math Garden is a web-based computer

adaptive practice and monitoring system based on weekly measurements. In recent years

math abilities of Dutch students have been widely debated. This is mainly due to the results

of the National Periodical Education Polls (PPON). These results show that few children

reach the required math level at the end of their primary education (Kraemer, Janssen,

Van der Schoot, & Hemker, 2005). Based on these findings a parliamentary inquiry into

Dutch education was initiated. Both the committee “Dijsselbloem” (2008) and the expert

group “Doorlopende Leerlijnen” (2008) recommended several improvements to the Dutch

education system in general and math education in particular. Recommendations included

the provision of more time to practice and maintain basic math skills, more efficient and

effective measurement in education, and the use of these measurement results to improve

the ability of individual students, the classroom and education in general. These recom-

mendations are also supported by Fullan (2006), who claimed that acting on data is critical

for learning from experience.

Combining practice and measurementIn the light of these recommendations we propose to combine practice and measure-

ment in a playful manner using computerized educational games. We expect that in the

Page 4: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 25

near future children will increasingly use mini computers and handheld devices to do their

daily exercises in arithmetic, spelling, and other subjects. The use of computers has two

main advantages. First, the input can be analyzed automatically and feedback can be given

immediately, which will free teachers from checking and correcting the children’s exercise

books. The recorded and automatically analyzed data can provide teachers with detailed

information on children’s progress and the errors they make. Teachers can use this infor-

mation to optimize individual instruction. The information concerning the child’s progress

and abilities, which is accumulated over time, may ultimately obviate the need to conduct

tests and examinations. Second, by using computers it is possible to let children practice

at their individual ability level. Research on the development of expertise performance has

shown that people do improve their performance considerably if they regularly do specific

exercises that are adjusted to their ability level and include immediate feedback. In the

development of Math Garden we follow these ideas developed in sports and expertise train-

ing, especially the idea of deliberate practice (Ericsson, 2006, pp. 683–703).

Three problems of CATTo implement individualized practice, we apply the technique of computer adaptive test-

ing (Van der Linden & Glas, 2000; Wainer, 2000). Computer adaptive testing (CAT) is

based on item response theory (IRT). This theory consists of statistical models that relate

item responses to the (latent) abilities that the items measure (Lord & Novick, 1968). A

large collection of item response models is available, but these are all basically variations

on the simplest model, i.e., the one-parameter logistic (1PL) model or Rasch model (Rasch,

1960). In the Rasch model the probability of a correct or affirmative answer is a logistic

function of the difference between the ability of the subject and the difficulty of the item.

In the two-parameter logistic model, the difference is weighted by an item discrimination

parameter, which has a high value when an item discriminates well between low and high

ability subjects. Item response models can be used for equating tests, to detect and study

differential item functioning (bias), and to develop computer adaptive tests (Van der Lin-

den & Hambleton, 1997). The idea of CAT is to determine the ability level of a person dy-

namically. In CAT, item administration depends on the subject’s previous responses. If the

preceding item is answered correctly (incorrectly), a more (less) difficult item is presented.

Hence, each person is presented a test tailored to his or her ability. Using CAT, test length

can be shortened up to 50% (Eggen & Verschoor, 2006). Originally, CAT was developed

for measurement only. Our aim to combine practice and measurement raises several novel

issues. We distinguish the following three issues.

First, in standard CAT the parameters of the items, especially the difficulty, have to be

Page 5: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

26 | Chap te r 2

known in advance to test adaptively. Items therefore have to be “pre-calibrated” before

they can be used in real test situations. This means that a large representative sample of

the population has to have answered the items in the item bank to provide the information

for item calibration. The difficulty of the items is determined using the data of this sample.

This method is obviously time-consuming and costly, especially as the calibration has to

be carried out repeatedly (e.g., every few years) to acquire accurate norm referenced item

parameters.

Second, CAT operates most effectively if the difficulty level of administered items equals

the ability estimate of the person. The probability of answering such items correctly is .5.

However, for most children and many adults the success rate associated with a .5 probabil-

ity is experienced as discouraging. Research by Eggen and Verschoor (2006) showed that

increasing this probability to above .7 greatly reduces measurement precision. Given a .7

probability, more items need to be administered to obtain an accurate estimate of person

ability. This requirement reduces the efficiency of computer adaptive testing.

The third problem concerns a testing problem that applies to psychological and educa-

tional measurement in general, namely, the trade-off between speed and accuracy. Without

explicit instructions, participants in tests and experiments are free to balance speed and

accuracy as they wish. Consequently the trade-off between speed and accuracy can be

a source of large individual differences. The current solution in psychometrics (Van der

Linden, 2007) and experimental psychology (Ratcliff & Rouder, 1998; Vandekerckhove &

Tuerlinckx, 2008) is to estimate person parameters involved in this trade-off on the basis of

the data. However, this procedure requires large amounts of high quality data.

New CATWe developed an extended CAT approach to solve these problems. This Computer Adap-

tive Practice (CAP) system provides the basis of Math Garden. The CAP system includes the

following two innovations. First, we have applied a new estimation method based on the Elo

(1978) rating system (ERS) developed for chess competitions. The ERS allows for on the

fly estimation of item difficulty and person ability parameters. With this method, pre-testing

is no longer required. Second, we have used an explicit scoring rule for speed and accuracy,

which is known to the subject during the test. Inclusion of speed in the scoring has the ad-

vantage that we acquire more information about ability. Research by Van der Maas and Wa-

genmakers (2005) showed that in the responses to easy chess items there is a strong negative

relation between response time and ability. Subjects tend to answer easy items correctly, but

more advanced subjects answer them more quickly. Third, by integrating response time into

the estimation of ability, we can decrease the difficulty of administered items with less loss

Page 6: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 27

of measurement precision than noted by Eggen and Verschoor (2006). In addition we expect

the higher success rate to increase the motivation of children during the test. In the Method

section we describe Math Garden, the Elo algorithm and the new scoring rule in more detail.

In the results section of this chapter we test the working of Math Garden. We present evidence

for high validity and reliability of ability and difficulty estimation, the motivational value of

the Math Garden, and its usefulness as a diagnostic and monitoring instrument.

Methods

ParticipantsA total of 35 primary schools, eight remedial teachers and 32 families participated in this

study, comprising of N = 3648 active participants. Also 334 aspiring kindergarten pupils

joined Math Garden. In the time period from August 2008 to early June 2009 more than 3.5

million arithmetic problems were answered in our sample. In addition to the responses, we

registered the gender, age and grade of the participants. Table 1 shows the mean age with

standard deviation and the number of children for each grade.

MaterialsThe main measurement tool used in this study is the web-based practice and monitoring

system we developed: Math Garden. The student interface consists of a garden containing

distinct flowerbeds, representing, among others, the four domains: addition, subtraction,

multiplication, and division (Figure 1a) on which we focus in this chapter. The size of the

flowers represents the math ability of the student. By clicking on a flowerbed the math

game is started for a specific domain and the student can start playing.

Table 1 Age, gender and N per grade

Grade Age category N Mean age (SD) % Male (Female)

kindergarten 4-5 103 4.32 (0.51) 50.49 (49.51)

kindergarten 5-6 231 5.45 (0.51) 47.19 (52.81)

1 6-7 529 6.61 (0.51) 53.50 (46.50)

2 7-8 681 7.69 (0.54) 55.21 (44.79)

3 8-9 526 8.68 (0.79) 47.91 (52.09)

4 9-10 513 9.70 (0.61) 47.24 (52.76)

5 10-11 574 10.79 (0.60) 49.48 (50.52)

6 11-12 416 11.80 (0.57) 50 (50)

Secondary Education > 12 75 13.33 (3.94) 64 (36)

Page 7: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

28 | Chap te r 2

The visual interface of the math task consists of a math question, six answer options

(addition and subtraction) or a number pad (multiplication and division), a coin bag, a

question mark, a stop sign, and an elapsing coin bar that indicates the time left on the item

(Figure 1b). The game rules are intuitive and therefore only require minimal explanation

on the website. Students gain points (coins), displayed at the bottom of the game interface,

by answering items correctly and lose coins when answering incorrectly. With each item a

total number of twenty coins, corresponding to the maximum time in seconds, can be won

or lost. Every second one coin disappears. The remaining coins are added to the coin bag if

the item has been solved correctly and are subtracted if solved incorrectly. If the time limit

has expired or the question mark has been clicked, no coins are lost or won. The rationale

of this scoring rule is explained in the psychometrics section. A session consists of fi fteen

items after which the math game terminates and the student is returned to his or her garden.

The fl owers will start growing according to the progression that has been made. Students

are motivated by two reward systems. Good performance is rewarded by growing fl owers

and virtual coins.13The Math Garden website contains a dedicated area, the prize cabinet,

where virtual prizes can be bought with the earned coins. Another way students are moti-

vated to continue playing in Math Garden is that the fl owerbeds wither if the student does

not play. Withering worsens over time and can only be undone by completing a new session

of 15 items.

Figure 1 The main Math Garden interface (a) and an addition item (b).

The four domains, addition, subtraction, multiplication, and division, contain 738, 723,

659, and 664 items, respectively. The items in the four domains cover the curriculum in

primary education. They vary from easy (e.g., 3 + 4 with response options: 7, 8, 6, 1, 9

and 12) to diffi cult (e.g., 7.34  +  311.4 with response options: 318.74; 318.38; 318.47;

31 Because of the adaptive nature of the test, every student has roughly the same percentage correct. Hence the num-ber of coins won refl ects only how often a student plays and not his arithmetic level.

a) b)

Page 8: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 29

317.74; 319.74 and 318.34). The response options are selected to be informative distracters.

Open-ended items were used in the games multiplication and division. Variables measured

by the task are response time, the given answer, the correctness (0, 1) and a timestamp at

administration.

We studied the validity of the data by comparing the ability estimate, measured with the

Math Garden, with students’ scores on the math tests from the pupil monitoring system

(Janssen & Engelen, 2002) of the National Institute for Educational Measurements (Cito).

In the Cito monitoring system math tests are administered twice a year from mid grade 1

until mid grade 6. These tests assess the knowledge and skills that are being taught in these

grades. The tests contain both open-ended and forced-choice items. Students’ scores on the

math test (the total of correct answers) are transformed to a score on a norm-referenced

general math ability scale. This allows one to compare students’ scores from different

grades using one scale.

Psychometrics

Elo rating system In chess the Elo (1978) rating system (ERS) is used to estimate the relative ability of a

player. The ERS is a dynamic paired comparison model, which is mathematically closely

related to the Rasch IRT model (Batchelder & Bershad, 1979). Initially chess players are

given a provisional ability rating θ, which is incrementally updated (see equation (1)) based

on match results (in chess 0, .5 and 1, for loss, draw and win outcomes). The updated abil-

ity estimate

Opmerkingen  ronde  3    Ik  gebruik  weer  af  en  toe  dikgedrukt  en  schuin  om  de  verschillende  woorden  te  onderscheiden,  deze  formatting  niet  overnemen.    

• P  1:    Concept  doctoral  thesis  mag  weg  • P  3:  Graag  ook  academisch  proefschrift  in  schreefloze  lettertype  als  kopje  

erboven  met  witregel  ertussen.  Academisch  Proefschrift  beiden  met  hoofdletter  

• P  3:  AU  moet  Aula  der  Universiteit  van  Amsterdam  zijn  • P  5:  paginanummering  klopt  nog  niet,  maar  dit  wilde  je  geloof  ik  op  het  

eind  doen.    Hoofdstuk  1  

• P.  9,  alinea  2,  derde  zin.  Graag  nog  een  komma  achter  2005  • P.  14  regel  5.  scPhools  moet  schools  zijn  (sorry  dat  je  dit  weer  moet  

aanpassen,  was  een  tikfout)    Hoofdstuk  2  

• P.  27.  2e  alinea  regel  4:  komen  toevoegen  achter  reponses:  In  addition  to  the  responses,  we  registered  

• P.  27.  Is  het  mogelijk  om  de  tabel  van  pagina  28  onder  het  stukje  participants  te  zetten  op  pagina  27?  Dan  staan  de  plaatjes  van  pagina  29  ook  weer  dichterbij  de  tekst  waar  ze  horen.  

• P.  28.  De  footnote  en  de  verwijzing  ernaar  in  de  tekst  bevat  het  cijfer  3,  moet  1  zijn.  

• P.  29:  laatste  zin  θ moet  schuingedrukt  zijn.  • Overall  heb  ik  het  idee  dat  de  formules  behoorlijk  klein  zijn,  misschien  

moeten  we  even  kijken  hoe  dit  uitpakt  in  de  drukproeven  of  denk  je  dat  we  het  nu  al  groter  moeten  maken.  

• P.  29:  laatste  zin  er  mist  een  ):  (see  equation  (1))  • P.  30:  regel  2,  er  mist  een  teken  achter  estimate  𝜃𝜃  

P.  30:  na  regel  4  ontbreekt  er  tekst.  De  hele  zin  moet  zijn:    The  expected  match  result  is  a  function  of  the  difference  between  the  ability  estimates  of  both  player  j  and  k  preceding  the  match  and  expresses  the  probability  of  winning  (see  equation  (2)):  

• P.  30.  De  footnote  betreft  footnote  2  (ook  in  de  tekst)  • P.  31.  Witregel  na  regel  2  mag  weg.  • P.  31.  Na  1e  formule  niet  inspringen  • P.  31:  4e  regel  onder  high  speed  high  stakes.  Dit  moet  een  andere  x  zijn,  

dezelfde  als  in  de  3e  regel  van  onder  maar  dan  zonder  de  subscripts.  • P.  31.  5e  regel  van  onder,  niet  inspringen  • P.  32:  Figuur  2  mag  eventueel  wat  kleiner  als  dat  nu  door  verspringen  van  

teksten  handiger  is.  • P.  32,  er  zitten  nog  veel  fouten  in  de  tekens  en  subscripts  in  deze  alinea.  

Hierbij  de  goede  tekst.  Dikgedrukt  zijn  de  wijzigingen:  Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is

(signifi ed by the hat) depends on the weighted difference in match result S

and expected match result E(S). The expected match result is a function of the difference

between the ability estimates of both player j and k preceding the match and expresses the

probability of winning (see equation (2)):

The K factor in equation (1) weights the impact of the deviation from expectation on the

new ability estimate. This value essentially determines the rate at which θ can change over

matches. In the standard ERS the K factor is constant. Glickman (1995) argued that not all

ability ratings are estimated accurately by the ERS update function (eq. 1). Inaccuracies

mostly occur when players are new or have not played for an extended period of time, re-

   

23  

   Figure  1.  The  main  Math  Garden  interface  (a)  and  an  addition  item  (b).  

 

The   four   domains,   addition,   subtraction,  multiplication   and   division,   contain   738,   723,   659  

and   664   items,   respectively.   The   items   in   the   four   domains   cover   the   curriculum   in   primary  

education.   They   vary   from  easy   (e.g.,   3  +  4  with   response  options:   7,   8,   6,   1,   9   and  12)   to  difficult  

(e.g.,  7.34  +  311.4  with   response  options:  318.74;  318.38;  318.47;  317.74;  319.74  and  318.34).  The  

response   options   are   selected   to   be   informative   distracters.   Open-­‐ended   items   were   used   in   the  

games   multiplication   and   division.   Variables   measured   by   the   task   are   response   time,   the   given  

answer,  the  correctness  (0,  1)  and  a  timestamp  at  administration.  

We   studied   the   validity   of   the   data   by   comparing   the   ability   estimate,  measured  with   the  

Math  Garden,  with  students’  scores  on  the  math  tests  from  the  pupil  monitoring  system  (Janssen  &  

Engelen,  2002)  of  the  National  Institute  for  Educational  Measurements  (Cito).  In  the  Cito  monitoring  

system  math   tests   are   administered   twice   a   year   from  mid   grade   1   until  mid   grade   6.   These   tests  

assess  the  knowledge  and  skills  that  are  being  taught  in  these  grades.  The  tests  contain  both  open-­‐

ended  and  forced-­‐choice  items.  Students’  scores  on  the  math  test  (the  total  of  correct  answers)  are  

transformed  to  a  score  on  a  norm-­‐referenced  general  math  ability  scale.  This  allows  one  to  compare  

students’  scores  from  different  grades  using  one  scale.  

 

Psychometrics  

Elo  rating  system.  In  chess  the  Elo  (1978)  rating  system  (ERS)  is  used  to  estimate  the  relative  

ability  of  a  player.  The  ERS   is  a  dynamic  paired  comparison  model,  which   is  mathematically  closely  

related   to   the   Rasch   IRT   model   (Batchelder   &   Bershad,   1979).   Initially   chess   players   are   given   a  

provisional  ability  rating  θ,  which  is  incrementally  updated  (see  equation  (1)  based  on  match  results  

(in  chess  0,  .5  and  1,  for  loss,  draw  and  win  outcomes).  The  updated  ability  estimate  𝜃𝜃  (signified  by  the  hat)  depends  on  the  weighted  difference  in  match  result  S  and  expected  match  result  E(S).  The  

expected  match  result  is  a  function  of  the  difference  between  the  ability  estimates  of  both  player  j    

                   

(1)  

 

θ j =θ j +K(Sj −E(Sj ))

θk =θk +K(Sk −E(Sk ))

   

24  

                 

                  (2)  

 

The  K  factor  in  equation  (1)  weights  the  impact  of  the  deviation  from  expectation  on  the  new  ability  

estimate.   This   value   essentially   determines   the   rate   at   which   θ   can   change   over   matches.   In   the  

standard   ERS   the   K   factor   is   constant.   Glickman   (1995)   argued   that   not   all   ability   ratings   are  

estimated  accurately  by  the  ERS  update  function  (eq.  1).  Inaccuracies  mostly  occur  when  players  are  

new  or  have  not  played  for  an  extended  period  of  time,  resulting  in  much  uncertainty  in  their  ability  

rating  θ.  Glickman  proposed  to  let  the  K  factor  reflect  the  uncertainty  in  ability  estimates  by  making  it  

a   function   of   time   and   playing   frequency.   If   there   is   little   uncertainty,   the  K   factor   for   recent   and  

frequent  players  will  be  low.  If  there  is  much  uncertainty  the  K  factor  will  be  high.  

Computer  adaptive  practice.  Our  suggestion   for  creating  an  on  the   fly   item  calibrating  and  

computer   adaptive   practice   (CAP)   system   is   to   replace   one   player   in   the   Elo   system   by   an   item.2  

Solving  an  item  correctly  is  interpreted  as  winning  the  match  against  the  item.  The  updating  function  

in  equation  (1)  can  be  rewritten  to  equation  (3)  for  updating  player  and  item  ratings:  

 

               

(3)  

 

 

where  βi  is  the  difficulty  estimate  of  the  item  and  Sij  and  E(Sij)  are  the  score  and  expected  probability  

of  winning  for  person  j  on  item  i.  Following  Glickman,  the  K  factor  in  our  CAP  system  is  a  function  of  

the  rating  uncertainty  U  of  the  player  and  the  item  (eq.  4):  

   

K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )

                  (4)  

 

where  K  =  0.0075   is   the  default   value  when   there   is   no  uncertainty   and  K+  =  4   and  K-­‐  =  0.5   are   the  

weights  for  the  rating  uncertainty  for  person  j  and  item  i.  These  values  determine  the  rate  at  which  θ  

and   β   can   change   following   each   item   response.   These   values   have   been   determined   through  

extensive  simulations.  

The   uncertainty  U   depends   on   both   recency   and   frequency.   Equation   (5)   combines   these  

opposite  effects  on  uncertainty.  We  apply  the  same  equation  to  items  and  players,  with  provisional  

uncertainty  of  U  =  1  and  0  ≤  U  ≤  1:      

 

U =U −140

+130

D                     (5)  

                                                                                                                         2  This  approach  has,  for  many  years,  successfully  been  applied  in  an  online  chess  testing  system  on  the  Chess  Tactics  Server  (http://chess.emrald.net)  

E(Sj ) =1

1+10(θk−θ j )/400

θ j =θ j +K j (Sij −E(Sij ))

βi = βi +Ki (E(Sij )− Sij )

Page 9: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

30 | Chap te r 2

sulting in much uncertainty in their ability rating θ. Glickman proposed to let the K factor

reflect the uncertainty in ability estimates by making it a function of time and playing fre-

quency. If there is little uncertainty, the K factor for recent and frequent players will be low.

If there is much uncertainty the K factor will be high.

Computer adaptive practice Our suggestion for creating an on the fly item calibrating and computer adaptive prac-

tice (CAP) system is to replace one player in the Elo system by an item.24 Solving an

item correctly is interpreted as winning the match against the item. The updating func-

tion in equation (1) can be rewritten to equation (3) for updating player and item ratings:

where βi is the difficulty estimate of the item and S

ij and E(S

ij) are the score and expected

probability of winning for person j on item i. Following Glickman, the K factor in our

CAP system is a function of the rating uncertainty U of the player and the item (eq. 4):

where K = 0.0075 is the default value when there is no uncertainty and K+ = 4 and K

- = 0.5

are the weights for the rating uncertainty for person j and item i. These values determine

the rate at which θ and β can change following each item response. These values have been

determined through extensive simulations.

The uncertainty U depends on both recency and frequency. Equation (5) combines these

opposite effects on uncertainty. We apply the same equation to items and players, with pro-

visional uncertainty of U = 1 and 0 ≤ U ≤ 1: 

We assume that uncertainty for players and items decreases after every administration and

increases with time. Therefore uncertainty reduces to zero after 40 administrations and

conversely increases to the maximum of 1 after 30 days D of not playing.

24 This approach has, for many years, successfully been applied in an online chess testing system on the Chess Tac-tics Server (http://chess.emrald.net)

   

24  

                 

                  (2)  

 

The  K  factor  in  equation  (1)  weights  the  impact  of  the  deviation  from  expectation  on  the  new  ability  

estimate.   This   value   essentially   determines   the   rate   at   which   θ   can   change   over   matches.   In   the  

standard   ERS   the   K   factor   is   constant.   Glickman   (1995)   argued   that   not   all   ability   ratings   are  

estimated  accurately  by  the  ERS  update  function  (eq.  1).  Inaccuracies  mostly  occur  when  players  are  

new  or  have  not  played  for  an  extended  period  of  time,  resulting  in  much  uncertainty  in  their  ability  

rating  θ.  Glickman  proposed  to  let  the  K  factor  reflect  the  uncertainty  in  ability  estimates  by  making  it  

a   function   of   time   and   playing   frequency.   If   there   is   little   uncertainty,   the  K   factor   for   recent   and  

frequent  players  will  be  low.  If  there  is  much  uncertainty  the  K  factor  will  be  high.  

Computer  adaptive  practice.  Our  suggestion   for  creating  an  on  the   fly   item  calibrating  and  

computer   adaptive   practice   (CAP)   system   is   to   replace   one   player   in   the   Elo   system   by   an   item.2  

Solving  an  item  correctly  is  interpreted  as  winning  the  match  against  the  item.  The  updating  function  

in  equation  (1)  can  be  rewritten  to  equation  (3)  for  updating  player  and  item  ratings:  

 

               

(3)  

 

 

where  βi  is  the  difficulty  estimate  of  the  item  and  Sij  and  E(Sij)  are  the  score  and  expected  probability  

of  winning  for  person  j  on  item  i.  Following  Glickman,  the  K  factor  in  our  CAP  system  is  a  function  of  

the  rating  uncertainty  U  of  the  player  and  the  item  (eq.  4):  

   

K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )

                  (4)  

 

where  K  =  0.0075   is   the  default   value  when   there   is   no  uncertainty   and  K+  =  4   and  K-­‐  =  0.5   are   the  

weights  for  the  rating  uncertainty  for  person  j  and  item  i.  These  values  determine  the  rate  at  which  θ  

and   β   can   change   following   each   item   response.   These   values   have   been   determined   through  

extensive  simulations.  

The   uncertainty  U   depends   on   both   recency   and   frequency.   Equation   (5)   combines   these  

opposite  effects  on  uncertainty.  We  apply  the  same  equation  to  items  and  players,  with  provisional  

uncertainty  of  U  =  1  and  0  ≤  U  ≤  1:      

 

U =U −140

+130

D                     (5)  

                                                                                                                         2  This  approach  has,  for  many  years,  successfully  been  applied  in  an  online  chess  testing  system  on  the  Chess  Tactics  Server  (http://chess.emrald.net)  

E(Sj ) =1

1+10(θk−θ j )/400

θ j =θ j +K j (Sij −E(Sij ))

βi = βi +Ki (E(Sij )− Sij )

   

24  

                 

                  (2)  

 

The  K  factor  in  equation  (1)  weights  the  impact  of  the  deviation  from  expectation  on  the  new  ability  

estimate.   This   value   essentially   determines   the   rate   at   which   θ   can   change   over   matches.   In   the  

standard   ERS   the   K   factor   is   constant.   Glickman   (1995)   argued   that   not   all   ability   ratings   are  

estimated  accurately  by  the  ERS  update  function  (eq.  1).  Inaccuracies  mostly  occur  when  players  are  

new  or  have  not  played  for  an  extended  period  of  time,  resulting  in  much  uncertainty  in  their  ability  

rating  θ.  Glickman  proposed  to  let  the  K  factor  reflect  the  uncertainty  in  ability  estimates  by  making  it  

a   function   of   time   and   playing   frequency.   If   there   is   little   uncertainty,   the  K   factor   for   recent   and  

frequent  players  will  be  low.  If  there  is  much  uncertainty  the  K  factor  will  be  high.  

Computer  adaptive  practice.  Our  suggestion   for  creating  an  on  the   fly   item  calibrating  and  

computer   adaptive   practice   (CAP)   system   is   to   replace   one   player   in   the   Elo   system   by   an   item.2  

Solving  an  item  correctly  is  interpreted  as  winning  the  match  against  the  item.  The  updating  function  

in  equation  (1)  can  be  rewritten  to  equation  (3)  for  updating  player  and  item  ratings:  

 

               

(3)  

 

 

where  βi  is  the  difficulty  estimate  of  the  item  and  Sij  and  E(Sij)  are  the  score  and  expected  probability  

of  winning  for  person  j  on  item  i.  Following  Glickman,  the  K  factor  in  our  CAP  system  is  a  function  of  

the  rating  uncertainty  U  of  the  player  and  the  item  (eq.  4):  

   

K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )

                  (4)  

 

where  K  =  0.0075   is   the  default   value  when   there   is   no  uncertainty   and  K+  =  4   and  K-­‐  =  0.5   are   the  

weights  for  the  rating  uncertainty  for  person  j  and  item  i.  These  values  determine  the  rate  at  which  θ  

and   β   can   change   following   each   item   response.   These   values   have   been   determined   through  

extensive  simulations.  

The   uncertainty  U   depends   on   both   recency   and   frequency.   Equation   (5)   combines   these  

opposite  effects  on  uncertainty.  We  apply  the  same  equation  to  items  and  players,  with  provisional  

uncertainty  of  U  =  1  and  0  ≤  U  ≤  1:      

 

U =U −140

+130

D                     (5)  

                                                                                                                         2  This  approach  has,  for  many  years,  successfully  been  applied  in  an  online  chess  testing  system  on  the  Chess  Tactics  Server  (http://chess.emrald.net)  

E(Sj ) =1

1+10(θk−θ j )/400

θ j =θ j +K j (Sij −E(Sij ))

βi = βi +Ki (E(Sij )− Sij )

   

24  

                 

                  (2)  

 

The  K  factor  in  equation  (1)  weights  the  impact  of  the  deviation  from  expectation  on  the  new  ability  

estimate.   This   value   essentially   determines   the   rate   at   which   θ   can   change   over   matches.   In   the  

standard   ERS   the   K   factor   is   constant.   Glickman   (1995)   argued   that   not   all   ability   ratings   are  

estimated  accurately  by  the  ERS  update  function  (eq.  1).  Inaccuracies  mostly  occur  when  players  are  

new  or  have  not  played  for  an  extended  period  of  time,  resulting  in  much  uncertainty  in  their  ability  

rating  θ.  Glickman  proposed  to  let  the  K  factor  reflect  the  uncertainty  in  ability  estimates  by  making  it  

a   function   of   time   and   playing   frequency.   If   there   is   little   uncertainty,   the  K   factor   for   recent   and  

frequent  players  will  be  low.  If  there  is  much  uncertainty  the  K  factor  will  be  high.  

Computer  adaptive  practice.  Our  suggestion   for  creating  an  on  the   fly   item  calibrating  and  

computer   adaptive   practice   (CAP)   system   is   to   replace   one   player   in   the   Elo   system   by   an   item.2  

Solving  an  item  correctly  is  interpreted  as  winning  the  match  against  the  item.  The  updating  function  

in  equation  (1)  can  be  rewritten  to  equation  (3)  for  updating  player  and  item  ratings:  

 

               

(3)  

 

 

where  βi  is  the  difficulty  estimate  of  the  item  and  Sij  and  E(Sij)  are  the  score  and  expected  probability  

of  winning  for  person  j  on  item  i.  Following  Glickman,  the  K  factor  in  our  CAP  system  is  a  function  of  

the  rating  uncertainty  U  of  the  player  and  the  item  (eq.  4):  

   

K j = K(1+K+Uj −K−Ui )Ki = K(1+K+Ui −K−Uj )

                  (4)  

 

where  K  =  0.0075   is   the  default   value  when   there   is   no  uncertainty   and  K+  =  4   and  K-­‐  =  0.5   are   the  

weights  for  the  rating  uncertainty  for  person  j  and  item  i.  These  values  determine  the  rate  at  which  θ  

and   β   can   change   following   each   item   response.   These   values   have   been   determined   through  

extensive  simulations.  

The   uncertainty  U   depends   on   both   recency   and   frequency.   Equation   (5)   combines   these  

opposite  effects  on  uncertainty.  We  apply  the  same  equation  to  items  and  players,  with  provisional  

uncertainty  of  U  =  1  and  0  ≤  U  ≤  1:      

 

U =U −140

+130

D                     (5)  

                                                                                                                         2  This  approach  has,  for  many  years,  successfully  been  applied  in  an  online  chess  testing  system  on  the  Chess  Tactics  Server  (http://chess.emrald.net)  

E(Sj ) =1

1+10(θk−θ j )/400

θ j =θ j +K j (Sij −E(Sij ))

βi = βi +Ki (E(Sij )− Sij )

Page 10: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 31

High speed, high stakes We incorporate speed by using the scoring rule (shown in eq. 6) for speed and accuracy,

which we call the high speed, high stakes (HSHS) scoring rule (Maris & Van der Maas,

2012). This rule imposes a speed-accuracy trade-off setting on the individual. Player j has

to respond x in time tij before the time limit d

i for item i. The score S

ij is scaled by the dis-

crimination parameter ai:

In this scoring rule the stakes are high when the subject responds quickly. In case of a

correct answer (xij = 1) the score equals the remaining time. In case of an incorrect answer

(xij = 0) the remaining time is multiplied by −1. Thus a quick incorrect answer leads to a

large negative score. This scoring rule is depicted in Figure 2. The scoring rule is expected

to minimize guessing by encouraging deliberate and thoughtful responses.

Figure 2 High speed, high stakes scoring rule.

Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scor-

ing rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is based on the

ability estimate of the person θj, the difficulty estimate of the item β

i, the time limit d

i and

discrimination parameter ai for that item. In Math Garden, we set a

i = 1/d

i, such that the

effective discrimination equals that of the 1PL model:

   

25  

 

We  assume  that  uncertainty  for  players  and  items  decreases  after  every  administration  and  increases  

with  time.  Therefore  uncertainty  reduces  to  zero  after  40  administrations  and  conversely   increases  

to  the  maximum  of  1  after  30  days  D  of  not  playing.  

High  speed,  high  stakes.  We  incorporate  speed  by  using  the  scoring  rule  (shown  in  eq.  6)  for  

speed  and  accuracy,  which  we  call  the  high  speed  high  stakes  (HSHS)  scoring  rule  (Maris  &  Van  der  

Maas,  2012).  This   rule   imposes  a  speed  accuracy  trade-­‐off  setting  on  the   individual.  Player   j  has   to  

respond  x   in   time   tij  before   the   time   limit  di   for   item   i.  The  score  Sij   is   scaled  by   the  discrimination  

parameter  ai:    

 

Sij = (2xij −1)(aidi − aitij )                     (6)  

 

In  this  scoring  rule  the  stakes  are  high  when  the  subject  responds  quickly.  In  case  of  a  correct  answer  

(xij  =  1)   the   score   equals   the   remaining   time.   In   case   of   an   incorrect   answer   (xij  =  0)   the   remaining  

time  is  multiplied  by  −1.  Thus  a  quick   incorrect  answer   leads  to  a   large  negative  score.  This  scoring  

rule   is   depicted   in   Figure  2.   The   scoring   rule   is   expected   to   minimize   guessing   by   encouraging  

deliberate  and  thoughtful  responses.  

 

 

     

Figure  2.  High  speed,  high  stakes  scoring  rule.  

 

Maris  &  Van  der  Maas  (2012)  derived  an  IRT  model  that  conforms  to  the  HSHS  scoring  rule.  

The  expected  score  (eq.  7)  can  be  inferred  from  this  model.  E(Sij)  is  based  on  the  ability  estimate  of  

the  person  θj,  the  difficulty  estimate  of  the  item  βi,  the  time  limit  di  and  discrimination  parameter  ai  

for  that  item.  In  Math  Garden,  we  set  ai  =  1/di,  such  that  the  effective  discrimination  equals  that  of  

the  1PL  model:  

di

+aidi

-aidi

0

+.5aidi

-.5aidi

correct

incorrect

tjtime

score

di

+aidi

-aidi

0

+.5aidi

-.5aidi

correct

incorrect

tjtime

scor

e

   

26  

E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1

−1

θ j −βi                 (7)  

 

We  use  the  HSHS  score  Sij  (eq.  6)  and  the  corresponding  expected  score  E(Sij)  (eq.  7)  in  our  modified  

Elo  update  function  (eq.  3).  

Item  selection.   Items  are  selected   for  which   the  mean  probability  of  answering  correctly   is  

about  .75.  Repetition  of  the  same  items  is  restricted,  by  ensuring  that  items  are  reused  only  after  20  

other  items  have  been  answered.  A  new  target  βt  is  selected  by  using:    

 

             

(8)  

 

where  probability  P   is  randomly  drawn  from  a  normal  distribution  P  ∼  N(.75,.1)  and  restricted  such  

that  .5  <  P  <  1.  For  administration  the  nearest  available  item  is  selected  by:  mini|  βi  -­‐  βt|.  

 

Procedure  

Although  Math  Garden  started  out  as  a  pilot  project,  only  available   to  a   limited  number  of  

schools  in  the  Netherlands,  the  website  later  on  became  available  for  a  larger  audience.  In  the  pilot  

period  the  students  received  a  login  account  and  an  instruction  from  their  teacher.  In  this  instruction,  

teachers  explained  the  scoring  rule  of  the  games  and  students  were  told  that  they  could  click  on  the  

question  mark  if  they  did  not  know  the  answer.  After  this,  students  could  start  playing  on  their  own.  

Teachers  were  told  that  the  first  two  sessions  should  be  played  at  school.  After  this,  students  were  

also  allowed  to  play  at  home,  but  they  were  instructed  to  play  by  themselves.  After  the  pilot  period  

the  Math  Garden  also  became  available  to  remedial  teachers  and  families.  The  remedial  teachers  and  

families   were   not   instructed   on   the   frequency   of   playing.   The   manuals   on   how   to   use   the   Math  

Garden   were   all   available   on   the   website   but   the   scoring   rule   of   the   games   was   not   explicitly  

explained  to  the  children.  

 

Results  

Measurement  precision  

To  test  whether  the  incorporation  of  response  time  in  the  estimation  of  ability  allows  us  to  

lower  the  difficulty  of  administered  items  with  less   loss  of  measurement  precision,  we  conducted  a  

simulation  study.  We  compared  our  results  to  those  of  Eggen  and  Verschoor  (2006).  In  a  simulation  

study,   Eggen   and   Verschoor   showed3   an   increasing   (negative)   bias   (Figure  3:   left)   and   a   drop   in  

measurement   precision   (Figure   3:   right)   when   selecting   easy   items   in   a   standard   CAT   using   the  

weighted  maximum  likelihood  estimator  (WML)  and  the  one-­‐parameter  logistic  (1PL)  model.  Average  

bias  was  computed  by:  1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!)  and  measurement  precision  was  quantified  by  calculating  the  

mean  standard  error  of  estimation  𝑠𝑠𝑠𝑠(𝜃𝜃)  using  the  information  function  for  the  1PL  model.  

                                                                                                                         3  Table  1  in  Eggen  and  Verschoor  (2006).  

βt = θ j − lnP1−P

Page 11: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

32 | Chap te r 2

We use the HSHS score Sij (eq. 6) and the corresponding expected score E(S

ij) (eq. 7) in our

modified Elo update function (eq. 3).

Item selection Items are selected for which the mean probability of answering correctly is about .75.

Repetition of the same items is restricted, by ensuring that items are reused only after 20

other items have been answered. A new target βt is selected by using:

where probability P is randomly drawn from a normal distribution P ∼ N(.75,.1) and re-

stricted such that .5 < P < 1. For administration the nearest available item is selected by: 

mini| β

i - β

t|.

ProcedureAlthough Math Garden started out as a pilot project, only available to a limited number

of schools in the Netherlands, the website later on became available for a larger audience.

In the pilot period the students received a login account and an instruction from their teach-

er. In this instruction, teachers explained the scoring rule of the games and students were

told that they could click on the question mark if they did not know the answer. After this,

students could start playing on their own. Teachers were told that the first two sessions

should be played at school. After this, students were also allowed to play at home, but they

were instructed to play by themselves. After the pilot period the Math Garden also became

available to remedial teachers and families. The remedial teachers and families were not

instructed on the frequency of playing. The manuals on how to use the Math Garden were

all available on the website but the scoring rule of the games was not explicitly explained

to the children.

Results

Measurement precisionTo test whether the incorporation of response time in the estimation of ability allows us

to lower the difficulty of administered items with less loss of measurement precision, we

conducted a simulation study. We compared our results to those of Eggen and Verschoor

(2006). In a simulation study, Eggen and Verschoor showed35 an increasing (negative) bias

35Table 1 in Eggen and Verschoor (2006).

   

26  

E(Sij ) = aidie2aidi (θ j−βi ) +1e2aidi (θ j−βi ) −1

−1

θ j −βi                 (7)  

 

We  use  the  HSHS  score  Sij  (eq.  6)  and  the  corresponding  expected  score  E(Sij)  (eq.  7)  in  our  modified  

Elo  update  function  (eq.  3).  

Item  selection.   Items  are  selected   for  which   the  mean  probability  of  answering  correctly   is  

about  .75.  Repetition  of  the  same  items  is  restricted,  by  ensuring  that  items  are  reused  only  after  20  

other  items  have  been  answered.  A  new  target  βt  is  selected  by  using:    

 

             

(8)  

 

where  probability  P   is  randomly  drawn  from  a  normal  distribution  P  ∼  N(.75,.1)  and  restricted  such  

that  .5  <  P  <  1.  For  administration  the  nearest  available  item  is  selected  by:  mini|  βi  -­‐  βt|.  

 

Procedure  

Although  Math  Garden  started  out  as  a  pilot  project,  only  available   to  a   limited  number  of  

schools  in  the  Netherlands,  the  website  later  on  became  available  for  a  larger  audience.  In  the  pilot  

period  the  students  received  a  login  account  and  an  instruction  from  their  teacher.  In  this  instruction,  

teachers  explained  the  scoring  rule  of  the  games  and  students  were  told  that  they  could  click  on  the  

question  mark  if  they  did  not  know  the  answer.  After  this,  students  could  start  playing  on  their  own.  

Teachers  were  told  that  the  first  two  sessions  should  be  played  at  school.  After  this,  students  were  

also  allowed  to  play  at  home,  but  they  were  instructed  to  play  by  themselves.  After  the  pilot  period  

the  Math  Garden  also  became  available  to  remedial  teachers  and  families.  The  remedial  teachers  and  

families   were   not   instructed   on   the   frequency   of   playing.   The   manuals   on   how   to   use   the   Math  

Garden   were   all   available   on   the   website   but   the   scoring   rule   of   the   games   was   not   explicitly  

explained  to  the  children.  

 

Results  

Measurement  precision  

To  test  whether  the  incorporation  of  response  time  in  the  estimation  of  ability  allows  us  to  

lower  the  difficulty  of  administered  items  with  less   loss  of  measurement  precision,  we  conducted  a  

simulation  study.  We  compared  our  results  to  those  of  Eggen  and  Verschoor  (2006).  In  a  simulation  

study,   Eggen   and   Verschoor   showed3   an   increasing   (negative)   bias   (Figure  3:   left)   and   a   drop   in  

measurement   precision   (Figure   3:   right)   when   selecting   easy   items   in   a   standard   CAT   using   the  

weighted  maximum  likelihood  estimator  (WML)  and  the  one-­‐parameter  logistic  (1PL)  model.  Average  

bias  was  computed  by:  1/𝑛𝑛 (𝜃𝜃! − 𝜃𝜃!)  and  measurement  precision  was  quantified  by  calculating  the  

mean  standard  error  of  estimation  𝑠𝑠𝑠𝑠(𝜃𝜃)  using  the  information  function  for  the  1PL  model.  

                                                                                                                         3  Table  1  in  Eggen  and  Verschoor  (2006).  

βt = θ j − lnP1−P

Page 12: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 33

(Figure 3: left) and a drop in measurement precision (Figure 3: right) when selecting easy

items in a standard CAT using the weighted maximum likelihood estimator (WML) and the

one-parameter logistic (1PL) model. Average bias was computed by:

based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:

• P.  32.  Na  formule  niet  inspringen.  En  ook  deze  tekst  klopt  niet:  We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).

• P.  32.  De  punt  in  het  kopje  Item  selection  moet  weg.  • P.  33.  Footnote  moet  nummering  3  zijn.  • P.  33.  In  de  laatste  zin  zijn  nog  tekens  weggevallen.  Dit  moet  zijn:  

Average  bias  was  computed  by:  1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!)  and  measurement  precision  was  quantified  by  calculating  the  mean  standard  error  of  estimation  𝑠𝑠𝑠𝑠(𝜃𝜃)  using  the  information  function  for  the  1PL  model.  

• P.  34,  3e  zin  van  onder:  niet  inspringen  • P.  35,  3e  zin  er  missen  nog  tekens  na  estimated  abilities,  het  moet  zijn:  

and  estimated  abilities  𝜃𝜃.  • P.  35.  Het  is  niet  mooi  dat  hier  zoveel  wit  is.  De  tekst  van  de  volgende  

pagina  kan  hier  al  beginnen.  • P.  36.  Onderschrift  grafiek.  Norm  referenced  moet  een  streepje  tussen.    • P.  37,  regel  5:  er  moet  nog  1  ω2  italic  worden  • P.  38,  onderschrift  Figuur  6.  De  β ziet er raar uit.  • P.  39,  6e  regel  van  onder.  De  punt  achter  Si.  moet  niet  in  het  onderschrift  

maar  als  gewone  punt.  • P.  40,  er  zit  wederom  veel  wit  op  pagina  40.  • P.  43.  Regel  2.  Er  moet  nog  een  punt  achter  de  zin.  • P.  45.  Kopje  Discussion  is  van  een  niveau  hoger.  • P.  45.  Regel  3.  Streepje  toevoegen  tussen  norm  en  referenced  • P.  45,  laatste  zin  voor  grafiek.  Komma  toevoegen  na  However  • P.  45.  Figuur  13  dus  graag  voor  het  stuk  Discussie  gedeelte  plaatsen.  Waar  

precies  is  ook  afhankelijk  wat  voor  jou  makkelijk  werkt,  maar  nadat  er  in  de  tekst  naar  verwezen  wordt  en  voor  het  kopje  Discussie.  

• P.  47.  Derde  regel  niet  inspringen.  • P.  47.  2e  alinea,  ie  zin.  Math  moet  met  een  kleine  letter  • P.  47.  2e  alinea,  3e  zin.  Komma  toevoegen  na  estimation:  on  ability  

estimation,  the  question  of    Hoofdstuk  3  

• P  52,  alinea  2,  3e  regel.  CITO  moet  Cito  zijn.  • P.  52,  alinea  2,  4e  regel.  Komma  na  In  this  Chapter  mag  weg.  • P  52:  Kopje  Sources  of  Problem  Difficulty  moet  Sources  of  problem  difficulty  zijn.  Dus  alleen  eerste  woord  met  hoofdletter.  

• P  54,  3e  regel.  Staat  ere  en  extra  spatie  voor  Siegler?  • P  54,  2e  alinea,  regel  6.  Komma  toevoegen  voor  and,  dus:  ,  and  the  

minimum  operand  • P  54:  Er  mist  een  inspringing  onder  het  kopje  Tie  effect  • P  54:  3e  zin  van  onder.  Het  lijkt  of  er  een  extra  spatie  staat  voor  An  

alternative  explanation  • P  55:  Er  mist  een  inspringing  onder  het  kopje  Special  numbers  • P  57:  Regel  3.  We  use  data  moet  We  used  data  zijn.  

! and

measurement precision was quantified by calculating the mean standard error of estimation

based on the ability estimate of the person θj, the difficulty estimate of the item βi, the time limit di and discrimination parameter ai for that item. In Math Garden, we set ai = 1/di, such that the effective discrimination equals that of the 1PL model:

• P.  32.  Na  formule  niet  inspringen.  En  ook  deze  tekst  klopt  niet:  We use the HSHS score Sij (eq. 6) and the corresponding expected score E(Sij) (eq. 7) in our modified Elo update function (eq. 3).

• P.  32.  De  punt  in  het  kopje  Item  selection  moet  weg.  • P.  33.  Footnote  moet  nummering  3  zijn.  • P.  33.  In  de  laatste  zin  zijn  nog  tekens  weggevallen.  Dit  moet  zijn:  

Average  bias  was  computed  by:  1/𝑛𝑛 (𝜃𝜃 − 𝜃𝜃!)  and  measurement  precision  was  quantified  by  calculating  the  mean  standard  error  of  estimation  𝑠𝑠𝑠𝑠(𝜃𝜃)  using  the  information  function  for  the  1PL  model.  

• P.  34,  3e  zin  van  onder:  niet  inspringen  • P.  35,  3e  zin  er  missen  nog  tekens  na  estimated  abilities,  het  moet  zijn:  

and  estimated  abilities  𝜃𝜃.  • P.  35.  Het  is  niet  mooi  dat  hier  zoveel  wit  is.  De  tekst  van  de  volgende  

pagina  kan  hier  al  beginnen.  • P.  36.  Onderschrift  grafiek.  Norm  referenced  moet  een  streepje  tussen.    • P.  37,  regel  5:  er  moet  nog  1  ω2  italic  worden  • P.  38,  onderschrift  Figuur  6.  De  β ziet er raar uit.  • P.  39,  6e  regel  van  onder.  De  punt  achter  Si.  moet  niet  in  het  onderschrift  

maar  als  gewone  punt.  • P.  40,  er  zit  wederom  veel  wit  op  pagina  40.  • P.  43.  Regel  2.  Er  moet  nog  een  punt  achter  de  zin.  • P.  45.  Kopje  Discussion  is  van  een  niveau  hoger.  • P.  45.  Regel  3.  Streepje  toevoegen  tussen  norm  en  referenced  • P.  45,  laatste  zin  voor  grafiek.  Komma  toevoegen  na  However  • P.  45.  Figuur  13  dus  graag  voor  het  stuk  Discussie  gedeelte  plaatsen.  Waar  

precies  is  ook  afhankelijk  wat  voor  jou  makkelijk  werkt,  maar  nadat  er  in  de  tekst  naar  verwezen  wordt  en  voor  het  kopje  Discussie.  

• P.  47.  Derde  regel  niet  inspringen.  • P.  47.  2e  alinea,  ie  zin.  Math  moet  met  een  kleine  letter  • P.  47.  2e  alinea,  3e  zin.  Komma  toevoegen  na  estimation:  on  ability  

estimation,  the  question  of    Hoofdstuk  3  

• P  52,  alinea  2,  3e  regel.  CITO  moet  Cito  zijn.  • P.  52,  alinea  2,  4e  regel.  Komma  na  In  this  Chapter  mag  weg.  • P  52:  Kopje  Sources  of  Problem  Difficulty  moet  Sources  of  problem  difficulty  zijn.  Dus  alleen  eerste  woord  met  hoofdletter.  

• P  54,  3e  regel.  Staat  ere  en  extra  spatie  voor  Siegler?  • P  54,  2e  alinea,  regel  6.  Komma  toevoegen  voor  and,  dus:  ,  and  the  

minimum  operand  • P  54:  Er  mist  een  inspringing  onder  het  kopje  Tie  effect  • P  54:  3e  zin  van  onder.  Het  lijkt  of  er  een  extra  spatie  staat  voor  An  

alternative  explanation  • P  55:  Er  mist  een  inspringing  onder  het  kopje  Special  numbers  • P  57:  Regel  3.  We  use  data  moet  We  used  data  zijn.  

! using the information function for the 1PL model.

Figure 3 BIAS and SE for different computer adaptive methods at different values of expected

probability correct.

In our simulation we used the Elo update function to estimate ability and difficulty,

utilizing: a) accuracy data with the 1PL model and b) accuracy and response time data

using the HSHS model. As in the study by Eggen and Verschoor (2006), our item bank

consisted of 300 items with normally distributed β ~ N(0,1) difficulties and we also sam-

pled 4000 abilities from a normal distribution θ ~ N(0,1). The CAP algorithm starts with an

item of intermediate difficulty −0.5 < β < 0.5 and terminates after 40 items. As a starting

point for ability we selected a random ability from a normal distribution β ~ N(0,1). We

compared our Elo based HSHS model, at different desired success probabilities, to Eggen

and Verschoor’s 1PL model using standard CAT. Eggen and Verschoor investigated suc-

cess probabilities up to .75. With regard to bias it can be concluded that the Elo estima-

tion method performs slightly worse with accuracy data only (Figure 3: left: Elo+1PL),

but outperforms Eggen and Verschoor’s standard CAT method, when RT’s are included

(Figure 3: left: Elo + HSHS). With regard to the standard error of estimation we also com-

pared our two Elo methods to the theoretical maximum information for the 1PL model.

0.5 0.6 0.7 0.8 0.9

-0.1

0-0

.08

-0.0

6-0

.04

-0.0

20.

00

BIAS

Expected probability correct

mea

n B

IAS

Eggen et al.Elo + 1PLElo + HSHS

0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9

0.30

0.35

0.40

0.45

0.50

SE

Expected probability correct

SE

sd of ^i i

mean se(^i)

sd of ^i i

mean se(^i)

Eggen et al.Elo + 1PLElo + HSHS Max Info.

0.5 0.6 0.7 0.75 0.8 0.9

Page 13: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

34 | Chap te r 2

We calculated the maximum information (Figure 3: right: Max Info.) with equation (9):

where a = 1 is the discrimination parameter, N = 40 is the number of items, and Pi(θ) is

the desired probability correct. As can be seen in (Figure 3: right: Max info.), when the 

probability of answering correctly assumes large values (x-axis), the theoretical minimum 

SE (eq. 9) for the 1PL model increases exponentially (y-axis). For the standard error of esti-

mation (Figure 3: right) we calculated the standard deviation of the difference in simulated

abilities θ and estimated abilities

Opmerkingen  ronde  3    Ik  gebruik  weer  af  en  toe  dikgedrukt  en  schuin  om  de  verschillende  woorden  te  onderscheiden,  deze  formatting  niet  overnemen.    

• P  1:    Concept  doctoral  thesis  mag  weg  • P  3:  Graag  ook  academisch  proefschrift  in  schreefloze  lettertype  als  kopje  

erboven  met  witregel  ertussen.  Academisch  Proefschrift  beiden  met  hoofdletter  

• P  3:  AU  moet  Aula  der  Universiteit  van  Amsterdam  zijn  • P  5:  paginanummering  klopt  nog  niet,  maar  dit  wilde  je  geloof  ik  op  het  

eind  doen.    Hoofdstuk  1  

• P.  9,  alinea  2,  derde  zin.  Graag  nog  een  komma  achter  2005  • P.  14  regel  5.  scPhools  moet  schools  zijn  (sorry  dat  je  dit  weer  moet  

aanpassen,  was  een  tikfout)    Hoofdstuk  2  

• P.  27.  2e  alinea  regel  4:  komen  toevoegen  achter  reponses:  In  addition  to  the  responses,  we  registered  

• P.  27.  Is  het  mogelijk  om  de  tabel  van  pagina  28  onder  het  stukje  participants  te  zetten  op  pagina  27?  Dan  staan  de  plaatjes  van  pagina  29  ook  weer  dichterbij  de  tekst  waar  ze  horen.  

• P.  28.  De  footnote  en  de  verwijzing  ernaar  in  de  tekst  bevat  het  cijfer  3,  moet  1  zijn.  

• P.  29:  laatste  zin  θ moet  schuingedrukt  zijn.  • Overall  heb  ik  het  idee  dat  de  formules  behoorlijk  klein  zijn,  misschien  

moeten  we  even  kijken  hoe  dit  uitpakt  in  de  drukproeven  of  denk  je  dat  we  het  nu  al  groter  moeten  maken.  

• P.  29:  laatste  zin  er  mist  een  ):  (see  equation  (1))  • P.  30:  regel  2,  er  mist  een  teken  achter  estimate  𝜃𝜃  

P.  30:  na  regel  4  ontbreekt  er  tekst.  De  hele  zin  moet  zijn:    The  expected  match  result  is  a  function  of  the  difference  between  the  ability  estimates  of  both  player  j  and  k  preceding  the  match  and  expresses  the  probability  of  winning  (see  equation  (2)):  

• P.  30.  De  footnote  betreft  footnote  2  (ook  in  de  tekst)  • P.  31.  Witregel  na  regel  2  mag  weg.  • P.  31.  Na  1e  formule  niet  inspringen  • P.  31:  4e  regel  onder  high  speed  high  stakes.  Dit  moet  een  andere  x  zijn,  

dezelfde  als  in  de  3e  regel  van  onder  maar  dan  zonder  de  subscripts.  • P.  31.  5e  regel  van  onder,  niet  inspringen  • P.  32:  Figuur  2  mag  eventueel  wat  kleiner  als  dat  nu  door  verspringen  van  

teksten  handiger  is.  • P.  32,  er  zitten  nog  veel  fouten  in  de  tekens  en  subscripts  in  deze  alinea.  

Hierbij  de  goede  tekst.  Dikgedrukt  zijn  de  wijzigingen:  Maris and Van der Maas (2012) derived an IRT model that conforms to the HSHS scoring rule. The expected score (eq. 7) can be inferred from this model. E(Sij) is

. This method of calculation is simpler, yet comparable

with the procedure used by Eggen and Verschoor (2006) for calculating the standard error

of estimation.

The SE of the Elo estimation method using only accuracy data (Figure  3: right:

Elo+1PL) is largest for almost all probability levels. This is to be expected as this method

is statistically inferior to the WML method used by Eggen and Verschoor (2006). Up to

the probability level of about .69 the SE using the HSHS Elo method (Figure 3: right:

Elo + HSHS) is larger than the SE found in the Eggen and Verschoor simulation. However,

at higher probability levels, especially compared to our target of .75, the SE is considerably

lower. At probability levels higher than about .78 the SE even drops below the theoretical

maximum information (Figure 3: left: Max info.) for the 1PL model. This demonstrates

that incorporating response times results in much better measurement precision when using

easy items.

ValidityTo assess the validity of the Math Garden measurements, the ratings of the students were

compared to their scores on the norm-referenced general math ability scale of the pupil

monitoring systems of Cito (Janssen & Engelen, 2002). The correlations between these

two measures, which serve as a measure of convergent validity, ranged from .78 to .84 for

the four domains addition, subtraction, multiplication and division. These correlations were

based on a subset of our sample. Cito scores where available for N = 964 participants. To

put these correlations into perspective we looked at the correlation between two subsequent

Cito scores. The correlation between Cito mid year and end of the year 2007–2008 was

.95. This indicates that our correlations can be considered fairly high. Figure 4 displays the

relation between test scores. The numbers indicate the regression line for each grade.

   

27  

 Figure  3.  BIAS  and  SE  for  different  computer  adaptive  methods  at  different  values  of  expected  probability  correct.  

 

In  our  simulation  we  used  the  Elo  update  function  to  estimate  ability  and  difficulty,  utilizing:  

a)  accuracy  data  with  the  1PL  model  and  b)  accuracy  and  response  time  data  using  the  HSHS  model.  

As  in  the  study  by  Eggen  and  Verschoor  (2006),  our  item  bank  consisted  of  300  items  with  normally  

distributed  β  ∼  N(0,1)  difficulties  and  we  also  sampled  4000  abilities  from  a  normal  distribution  θ  ∼  

N(0,1).  The  CAP  algorithm  starts  with  an  item  of  intermediate  difficulty  −0.5  <  β  <  0.5  and  terminates  

after  40  items.  As  a  starting  point  for  ability  we  selected  a  random  ability  from  a  normal  distribution  

β  ∼  N(0,1).  We   compared  our   Elo  based  HSHS  model,   at   different   desired   success   probabilities,   to  

Eggen   and   Verschoor’s   1PL   model   using   standard   CAT.   Eggen   and   Verschoor   investigated   success  

probabilities   up   to   .75.   With   regard   to   bias   it   can   be   concluded   that   the   Elo   estimation   method  

performs  slightly  worse  with  accuracy  data  only  (Figure  3:  left:  Elo+1PL),  but  outperforms  Eggen  and  

Verschoor’s  standard  CAT  method,  when  RT’s  are  included  (Figure  3:  left:  Elo  +  HSHS).  With  regard  to  

the  standard  error  of  estimation  we  also  compared  our  two  Elo  methods  to  the  theoretical  maximum  

information  for  the  1PL  model.  We  calculated  the  maximum  information  (Figure  3:  right:  Max  Info.)  

with  equation  (9):  

(9)  

 

where  a  =  1  is  the  discrimination  parameter,  N  =  40  is  the  number  of  items,  and  Pi(θ)  is  the  desired  

probability  correct.  As  can  be  seen  in  (Figure  3:  right:  Max  info.),  when  the  probability  of  answering  

correctly   assumes   large   values   (x-­‐axis),   the   theoretical   minimum   SE   (eq.   9)   for   the   1PL   model  

increases  exponentially   (y-­‐axis).  For   the  standard  error  of  estimation   (Figure  3:   right)  we  calculated  

the   standard   deviation   of   the   difference   in   simulated   abilities   θ   and   estimated   abilities   𝜃𝜃.  This  

0.5 0.6 0.7 0.8 0.9

-0.1

0-0

.08

-0.0

6-0

.04

-0.0

20.0

0

BIAS

Expected probability correct

mean B

IAS

Eggen et al.

Elo + 1PL

Elo + HSHS

0.5 0.6 0.7 0.75 0.8 0.9 0.5 0.6 0.7 0.8 0.9

0.3

00.3

50.4

00.4

50.5

0

SE

Expected probability correct

SE

sd of ^

i i

mean se(^i)

sd of ^

i i

mean se(^i)

Eggen et al.

Elo + 1PL

Elo + HSHS

Max Info.

0.5 0.6 0.7 0.75 0.8 0.9

se(θ ) =1 Nai2Pi (θ )(1−Pi (θ ))

Page 14: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 35

Figure  4  Correlation between Math Garden rating for the domains addition, subtraction, mul-

tiplication, and division and the norm-referenced Cito scores (mid 2008). Included are regression

lines for each grade (Dutch grades) indicated by grade numbers.

We also studied the validity by comparing the mean ability ratings of children in differ-

ent grades. We expected a positive relation between grade and ability. Figure 5 shows the

average ability rating for each grade and domain. As expected, children in older age groups

had a higher rating than children in younger age groups. In all four domains, there is an

overall significant effect of grade: addition, F(5, 1456) = 1091.4, p < .01, ω2 = .78; subtrac-

tion, F(5, 1363) = 780.5, p < .01, ω2 = .74; multiplication, F(5, 1215) = 409.6, p < .01, ω2

= .62; and division, F(5, 973) = 223.31, p < .01, ω2 = .53. Levene’s tests show differenc-

es in variances for the domains multiplication and division. However, the non-parametric

Kruskal–Wallis tests also show significant differences for these domains: χ2(5) = 753.28,

p < .01 for multiplication and χ2(5) = 505.17, p < .01 for division. For all domains, post

hoc analyses show significant differences between all grades, except for the differences

between grades five and six.

••••

•••

•• •

• •••

•••

• •••

•••

•• • •••

••

• •

••

• • •••

••• •

••

•••

••

•• •

• ••

•••

••

••

••••

•• •

••

••

••

••

• ••

••

•••

••

••

••

••

••

••

••• •

• •

• •

• •

• •

••••

••

• ••

••

••

••

• •

•••••

••

••

•••

• •

• •

••

•• •

••• ••

•••

•••

••

••

••

••

• ••

• •

••• •

•• ••

••

••

••

••

• •

••

••

••

••

••

••

••

••

•• •

•••

••

••

••

•••

• ••

• •

••

••

••

•• ••

• ••

••

••

••

•• •

• • •

••

••

••

••••

••

••

• ••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• •

•••

••

••

••

• •

••

• •

••

••

••

• •

••

••

••

• •••

• •••

••

••

• ••

••• • •

••

••

••

• •••

••

••

•••

• ••

• •

••

••

••

• •••

•••

•• •

••

••

•••

••

••

••

••

••••

••

••

••

0 5 10

050

100

150

r = 0.83

rating addition

Cito

Nor

m s

core

•• ••

•••

•• •

• •••

•••

••••

••

•• • ••••

•• •

••

•• •••

••• •

•••

• •

••

•• •

• ••

•• •

••

••

••

••

•••

••

••

••

••

• ••

••

•••

••

••

••

••

••

••

••• •

••

• •

• •

• •

••• •

••

• ••

••

••

••

• •

•• •••

••

••

•••

• •

• •

••

•• •

•• • ••

••

••

••

••

• •

••

••

• ••

• •

••••

• •••

••

••

••

••

••

• •

••

•••

••

••

••

• •

•••• •

•••

••

••

••

•••

• ••

• •

••

••

••

•• ••

• ••

••

••

••

•••

• ••

••

••

••••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••••

•••

••

••

••

• •

••

• •

••

••

••

••

••

• •• •

• •••

••

••

• ••

••• • •

••

••

••

• •••

• •

••

••

• •

••

• •

••

••••

••••

••

••

••

••

••

••

••

••

• ••

••

••

−5 0 5 10

050

100

150

r = 0.84

rating subtraction

Cito

Nor

m s

core

• •••

•••

•• •

• •••

••

••• •

•••

• • • •••

••• •

••

• • •••

•• • •

••

•• •

••

•• •

•••

•• •

••

••

••

••

•• •••

••

•••

• •

• ••

••

••

••

••

••

••

••

••

• •• •

• •

••

• •

• •

••• •

••

• ••

••

• •

••

• •

•• •

• •

••

••

•• •

• •

• •

••

•• •

• •• ••

••

••

••

••

• •

••

••

• ••

• •

••••

•• ••

••

••

••

• •

• •

••

•••

••

••

••

• •

••

• ••

•••

••

••

••

•••

• ••

• •

••

••

••

••••

•••

••

••

••

•• •

••

• •••

••

••

• ••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

••

••

•• •

••

•••

••

•• ••

• •• •

••

••

• ••

•• • • •

••

••

••

• •••

••

••

••

••

••

••••

• •••

•• •

••

•••

••

••

−5 0 5 10

050

100

150

r = 0.8

rating multiplication

Cito

Nor

m s

core

• •••

•••

•••

• •• •

••

••• •

•••

•• • •••

••

• •

••

•• ••••

• •••

•• •

••

• • •

•••

•• •

••

••

••

••

•• •

••

••

••

• •

• ••

••

••

••

••

••

••

••

••

• •••

• •

••

• •

• •

• •• •

••

• ••

••

• •

••

• •

•• •• •

••

••

•• •

• •

• •

••

•• •

•• • ••

••

••

••

••

••

••

• •

• ••

• •

••••

• • ••

••

••

• •

••

••

••

•••

••

••

••

•• •

•••

••

••

••

•••

• ••

• •

••

•••

••••

• ••

••

••

••

•••

••

• •••

••

••

••

• ••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• •

•••

••

••

• ••

••

••

•• •

••

•••

••

• • ••

• •• •

••

••

• ••

•••••

••

••

••

• • ••

••

••

••

••

••

••••

• •

••

•••

••

••

••

−5 0 5 10

050

100

150

r = 0.78

rating division

Cito

Nor

m s

core

4 5

67

8

4 5

6

78

4 5

6

78

4 5

67

8

Page 15: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

36 | Chap te r 2

Figure 5 Rating per domain and age group.

ReliabilityOne way to assess reliability is to compare children’s ratings across domains. Since the

domains involve related operations we expect high correlations between them. The correla-

tions between the four domains, addition, subtraction, multiplication, and division, vary

between .67 and .88, all signifi cant at p < .01, indicating fairly high correlations. Another

relatively simple way to assess the reliability of Math Garden is to construct parallel tests.

We can compare item difficulty β’s of so-called mirrored items (e.g. 7 + 4, 4 + 7 and

2 × 4, 4 × 2) for the domains addition (N = 48) and multiplication (N = 81). Mirrored

items should have very similar β’s. Figure 6 shows the correlation of the mirrored item β’s

for the domains addition and multiplication. These correlations are .88 and .98 (p < .01),

respectively, indicating a high reliability of these item sets.

Figure 6 Scatter plot of diffi culty β’s of mirrored items. Included are some example items, indicated

with black dots.

1 2 3 4 5 6

-10

-50

510

+

grade

Rat

ing

1 2 3 4 5 6-1

0-5

05

10

-

grade

Rat

ing

1 2 3 4 5 6

-10

-50

510

x

grade

Rat

ing

1 2 3 4 5 6

-10

-50

510

:

grade

Rat

ing

-8 -6 -4 -2 0

-8-6

-4-2

0

Correlation m+n, n+m

Rating mirror item (n+m)

Rat

ing

item

(m

+n)

7 + 3

7 + 6

8 + 7

1 + 3

2 + 3

0 + 4

4 + 1

5 + 35 + 6

7 + 1

7 + 48 + 4

8 + 5

6 + 8

9 + 0

9 + 1

3 + 9

10 + 7

15 + 82

6 + 1

6 + 2

8 + 110 + 2

9 + 8

10 + 40

-5 0 5 10

-50

510

Correlation mxn, nxm

Rating mirror item (nxm)

Rat

ing

item

(m

xn)

5 x 1

5 x 2

7 x 6

8 x 5

8 x 6

9 x 4

10 x 1

15 x 19

20 x 19

80 x 12

12 x 700

700 x 500

80 x 500

25 x 500

64 x 20

11 x 3311 x 75

11 x 2011 x 500

Page 16: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 37

Besides diffi culty β’s, we can also compute the discriminatory power of items, which in-

dicates how well the item discriminates low from high ability subjects. We estimated these

so called a-parameters by using a logistic regression analysis on the accuracy responses

predicted by the difference in rating between item and respondent. As in the preceding

analysis, we compared the discriminatory power between mirrored items. The scatter plots

in Figure 7 show rather high significant (p < .01) positive correlations. The correlations for

addition and multiplication are .74 and .71.

Figure 7 Scatter plot of discriminatory a-parameters for mirrored items.

As a fi nal test of reliability, we investigated the stability of the diffi culty ratings β. A

high correlation between β values of items at two time points far apart indicates high re-

liability. Therefore we would expect a stable item bank to correlate highly over time. We

fi rst looked at the correlation between the item β ratings, as they were set at the start of the

project (week 36) and the item ratings in all subsequent weeks. In Figure 8, this correlation

is shown by the solid line. Clearly, the initial ratings, set on the basis of an analysis of Math

materials used by the schools, were quite good, as the correlation between initial ratings

and the ratings after 40 weeks is still .85. Secondly, we also correlated established item

ratings in week 44 with all item ratings in subsequent weeks (dotted line). This shows that

these established ratings are very stable as the correlations in all 32 weeks stay above .95.

0.2 0.4 0.6 0.8 1.0 1.2

0.2

0.4

0.6

0.8

1.0

1.2

a-parameter correlation n+m, m+n

a-parameter n+m

a-pa

ram

eter

m+n

0.2 0.4 0.6 0.8 1.0 1.2 1.4

0.2

0.4

0.6

0.8

1.0

1.2

a-parameter correlation nxm, mxn

a-parameter nxm

a-pa

ram

eter

mxn

Page 17: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

38 | Chap te r 2

Figure  8  Stability of items ratings for initial ratings (solid line) and established ratings after 2

months (dotted line). The x-axis displays week numbers (v = vacation). Correlations are computed

over active (played) items in each week (Ni = number of administered items).

Item reuseAs a result of the longitudinal nature of the Math Garden system, items are presented

to the same child more than once. Although the system ensures that at least 20 other items

are administered before an item is reused, this reuse may present a threat to the assump-

tion of local independence (e.g., the response to an item must not depend on the previous

response to the same item). To test this, we performed regression analyses with both the

number of items and the amount of time between two presentations of the same item to the

same child as predictors for the child’s performance on that item. The child’s performance

was measured by subtracting his expected score E(Si) from the actual score S

i. If there is an

item-specifi c learning effect, any child that encounters an item for the second time is likely

to have a higher than expected score for that item. We selected pairs of data points that rep-

resented subsequent presentations of the same item to the same child. We selected the data

so that no child contributed more than one pair of data points, resulting in N = 478 pairs

of data points. Because item-specific learning effects are logically more likely to occur if

there is a small amount of time between two presentations of the same item to the same

child, we removed 90 data points with more than 30 minutes between the two presentations

of the item. A regression analysis with this dataset shows no main effect for either the

number of items, or the amount of time between two presentations of the same item to the

same child: number of items, R2 < .001, F(1, 476) = 0.39, p = .53, and amount of time, R2 <

.001, F(1, 476) = 0.0072, p = .93.

Math Garden aimsIn order to keep children motivated, items were sampled so that children solved about

75% of the items successfully. However, in the fi rst few months we imposed a success rate

of 70%. Figure 9a shows the proportion of correctly answered items per grade and domain.

0.80

0.85

0.90

0.95

1.00

Played item correlation across weeks

Weeknumbers 2008-2009

Correlation

Spearman's ρ correlation of βw and β>w

v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v v 10 11 12 13 14 15 16 v v 19 20 21 22 23

ρw=44ρw=35

2784

1737

1779

1769

1779

1802

1893

1824

1510

2784

1946

1945

1943

1913

2006

2034

2043

1641

1415

1971

1968

1915

1986

2033

2145

2111

2051

2041

2015

1996

1999

2005

1989

1938

1947

1490

1828

1965

1877

2042

1773Ni

Page 18: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 39

Only the results of the children who answered more than fi fteen items were included in the

graph. The graphs show that the proportion of correctly answered items varied between .6

and .8 for most children. The proportion correct seems to be somewhat lower for subtrac-

tion and lower still for multiplication and division. At the start of this project, the domains

addition and subtraction were briefl y available for the lower age groups. This resulted in

a lot of question mark use in these domains. To counter this unwanted effect we made the

availability of these domains dependent on the profi ciency on addition and subtraction. In

total, the amount of question mark use in the math games was about 7.3%. Filtering out

the question mark responses (Figure 9b) results in considerably higher proportions correct.

Figure 9 Proportion correct per grade and domain.

One of the aims of Math Garden was that it should be a challenging web environment

for children of all competency levels. The usage statistics can answer the question whether

children are motivated to play the math games. They provide an indication of how attrac-

tive and challenging the children found the Math games. It is possible that children visit

the Math Garden site mainly because their teachers told them to. To assess how intrinsi-

cally motivated the children were to play the games, we looked at the days and hours that

children played in Math Garden. Figure 10 (top) shows the number of solved arithmetic

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

+

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

-

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

x

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

:

gradeP

ropo

rtion

cor

rect

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

+

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

-

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

x

grade

Pro

porti

on c

orre

ct

1 2 3 4 5 6

0.5

0.6

0.7

0.8

0.9

1.0

:

grade

Pro

porti

on c

orre

ct

a) Question mark response included

b) Question mark response excluded

Page 19: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

40 | Chap te r 2

problems per day of the week and Figure 10 (bottom) shows the number of solved items

per hour of the day. Not surprisingly, most problems were solved on Monday till Friday

and between 9.00 a.m. and 3.00 p.m. However, both graphs also show that a considerable

number of problems were solved after school hours and during the weekends. Actually,

33.2% of all problems were solved outside school hours.

Figure 10 Playing frequency during the week and during the day.

To investigate whether competency had any effect on motivation, we looked at the rela-

tion between ability and playing frequency. Only data of children who solved 15 or more

problems were included to ensure accuracy of the ability estimates. We found only low

but signifi cant (p < .01) correlations between ability level and playing frequency for all

domains. The correlations for the domains addition, subtraction, multiplication, and divi-

sion were, −.15, −.12, −.05, and .09, respectively. The playing frequency does not appear

to depend importantly on the competency level of the children.

Diagnostic abilityWe will briefl y demonstrate the diagnostic and tracking ability of Math Garden by con-

sidering a few examples. Using the high frequency dataset, we were able to provide in-

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

:x-+

Frequency per weekday

Freq

uenc

y x

1000

0100200300400500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Frequency per hour

Freq

uenc

y x

1000

0100200300400500

Page 20: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 41

dividual and group diagnostics. Figure 11 shows the percentage of typical errors a given

child had made (bars) compared to the percentage of these errors made by children of the

same grade (solid line). We can see, for instance, that this child makes significantly more

zero errors (400−200 = 380) for the domain subtraction than other children in the same

grade. We provided teachers with such graphs for individuals and groups of individuals

(e.g., for the whole class).

Detailed analysis of the item difficulties provides us with insight into sources of item

difficulty. Some interesting results have emerged. For example multiplications by 10 or

even 100 and one digit numbers (7 × 100) are among the 10% easiest items for this domain.

In division it appears that items of the type nn/n (77/7) are also very easy (again among the

10% easiest items). In Chapter 3 we tested how well all kinds of item effects, previously

studied in isolation, predict item difficulty. The combined item effects, such as problem

size, ties, and the 5 effect, explained 90% of the variance in the difficulty of simple multi-

plications items.

Figure 11 Error analysis of answers to subtraction problems of a child in grade 5. Bars display the

percentage of errors for this child in a specific week. The lines display the percentage of errors made

by other children in Math Garden (dotted line) and by other children in grade 5 (solid line).

use of question mark

to slow

adding units (42-23=25)

reversibility + borrow error (68-29=31)

reversibility error (93-78=25)

counting error 2 (93-78=13)

counting error 1 (93-78=14)

borrow error (80 – 29 = 61)

mirror error (18-5=31)

addition (7-4=11)

digit forgotten (95-75=25)

-0=0 error (7-0=0)

0 forgotten (3000-2000=100)

0 error (9000-5000=8500)

position error (22-10=21)

unknown

Freqency: 23 grade: 5

%

0 2 4 6 8 10 12 14

55

55

55

55

55

55555

5 significant

non significant

grade mean

total mean

Page 21: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

42 | Chap te r 2

A window on developmental changeThe high frequency measurements combined with the size of the sample, provide unique

insights into arithmetic development and learning trajectories of children. In Math Garden,

trend analyses are provided to teachers. Figure 12 shows the progress of a single child

compared to all other children in the same age group. Teachers can use this information to

consider interventions. As can be seen in the graph, this child started out having an average

rating and a flat growth curve. By week 45 this child started to acquire the necessary ability

and by week 49 the child was in the top 25% of all children.

Figure 12 Progress chart of a child in grade 6 (black line), in comparison to the mean of grade 6

(dotted line).

At micro level it is even possible to study the learning pattern of one child on a specifi c

item over time. For example, in Figure 13 we see the answers and response times of two

children on two items across weeks. In the top graph of Figure 13 we see an individual who

did not know the answer to the math question 9 × 9 and answered with a question mark in

about 5 to 10 seconds at the fi rst ten occasions. Then there were two mistakes in which the

child joined the two digits instead of multiplying. However, in the next attempt the question

was answered correctly but more time was needed to respond. From this point on, the abil-

ity level seems suffi cient for consistent correct and speedier answers. The bottom graph of

Figure 13 shows a lucky guess in the fi rst week (third trial) followed by a gradual gain in

insight. Half way week 42 this child started answering correctly more often but with highly

varying response times. At the end of week 44 the response time dropped. Note that occa-

sionally errors keep occurring. These examples illustrate the level of detail that is possible

in the analysis of Math Garden data.

02

46

810

Week numbers 2008-2009

Rat

ing

v v 37 38 39 40 41 v v 44 45 46 47 48 49 50 51 v v 2 3 4 5 6 7 v

90%

75%

25%10%

j

Page 22: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 43

Figure  13 Response time pattern for two children on different items during a number of weeks

(x-axis). The y-axis indicates the response time in seconds. The answer is displayed in the graph. The

question mark answer means that the child pressed the ‘?’ button.

Discussion

In this chapter we presented and tested a new model for computerized adaptive prac-

tice and monitoring. The results concerning the validity and reliability are promising. The

high correlations with the norm-referenced Cito scores indicate high criterion validity. The

increase in player ability rating across grades also supports this, although the children in

grades 5 and 6 did not seem to differ. This is probably due to the fact that in the domains

we tested no new mental arithmetic techniques are taught in grade 6.

By simulation, we compared measurement precision and measurement bias of CAP to

standard CAT. For easy items the use of the HSHS scoring model, which combines speed

and accuracy and the Elo rating system (ERS) resulted in less loss in measurement pre-

cision and less bias than found in standard CAT estimation. The ERS combined with the

1PL model, using only accuracy data, resulted in worse estimations. Concerning the items

and the item bank, we found that diffi culty ratings converge in about eight playing weeks,

resulting in consistent diffi culty ratings across time. High reliability is also indicated by the

high correlations of the diffi culty and discrimination parameters between sets of mirrored

items. We have not found any indication of learning effects caused by the reuse of items,

therefore also indicating the assumption of local independence has been met for reuse of

items. However, in other learning domains this issue still requires careful consideration.

The fi t statistics for the HSHS model are still in development, and are therefore not in-

cluded in the result section of this chapter. Evaluation of the goodness of fi t for IRT models

05

1015

20

Answer to item: 9 x 9 and response time of a grade 3 child

week numbers

resp

onse

tim

e (s

ec.)

?

?

? ?

? ?

? ?? ? 99 99

81

81

?

8181 81 81

81

8181 81 81 81 81 81 81 81 81 81 81

8181

8181

81 81 81 81 81 81 81 81 81 8181

8181 81

8181 81

?

35 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 37 37 37 37 38 38 38 38 38 38 38

05

1015

20

Answer to item: 21 : 3 and response time of a grade 3 child

week numbers

resp

onse

tim

e (s

ec.)

? ?

7

? ?? ? ? ? ? ?

?53

? ?? ? ?

? ?? ?

? ? ? ? ? ? ? ? ? ? ?

7?

?

?

77 7

?7 7

7 7

7 7

77

77

7 7

7

4

6

6

77

7

6

7

7 46

6

67

6

77

7 7 7

7 7

?

7

7

41 41 41 41 41 41 41 41 41 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 43 43 44 44 44 44 44 44 45 45 45 45 46 22 23

Page 23: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

44 | Chap te r 2

is an active area of research, and so far definite solutions are lacking (Embretson & Reise,

2000). Some of the relevant issues (Hambleton, Swaminathan, & Rogers, 1991) concern:

the sensitivity of the chi-square fit statistic to sample sizes, technical issues in the testing

of dimensionality (Hattie, 1984, 1985), and the testing of the assumption of local indepen-

dence. Evaluating the fit of IRT models is more complicated still in the context of computer

adaptive testing, due to the inherent incomplete item-person data matrix. An alternative

approach to comprehensive model fitting consists of checking model assumptions, and

establishing reliability and validity (Hambleton & Swaminathan, 1984). Here we have suf-

ficed with this alternative approach.

We can conclude that children were motivated to play the Math games. The frequency

data demonstrated that children played a lot outside school hours. Children with a lower

math ability did not play appreciably less, which suggests that they found the math games

as motivating as high ability children did. We demonstrated that Math Garden has many

possibilities as a diagnostic tool. The error analysis can provide teachers with valuable

insight into the kind of errors that individual pupils make. This information can be used to

optimize interventions. The current dataset, consisting of a large number of individual high

frequent time series, allows for many further investigations of difficulty effects (Chapter

3), strategy patterns in mathematical problem solving, and individual learning trajectories.

The item ratings also provide insight into what we call informal learning paths. Because

of the adaptive item ratings, we gain an on the fly insight into the difficulty of arithmetic

problems. Some items turned out to be unexpectedly easy. For instance, 8 + 6, 5000 + 5 and

50 + 60 were almost equally difficult whereas 8 + 6 is taught much earlier on in the Dutch

curriculum than the other two addition problems. This kind of information can be used to

determine the curriculum (i.e., what is taught) in each grade.

One of the problems with the Elo rating system is the occurrence of rating inflation and

deflation (Glickman, 1999), which we call drift. In educational applications, one source

of drift is that new young players start with low ratings and stop playing when they leave

school with high ratings. This causes a systematic downwards drift in item rating and, as

a consequence, lowers person ratings. This does not seem to jeopardize the operation of

Math Garden, since drift influences player and item ratings simultaneously. The main prob-

lem lays in the interpretation of the rating. Rating points cannot be accurately compared

following inflation or deflation. Therefore we present transformed ratings to teachers and

users to prevent interpretation problems.

Page 24: UvA-DARE (Digital Academic Repository) Math Garden: A new … · development of Math Garden we follow these ideas developed in sports and expertise train - ing, especially the idea

Compu te r adap t i ve p rac t i ce o f ma th ab i l i t y | 45

Transformation is conducted by calculating the average probability correct for a single

user on all items in the domain, as shown in equation 10:

This value is an estimation of the percentage of items in the domain that the user is able to

answer correctly. We also reduced drift by incorporating the rating uncertainty in calculat-

ing the K factor, which minimizes the influence of unreliable person and item estimations

on the updating proces. A related issue is the convergence speed. This is the time or number

of responses needed to get a stable rating. We set the rating uncertainty parameters of the

K factor, which determine the convergence speed, on the basis of extended testing. A better

approach would perhaps be to estimate the uncertainty based on aberrant response patterns,

where unexpected responses are used as an indication of unreliability.

A last issue concerns the one-dimensionality of the math domains. In practice, every test

and item bank is expected to violate the assumption of one-dimensionality to some degree.

Though we see no immediate effects on ability estimation, the question of how robust the

HSHS Elo algorithm is to violation of this assumption needs further investigation. We also

intend to further address the possible individual differences between children and how the

HSHS scoring rule affects their behavior.

In conclusion, Math Garden meets the requirements we set for the practice and progress

monitoring system. It is worth noting that although the new CAP algorithm is implemented

in the domain of math, the system can be applied to all kinds of learning domains. In the

2010 release of Math Garden more games, e.g. fractions, have been added and a language

garden is in development. Also, the number of schools using Math Garden continues to

grow steadily (about 150 in October 2010), yielding about 50 thousand responses per day.

We expect a fast adoption of computers, such as handhelds, minicomputers and tablets,

in primary schools in the next 5 years. If children do their daily exercises in practice and

progress monitoring systems using these devices, we expect many benefits for students,

teachers, and scientists.

   

37  

were  almost  equally  difficult  whereas  8  +  6   is   taught  much  earlier  on   in   the  Dutch  curriculum  than  

the  other  two  addition  problems.  This  kind  of  information  can  be  used  to  determine  the  curriculum  

(i.e.  what  is  taught)  in  each  grade.  

One   of   the   problems   with   the   Elo   rating   system   is   the   occurrence   of   rating   inflation   and  

deflation  (Glickman,  1999),  which  we  call  drift.  In  educational  applications,  one  source  of  drift  is  that  

new  young  players  start  with  low  ratings  and  stop  playing  when  they  leave  school  with  high  ratings.  

This   causes   a   systematic   downwards   drift   in   item   rating   and,   as   a   consequence,   lowers   person  

ratings.  This  does  not  seem  to  jeopardize  the  operation  of  Math  Garden,  since  drift  influences  player  

and   item   ratings   simultaneously.   The  main  problem   lays   in   the   interpretation  of   the   rating.   Rating  

points   cannot   be   accurately   compared   following   inflation   or   deflation.   Therefore   we   present  

transformed   ratings   to   teachers   and   users   to   prevent   interpretation   problems.   Transformation   is  

conducted  by  calculating  the  average  probability  correct  for  a  single  user  on  all  items  in  the  domain,  

as  shown  in  equation  10:      

 

P =1/ n 11+ e−a(θ j−βi )i=1

n

∑                     (10)  

 

This  value  is  an  estimation  of  the  percentage  of  items  in  the  domain  that  the  user  is  able  to  

answer  correctly.  We  also  reduced  drift  by   incorporating  the  rating  uncertainty   in  calculating  the  K  

factor,   which  minimizes   the   influence   of   unreliable   person   and   item   estimations   on   the   updating  

proces.  A  related  issue  is  the  convergence  speed.  This  is  the  time  or  number  of  responses  needed  to  

get  a   stable   rating.  We  set   the   rating  uncertainty  parameters  of   the  K   factor,  which  determine   the  

convergence   speed,   on   the   basis   of   extended   testing.   A   better   approach   would   perhaps   be   to  

estimate   the   uncertainty   based   on   aberrant   response   patterns,   where   unexpected   responses   are  

used  as  an  indication  of  unreliability.  

A  last  issue  concerns  the  one-­‐dimensionality  of  the  Math  domains.  In  practice,  every  test  and  

item  bank  is  expected  to  violate  the  assumption  of  one-­‐dimensionality  to  some  degree.  Though  we  

see  no  immediate  effects  on  ability  estimation  the  question  of  how  robust  the  HSHS  Elo  algorithm  is  

to   violation   of   this   assumption   needs   further   investigation.  We   also   intend   to   further   address   the  

possible   individual   differences   between   children   and   how   the   HSHS   scoring   rule   affects   their  

behavior.  

In   conclusion,  Math  Garden  meets   the   requirements  we   set   for   the   practice   and   progress  

monitoring   system.   It   is  worth  noting   that   although   the  new  CAP  algorithm   is   implemented   in   the  

domain  of  math,  the  system  can  be  applied  to  all  kinds  of   learning  domains.   In  the  2010  release  of  

Math  Garden  more  games,  e.g.  fractions,  have  been  added  and  a  language  garden  is  in  development.  

Also,   the  number  of   schools   using  Math  Garden   continues   to   grow   steadily   (about  150   in  October  

2010),  yielding  about  50  thousand  responses  per  day.  We  expect  a  fast  adoption  of  computers,  such  

as  handhelds,  minicomputers  and  tablets,  in  primary  schools  in  the  next  5  years.  If  children  do  their  

daily   exercises   in   practice   and   progress  monitoring   systems   using   these   devices,   we   expect  many  

benefits  for  students,  teachers,  and  scientists.