15
Using self-assessment to compare learnersreading prociency in a multilingual assessment framework Karen Ashton * Institute of Education, Massey University, Private Bag 11-222, Palmerston North 4442, New Zealand article info Article history: Received 16 February 2013 Received in revised form 15 November 2013 Accepted 26 November 2013 Keywords: Assessment Asset Languages Can-do Comparability Cross-language Functional Multilingual Reading prociency Self-assessment abstract Can-do statements are commonly used in language education to describe the level of a learners prociency. However, there is little research on whether they represent pro- ciency and progression for non-European languages, particularly non-Latin script or community languages. A self-assessment can-do survey was developed to investigate and compare secondary school learner self-assessments of reading prociency in German, Japanese and Urdu for the multilingual assessment scheme Asset Languages in England. Findings show that the same three factors best represent learnersself-assessed reading prociency across all three languages. However, differences were also found raising the need for further research in the way can-do statements represent progression for learners of Japanese and Urdu. Additionally, and in contrast to the ndings for German and Japa- nese, results indicate that learners of Urdu and their teachers underestimated learner prociency in relation to test scores. Finally, this research demonstrates that the construct of reading in the current National Curriculum for Modern Foreign Languages is not endorsed by the learners in this study. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Assessments within a functional framework aim to have levels which are informative to users and scores from tests that are comparable across different languages and contextsregardless of rst language (L1), the language studied or the number of learning hours (Bachman & Clark, 1987 , p. 28). However, as Alderson (2005) notes, further research is needed on whether functional frameworks represent progression for learners of less commonly taught languages. Most existing research has focused on learners of English or other European languages, and for a single language linked to a functional framework such as the Common European Framework of Reference (CEFR) (Council of Europe, 2001), rather than a multilingual assessment context (Jones, Ashton, & Walker, 2010). The context for this work is Asset Languages, a multilingual assessment scheme in England, which alongside the Lan- guages Ladder, forms one of the overarching objectives of Englands National Languages Strategy (2004), the introduction of a voluntary recognition system. The Languages Ladder, owned by the Department for Education and Skills (DfES) (DfES, 2004), consists of can-do statements for reading, writing, listening and speaking at six levels (broken down into 17 grades), while Asset Languages, through Cambridge Assessment, provides voluntary low stakestests in 25 1 languages against the * Tel.: þ64 6 356 9099x84445. E-mail address: [email protected]. 1 From June 2014 there will be a reduction in the number of language assessments on offer as it was considered to be no longer nancially viable to offer assessments in all 25 languages (OCR, 2013; TES, 2012). Contents lists available at ScienceDirect System journal homepage: www.elsevier.com/locate/system 0346-251X/$ see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.system.2013.11.006 System 42 (2014) 105119

Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

  • Upload
    karen

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

System 42 (2014) 105–119

Contents lists available at ScienceDirect

System

journal homepage: www.elsevier .com/locate/system

Using self-assessment to compare learners’ readingproficiency in a multilingual assessment framework

Karen Ashton*

Institute of Education, Massey University, Private Bag 11-222, Palmerston North 4442, New Zealand

a r t i c l e i n f o

Article history:Received 16 February 2013Received in revised form 15 November 2013Accepted 26 November 2013

Keywords:AssessmentAsset LanguagesCan-doComparabilityCross-languageFunctionalMultilingualReading proficiencySelf-assessment

* Tel.: þ64 6 356 9099x84445.E-mail address: [email protected].

1 From June 2014 there will be a reduction in the nassessments in all 25 languages (OCR, 2013; TES, 20

0346-251X/$ – see front matter � 2013 Elsevier Ltdhttp://dx.doi.org/10.1016/j.system.2013.11.006

a b s t r a c t

Can-do statements are commonly used in language education to describe the level of alearner’s proficiency. However, there is little research on whether they represent profi-ciency and progression for non-European languages, particularly non-Latin script orcommunity languages. A self-assessment can-do survey was developed to investigate andcompare secondary school learner self-assessments of reading proficiency in German,Japanese and Urdu for the multilingual assessment scheme Asset Languages in England.Findings show that the same three factors best represent learners’ self-assessed readingproficiency across all three languages. However, differences were also found raising theneed for further research in the way can-do statements represent progression for learnersof Japanese and Urdu. Additionally, and in contrast to the findings for German and Japa-nese, results indicate that learners of Urdu and their teachers underestimated learnerproficiency in relation to test scores. Finally, this research demonstrates that the constructof reading in the current National Curriculum for Modern Foreign Languages is notendorsed by the learners in this study.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Assessments within a functional framework aim to have levels which are informative to users and scores from tests thatare ‘comparable across different languages and contexts’ regardless of first language (L1), the language studied or the numberof learning hours (Bachman & Clark, 1987, p. 28). However, as Alderson (2005) notes, further research is needed on whetherfunctional frameworks represent progression for learners of less commonly taught languages. Most existing research hasfocused on learners of English or other European languages, and for a single language linked to a functional framework suchas the Common European Framework of Reference (CEFR) (Council of Europe, 2001), rather than a multilingual assessmentcontext (Jones, Ashton, & Walker, 2010).

The context for this work is Asset Languages, a multilingual assessment scheme in England, which alongside the Lan-guages Ladder, forms one of the overarching objectives of England’s National Languages Strategy (2004), the introduction of a‘voluntary recognition system’. The Languages Ladder, owned by the Department for Education and Skills (DfES) (DfES, 2004),consists of can-do statements for reading, writing, listening and speaking at six levels (broken down into 17 grades), whileAsset Languages, through Cambridge Assessment, provides voluntary ‘low stakes’ tests in 251 languages against the

umber of language assessments on offer as it was considered to be no longer financially viable to offer12).

. All rights reserved.

Page 2: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119106

Languages Ladder. The aim was to have a functional framework of levels with shared meaning across all languages, repre-senting small positive steps of achievement in can-do terms. This approach contrasts with existing ‘high stakes’ qualificationsavailable in Englandwhich test the ability of learners after taking a set course for a specified number of guided learning hours.Here, assessments can reflect the different learning demands of languages. For example, there is a general consensus that forJapanese, reading and writing are more difficult (Koda, 1994) and take longer, even two or three times as long, to achieve thesame basic proficiency compared with European languages for L1 English speakers (de Courcy & Birch, 1993). Comparisonsacross subjects therefore tend to be based on how well learners of a similar age have achieved set objectives or attainmenttargets (see for example QCA (2008)) rather than functional competence as was required for Asset Languages.

The voluntary recognition system was to be inclusive in the educational contexts, i.e. primary, secondary and adult, andrange of languages catered for, with the ethos that language proficiency, however achieved, should be recognised. Specificreference was made to community language learners; learners of the language of the community that they belong to (CILT,2005). Separate testing of reading, writing, listening and speaking skills was seen as necessary, particularly for non-Latinscript or community languages where learners may progress more rapidly in listening and speaking than reading andwriting (Little, 2006; Ogawa Douglas, 1992) due to the difficulty of reading and writing for non-Latin script languages and theprofile of community language learners who tend to have stronger listening and speaking skills. A key challenge for AssetLanguages is how these facets can work within a single framework. Due to these complexities and significant pressures ofdevelopment, as North and Jones (2009) outline, an iterative cycle was foreseen with the first standards set seen as provi-sional. Findings from research such as the current study can then feed into this iterative review process.

This article uses a can-do survey to explore and compare secondary school learner self-assessments of reading proficiencyin German, Japanese and Urdu. Although reading (and responding) is a core skill in the National Curriculum for ModernForeign Languages (NC for MFL) in England (DfES & QCA, 1999), past research has highlighted the limited attention given toreading within foreign language classes (Macaro & Erler, 2008). For example, an Ofsted (2008, p. 15) evaluation of languagelearning found that students ‘rarely read beyond the course book, a worksheet or for examination practice’. FurthermoreGrenfell (1992) is critical of the attention paid to reading within course books while Dobson (in Macaro & Erler, 2008, p. 90)comments that reading experience is often ‘limited to comprehending short texts which rarely extend beyond a few sen-tences’. A recent cross-European survey supports these findings with results revealing England as the lowest performingcountry in reading in both the first and second most widely taught foreign languages (French and German respectively)(European Commission, 2012). Thus reading remains an area in urgent need of further research.

The three languages were chosen for several reasons. First, they use different scripts and alphabets. In Japanese, a non-Latin script language, Chinese logographic characters, Kanji, are used together with two syllabic systems, hiragana andkatakana with around 1500 kanji in common daily use (Akita & Hatano, 1999; Sasaki, 2004). Urdu is an alphabetic languagewith Persian (as opposed to Roman) letters and is written from right to left (Mumatz & Humphreys, 2001). Second, in England,the majority of learners of German and Japanese are either second or foreign language learners whereas Urdu is the mostwidespread community language (CILT, 2006). The following fromMcPake (2006, p. 4) best describes the Urdu learners in thisstudy: ‘children learn a community language after they have learned English’ with children from families of Pakistani originoften learning to speak Panjabi at home before starting to learn Urdu the national language of Pakistan at a later age. Theinclusion of Urdu was also seen as important here given that in a large cross GCSE school subject comparability study, Coe(2008), found that Urdu behaved psychometrically differently to other subjects. The trait constructed by Coe can be seenas a measure of general academic ability and Coe suggested that Urdu students have a special interest or ability in the subjectdistinct from the trait of general academic ability.

Issues of comparability within a functional framework, should they exist, are likely to be more prominent across thesethree languages and learning contexts and at lower levels of reading proficiency where further research is needed into theprogression represented in functional frameworks, particularly given the limited focus on reading in foreign languageclassrooms in England.

2. Self-assessment

Self-assessment is based on the ability of learners to reflect on and accurately assess their knowledge and skills (Ellis,2003; Little & Perclova, 2001). A key area for research has been on the validity and reliability of self-assessment. Here Ellis(2003, p. 303) argues that self-assessment can be both ‘valid and reliable’ while Oscarson (1997) and Ross (2006) claimthat the majority of studies have produced favourable results. Examples of studies which have found a positive relationshipbetween language proficiency tests and self-assessments include LeBlanc and Painchard (1985), Bachman and Palmer (1989),Hargan (1994), Ross (1998), Alderson (2005), Brantmeier and Vanderplank (2008), Alexandria (2009) and Litz (2009), whileBrantmeier (2006, p. 19) notes that results across studies ‘generally support the use of self-assessment as an indicator ofsecond language abilities’. A positive relationship between self-assessment and test results has also been found in othersubject areas, e.g., Shen (2002) and Shen and Pedulla (2000) in large scale cross-country comparability studies of mathe-matics and science.

Other studies have looked at how self-assessment can be made more reliable finding that learners are better at assessingitems that are concrete or functional rather than abstract or general (Alderson, 2005; Bachman, 1990; Ross, 1998). Further tothis, Ellis (2003, p. 302) suggests that the lack of correlation between self-assessments and proficiency tests found in somestudies could be due to the use of generic self-assessment instruments.

Page 3: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119 107

Research into self-assessment has also looked at differences in the way learners assess themselves. Ross (1998) andBrantmeier (2006) note considerable variation across learners while others have found patterns across groups of learners.Blatchford (1997), for example, in comparing student self-assessments with standardised tests, found significant correlationsfor learners at aged 16 but not at aged seven suggesting that younger learners are less capable of accurate self-assessment.Similarly, Heilenman (1990) found that less experienced learners (defined as self-reported classroom experience) were morelikely to overestimate their ability while more experienced learners were more likely to underestimate their ability.Heilenman (1990, p. 190) suggests that this is because less experienced learners ‘have little or no way of being aware of whatthey do not know or are unable to do’ while experienced learners are more likely ‘to be aware of the limits of their skills andknowledge’.

While the above studies focus on the cognitive level and ‘experience’ of learners, Brantmeier and Vanderplank (2008)found that high performing students in an advanced Spanish class underestimated their ability while lower performingstudents tended to overestimate their ability. This replicates the finding of Patri (2002, p. 121) who found that ‘low achieverswere over-estimating and high achievers were underestimating their performances’ within a remedial English universityprogramme.

While these studies reported differences between higher and lower performing students within the same broad profi-ciency level, Davidson and Henning (1985) compared learners ‘from low intermediate to high advanced’ proficiency levels.They found that overall there was a greater tendency for learners to over rather than underestimate their ability, a tendencyalso found by Beresford (2000) and Shameem (1998). By contrast AlFallay (in AlFallay, 2004) and DIALANG (Alderson, 2005)found that more advanced students (in terms of proficiency level), were better able to assess their ability than beginnerlearners. It should be noted here that in DIALANG (Alderson, 2005), a low number of learners at beginner level make itdifficult to interpret this finding. Brantmeier, Vanderplank, and Strube (2012) also found positive associations betweenadvanced level learners’ self-assessments and test scores. However, there is no comparison with beginner learners as againthere were few learners at beginner level and Brantmeier et al. (2012, p. 151) acknowledge that there were insufficient items‘tailored to beginner learners for accurate statistical analysis’.

In summarising across studies Boud and Falchikov (1989, p. 543) argue that ‘there is no consistent tendency’, a conclusionthat still appears valid today. The variation in findings across studies is hardly surprising given the differences in purpose,contexts and design of self-assessment instruments (Brantmeier et al., 2012; Oscarson, 1997; Saito, 2003). Additionally asBrantmeier et al. (2012) note, most research in this area to date has been for adult learners of English. Further challenges comefrom the use of terminology. For example, some studies talk about the experience (either cognitive or language experience) ofthe learner, or of low and high achievers within a class of learners at the same general level of proficiency, while others havecompared learners across different levels of proficiency. Comparisons across studies are also difficult because learners are notmatched for proficiency level (and terms such as low (or beginner), intermediate and advanced may not have been applied inthe same way), the language studied, experience or age.

2.1. Can-do self-assessment

Can-do self-assessment methods are becoming increasingly common in language learning and testing where theyhave been used in several large scaling projects. A benefit of using self-assessment surveys is the ability to link andcompare learners of different languages through common items. As North (2000b, p. 573) notes of the process ofadministering and calibrating a can-do survey, ‘[o]ne of the points which is often overlooked about scaling is that it canbe used in this way to investigate a construct’. This is what North (2000b, p. 570) refers to as ‘social moderation’ which heargues is ‘potentially the strongest form of linking because it implies the development and validation of a common viewof a construct’. The best known example is North’s (2000a) work in developing the CEFR common reference scales wherea large number of participants (around 300 teachers and 2800 learners) completed can-do surveys assessing learnerability (Council of Europe, 2001). Alderson (2005), in the DIALANG project, discusses how self-assessment surveystogether with proficiency tests were used in order to provide adult learners with feedback on the accuracy of their self-ratings. In the Nordic–Baltic region, the Bergen can-do project used adapted CEFR descriptors to develop a set of can-dostatements for on-going self-assessment for 11–12 year olds (Hasselgreen, 2003). Finally, Jones (2000b, 2002), in the‘ALTE can-do project’, used self-assessment surveys to develop ALTE can-do statements for learners with different L1s (allEuropean). The current research, although smaller in scale, aims to build on these studies to explore the extent to which afunctional framework represents progression for learners of different languages. As is the case in the above studies, theresults here are not used for formal assessment; rather they are used to facilitate comparisons across learners of the threelanguages.

3. Methodology

The following research question forms the key focus of this study:

To what extent does a functional framework represent learners’ self-assessed proficiency and progression across three differentlanguages?

Page 4: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119108

Two questions are addressed as part of this:

1) Are there any patterns of over or underestimation?2) What similarities and differences are there across the three learner groups?

3.1. Participants

Participants include 439 learners and their teachers, at secondary schools registered to take Asset Languages assessmentsat A1, A2 and B1 (beginner and lower intermediate levels) in England. Learners were aged 12–15 (average age 13.6) and inyears 8–10 (approximately 20 per cent year 8, 65 per cent year 9, 15 per cent year 10). Learners studied the language (n ¼ 137German, n ¼ 139 Japanese, n ¼ 163 Urdu) as a timetabled subject for either 2 or 3 hours per week, following the NC for MFL(DfES & QCA, 1999). The number of participants was selected to ensure an adequate sampling size of at least 100 per languagefor Rasch modelling (North & Jones, 2009). Learners completed a self-assessment survey containing can-do statements and ashort section on learner background. A smaller number of learners (n ¼ 44 German, n ¼ 45 Japanese, n ¼ 31 Urdu) alsocompleted an Asset Languages reading test in their target language. Teacher ratings were collected for all learners. Alllearners, their parents, teachers and school heads signed consent forms and the ethical guidelines of the British EducationalResearch Association (2004) were followed.

3.2. Materials

3.2.1. Learner self-assessment surveyThe self-assessment survey consisted of 44 can-do statements (see Appendix 1) from a range of existing sources (following

Alderson, 2005; Brantmeier & Vanderplank, 2008; North, 2000a):

� CEFR (Council of Europe, 2001)� CEFR portfolios for 11–16 year olds� ALTE Can-do project (ALTE, 2002)� Bergen Can-do project (Hasselgreen, 2003)� NC for MFL (DfES & QCA, 1999)� Languages Ladder (Asset Languages, 2005)

Learners responded by ticking ‘yes’, I can or ‘no’ I can’t, to each statement following the approach used by Jones (2000a)and in DIALANG (Alderson, 2005). Although responding categorically can be challenging and additional options ‘encouragemore precision in rating’ (Brantmeier, 2006, p. 21), providing additional options can also make the analysis and interpretationmore complex, e.g. ‘[i]f somebody says that they can do something sometimes, does that mean they are at a given level ornot?’ (Alderson, 2005, pp. 211–212).

Statements were presented in order of predicted difficulty to provide scaffolding for learners following Bachman andPalmer (1989) who found learners could better assess what they cannot do rather than what they can do. All surveys wereadministered by the researcher. As the purpose was to statistically determine whether learners across languages rated thecan-do statements in the same way, the researcher did not want to impose an understanding and therefore followingAlderson (2005) training in interpreting the statements was not provided.

The following process was used for developing and validating the survey prior to administration:

� Relevant A1, A2 and B1 statements (e.g. workplace statements were not used) were entered into a database.� Amendments were made to ensure consistency in format.� Where there was duplication, statements with concrete examples (considered easier for learners (Ross, 1998)) or that hadbeen calibrated were preferred.

� The survey was piloted and revised. Changes included a) removing statements that appeared unrelated (e.g. reading aloudwhich is in the NC for MFL), b) adding statements at B2 as some learners rated their ability as above B1, c) minor wordchanges as teachers felt some terms were difficult for learners to understand.

One point to note is that although the CEFR levels A1, A2, B1, B2 are used here for ease of discussion and interpretationwhen talking about proficiency levels, the statement sources have used different development processes and have differingstrengths of claim in terms of linkage to the CEFR. For example, the Languages Ladder and NC for MFL statements weredeveloped by teacher and expert working groups and ‘aligned’ to the CEFR through expert judgement, while the CEFR andALTE statements were calibrated on a large number of participants and ‘followed a principled scaling methodology’ (North,2000b, p. 556). The Bergen can-do and CEFR portfolio statements for 11–16 year olds were adapted from the CEFR for thelower secondary school context (Hasselgreen, 2003). Although the Languages Ladder and NC for MFL statements are notstatistically calibrated they are the ones that these learners are the most familiar with. Thus the survey contained somestatements that have been statistically calibrated, some that were developed specifically to sit alongside the curriculum

Page 5: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119 109

and course objectives (Brantmeier et al., 2012; Ross, 1998) and some that are specifically designed for the age groupstudied.

3.2.2. Learner background informationThe final section of the survey requested information such as gender, age, L1, learning hours in the target language, use of

the target language outside the classroom, reading habits in both English and the target language. These data were intendedto help interpret the self-assessment findings.

3.2.3. Teacher self-assessment surveyFor each class, teachers assigned each learner to one of three ability groups (low, medium, high) and completed one survey

for each group. These data were collected as an additional data source in the samemanner as the test data to compare againstthe learner self-assessments.

3.2.4. Asset Languages test dataA smaller number of the same learners (n ¼ 44 German, n ¼ 45 Japanese, n ¼ 31 Urdu) completed an Asset Languages

reading test at their level of ability (A1, A2 or B1 as determined by their teacher). Tests also contained items at adjacent levelsto capture a broader range of proficiency. Each test consisted of 25–30 multiple choice questions depending on the level.Assessments were developed following Cambridge English Language Assessment’s standard procedures (see for exampleKhalifa and Weir (2009)). To assist with the validation argument for comparability, Asset Languages assessments for alllanguages within the same educational context and level were constructed to a standardised specification including but notlimited to aspects such as task types, text types, topics, and functions.

4. Data analysis and results

Rasch modelling, which belongs to a class of models within Item Response Theory (IRT), was used as the primary methodof data analysis. Winsteps software was used to conduct the Rasch modelling (Linacre, 2010). Rasch modelling providesestimates of both the ability of learners and the difficulty of can-do statements in logits on the same measurement scale or‘ruler’ (Bachman, 1990; Bond & Fox, 2001; Linacre, 1999). If the can-do statement is lower on the scale than a person’s ability,there is an increased chance that this person can do the function in the can-do statement. The reverse is also true, if the can-do statement is higher on the scale than a person’s ability, there is an increased chance that this person cannot do the functionin the can-do statement (Bachman, 1990). This is illustrated by Fig. 1 below. For this study, as all learners have taken the samecan-do survey, comparisons across the three languages are possible on the same measurement scale or ‘ruler’.

Fig. 1. Illustration of Rasch measurement scale.

Following Rasch modelling, principal component analysis and regression analysis were used to assess how the can-dostatements factored together and the contribution of these factors to learners’ self-assessed reading proficiency. The re-sults are then compared across the three language groups.

4.1. Fit, validity and reliability of the measurement scale

The data for each languagewere first analysed separately using Raschmodelling to ensure the validity and reliability of thecan-do statements and learner responses before combining the data sets to enable comparisons across languages. Rasch fitvalues were analysed to assess the degree of agreement between observed and expected responses (Linacre, 2002). Expectedresponses are based on the assumption that ‘individuals with higher degrees of a given ability could be expected to have ahigher probability of correct performance on tasks of lower difficulty or complexity, and a lower probability of correct per-formance on tasks of greater difficulty or complexity’ (Bachman, 1990, p. 19).

Page 6: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119110

Fit values between 0.5 and 1.5 were used to denote an acceptable degree of agreement following North (2000a) andLinacre (2003, p. 918) who claims this range provides an indicator of ‘useful fit’ which is ‘productive for measurement’.All can-do statements had acceptable fit values but a small number of learners for each language group (total n ¼ 21) with fitvalues of above 1.5 were removed from the data. As Linacre (1999) notes such values show unexplained randomness andmisinformation in the observations.

After combining the three data sets, all statements continued to have good fit values (between 0.80 and 1.2). However,there were changes in the fit values for learners. As North (2000a) notes, when combining data sets some change is to beexpected, but here differences relate to learner groups rather than general movement. The German and Japanese learners allhad acceptable fit values (between 0.50 and 1.50) but tended to have lower values showing little variation across learners interms of which statements they found easy or difficult. In other words therewas extremely good agreement across learners asto the order of difficulty of statements. By contrast, the Urdu learners tended to have higher fit values (some above 1.7)meaning that there was more variation in terms of which statements learners found easy or difficult, or in other words lessagreement as to the order of difficulty of statements.

The analysis was run twice again, each time removing Urdu learners with fit values above 1.7. Following this process therewere 23 Urdu learners with fit values between 1.5 and 1.7. In this instance these learners were retained as a compromisebetween reaching a more adequate statistical solution and keeping the integrity of the data (Table 1).

Table 1Number of learners and average fit values in the final data set.

German Japanese Urdu

Number of learners 133 128 134Average fit values 0.90 0.92 1.16

n ¼ 395.

Reliability was very high with a Cronbach’s alpha of 0.99 for estimates of statement difficulty and 0.91 for estimates oflearner ability. For DIALANG where 376 learners assessed their reading ability against 18 can-do statements, the reliabilitywas 0.813 which Alderson (2005, p. 101) considers ‘entirely acceptable’. As Bond and Fox (2001) note, high reliability canmean that there is a range of higher and lower scores and thus here also reflects the variation in ability estimates for Urdulearners as is also illustrated by the higher standard deviation for the Urdu data (Table 2).

Table 2Mean, SD and SEM.

German Japanese Urdu

Mean 26.5 24.5 25SD 5.3 7.3 9.8SEM 0.10 0.11 0.13

4.2. Difficulty of statements

The calibration shows good progression in terms of statement difficulty (in logits) across levels (Table 3). The calibrateddifficulty of each statement can be seen in Appendix 1. The order of difficulty of statements also correlates highly with theoriginal assignment of levels (0.84 Spearman’s Rho, statistically significant at the 0.01 level), despite the differing develop-ment processes used across sources.

Table 3Calibrated average difficulty of statements across levels.

A1 A2 B1 B2

�2.27 �0.22 1.46 2.31

4.3. Ability of learners

The German learners rated their reading proficiency as the highest of the three groups followed by the Urdu and then theJapanese learners (Table 4).

Table 4Average ability estimates in logits.

German Japanese Urdu

2.03 0.50 1.3

Page 7: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119 111

In order to assign levels to learners, it was necessary to conceptualise what being at a level meant. Although as Alderson(2005, p. 266) claims some sub-skills are relevant across CEFR levels and some learners at lower levels will be able to performitems and tasks at higher levels, it is necessary ‘to define where one stage ‘ends’ and the next ‘begins’’. Here a learner wasdeemed to be ‘operating at the level’ if they said they could do 75–80 per cent of statements at that level. This follows similarresearch, e.g. Hasselgreen (2003) who used 70–80 per cent and Jones (2000a, p. 13) who used 80 per cent as this score is‘frequently used in domain or criterion-referenced testing as an indication of mastery in a given domain’. To simplify thediscussion, from this point on this is the intended interpretation where learners are referred to as being at A1, A2 etc. Giventhat the majority of learners have self-assessed their ability as being at A1 and A2, the remaining discussion focuses mostly onthese two levels. As can be seen there are some learners, nearly 20 per cent for Urdu, who according to their self-ratings havenot achieved A1, the lowest level of functional competence represented by the CEFR (Table 5).

Table 5Numbers of learners at each level according to the self-assessment survey.

Level German Japanese Urdu

Below A1 3 (2.3%) 16 (12.5%) 26 (19.4%)A1 18 (13.5%) 71 (55.5%) 50 (37.3%)A2 89 (66.9%) 34 (26.5%) 30 (22.4%)B1 13 (9.8%) 7 (5.5%) 9 (6.7%)B2 10 (7.5%) 0 19 (14.2%)

n ¼ 395.

4.4. Accuracy of ratings

As others, e.g. Brantmeier et al. (2012), have stated the accuracy of self-ratings is difficult to assess, which is furthercomplicated here by the fact that the initial standards for Asset Languages are seen as provisional rather than ‘the goldstandard’ (North & Jones, 2009) and the low numbers of learners who completed tests (n ¼ 44 German, n ¼ 45 Japanese,n ¼ 31 Urdu). However, trends across the learner and teacher ratings and tests are looked at.

Correlations (Pearson’s R) of the ability estimates in logits show statistically significant results at the 0.01 level for alllanguages between the self-assessment and test data and learner and teacher assessments (Table 6). These results are similarto those found in DIALANG where for reading the correlation between test and self-assessment data was 0.54(Alderson, 2005).

Table 6Correlations across data sets.

German Japanese Urdu

Self-assessments and tests* 0.67** 0.59** 0.68**Teacher assessments and self-assessments 0.59** 0.70** 0.69**

*n ¼ 44 German, n ¼ 45 Japanese, n ¼ 31 Urdu.**Statistically significant at 0.01 level.

The graphs in Fig. 2 compare the percentage of learners at each level according to each of the three data sets.

Fig. 2. Comparisons of self-ratings, teacher ratings and test results for smaller sample.

4.5. Clusters of can-do statements

Principal components analysis (Field, 2005) was performed to reduce the can-do statements into factors. This was fol-lowed by multiple regression analysis to determine how well each of the subsequent factors predicted learner self-ratings ofreading proficiency with the results compared across languages.

Page 8: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119112

After first confirming that the survey represented a unidimensional scale by performing principal components analysis onthe Rasch residuals (a Kaiser–Meyer–Olkin measure of sampling adequacy of 0.06 was obtained) (Bond & Fox, 2001; Linacre,1998; Raîche, 2005), principal components analysis was performed on the actual Rasch values. This showed that 33 of the 44statements had significant loadings (see Appendix 1) clustering into five factors. Field (2005) claims that values greater than0.364 are significant where n ¼ 300. The smallest loading here is 0.401 where n ¼ 395.

Multiple regression analysis was then used to determine howwell each of the five factors predicted learner self-ratings ofreading proficiency. The results showed that at both A1 and A2 for all three languages, the same three factors (PERSONAL &FAMILIAR, QUICK TEXTUAL and SPECIFIC DETAILS – see Table 7 below), containing 22 statements, best predict learner per-ceptions of reading proficiency. Table 7 provides a brief description of these three factors with the reliability and averagedifficulty statistics.

Table 7The three factors that best predict learner perceptions of reading proficiency.

Factor Factor name Details Average difficulty (logits) Reliability of scale

1 PERSONAL &FAMILIAR

These statements relate mostly to reading personalcommunication and reading for personal purposes.

�0.28 (Easiest) 0.82

3 QUICK TEXTUAL These statements tend to relate to higher order cognitivefunctions such as reading quickly to obtain information,mostly requiring text level comprehension in order torecognise the line of argument and conclusions in texts.

1.34 (Most difficult) 0.76

4 SPECIFIC DETAILS These statements are similar to those for Factor 3 althoughdo not require text level comprehension in order to locatespecific words and details.

0.24 0.77

Table 8 below shows the results of the multiple regression analysis, i.e. the contribution of the three factors to learners’self-rated ability, at A1 and A2. These three factors account for a very high percentage of the learners’ perceived ability at bothlevels. For example, at A1 these three factors account for the following per cent of perceived ability: 92.5 (German), 94.5(Japanese) and 85.5 (Urdu) while the figures at A2 are: 95 (German), 94.5 (Japanese) and 85.5 (Urdu). This means that ashorter survey containing 22 statements would reveal nearly as much about learners’ perceived ability as the full surveycontaining 44 statements at both levels.

Table 8Factor prediction of ability at A1 and A2.

Language 4 SPECIFIC DETAILS 3 QUICK TEXTUAL 1 PERSONAL & FAMILIAR

German Contribution at A1 0.550 0.303 0.071Contribution at A2 0.457 0.368 0.145

Japanese Contribution at A1 0.429 0.054 0.462Contribution at A2 0.133 0.358 0.465

Urdu Contribution at A1 0.439 0.275 0.144Contribution at A2 0.133 0.358 0.465

Themodel was tested to assess howgeneralisable the findings would be to awider population using Stein’s formula (Field,2005). Although further testing would be required, the results are indicative of a strong regression model with the resultsgeneralisable to a wider population with only a small reduction (slightly more for Urdu) in the model’s capability to predictlearner perceptions of ability (Table 9).

Table 9Capability of model to predict variance in ability for a wider population.

German Japanese Urdu

A1 91.2% 94.2% 84.9%A2 97% 93.8% 82.9%

4.6. Source of statements

Table 10 shows the source of the statements in the full survey and in the 22 statements (three factors) that best representlearners’ of all three languages perception of reading proficiency.

Page 9: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

Table 10Statement source.

Source Full survey 22 Statements

NC for MFL 10 2Languages Ladder 4 2ALTE can-do 5 2Bergen can-do 3 1CEFR 9 5CEFR portfolio 13 10Total 44 22

K. Ashton / System 42 (2014) 105–119 113

4.7. Background data

The following tables show some of the learner background data collected. Table 11 shows that learners of Urdu had manymore opportunities than learners of German and Japanese to use the language, particularly orally, outside the classroom intheir daily life. Table 12 shows that over 50% of the Urdu learners have said that they also speak or know Panjabi which reflectsthe type of Urdu learner outlined in the introduction of this article. For each question, learners could respond to more thanone option and only as relevant to them.

Table 11Use of target language outside the classroom.

German Japanese Urdu

Comics 1For fun/messing around 4Greetings 2 5 1In the community/local shops 1 2In the mosque 19Internet 5 1On holiday in country 9 1On school trip 10 13E-mail/online chat 2Reading letters 1Singing songs 1Speak at home 5 8 53Speak with wider family 6 2 13Web conference/skype 1Watching TV 1With exchange students 3With friends 9 14 22Total 47 55 113

Table 12Languages other than the English and the target language learnt, spoken or known.

German Japanese Urdu

Arabic 8 10Bengali 10French 116 38 14German 19Gujarati 18 13Hindi 8Panjabi 71Pashto 7Polish 5Spanish 16 36 13Urdu 5Total 132 137 138

5. Discussion

The discussion below responds to the research questions posed.

5.1. Are there any patterns of over or underestimation?

Although there are positive statistically significant correlations between the learner self-assessments and test data andlearner and teacher assessments for all three languages, this does not give the full picture in terms of the accuracy of ratings.

Page 10: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119114

As can be seen in Fig. 2, for German the test results place themajority of learners at A1, followed by A2with few learners at B1while the learner and teacher ratings place the majority of learners at A2 with smaller numbers at B1 and B2. Although thenumbers here are low, this may reflect the findings of Davidson and Henning (1985), Beresford (2000) and Shameem (1998)who noted tendencies to overestimate at all levels but it also illustrates individual variation as found by Ross (1998) andBrantmeier (2006). For example, some of the learners at A1 and A2, according to the test, have accurately self-assessed theirability while others have over-estimated their ability. The Japanese results are similar but not as marked. For example, there isgood agreement across the three sets of data that the majority of learners are at A1, although some overestimation is evidentas the learner and teacher ratings place more learners at A2 and B1 than is reflected by the test scores.

The findingsmay also reflect that learners at lower proficiency levels are less able to accurately self-assess their ability as issuggested by Alderson (2005) and Brantmeier et al. (2012). This may be exaggerated in this context due to the limited foreignlanguage reading experience learners have, particularly of more extensive texts (Ofsted, 2008), making rating statements athigher levels too abstract for these learners (Alderson, 2005; Bachman, 1990; Ross, 1998).

The pattern of ratings for Urdu learners is quite different. For example, the test results place a much lower number oflearners at A1 and a higher number at A2, B1 and B2 than is reflected by the learner and teacher ratings. Although teacherunderestimation may seem surprising, it is perhaps not in this learning context where learners have many and varied op-portunities to use the language outside class (see Table 11). This supports Sneddon’s (2000) findings that it is common forteachers to lack knowledge about the use of community languages in the home. However, in contrast to the findings for theGerman and Japanese learners, it appears that learners are also underestimating their ability. It is possible that although theselearners feel that English is their L1, the amount of Urdu they have been exposed to is much greater than they realise. Theymay be muchmore bilingual and biliterate than they give themselves credit for. It is also possible that similar to the finding inthe self-assessment literature that more experienced learners are more likely ‘to be aware of the limits of their skills andknowledge’ (Heilenman, 1990, p. 190), Urdu learners are simply more aware than German and Japanese learners of what theycannot do. Learners may be negatively comparing their proficiency in reading with their proficiency in listening and speakingwhich tends to be higher, and/or with the ability of more proficient friends and family members, leading them to devalue theproficiency they do have in reading.

In summary, despite positive and statistically significant correlations across all three data sets for each language, it appearsthat there is not only individual variation in the accuracy of the self-ratings but also differences across language groups. Forexample, the Urdu learners appear to have rated their ability more severely than the German and Japanese learners, withover-rating particularly evident for the German learners relative to test scores. It should be reiterated that the number oflearners who sat tests in this study was relatively low and further research is needed to see if these findings are replicated.However, the different pattern across languages is illuminating and raises a challenge for learners and teachers of communitylanguages to better understand and value all the learning that goes on, a key goal in any framework designed to value alllanguages regardless of how they are learnt.

5.2. What similarities and differences are there across the three learner groups?

The findings reveal that same three factors (22 statements) best represent learners’ of all three languages perception ofreading proficiency and progression. To the extent that frameworks are designed ‘to summarize and simplify, identifyingthose features which are common to all language users and all situations of use’ (Jones, 2007, p. 21), it could be argued thatthere are sufficient similarities to relate the proficiency and progression of learners across the three languages. However,several issues also arose which signal areas in need of further investigation.

5.2.1. Can-do statements in English secondary schoolsTable 10 shows that althoughNC forMFL statementsmadeupnearly a quarter of the original survey, only two appearwithin

thefinal list of 22 statementswhile statements fromtheCEFR andCEFRportfolio account for two thirds of this list. This suggeststhat the 395 learners in this survey do not endorse the construct of reading presented in the NC for MFL, statements that theyare familiar with and which were developed specifically for their context and use. This provides empirical support for criticssuch as Mitchell (2000; 2003), Hope (2004) and Norman (1998) who have argued that the NC for MFL does not take sufficientaccount of the developments and research in second language acquisition which Hope (2004, p. 9) claims has resulted instatements ‘which bear little resemblance to the way learners progress’. Here, bridging of the gap between curriculum designand research in the field of assessment is needed. As North (2000b, p. 556) states, despite the literature on scaling, the use ofsmall authoring teams of teachers substituting ‘intuition and opinion for analysis seems unfortunately very prevalent inEurope’. That the CEFR and CEFR portfolio statements are more salient in defining learners’ view of the construct providesfurther support for this viewand should be taken into consideration in future research and revisions of curriculum statements.

5.2.2. The profile of Japanese learnersWhile German and Urdu learners at A1 and A2 have the same factor structure with SPECIFIC DETAILS best predicting

ability followed by QUICK TEXTUAL and PERSONAL & FAMILIAR, for Japanese learners the order at A1 was PERSONAL &FAMILIAR, SPECIFIC DETAILS, QUICK TEXTUAL and at A2 PERSONAL & FAMILIAR, QUICK TEXTUAL, SPECIFIC DETAILS. One-wayANOVA tests (see results in Appendix 1) were performed at statement level to shed further light on this finding.

Page 11: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119 115

Japanese learners at A1 rated their ability as statistically significantly lower than both German and Urdu learners for fourstatements for ‘PERSONAL & FAMILIAR’. Looking at the level of these statements is important because learners at A1 would beexpected to be able to perform most functions at A1 but fewer at higher levels. Only one of these statements was at A1 withthe remainder at A2 and B1 which indicates that German and Urdu learners at A1 feel able to do some of the functions athigher levels. All five statements in ‘QUICK TEXTUAL’ have statistically significant differences with Japanese learners at A1finding these statements more difficult than Urdu learners. ‘QUICK TEXTUAL’ statements require reading quickly to obtaininformation across the text and are at a higher level (A2 and B1) than expected of these learners. This may be a furtherindication of over-rating and may also highlight issues of reliability for lower level learners rating statements at higher levelsas discussed in Section 5.1.

At A2 for ‘PERSONAL & FAMILIAR’, there are only two statistically significant differences with Japanese learners rating theirability as lower than other learners for one statement at A2 and one at B1. Although these findings are difficult to interpret atA2 given suspected over-rating, it demonstrates that as Japanese learners progress, the functions at higher levels, althoughstill out of reach, do not seem as distant as they did at A1. Similarly, for ‘QUICK TEXTUAL’ at A2, although statistically sig-nificant differences remain for all five statements (two at A2, three at B1), it is a more substantive part of the construct than atA1 (contributing the most amount of variance after ‘PERSONAL & FAMILIAR’).

These findings suggest that the profile of Japanese learners, particularly at A1, is different to that of the German and Urdulearners but that as learners progress, some of these differences disappear. The findingsmay reflect a difference in instruction,i.e. a difference in what is focused on in the early stages of learning for Japanese where script acquisition and the microanalysis of characters are likely to be the focus (de Courcy & Birch,1993; Everson & Ke,1997). Statements for ‘QUICK TEXTUAL’may contribute little to the ability of Japanese learners and not be perceived as part of the construct at A1 because thesefunctions are simply too difficult for learners at this level. The fewer statistically significant differences as learners progressmay reflect the additional learning and script acquisition of learners. This raises a challenge for the use of functionalframeworks at lower levels for L1 English learners of non-Latin script languages. Further research here is needed to map outthe progression of learning in these early stages and to compare this to other languages.

5.2.3. The profile of Urdu learnersThe results show very little variation in theway German and Japanese learners responded to each can-do statement; there

was very high agreement as to the order of difficulty of statements. However, Urdu learners responded to the survey in amuch more varied way. ‘Easy’ statements were judged as ‘difficult’ by so-called higher level learners while ‘difficult’ state-ments were judged as ‘easy’ by so-called lower level learners. This is demonstrated by the fit values of the Urdu learners andby the lower contribution of the three factors to variance in ability than was found for German and Japanese. Consequently,the model is less robust in terms of generalisability to the wider population for Urdu. This discovery adds extra information tothe finding of Coe (2008) that Urdu behaved psychometrically differently to other subjects in a cross GCSE school subjectcomparability study. Although the majority of Urdu learners in this study considered English as their L1, they were pre-dominantly of Pakistani heritage and hadmanymore opportunities than learners of German and Japanese to use the languageoutside the classroom (see Table 11). Follow-up conversations with German and Japanese learners revealed that use of thelanguage at home or with friends typically referred to teaching family members words and phrases or using basic formulaicexpressions with friends. Given that the classroom is the main place where German and Japanese students use these lan-guages, it is not surprising that therewas less variation in their responses to the survey. By contrast, the range of opportunitiesto use and learn the language outside the classroommeans that there is likely to bemore variation in the language knowledgeof each Urdu learner.

Analysis revealed that as the age of the Urdu learners’ increases, there is more consistency in their ratings. It may be thatincreased classroom experience creates more uniformity in their ratings. The reverse was found with German and Japaneselearners. For these learners, it may be that over time an increase in opportunities to use the language outside the classroomcreates more variance or it may simply reflect different (more or less successful) rates of learning over time.

The findings reflect the different statuses that these languages hold in this context. German and Japanese are learnt assecond or foreign languages while Urdu is learnt as a community language. This raises a challenge for the use of functionalframeworks for learners of community languages. The variability found indicates that the learning route mapped out infunctional frameworks for ‘typical’ foreign language learners does not necessarily reflect the varied learning opportunitiesand paths for learners of community languages.

6. Conclusion

The present study demonstrates that the same three factors best represent learners’ of all three languages self-assessedreading proficiency. This means that it is possible to relate these learners and learning contexts to the same framework ina multilingual assessment context. However, it also raises concerns about the way functional frameworks designed for typicalforeign language learners represent progression at lower levels for learners of Japanese and Urdu. Further research in this areais needed. The present study also demonstrates that the nature of progression in the NC for MFL in England is not endorsed bylearners. Finally, different patterns of rating across languages have emerged with Urdu learners and their teachers consis-tently underestimating learner ability in relation to test scores.

Page 12: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119116

Conflict of interest

I am a former employee of Cambridge English Language Assessment. This studywas part of a PhD, supported financially byCambridge English Language Assessment and supervised by Dr Neil Jones, Cambridge English Language Assessment.

Acknowledgements

I would like to thank Dr Neil Jones, Cambridge English Language Assessment, for his support and informative review,comments and suggestions. Many thanks also to the anonymous reviewers who provided very helpful feedback andsuggestions.

Appendix 1

Factor Factorloading

Difficultyin logits

Orderpresentedin survey

Source Level Statement Statisticallysignificantdifferences(0.05 level)

1 0.401 �3.473 1 ALTE Can-do A1 CAN understand simple information, forexample, from a prospective pen friend (e.g. ‘Myname is Anita. I’m 16 and I go to school at.’).

1 0.665 �1.277 2 CEFR portfolio11–16 Czech

A1 CAN understand short, simple messages onpostcards, for example holiday greetings.

A1 J < UA1 J < G

1 0.536 �1.463 4 Languages Ladder A2 CAN understand the main points and simpleopinions (e.g. likes, dislikes and preferences) ofa longer written text.

A1 J < G

1 0.455 0.227 6 ALTE Can-do A2 CAN understand a letter from a friendexpressing personal opinions, experiences andfeelings.

1 0.468 1.523 7 CEFR readingcorrespondence

B1 CAN understand the description of events,feelings and wishes in personal letters wellenough to correspond regularly with a penfriend.

A1 J < UA2 J < UA2 J < G

1 0.595 2.570 32 CEFR portfolio11–16 Turkish

B2 CAN read texts such as plays, short stories andpoems and evaluate the attitudes of thecharacters throughout the story.

1 0.559 �3.400 33 CEFR portfolio11–16 Czech

A1 CAN recognise familiar names, words andphrases in very short, simple texts.

1 0.548 �1.263 20 CEFR portfolio11–16 Turkish

A2 CAN understand basic everyday language inshort, simple texts and instructions.

A1 J < UA1 J < GA2 J < G

1 0.528 2.067 30 Bergen B2 CAN read most books that interest me withoutreal difficulty.

2 0.638 2.170 8 ALTE can-do B2 CAN understand what is said in a personalletter, even where colloquial language is used

2 0.664 �1.563 9 Bergen A1 CAN understand some words and phrases that Ihave learnt or that are easy to guess (e.g. onpostcards, in adverts, signs, text messages,internet pages)

2 0.504 �1.230 11 NC for MFL (UK) A1 CAN use a bilingual dictionary to look upunknown words.

2 0.733 0.213 13 NC for MFL (UK) A2 CAN use the context and knowledge of thelanguage to work out the meaning of unknownphrases.

2 0.522 1.037 14 CEFR Overall readingcomprehension

B2 CAN read with a large degree of independence,adapting style and speed of reading to differenttexts and purposes, and using appropriatereference sources (e.g. dictionary) selectively

2 0.428 �3.277 17 CEFR Overall readingcomprehension

A2 CAN understand short, simple texts containingthe most common words in the language

2 0.567 0.283 22 NC for MFL (UK) A2 CAN understand texts referring to present andpast and future events.

3 0.616 �1.003 36 NC for MFL (UK) A2 CAN identify the main points and specificdetails from texts written in simple language.

A1 J < UA2 J < UA2 J < GA2 G < U

3 0.650 0.380 37 CEFR portfolio11–16 Turkish

A2 A1 J < UA2 J < U

Page 13: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

(continued )

Factor Factorloading

Difficultyin logits

Orderpresentedin survey

Source Level Statement Statisticallysignificantdifferences(0.05 level)

CAN identify the key information in newspaperand magazine reports such as place, time andpeople.

3 0.634 1.463 40 CEFR reading fororientation

B1 CAN scan longer texts in order to locate desiredinformation, and gather information fromdifferent parts of a text, or from different textsin order to fulfil a specific task

A1 J < UA1 G < UA2 G < UA2 J < U

3 0.617 2.217 41 CEFR reading forinformation andargument

B1 CAN recognise the line of argument in thetreatment of the issue presented, though notnecessarily in detail.

A1 J < UA1 J < GA1 U < GA2 G < UA2 J < U

3 0.734 2.097 42 CEFR reading forinformation andargument

B1 CAN identify the main conclusions in clearlystructured argumentative texts.

A1 J < UA1 G < UA2 G < UA2 J < U

4 0.441 �0.667 12 NC for MFL (UK) A2 CAN use the context to understand unknownwords.

4 0.568 �1.000 35 CEFR portfolio11–16 Czech

A1 CAN find basic information such as where a filmis on and when it starts.

4 0.609 0.963 39 CEFR portfolio11–16 Turkish

B1 CAN find relevant information like who hasdone what and where by skimming texts suchas news summaries.

4 0.546 0.873 43 CEFR portfolio11–16 Turkish

B2 CAN quickly scan a variety of texts andunderstand most of the words

4 0.723 1.830 44 CEFR reading fororientation

B2 CAN scan quickly through long and complextexts, locating relevant details.

4 0.401 �1.543 19 Languages Ladder A2 CAN understand the main points and some ofthe detail from a short written text.

4 0.458 �2.377 24 CEFR portfolioSudTirol

B1 CAN understand simple texts that are about myhobby or something I am interested in.

4 0.481 1.827 25 CEFR portfolio11–16 Turkish

B1 CAN understand the main points of the news innewspapers and magazines.

5 0.522 �3.600 15 NC for MFL (UK) A1 CAN understand simple words in a familiarcontext.

5 0.455 �2.663 16 CEFR overall readingcomprehension

A1 CAN understand very short, simple texts asingle phrase at a time, picking up familiarnames, words and basic phrases and rereadingas required.

5 0.497 1.577 28 Languages Ladder B1 CAN understand a wide range of authentic textsin familiar contexts.

5 0.656 0.283 29 Bergen B1 CAN read simple books if I am very motivatedand they use familiar language

X X �3.193 3 ALTE Can-do A1 CAN understand opinions simply expressed, forexample ‘I don’t like football’.

X X �0.923 5 ALTE Can-do A1 CAN understand a text which describes peopleor everyday life

X X �3.630 10 CEFR portfolio11–16 Czech

A1 CAN understand what short simple texts areabout, especially if they have pictures.

X X 3.020 31 CEFR portfolio11–16 Turkish

B2 CAN read and understand articles and reportson current problems in which the writersexpress specific attitudes and points of view

X X �3.103 34 CEFR portfolio11–16 Turkish

A1 CAN pick out the familiar words and phrases invery short simple texts.

X X 1.220 38 CEFR reading forinformationand argument

A2 CAN identify specific information in simplewritten material such as letters, brochures andshort newspaper articles describing events.

X X �1.620 18 NC for MFL (UK) A1 CAN understand short texts and dialoguesmadeup of familiar language.

X X 0.923 21 NC for MFL (UK) A2 CAN understand a range of texts using familiarlanguage in unfamiliar contexts.

X X 0.327 23 NC for MFL (UK) A2 CAN understand the main points and somedetails from short stories and factual texts.

X X 1.897 26 NC for MFL (UK) B1 CAN understand a range of texts, includingsome complex sentences and unfamiliarlanguage.

X X 1.220 27 Languages Ladder B1

(continued on next page)

K. Ashton / System 42 (2014) 105–119 117

Page 14: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

(continued )

Factor Factorloading

Difficultyin logits

Orderpresentedin survey

Source Level Statement Statisticallysignificantdifferences(0.05 level)

CAN understand texts including someunfamiliar material from which I can recogniseattitudes and emotions.

K. Ashton / System 42 (2014) 105–119118

References

Akita, K., & Hatano, G. (1999). Learning to read and write in Japanese. In M. Harris, & G. Hatano (Eds.), Learning to read and write (pp. 214–234). Cambridge:Cambridge University Press.

Alderson, C. (2005). Diagnosing foreign language proficiency. London: Continuum.Alexandria, W. (2009). Adult English learners’ self assessment of second language proficiency: Contexts and conditions. New York: New York University Press.AlFallay, I. (2004). The role of some selected psychological and personality traits of the rater in the accuracy of self- and peer-assessment. System, 32, 407–

425.ALTE. (2002). ALTE can-do project. Retrieved 20.11.04 from www.alte.org/can-do/index.cfm.Asset Languages. (2005). Can-do statements. Retrieved 04.09.05, from www.assetlanguages.org.uk.Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.Bachman, L. F., & Clark, J. L. D. (1987). The measurement of foreign/second language proficiency. Annals of the American Academy, 490, 20–33.Bachman, L. F., & Palmer, A. S. (1989). The construct validation of self-ratings of communicative language ability. Language Testing, 6(1), 14–29.Beresford, J. (2000). Student perspectives at school improvement. Paper presented at the British Educational Research Association conference, Cardiff

University.Blatchford, P. (1997). Students’ self assessment of academic attainment: accuracy and stability from 7 to 16 years and influence of domain and social

comparison group. Educational Psychology, 17(3), 345–360.Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum.Boud, D., & Falchikov, N. (1989). Quantitative studies of self-assessment in higher education: a critical analysis of findings. Higher Education, 18, 529–549.Brantmeier, C. (2006). Advanced L2 learners and reading placement: self-assessment, CBT, and subsequent performance. System, 34, 15–35.Brantmeier, C., & Vanderplank, R. (2008). Descriptive and criterion-referenced self-assessment with L2 readers. System, 36, 456–477.Brantmeier, C., Vanderplank, R., & Strube, M. (2012). What about me? Individual self-assessment by skill and level of language instruction. System, 40, 144–

160.British Educational Research Association. (2004). Revised ethical guidelines for educational research. Retrieved from http://www.bera.ac.uk/files/

guidelines/ethica1.pdf.CILT. (2005). Language trends 2005: Community language learning in England, Wales and Scotland. London: CILT.CILT. (2006). Positively plurilingual: The contribution of community languages to UK education and society. CILT.Coe, R. (2008). Comparability of GCSE examinations in different subjects: an application of the Rasch model. Oxford Review of Education, 34(5).Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.de Courcy, M., & Birch, G. (1993). Reading and writing strategies used in a Japanese immersion program. Research report. Nathan, Queensland: Faculty of

Education, Griffith University.Davidson, F., & Henning, G. (1985). A self-rating scale of English difficulty: Rasch scalar analysis of items and rating categories. Language Testing, 2(2), 164–

179.DfES. (2004). Languages ladder – steps to success. Retrieved 18.06.04 from http://www.dfes.gov.uk/languages/DSP_languagesladder.cfm.DfES, & QCA. (1999). The national curriculum for England: Modern foreign languages. London: DfES, QCA.Ellis, R. (2003). Task-based language learning and teaching. Oxford: Oxford University Press.European Commission. (2012). First European survey on language competences: Final report. Luxembourg: Publications Office of the European Union.

Retrieved from http://ec.europa.eu/languages/eslc/index.html.Everson, M. E., & Ke, C. (1997). An inquiry into the reading strategies of intermediate and advanced learners of Chinese as a foreign language. Journal of

Chinese Language Teachers, 3(1), 1–20.Field, A. P. (2005). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner (2nd ed.). London: SAGE.Grenfell, M. (1992). Process reading in the communicative classroom. The Language Learning Journal, 6(1), 48–52.Hargan, N. (1994). Learner autonomy by remote control. System, 22(4), 455–462.Hasselgreen, A. (2003). Bergen ‘can do’ project. Strasbourg: Council of Europe.Heilenman, K. (1990). Self-assessment of second language ability: the role of response effect. Language Testing, 7(2), 174–201.Hope, M. (2004). Assessment at 16þ in MFLS: Where have we come from? Where are we now? And how on earth did that happen?. Paper presented at the

Language World, Oxford.Jones, N. (2000a). Background to the validation of the ALTE ‘can-do’ project and the revised common European framework. Research Notes, 2.Jones, N. (2000b). BULATS: a case study comparing computer based and paper-and-pencil tests. Research Notes, 3(4), 10–13.Jones, N. (2002). Relating the ALTE framework to the common European framework of reference. In Council of Europe (Ed.), Case studies on the use of the

common European framework of reference (pp. 167–183). Strasbourg: Council of Europe Publishing.Jones, N. (2007). Assessment and the national languages strategy. Cambridge Journal of Education, 37(1), 17–33.Jones, N., Ashton, K., & Walker, T. (2010). Asset languages: a case study of piloting the CEFR manual. In W. Martyniuk (Ed.), Aligning tests with the CEFR (Vol. 33).

Cambridge: UCLES, Cambridge University Press.Khalifa, H., & Weir, C. (2009). Examining reading: Research and practice in assessing second language reading. Cambridge: Cambridge University Press.Koda, K. (1994). Second language reading research: problems and possibilities. Applied Psycholinguistics, 15, 1–28.LeBlanc, R., & Painchard, G. (1985). Self-assessment as a second language placement instrument. TESOL Quarterly, 19(4), 673–687.Linacre, J. (1998). Detecting multidimensionality: which residual data-type works best? Journal of Outcome Measurement, 2(3), 266–283.Linacre, J. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103–122.Linacre, J. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878.Linacre, J. (2003). Rasch power analysis: size vs. significance: infit and outfit mean-square and standardized chi-square fit statistic. Rasch Measurement

Transactions, 17(1), 918.Linacre, J. (2010). A user’s guide to WINSTEPS: Rasch-model computer programs. Chicago: Winsteps.com.Little, D. (2006). The common European framework of reference for languages: content, purpose, origin, reception and impact. Language Teaching, 39, 167–

190.Little, D., & Perclova, R. (2001). European language portfolio guide for teachers and teacher trainers. Strasbourg: Council of Europe.Litz, D. (2009). Self-assessment in academic writing: a UAE case study. Learning and Teaching in Higher Education: Gulf Perspectives, 6(2).

Page 15: Using self-assessment to compare learners' reading proficiency in a multilingual assessment framework

K. Ashton / System 42 (2014) 105–119 119

Macaro, E., & Erler, L. (2008). Raising the achievement of young-beginner readers of French through strategy instruction. Applied Linguistics, 29(1), 90–119.McPake, J. (2006). Provision for community language learning in Scotland: Final report. Retrieved 22.01.07, from http://www.scotland.gov.uk/Publications/

2006/09/07093013/0.Mitchell, R. (2000). Prescribed language standards and foreign language classrooms practice: relationships and consequences. In C. Elder, A. Brown, E. Grove, K.

Hill, N. Iwashita, T. Lumley, et al. (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies (Vol. 11); (pp. 163–176). Cambridge: CUP/UCLES.Mitchell, R. (2003). Rethinking the concept of progression in the national curriculum for modern foreign languages: a research perspective. Language

Learning, 27.Mumatz, S., & Humphreys, G. W. (2001). The effects of bilingualism on learning to read English: evidence from the contrast between Urdu–English bilingual

and English monolingual children. Journal of Research in Reading, 24(2), 113–134.Norman, N. (1998). Prescription and autonomy in modern language curricula and materials in Great Britain and Germany: the search for a missing

methodology. Language Learning Journal, 17, 48–56.North, B. (2000a). The development of a common framework scale of language proficiency (Vol. 8). New York: Peter Lang Publishing Inc.North, B. (2000b). Linking language assessments: an example in a low stakes context. System, 28, 555–577.North, B., & Jones, N. (2009). Relating language examinations to the common European framework of reference for languages: Learning, teaching, assessment

(CEFR) – Further material on maintaining standards across languages, contexts and administrations by exploiting teacher judgment and IRT scaling. Stras-bourg: Council of Europe.

OCR. (2013). Asset languages. Retrieved 28.09.03 from http://www.ocr.org.uk/qualifications/by-type/asset-languages/.Ofsted. (2008). The changing landscape of languages: An evaluation of language learning 2004/2007. London: Ofsted.Ogawa Douglas, M. (1992). Development of orthography-related reading/writing strategies by learners of Japanese as a foreign language (Unpublished PhD

thesis). Southern California: University of Southern California.Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham, & D. Corson (Eds.), The encyclopedia of language and ed-

ucation: Vol. 7. Language testing and assessment (pp. 175–187). Dordrecht: Kluwer Academic.Patri, M. (2002). The influence of peer feedback on self- and peer-assessment. Language Testing, 19(2), 109–132.QCA. (2008). Grade standards in GCSE modern foreign languages. London: QCA.Raîche, G. (2005). Critical eigenvalue sizes in standardized residual principal components analysis. Rasch Measurement Transactions, 19 (1st ed). Retrieved

2007, from http://www.rasch.org/rmt/rmt191h.htm.Ross, S. (1998). Self-assessment in second language testing: a meta-analysis and analysis of experiential factors. Language Testing, 15(1), 1–20.Ross, S. (2006). The reliability, validity, and utility of self-assessment. Practical Assessment, Research & Evaluation, 11(10), 1–13.Saito, Y. (2003). The use of self-assessment in second language assessment. Working Papers in TESOL & Applied Linguistics, 3(1).Sasaki, M. (2004). Acquisition of reading processes in the first and second writing systems (Unpublished PhD thesis). Essex: Department of Language and

Linguistics, University of Essex.Shameem, N. (1998). Validating self-reported language proficiency by testing performance in an immigrant community: the Wellington Indo-Fijians.

Language Testing, 15(1), 86–108.Shen, C. (2002). Revisiting the relationship between students’ achievement and their self-perceptions: a cross-national analysis based on TIMSS 1999 data.

Assessment in Education, 9(2), 161–184.Shen, C., & Pedulla, J. J. (2000). The relationship between students’ achievement and their self-perception of competence and rigour of mathematics and

science. Assessment in Education, 7(2), 237–253.Sneddon, R. (2000). Language and literacy: children’s experiences in multilingual environments. International Journal of Bilingual Education and Bilingualism,

3(4), 265–282.TES. (2012). Language woes deepen. Retrieved 27.09.13 from http://www.tes.co.uk/article.aspx?storycode¼6289107.