9
Item response theory was used to shorten EORTC QLQ-C30 scales for use in palliative care Morten Aa. Petersen a, * , Mogens Groenvold a,b , Neil Aaronson c , Jane Blazeby d,e , Yvonne Brandberg f , Alexander de Graeff g , Peter Fayers h , Eva Hammerlid i , Mirjam Sprangers j , Galina Velikova k , Jakob B. Bjorner l,m ; for the European Organisation for Research and Treatment of Cancer Quality of Life Group a The Research Unit, Department of Palliative Medicine, Bispebjerg Hospital, DK-2400 Copenhagen NV, Denmark b Institute of Public Health, University of Copenhagen, Denmark c Division of Psychosocial Research & Epidemiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands d Clinical Sciences at South Bristol, Bristol, UK e Department of Social Medicine, University of Bristol, Bristol, UK f Psychosocial Unit, Department of Oncology, Karolinska Hospital, Stockholm, Sweden g Department of Internal Medicine, University Medical Centre, Utrecht, The Netherlands h Department of Public Health, Aberdeen University Medical School, Aberdeen, UK i Department of Otolaryngology Head & Neck Surgery, Sahlgrenska University Hospital, Gothenburg, Sweden j Department of Medical Psychology, Academic Medical Centre, University of Amsterdam, Amsterdam, The Netherlands k Cancer Research UK Clinical Centre Leeds, St James’s University Hospital, UK l National Institute of Occupational Health, Copenhagen, Denmark m Qualitymetric Incorporated, Lincoln, RI, USA Accepted 21 April 2005 Abstract Background and Objective: The goal was to develop a shortened version of the EORTC QLQ-C30 for use in palliative care. We wanted to keep as few items as possible in each scale while still being able to compare results with studies using the original scales. We examined the possibilities of shortening the physical functioning, cognitive functioning, fatigue, and nausea and vomiting scales. Study Design and Setting: The shortening was based on 2,366 (physical functioning) and 10,815 (three other scales) observations, respectively. We used item response theory to construct scoring algorithms for predicting scores on the original scales. Results: Evaluations showed that a three-item physical scale, a two-item fatigue scale, and a one-item nausea or vomiting scale predicted the scores on the original scales with excellent agreement and had measurement abilities similar to the original scales with no loss or only a little loss in power to detect group differences. The results of the cognitive functioning scale indicated problems when predicting scores from a shortened version. Conclusion: Given the favorable results for the physical functioning, fatigue, and nausea or vomiting scales we expect that the shortened versions of these scales will be included in the abbreviated version of the EORTC QLQ-C30 for palliative care. Ó 2006 Elsevier Inc. All rights reserved. Keywords: EORTC QLQ-C30; IRT; Scales, multi-item; Palliative care; Quality of life; Scales, shortening 1. Introduction According to the World Health Organization (WHO) definition, palliative care is the active, total care of patients whose disease is not responsive to curative treatment. The goal of palliative care is to achieve of the best quality of life for patients and their families [1]. Therefore, for descriptive and evaluative studies in palliative care, there is a great need for well-validated questionnaires suitable for measur- ing the important dimensions of the patient’s quality of life. Such questionnaires should be multidimensional, measur- ing symptom-related aspects as well as psychological, social, and other aspects of the patient’s well being. It is crucial that questionnaires are as brief as possible, to keep the response burden at a minimum. Other important * Corresponding author. Tel.: 3531-2025; fax: 3531-2071. E-mail address: [email protected] (M.Aa. Petersen). 0895-4356/06/$ – see front matter Ó 2006 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2005.04.010 Journal of Clinical Epidemiology 59 (2006) 36–44

Item response theory was used to shorten EORTC QLQ-C30 scales for use in palliative care

  • Upload
    shfa

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Journal of Clinical Epidemiology 59 (2006) 36–44

Item response theory was used to shorten EORTC QLQ-C30 scalesfor use in palliative care

Morten Aa. Petersena,*, Mogens Groenvolda,b, Neil Aaronsonc, Jane Blazebyd,e,Yvonne Brandbergf, Alexander de Graeffg, Peter Fayersh, Eva Hammerlidi,

Mirjam Sprangersj, Galina Velikovak, Jakob B. Bjornerl,m;for the European Organisation for Research and Treatment of Cancer Quality of Life Group

aThe Research Unit, Department of Palliative Medicine, Bispebjerg Hospital, DK-2400 Copenhagen NV, DenmarkbInstitute of Public Health, University of Copenhagen, Denmark

cDivision of Psychosocial Research & Epidemiology, The Netherlands Cancer Institute, Amsterdam, The NetherlandsdClinical Sciences at South Bristol, Bristol, UK

eDepartment of Social Medicine, University of Bristol, Bristol, UKfPsychosocial Unit, Department of Oncology, Karolinska Hospital, Stockholm, Sweden

gDepartment of Internal Medicine, University Medical Centre, Utrecht, The NetherlandshDepartment of Public Health, Aberdeen University Medical School, Aberdeen, UK

iDepartment of Otolaryngology Head & Neck Surgery, Sahlgrenska University Hospital, Gothenburg, SwedenjDepartment of Medical Psychology, Academic Medical Centre, University of Amsterdam, Amsterdam, The Netherlands

kCancer Research UK Clinical Centre Leeds, St James’s University Hospital, UKlNational Institute of Occupational Health, Copenhagen, Denmark

mQualitymetric Incorporated, Lincoln, RI, USA

Accepted 21 April 2005

Abstract

Background and Objective: The goal was to develop a shortened version of the EORTC QLQ-C30 for use in palliative care. Wewanted to keep as few items as possible in each scale while still being able to compare results with studies using the original scales. Weexamined the possibilities of shortening the physical functioning, cognitive functioning, fatigue, and nausea and vomiting scales.

Study Design and Setting: The shortening was based on 2,366 (physical functioning) and 10,815 (three other scales) observations,respectively. We used item response theory to construct scoring algorithms for predicting scores on the original scales.

Results: Evaluations showed that a three-item physical scale, a two-item fatigue scale, and a one-item nausea or vomiting scalepredicted the scores on the original scales with excellent agreement and had measurement abilities similar to the original scales with no lossor only a little loss in power to detect group differences. The results of the cognitive functioning scale indicated problems when predictingscores from a shortened version.

Conclusion: Given the favorable results for the physical functioning, fatigue, and nausea or vomiting scales we expect that theshortened versions of these scales will be included in the abbreviated version of the EORTC QLQ-C30 for palliative care. � 2006 ElsevierInc. All rights reserved.

Keywords: EORTC QLQ-C30; IRT; Scales, multi-item; Palliative care; Quality of life; Scales, shortening

1. Introduction

According to the World Health Organization (WHO)definition, palliative care is the active, total care of patientswhose disease is not responsive to curative treatment. Thegoal of palliative care is to achieve of the best quality of life

* Corresponding author. Tel.: 3531-2025; fax: 3531-2071.

E-mail address: [email protected] (M.Aa. Petersen).

0895-4356/06/$ – see front matter � 2006 Elsevier Inc. All rights reserved.

doi: 10.1016/j.jclinepi.2005.04.010

for patients and their families [1]. Therefore, for descriptiveand evaluative studies in palliative care, there is a greatneed for well-validated questionnaires suitable for measur-ing the important dimensions of the patient’s quality of life.Such questionnaires should be multidimensional, measur-ing symptom-related aspects as well as psychological,social, and other aspects of the patient’s well being. It iscrucial that questionnaires are as brief as possible, to keepthe response burden at a minimum. Other important

37M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

considerations when selecting a questionnaire are the mea-surement (psychometric) properties and the availability ofdata from published studies for comparisons.

At present, a wide range of questionnaires is used formeasuring quality of life in palliative care [2,3]. Thisdiversity is understandable, because studies may havedifferent research questions and therefore may requiredifferent methods. Many questionnaires have similarcontent, however, and so the diversity may also reflectlack of consensus about the relative merits of differentquestionnaires. This complicates comparisons of resultsacross studies. Furthermore, many of the available ques-tionnaires used in palliative care have not undergone in-depth psychometric or cross-cultural validation [2].

The European Organisation for Research and Treatmentof Cancer Quality of Life Questionnaire (EORTC QLQ-C30) [4,5] is one of the most widely used disease-specificquality of life questionnaires [6,7]. The questionnaire isa familiar instrument for many physician researchers and,importantly, published studies and reference data [8] areavailable for comparisons of results. The QLQ-C30 hasbeen tested extensively for validity, reliability, and othermeasurement characteristics [4,9–13]. These studies haveshown the questionnaire to be a generally valid and reliableinstrument; however, cancer patients in palliative care areextremely ill and any questionnaire for this field should beas brief and as focused as possible.

We initiated a project to develop a shortened version ofthe EORTC QLQ-C30 for use with cancer patients inspecialized palliative care, that is, patients with advanced,incurable, and symptomatic cancer, who are in contact withhospices, departments of palliative care, palliative teams, orsimilar support. The methods and results for shortening theemotional functioning scale have been reported; the scalewas shortened from four to two items [14]. Another part ofthis project used interviews with patients (N 5 41) andhealth care professionals (N 5 66) from six Europeancountries to determine which domains should be retained ina shortened questionnaire for palliative care. Based on theseinterviews, it was found that it would be appropriate todevelop shortened versions of four additional scales ina palliative care version of the questionnaire: the physicalfunctioning (PF) scale, the fatigue (FA) scale, the nausea orvomiting (NV) scale, and the cognitive functioning (CF)scale. These domains were found to be relevant and suitablefor the target group but should optimally be measured withfewer items. That is, in all five of the multi-item scales inthe QLQ-C30 should be included in the questionnaire forpalliative care in shortened versions if possible.

The aim of the present paper is to evaluate thepossibilities for shortening the PF, FA, NV, and CF scalesfor use in palliative care. We wanted to keep as few items aspossible in the scales while still being able to directlycompare results obtained with the shortened scales withthose from studies using the original, unabbreviated scales.Several approaches have been used to shorten scales

including methods based on Cronbach’s a, item–scalecorrelations, factor analysis, and expert opinions (see Costeet al. [15] for a review of methods used to shorten scales).A limitation of all these approaches is that the scores on theshortened scales are not compatible with the scores fromthe original scales, because they are not on the same metric.We therefore used a new approach, unique in that it seeks tomake the scores on the shortened scales compatible withthe scores from the original scales. This is accomplished byusing item response theory (IRT) [16] to select items for theshortened scales and to construct scoring algorithms forpredicting the scores on the full scales from the responsesto the items in the shortened scales. This approach toshortening scales was first described in [14] and is furtherdeveloped, applied, and evaluated in the present paper. Thatis, the results presented here are also an evaluation of a newapproach for shortening scalesdan approach that may beused in general to construct shortened questionnaires thatare compatible with the original questionnaires.

2. Methods

2.1. Sample

We established a database of ongoing or completedstudies carried out by members of the EORTC Quality ofLife Group. All studies used the EORTC QLQ-C30 [4,5].Only one assessment per subject was used. The backgroundvariables language, gender, age, stage of disease, andcancer site were also collected when available. Thedatabase included a total of 10,815 subjects representing10 European languages.

Because the items of the EORTC QLQ-C30 physicalfunctioning scale have been revised, only the latest version(version 3) of the questionnaire could be used for theanalyses of this scale. For these analyses, 2,366 subjectswere available.

2.2. Questionnaire

The EORTC QLQ-C30 consists of 30 items. Twenty-fourof the items formnine scales and six are single-item symptommeasures. The scale scores are constructed by summationand linear transformation of the scores on the items [5].

Our focus here is on the physical functioning, fatigue,nausea or vomiting, and cognitive functioning scales. Theseconsist of five, three, two, and two items, respectively. Eachitem has four response categories: Not at all, A little, Quitea bit, and Very much. Table 1 shows the wording of theitems. Scale scores were calculated for respondents havingnonmissing values for all items in the scale.

2.3. Analysis strategy

2.3.1. Differential item functioning analysisWe wanted to construct shortened scales that could be

used for all palliative care patients. The shortened scales

38 M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

should be able to predict the scores on the full scales withacceptable precision, not only in an average, mixed sample,but also in subgroups (e.g., specific countries, cancer sites).Differential item functioning (DIF) in a scale means that theitems in the scale function differently in different subgroups[17]. If there is DIF, the shortened scale’s ability to predictmay vary from subgroup to subgroup; the shortened scalemight not be an appropriate measure to use in all subgroupsof palliative care patients. Therefore, we first investigatedfor DIF.

We used the contingency table method describedpreviously [18] to test for DIF between subgroups definedby age, gender, cancer site, language, and stage (i.e.,palliative vs. nonpalliative). For a more detailed descriptionof the DIF analysis method, see Petersen et al. [18]; see alsoGroenvold et al. [19] and Kreiner et al. [20].

Table 1

Item wording, subject numbers, nonmissing responses, and mean scores

Variablea NbNonmissing,

%

Mean

score SD

PF scale, physical functioning 2,314 97.8 71.1 28.0

Item 1: Do you any

trouble doing strenuous

activities, like carrying

a heavy shopping bag

or a suitcase?

2,345 99.1 55.0 38.1

Item 2: Do you have any

trouble taking a long walk?

2,342 99.0 56.6 39.9

Item 3: Do you have

any trouble taking a short

walk?

2,331 98.5 77.0 33.2

Item 4: Do you need to

stay in bed or a chair

during the day?

2,339 98.9 74.6 32.9

Item 5: Do you need

help with eating, dressing,

washing yourself

or using the toilet?

2,345 99.1 91.9 21.7

FA scale, fatigue 10,590 97.9 35.1 27.7

Item 10: Did you need to rest? 10,678 98.7 37.3 30.6

Item 12: Have you felt weak? 10,681 98.8 30.1 31.2

Item 18: Were you tired? 10,685 98.8 38.1 30.3

NV scale, nausea or vomiting 9,945 92.0 9.6 19.6

Item 14: Have you

felt nauseated?

9,969 92.2 13.3 24.9

Item 15: Have you vomited? 9,959 92.1 6.0 18.3

CF scale, cognitive functioning 9,929 91.8 82.4 22.2

Item 20: Have you had

difficulty in concentrating

on things, like reading

a newspaper or watching TV?

9,968 92.2 84.5 26.1

Item 25: Have you

had difficulty remembering

things?

9,960 92.1 80.3 25.8

Abbreviations: SD, standard deviation.a Items 1–5, 20, and 25 are scored: 05 Very much, 33.35 Quite a bit,

66.7 5 A little, 100 5 Not at all. Items 10, 12, 14, 15, and 18 are scored:

0 5 Not at all, 33.3 5 A little, 66.7 5 Quite a bit, 100 5 Very much.b For PF, total N 5 2,366; for FA, NV, and CF, total N 5 10,815.

The effect of possible DIF findings with regard to age,gender, and cancer site were evaluated by dividing thepalliative care patients into subgroups according to DIF andthen comparing the properties of the shortened scales inthese subgroups. DIF between stages and languages washandled in one of three ways. If DIF between stages but notbetween languages was found, the groups of nonpalliativecare patients were excluded from the subsequent analyses;these subjects were only included to increase the precisionof the model estimations. If DIF between languages, but notstages, was found, we investigated whether the predictionwas equally good for palliative care patients from differentcountries; however, we had palliative care patients fromScandinavian countries only. Even if appropriate for use inScandinavia, the shortened scales might not be appropriatefor use with palliative care patients in other countries. Toinvestigate this, we used the nonpalliative care patients inthe database. These patients had markedly better scoresthan palliative care patients. This was taken into account bystratifying according to the observed scale scores and thenexamining the prediction within each stratum. Furthermore,the results from each stratum were combined by weightingthe results according to the distribution of the observedscale scores for the palliative care patients. This was usedas an approximation of the ability to predict for palliativecare patients in the countries from which we did not havepalliative care patients. Finally, if there was DIF betweenlanguages and also between stages, we estimated the IRTmodel once using all available observations and again usingthe palliative care patients only and compared the results.If the two models resulted in similar prediction, weconcluded that the DIF did not have an important effecton the selection of items or on the construction of a scoringalgorithm. Further, the prediction of scores was evaluatedon both the palliative and the nonpalliative care patients.

2.3.2. IRT modelsIRT-based methods were used for the selection of items

for the shortened scales and for the prediction of scalescores. For the FA scale, we used the generalized partialcredit model (GPCM) [21]. The GPCM was estimated bymarginal maximum likelihood estimation using the PAR-SCALE computer program [22]. A GPCM could not befitted to the PF scale using our data (the estimationprocedure could not converge) and a GPCM cannot befitted to a two-item scale (the model is overparameterized).Therefore, for the PF, NV, and CF scales the morerestrictive partial credit model (PCM) [23] was used. ThePCM was estimated by conditional maximum likelihood[23] using the program OPLM [24].

2.3.3. Prediction of scoresWe evaluated two methods for constructing scoring

algorithms to predict the scores on the full scale from theresponses to the items in the shortened scale. Both methodsused the estimated IRT model and an estimate of the IRT

39M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

score, which is the true but unobservable level of a scale. Inmethod 1, the IRT score was estimated using all items inthe scale. Based on these IRT score estimates, theprobability of each combination of responses to the itemsin the full scale was estimated. For each combination of theitems in the shortened scale, the most likely responsecombination of all items in the scale was then chosen. Fromthese responses, the predicted scale scores were calculated.

In method 2, the IRT score was estimated using theitems of the shortened scale only. For each IRT scoreestimate, we estimated the probabilities of the possiblescores on the full scale. For each combination of the itemsin the shortened scale, we then chose the score on the fullscale with the largest estimated probability as the predictedscale score.

2.3.4. Selection of itemsTwo criteria were used for the selection of items for the

shortened scales: the item information functions (IIFs)[22,25] and the ability to predict scores on the full scales.The IIF is a measure of how much information an itemprovides about the IRT score. IIFs can be used as a simpleway to do a rough initial selection of items. This isespecially attractive for longer scales (in our case, primarilythe PF scale); however, we required pronounced differencesin the IIFs to remove an item from a scale based solely onthe IIFs.

We put greater emphasis on detailed evaluations of theability of various possible shortened scales to predict scoreson the full scales. We compared the predicted scale scoreswith the observed scores for the palliative care patients bycalculating the percentage of correctly predicted scalescores, the percentage of predicted scores that deviate atmost one scale score level from the observed scale score,the difference in mean scores, the predicted mean scalescore for each observed scale score, the Pearson correlationr, and the weighted k measure of agreement between thepredicted and observed scale scores.

2.3.5. Assessment of measurement abilityTo assess the practical consequences of using the

shortened scales, we compared the abilities of the shortenedand the unabbreviated scales to detect group differences inthe sample of palliative care patients. We selected 10criterion variables to define the groups for these compar-isons: for each scale we used age, gender, and the eightother multi-item scales in the questionnaire. The criterionvariables were dichotomized at the median. We calculatedthe difference in mean scores between the two groups usingboth the predicted and the observed scale scores and testedfor significant group differences using Student’s t-test.

In case of DIF between languages, we repeated thecomparisons on a random sample of the nonpalliative carepatients. The random sample was selected to match thescale score distributions of the palliative care patients.

3. Results

3.1. Characteristics of the sample

Of the 10,815 subjects in the database, 904 (8.4%) werepatients receiving specialized palliative care. All palliativecare patients were from Scandinavia. The remainingsubjects were cancer patients not in specialized palliativecare institutions (67.8%) or individuals from generalpopulation samples (23.8%). The database is described infurther detail in Petersen et al. [18].

For the analyses of the PF scale (which were done onversion 3 of the QLQ-C30 only), 2,366 subjects wereavailable. All were cancer patients. Of these, 267 (11.3%)were palliative care patients. The version 3 subgrouprepresented the same languages as the full database exceptFinnish.

Table 1 shows the number of responses, mean scores,and standard deviations for the items and the scales. Theitem and scale scores are transformed to a 0–100 scale. Forthe physical and cognitive scales and items, a higher scorereflects better functioning; for the fatigue and nausea orvomiting scales and items, a higher score reflects moresevere symptoms [5].

3.2. DIF analyses

For the physical functioning scale, we found DIF withregard to all languages except Norwegian and French whencomparing with the English original. Also when comparingthe translations with each other we found DIF in severalcomparisons. There was no significant DIF for age, gender,cancer site, and stage.

Except for Swedish and Italian, there was evidence ofDIF for all translations for the fatigue scale. Furthermore,we found DIF between palliative care patients and bothnonpalliative care patients and general population samples.There were no indications of significant DIF with regard toage, gender, or cancer site.

For the cognitive scale, there were significant DIF whencomparing the Dutch, Italian, and Spanish translations withEnglish, when comparing palliative care patients with bothnonpalliative care patients and with general populationsamples, and when comparing age groups. No significantDIF findings for gender and cancer site.

For the nausea or vomiting scale, there were nosignificant DIF findings.

3.3. Selection of items

As a consequence of the DIF findings summarizedabove, all evaluations of the shortened scales were carriedout on both the palliative and the nonpalliative care patients(which represented other languages than for the palliativecare patients). For completeness, we also did this for thenausea or vomiting scale even though no DIF was found forthis scale. Furthermore, to assess the consequences of the

40 M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

DIF between cancer stages for the fatigue and cognitivescales we estimated the IRT models for these two scalesusing both the total database and the palliative care patientsonly and compared the results.

Based on the estimated IRT models we calculated theitem information functions (IIFs). The IIFs for the physicalfunctioning items indicated that item 1 (strenuous activi-ties) contributed the least information for palliative carepatients (IIFs not shown). The item provided very littleinformation about patients with poor physical functioning.Therefore, based on the IIFs, we deleted item 1 from thephysical scale. The remaining four items were used toconstruct shortened scales for further evaluation.

Table 2 summarizes the results for the shortened scalesperforming best. Among the four possible three-item PFscales, items 3–5 predicted the PF scores for palliative carepatients most accurately. Prediction method 2 resulted inslightly better prediction than method 1. Using method 2,items 3–5 predicted the correct PF score for 72% of thepalliative care patients. In only 10% of the cases did thepredicted PF score deviate more than one score level (6.7)from the observed PF score. The mean difference betweenthe predicted and the observed PF scores was less than 1and both the correlation and the k were high. Slightly fewernonpalliative care patients had a correctly predicted score;otherwise, the results were very similar to the evaluationsfor the palliative care patients. For the palliative carepatients, there were large mean deviations for PF scores O67 (Fig. 1); however, these findings are not reliable, in thatvery few palliative care patients had these scores (2–8 perscore). For all other PF scores, the deviations were small.The results for the nonpalliative care patients were good,also for the high PF scores.

The evaluations of the six possible two-item PF scalesindicated that the shortened scale consisting of items 3 and5 predicted the scale scores best. Items 3 and 5 predictedthe correct PF score for about half of the palliative care

patients and for about 40% of the nonpalliative carepatients. The overall mean differences were less than 1, butthe analyses stratified by observed PF scores showed thatthere were large mean deviations for PF scores > 40 for thepalliative care patients and for the nonpalliative carepatients for PF scores > 73 (results not shown).

Because of DIF findings, we estimated the IRT modelbased on both the total database and on palliative carepatients only for the fatigue scale. The differences in theestimated IIFs between the three items were small andinconsistent for the two models. Therefore, we constructedscoring algorithms based on the two IRT models for allthree possible two-item fatigue scales.

For each two-item scale, we compared the scoringalgorithms constructed from the two IRT models. Therewere only minor differences in the scoring algorithmsbased on the total database and on palliative care patientsonly. This indicated that the findings of DIF would havelittle consequences for the prediction of scale scores. Theshortened scale consisting of items 12 and 18 usingprediction method 1 had the most accurate prediction ofthe FA scores. Items 12 and 18 predicted the correct FAscore for about two thirds of the palliative care patients.The predicted FA score differed from the observed scorewith more than one scale score level for only 3% of thepatients (Table 2). The mean predicted FA score was veryclose to the observed score. Also when stratifying by theobserved FA scores there were only small and unsystematicmean deviations (Fig. 1). Correlation and k measurebetween predicted and observed FA scores were high.The prediction for the nonpalliative care patients wasalmost identical to the prediction for the palliative carepatients.

As a final step in the selection of a shortened fatiguescale we evaluated the prediction of each of the three itemsindividually. All three items predicted the correct FA scorefor about 45% of the palliative care patients (results not

Table 2

Prediction of the scores on the original scales. Results for the shortened scales performing best for each of the four domains

Correctly

predicted, %

Maximally 1 score

level wronga, %

Mean

difference (SD) Pearson’s r Weighted k

PF predicted with items 3–5

Palliative care patients 72.1 89.5 20.6 (6.9) 0.96 0.88

Weighted nonpalliative patients 68.7 88.7 20.5 (6.4) 0.97 0.88

FA predicted with items 12, 18

Palliative care patients 64.3 97.1 20.2 (7.4) 0.97 0.87

Weighted nonpalliative patients 64.1 96.7 1.0 (7.6) 0.97 0.86

NV predicted with item 14

Palliative care patients 68.5 93.4 20.6 (13.2) 0.91 0.74

Weighted nonpalliative patients 67.9 91.8 20.6 (14.2) 0.90 0.77

CF predicted with item 20

Palliative care patients 48.4 87.1 21.0 (16.9) 0.87 0.66

Weighted nonpalliative patients 49.0 86.9 5.4 (15.9) 0.85 0.66

Abbreviations: CF, cognitive functioning; FA, fatigue; NV, nausea or vomiting; PF, physical functioning; SD, standard deviation.a Predicted scores at most one scale score level from the observed scale score. On a 0–100 scale, one scale score level is 6.7 points for PF, 11.1 for FA,

and 16.7 for NV and CF.

41M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

Mea

n of

FA

sco

res

pred

icte

d w

ith

item

s 12

, 18

PF scale FA scale

0.00.0 6.7 13.3 20.0 26.7 33.3 40.0 46.7 53.3 60.0 66.7 73.3 80.0 86.7 93.3 100.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

10.0

0.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Observed PF score

0.0 11.1 22.2 33.3 44.4 55.6 66.7 77.8 88.9 100.0

0.0 16.7 33.3 50.0 66.7 83.3 100.0 0.0 16.7 33.3 50.0 66.7 83.3 100.0

Observed FA score

Observed NV score Observed CF score

Mea

n of

PF

sco

res

pred

icte

d w

ith

item

s 3-

5

Palliative care patients

Non-palliative care patients

Perfect mean fit

Palliative care patients

Non-palliative care patients

Perfect mean fit

Palliative care patients

Non-palliative care patients

Perfect mean fit

Palliative care patients

Non-palliative care patients

Perfect mean fit

NV scale CF scale

Mea

n of

NV

sco

res

pred

icte

d w

ith

item

14

10.0

0.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Mea

n of

CF

sco

res

pred

icte

d w

ith

item

20

Fig. 1. For each observed scale score level, the plots show the mean of the scale scores predicted with the shortened scales calculated for the palliative and

nonpalliative care patients, respectively.

shown). Overall mean deviations from the observed FAscores were between 22 and 3; however, all candidates hadlarge mean deviations for one or more observed FA scores.

Of the two items in the NV scale item 14, nausea, wasclearly superior to item 15 in predicting the NV scores.Using prediction method 1, item 14 predicted the correctNV score for little more than two thirds of the palliative carepatients and the average predicted score was very close tothe observed score (Table 2). Across the observed NVscores, the deviations were relatively small (one deviation of8; all other deviations were!5). Correlation and kmeasurewere high, although a bit lower than for the PF and FAscales. The results for the nonpalliative care patients weresimilar to the results for the palliative care patients.

For the CF scale, we estimated the IRT models basedboth on the total database and on the palliative care patientsonly. This resulted in significantly different predictions.Using the IRT model based on the palliative care patients,the mean deviations were small and unsystematic for thepalliative care patients but there were large and systematicdeviations for the nonpalliative care patients (see Fig. 1).Using the total database resulted in the reverse findings:systematically biased predictions for the palliative carepatients but not for the nonpalliative care patients. Ingeneral, the prediction of the CF scale was markedly poorer

than the predictions for the three other scales. Using item20 and the prediction based on the palliative care patientsresulted in the best prediction of CF. Item 20 predicted thecorrect CF score for little less than half of the patients andfor 13% the predicted score was O17 points from theobserved CF score (Table 2). Correlations and k measuresbetween predicted and observed scores were lower than forthe other scales, although still fairly high.

3.4. Assessment of measurement ability

Based on the evaluations of the ability to predict scalescores, we decided to assess the measurement abilities ofthe shortened PF scale consisting of items 3–5, theshortened FA scale consisting of items 12 and 18, and theshortened NV scale consisting of item 14, compared to thefull scales. Because the results for the CF scale were judgedunsatisfactory, this scale was not further evaluated. Theresults of the assessments are summarized in Table 3.

Comparing the shortened and full scales using thepalliative care patients and the control sample of non-palliative care patients, respectively, resulted in very similarfindings for all three scales and there were no indicationsthat DIF between languages affected the measurementproperties of these shortened scales.

42 M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

For the PF scale, the average t-test sizes for theshortened and the full scales were very similar. Usinga significance level of 5%, they resulted in the sameconclusions (a significant or nonsignificant group differ-ence) in all but one of the 20 comparisons carried out intotal for the scale.

The shortened FA scale resulted in a slightly smalleraverage test size than the full FA scale, reflecting that theshortened FA scale had larger standard deviations.Nevertheless, the shortened FA scale found exactly thesame significant group differences as the full FA scale.

Also for the NV scale the shortened scale resulted inslightly smaller test sizes than the full NV scale, but onlyfor one of the 20 comparisons did this affect the conclusiondrawn from the comparison.

4. Discussion

Here we have reported on the development andpsychometric testing of possible shortened versions of theEORTC physical functioning, fatigue, nausea or vomiting,and cognitive functioning scales for use in palliative care ofcancer patients.

We first investigated for differential item functioning(DIF). Significant DIF was found for the PF, FA, and CFscales, primarily between languages. For the PF and FAscales, the possible DIF had little effect on the prediction ofscale scores. Comparing the results for the Scandinavianpalliative care patients and the patients from the rest ofEurope indicated that for these two scales the shortenedscales have similar prediction and measurement abilities forall palliative care patients across Europe. For the CF scale,however, there were pronounced differences in the abilitiesto predict between palliative care patients and otherpatients. This may mainly be due to DIF between palliative

Table 3

Results of 10 group comparisons for the palliative care patients and for

the matched control sample of the remaining patients in the database,

and the average result across these two samples combined

Shortened scale Full scaleSame

conclusiona,

%

Mean

diff.

Mean

t-test

Mean

diff.

Mean

t-test

PF, palliative care 12.0 4.1 11.5 3.9 90

PF, control 10.2 4.4 9.6 4.4 100

PF, combined 11.1 4.3 10.6 4.2 95

FA, palliative care 19.1 10.4 19.0 11.0 100

FA, control 20.9 11.0 20.3 11.4 100

FA, combined 20.0 10.7 19.7 11.2 100

NV, palliative care 12.3 5.6 11.4 5.9 100

NV, control 18.4 9.3 18.3 9.6 90

NV, combined 15.4 7.5 14.9 7.8 95

Abbreviations: diff., difference; FA, fatigue; NV, nausea or vomiting;

PF, physical functioning.a Both the shortened scale and the full scale found a significant group

difference (P!.05) or neither of them found a significant difference (P!.05).

care patients and nonpalliative care patients, but may alsoin part be due to DIF between languages. Whether this isthe case was difficult to investigate in our database becausewe had palliative care patients from Scandinavia only. Inany case, the findings suggest that prediction of the CFscale based on a shortened version was affected by DIF.

The evaluations of the PF scale showed very goodagreement between scores predicted with items 3–5 and theobserved PF scores for palliative care patients. The ratherlarge mean deviations found for high PF scores should beignored as these estimates were based on too fewobservations (2–8 patients at each score O67). Thepredicted scores for the nonpalliative care patients wereclose to the observed scores, including for high PF scores.The assessment of measurement ability suggested that theshortened PF scale has the same power to detect groupdifferences as the full scale. Taken together, the compar-isons indicated that the shortened scale can safely be usedin palliative care settings instead of the full PF scale.

Among the two-item versions of the FA scale, items 12and 18 had the best predicting ability. All evaluationsindicated very good agreement between the predicted andobserved FA scores. Furthermore, the assessment ofmeasurement ability showed that this two-item scale onlyhad marginally lower power to detect group differencesthan the full FA scale.

We also investigated the possibility of using a two-itemPF scale and a single fatigue item in the shortenedquestionnaire. The evaluations indicated that using theseshortened scales would reduce the accuracy of theprediction markedly. These shortened scales predicted thecorrect scale score for !50% of the palliative care patientsand yielded relatively large mean deviations for one ormore observed scale scores. The question is whether or notthe predictions are accurate enough to use these scales inthe shortened questionnaire. To our knowledge, there is norelevant precedent for defining what is good enough. As inany scale development, deciding on the length of the scaleis a balance between response burden or feasibility andmeasurement precision. For development of a shortenedversion of the EORTC QLQ-C30 for use in palliative care,we gave priority to precision (i.e., ability to predict scalescores) rather than minimal length.

The NV scores predicted with item 14 were in goodagreement with the observed NV scores. Using item 14may be expected to result in the same findings andconclusions as using the full NV scale. From a clinicalpoint of view, it makes sense that it is better to predictvomiting (item 15) from nausea (item 14) than vice versa.

The evaluations showed markedly poorer agreement forthe CF scale than for the other three scales. Furthermore,the evaluations indicated that the prediction of the CF scalewas affected by DIF. Previous research has shown that thisscale may not be fully unidimensional [11]. This, combinedwith the DIF findings, may explain the poorer predictionabilities of the CF items. We judged that the prediction was

43M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

Table 4

Scoring algorithms for predicting the scores on the PF, FA, and NV scales from the reduced sets of items

PF FA NV

Sum of items 3–5a Predicted Response to item 12 Response to item 18 Predicted Response to item 14 Predicted

0 0.0 Not at all Not at all 0.0 Not at all 0.0

1 6.7 Not at all A little 22.2 A little 16.7

2 13.3 Not at all Quite a bit 33.3 Quite a bit 50.0

3 20.0 Not at all Very much 55.6 Very much 100.0

4 26.7 A little Not at all 22.2

5 33.3 A little A little 33.3

6 46.7 A little Quite a bit 55.6

7 60.0 A little Very much 66.7

8 73.3 Quite a bit Not at all 33.3

9 93.3 Quite a bit A little 44.4

Quite a bit Quite a bit 66.7

Quite a bit Very much 88.9

Very much Not at all 44.4

Very much A little 66.7

Very much Quite a bit 88.9

Very much Very much 100.0

a Items 3–5 are scored on a scale of 0 5 Very much, 1 5 Quite a bit, 2 5 A little, and 3 5 Not at all.

too poor and the risk of biased CF scores in some languagesor other subgroups of palliative care patients because ofDIF was too high. We therefore, do not intend to usea shortened version of the CF scale in the palliative carequestionnaire.

We used and compared two methods for predicting scalescores. The method giving the best results differed betweenthe scales. We hypothesize that prediction method 1(estimating the IRT score using all items in the full scale)will have the best prediction abilities for short scales. Forlonger scales, the predicted probabilities for each combi-nation of the items is likely to be too uncertain and theconstruction of scoring algorithms will also be quitecumbersome. Therefore, for longer scales predictionmethod 2 is probably the best choice.

A previous study used a similar strategy as reported herefor shortening the EORTC emotional functioning scalefrom four to two items [14]. The results from the presentpaper confirmed the encouraging results from the first scaleanalyzed: with these IRT based prediction methods, itseems possible to reduce the scales to half of the items(rounded up for an uneven number of items) while stillpredicting the scores on the full scale with high precision.Whether this applies in general for other scales of differentlength and contents and for other populations thanpalliative care patients cannot be determined from thesestudies.

Based on the present study we anticipate the use ofa three-item PF scale consisting of items 3, 4, and 5, a two-item FA scale consisting of items 12 and 18, and item 14 asa single item NV scale, scored according to the algorithmsin Table 4, in a version of the EORTC QLQ-C30 shortenedfor cancer patients in palliative care. Previously, two itemshave been deleted from the emotional functioning scale[14]. That is, in all six items have been deleted using this

IRT based approach. In the context of the present project,we have also interviewed cancer patients and health careprofessionals in palliative care about the importance andappropriateness of the individual items of the EORTCQLQ-C30. Based on these interviews, the final decisionsconcerning the shortened version of the QLQ-C30 for usein palliative care will be made. We anticipate that a numberof scales or single items (apart from the five scales dealtwith in this and the previous paper) will be deleted. Weexpect that the questionnaire can be shortened to about halfthe length (i.e., to about 15 items). This would markedlyincrease the usefulness of the questionnaire in palliativecare studies.

We recognize that the shortened scales need to beevaluated in independent samples of palliative care patientsfrom across Europe. Given the favorable results obtainedthus far, we expect that the shortened questionnaire willprove appropriate and useful as a core instrument for futurestudies in palliative care.

Acknowledgments

This work was supported by grants from the EuropeanOrganisation for Research and Treatment of Cancer Qualityof Life Group. The study was based on data contributed bythe following individuals: Neil Aaronson, Amsterdam, TheNetherlands; Marianne Ahlner-Elmqvist, Malmo, Sweden;Juan I. Arraras, Pamplona, Spain; Jane Blazeby, Bristol,UK; Yvonne Brandberg, Stockholm, Sweden; AnneBredart, Paris, France; Elisabeth Brenne, Trondheim,Norway; Thierry Conroy, Vandoeuvre les Nancy, France;Ann Cull, Edinburgh, UK; Alexander de Graeff, Utrecht,The Netherlands; Mogens Groenvold, Copenhagen,Denmark; Eva Hammerlid, Goteborg, Sweden; Marianne

44 M.Aa. Petersen et al. / Journal of Clinical Epidemiology 59 (2006) 36–44

Hjermstad, Oslo, Norway; Marit Jordhoy, Trondheim,Norway; Marianne Sullivan, Goteborg, Sweden; GalinaVelikova, Leeds, UK; Craig Vickery, Exeter, UK; MaggieWatson, Sutton, UK; Teresa Young, Middlesex, UK.

References

[1] Johnston G, Abraham C. The WHO objectives for palliative care: to

what extent are we achieving them? Palliat Med 1995;2:123–37.

[2] Hearn J, Higginson IJ. Outcome measures in palliative care for

advanced cancer patients: a review. J Public Health Med 1997;2:

193–9.

[3] Kaasa S, Loge JH. Quality of life in palliative care: principles and

practice. Palliat Med 2003;1:11–20.

[4] Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A,

Duez NJ, Filiberti A, Flechtner H, Fleishman SB, de Haes JC, et al.

The European Organization for Research and Treatment of Cancer

QLQ-C30: a quality-of-life instrument for use in international clinical

trials in oncology. J Natl Cancer Inst 1993;5:365–76.

[5] Fayers PM, Aaronson NK, Bjordal K, Groenvold M, Curran D,

Bottomley A. The EORTC QLQ-C30 scoring manual. Brussels:

European Organisation for Research and Treatment of Cancer; 2001.

[6] Fayers P, Bottomley A. Quality of life research within the

EORTC-the EORTC QLQ-C30. European Organisation for Research

and Treatment of Cancer. Eur J Cancer 2002;S125–33.

[7] Garratt A, Schmidt L, Mackintosh A, Fitzpatrick R. Quality of life

measurement: bibliographic study of patient assessed health outcome

measures. BMJ 2002;7351:1417–9.

[8] Fayers PM, Weeden S, Curran D. EORTC QLQ-C30 reference

values. Brussels: European Organization for Research and Treatment

of Cancer; 1998.

[9] Niezgoda HE, Pater JL. A validation study of the domains of the core

EORTC quality of life questionnaire. Qual Life Res 1993;5:319–25.

[10] Hjermstad MJ, Fossa SD, Bjordal K, Kaasa S. Test/retest study of the

European Organization for Research and Treatment of Cancer Core

Quality-of-Life questionnaire. J Clin Oncol 1995;5:1249–54.

[11] Aaronson NK, Cull A, Kaasa S, Sprangers MA. The European

Organization for Research and Treatment of Cancer (EORTC)

modular approach to quality of life assessment in oncology: an

update. In: Spilker B, editor. Quality of life and pharmacoeconomics

in clinical trials. 2nd ed. Philadelphia: Lippincott Williams &

Wilkins; 1996:179–89.

[12] Kaasa S, Bjordal K, Aaronson N, Moum T, Wist E, Hagen S,

Kvikstad A. The EORTC core quality of life questionnaire

(QLQ-C30): validity and reliability when analysed with patients

treated with palliative radiotherapy. Eur J Cancer 1995;31A:2260–3.

[13] Groenvold M, Klee MC, Sprangers MA, Aaronson NK. Validation of

the EORTC QLQ-C30 quality of life questionnaire through combined

qualitative and quantitative assessment of patient-observer agree-

ment. J Clin Epidemiol 1997;4:441–50.

[14] Bjorner JB, Petersen MAa, Groenvold M, Aaronson N,

Ahlner-Elmqvist M, Arraras JI, Bredart A, Fayers P, Jordhoy M,

Sprangers M, Watson M, Young T; European Organisation for

Research and Treatment of Cancer Quality of Life Group. Use of

item response theory to develop a shortened version of the EORTC

QLQ-C30 emotional function scale. Qual Life Res 2004;13:1683–97.

[15] Coste J, Guillemin F, Pouchot J, Fermanian J. Methodological

approaches to shortening composite measurement scales. J Clin

Epidemiol 1997;3:247–52.

[16] van der Linden WJ, Hambleton RK, editors. Handbook of modern

item response theory. Berlin: Springer; 1997.

[17] Holland PW, Wainer H, editors. Differential item functioning.

Hillsdale, NJ: Lawrence Erlbaum Associates; 1993.

[18] Petersen MAa, Groenvold M, Bjorner JB, Aaronson N, Conroy T,

Cull A, Fayers P, Hjermstad M, Sprangers M, Sullivan M. European

Organisation for Research and Treatment of Cancer Quality of Life

Group. Use of differential item functioning analysis to assess the

equivalence of translations of a questionnaire. Qual Life Res

2003;12:373–85.

[19] Groenvold M, Bjorner JB, Klee MC, Kreiner S. Test for item bias in

a quality of life questionnaire. J Clin Epidemiol 1995;6:805–16.

[20] Kreiner S. Validation of index scales for analysis of survey data:

the Symptom Index. In: Dean K, editor. Population health research:

linking theory and methods. London: SAGE Publications; 1993:

116–44.

[21] Muraki E. A generalized partial credit model. In: van der Linden WJ,

Hambleton RK, editors. Handbook of modern item response theory.

Berlin: Springer; 1997:153–68.

[22] Muraki E, Bock RD. PARSCALE: IRT based test scoring and item

analysis for graded open-ended exercises and performance tasks

[Computer program]. Chicago: Scientific Software International;

1996.

[23] Masters GN, Wright BD. The partial credit model. In: van der

Linden WJ, Hambleton RK, editors. Handbook of modern item

response theory. Berlin: Springer; 1997:101–21.

[24] Verhelst ND, Glas CAW. OPLM: one-parameter logistic model

[Computer program]. Arnhem, Netherlands: CITO; 1995.

[25] Samejima F. Normal ogive model on the continuous response level in

the multidimensional latent space. Psychometrika 1974;39:111–21.