9
Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities? Kjetil G. Ringdal a,b,c,d, *, Nils Oddvar Skaga e,f , Morten Hestnes f , Petter Andreas Steen b,c , Jo Røislien a,g , Marius Rehn a,c,h , Olav Røise b,c,d , Andreas J. Kru ¨ ger a,i,j , Hans Morten Lossius a,k a Department of Research, Norwegian Air Ambulance Foundation, Drøbak, Norway b Division of Emergencies and Critical Care, Oslo University Hospital Ulleva ˚l, Oslo, Norway c Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Norway d Norwegian National Trauma Registry, Oslo University Hospital, Norway e Department of Anaesthesiology, Division of Emergencies and Critical Care, Oslo University Hospital Ulleva ˚l, Oslo, Norway f The Ulleva ˚l Trauma Registry, Department of Research and Development, Division of Emergencies and Critical Care, Oslo University Hospital Ulleva ˚l, Oslo, Norway g Department of Biostatistics, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, Norway h Akershus University Hospital, Lørenskog, Norway i Department of Anesthesia and Emergency Medicine, St. Olav’s University Hospital, Trondheim, Norway j Faculty of Medicine, Norwegian University of Science and Technology, Trondheim, Norway k Department of Surgical Sciences, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway Introduction Injury severity classification is considered a fundamental component of trauma outcome research and quality assessments. The Abbreviated Injury Scale (AIS), 1 developed by the U.S. Association for the Advancement of Automotive Medicine (AAAM), is frequently used to classify overall injury severity in the multiply injured patients. The AIS is the basis of several composite injury severity measurements, such as the Injury Severity Score (ISS) 2 and Injury, Int. J. Care Injured 44 (2013) 691–699 A R T I C L E I N F O Article history: Accepted 30 June 2012 Keywords: Abbreviated Injury Scale Injury Severity Scores Agreement Reliability Trauma registries A B S T R A C T Background: Injury severity is most frequently classified using the Abbreviated Injury Scale (AIS) as a basis for the Injury Severity Score (ISS) and the New Injury Severity Score (NISS), which are used for assessment of overall injury severity in the multiply injured patient and in outcome prediction. European trauma registries recommended the AIS 2008 edition, but the levels of inter-rater agreement and reliability of ISS and NISS, associated with its use, have not been reported. Methods: Nineteen Norwegian AIS-certified trauma registry coders were invited to score 50 real, anonymised patient medical records using AIS 2008. Rater agreements for ISS and NISS were analysed using Bland–Altman plots with 95% limits of agreement (LoA). A clinically acceptable LoA range was set at Æ9 units. Reliability was analysed using a two-way mixed model intraclass correlation coefficient (ICC) statistics with corresponding 95% confidence intervals (CI) and hierarchical agglomerative clustering. Results: Ten coders submitted their coding results. Of their AIS codes, 2189 (61.5%) agreed with a reference standard, 1187 (31.1%) real injuries were missed, and 392 non-existing injuries were recorded. All LoAs were wider than the predefined, clinically acceptable limit of Æ9, for both ISS and NISS. The joint ICC (range) between each rater and the reference standard was 0.51 (0.29,0.86) for ISS and 0.51 (0.27,0.78) for NISS. The joint ICC (range) for inter-rater reliability was 0.49 (0.19,0.85) for ISS and 0.49 (0.16,0.82) for NISS. Univariate linear regression analyses indicated a significant relationship between the number of correctly AIS-coded injuries and total number of cases coded during the rater’s career, but no significant relationship between the rater-against-reference ISS and NISS ICC values and total number of cases coded during the rater’s career. Conclusions: Based on AIS 2008, ISS and NISS were not reliable for summarising anatomic injury severity in this study. This result indicates a limitation in their use as benchmarking tools for trauma system performance. ß 2012 Elsevier Ltd. All rights reserved. * Corresponding author at: Department of Research, Norwegian Air Ambulance Foundation, P.O. Box 94, N-1441 Drøbak, Norway. Tel.: +47 976 49 121; fax: +47 64 90 44 45. E-mail address: [email protected] (K.G. Ringdal). Contents lists available at SciVerse ScienceDirect Injury jo ur n al ho m epag e: ww w.els evier .c om /lo cat e/inju r y 0020–1383/$ see front matter ß 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.injury.2012.06.032

Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Embed Size (px)

Citation preview

Page 1: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Injury, Int. J. Care Injured 44 (2013) 691–699

Abbreviated Injury Scale: Not a reliable basis for summation of injury severityin trauma facilities?

Kjetil G. Ringdal a,b,c,d,*, Nils Oddvar Skaga e,f, Morten Hestnes f, Petter Andreas Steen b,c, Jo Røislien a,g,Marius Rehn a,c,h, Olav Røise b,c,d, Andreas J. Kruger a,i,j, Hans Morten Lossius a,k

a Department of Research, Norwegian Air Ambulance Foundation, Drøbak, Norwayb Division of Emergencies and Critical Care, Oslo University Hospital Ulleval, Oslo, Norwayc Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Norwayd Norwegian National Trauma Registry, Oslo University Hospital, Norwaye Department of Anaesthesiology, Division of Emergencies and Critical Care, Oslo University Hospital Ulleval, Oslo, Norwayf The Ulleval Trauma Registry, Department of Research and Development, Division of Emergencies and Critical Care, Oslo University Hospital Ulleval, Oslo, Norwayg Department of Biostatistics, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, Norwayh Akershus University Hospital, Lørenskog, Norwayi Department of Anesthesia and Emergency Medicine, St. Olav’s University Hospital, Trondheim, Norwayj Faculty of Medicine, Norwegian University of Science and Technology, Trondheim, Norwayk Department of Surgical Sciences, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway

A R T I C L E I N F O

Article history:

Accepted 30 June 2012

Keywords:

Abbreviated Injury Scale

Injury Severity Scores

Agreement

Reliability

Trauma registries

A B S T R A C T

Background: Injury severity is most frequently classified using the Abbreviated Injury Scale (AIS) as a

basis for the Injury Severity Score (ISS) and the New Injury Severity Score (NISS), which are used for

assessment of overall injury severity in the multiply injured patient and in outcome prediction. European

trauma registries recommended the AIS 2008 edition, but the levels of inter-rater agreement and

reliability of ISS and NISS, associated with its use, have not been reported.

Methods: Nineteen Norwegian AIS-certified trauma registry coders were invited to score 50 real,

anonymised patient medical records using AIS 2008. Rater agreements for ISS and NISS were analysed

using Bland–Altman plots with 95% limits of agreement (LoA). A clinically acceptable LoA range was set at

�9 units. Reliability was analysed using a two-way mixed model intraclass correlation coefficient (ICC)

statistics with corresponding 95% confidence intervals (CI) and hierarchical agglomerative clustering.

Results: Ten coders submitted their coding results. Of their AIS codes, 2189 (61.5%) agreed with a

reference standard, 1187 (31.1%) real injuries were missed, and 392 non-existing injuries were recorded.

All LoAs were wider than the predefined, clinically acceptable limit of �9, for both ISS and NISS. The joint

ICC (range) between each rater and the reference standard was 0.51 (0.29,0.86) for ISS and 0.51 (0.27,0.78) for

NISS. The joint ICC (range) for inter-rater reliability was 0.49 (0.19,0.85) for ISS and 0.49 (0.16,0.82) for NISS.

Univariate linear regression analyses indicated a significant relationship between the number of correctly

AIS-coded injuries and total number of cases coded during the rater’s career, but no significant relationship

between the rater-against-reference ISS and NISS ICC values and total number of cases coded during the

rater’s career.

Conclusions: Based on AIS 2008, ISS and NISS were not reliable for summarising anatomic injury severity

in this study. This result indicates a limitation in their use as benchmarking tools for trauma system

performance.

� 2012 Elsevier Ltd. All rights reserved.

Contents lists available at SciVerse ScienceDirect

Injury

jo ur n al ho m epag e: ww w.els evier . c om / lo cat e/ in ju r y

* Corresponding author at: Department of Research, Norwegian Air Ambulance

Foundation, P.O. Box 94, N-1441 Drøbak, Norway. Tel.: +47 976 49 121;

fax: +47 64 90 44 45.

E-mail address: [email protected] (K.G. Ringdal).

0020–1383/$ – see front matter � 2012 Elsevier Ltd. All rights reserved.

http://dx.doi.org/10.1016/j.injury.2012.06.032

Introduction

Injury severity classification is considered a fundamentalcomponent of trauma outcome research and quality assessments.The Abbreviated Injury Scale (AIS),1 developed by the U.S.Association for the Advancement of Automotive Medicine (AAAM),is frequently used to classify overall injury severity in the multiplyinjured patients. The AIS is the basis of several composite injuryseverity measurements, such as the Injury Severity Score (ISS)2 and

Page 2: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699692

New Injury Severity Score (NISS),3 and ISS is one of theindependent variables included in outcome prediction modelssuch as the Trauma and Injury Severity Score (TRISS).4,5 AssigningAIS codes to patients with multiple injuries can be rater-subjectivebecause identical injuries can sometimes be given different codes.6

Mackenzie et al. reported considerable variation in the meannumber of AIS-scored injuries recorded per patient among ratersusing the AIS 1980 edition.7 Physicians and nurses had higherintra-rater reliability than emergency medical technicians andnonclinical technicians. The inter-rater AIS score agreement wassignificantly higher for blunt than for penetrating injuries. Nealeet al. found that although only 39% of the AIS codes assigned by anytwo raters were identical, the inter-rater reliability for ISS wasalmost perfect.8 In contrast, Zoltie and de Dombal found a largevariation in ISS, with a mere 28% probability of agreement betweentwo independent raters.9

A revised Utstein Trauma Template for uniform data reportingfrom patients subjected to major trauma was published recently10–

12 with the aim of reducing the variability in data collection forinternational trauma registries. This template recommends usingthe newest version of AIS,11 which is currently the AIS 2008edition.1 However, no reports have evaluated the inter-rater levelsof agreement and reliability of the ISS and NISS based on thisedition of the AIS. The aim of the present study was to estimatethese parameters in a representative group of AIS-certifiedNorwegian trauma registry coders with a comparison against areference standard.

Methods

Rater sampling

Study participants were recruited from a list of nineteenNorwegian trauma registry coders certified in the AIS 2005 orUpdate 2008 versions who were working in trauma registries orwho were intended to code in hospitals in the process ofestablishing a registry. The list was cross-checked against a listof the Norwegian Better & Systematic Trauma Care Foundationnetwork13 contact persons. Participating hospitals were compen-sated financially so that the raters could take time from theirregular work to complete the coding for the study. There was nospecific training for this group prior to this study.

Case sampling

Patient records were selected from the trauma registry at OsloUniversity Hospital – Ulleval (OUH-U). OUH-U, the largest traumacentre in Norway, receives approximately 1400 trauma patientsannually of which 34% have an ISS > 15 (i.e., severe injury4), and44% have a NISS > 15 according to AIS 2008. A sample ofconsecutive patients with NISS > 15 who were directly admittedfrom the scene of injury were selected for the study. Exclusioncriteria included asphyxia, drownings, and burns as the predomi-nant injuries; hospital admission > 24 h after the injury; andpatients declared dead before reaching the hospital and with nosigns of life or response to initial resuscitation upon arrival to theEmergency Department (ED).10

Pre-hospital and ED charts, hospital admission notes, traumaanaesthesia records, surgery reports, physician’s progress notes,intensive care unit records, nurse reports, laboratory data,radiology reports, autopsy records (if applicable), and dischargesummaries of each patient’s record were distributed to allparticipants.

Direct patient identifiable information (e.g., name, address) wasdeleted, but date of birth was replaced with a fictitious but realisticdate. Indirect patient identifiable data (e.g., date of injury and dates

of medical chart notes and operations) were replaced withfictitious but realistic data. The geographic location of the injuryevent, names of treating doctors, treating hospital(s), anddepartments were deleted.

The determination of the sample size for ISS and NISS reliabilitystudies depends upon a reasonable estimate of the reliabilitycoefficient in the study population, the coefficient of theconfidence interval (CI), and the maximum error (e.g., from aprevious estimate).14,15 A sample of 50 cases has been suggested15

and was used in this study.

Data variables and data collection

The raters were asked to score all 50 cases according to anexpanded Norwegian version of the Utstein Trauma Template DataDictionary,11 which contained 48 data variables, including AIS, ISS,and NISS. The raters were allowed to use either the AIS 2005(AIS05) or AIS 2008 (AIS08). In cases where AIS05 ratings did notagree with AIS08 ratings, the AIS05 codes were converted to AIS08codes using a recently developed list of differences between the AIScodesets.16 The study investigators calculated the ISS and NISS forthe raters based on the raters’ AIS codes.

Main outcome measures were completeness of injury coding,agreement in ISS and NISS scoring, and reliability of ISS and NISSscoring.

The raters reported their levels of experience and training in aquestionnaire (Supplemental File 1). Two web-based data entrytools were used to collect data from the cases and thequestionnaire.

AIS, ISS, and NISS

The AIS classification system is a consensus-derived, anatomi-cally based, seven-digit injury scoring system.1 The first six digitsrefer to a unique numerical identifier that designate the injuredbody region (out of nine regions), the type of anatomic structure,and the specific anatomic structure; the seventh digit refers to anordinal injury severity scale with categories ranging from 1 (‘minorinjury’) to 6 (‘maximal injury’). From the AIS scores, an ISS value, apragmatic quantitative summary measure of the overall severity ofanatomic and functional damage, is calculated by summing thesquares of the highest AIS severity codes in each of the three (out ofthe six) most severely injured ISS body regions.2 In contrast, NISS, arevised version of the ISS, is calculated by summing the squares ofthe three most severe AIS injuries, regardless of ISS body region.3

The ISS and NISS scales range from 1 to 75.

Reference standard

A reference panel developed a reference standard based on theAIS08 dictionary. The panel consisted of one trauma registrar whowas an AAAM AIS certified course instructor and faculty member(MH), one trauma anaesthesiologist (NOS), and one PhD student(KGR), all experienced and certified to the AIS05/08 versions. Thepanellists first individually coded each case, and secondly met todiscuss their coding and develop an initial reference standard.Coding disagreements were resolved by consensus or by consul-tation with clinical experts in the relevant surgical fields. Each AIScode assigned by the raters was thoroughly examined against thereference standard to identify missing codes, coding errors,missing body regions, and other possible mistakes. In cases forwhich multiple skin injuries were assigned to a patient, thereference panel considered that coding one skin injury per severitylevel was sufficient. If a rater identified an obviously correct codethat had not been included in the initial reference standard, thecode was added to the reference standard. The final reference

Page 3: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699 693

standard was developed based on the agreed AIS codes followingconsensus and expert consultation, and after reviewing the raters’AIS coding. The final judgements of the accuracy of the raters’injury coding was made after concluding on a final referencestandard.

Ethical considerations

The Data Privacy Ombudsman for research at OUHconsidered this project exempt from license requirements,and the Regional Committee for Medical and Health ResearchEthics (ref. no. 1.2009.1139, 2009/345-1) was informed butconsidered the project outside their mandate for approval. Onlyraters who gave written informed consent were allowed toparticipate.

Statistical methods

Continuous data are presented as medians and ranges andcategorical data are presented as counts and percentages. Theagreement and reliability between each rater and the referencestandard (rater-against-reference) and the agreement and reli-ability between raters (inter-rater) were estimated for ISS andNISS.

Inter-rater agreement was assessed using the Bland–Altmanlimits of agreement (LoA) method.17,18 This method compares theestimated variation in the data to a clinical evaluation of what is anacceptable variation in order for measurements to be considered‘‘not different’’. The LoA were calculated as the mean of thedifferences between the measurements of two raters�1.96 � standard deviation, and will contain 95% of future measure-ment pairs in similar individuals, assuming a normal distribution ofdata.19 A smaller range between these two limits denotes betteragreement. An extension of the LoA method to compare more thantwo pairs of measurements was used in this analysis.20 A clinicallyacceptable LoA range was set at �9 units, equivalent to the increase inthe derived ISS/NISS value when the severity of a single injury isincreased from AIS 4 to 5.

Reliability is defined as the ratio of variation betweenmeasurements to the total variation of all the measurements itis intended to measure.15,21–23 Reliability was estimated by intra-class correlation coefficient (ICC) statistics and corresponding 95%CI using a two-way mixed model with the absolute agreementindex.22 ICC statistics give a number on a scale from 0 to 1, where 0indicate agreement no better than expected by chance, and 1indicates perfect agreement.23,24 The inter-rater ICC values werefurther analysed using hierarchical agglomerative clusteringanalysis with complete linkage and ‘1-ICC’ as a distance measurefor the accompanying dendrogram.25 Hierarchical agglomerativeclustering is a multivariate statistical method for partitioningvalues into optimally similar groups.25

Univariate linear regression analysis was performed toevaluate the relationship between the number of correctlyAIS-coded injuries as the dependent variable and the numberof cases coded during the rater’s entire career as theindependent variable, and to evaluate the relationships betweenISS ICC and NISS ICC values as dependent variables and thenumber of cases coded during the rater’s entire career as theindependent variable. Statistical significance was assumed whenP < 0.05.

Statistical analyses were performed using STATA/SEversion 11.2 (StataCorp LP, College Station, TX, USA) and Rversion 2.11.1 (The R Foundation for Statistical Computing,Vienna, Austria).26

The Guidelines for Reporting Reliability and Agreement Studies(GRRAS) were used in the drafting of this report.21

Results

Raters

Of the 19 identified and invited raters, one declined toparticipate, and two never responded. Five that initially agreedto participate later withdrew due to resource constraints. One raterinitially agreed to participate but did not respond to four e-mailreminders and never submitted the data. Ten raters answered aquestionnaire and scored the 50 cases.

Three raters were clinically experienced registered nurses, fivewere specialist nurses (i.e., nurse anaesthetists or critical carenurses), and two were physicians (anaesthesiologists). The median(range) coding experience was 3.5 (0,10) years, and the median(range) number of cases coded throughout the rater’s career was275 (20,2000). The invitees that did not participate were allcertified in the AIS system and were either nurses or physicians.

Cases

Of the 50 patient cases, 39 (78%) were male. The median (range)age was 45 years (7,83), and 90% were blunt traumas. Twelve (24%)of the patients died of their injuries.

The median (range) ISS and NISS values in the referencestandard were 22 (9,45) and 28 (16,75), respectively.

AIS coding and injury identification

The median (range) number of AIS codes assigned by the raterswas 352 (275,459), compared to 382 codes assigned in thereference standard. Overall, the raters assigned 3561 AIS codes. Ofthe AIS codes, 2189 (61.5%) agreed with the reference standard. Atotal of 471 (13.3%) AIS codes were misclassified with regard tolevel of severity (Table 1) but were still considered correct withregard to anatomic structure and therefore included in theanalyses.

The raters recorded 392 injury codes that did not existaccording to the reference standard and overlooked (missed)1187 (31.1%) injuries. Of the overlooked injuries, 743 injuries wereof AIS 1–2 severity. A total of 509 double-coded injuries and codedskin injuries AIS 1 exceeding more than one per patient wereexcluded, but were not classified as incorrect (Table 1).

The raters found 15 injuries that were not identified in theinitial reference standard, and nine of the injuries in the initialreference standard were removed. In the reference standard, threeseverity levels were changed for both ISS and NISS, and anotherthree levels were changed for ISS.

The proportion of missed codes was highest in the body regionsspine (47.4%), face (43.2%), and head (35.2%) (Table 2). A highproportion of missing codes were also found in the neck region, butthe low number of injuries in this region may be a possibleexplanation for the high proportion. No specific injury descriptorsaccounted for the largest number of missed codes or errors.

Agreement

The LoAs between all raters and between each rater and thereference standard for ISS are shown in Fig. 1. The narrowest LoArange was from �8.12 to 10.48, and the widest range was from�35.98 to 35.22. Rater 1 disagreed markedly with all other ratersand the reference standard. Raters 3 and 4 agreed the most witheach other and the reference standard.

The LoAs between all raters and between the raters and thereference standard for NISS are shown in Fig. 1. The narrowest LoArange was from �15.29 to 17.97, and the widest range was from�42.25 to 33.21. As for ISS, rater 1 was least in agreement with all

Page 4: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Table 1Description of the AIS codes assigned by each rater.

Types of coding errors Rater Total

1 2 3 4 5 6 7 8 9 10

Correct AIS-coded injuries 159 197 261 228 227 249 209 262 171 226 2189

Completely correct 146 186 245 218 217 224 193 249 153 199 2030

Insignificantly different 13 11 16 10 10 25 16 13 18 27 159

Misclassified AIS-coded injuries 49 47 47 47 33 53 46 30 57 62 471

Severity too high 18 13 11 12 6 28 13 12 23 29 165

Severity too low 24 24 30 30 25 18 23 15 27 29 245

Incorrect organs or structures 3 6 1 2 2 2 3 2 4 1 26

Fundamental coding principle error 4 4 5 3 0 5 7 1 3 3 35

Non-existing injuries 45 50 59 23 18 71 26 18 36 46 392

Mistyped codes 3 20 12 5 0 10 5 1 5 5 66

Codes from different dictionary

than that stated by the rater

0 0 0 0 1 0 3 0 0 0 4

Injury not found in patient chart 42 30 47 18 17 61 18 17 31 41 322

Missed injuries 177 138 79 109 122 85 131 92 157 97 1187

AIS 1–2 118 86 51 68 73 43 74 62 103 65 743

AIS 3–5 59 52 28 41 49 42 57 30 54 32 444

Excluded codesa 45 50 59 23 18 71 26 18 36 46 509

AIS: Abbreviated Injury Scale.a It was considered satisfactory to code one skin injury per severity level per patient. Additional skin injury codes were not classified as errors; rather these were excluded

from the injury summation.

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699694

other raters and the reference standard. Rater 3 agreed the mostwith the reference, and raters 4 and 6 agreed the most with eachother.

All LoAs, both for ISS and NISS, were wider than the predefinedclinically acceptable LoA range of �9. The median (range)proportion of measurements that were covered within the pre-defined �9 limits was 80% (56%,94%) for ISS and 68% (48%,86%) forNISS.

Reliability

The joint ICC (range) for all raters against the reference standardwas 0.51 (0.29,0.86) for ISS (Fig. 2) and 0.51 (0.27,0.78) for NISS(Fig. 3).

The joint ICC (range) for inter-rater reliability was 0.49(0.19,0.85) for ISS, and 0.49 (0.16,0.82) for NISS. Hierarchicalagglomerative clustering analysis of the inter-rater estimated ISSICC values revealed two subgroups, with raters 1 and 9 in the leastagreement with the rest of the raters (Fig. 4a). These two clustersshowed relatively little agreement with one another. Inter-raterNISS ICC clustering also showed two subgroups (Fig. 4b),confirming that raters 1 and 9 stood out as being the least inagreement with the other raters.

Removing the apparent outlier rater 1 from the rater-toreference standard and inter-rater ICC calculations only marginallychanged the ICCs.

Table 2Missed codes stratified by body region.

Body region Number of missed

codes for all raters

Head 440

Face 147

Neck 25

Thorax 96

Abdomen and pelvis content 23

Spine 161

Upper extremities 93

Lower extremities 131

External and other trauma 71

Total 1187

Levels of agreement and reliability between the reference panellists

The narrowest ISS LoA range for inter-panellist agreement wasfrom �10.69 to 9.65, and the widest LoA was from �16.87 to 13.63.The narrowest NISS LoA range was from �13.44 to 16.24, and thewidest range was from �25.75 to 18.83. All ranges were wider thanthe predefined clinically acceptable LoA range of �9. The median(range) proportion of measurements that were covered within the pre-defined �9 limits was 86% (80%,90%) for ISS and 68% (58%,80%) for NISS.

The joint ICC (range) for inter-panellist reliability was 0.72(0.64,0.79) for ISS and 0.68 (0.64,0.77) for NISS.

Relationships between the ratings and the participants’ experience

Univariate linear regression analyses indicated a statisticallysignificant relationship between the number of correctly AIS-coded injuries and the total number of cases coded during therater’s career (P = 0.03), but no statistically significant relationshipwas found between the rater-against-reference ICC and the totalnumber of cases coded during the rater’s career for ISS (P = 0.80) orNISS (P = 0.45).

Discussion

The anatomic injury scores assigned by ten AIS-certified traumaregistry coders using AIS 2008 varied considerably, with less than

Median (range) number of

missed codes per rater

Number of codes in

reference standard

45.5 (28–56) 125

16 (7–19) 34

3 (0–3) 3

7.5 (5–23) 55

2.5 (0–4) 18

12.5 (8–31) 34

10 (2–15) 36

13 (8–18) 45

6.5 (2–18) 32

– 382

Page 5: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Fig. 1. Bland–Altman limits of agreement for ISS and NISS scores from each rater against the reference standard and against each other. This figure shows the agreement

between raters and the agreement between each rater and the reference standard for ISS values (a) and NISS values (b), as expressed by the Bland–Altman 95% limits of

agreement (LoA). The x-axis shows the LoAs, and the y-axis shows each pair of raters. The vertical broken line that crosses zero indicates perfect agreement, whereas the two

dotted lines crossing �9 indicates the clinically acceptable LoA.

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699 695

two-thirds of the codes agreeing with a reference standard andwith nearly one-third of injuries overlooked. This led to relativelylow levels of agreement and reliability of injury severity scoring(ISS and NISS), and indicates that summative injury scoring usingthe AIS system is subject to large inter-rater variability and thusmust be interpreted with great caution.

We can probably assume that the AIS system will always besubject to some inter-rater variability. Therefore, the question ishow large of a variation in ISS and NISS we can accept. In this study,we failed to find values within the acceptable LoA range of �9 unitsof disagreement in ISS and NISS for the raters. The amount ofdisagreement between the reference panellists before panel consensus

Page 6: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Fig. 2. Rater-against-reference reliability for ISS calculated by intraclass correlation coefficient. The levels of reliability between each rater and the reference standard for ISS,

determined by intraclass correlation coefficient (ICC) statistics and the corresponding 95% confidence intervals (CI) are depicted. ICC is based on a scale from 0 to 1, where 0

indicates agreement no better than that expected by chance, and 1 indicates perfect agreement.

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699696

was reached is also noteworthy, and gives an indication of thecomplexity of the system.

Our study assessed the reliability of the scales in a real-lifesituation. We employed authentic cases selected from consecu-tively admitted trauma patients, and the raters were instructed toindependently score all 50 cases according to an expanded versionof the Utstein Template. This study design permitted theevaluation of coding agreement and reliability in a setting similarto that of real trauma registry work, including the registration andcoding of multiple data variables, without us identifying specificinjuries for the raters.

In contrast, MacKenzie et al. studied coding reliability byreviewing cases to identify a set of injuries that the majority ofthe raters had coded,7 whereas Read-Allsopp asked her raters to

Fig. 3. Rater-against-reference reliability for NISS calculated by intraclass correlation coe

for NISS, determined by intraclass correlation coefficient (ICC) statistics and the corres

classify an artificially developed list of random injuries.27

Furthermore, we included a reference standard that enabledus to evaluate concurrent criterion-related validity, whichinvolves a comparing of each rater’s assessment with that ofa reference standard14,15 and allowed us to estimate theaccuracy of injury identification and severity classificationassociated with the use of AIS codes. The few previous studiesof agreement and/or reliability issues in the AIS system7–9,27 didnot test these against a specifically developed consensus-derived reference standard. MacKenzie et al. identified onerater as the most accurate and consistent, and compared allother raters with this ‘‘reference’’7; however, these authors didnot explain the specific criteria for accuracy and consistencythat were used to choose the reference rater.28

fficient. The levels of reliability between each participant and the reference standard

ponding 95% confidence intervals (CI) are depicted.

Page 7: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

Fig. 4. Hierarchical agglomerative clustering of intraclass correlation coefficient values for the inter-rater reliability of ISS (a) and NISS (b). The dendrograms depict

hierarchical agglomerative clustering. Similar elements are linked near the bottom of the graph (i.e., near 1), whereas dissimilar elements are linked higher on the graph (i.e.,

near 0).

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699 697

While the raters in our study all had formal competence with acertification in injury coding from the AAAM and clinicalexperience in emergency medicine and trauma care, they hadvarying degrees of coding experience (non-formal competence).However, they can be considered a realistic sample of those whocode for the trauma registries in Norway and who will provide thedata for a future national trauma registry. Thus, these results are asnapshot of the agreement and reliability that we may expect fromNorwegian coders. We failed to find a relationship between rater-against-reference ICC and the total number of cases coded duringthe rater’s career for ISS and NISS, but found a relationship betweenthe number of correctly AIS-coded injuries and the total number ofcases coded during the rater’s career. This indicates that theproblems may lie in the abstraction of injury descriptions andassignment of AIS codes, which are the absolute bases of ISS andNISS. However, we were unable to perform more detailedmultivariable linear regression analyses because only ten raterswere included. Therefore, we cannot rule out a possible influenceof other causal relationships.

Decreasing the number of raters who code trauma cases inNorway, i.e., performing centralised coding in a regional ornational registry may reduce coding variability within the registry,but would probably not solve the problem of comparing dataacross nations, or between individual hospitals from differentcountries. Another way to improve the consistency of coding,especially with regard to the AIS coding, might be to have twocoders screen each case. Alternatively, coders could performregular reviews of the accuracy of injury identification and AIScoding in a random set of cases. However, duplicate injuryclassification would increase costs in terms of human resourcesand training requirements, which may be unaffordable andtherefore unacceptable in resource-constrained settings.29,30 Thiscould result in AIS coding being performed only in hospitals with azealous emphasis on trauma care (e.g., level I–II trauma centres inhigh-income countries).31 Another possibility for injury codingcould be to use the International Classification of Diseases NinthRevision, Clinical Modification (ICD-9-CM)32 to map ICD-9-CMcodes into AIS codes.33 However, most European hospitals use theICD-10 edition, and even though an ICD-10-CM version isavailable,34 an ICD-10-CM-to-AIS 2008 mapping tool is, to ourknowledge, not currently available. However, we anticipate thatthe ICD system will also be subject to inter-rater variabilitybecause the individuals assigning ICD codes will most likely facethe same problems in identifying and coding injuries correctly

according to radiology and surgery reports. Furthermore, incomparative studies of injury scores, the ICD-9-to-AIS-mappedscores generally did not perform as well as those based on directlycoded AIS scores.35–38 It should also be emphasised that ICD-to-AISmapping does not generate a 1-to-1 match in many injurydescriptions, but rather a 1-to-many from ICD to AIS.

This study shows that, in our current setting, the use of the AISmethodology is unlikely to have adequate precision to function as abenchmarking tool. However, through targeted training, a morecomprehensive AIS course, coding consensus processes, more timeto code at each hospital, regular recertification, and properlydesigned databases, the accuracy, agreement and reliability of AISscoring may increase to an acceptable level. Overall, these findingsindicate the need to initiate quality improvement processes.Because a more adequate and precise injury-scoring alternative isnot currently available, the trauma registry community may usethese results to improve the scoring accuracy and precision. This isespecially important for the introduction of a national inclusivetrauma registry system.

A limitation of the AIS system that may cause rater variabilitymay be that the AIS dictionary is very detailed and complex, withmany specific coding rules, and cannot be properly understoodwithout extensive experience. The fact that NISS summarises thethree most severe injuries regardless of body region may explainwhy the LoA ranges were wider and the ICC values were lower forNISS than for ISS.

Some limitations of the study are worth noting. The inclusion ofonly patients with a NISS > 15 may have introduced a selectionbias. However, because our focus was on severely injured patients,according to the inclusion criteria of the Utstein Trauma Template,this choice was considered valid. The process of assigning AIS codesmay have been different for the reference panellists (fewer timerestrictions, higher competence) than for the raters (more timerestrictions, lower competence). The raters probably spent lesstime coding than the reference panellists due to greater timeconstraints, but constrained time frames are probably a moreaccurate reflection of the everyday setting for most raters. Thismay have affected the coding accuracy and reliability. Thereliability might have been different for less injured patients,who are also more often treated and scored in lower-level traumacentres. Therefore, future studies should test the reliability of AISscoring in patients with mild to moderate injuries.

Three panellists, all from the same institution, developed thereference standard. This may have introduced a bias as these

Page 8: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699698

panellists may have all adapted similar coding habits. However, allraters had attended courses with MH as an instructor, whichshould have reduced the variability between the referencestandard and the raters.

Nine invited hospitals declined, did not respond to theinvitation to participate in the study, or withdrew from the study.A lack of response from the invitees may have introduced aselection bias, and since no follow-up was performed for theinvitees that did not participate, we cannot preclude the possibilitythat the characteristics of the non-respondents differed from thoseof the respondents.

This study did not test the reliability of the AIS scale as such;rather, it evaluated the accuracy of injury identification and thereliability of ISS and NISS. Therefore, we were not able to judge theactual reliability of AIS 2008.

Finally, because the medical charts contain informationprovided by several different health personnel, the injuries mayhave been described in several different ways in the same medicalchart, in some cases. Injuries may have been described differentlyby radiologists and by surgeons, which may have caused someconfusion in assigning the correct severity level. Vague or missinginjury information in patient charts, together with complicated AISdescriptors, are probably also important factors causing codingvariations. This is a fundamental problem with injury coding.

Further studies may include an intra-rater reliability test and anew rater-against-reference standard reliability test. Furtherstudies may also test the reliability of the AIS scale. Future studiesshould be designed to evaluate how differences in scoring betweenraters affect the utility of the ISS and NISS in outcome predictionmodels for trauma patients.

Conclusions

Anatomic injury scores assigned by AIS-certified traumaregistry coders using AIS 2008 varied considerably in this study.This caused relatively low levels of agreement and reliability ofinjury severity scores for ISS and NISS and indicates that thesescoring tools are overly rater dependent. ISS and NISS scorescannot be considered reliable classifiers for summarising anatomicinjury severity, and may not be appropriate for benchmarkingtrauma system performance.

Conflict of interest statement

KGR, MR, and AJK have received PhD funding from theNorwegian Air Ambulance Foundation (SNLA). NOS has receivedPost Doc grants from the South-Eastern Norway Regional HealthAuthority. The other authors declare that they have no externalfinancial or non-financial conflicts of interests related to this study.

Authors’ contributions

KGR, NOS, MH, MR, PAS, OR, AJK, and HML planned the study.KGR, MH, and MR selected and anonymised the medical records.KGR, MH, and AJK developed the web-based databases. KGR, MH,and NOS developed the reference standard, assisted by MR. KGRand MH investigated all injury codes. KGR and JR analysed the data.KGR wrote the first manuscript draft. All authors contributed to theinterpretations of the results, helped to draft the manuscript, andapproved the final version of the manuscript.

Acknowledgements

We thank senior lecturer J. Mary Jones PhD (MathematicsDepartment, Faculty of Natural Sciences, Keele University, Keele,UK) for assistance in planning the project. We acknowledge the

Unit for Applied Clinical Research, Norwegian University of Scienceand Technology, Trondheim for designing the web-based databasefor the collection of clinical data from the trauma cases. Weacknowledge The Centre for Information Technology Services,University of Oslo for providing a web-based tool for collectingquestionnaire data. We deeply acknowledge the members of theSNLA for funding the scholarships of the PhD students Kjetil G.Ringdal MD, Marius Rehn MD, and Andreas J. Kruger MD. We alsoexpress deep gratitude to SNLA for financially compensating forthe participants who took time from their hospital work toparticipate in the study. The study sponsor did not have anyinvolvement in the study. We acknowledge the coding workperformed by the participants from Finnmark Hospital, Hammerf-est; Sørlandet Hospital, Kristiansand; Nord-Trøndelag Hospital,Levanger; Nord-Trøndelag Hospital, Namsos; Oslo UniversityHospital, Ulleval; St. Olav’s University Hospital, Trondheim;Telemark Hospital, Skien; Vestre Viken Hospital, Asker andBærum; and Østfold Hospital, Fredrikstad.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in

the online version, at http://dx.doi.org/10.1016/j.injury.2012.06.032.

References

1. Association for the Advancement of Automotive Medicine. Abbreviated InjuryScale (AIS) 2005 – update 2008. Barrington, IL: Association for the Advancementof Automotive Medicine; 2008.

2. Baker SP, O’Neill B, Haddon Jr W, Long WB. The injury severity score: a methodfor describing patients with multiple injuries and evaluating emergency care.Journal of Trauma 1974;14:187–96.

3. Osler T, Baker SP, Long W. A modification of the injury severity score that bothimproves accuracy and simplifies scoring. Journal of Trauma 1997;43:922–5.[discussion 925–926].

4. Boyd CR, Tolson MA, Copes WS. Evaluating trauma care: the TRISS method.Trauma Score and the Injury Severity Score. Journal of Trauma 1987;27:370–8.

5. Champion HR, Sacco WJ, Hunt TK. Trauma severity scoring to predict mortality.World Journal of Surgery 1983;7:4–11.

6. Barancik JI, Chatterjee BF. Methodological considerations in the use of theabbreviated injury scale in trauma epidemiology. Journal of Trauma1981;21:627–31.

7. MacKenzie EJ, Shapiro S, Eastham JN. The Abbreviated Injury Scale and InjurySeverity Score. Levels of inter- and intrarater reliability. Medical Care1985;23:823–35.

8. Neale R, Rokkas P, McClure RJ. Interrater reliability of injury coding in theQueensland Trauma Registry. Emergency Medicine (Fremantle) 2003;15:38–41.

9. Zoltie N, de Dombal FT. The hit and miss of ISS and TRISS. Yorkshire TraumaAudit Group. BMJ 1993;307:906–9.

10. Ringdal KG, Coats TJ, Lefering R, Di Bartolomeo S, Steen PA, Røise O, et al. TheUtstein template for uniform reporting of data following major trauma: a jointrevision by SCANTEM, TARN, DGU-TR and RITG. Scandinavian Journal of TraumaResuscitation and Emergency Medicine 2008;16:7.

11. Ringdal KG, Coats TJ, Lefering R, Di Bartolomeo S, Steen PA, Røise O, et al. TheUtstein Trauma Template for uniform reporting of data following major trauma:data dictionary. European Trauma Registry Network; 2008http://www.euro-trauma.net/.

12. Ringdal KG, Lossius HM, Jones JM, Lauritsen JM, Coats TJ, Palmer CS, et al.Collecting core data in severely injured patients using a consensus traumatemplate: an international multicentre study. Critical Care 2011;15:R237.

13. Better & Systematic Trauma Care Foundation (BEST). BEST. http://www.best-net.no/.

14. Jones JM. Nutritional screening and assessment tools. New York: Nova SciencePublisher, Inc; 2006.

15. Streiner DL, Norman GR. Health measurement scales. A practical guide to theirdevelopment and use. 4th ed. New York: Oxford University Press Inc; 2008.

16. Ringdal KG, Hestnes M, Palmer CS. Differences and discrepancies between 2005and 2008 Abbreviated Injury Scale versions – time to standardise. ScandinavianJournal of Trauma Resuscitation and Emergency Medicine 2012;20:11.

17. Altman DG, Bland JM. Measurement in medicine: the analysis of methodcomparison studies. Statistician 1983;32:307–17.

18. Bland JM, Altman DG. Statistical methods for assessing agreement between twomethods of clinical measurement. Lancet 1986;1:307–10.

19. Bland JM, Altman DG. Agreement between methods of measurement withmultiple observations per individual. Journal of Biopharmaceutical Statistics2007;17:571–82.

20. Carstensen B. Comparing and predicting between several methods of measure-ment. Biostatistics 2004;5:399–413.

Page 9: Abbreviated Injury Scale: Not a reliable basis for summation of injury severity in trauma facilities?

K.G. Ringdal et al. / Injury, Int. J. Care Injured 44 (2013) 691–699 699

21. Kottner J, Audige L, Brorson S, Donner A, Gajewski BJ, Hrobjartsson A, et al.Guidelines for Reporting Reliability and Agreement Studies (GRRAS) wereproposed. Journal of Clinical Epidemiology 2011;64:96–106.

22. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability.Psychological Bulletin 1979;86:420–8.

23. Shoukri MM. Measures of interobserver agreement and reliability. 2nd ed. BocaRaton, FL: Chapman & Hall/CRC Press; 2010.

24. Shrout PE. Measurement reliability and agreement in psychiatry. StatisticalMethods in Medical Research 1998;7:301–17.

25. Johnson RA, Wichern DW. Applied multivariate statistical analysis. 6th ed. UpperSaddle River, NJ: Pearson Prentice Hall; 2008.

26. R Development Core Team. R: a language and environment for statisticalcomputing. Vienna, Austria: R Foundation for Statistical Computing [accessed15.08.10].

27. Read-Allsopp C. Establishing inter-rater reliability scoring in a state traumasystem. Journal of Trauma Nursing 2004;11:35–9.

28. Posner KL, Sampson PD, Caplan RA, Ward RJ, Cheney FW. Measuring interraterreliability among multiple raters: an example of methods for nominal data.Statistics in Medicine 1990;9:1103–15.

29. Nakahara S, Yokota J. Revision of the International Classification of Diseases toinclude standardized descriptions of multiple injuries and injury severity.Bulletin of the World Health Organization 2011;89:238–40.

30. Cryer C. Severity of injury measures and descriptive epidemiology. InjuryPrevention 2006;12:67–8.

31. Osler T, Rutledge R, Deis J, Bedrick E. ICISS: an international classification ofdisease-9 based injury severity score. Journal of Trauma 1996;41:380–6. [dis-cussion 386–388].

32. Centers for Disease Control and Prevention. International Classification ofDiseases, Ninth Revision, Clinical Modification (ICD-9-CM). http://www.cdc.gov/nchs/icd/icd9cm.htm.

33. MacKenzie EJ, Steinwachs DM, Shankar BS, Turney SZ. An ICD-9CM to AISconversion table: Development and application. Proceedings American Associa-tion for Automotive Medicine Annual Conference 1986;30:135–51.

34. Centers for Disease Control and Prevention. International Classification ofDiseases, Tenth Revision, Clinical Modification (ICD-10-CM). http://www.cdc.gov/nchs/icd/icd10cm.htm.

35. Meredith JW, Evans G, Kilgo PD, MacKenzie E, Osler T, McGwin G, et al. Acomparison of the abilities of nine scoring algorithms in predicting mortality.Journal of Trauma 2002;53:621–8. [discussion 628–629].

36. Sacco WJ, MacKenzie EJ, Champion HR, Davis EG, Buckman RF. Comparison ofalternative methods for assessing injury severity based on anatomic descrip-tors. Journal of Trauma 1999;47:441–6. [discussion 446–447].

37. Di Bartolomeo S, Tillati S, Valent F, Zanier L, Barbone F. ISS mapped from ICD-9-CM by a novel freeware versus traditional coding: a comparative study.Scandinavian Journal of Trauma Resuscitation and Emergency Medicine2010;18:17.

38. Stephenson SC, Langley JD, Civil ID. Comparing measures of injury severity foruse with large databases. Journal of Trauma 2002;53:326–32.