9
Preventive Veterinary Medicine 104 (2012) 240–248 Contents lists available at SciVerse ScienceDirect Preventive Veterinary Medicine j our na l ho me p age: ww w.elsevier.com/locate/prevetmed Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network David Jordan a,, Peter Kirkland b , Stephen Morris a , Ronald Coilparampil b a NSW Primary Industries, 1243 Bruxner Highway, Wollongbar Primary Industries Institute, Wollongbar 2477, NSW, Australia b NSW Primary Industries, Elizabeth Macarthur Agricultural Institute, Camden, NSW, Australia a r t i c l e i n f o Article history: Received 26 August 2011 Received in revised form 8 November 2011 Accepted 9 November 2011 Keywords: Repeatability Reproducibility Reliability Agreement Laboratory assay ELISA a b s t r a c t Results from laboratory assays for detection of animal disease are often assessed for repeata- bility (agreement within laboratory) and reproducibility (agreement between laboratories). This work aimed to understand the strengths and limitations of available methods for describing these quantities. Five major veterinary laboratories in Australia volunteered to participate in a designed evaluation based on repeat testing of twenty bovine sera. Sampling was stratified so that ten of the sera were negative to the virus neutralisation test (VNT) for antibody to bovine herpes virus 1 (BHV-1) and the remaining ten sera were VNT posi- tive. Each serum was divided into 50 replicates and each laboratory assayed one replicate of each serum on a weekly basis using a commercial ELISA for BHV-1. Laboratories were blinded to the identity of sera. The data on sample to positive control ratio (S/P) for these 1000 individual assays were collated, sources of variance analysed using a random effects model, and reliability coefficients () obtained from the variance estimates as quantitative measures of within and between laboratory agreement. Coefficient of variation (CV) was calculated for combinations of sera and laboratory. CV was found to be higher for sera with the lowest mean S/P values (VNT ve sera). For VNT ve sera, agreement of S/P within laboratory was low to moderate (: 0.01–0.27) and the agreement between all labs was low ( = 0.02). Reliability coefficients for VNT +ve sera were very high for agreement within laboratories (: 0.63–0.92) and moderate for agreement between laboratories ( = 0.52). As well, simulation demonstrated that sero-prevalence has a dramatic affect on the relia- bility coefficient if sampling were to be irrespective of VNT status. We conclude that there are some limitations with the available approaches for assessing agreement within and between laboratories. Although reliability coefficients have some drawbacks they are an attractive way of reducing reliance on subjective assessment of agreement. Crown Copyright © 2011 Published by Elsevier B.V. All rights reserved. 1. Introduction Knowledge of the performance of assays used to detect the presence of infectious agents is central to the man- agement and prevention of animal diseases. The World Organisation for Animal Health (OIE) has recommended approaches to validation of assays (OIE, 2009) and in Corresponding author. Tel.: +61 2 66261240; fax: +61 2 66281744. E-mail address: [email protected] (D. Jordan). Australia these are promoted throughout the national net- work of veterinary laboratories (SCAHLS, 2007). A central element of these guidelines is to describe the extent of agreement (precision) in assay results (e.g., titre, optical density OD) when a single sample (e.g., serum) is assessed multiple times within a laboratory and within multiple laboratories. Within laboratory agreement in assay results is usually referred to as reproducibility and between lab- oratory agreement is usually referred to as repeatability (Collins et al., 1993). These forms of agreement are a way of describing the quality of measurement made on the 0167-5877/$ see front matter. Crown Copyright © 2011 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.prevetmed.2011.11.005

Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

Embed Size (px)

Citation preview

Page 1: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

Da

Da

b

a

ARRA

KRRRALE

1

taOa

0d

Preventive Veterinary Medicine 104 (2012) 240– 248

Contents lists available at SciVerse ScienceDirect

Preventive Veterinary Medicine

j our na l ho me p age: ww w.elsev ier .com/ locate /prevetmed

escribing the within laboratory and between laboratory agreement of serum ELISA in a national laboratory network

avid Jordana,∗, Peter Kirklandb, Stephen Morrisa, Ronald Coilparampilb

NSW Primary Industries, 1243 Bruxner Highway, Wollongbar Primary Industries Institute, Wollongbar 2477, NSW, AustraliaNSW Primary Industries, Elizabeth Macarthur Agricultural Institute, Camden, NSW, Australia

r t i c l e i n f o

rticle history:eceived 26 August 2011eceived in revised form 8 November 2011ccepted 9 November 2011

eywords:epeatabilityeproducibilityeliabilitygreementaboratory assayLISA

a b s t r a c t

Results from laboratory assays for detection of animal disease are often assessed for repeata-bility (agreement within laboratory) and reproducibility (agreement between laboratories).This work aimed to understand the strengths and limitations of available methods fordescribing these quantities. Five major veterinary laboratories in Australia volunteered toparticipate in a designed evaluation based on repeat testing of twenty bovine sera. Samplingwas stratified so that ten of the sera were negative to the virus neutralisation test (VNT)for antibody to bovine herpes virus 1 (BHV-1) and the remaining ten sera were VNT posi-tive. Each serum was divided into 50 replicates and each laboratory assayed one replicateof each serum on a weekly basis using a commercial ELISA for BHV-1. Laboratories wereblinded to the identity of sera. The data on sample to positive control ratio (S/P) for these1000 individual assays were collated, sources of variance analysed using a random effectsmodel, and reliability coefficients (�) obtained from the variance estimates as quantitativemeasures of within and between laboratory agreement. Coefficient of variation (CV) wascalculated for combinations of sera and laboratory. CV was found to be higher for sera withthe lowest mean S/P values (VNT −ve sera). For VNT −ve sera, agreement of S/P withinlaboratory was low to moderate (�: 0.01–0.27) and the agreement between all labs waslow (� = 0.02). Reliability coefficients for VNT +ve sera were very high for agreement withinlaboratories (�: 0.63–0.92) and moderate for agreement between laboratories (� = 0.52).

As well, simulation demonstrated that sero-prevalence has a dramatic affect on the relia-bility coefficient if sampling were to be irrespective of VNT status. We conclude that thereare some limitations with the available approaches for assessing agreement within andbetween laboratories. Although reliability coefficients have some drawbacks they are anattractive way of reducing reliance on subjective assessment of agreement.

Crown

. Introduction

Knowledge of the performance of assays used to detecthe presence of infectious agents is central to the man-

gement and prevention of animal diseases. The Worldrganisation for Animal Health (OIE) has recommendedpproaches to validation of assays (OIE, 2009) and in

∗ Corresponding author. Tel.: +61 2 66261240; fax: +61 2 66281744.E-mail address: [email protected] (D. Jordan).

167-5877/$ – see front matter. Crown Copyright © 2011 Published by Elsevier Boi:10.1016/j.prevetmed.2011.11.005

Copyright © 2011 Published by Elsevier B.V. All rights reserved.

Australia these are promoted throughout the national net-work of veterinary laboratories (SCAHLS, 2007). A centralelement of these guidelines is to describe the extent ofagreement (precision) in assay results (e.g., titre, opticaldensity – OD) when a single sample (e.g., serum) is assessedmultiple times within a laboratory and within multiplelaboratories. Within laboratory agreement in assay results

is usually referred to as reproducibility and between lab-oratory agreement is usually referred to as repeatability(Collins et al., 1993). These forms of agreement are a wayof describing the quality of measurement made on the

.V. All rights reserved.

Page 2: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

erinary

D. Jordan et al. / Preventive Vet

original scale without dichotomising the result as in thecase of assessment of diagnostic accuracy (sensitivity andspecificity). While there is no simple relationship betweenmeasures of agreement and measures of diagnostic accu-racy, those tests with a high level of diagnostic accuracymust be based on assays that have a high level of agree-ment. However, a high level of agreement in assay resultdoes not guarantee a high level of diagnostic accuracy, sinceit is possible for an assay to give highly repeatable measure-ments while being systematically biased in some way.

Circumstances where it might be useful to assess theagreement amongst assay results include: during thedevelopment of new diagnostic assays, when there is aneed to understand the contribution of laboratory imple-mentation of an assay to errors (as in quality assurance),and when there is interest in interpreting “strength of reac-tion” information (e.g., magnitude of ELISA OD reading, titreetc.) to inform on diagnosis, and when comparing resultsbetween test occasions and between laboratories. Assess-ment of agreement may also play an important role whenthere are too few animals of known disease status availableto assess diagnostic accuracy such as when a disease is newor occurs so infrequently that it is difficult to accumulateappropriate reference samples. Although the assessmentof agreement is becoming mandatory for new assays themethods for doing this have received scant attention in theresearch literature.

The traditional way of assessing agreement for an assayis to use the coefficient of variation (CV), defined as thestandard deviation of repeated observations divided bytheir mean. While the CV is easy to derive, it only pro-vides a crude summary of the variation in measurementson a given sample, not across multiple samples. Graphicalanalysis is useful for subjective assessment of agreementalthough the process of interpretation can be awkward inthe case where repeated measures are made on multiplesamples. A third way that has received little attention in theveterinary literature is to assess the statistical reliability.Here, the reliability coefficient (a form of the intraclass-correlation coefficient or ICC) is estimated as a summarymeasure of agreement across multiple samples. Estima-tion of the ICC requires a suitable study design matched toan appropriate analysis. Despite the growing need for theagreement of assays to be assessed the benefits and pitfallsof the above approaches appear not to have been describedby concurrent application to a single data set.

The aim of this work was therefore to explore the utilityof graphical description, CV and reliability coefficients asingredients of the validation of an ELISA assay used in anational laboratory network. This extends beyond earlierwork evaluating the serology of Mycobacterium aviumsubsp. paratuberculosis in a recurrent testing processwithin multiple laboratories (Dargatz et al., 2004). This lat-ter work was based on a limited number of individual sera(three) preventing the estimation of reliability coefficients.Variance components have been assessed for an assayfor antibodies in bovine milk and were used to explain

variation in assay performance due to factors within a labo-ratory (Nielsen, 2002). However, the latter did not formallyinvolve the assessment of ICC, nor CV or graphical descrip-tions of agreement. In this work we performed 1000 assays

Medicine 104 (2012) 240– 248 241

for antibody on 20 individual sera in five different laborato-ries. The data provide an ideal basis for comparing methodssuited to the assessment of agreement in assay results.

2. Methods

The assessment of agreement required a study designincorporating repeated measurements made on individualsamples. Repeated measurements were made within mul-tiple laboratories to assess within and between laboratoryagreement. Bias was controlled by blinding participants tothe identity of sera and other testing laboratories.

2.1. Selected assay

This work relied on an assay widely used to detectantibody against BHV-1 (Pourquier® ELISA IBR Serum andMilk Screening test). This assay was chosen because par-ticipating laboratories and their technicians were familiarwith the protocol, because a commercial kit containingall reagents was freely available, and because it is simi-lar to many other ELISAs used in veterinary laboratoriesfor detection of antibody. A test for BHV-1 was attractivebecause the virus is known to be present in the Australiancattle herd allowing access to serum from infected individ-uals.

2.2. Sera and collaborating laboratories

A panel of twenty bovine sera were obtained from−20 ◦C storage at the virology laboratory of ElizabethMacarthur Agricultural Institute, Menangle, New SouthWales, Australia (EMAI). The sera were from a collec-tion exceeding approximately 2000 in number and hadbeen accumulated from submissions related to live-export.These sera had all been earlier screened using the IBR ELISA.One group of 10 sera were sampled from submissionswhere all samples had been screened as negative ensuringthat none of these belonged to consignments containingpositive sera. Although we recognise that formal randomsampling (based on a sampling frame) of the negative serafrom the collection is preferred this was not possible. How-ever, the positioning of individual sera in cold storage wasbased on timing of acquisition of the sample and tech-nicians removed the ten negative samples from differentlocations spaced evenly throughout the collection, in thefashion of systematic random sampling. A second groupof 10 positive sera were similarly chosen from the posi-tive sera. In the case of positive sera there was sometimesan insufficient volume of sample available from individ-ual animals to provide the total volume required by thestudy design (5 mL/serum). In this case, sera of a similarIBR ELISA reaction strength were combined and blended toensure the minimum volume was available. All twenty seraincluded in the panel were then subjected to the virus neu-tralisation test (VNT) for antibody to IBR (Gu and Kirkland,2007) as an independent means of confirming the presence

or absence of antibody.

After conduct of the VNT, the panel of 20 sera werethen arranged in random sequence and passed to asecond laboratory (Regional Veterinary Laboratory, RVL)

Page 3: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

2 erinary

riRtic4orcrbDpLi

Evlofeo

2

c(cLaiarta

2

fsolmtt

fnabRawG

42 D. Jordan et al. / Preventive Vet

esponsible for the division of sera, dispatch of sera to test-ng laboratories around Australia and collation of results.VL was provided with no information about the sera. Inhis laboratory each serum was subdivided to create 50dentical replicates (100 �L in volume), with each repli-ate temporarily identified and stored in sterile tubes at◦C. Inclusion of five testing laboratories allowed estimatef variance in results attributable to laboratories. Everyeplicate of every serum was re-labelled with a unique, sixharacter code generated at random from a computer algo-ithm. Sera were packed in chilled boxes and dispatchedy air courier to the five collaborating testing laboratories.elivery of samples to each testing laboratory and tem-erature on arrival was verified by the dispatch laboratory.aboratories were sent standard instructions for participat-ng in the trial.

The five veterinary laboratories performing the IBRLISAs volunteered to participate after canvassing all majoreterinary laboratories in Australia. Inclusion of a testingaboratory in the study was conditional on each agreeing tonly use the specified IBR ELISA kit, to follow the kit manu-acturer’s instructions for performing the assay, to only testach serum replicate once, and to maintain confidentialityf results.

.3. Study design

Each of the five testing laboratories received 10 repli-ates of each of the 20 sera arranged into 10 test batcheseach lab received a total of 200 serum samples). One repli-ate of each serum was present within each test batch.aboratories were requested to process one test batch atpproximately weekly intervals within their routine test-ng for IBR antibody in bovine sera. OD of the test samplend that of the control sera from each test plate wereecorded against the unique coded identity. At comple-ion of all testing results were collated and relayed fornalysis.

.4. Descriptive analysis

The interpretive criteria for the IBR ELISA, when per-ormed according to the manufacturer’s instructions, is theample to positive control ratio (S/P) expressed as a percentf OD recorded at a wavelength of 450 nm. S/P was calcu-ated using the data on OD of the sera tested (ODsample), the

ean of the OD of two positive controls (ODposcontrol) andhe OD of a single negative control (ODnegcontrol) accordingo the formula:

SP

= ODsample − ODnegcontrol

ODposcontrol − ODnegcontrol× 100 (1)

Computer programs were used to de-code the designeatures (identity of sera, test batch, sera and replicateumber) of each individual test result from the six char-cter code. However, individual laboratories continued toe referred to in the analysis by their unique coded letter.

esults for sera where a testing laboratory’s ELISA result (on

dichotomous scale) differed from the initial VNT resultere tabulated to discern possible causes of disagreement.raphical analyses were performed by grouping S/P results

Medicine 104 (2012) 240– 248

and CV by laboratory and by sera to depict variation withinand between these factors.

2.5. Estimation of the reliability of S/P data

Measurement models for estimating reliability are lin-ear equations describing how the design factors (includingsera, laboratories and error) account for variation in theobservations. Measurement model A represented by Eq. (2)was used for within laboratory agreement. It includes theeffect of sera being processed in batches:

Xij = �i + ˇj + eij (measurement model A) (2)

In Eq. (2) Xij is the S/P measurement made on the ith serumin the jth batch, �i is the effect of the ith serum, ˇj is theeffect of the jth batch and eij is the residual error term. Esti-mating reliability across all laboratories (reproducibility)must account for variance due to laboratories and addi-tional interactions that arise (measurement model B):

Xijk = �i + ˛k + �jk + ωik + eijk (measurement model B)(3)

In Eq. (3) Xijk is the S/P measurement made on the ith serumin the jth batch by the kth lab, �i is the effect of the ith serum,˛k is the effect of the kth lab, � jk is the effect of batch j withinlab k (where batch is nested within laboratory), ωik is theinteraction between serum i and lab k and eijk is the residualerror term.

Each measurement model was subject to an analysis(below) to obtain the estimates of variance required for cal-culating the appropriate reliability coefficient. For model A(within laboratory agreement) the equation for obtainingreliability is:

� = �2�

�2� + �2

ˇ+ �2

e

(4)

For model B (between laboratory agreement) the equationis:

� = �2� + �2

ω

�2� + �2

˛ + �2� + �2

ω + �2e

(5)

In Eqs. (4) and (5) the ICC reliability coefficient is � (rho) andthe variance components are: �2

� is the variance due to sera,�2

ˇis the variance due to batch, �2

e is the variance due to

error, �2˛ is the variance due to laboratory, �2

� is the variancedue batches within laboratory and �2

ω is the variance dueto interaction between sera and laboratory.

Variance estimates for Eqs. (4) and (5) were obtainedfrom random-effects models solved using the xtmixed rou-tine in Stata (version 11.1, Stata Corporation, TX, USA)taking care to correctly specify the terms according to therecommendations of Marchenko (2006) and Rabe-Heskethand Skrondal (2008). Analysis of S/P data within linearrandom-effects models assumes that the residuals areindependently and normally distributed. However, S/P maypossibly exhibit a mean–variance relationship so assump-

tions were assessed using plots of residual. The latterincluded models based on various transformations of theresponse variable (logarithmic, inverse, square and squareroot etc.).
Page 4: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

D. Jordan et al. / Preventive Veterinary Medicine 104 (2012) 240– 248 243

0

50

100

150

200

0

50

100

150

200

C A D B E C A D B E C A D B E C A D B E C A D B E

Serum 5 (VNT −ve) Serum 4 (VNT −ve) Serum 17 (VNT −ve) Serum 13 (VNT −ve) Serum 15 (VNT −ve)

Serum 20 (VNT −ve) Serum 3 (VNT −ve) Serum 16 (VNT −ve) Serum 8 (VNT −ve) Serum 7 (VNT −ve)S/P

Laboratory

0

200

400

600

800

0

200

400

600

800

C A D B E C A D B E C A D B E C A D B E C A D B E

Serum 18 (VNT 8) Serum 14 (VNT 16) Serum 2 (VNT 32) Serum 6 (VNT 64) Serum 10 (VNT 64)

Serum 19 (VNT 64) Serum 12 (VNT 64) Serum 9 (VNT 128) Serum 11 (VNT 64) Serum 1 (VNT 128)S/P

Laboratory

a

b

Fig. 1. (a) Panel of plots for all ten VNT negative sera showing the IBR ELISA S/P measurement. One plot is provided per serum, with plots arranged acrossrows in order of increasing mean of seropositivity and labs arranged (left to right) within each plot in order of increasing mean of lab seropositivity. Plotheadings indicate the serum number and VNT status. Horizontal broken lines indicated the interpretation cut-point (S/P = 45). Each “x” symbol representsa discordant result (disagreement with VNT result), each “O” symbol represents a concordant result. (b) Panel of plots for all ten VNT positive sera arrangedin the same fashion as above.

Page 5: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

244 D. Jordan et al. / Preventive Veterinary Medicine 104 (2012) 240– 248

Table 1Discordant results, where ELISA and VNT dichotomous results disagree, noting the lab, batch and serum number, assay S/P, mean S/P for identical sera andapparent reason for the discrepancy.

Case Lab/serum/batch VNT ELISA S/P case (mean S/P for identical serum) Apparent reason

1 A/02/08 +ve −ve 6.6 (158.9) Major underestimation of S/Pa

2 A/03/08 −ve +ve 160.6 (9.2) Major overestimation of S/Pa

3 A/14/01 +ve −ve 39.0 (104.4) Underestimation of S/P4 A/18/02 +ve −ve 40.1 (72.5) Underestimation of S/P5 A/18/04 +ve −ve 41.1 (72.5) Underestimation of S/P6 A/18/10 +ve −ve 11.7 (72.5) Underestimation of S/P7 D/02/01 +ve −ve 164.7 (158.9) Error in interpretation or transcription of dichotomous result8 D/08/01 −ve +ve 72.7 (9.6) Moderate overestimation of S/P

ly arise

w .

2s

oo(oCseacobdftVtr

rpamdsiftep

3

3

ddTiiCe

9 D/17/03 −ve +ve 3.5 (6.6)

a Cases 1 and 2 are from the same batch in the same lab. They apparentere confused with each other or the results were switched at data entry

.6. Between laboratory agreement under cross sectionalampling

Agreement of an assay could be assessed using serabtained at random from the population of animals with-ut regard to prior knowledge of test or disease statusunlike this study). At the commencement the prevalencef test positives in the sample will usually be unknown.onsequently, the factor “sera” would then contain VNTtatus as an additional source of variation not present inarlier analyses grouped by VNT status. The additional vari-nce will affect the estimates of reliability. The impact ofross sectional sampling on the estimate of between lab-ratory agreement for a serum ELISA was thus assessedy simulation. This was performed by creating simulatedata sets from a design identical to the earlier analysisor comparison. Each simulated data set contained a mix-ure of VNT +ve and VNT −ve samples. The number ofNT +ve samples in the data sets was varied to iden-

ify how sample sero-prevalence affects the estimates ofeliability.

The simulation proceeds by generating a data set byandomly sampling from the normal distributions witharameters equated to those of the earlier estimated vari-nce components and combining these according to theeasurement model (Eq. (3)). The sera in the simulated

ata set are made up of p positive sera and 10-p negativeera and the sero-prevalence made to vary from 10% to 90%n steps of 10%. Each data set is subject to variance analysisor calculation of a reliability coefficient (between labora-ory agreement) and the process repeated 1000 times forach value of p. Simulation code was generated in the Statarogramming language (version 11.1).

. Results

.1. Discordant results

Discordant results are where the ELISA and VNTichotomous results disagree. Table 1 summarises the nineiscordant results from the 1000 ELISA assays performed.hree of the nine disagreements are due to small errors

n estimation of the S/P for sera with a mean seropositiv-ty near the cut-point value for interpretation (Cases 3–5).ases 1, 2, 6 and 8 involve more substantial measurementrror possible due to mistakes in transcription of results.

Error in interpretation or transcription of dichotomous result

from a single incident, either the serum samples for A/02/08 and A/03/08

Indeed, transcription errors or inadvertent switching ofsamples are a likely cause of Cases 1 and 2 since these mea-surements are from the same batch in the same lab andthe S/P results are consistent with this explanation. Proba-ble transcription or interpretation errors were identified inCase 7 and Case 9 where ELISA S/P readings appear precisebut the allocated dichotomous result was inappropriate.

3.2. Descriptive results

The S/P data for VNT negative sera are shown in Fig. 1a.Most of the variation is due to differences within sera(random noise) and differences between labs rather thanthe differences between sera. In general, the variation inS/P for VNT negative sera is small with a low chance ofresults exceeding the interpretation cut-point, althoughthe amount of variation differs noticeably between labs.Note that the scale of the y-axis is governed by the largestobservation (Serum 3, Lab A) and that removing this datapoint would make the variation in S/P for VNT negative seramore obvious.

The S/P data for VNT +ve sera are shown in Fig. 1b. Com-pared to VNT −ve sera, variation between VNT +ve sera isgreater, the overall amount of variation is greater and thevariation within labs is much greater. The extent of varia-tion within each lab appears consistent across all sera. Labswith greater variation are at risk of discordant result (falsenegative test) for VNT +ve sera with a low mean S/P (Lab Ain this data).

A plot of CV for raw S/P data grouped by sera and lab(Fig. 2) illustrates that this measure of agreement is not astable characteristic of an assay but is affected by both thelaboratory and especially the mean value of the sera beingassessed. Sera with lower mean S/P values unexpectedlyhad higher CV values and the ranking of labs by CV for eachserum was similar across all sera.

3.3. Estimates of agreement

Reliability coefficients for S/P measurements estimatedfrom the output of linear random-effects models are givenin Table 2. Since the sampling of sera was not random across

both classes of VNT result, but were stratified-random byVNT result, the calculations were performed grouped byVNT result. In general, measurements in S/P made on VNT−ve sera have much lower between laboratory agreement
Page 6: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

D. Jordan et al. / Preventive Veterinary Medicine 104 (2012) 240– 248 245

Table 2Estimates of the reliability coefficient (�) for S/P derived from linear random-effects models arranged by class of virus neutralisation test (VNT) result.Confidence intervals (95%) are given where possible.

Reliability coefficient Lab Intraclass correlation (�) (95% confidence interval)

VNT −ve VNT +ve

Within laboratory agreement

A 0.01 (NA) 0.75 (0.55–0.95)B 0.20 (0–0.41) 0.63 (0.36–0.90)C 0.27 (0.04–0.50) 0.74 (0.53–0.96)D 0.04 (0–0.15) 0.92 (0.86–0.99)E

Between laboratory agreement All

Fig. 2. Relationship between the coefficient of variation for each serumfor laboratories A, B, C, D and E (top panel) and the mean S/P reactionstrength (log 10 scale) for individual sera (bottom panel).

(� = 0.02) compared to VNT +ve sera (� = 0.52) with thelatter providing high estimates of agreement within lab-

oratories. When major outliers are deleted from the dataset for VNT −ve sera (Cases 2 and 8 in Table 1) the betweenlaboratory agreement for VNT −ve sera improves to 0.11.Random-effects model variance estimates for the data on

Table 3Estimates of variance components used to calculate reliability coefficients used asVNT −ve sera and VNT +ve sera.

Source VNT −ve model

Variance (95% CI) % of model variance

Sera 0.66 (0.03–15.27) 0.8%

Lab 3.13 (0.37–26.47) 3.9%

Lab × Batch 9.32 (4.58–18.97) 11.7%

Sera × Lab 0.73 (0.01–86.07) 0.9%

Error 65.56 (57.13–75.24) 82.6%

0.17 (0–0.36) 0.85 (0.7–0.99)

0.02 (0–0.06) 0.52 (0.20–0.86)

S/P are shown in Table 3 by class of VNT result. These arethe variance estimates from the full data set used to cal-culate the reliability coefficients shown in the bottom ofTable 2 for between laboratory agreement. They indicatethe breakdown of the source of variation in the data andpoint to the importance of batch to batch variation withina laboratory (Lab × Batch term or “batch within lab”) whenthe assay is applied to VNT −ve sera (Table 3, compare theamount of variance due to Lab × Batch versus other fac-tors). A feature of the variance breakdown for VNT −vesera is the comparatively large amount attributed to theerror term, indicating the low signal to noise ratio in assayresults when antibody is absent. This is the main reason forlow estimates of between laboratory agreement in S/P forVNT −ve sera. The amount of variation attributed to labs ismuch greater for VNT +ve sera, both in absolute terms andas a proportion of total variation. Variation attributable tobatches is greater in absolute terms for VNT +ve sera butis about the same proportion of total variation as for VNT−ve sera.

Plots of residuals versus fitted values for the randomeffects models did reveal some heterogeneous variance inthe form of heavy tails. These result from greater varianceassociated with higher mean S/P values which is discern-able in Fig. 1a and b. Transformations of the S/P data werenot relied on because they did not greatly improve plots ofresiduals and after transformation variance estimates areon a scale that is difficult to interpret.

The simulated demonstration of cross sectional sam-pling on estimates of between laboratory agreement isshown in Fig. 3. In this output the heavy tails in the datafrom VNT +ve sera are removed by the process of sim-ulation which ensures data sets conform to the model

assumptions (a process we judged acceptable for the pur-pose of demonstration). When the prevalence is at the lowend of that evaluated (10%) the between laboratory agree-ment is very high, although variable due to the level of

a measure of between laboratory agreement of S/P measurement for the

VNT +ve model

Variance (95% CI) % of model variance

10282.9 (3893.0–27161.0) 42.5%8383.8 (1951.0–36025.9) 35.0%1731.7 (1111.0–2699.1) 7.16%2491.9 (1533.2–4050.2) 10.3%1278.4 (1113.9–1467.2) 5.3%

Page 7: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

246 D. Jordan et al. / Preventive Veterinary Medicine 104 (2012) 240– 248

.2 .4 .6 .8 1Reliability coefficient

90%

80%

70%

60%

50%

40%

30%

20%

10%

Sam

ple

prev

alen

ce

F efficienT anges frs samplin

sit(sbiih

4

rfopearowpsfircbptofPrfCsrw(

ig. 3. Estimates of between laboratory agreement (using the reliability cohe simulated prevalence of VNT +ve sera in a sample of size n = 10 sera rhow the distribution of reliability estimates from 1000 iterations of data

ampling (10 sera per evaluation). As the prevalencencreases the agreement between laboratories decreasesowards the empirical value obtained for VNT +ve sera0.52, Table 2). Variation in agreement at each level ofimulated prevalence reflects the effect of particular com-inations of sera included in each evaluation (model

teration). Overall, it is very clear that the prevalence ofnfection or seropositivity in the population and sample canave a large bearing on reliability estimates.

. Discussion

The performance of diagnostic assays concerns a broadange of animal-health professionals who might not all beamiliar with the analytical techniques relied on in partsf this study. Thus, graphical evidence that clearly dis-lays the sources of variation in measured outcome isssential to complement any model-based measures ofgreement such as the reliability coefficient. This workelied on arranging the plots of S/P data in panels withne panel for each serum. Within each panel laboratoriesere sequenced according to their effect on the data. Thisrovides a simple tool for comparison of laboratories andera, strengthens the interpretation of the reliability coef-cients, and delivers some evidence of the fit of data toandom effects models used to generate reliability coeffi-ients. For example, Fig. 1a and b elucidate the relationshipetween signal and noise in the S/P data pointing to whicharticular sera and laboratories are at risk of misclassifica-ion errors. The same figures also highlight the magnitudef within-laboratory variation in S/P underlining the needor work that identifies the causes of excessive variation.lots of CV are informative and in this study unexpectedlyevealed that those sera with a higher mean value of S/P inact had lower values of CV across all laboratories. Plots ofV also highlight the difficulty of prescribing a particular

et value of CV as a broadly applicable performance crite-ia, especially true for sera with a low S/P value (VNT −ve)here the values of CV were shown to vary more widely

Fig. 2).

t) obtained when cross-sectional sampling is evaluated using simulation.om 10% to 90% in steps of 10% (y axis). At each prevalence level box plotsg and variance estimation (x axis).

Although graphical tools are attractive for ease ofcommunication of raw S/P data and CV, they do rely onsubjective interpretation. Minimising subjective assess-ment is one of the principal motivations for using thereliability coefficient since it provides a single, summarymeasure of agreement. However, unlike diagnostic sen-sitivity and specificity the reliability coefficient does nothave a simple, intuitive interpretation from the perspectiveof disease control or diagnosis. The main reason for this isthat the reliability coefficient is a characteristic of both theassay, and, the population being evaluated. For example,when prevalence changes or subjects (sera) are caused tovary in some way that affects the assay outcome, then thereliability coefficient will be also affected. Thus, high levelsof agreement can be arbitrarily achieved by sourcing froma more heterogeneous population. These are key pointsin understanding the practical limitation of the reliabilitycoefficient and are exemplified by Fig. 3 where simulatedsero-prevalence was shown to have a dramatic effecton the estimate of reliability. Nevertheless, “standardvalues” have been offered for interpretation of reliabilitycoefficients: 0–0.40 (poor), 0.41–0.75 (good), 0.76–1.0(excellent) (Shoukri and Pause, 1999). These are useful asgeneral guidelines, however, the experiences from humanmedicine are that differing sets of standards could be usedaccording to the measurement system and the practicalimpact of errors (Streiner and Norman, 1995). For this rea-son, additional studies based on other assays are requiredso that interpretive standards specific to the context of vet-erinary laboratories and animal disease can be developed.

The reliability coefficient has some virtues that mustbe balanced against the above limitations. Apart fromoffering a single summary measure of agreement the anal-ysis of data provides information on the sources of errorwhen an assay is performed in a population of laborato-ries. The estimates of reliability can be generalized since

inferences about assay performance from the reliabilitycoefficient apply to the population of laboratories not justthose involved in the study. As an example from this study,consider the between laboratory agreement of the S/P for
Page 8: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

erinary

D. Jordan et al. / Preventive Vet

VNT +ve sera which was estimated to be 0.52 (Table 2).The result indicates that in the population of laboratoriesonly 52% of the variation in S/P is due to the sera them-selves. Other sources of variation are “between labs” (35%),“batches within labs” (7%), and residual error (5%) (Table 3).Knowledge of the extent of laboratory-to-laboratory vari-ation is useful because it guides the need for furtherstandardisation of how assays are performed so to avoidconflicts when animals are tested twice but at differentlaboratories.

From 1000 individual assays performed in this studynine resulted in misclassification when the VNT test wasused as the basis of comparison. Usually, laboratorieswould have applied a policy of re-testing any serum with anS/P close to the cut-point and had this applied in this workthree of these misclassifications in dichotomous resultwould probably have been eliminated (Table 1, Cases 3–5).Two additional misclassifications were apparently due tointerpretation or transcription error (Table 1, Cases 7 and9) and might also have been eliminated under routinelaboratory practices. The remaining four cases of misclas-sification would have little chance of being rectified undernormal conditions (Table 1, Cases 1, 2, 6 and 8). This lowrate of intractable errors (four tests per 1000 or 0.4%)provides good evidence of the integrity of processes andpractices in the laboratory network. However, a basis forcomparison of this error rate is lacking, and the currentfinding of 0.4% intractable error may be useful as a bench-mark for future work on assay performance in a laboratorynetwork.

Some aspects of this study potentially impact on inter-pretations. The sampling of sera was purposefully notperformed in a cross-sectional manner (i.e., simple-randomselection where each individual serum in the collection hadan equal probability of selection). Instead, samples werestratified according to VNT +ve and VNT −ve status to allowscrutiny of both these classes of sera. This decision wasmade because it was felt that the prevalence of VNT +vesera in the collection was likely low and that the extra infor-mation on both classes would give much greater insightinto assay performance. This also permitted the simulationcomponent that evaluated cross sectional sampling underdifferent levels of seropositivity. It was also necessary tocombine and blend some individual sera with a similarELISA reaction to ensure a sufficient number of VNT +vesera were available for the study. Although, serum fromindividual animals is preferred blending becomes neces-sary when sero-prevalence is low. An over-arching caveaton all aspects of assessment of agreement of S/P is thatwhile some laboratories or assays may not perform wellthis does not necessarily translate to misclassification ofthe dichotomous result. Assessment of misclassificationrequires evaluation of diagnostic sensitivity and diagnos-tic specificity but this would be costly to perform on asimilar scale (national) as this work. This work also didnot seek to explain the causes of variation in assay resultsbeyond that explained by the design factors (laboratories,

batches within laboratories, sera and interactions). Exam-ples of factors not studied include variation in assay kitsfrom different production batches and variation in techni-cian performance within laboratory. Inclusion of additional

Medicine 104 (2012) 240– 248 247

factors in a study of this type would exponentially increasethe resources required and are best addressed as separateresearch questions. As well, there is scope to further under-stand the performance of assays by studying the variationobserved in the OD of quality assurance sera. However,since often only two such sera are present in ELISA kitsthis would not be a suitable basis for generating reliabilitycoefficients.

This work demonstrates that estimation of agreement ofa serological test for animal disease on a multi-laboratorybasis can be attempted using a number of methods.Descriptive (graphical) interpretation of raw data and coef-ficient of variation are very informative, although theresults could not be used to prescribe a single value ofCV as a standard to be met as part of a validation pro-cess. Quantitative estimates of reliability based on theICC hold some promise but in some instances could beaffected by an inability to obtain a good fit of data to theunderlying statistical model. Reliability estimates for sero-negative samples are very low but this is, paradoxically,likely to be an indicator of good assay implementation.When sampling of sera is cross-sectional the reliabilitycoefficient so obtained is strongly influenced by the sero-prevalence in the sample which in turn is derived fromthe population sero-prevalence. Thus, comparison of reli-ability coefficients obtained from different populations isunwise and prescriptive standards of interpretation cannotas yet be justified. Nevertheless, some of these issues couldbe addressed by using a well constructed (epidemiologi-cally relevant) and standardised panel of sera to control forthe influence of sero-prevalence and differences betweenpopulations.

Acknowledgements

Funding for this work was generously provided bythe Australian Biosecurity Collaborative Research Centre(ABCRC) for Emerging Infectious Disease. We wish to thankDr. Deb Cousins of ABCRC for expert guidance on labora-tory aspects and Dr. Neil Coombes of NSW Departmentof Primary Industries for helpful critique of the statisticalapproach. The five collaborating veterinary laboratories arethanked for performing ELISA assays.

References

Collins, M.T., Angulo, A., Buergelt, C.D., Hennager, S.G., Hietala, S.K., Jacob-son, R.H., Whipple, D.L., Whitlock, R.H., 1993. Reproducability of acommercial enzyme linked immunosorbent assay for bovine paratu-berculosis among 8 laboratories. J. Vet. Diagn. Invest. 5, 52–55.

Dargatz, D.A., Byrum, B.A., Collins, M.T., Goyal, S.M., Hietala, S.K., Jacob-son, R.H., Kopral, C.A., Martin, B.M., McCluskey, B.J., Tewari, D.,2004. A multilaboratory evaluation of a commercial enzyme-linkedimmunosorbent assay test for the detection of antibodies againstMycobacterium avium subsp paratuberculosis in cattle. J. Vet. Diagn.Invest. 16, 509–514.

Gu, X., Kirkland, P.D., 2007. Infectious bovine rhinotracheitis. In: Aus-tralian and New Zealand Standard Diagnostic Procedures for AnimalDiseases. Sub-Committee on Animal Health Laboratory Standards

for Animal Health Committee, Australia, Canberra, http://www.scahls.org.au/ data/assets/pdf file/0003/1280874/ibr.pdf (accessed19.08.11).

Marchenko, Y., 2006. Estimating variance components in Stata. Stata J. 6,1–21.

Page 9: Describing the within laboratory and between laboratory agreement of a serum ELISA in a national laboratory network

2 erinary

N

O

R

48 D. Jordan et al. / Preventive Vet

ielsen, S., 2002. Variance components of an enzyme-linked immunosor-bent assay for detection of IgG antibodies in milk samples toMycobacterium avium subspecies paratuberculosis in dairy cattle. J. Vet.

Med. Ser. B 49, 384–387.

IE, 2009. Standard Operating Procedure (SOP) for OIE Validation andCertification of Diagnostic Assays , p. 26.

abe-Hesketh, S., Skrondal, A., 2008. Multilevel and Longitudinal Model-ing Using Stata. Stata Press College Station, Texas.

Medicine 104 (2012) 240– 248

SCAHLS, 2007. SCAHLS validation template for serological assays. In: SubCommittee for Animal Health Laboratory Standards for Animal HealthCommittee, Canberra, http://www.scahls.org.au/newtests/serology-

template.doc (accessed 19.08.11).

Shoukri, M.M., Pause, C.A., 1999. Statistical Methods for Health Sciences.CRC Press, Boca Raton.

Streiner, D.L., Norman, G.R., 1995. Health Measurement Scales. OxfordUniversity Press, Oxford.