17
1 Robust performance comparison: Some Ratio-Based Analyses of Scottish Health Boards Abstract Aims: Summary measures of healthcare performance typically embed contentious assumptions: for example, although many comparative quality measures are constructed as ratios, the choice of denominator is often ambiguous. Numerators are often composite indicators, yet it is unclear how constituent elements should be weighted to obtain a single number. We aim to explore healthcare applications of a robust approach to ranking which is robust to alternative modelling assumptions of this sort. Methods: We use a ratio-based efficiency (REA) technique (Salo and Punkka, 2011) for healthcare quality comparisons of Scottish Health Boards. The REA approach permits pairwise one-on-one comparisons of ratios with multiple denominators and numerators which consider all feasible weights. As a quality measure, we examine the number of adverse events patients experience before, during and after a hospital stay relative to hospitalised and general populations in Scottish Health Boards. Results: The results show possible rankings of Scottish Health Boards (displayed in a unified way as dominance relation). This allows one to identify Boards which cannot be ranked, say, worse than 4th or better than 7th. Such rankings give policy makers a sense of the uncertainty around ranks, and where to seek further information and the extent to which action is warranted. Conclusions: The REA approach enables a new and useful perspective on quality comparisons between organisations. Rankings are often sensitive to assumptions and the REA approach identifies which ranking judgements one can have total confidence in. Key words: performance comparisons; ratio-based analysis; ranking processes; composite indicators; Scotland PhD students Yrjänä Hynninen Department of Mathematics and Systems Analysis Systems Analysis Laboratory Aalto University School of Science Aalto | Finland [email protected] Laura Schang Department of Management London School of Economics and Political Science London | UK [email protected] Supervisors Professor Alec Morton Department of Management Science Strathclyde Business School University of Strathclyde Glasgow | UK [email protected] Professor Ahti Salo Department of Mathematics and Systems Analysis Systems Analysis Laboratory Aalto University School of Science Aalto | Finland [email protected] Paper number: A040

Robust performance comparison: Some Ratio-Based Analyses ... · Robust performance comparison: Some Ratio-Based Analyses of Scottish Health Boards ... ranked hospital trusts in Englands

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

1

Robust performance comparison: Some Ratio-Based Analyses of Scottish

Health Boards

Abstract

Aims: Summary measures of healthcare performance typically embed contentious assumptions: for

example, although many comparative quality measures are constructed as ratios, the choice of

denominator is often ambiguous. Numerators are often composite indicators, yet it is unclear how

constituent elements should be weighted to obtain a single number. We aim to explore healthcare

applications of a robust approach to ranking which is robust to alternative modelling assumptions of

this sort.

Methods: We use a ratio-based efficiency (REA) technique (Salo and Punkka, 2011) for healthcare

quality comparisons of Scottish Health Boards. The REA approach permits pairwise one-on-one

comparisons of ratios with multiple denominators and numerators which consider all feasible

weights. As a quality measure, we examine the number of adverse events patients experience

before, during and after a hospital stay relative to hospitalised and general populations in Scottish

Health Boards.

Results: The results show possible rankings of Scottish Health Boards (displayed in a unified way as

dominance relation). This allows one to identify Boards which cannot be ranked, say, worse than 4th

or better than 7th. Such rankings give policy makers a sense of the uncertainty around ranks, and

where to seek further information and the extent to which action is warranted.

Conclusions: The REA approach enables a new and useful perspective on quality comparisons

between organisations. Rankings are often sensitive to assumptions and the REA approach identifies

which ranking judgements one can have total confidence in.

Key words: performance comparisons; ratio-based analysis; ranking processes; composite indicators;

Scotland

PhD students

Yrjänä Hynninen Department of Mathematics and Systems Analysis Systems Analysis Laboratory Aalto University School of Science Aalto | Finland [email protected] Laura Schang Department of Management London School of Economics and Political Science London | UK [email protected]

Supervisors

Professor Alec Morton Department of Management Science Strathclyde Business School University of Strathclyde Glasgow | UK [email protected] Professor Ahti Salo Department of Mathematics and Systems Analysis Systems Analysis Laboratory Aalto University School of Science Aalto | Finland [email protected]

Paper number: A040

2

1. Introduction

Internationally, concerns about rising healthcare costs and evidence of variations in the quality and

outcomes of health services in many countries have highlighted the need for the production of

summary measures of health system performance. In a context where information on healthcare

performance becomes increasingly available and there is high demand for the public release of

performance data, summarizing the information contained in diverse indicators in a single composite

measure and ranking healthcare organizations on that basis seems attractive for several reasons. By

presenting the “big picture“, summary measures can highlight in a unified way to what extent the

objectives of health systems to promote, restore and maintain health are met (WHO 2000; Smith

2002). As such, summary measures may seem a natural approach to compare similar healthcare

organizations, potentially strengthening accountability and providing policy-makers and managers

with a focus for improvement (Goddard and Jacobs 2009).

However, a common challenge is the ambiguity inherent in measures of healthcare performance

generally and summary measures in particular. For example, it is often unclear how constituent

elements should be weighted to obtain a single number. Over the past years, various studies have

found rankings based on composite indicators to be highly sensitive to specific methodological

choices (see e.g. Jacobs, Goddard, and Smith 2005; Reeves et al. 2007; Gravelle et al. 2003; OECD

2008). In their analysis of hospital performance based on star ratings in the English NHS, Jacobs et al.

(2005) showed how subtle changes in the weighting system led some hospitals to jump almost half of

the league table. A re-analysis of the WHO’s 2000 performance assessment of health systems

(Gravelle et al. 2003) demonstrated dramatic changes in the top and bottom rankees depending on

the statistical specification used. In addition, although many comparative quality measures are

constructed as ratios, the choice of denominator is often ambiguous.

These concerns are not of purely academic interest. Frequently, the production of rankings has

serious consequences for the rankees. For example, six of the Chief Executives of the twelve lowest

ranked hospital trusts in England’s star rating system (the so-called “dirty dozen“) lost their jobs as a

result (Bevan and Hamblin 2009). It has been argued that France and Spain’s apparently high ranking

in the WHO’s 2000 assessment of healthcare systems substantially diminished policy pressure for

reform in these countries (Navarro 2000). In Medicare’s Premier Hospital Quality Incentive

Demonstration, a pay-for-performance scheme based on a composite score of hospital quality,

hospitals scoring below the ninth decile faced a 2% deduction in their Medicare payment (CMS

2009). With such high stakes, understanding whether conclusions drawn are robust to plausible

alternative assumptions seems critical.

3

In this paper we aim to define a ranking process which is robust to plausible numerator weight sets

and reasonable choices of denominator. Specifically, we adopt a novel ratio-based efficiency

measurement approach (REA; based on Salo and Punkka (2011)) for comparisons of healthcare

quality and performance. The next sections provide a framework of key sources of ambiguity in

healthcare performance measurement and set out the study’s methods. We then illustrate the

usefulness of the REA approach with an application to Health Boards in Scotland, by investigating

their comparative performance on quality indicators. We close with possible implications for policy

and highlight the need for further methodological development of robust measures of healthcare

performance.

2. Ambiguity in healthcare performance measurement

Measuring the performance of a healthcare system requires some concept of the benefit: the good

that is produced by a healthcare system. Since health status and outcomes are a function of activities

of the health system and exogeneous factors (e.g. life-style, case-mix/ disease severity, demographic

and socio-economic factors), measures of performance of the health system should ideally indicate

the ‘value-added’ (Goldstein and Spiegelhalter 1996) i.e. the health gain or incremental healthcare

outcome conferred to patients that is attributable to the workings of the health system. In practice,

operationalising this concept is however difficult due to limited information on the counterfactual

i.e. the health status in the absence of the health intervention (Jacobs et al. 2006). As a result, which

factors are considered genuinely under the control of the health service is often uncertain.

Due to practical constraints (e.g. time lags before health outcomes are realised, costs of collecting

patient-reported and other outcome data), many studies consider activities rather than health

outcomes (Hussey et al. 2009). This may offer useful information about organisational performance if

a clear link between activities and health gain exists (Jacobs, Smith, and Street 2006). However, the

effectiveness and appropriateness of many interventions depends on patient characteristics (e.g. an

appendectomy can be life-saving but also inappropriate and a waste of resources if conducted on the

wrong patient). A procedure that was not necessary in the first place is inefficient no matter how

many resources were used to provide it (Binder et al. 2009). Generally the quantity of interventions

provided tells very little about the ‘quality’ of the health service: in view of evidence of considerable

variation in healthcare utilisation (for a recent review see Corallo et al. 2014), one cannot be sure

that all outputs are appropriate and provided with good quality: organisations with more activities

4

could be overproviding unnecessary care while organisations with few activities could be

underproviding necessary care.

Whether outcomes or activities are chosen, healthcare ouputs are heterogeneous and

multidimensional. However, without a functioning market, there is no price mechanism for

comparison. To aggregate different healthcare outputs into a summary measure of performance,

weights are required which – analogous to prices – represent the relative importance of different

outputs and the opportunity cost of achieving them (Smith 2002). In practice, arriving at exact

specifications of weights is often contentious. First, it is often unclear whose preferences should be

elicited. Weights used often reflect a single set of preferences, although the evidence suggests

substantial heterogeneity in preferences between and within groups of policy-makers, patients and

the wider public (Smith 2002). Making precise judgements about the relative value of sub-indicators

to the composite may be both politically controversial and cognitively demanding, thus triggering

reluctance among respondents to settle on an exact set of weights. Second, there is no consensus on

a single best method how to elicit weights. Different techniques for valuing health(care) outcomes –

from simpler trade-off methods including ranking from most to least desired indicator and voting

techniques to more elaborate multiattribute approaches such as conjoint analysis and the analytic

hierarchy process – tend to produce different results and each method has distinct advantages and

disadvantages in terms of feasibility, consistency and validity (Dolan 1997; OECD 2008; Appleby and

Mulligan 2000).

Third, the choice of weights (and method of estimation) also reflects – often implicit – assumptions

about what these weights are supposed to represent. For instance, data envelopment analysis (DEA)

uses empirically derived, flexible weighting systems, sometimes called “benefit of the doubt“

approach, since weights vary across organisations and performance domains in a way so as to ensure

that each organisation is evaluated on weights which are most favourable to it. The core assumption

is that relatively good performance on particular indicators reflects policy importance in that country

or organisation and hence is assumed to merit a higher weight. As a result, however, the final

composite score may be heavily influenced by an indicator that is considered of rather marginal

importance in the wider health system context (Goddard and Jacobs 2009).

Healthcare goods have to be divided by some measure of input (either cost or population) to make

entities of different sizes comparable and to establish a common currency unit in which performance

is assessed as ‘good’ or ‘poor’ relative to other organizations. To construct ratio-based quality

measures, the denominator should ideally represent the best available proxy for the population at

5

risk (Romano, Hussey, and Ritley 2010). But this principle raises more questions than it answers. For

example, events like healthcare-associated infections could reasonably be measured relative to the

total population (since a system which keeps people out of hospital thereby averts the risk of

acquiring an infection there) but also bed days (thus indicating the extent to which sick people in

hospital are not made worse).

Assumptions made about the denominator are important for two reasons. First, different

denominators highlight different and potentially insightful aspects of performance (Guillen et al.

2011). Second, the choice of denominator implies particular beliefs about attribution and causality

which may influence observed variations in measured performance. Consider two health authorities

A and B with the same number of healthcare associated infections (HAI) but a lower number of bed

days in authority A. On a simple ratio measure of HAI/ 1,000*bed days, authority B would seem to

score superior, but this conclusion would be warranted only if all HAI were caused during hospital

stays and there were hence no groups at risk of HAI other than hospitalised populations. However, if

the numerator also included community-acquired infections, then a narrowly defined denominator

such as hospital bed days would underestimate the actual number of exposed individuals. A

comprehensive denominator, such as total resident population, in contrast, would overestimate the

population at risk by including individuals facing no or a negligible risk of experiencing the event

(Marlow 1995).

One option to address this would be to take great care in specifying a numerator that is

unambiguously linked to one single meaningful denominator (McKibben et al. 2005); for example, by

excluding community-acquired infections that are present on admission to hospital from the

numerator. Nevertheless, since there will always be some uncertainty about the correct population

at risk, it makes sense to consider different denominators rather than restricting oneself to a single

denominator, in order to enable a more complete perspective on the outcome of interest (Guillen et

al. 2011). To do this one could produce multiple ratios between all reasonable numerator and

denominator combinations. However, the manual comparison of multiple performance ratios quickly

becomes unwieldy. In a situation with, say, three numerators and four denominators, one would get

12 performance ratios for each entity under scrutiny. Juxtaposing this array of ratios may not offer

intuitive results nor does it recognize that some denominators may be more important than others in

influencing the outcome of interest.

In summary, there are four key sources of ambiguity which make robust performance comparison in

6

healthcare difficult:1

Are outcomes actually under the control of health services?

Are activities undertaken appropriately?

Are multidimensional or heterogeneous goods weighted appropriately?

Is the denominator correctly identified?

The rest of this paper focuses on an application of the REA approach to address ambiguity around

numerator weight sets and appropriate choices of denominators. Further work into robust

performance comparisons of treatment appropriateness and health outcomes is ongoing.

3. Methods: the REA approach

In order to develop a robust healthcare performance comparison, we adapt a ratio-based efficiency

(REA) technique (Salo and Punkka, 2011). Assume that there are K Decision-Making Units (DMUs –

the entities to be evaluated) that have N different measures for the numerator of a ratio and M

measures for the denominator of a ratio. The values of the nth numerator and the mth denominator

of the kth DMU are and , respectively. Thus, the possible performance ratios of

the DMU k are , where and .

To address the challenge of multiple performance ratios, REA enables the aggregation of different

numerators and denominators in a summary measure of performance. The relative importance of

the nth numerator and the mth denominator is captured by nonnegative weights and ,

respectively. The aggregated performance ratio of DMU k is thus defined as

∑ . (1)

The REA method is innovative insofar as the aggregation of the numerators and the denominators is

achieved without fixing the weights of constituent indicators. Thus, depending on the weights, the

ratios can vary substantially and they can represent various situations with some particular aspects

of performance heavily stressed or all aspects equally considered. By comparing the performance

ratios between DMUs with all feasible weights, one can produce robust information about the

performance of DMUs.

1 Of course in any performance measurement exercise there is also uncertainty around data quality and

missing values. These problems are not unique to a healthcare context and our paper does not add insights to this debate. Our focus here is on uncertainty that characterises performance measurement in healthcare.

7

The weights of different performance measures can either be allowed to have all nonnegative values

or the decision maker can express preference statements about the value of different measures of

performance. For example, if the decision maker states that a unit on numerator 1 is at least as

valuable as a unit of the numerator 2 but not more than four times as valuable, then the constraints

must hold.

To examine the pairwise relations between DMUs, the REA uses the concept of dominance: DMU

dominates DMU if the performance ratio of DMU is at least as high as that of DMU for all

feasible weights, and there exist some weights for which its performance ratio is strictly higher. If the

dominance relation exists between two DMUs, the decision maker can have total confidence in the

result, in a sense that regardless of the stressed aspect of performance, the one DMU performs

better than the other. In other words, the dominance relation between DMUs and can be

determined based on the pairwise performance ratio

. (2)

The maximum and the minimum of over all feasible weights provide upper and lower

bounds on how well DMU can perform relative to DMU . For example, if the maximum of

, then the performance ratio of DMU can be at most 28% greater than that of DMU .

On the contrary, if the minimum of , the performance ratio of DMU is at least 6 %

higher than that of DMU . Thus, if the minimum of is greater than one, DMU dominates DMU

. The dominance structure can be computed efficiently with linear programming.

To further complement the understanding of relations between multiple DMUs, one could be

interested in the ranking of a DMU among all DMUs, over all feasible weights. For this purpose, the

REA computes ranking intervals, which indicate the best and worst performance rankings that a DMU

can attain relative to other DMUs. The best ranking is determined by computing the minimum

number of other DMUs with a strictly higher performance ratio. For instance, the best ranking as a

third means that the least number of other DMUs with a strictly higher performance ratio is two. If

for some feasible weights the performance ratio of a DMU is higher than or equal to the ratio of any

other DMU, then its best ranking will be one. The worst ranking is computed similarly.

The REA results can also include ratio bounds, which show how much larger or smaller the

performance ratio of a given DMU can be with different weights, relative to some other DMU or a

subset of other DMUs. More specifically, the bounds reveal the relative performance ratio with the

8

most and the least optimal weights for each DMU. In this context, the value of one means that the

DMU is a top-performer i.e. with some feasible weights the performance ratio of a DMU is higher

than the ratio of any other DMU. It is worth noting that the upper limit of the bound is equal to the

DEA-superefficiency score (see e.g. Zhu 1996).

All the REA results are calculated based on the pairwise comparisons between DMUs, instead of

comparing each DMU to the efficient frontier. This makes the REA results more robust than the

frontier-based results, since the introduction/removal of an outlier DMU could substantially change

the location of the efficiency frontier and thus affect frontier-based results. Results of the REA are

insensitive to the effect of an outlier DMU, since the existing pairwise dominances cannot change if a

new DMU is added and the rankings can change only by one unit. It is also worth noting that due to

pairwise comparisons, the REA does not require any returns-to-scale assumption like frontier-based

methods do.

4. Study context, data and setup

We here illustrate the REA approach with an application to the comparative quality of Scottish

Health Boards. In Scotland, responsibility for the allocation of resources is decentralized to 14 local

Health Boards. The ultimate objectives of these Boards are to protect and improve the health of their

populations through planning for and delivering health services (Scottish Government 2014). An

important part of their mandate is to ensure the delivery of high-quality services. Health Boards’

performance is evaluated against a performance measurementsystem (known as HEAT targets)

based on a series of ratio-based quality targets intended to measure the effectiveness,

appropriateness and safety of the health service. Since HEAT targets can be seen as policy-makers’

definitions of valued outputs of the health service, they provide a sensible basis for selecting

meaningful measures of performance.

Drawing on data from the HEAT target system, we here focus on Health Boards’ relative degree of

achievement in avoiding adverse events patients experience before, during and after hospital stays.

The set of thematically linked performance indicators represents events which health systems should

aim to minimise. In contrast to solely activity-based measures, these concepts thus seek to enable

relatively unambiguous comparisons of ‘good’ and ‘poor’ performance.

To examine robustness to alternative choices of denominator, we relate the number of adverse

events to hospitalised and general populations in Scottish Health Boards. The set of variables and

9

descriptive statistics are shown in Table 1. The final dataset covers 11 mainland Health Boards. Due

to zero cases of adverse events on some indicators, the three island Boards (Orkney, Shetland,

Western Isles) were removed from the dataset. The REA analyses were conducted with data

normalized between [0;1] such that indicators with a greater magnitude were not given higher

importance.

Variable Definition Mean Std dev. Min Max

18WRTTa

Number of patient journeys from referral to treatment over 18 weeks (among patients seen)

9,858 8,791 1,851 30,603

4-hour A&E waiting

a

Number of recorded A&E waits lasting over 4 hours

9,412 10,096 859 31,731

Emergency admissions

a

Number of emergency admissions among +75 years

13,419 9,274 4,107 37,256

MRSA/MSSA Number of MRSA/MSSA infections 34 28 6 103

C.difficilea

Number of Clostridium difficile infections

82 61 21 200

Delayed discharges

a

Number of delayed discharges over 14 days

34 26 2 86

Total population

b

Resident population (mid-year estimates)

475,232 318,214 113,880 1,214,587

RTTa

Number of patient journeys from referral to treatment (among patients seen)

111,060 91,504 30,002 359,049

AOBDa

Number of acute occupied bed days

113,244 98,182 20,723 365,951

Non-AOBD person days at risk

Person days at risk spent outside hospital per year = total population * 365 days - AOBD

173,802,619 116,622,534 41,482,332 443,848,174

Sources: aHEAT target system;

bNational Records of Scotland. All data are for the year 2012/13.

Table 1. Variables and descriptive statistics

When constructing a summary measure of performance, the weight wi should ideally capture the

value attached to an extra unit of an individual performance measure for attribute i (Smith 2002). As

we have noted in the methods section, an advantage of REA is its ability to address uncertainty about

weight specifications by using the full set of feasible weights. This can be an attractive option when

one assumes complete ignorance about the relative value of averting particular adverse events.

However, while an elicitation of exact cardinal preferences over ‘how much’ worse a, say, MRSA

infection is compared to a, say, emergency admission may not feasible (e.g. due to high cognitive

demands) or desirable (e.g. due to biases introduced by specific elicitation methods), it may be

possible to obtain reasonable statements about which events are worse than others. Introducing

some plausible weight restrictions based on ordinal preferences may be useful because this

recognises people’s ability to provide limited preference information about the relative badness of

10

particular events without imposing delusively exact weights. For illustrative purposes, the research

team arrived at a consensus-based set of ordinal weights through pairwise comparisons of any two

adverse events, along the lines ‘If you could avoid either an emergency admission to hospital or an

MRSA infection, which event would you rather avoid’. Corresponding to their relative badness, events

were ranked as follows (from worst=1 to least bad=6):

1. an MRSA/MSSA infection; 2. an emergency admission2; 3. a clostridium difficile infection; 4. having to wait longer than 18 weeks from referral to treatment; 5. having to wait more than 4 hours in A&E3; 6. a delayed discharge.

For illustrative purposes, we also made the assumption that avoiding an event of a particular

healthcare output can at most have half of the overall value attached to avoiding an event of each of

the six healthcare outputs . This resulted in the following proportional weight restrictions:

whole patient journey analysis (with all six quality indicators): avoiding an event of the worst healthcare

output cannot be more than ten times as valuable as avoiding an event of the least bad healthcare output

(since with six indicators, a ratio of 1/10 means that one output can have at most half of the weight mass);

Hospital ‘frontdoor‘ and ‘backdoor‘ analyses (each comprising three quality indicators): avoiding an event of

the worst healthcare output cannot be more than four times as valuable as avoiding an event of the least

bad healthcare output (since with three indicators, a ratio of 1/4 means that one output can have at most

half of the whole weight mass).

Denominator weights should ideally reflect the amount of risk of experiencing the outcome of

interest in a given population (Romano, Hussey, and Ritley 2010). The population groups considered

to be at a higher risk would thus receive a higher weight in the denominator. To illustrate the REA

approach for a stylised example of performance on healthcare associated infections, we identified

two mutually exclusive and collectively exhaustive denominators: acute occupied bed days (AOBD)

and non-AOBD person days at risk (see Table 1). Our judgement that the risk of acquiring a

healthcare associated infection would be greater in a hospital than in a non-hospital setting led to

the ordinal weight restriction:

>

In the absence of such information for the remaining analyses, we used reasonable proportional

weight restrictions. For illustrative purposes, we assumed that the most influential denominator

cannot be more than four times as influential as the least influential denominator.

2 We assumed an avoidable admission e.g. for acute exacerbation of COPD that could have been prevented

with timely primary care. 3 We assumed a condition where patients are in mild to moderate discomfort.

11

5. Results

The ranking intervals for relative performance on quality measures of the whole patient journey, the

hospital ‘frontdoor‘ and the hospital ‘backdoor‘ (Figures 1-3) show all the rankings that Boards can

attain across the full set of feasible numerator/denominator weights, within ordinal and proportional

weight restrictions. As can be seen in Figure 1, the large size of ranking intervals for measures of

quality along the whole patient journey suggests considerable sensitivity to different modelling

assumptions, despite the use of some reasonable weight restrictions (see section 4). If one takes a

lower-level view of the hospital ‘frontdoor’ and the hospital ‘backdoor’ with fewer sub-indicators

(Figures 2 and 3), then observed variations in performance appear to be manifested more clearly.

The ranking intervals shown in Figure 2 mean that one can be confident that Board F, for example,

cannot be ranked worse than 7th and not better than 2nd. As can be seen in the size of the ranking

intervals, the relative performance of some Boards is robust to changes in weights. For example,

Board K is ranked strictly higher than 3rd when compared to the other Boards no matter how

constituent indicators are weighted. In contrast, a large ranking interval shows that a Board’s relative

performance is sensitive to changes in weights. For example, Board I jumps between 1st and 7th for

different weights.

Figure 1. Whole patient journey: Performance rankings of Scottish Health Boards (n=11) for 18WRTT, emergency admissions, 4-hour A&E waiting times, MRSA/MSSA infections, C.difficile infections and delayed discharges relative to total population, RTT and AOBD

123456789

1011

A B C D E F G H I J K

Ran

k

Health Board

123456789

1011

A B C D E F G H I J K

Ran

k

Health Board

123456789

1011

A B C D E F G H I J K

Ran

k

Health Board

Figure 3. The hospital ‘backdoor’:

Performance rankings for MRSA/MSSA infections,

C.difficile infections and delayed discharges

relative to total population and AOBD

Figure 2. The hospital ‘frontdoor’:

Performance rankings for 18WRTT, emergency

admissions and 4-hour A&E waiting times

relative to total population and RTT

12

Based on pairwise comparisons, the REA results can be displayed in a unified way as a dominance

relation (Figure 4): insofar as Boards are more superordinate or “higher up”, their relative

performance is more robust to changes in the weights attached to the constituent indicators of

performance.

Figure 4. Dominance graph for the hospital ‘frontdoor’: 18WRTT, emergency admissions and 4-hour A&E waiting times relative to total population and RTT

A comparison of performance ratios (Figure 5) sheds some additional light on the implications of

using the most and the least optimal weights for each Board. ‘Good‘ performers, i.e. Boards that are

the best at avoiding adverse events, obtain the highest performance ratios. 4 The lower limits of the

ratio intervals show how well a Board performs with the least optimal weights when compared to

the best performing Board. With performance ratios 1, Boards B, I and K are top performers since

some of their performance ratios exceed those of any other Board.

Figure 5. Performance ratios for the hospital ‘frontdoor’: 18WRTT, emergency admissions and 4-hour A&E waiting times relative to total population and RTT

4 Note that in our example, the conventional numerator (adverse events) and denominator (population) are

reversed in the REA analyses: this ensures that a high performance ratio (as shown in Figure 5) also means high performance, in terms of minimising the adverse events-population ratio (or maximising the population-adverse events ratio).

0.4

0.6

0.8

1

1.2

A B C D E F G H I J K

Pe

rfo

rman

ce r

atio

Health Board

13

Table 2 examines robustness to different choices of denominator in the case of healthcare

associated infections (HAI). Although most Boards seem to perform either relatively well (Boards H, I,

K) or relatively poorly (Boards B, E, A) on either of two approximations of the population at risk –

acute occupied bed days or total population – for some Boards different assumptions about

appropriate denominators lead to notable reversals in relative performance ranks. Boards G, F and A

appear to perform distinctly better when HAI are measured relative to acute occupied bed days

while Boards D and K seem to jump up the ranking when a population denominator is used.

If one takes into account both hospitalised and general populations outside the hospital environment

in a composite denominator, then the ranking intervals estimated from REA show the robustness of

performance ranks to these alternative modelling assumptions. Whereas the number of acute

occupied bed days ignores populations in non-acute hospital settings exposed to HAI (e.g. in geriatric

wards, nursing homes) and total population overestimates the number of people at a non-negligible

risk of HAI, REA-based results show possible ranks if one considers both bed days and its counterpart,

the person time at risk of contracting a HAI in any non-acute hospital environment.

Note that in our example, the ranking intervals are narrowed as a consequence of using specific

weight restrictions. As explained in section 4, this was informed by the judgement that any day spent

in an acute hospital environment would put a patient at a higher risk of contracting a HAI than any

day spent in a non-hospital environment. Given this modelling assumption, Board G’s ranking

interval, for instance, is bounded by the ranks 3 and 4 rather than by rank 3 (as predicted by the

AOBD denominator) and rank 6 (as predicted by the total population denominator).

Board

Per 100,000 AOBD Per 100,000 Total population Ranking interval for AOBD

and non-AOBD person days at risk

Number of HAI

Rank Number of HAI

Rank (Rank difference

compared to AOBD)

H 77 1 14 1 (0) 1 I 80 2 18 3 (-1) 2-3 G 83 3 25 6 (-3) 3-4 K 84 4 17 2 (+2) 2-4 J 110 5 23 5 (0) 5 C 111 6 25 6 (-1) 6-7 A 120 7 29 9 (-2) 7-8 D 130 8 22 4 (+4) 6-8 E 131 9 31 10(-1) 9-10 B 133 10 34 11 (-1) 10-11 F 150 11 27 8 (+3) 9-11

Table 2. Number of healthcare associated infections (HAI; includes MRSA/MSSA, C.difficile) relative to different choices of denominator

14

6. Discussion and implications

In this paper, we have explored some healthcare applications of the REA approach, in order to

provide a new perspective on pervasive and long-standing problems in healthcare performance

comparisons. As we have outlined in section 2, there are at least four key sources of ambiguity that

make robust comparative assessments of healthcare performance difficult:

it is often not clear to what extent health outcomes are under the control of health services;

we lack knowledge about whether healthcare delivered is appropriate i.e. whether patients

who would benefit have access to the right care while patients who would not benefit do not

receive unnecessary or even harmful care;

there is ambiguity over how heterogeneous or multidimensional healthcare outputs should

be weighted and aggregated in a composite measure of performance; and

in ratio-based measures of performance, there is ambiguity with regard to the choice of

denominator i.e. the ‘currency unit‘ used to make different organizations comparable.

In this paper, we have demonstrated how the REA technology can be used to shed some light on the

last two questions. Represented in the form of ranking intervals and dominance relations, the

comparative performance of Scottish Health Boards, measured as their relative success in avoiding

adverse events patients experience before, during and after hospital stays, is robust to different

choices of denominators and to changes in weights attached to numerators and denominators. This

is an important finding since neither composite measures nor ratio-based measures of performance

are in any way “objective” guides for policy-makers and managers. Unless one can be confident that

observed variations in measured performance reflect real differences, rankings of healthcare

organizations may severely mislead interpretations of performance and misguide any consequent

regulatory action.

As Jacobs et al. (2005) note in the context of composite measures of performance, two possible

policy implications to respond to the uncertainty inherent in performance measures would be to

either dismiss ambiguous measures altogether and instead display an array of indicators

representing important dimensions of performance, or to invest considerable resources into

enhancing the sophistication of the approach, for example by means of elaborate preference

elicitation. However, the presence of uncertainty is likely to render the results open to debate

nevertheless. The REA approach, as demonstrated in this paper, illustrates a third, more pragmatic,

way that explicitly recognises the uncertainty inherent in performance measures. The use of an

interval-based approach openly provides indications of uncertainty in numerator and denominator

15

valuations. Rankings are often sensitive to assumptions and the REA approach identifies which

ranking judgements one can have total confidence in. Since composite and ratio-based measures are

likely to stay and seem to appeal to policy-makers and managers worldwide seeking to obtain a

rounded assessment of performance across organizations, such results may prove most useful to

start a debate, guide the search for further information and indicate the extent to which action is

warranted, rather than providing a conclusive evaluation and prescription for action.

Opportunities for further methodological development of robust performance comparisons

While in this paper the existing HEAT target system in Scotland stimulated the empirical context of

application, we believe the REA approach has wider applicability and holds potential to enable a

fresh perspective on robust performance comparisons in healthcare. This technology is in its early

stages and further exploration of the REA approach in the context of robust comparisons of health

outcomes and treatment appropriateness forms part of the authors’ ongoing research agenda.

From a methodological perspective, this line of work raises several issues on which we would

welcome feedback. One question concerns the different sets of modelling assumptions that results

should be robust to. Although there may be no such thing as the ‘best’ set of assumptions, there are

various ambiguities around the criteria for the selection of appropriate indicators, weighting and

choice of denominators in ratio-based measures of healthcare performance, as well as the

appropriate level of aggregation. For instance, while a larger number of indicators that captures

performance along the whole patient journey (Figure 1) has the potential to provide a more rounded

view of performance, it fails to reveal notable and more distinct clusters of Boards which seem to

perform distinctly better on sub-measures along the hospital ‘frontdoor‘ (Figure 2) or the hospital

‘backdoor‘ (Figure 3).

A second question relates to the most meaningful displays of the information contained in relatively

complex summary measures of performance. An advantage of the REA approach is its ability to

synthesise key messages contained in disparate indicators and single ratios of performance. Ranking

intervals and dominance relations appear to be a relatively intuitive way to communicate

performance results to managers and policy-makers, possibly addressing the limitations of frontier-

based approaches such as DEA and SFA which often appear to be overly complex and hardly used by

practitioners (Hussey et al. 2009). However, there may be other useful displays of weights, ranges

and results of performance comparisons that could help deepen analysis of and insight into

measurement results and point towards areas for managerial and policy intervention.

16

References

Appleby, J, and J Mulligan. 2000. How well is the NHS performing? A composite performance indicator based on public consultation. London: The King's Fund.

Bevan, G, and R Hamblin. 2009. Hitting and missing targets by ambulance services for emergency calls: effects of different systems of performance measurement within the UK. Journal of the Royal Statistical Society Series a-Statistics in Society 172:161-190.

Binder, LF, and B Rudolph. 2009. Commentary: A Systematic Review of Health Care Efficiency Measures. Health Services Research 44 (3):806-811.

Charnes, A, WW Cooper, and E Rhodes. 1978. Measuring the efficiency of decision making units. Eur. J. Oper. Res 2:429-444.

Corallo, AN, R Croxford, DC Goodman, EL Bryan, D Srivastava, and TA Stukel. 2014. A systematic review of medical practice variation in OECD countries. Health Policy 114 (1):5-14.

CMS. 2009. Centers for Medicare & Medicaid Services. Premier Hospital Quality Incentive Demonstration: Fact sheet. Accessed May 9, 2014 at http://www.cms.hhs.gov/HospitalQualityInits/downloads/HospitalPremierFactSheet200907.pdf.

Dolan, P, C Gudex, P Kind, and A Williams. 1996. Valuing health states: A comparison of methods. Journal of Health Economics 16 (5):617-617.

Goddard, M, and R Jacobs. 2009. Using composite indicators to measure performance in health care. In Performance measurement for health system improvement: experiences, challenges and prospects, edited by P. Smith, E. Mossialos, I. Papanicolas and S. Leatherman. Cambridge: Cambridge University Press.

Goldstein, H, and DJ Spiegelhalter. 1996. League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society Series a-Statistics in Society 159:385-409.

Gravelle, H, R Jacobs, AM Jones, and A Street. 2003. Comparing the efficiency of national health systems: a sensitivity analysis of the WHO approach. Appl Health Econ Health Policy 2 (3):141-7.

Guillen, Ú, S DeMauro, L Ma, J Zupancic, E Wang, A Gafni, and H Kirpalani. 2011. Survival rates in extremely low birthweight infants depend on the denominator: avoiding potential for bias by specifying denominators. American Journal of Obstetrics and Gynecology 205 (4):329.e1-329.e7.

Hussey, PS, H de Vries, J Romley, MC Wang, SS Chen, PG Shekelle, and EA McGlynn. 2009. A systematic review of health care efficiency measures. Health Services Research 44 (3):784-805.

Jacobs, R, M Goddard, and PC Smith. 2005. How robust are hospital ranks based on composite performance measures? Med Care 43 (12):1177-84.

Jacobs, R, PC Smith, and A Street. 2006. Measuring efficiency in health care: analytic techniques and health policy. Cambridge: Cambridge University Press.

Marlow, AK. 1995. Potential years of life lost: what is the denominator? J Epidemiol Community Health 49 (3):320-2.

17

McKibben, L, T Horan, JI Tokars, G Fowler, DM Cardo, ML Pearson, P J Brennan, and Advisory Committee for Healthcare Infection Control Practices. 2005. Guidance on Public Reporting of Healthcare-Associated Infections: Recommendations of the Healthcare Infection Control Practices Advisory Committee. American Journal of Infection Control 33 (4):217-226.

Navarro, V. 2000. Assessment of the world health report 2000. Lancet 356 (9241):1598-1601.

OECD. 2008. Handbook on Constructing Composite Indicators. Paris: OECD.

Romano, P, P Hussey, and D Ritley. 2010. Selecting Quality and Resource Use Measures: A Decision Guide for Community Quality Collaboratives. Rockville: Agency for Healthcare Research and Quality.

Salo, A, and A Punkka. 2011. Ranking Intervals and Dominance Relations for Ratio-Based Efficiency Analysis. Management Science 57 (1):200-214.

Scottish Government. 2014. NHS Boards. Accessed 14 May, 2014 at http://www.scotland.gov.uk/Topics/Health/NHS-Workforce/NHS-Boards.

Smith, PC. 2002. Developing composite indicators for assessing health system efficiency. In Measuring up: improving health system performance in OECD countries, edited by P. Smith. Paris: OECD, p. 295-318.

WHO. 2000. The world health report 2000 - Health systems: improving performance. Geneva: World Health Organization.

Zhu, J. 1996. Robustness of the efficient DMUs in data envelopment analysis. Eur. J. Oper. Res. 90(3): 451-460.