Warwick Clinical Trials Unit 1 Statistical Errors in Publications October 2010

Warwick Clinical Trials Unit 1

Statistical Errors in Publications

October 2010


OVERVIEWOVERVIEW

Greater emphasis on sections dealing with:Greater emphasis on sections dealing with:

• Design;Design;

• Sample size;Sample size;

• Statistical methodology;Statistical methodology;

• Results (Presentation/Interpretation);Results (Presentation/Interpretation);

• Discussion/Conclusion.Discussion/Conclusion.


SAMPLE PAPERSSAMPLE PAPERS

• Sample 1Sample 1 – Randomised controlled trial – management of – Randomised controlled trial – management of ankle sprains comparing elastic support bandage v. aircast ankle sprains comparing elastic support bandage v. aircast ankle brace ankle brace (Br. J. Sports Med, 2005);(Br. J. Sports Med, 2005);

• Sample 2Sample 2 – Study to assess variables which predict chronic – Study to assess variables which predict chronic neck pain disability neck pain disability (Arch Phys Med Rehabil, 2004). (Arch Phys Med Rehabil, 2004).


PREVALENCE OF STATISTICAL ERRORSPREVALENCE OF STATISTICAL ERRORS

• Concerns of misuse of statistics dating back over 70 years Concerns of misuse of statistics dating back over 70 years (Altman, 2004)(Altman, 2004)

• Despite greater awareness (e.g. CONSORT) of statisticalDespite greater awareness (e.g. CONSORT) of statisticalissues such concerns have not diminishedissues such concerns have not diminished


Prevalence of Statistical Errors (cont’d)Prevalence of Statistical Errors (cont’d)

• Serious statistical errors were found in 40% of 164 articles Serious statistical errors were found in 40% of 164 articles published in psychiatry (Altman, 2002);published in psychiatry (Altman, 2002);

• At least one serious statistical error occurred in 38% and At least one serious statistical error occurred in 38% and 25% of papers in 25% of papers in Nature Nature and and BMJBMJ respectively (Garcia- respectively (Garcia-Berthou and Alcaraz (2004)); Berthou and Alcaraz (2004));

• Many surveys of statistical errors report error rates ranging Many surveys of statistical errors report error rates ranging from 30%-90% (Altman, 1991; Gore et. al., 1976; Pocock et from 30%-90% (Altman, 1991; Gore et. al., 1976; Pocock et al., 1987 and MacArthur, 1984).al., 1987 and MacArthur, 1984).


Why are there so many errors? (Altman, 2004)Why are there so many errors? (Altman, 2004)

• Many investigators are not professional researchers, theyMany investigators are not professional researchers, they are primarily clinicians; are primarily clinicians;

• Training usually a single course in statistics; Training usually a single course in statistics;

• Training focuses on data analysis, but issues such as Training focuses on data analysis, but issues such as statistical reporting and interpreting are not addressed; statistical reporting and interpreting are not addressed;

• Statistical content and complexity of medical research has

increased steadily over recent decades.


(Altman, 2004)(Altman, 2004)

“ “........ ........ When I tell friends outside medicine that many papers published in medical journals are misleading because of methodological weaknesses they are rightly shocked.

Huge sums of money are spent annually on research that is seriously flawed through the use of inappropriate designs, unrepresentative samples, small samples, incorrect methods …”


Observe(Natural Course

of Disease)

Hypothesize(Frame Research

Question)

Test(Conduct

Experiment/Clinical Trial)

Conclude(Validate or Modify

Hypothesis)

Personal & ScientificExperience

Research Planning,Grant Writing,

ProtocolDevelopment

Data Collection& Analysis

Journal Articles,Scientific Meetings

ConceptDevelopmen

t

Experimental

Design

StatisticalInference

= Process= Stage= Activity


DESIGNDESIGNPopulation: Population: A population is a group of individuals persons, objects, or items from which samples are taken

Sample: Sample: A sample is a finite part of a statistical population whose properties are studied to gain information about the whole

Sampling: Sampling: Sampling is the process of selecting a suitable sample, or a representative part of a population for the purpose of determining parameters or characteristics of the whole population.

Purpose of sampling:Purpose of sampling: To draw conclusions about populations from samples, we must use inferential statistics which enables us to determine a population`s characteristics by directly observing only a portion (or sample) of the population.


Design (cont’d)Design (cont’d)

Sampling error: Sampling error: What can make a sample unrepresentative of its population? One of the most frequent causes is sampling error.

Two types of sampling errors: Two types of sampling errors:

(i)(i)Chance: Chance: That is the error that occurs just because of bad luck.

(i)(i)Bias: Bias: Sampling bias is a tendency to favour the selection of units that have particular characteristics (as a result of poor sampling plan)

To avoid sampling error: To avoid sampling error: Plan careful !!

select using a random selection of participants


SAMPLE SIZESAMPLE SIZE

Sample size may be determined by various practical Sample size may be determined by various practical constraints:constraints:

•FinancialFinancial•Resources Resources •Too small a sample is not representative of a populationToo small a sample is not representative of a population•Too large a sample results in wastefulness and is unethicalToo large a sample results in wastefulness and is unethical

•The larger the sample size the more likely the results will The larger the sample size the more likely the results will

reflect what will happen in the populationreflect what will happen in the population


Sample size (Power Calculation) (cont’d)Sample size (Power Calculation) (cont’d)

● Difference : Clinically important difference

● significance threshold: type I error - conventionally set at 0.01 or 0.05

●Power: i.e. 1- type II error - conventionally 80% or 90%; How confident you are that the sample will detect a difference, if

one really exists in the population

Variability: The less variability among patients within each group, the more

likely they reflect the overall populations.

1


Sample size (Power Calculation) (cont’d)Sample size (Power Calculation) (cont’d)

Increase in Sample size:Increase in Sample size:

(a)(a) Smaller the clinically relevant difference;Smaller the clinically relevant difference;(b)(b) Increase in power;Increase in power;(c)(c) Less variability;Less variability;(d)(d) Reduction in Type I error rateReduction in Type I error rate

Allow for dropouts and/or withdrawalsAllow for dropouts and/or withdrawals


Sample size (cont’d)Sample size (cont’d)

Review the two articles in terms of :Review the two articles in terms of :

Design Design

Sample sizeSample size



“…“….A major concern in the design of studies is the .A major concern in the design of studies is the almost universal lack of almost universal lack of reportingreporting of how the sample size was obtained…..” (Altman, 2000). of how the sample size was obtained…..” (Altman, 2000).

“…“…Basis of the Basis of the power calculation is inadequatelypower calculation is inadequately describeddescribed …” (Malachy, …” (Malachy, 2004, Vail et al., 2003).2004, Vail et al., 2003).(all sample papers)(all sample papers)

““Quite often sample size calculations are computed Quite often sample size calculations are computed without allowing for without allowing for dropoutsdropouts” (McGuigan, 1995).” (McGuigan, 1995).(all sample papers)(all sample papers)


Sample size (cont’d)Sample size (cont’d) Small studies:Small studies:

• Small trials have a low power and high type I errorSmall trials have a low power and high type I error

• No sample size provided, then conclusions of the study have No sample size provided, then conclusions of the study have little value little value (as sample 2) (as sample 2)

• If underpowered then the conclusions to be taken with caution If underpowered then the conclusions to be taken with caution

and the results are inconclusive and the results are inconclusive (as sample 1 )(as sample 1 )



A description of the sample size in the literature should contain, A description of the sample size in the literature should contain, for example:for example:

“ “ The mean and The mean and sd.sd. for the RMQ on the active management is for the RMQ on the active management is 5.91 and 5.91 and 4.274.27 respectively (Oxfordshire Low Back Pain trial, respectively (Oxfordshire Low Back Pain trial, BMJ, 2005). The smallest difference between the two therapies BMJ, 2005). The smallest difference between the two therapies which is clinically relevant is approximately which is clinically relevant is approximately 2.02.0. Using this . Using this information, the total number of participants required for this information, the total number of participants required for this study will be 700, allowing for a 25% loss-to-follow up and using study will be 700, allowing for a 25% loss-to-follow up and using 90% 90% power with a power with a 1%1% type I error rate (significance level).” type I error rate (significance level).”


METHODSMETHODS

““................ All of the problems hinge on the ................ All of the problems hinge on the understanding of what a statistical test is doing understanding of what a statistical test is doing and what a p-value means ....”and what a p-value means ....”

(Murphy, 2004)(Murphy, 2004)


METHODSMETHODS

A Statistical test is a procedure you use to A Statistical test is a procedure you use to compute a probability in support of the compute a probability in support of the

hypothesis (null)hypothesis (null)


Methods (cont’d)Methods (cont’d)

e.g. He.g. H00: :

HH11: :

Test statistic : t-test = Test statistic : t-test =

• The test statistic is transformed into a p-valueThe test statistic is transformed into a p-value

21

21

2

2

1

1

21

ns

ns

xx



• P-value:P-value: strength of the evidence (quantified by a probability) strength of the evidence (quantified by a probability) in support of the null hypothesis.in support of the null hypothesis.

• Neither the statistical test nor the p-value Neither the statistical test nor the p-value PROVE/DISPROVE the null hypothesis – they provide PROVE/DISPROVE the null hypothesis – they provide

EVIDENCE in support of the null hypothesis.EVIDENCE in support of the null hypothesis.



Review the two articles in terms of :Review the two articles in terms of :

Methods Methods

Results (including figures and tables)Results (including figures and tables)



““.. A further issue is the .. A further issue is the copying of incorrect or inappropriate methods.copying of incorrect or inappropriate methods. Once Once incorrect procedures become common, it is hard to stop them from incorrect procedures become common, it is hard to stop them from spreading through the medical literature like a genetic mutation..”spreading through the medical literature like a genetic mutation..” (Altman, (Altman, 2002).2002).

(as sample 1)(as sample 1)

““Schwartzer et al. (2000) found that most papers made Schwartzer et al. (2000) found that most papers made important errors in important errors in the application of new technologythe application of new technology such as models for longitudinal data.” such as models for longitudinal data.” (Altman, 2000).(Altman, 2000).

(e.g. Hierarchical models in sample 1; ROC curves in sample 2)(e.g. Hierarchical models in sample 1; ROC curves in sample 2)


Methods (cont’d)Methods (cont’d)Most common errors in Methods section:Most common errors in Methods section:

• Failure to check assumption (Nature says that the most common error was not checking for a Failure to check assumption (Nature says that the most common error was not checking for a normal distribution and not stating how normality was tested);normal distribution and not stating how normality was tested);

• Using linear regression analysis without first establishing that the relationship is linear;Using linear regression analysis without first establishing that the relationship is linear;

• Ignoring paired or ordered categories and therefore using an inappropriate test;Ignoring paired or ordered categories and therefore using an inappropriate test;

• Arbitrarily dividing continuous data into ordinal categories without explanation (“Data Arbitrarily dividing continuous data into ordinal categories without explanation (“Data dredging”); dredging”);

• Multiple comparison (could increase the likelihood of significant result) Multiple comparison (could increase the likelihood of significant result) (sample 2)(sample 2)

• And many moreAnd many more ……. …….

sub-group analyses, ignoring repeated measures design, non-matched analysis for matched data, modelling sub-group analyses, ignoring repeated measures design, non-matched analysis for matched data, modelling incorrectly, i.e. interactions not included …….incorrectly, i.e. interactions not included …….



• Begin a statistical analysis with data exploration;Begin a statistical analysis with data exploration;

• Check assumptions;Check assumptions;

• Type of data – continuous, binary, ordinal, repeated over Type of data – continuous, binary, ordinal, repeated over time, etc.time, etc.

• Missing values, outliers, no. of withdrawals;Missing values, outliers, no. of withdrawals;

• Be careful with computer output (often helps to do simple Be careful with computer output (often helps to do simple calculations by hand first).calculations by hand first).


RESULTSRESULTS

“ “ ..The results section must be written ..The results section must be written so that the average so that the average reader can understandreader can understand the study findings” (Cummings, 2003). the study findings” (Cummings, 2003).

“… “… poorly written with poorly written with excessive jargonexcessive jargon …” (Byrne, 2000). …” (Byrne, 2000).(sample 1 and sample 2)(sample 1 and sample 2)

“ “ .. A major bias is .. A major bias is cherry-picking resultscherry-picking results…” (Malachy, 2004).…” (Malachy, 2004).


Results (cont’d)Results (cont’d)

Common Language pitfallsCommon Language pitfalls

• Avoid non-technical uses of technical terms such as “normal”, “significant”, Avoid non-technical uses of technical terms such as “normal”, “significant”, “sample”;“sample”;

• ““No difference” means “evidence of lack of statistical significant difference”;No difference” means “evidence of lack of statistical significant difference”; (Sample 1)(Sample 1)

• p-values - using 2 digit precision (e.g. p = 0.82);p-values - using 2 digit precision (e.g. p = 0.82);

• Do not reduce p-values to ‘non-significant’ or ‘NS’;Do not reduce p-values to ‘non-significant’ or ‘NS’;

• Report a quantity so as that it is scientifically relevant (e.g. mean blood pressure Report a quantity so as that it is scientifically relevant (e.g. mean blood pressure of 115.73 mmHg should be reported as 115.7 mmHg or even 116 mmHg)of 115.73 mmHg should be reported as 115.7 mmHg or even 116 mmHg)



P-values: P-values:

• Over-emphasis on the p-value;Over-emphasis on the p-value;

• An arbitrary division of the results into “significant” and “non-significant” An arbitrary division of the results into “significant” and “non-significant” according to the p-value was not the intention of the founders of according to the p-value was not the intention of the founders of statistical inference;statistical inference;

• Smaller p-values indicate a strong evidence against the null hypothesis.Smaller p-values indicate a strong evidence against the null hypothesis.



Confidence Intervals:Confidence Intervals:

A confidence interval is simply a range of values which enclose the A confidence interval is simply a range of values which enclose the population value;population value;

Confidence intervals are preferable to p-values, as they tell us the range of Confidence intervals are preferable to p-values, as they tell us the range of possible effect sizes compatible with the data;possible effect sizes compatible with the data; The larger the sample size the narrower the confidence interval;The larger the sample size the narrower the confidence interval;

A confidence interval based on the difference (e.g. treatment difference) and A confidence interval based on the difference (e.g. treatment difference) and contains a 0, or on a ratio (e.g. odds ratio) and contains a 1, implies lack of contains a 0, or on a ratio (e.g. odds ratio) and contains a 1, implies lack of evidence of a statistically significant difference.evidence of a statistically significant difference.



and many more pitfalls …..and many more pitfalls …..

• testing baseline values testing baseline values (sample 1)(sample 1) ; ;

• not reporting missing data;not reporting missing data;

• lack of statistical power not considered;lack of statistical power not considered;

• misinterpreting and misunderstanding results from models misinterpreting and misunderstanding results from models e.g. no interactions included.e.g. no interactions included.


PRESENTATIONPRESENTATION

• In tables that compare groups include count (of patients or events) and column In tables that compare groups include count (of patients or events) and column percentages;percentages;

• Use appropriate statistics (mean instead of median for non-normal data);Use appropriate statistics (mean instead of median for non-normal data);

• In tables of column percentages, do not include a row of counts and percentage In tables of column percentages, do not include a row of counts and percentage of missing data (doing this will distort the other percentages in the table);of missing data (doing this will distort the other percentages in the table);

• Statistical software packages provide a large amount of output – need to be Statistical software packages provide a large amount of output – need to be selective about what is presented;selective about what is presented;

• Use graphs as alternative to tables with many entries; do not duplicate graphs Use graphs as alternative to tables with many entries; do not duplicate graphs and tables.and tables.

• Labelling graphs and tables correctly Labelling graphs and tables correctly (sample 1 and sample 2)(sample 1 and sample 2)


INTERPRETATION AND DISCUSSIONINTERPRETATION AND DISCUSSION

• Put the study sample in context of the population;Put the study sample in context of the population;

• Interpreting studies with non-significant results and low statistical power Interpreting studies with non-significant results and low statistical power as “negative” (when they are inconclusive) as “negative” (when they are inconclusive) “The absence of proof is not proof of absence”;“The absence of proof is not proof of absence”;

• Errors encountered in the design and analysis of a study can also Errors encountered in the design and analysis of a study can also continue through to errors in interpretation (Rushton, 1999);continue through to errors in interpretation (Rushton, 1999);

• Weaknesses in study design and study strengths stated so that a clear Weaknesses in study design and study strengths stated so that a clear and accurate impression of the reliability of the data can be formed.and accurate impression of the reliability of the data can be formed.


And finally…..And finally…..

The misuse of statistics is very important;The misuse of statistics is very important;

The need for statisticians to be involved in research at some The need for statisticians to be involved in research at some stage, preferably early as possible;stage, preferably early as possible;

Most errors relatively unimportant;Most errors relatively unimportant;

Some can have major bearings on the validity of the study.Some can have major bearings on the validity of the study.

So…….So…….


““There are three kinds of lies: lies, There are three kinds of lies: lies, damn lies and statistics”.damn lies and statistics”.

Benjamin Disreali.Benjamin Disreali.


Documents

Warwick Clinical Trials Unit 1 Statistical Errors in Publications October 2010