Research Design and Statistics - med.stanford.edumed.stanford.edu/content/dam/sm/s-spire/documents/2018PDBootcamp/... · Research Question (PICO) 1. Patient population • Condition

Research Design and StatisticsAmber W. Trickey, PhD, MS, CPH

Senior Biostatistician1070 Arastradero #[email protected]

Outline

1. Study design• Research Question• Validity• Administrative

Codes2. Statistics

• Measurement • Hypothesis testing• Power calculations • Statistical tests

xkcd.com

Strength of Evidence

Descriptive

Analytic

Meta-AnalysesReviews

Study Designs

Evidence Pyramid:

Williams et al, JACS 2018:

Research Question (PICO)

1. Patient population• Condition / disease, demographics, setting, time

2. Intervention• Procedure / policy / test / process intervention

3. Comparison group• Control group: standard of care, non-exposed, watchful

waiting, no treatment4. Outcome of interest

• Treatment effects, side effects, patient-centered outcomes, proxy measures

* Call your friendly biostatistician now

Internal & External ValidityExternal validity: generalizability to other patients & settings Study design

• Which patients are included• How the intervention is implemented• Real-world conditions

Internal validity: finding a true cause-effect relationship Study design + analysis

• Specific information collected (or not)• Data collection definitions• Data analysis methods

Diagnosis Coding

ICD = International Classification of Diseases• Openly shared, World Health Organization• ICD-9 (1970’s) -> ICD-10 (1990’s)• ICD-10 specificity:

> 400% increase in # of codes• ~ 14,000 total ICD-9 Dx codes• ~ 68,000 total ICD-10 Dx codes

• www.ICD9Data.com www.ICD10Data.com

ICD Diagnosis Coding StructureICD-9: up to 5 digits, numerical sections

• 540.0 Acute appendicitis with generalized peritonitis

• 211.3 Benign neoplasm of the colon

• 823.21 Closed fracture of shaft of fibula

ICD-10: up to 7 digits, alphabetical sections

• K35.2 Acute appendicitis with generalized peritonitis

• D12.2 Benign neoplasm of the ascending colon

• S82.421F Displaced transverse fracture of shaft of right fibula, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with routine healing

7

Procedure Coding

• ICDhttps://www.icd10data.com/ICD10PCS/Codes• 19 times as many ICD-10 procedure codes vs. ICD-9!

• CPT® Current Procedural TerminologyProprietary, American Medical Association

Procedure Coding Structures

ICD-9-CM Procedure: up to 4 digits, numerical sections

• 47.01 Laparoscopic appendectomy

ICD-10-PCS: up to 7 digits, numerical sections + alphabetical codes

• 0DTJ4ZZ Resection of Appendix, Percutaneous Endoscopic Approach

CPT: 5 digit, numerical sections

• 44970 Laparoscopic appendectomy

9

Measures

Quantitative Variables

• Dependent variable (outcome)• Independent variable(s)

(exposure/predictor)• Covariates (confounders)

Confounder

Outcome(DV)

Predictor(IV)

Surgery vs. Nonop mgmt

90-day Return to ED/Hosp

Insurance Status

Uncomplicated Appendicitis

Example

Variable MeasurementType of Measurement Characteristics Examples Descriptive

StatsInformation

Content

Continuous Ranked spectrum; quantifiable

intervals

Length of stay, Surgical procedure

duration

Mean (SD) + all below Highest

Ordered Discrete Number of surgical complications

Mean (SD) + all below High

CategoricalOrdinal(Polychotomous)

Orderedcategories

Resident Post-Graduate Year (PGY),

Functional status classMedian Intermediate

Categorical Nominal(Polychotomous)

Unordered Categories Insurance type, Facility Counts,

Proportions Lower

Categorical Binary (Dichotomous) Two categories Sex (M/F),

Return to ED (Y/N)Counts,

Proportions Low

[Hulley 2007 Designing Clinical Research]

Measures of Central Tendency

1. Mean = average• Continuous, normal distribution

2. Median = middle• Continuous, nonparametric distribution

3. Mode = most common• Categorical

Variability

• Distribution: Normal? Skewed? • Standard Deviation: the “average”

deviation from the mean• Variance: SD2

• Range: minimum to maximum difference

• Interquartile Range: 25th – 75th

percentile

• Definition: Spread of values within a population

• Averages are important, but variability is critical for describing & comparing populations.

• Example measures: o SD, range, IQR

• For skewed distributions (non-normal, e.g. $, time), range or IQR are more representative measures of variability than SD.

Box Plot Components:

(75th percentile)

(25th percentile)

range

Variability

Measures of Association

Measures of Association: Risk Ratio• Risk

oProbability that a new event will occuroRisk in Exposed = a / (a+b)oRisk in Unexposed = c / (c+d)

• AKA ‘Baseline Risk’

• Risk Ratio (RR)oRatio of two risks

o AKA ‘Relative Risk’

Outcome

Predictor Yes No Total

Yes a b a + b

No c d c + d

Total a + c b + d N

Interpreting the Risk Ratio

• Interpretation similar for other ratio measures

• RR=1oRisk in exposed = Risk in unexposed, no association

• RR > 1oPositive association, exposure associated with

increased risk• RR < 1

oNegative association, exposure associated with decreased risk

Measures of Association: Odds Ratio• Odds

• Ratio of the probability of an occurrence of an event to the probability of nonoccurrence

• Odds in Exposed = a / b• Odds in Unexposed = c / d

• AKA ‘Baseline Odds’

• Odds Ratio (OR)• Ratio of two odds

Outcome


Yes a b a + b

No c d c + d

Total a + c b + d N

Measures of Association: Rate Ratio

• RateoProbability that an event will occuroRate in Exposed =

a / person-time in exposedoRate in Unexposed =

c / person-time in unexposed

• Rate RatiooRatio of two rates

• Incidence DensityoAsks when a disease occurs

Outcome


Yes a b PTE

No c d PTU

Total a + c b + d PTT

HypothesisTesting

Error Types

Type I Error (α): False positive• Find an effect when it is truly not there • Due to: Spurious association• Solution: Repeat the study

Type II Error (β): False negative• Do not find an effect when one truly exists • Due to: Insufficient power• Solution: Increase sample size

Hypothesis Testing• Null Hypothesis (H0)

o Default assumption in superiority studies Treatment has NO effect, no difference b/t groups

o Acts as a “straw man”, assumed to be true so that it can be knocked down as false by stats test.

• Alternative Hypothesis (HA)o Assumption being tested in superiority studies Group/treatment has an effect

• Non-inferiority hypotheses are reversed!

Hypothesis TestingOne- vs. Two-tailed Tests

One-sidedTwo-sided

0

Test Statistic

HA :M1 < M2

HA :M1 > M2

H0 :M1 = M2

0

Test StatisticEvaluate association in one direction

Two-sided tests almost always required – higher standard, more cautious

Equivalence & Non-inferiority Hypotheses

Traditional Comparative,Superiority

[Walker 2010]

Equivalence

Non-inferiority

• Assuming the null hypothesis is true, the p-value represents the probability of obtaining a test statistic at least as extreme as the one actually observed.

P-value Definition

P-Value

P-value measures evidence against H0• Smaller the p-value, the larger the evidence against H0• Reject H0 if p-value ≤ α

Pitfalls:• The statistical significance of the effect does not tell us about the size

of the effect• Never report just the p-value!• STATISTICAL significance does not equal CLINICAL significance• Not truly yes/no, all or none, but is actually a continuum• Highly dependent on sample size

Power

Error Types

Type I Error (α): False positive• Find an effect when it is truly not there • Due to: Spurious association• Solution: Repeat the study

Type II Error (β): False negative• Do not find an effect when one truly exists • Due to: Insufficient power• Solution: Increase sample size

A study with low power has a high probability of committing type II error.

• Power = 1 – β• Sample size planning aims to select a sufficient number of subjects to keep

α and β low without making the study too expensive or difficult.• Translation: How many subjects do I need to find a statistically &

meaningful effect?• Sample size calculation pitfalls:

• Requires MANY assumptions• If power calculation estimated effect size >> observed effect size,

sample may be inadequate or observed effect may not be meaningful.

Statistical Power

Statistical Power Components• Alpha (p-value, typically 0.05)• Beta (1-power, typically 0.1-0.2)• 1- vs. 2-tailed test• Estimated effect size or minimum important difference

(clinically relevant)• Population variability• Outcome of interest• Allocation ratio between groups• Study design

Statistical Power Tips

• Seek biostatistician feedback early • power calculations are the perfect opportunity

• Unless you have pilot data, you will probably need to identify previous research with similar methods

• Power should be calculated before the study is implemented• Post hoc power calculations are less useful – if you found a significant effect,

you had sufficient power; if not, you had insufficient power

• Journals are increasingly requiring details of power calculations

StatisticalSleuthing

Pre-reading: Williams et al. JACS 2018Top Ten Statistical Techniques % of Articles

Chi-square test 63%Logistic regression 39%Student’s t-test 39%Fisher’s exact test 37%Mann-Whitney U test (Wilcoxon rank-sum test) 34%Kaplan-Meier curve 30%Log-rank test 25%Cox proportional hazard model 21%Analysis of variance (ANOVA) 10%C-statistic (Concordance, C-index, Area under the ROC curve, AUC) 9%

Impact Factor & Statistical Tests

[Williams 2018]

Which Statistical Test?

1. Number of IVs2. IV

Measurement Scale (distribution)

3. Independent vs. Matched Groups

4. DV Measurement Scale (distribution)

Nonparametric Tests

• 50% of statistical tests in high-impact general surgery research are nonparametric.

• Nonparametric tests are “distribution-free” –no assumptions that the data follow a specific distribution.

• Recommendation: use (or check) nonparametric tests when data do not meet parametric test assumptions, especially regarding normality distribution.

[Williams 2018]

Common Regression ModelsOutcome Variable Appropriate

RegressionModel

Coefficient

Continuous AND

NormalLinear Regression

Slope (β): How much the outcome increases for every 1-unit increase in

the predictor

Binary / Categorical Logistic Regression

Odds Ratio (OR): How much the odds for the outcome increases for

every 1-unit increase in the predictor

Time-to-Event Cox Proportional-Hazards Regression

Hazard Ratio (HR): How much the rate of the outcome increases for

every 1-unit increase in the predictor

Nested Data

Hierarchical / Mixed Effects Models

Correlated Data Grouping of subjects Repeated measures

Can handle Missing data Nonuniform measures

Outcome Variable(s): Categorical Continuous Counts

Level 1:Patients

Level 3:Hospitals

Level 2:Surgeons

FinalRecommendations

Seven Habits of Highly Effective Data Users

[Rosenfeld 2014]

1. Check quality before quantity. All data are not created equal; fancy statistics cannot salvage biased data.

2. Describe before you analyze. Special data require special tests; improper analysis gives deceptive results.

3. Accept the uncertainty of all data. All observations have some random error; interpretation requires estimating precision or confidence.

4. Measure error with the right statistical test. positive results qualified by the chance of being wrong, negative results by chance of having missed a true effect.

5. Put clinical importance before statistical significance. Statistical tests measure error, not importance; an appropriate measure of clinical importance must be checked.

6. Seek the sample source. Results from one dataset do not necessarily apply to others.7. View science as a cumulative process. A single study is rarely definitive.

Consultation http://med.stanford.edu/s-spire/research/research-consultation-program.html

Resources https://med.stanford.edu/s-spire/Resources.html

Resources https://stats.idre.ucla.edu/other/dae/

Thanks!

[email protected]

Amber W. Trickey, PhD, MS, CPHSenior Biostatistician

Department of SurgeryStanford University School of Medicine1070 Arastradero Rd. Ste 225, MC5552

Stanford, CA 94305http://med.stanford.edu/s-spire.html

mailto:[email protected]

http://med.stanford.edu/s-spire.html

References1. Williams PJ, Murphy P, Van Koughnett JAM, Ott MC, et al. Statistical techniques in general surgery literature: What do

we need to know? Journal of the American College of Surgeons 2018;227(4):450-454.2. Alkaline Software. The Web's Free ICD-9-CM Medical Coding Reference. https:// www.icd9data.com3. Alkaline Software. The Web's Free 2018 ICD-10-CM/PCS Medical Coding Reference https:// www.icd10data.com4. Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. (2007). Designing Clinical Research. 3rd ed.

Philadelphia, PA: Lippincott Williams & Wilkins.5. Walker E, Nowacki AS. Understanding equivalence and noninferiority testing. Journal of general internal medicine.

2011 Feb 1;26(2):192-6.6. Gordis, L. (2014). Epidemiology. Philadelphia: Elsevier/Saunders. 7. Rosenfeld R. Chapter 2: Interpreting medical data. In: Flint, P.W., Haughey, B.H., Robbins, K.T., Thomas, J.R., Niparko,

J.K., Lund, V.J. and Lesperance, M.M., 2014. Cummings Otolaryngology-Head and Neck Surgery E-Book. Elsevier Health Sciences.

8. S-SPIRE Research Consultation Program. http://med.stanford.edu/s-spire/research/research-consultation-program.html

9. S-SPIRE Resources. https://med.stanford.edu/s-spire/Resources.html10. UCLA Institute for Digital Research and Education. What Statistical Analysis Should I Use? Statistical Analyses Using

Stata. https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/

Documents

Research Design and Statistics - med.stanford.edumed.stanford.edu/content/dam/sm/s-spire/documents/2018PDBootcamp/... · Research Question (PICO) 1. Patient population • Condition