Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Research Design and StatisticsAmber W. Trickey, PhD, MS, CPH
Senior Biostatistician1070 Arastradero #[email protected]
Outline
1. Study design• Research Question• Validity• Administrative
Codes2. Statistics
• Measurement • Hypothesis testing• Power calculations • Statistical tests
xkcd.com
Strength of Evidence
Descriptive
Analytic
Meta-AnalysesReviews
Study Designs
Evidence Pyramid:
Williams et al, JACS 2018:
Research Question (PICO)
1. Patient population• Condition / disease, demographics, setting, time
2. Intervention• Procedure / policy / test / process intervention
3. Comparison group• Control group: standard of care, non-exposed, watchful
waiting, no treatment4. Outcome of interest
• Treatment effects, side effects, patient-centered outcomes, proxy measures
* Call your friendly biostatistician now
Internal & External ValidityExternal validity: generalizability to other patients & settings Study design
• Which patients are included• How the intervention is implemented• Real-world conditions
Internal validity: finding a true cause-effect relationship Study design + analysis
• Specific information collected (or not)• Data collection definitions• Data analysis methods
Diagnosis Coding
ICD = International Classification of Diseases• Openly shared, World Health Organization• ICD-9 (1970’s) -> ICD-10 (1990’s)• ICD-10 specificity:
> 400% increase in # of codes• ~ 14,000 total ICD-9 Dx codes• ~ 68,000 total ICD-10 Dx codes
• www.ICD9Data.com www.ICD10Data.com
ICD Diagnosis Coding StructureICD-9: up to 5 digits, numerical sections
• 540.0 Acute appendicitis with generalized peritonitis
• 211.3 Benign neoplasm of the colon
• 823.21 Closed fracture of shaft of fibula
ICD-10: up to 7 digits, alphabetical sections
• K35.2 Acute appendicitis with generalized peritonitis
• D12.2 Benign neoplasm of the ascending colon
• S82.421F Displaced transverse fracture of shaft of right fibula, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with routine healing
7
Procedure Coding
• ICDhttps://www.icd10data.com/ICD10PCS/Codes• 19 times as many ICD-10 procedure codes vs. ICD-9!
• CPT® Current Procedural TerminologyProprietary, American Medical Association
Procedure Coding Structures
ICD-9-CM Procedure: up to 4 digits, numerical sections
• 47.01 Laparoscopic appendectomy
ICD-10-PCS: up to 7 digits, numerical sections + alphabetical codes
• 0DTJ4ZZ Resection of Appendix, Percutaneous Endoscopic Approach
CPT: 5 digit, numerical sections
• 44970 Laparoscopic appendectomy
9
Measures
Quantitative Variables
• Dependent variable (outcome)• Independent variable(s)
(exposure/predictor)• Covariates (confounders)
Confounder
Outcome(DV)
Predictor(IV)
Surgery vs. Nonop mgmt
90-day Return to ED/Hosp
Insurance Status
Uncomplicated Appendicitis
Example
Variable MeasurementType of Measurement Characteristics Examples Descriptive
StatsInformation
Content
Continuous Ranked spectrum; quantifiable
intervals
Length of stay, Surgical procedure
duration
Mean (SD) + all below Highest
Ordered Discrete Number of surgical complications
Mean (SD) + all below High
CategoricalOrdinal(Polychotomous)
Orderedcategories
Resident Post-Graduate Year (PGY),
Functional status classMedian Intermediate
Categorical Nominal(Polychotomous)
Unordered Categories Insurance type, Facility Counts,
Proportions Lower
Categorical Binary (Dichotomous) Two categories Sex (M/F),
Return to ED (Y/N)Counts,
Proportions Low
[Hulley 2007 Designing Clinical Research]
Measures of Central Tendency
1. Mean = average• Continuous, normal distribution
2. Median = middle• Continuous, nonparametric distribution
3. Mode = most common• Categorical
Variability
• Distribution: Normal? Skewed? • Standard Deviation: the “average”
deviation from the mean• Variance: SD2
• Range: minimum to maximum difference
• Interquartile Range: 25th – 75th
percentile
• Definition: Spread of values within a population
• Averages are important, but variability is critical for describing & comparing populations.
• Example measures: o SD, range, IQR
• For skewed distributions (non-normal, e.g. $, time), range or IQR are more representative measures of variability than SD.
Box Plot Components:
(75th percentile)
(25th percentile)
range
Variability
Measures of Association
Measures of Association: Risk Ratio• Risk
oProbability that a new event will occuroRisk in Exposed = a / (a+b)oRisk in Unexposed = c / (c+d)
• AKA ‘Baseline Risk’
• Risk Ratio (RR)oRatio of two risks
o AKA ‘Relative Risk’
Outcome
Predictor Yes No Total
Yes a b a + b
No c d c + d
Total a + c b + d N
Interpreting the Risk Ratio
• Interpretation similar for other ratio measures
• RR=1oRisk in exposed = Risk in unexposed, no association
• RR > 1oPositive association, exposure associated with
increased risk• RR < 1
oNegative association, exposure associated with decreased risk
Measures of Association: Odds Ratio• Odds
• Ratio of the probability of an occurrence of an event to the probability of nonoccurrence
• Odds in Exposed = a / b• Odds in Unexposed = c / d
• AKA ‘Baseline Odds’
• Odds Ratio (OR)• Ratio of two odds
Outcome
Predictor Yes No Total
Yes a b a + b
No c d c + d
Total a + c b + d N
Measures of Association: Rate Ratio
• RateoProbability that an event will occuroRate in Exposed =
a / person-time in exposedoRate in Unexposed =
c / person-time in unexposed
• Rate RatiooRatio of two rates
• Incidence DensityoAsks when a disease occurs
Outcome
Predictor Yes No Total
Yes a b PTE
No c d PTU
Total a + c b + d PTT
HypothesisTesting
Error Types
Type I Error (α): False positive• Find an effect when it is truly not there • Due to: Spurious association• Solution: Repeat the study
Type II Error (β): False negative• Do not find an effect when one truly exists • Due to: Insufficient power• Solution: Increase sample size
Hypothesis Testing• Null Hypothesis (H0)
o Default assumption in superiority studies Treatment has NO effect, no difference b/t groups
o Acts as a “straw man”, assumed to be true so that it can be knocked down as false by stats test.
• Alternative Hypothesis (HA)o Assumption being tested in superiority studies Group/treatment has an effect
• Non-inferiority hypotheses are reversed!
Hypothesis TestingOne- vs. Two-tailed Tests
One-sidedTwo-sided
0
Test Statistic
HA :M1 < M2
HA :M1 > M2
H0 :M1 = M2
0
Test StatisticEvaluate association in one direction
Two-sided tests almost always required – higher standard, more cautious
Equivalence & Non-inferiority Hypotheses
Traditional Comparative,Superiority
[Walker 2010]
Equivalence
Non-inferiority
• Assuming the null hypothesis is true, the p-value represents the probability of obtaining a test statistic at least as extreme as the one actually observed.
P-value Definition
P-Value
P-value measures evidence against H0• Smaller the p-value, the larger the evidence against H0• Reject H0 if p-value ≤ α
Pitfalls:• The statistical significance of the effect does not tell us about the size
of the effect• Never report just the p-value!• STATISTICAL significance does not equal CLINICAL significance• Not truly yes/no, all or none, but is actually a continuum• Highly dependent on sample size
Power
Error Types
Type I Error (α): False positive• Find an effect when it is truly not there • Due to: Spurious association• Solution: Repeat the study
Type II Error (β): False negative• Do not find an effect when one truly exists • Due to: Insufficient power• Solution: Increase sample size
A study with low power has a high probability of committing type II error.
• Power = 1 – β• Sample size planning aims to select a sufficient number of subjects to keep
α and β low without making the study too expensive or difficult.• Translation: How many subjects do I need to find a statistically &
meaningful effect?• Sample size calculation pitfalls:
• Requires MANY assumptions• If power calculation estimated effect size >> observed effect size,
sample may be inadequate or observed effect may not be meaningful.
Statistical Power
Statistical Power Components• Alpha (p-value, typically 0.05)• Beta (1-power, typically 0.1-0.2)• 1- vs. 2-tailed test• Estimated effect size or minimum important difference
(clinically relevant)• Population variability• Outcome of interest• Allocation ratio between groups• Study design
Statistical Power Tips
• Seek biostatistician feedback early • power calculations are the perfect opportunity
• Unless you have pilot data, you will probably need to identify previous research with similar methods
• Power should be calculated before the study is implemented• Post hoc power calculations are less useful – if you found a significant effect,
you had sufficient power; if not, you had insufficient power
• Journals are increasingly requiring details of power calculations
StatisticalSleuthing
Pre-reading: Williams et al. JACS 2018Top Ten Statistical Techniques % of Articles
Chi-square test 63%Logistic regression 39%Student’s t-test 39%Fisher’s exact test 37%Mann-Whitney U test (Wilcoxon rank-sum test) 34%Kaplan-Meier curve 30%Log-rank test 25%Cox proportional hazard model 21%Analysis of variance (ANOVA) 10%C-statistic (Concordance, C-index, Area under the ROC curve, AUC) 9%
Impact Factor & Statistical Tests
[Williams 2018]
Which Statistical Test?
1. Number of IVs2. IV
Measurement Scale (distribution)
3. Independent vs. Matched Groups
4. DV Measurement Scale (distribution)
Nonparametric Tests
• 50% of statistical tests in high-impact general surgery research are nonparametric.
• Nonparametric tests are “distribution-free” –no assumptions that the data follow a specific distribution.
• Recommendation: use (or check) nonparametric tests when data do not meet parametric test assumptions, especially regarding normality distribution.
[Williams 2018]
Common Regression ModelsOutcome Variable Appropriate
RegressionModel
Coefficient
Continuous AND
NormalLinear Regression
Slope (β): How much the outcome increases for every 1-unit increase in
the predictor
Binary / Categorical Logistic Regression
Odds Ratio (OR): How much the odds for the outcome increases for
every 1-unit increase in the predictor
Time-to-Event Cox Proportional-Hazards Regression
Hazard Ratio (HR): How much the rate of the outcome increases for
every 1-unit increase in the predictor
Nested Data
Hierarchical / Mixed Effects Models
Correlated Data Grouping of subjects Repeated measures
Can handle Missing data Nonuniform measures
Outcome Variable(s): Categorical Continuous Counts
Level 1:Patients
Level 3:Hospitals
Level 2:Surgeons
FinalRecommendations
Seven Habits of Highly Effective Data Users
[Rosenfeld 2014]
1. Check quality before quantity. All data are not created equal; fancy statistics cannot salvage biased data.
2. Describe before you analyze. Special data require special tests; improper analysis gives deceptive results.
3. Accept the uncertainty of all data. All observations have some random error; interpretation requires estimating precision or confidence.
4. Measure error with the right statistical test. positive results qualified by the chance of being wrong, negative results by chance of having missed a true effect.
5. Put clinical importance before statistical significance. Statistical tests measure error, not importance; an appropriate measure of clinical importance must be checked.
6. Seek the sample source. Results from one dataset do not necessarily apply to others.7. View science as a cumulative process. A single study is rarely definitive.
Consultation http://med.stanford.edu/s-spire/research/research-consultation-program.html
Resources https://med.stanford.edu/s-spire/Resources.html
Resources https://stats.idre.ucla.edu/other/dae/
Thanks!
Amber W. Trickey, PhD, MS, CPHSenior Biostatistician
Department of SurgeryStanford University School of Medicine1070 Arastradero Rd. Ste 225, MC5552
Stanford, CA 94305http://med.stanford.edu/s-spire.html
References1. Williams PJ, Murphy P, Van Koughnett JAM, Ott MC, et al. Statistical techniques in general surgery literature: What do
we need to know? Journal of the American College of Surgeons 2018;227(4):450-454.2. Alkaline Software. The Web's Free ICD-9-CM Medical Coding Reference. https:// www.icd9data.com3. Alkaline Software. The Web's Free 2018 ICD-10-CM/PCS Medical Coding Reference https:// www.icd10data.com4. Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. (2007). Designing Clinical Research. 3rd ed.
Philadelphia, PA: Lippincott Williams & Wilkins.5. Walker E, Nowacki AS. Understanding equivalence and noninferiority testing. Journal of general internal medicine.
2011 Feb 1;26(2):192-6.6. Gordis, L. (2014). Epidemiology. Philadelphia: Elsevier/Saunders. 7. Rosenfeld R. Chapter 2: Interpreting medical data. In: Flint, P.W., Haughey, B.H., Robbins, K.T., Thomas, J.R., Niparko,
J.K., Lund, V.J. and Lesperance, M.M., 2014. Cummings Otolaryngology-Head and Neck Surgery E-Book. Elsevier Health Sciences.
8. S-SPIRE Research Consultation Program. http://med.stanford.edu/s-spire/research/research-consultation-program.html
9. S-SPIRE Resources. https://med.stanford.edu/s-spire/Resources.html10. UCLA Institute for Digital Research and Education. What Statistical Analysis Should I Use? Statistical Analyses Using
Stata. https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/