Upload
hector-harrison
View
217
Download
1
Embed Size (px)
Citation preview
CPH Exam ReviewBiostatistics
Lisa Sullivan, PhDAssociate Dean for EducationProfessor and Chair, Department of BiostatisticsBoston University School of Public Health
Outline and Goals
Overview of Biostatistics (Core Area) Terminology and Definitions Practice Questions
An archived version of this review, along with the PPT file, will be available on the NBPHE website (www.nbphe.org) under
Study Resources
Biostatistics
Two Areas of Applied Biostatistics:
Descriptive Statistics Summarize a sample selected from a
population
Inferential Statistics Make inferences about population
parameters based on sample statistics.
Variable Types
Dichotomous variables have 2 possible responses (e.g., Yes/No)
Ordinal and categorical variables have more than two responses and responses are ordered and unordered, respectively
Continuous (or measurement) variables assume in theory any values between a theoretical minimum and maximum
We want to study whether individuals over 45 years are at greater risk of diabetes than those 45 and younger. What kind of variable is age?
1. Dichotomous2. Ordinal3. Categorical4. Continuous
We are interested in assessing disparities in infant morbidity by race/ethnicity. What kind of variable is race/ethnicity?
1. Dichotomous2. Ordinal3. Categorical4. Continuous
Numerical Summaries of Dichotomous, Categorical and Ordinal Variables
Frequency Distribution Table
Heath Status Freq. Rel. Freq. Cumulative Freq
Cumulative Rel. Freq.
Excellent 19 38% 19 38%
Very Good 12 24% 31 62%
Good 9 18% 40 80%
Fair 6 12% 46 92%
Poor 4 8% 50 100%
n=50 100%
Ordinal variables only
Frequency Bar Chart
05
1015202530
Marital Status
Fre
qu
en
cy
Relative Frequency Histogram
05
101520
25303540
Poor Fair Good Very Good Excellent
Health Status
%
Continuous Variables
Assume, in theory, any value between a theoretical minimum and maximum
Quantitative, measurement variables Example – systolic blood pressure
Standard Summary: n = 75, = 123.6, s = 19.4
Second sample n = 75, = 128.1, s = 6.4
Summarizing Location and Variability
When there are no outliers, the sample mean and standard deviation summarize location and variability
When there are outliers, the median and interquartile range (IQR) summarize location and variability, where IQR = Q3-Q1
Outliers <Q1–1.5 IQR or >Q3+1.5 IQR
Mean Vs. Median
Box and Whisker Plot
Min Q1 Median Q3 Max
Comparing Samples withBox and Whisker Plots
100 110 120 130 140 150 160
Systolic Blood Pressure
1
2
What type of display is shown below?
1. Frequency bar chart2. Relative frequency bar chart3. Frequency histogram4. Relative frequency histogram
I II III IV0
5
10
15
20
25
30
35
%
Percent Patients by Disease Stage
The distribution of SBP in men, 20-29 years is shown below. What is the best summary of a typical value
1. Mean2. Median3. Interquartile range4. Standard Deviation
When data are skewed, the mean is higher than the median.
1. True2. False
The best summary of variability for the following continuous variable is
1. Mean2. Median3. Interquartile range4. Standard Deviation
Numerical and Graphical Summaries Dichotomous and categorical
Frequencies and relative frequencies Bar charts (freq. or relative freq.)
Ordinal Frequencies, relative frequencies,
cumulative frequencies and cumulative relative frequencies
Histograms (freq. or relative freq. Continuous
n, and s or median and IQR (if outliers) Box whisker plot
What is the probability of selecting a male with optimal blood pressure?
1. 20/252. 20/803. 20/150
Blood Pressure Category
Optimal Normal Pre-Htn Htn Total
Male 20 15 15 30 80
Female 5 15 25 25 70
Total 25 30 40 55 150
What is the probability of selecting a patient with Pre-Htn or Htn?
1. 95/1502. 45/803. 55/150
Blood Pressure Category
Optimal Normal Pre-Htn Htn Total
Male 20 15 15 30 80
Female 5 15 25 25 70
Total 25 30 40 55 150
What proportion of men have prevalent CVD?
CVD Free of CVD
Men 35 265
Women 45 355
1. 35/802. 35/2653. 35/300
What proportion of patients with CVD are men ?
CVD Free of CVD
Men 35 265
Women 45 355
1. 35/7002. 35/803. 80/300
Are Family History and Current Status Independent?
Example. Consider the following table which cross classifies subjects by their family history of CVD and current (prevalent) CVD status.
Current CVD
Family History No Yes
No 215 25
Yes 90 15
P(Current CVD| Family Hx) = 15/105 = 0.143
P(Current CVD| No Family Hx) = 25/240 = 0.104
Are symptoms independent of disease?
Disease No Disease Total
Symptoms 25 225 250
No Symptoms 50 450 500
1. No2. Yes
Popular Probability Models – Discrete Outcomes
Binomial Poisson
Outcome Success/Failure Count
Number of response categories
2 >2
Number of trials/replications
Fixed Infinite
Relationships among trials
Independent Independent
Probability Models – Normal Distribution
Model for continuous outcome Mean=median=mode
Normal DistributionProperties of Normal Distribution
I) The normal distribution is symmetric about the mean (i.e., P(X > m) = P(X < m) = 0.5).
ii) The mean and variance (m and s2) completely characterize the normal distribution.
iii) The mean = the median = the mode
iv) Approximately 68% of obs between mean + 1 sd 95% between mean + 2 sd, and >99% between mean + 3 sd
Normal Distribution
Body mass index (BMI) for men age 60 is normally distributed with a mean of 29 and standard deviation of 6.
What is the probability that a male has BMI < 29?
11 17 23 29 35 41 47
P(X<29)= 0.5
Normal Distribution
11 17 23 29 35 41 47
P(X<30)=?
What is the probability that a male has BMI less than 30?
Standard Normal Distribution Z
Normal distribution with m=0 and s=1
-3 -2 -1 0 1 2 3
Normal Distribution
P(X<30)= P(Z<0.17) = 0.5675
From a table of standard normal probabilities or statistical computing package.
0.176
2930
σ
μxZ
Comparing Systolic Blood Pressure (SBP)
Comparing systolic blood pressure (SBP) Suppose for Males Age 50, SBP is
approximately normally distributed with a mean of 108 and a standard deviation of 14
Suppose for Females Age 50, SBP is approximately normally distributed with a mean of 100 and a standard deviation of 8
If a Male Age 50 has a SBP = 140 and a Female Age 50 has a SBP = 120, who has the “relatively” higher SBP ?
Normal Distribution
ZM = (140 - 108) / 14 = 2.29
ZF = (120 - 100) / 8 = 2.50
Which is more extreme?
Percentiles of the Normal Distribution
The kth percentile is defined as the score that holds k percent of the scores below it.
Eg., 90th percentile is the score that holds 90% of the scores below it.
Q1 = 25th percentile, median = 50th percentile, Q3 = 75th percentile
Percentiles
For the normal distribution, the following is used to compute percentiles:
X = m + Z s
where
m = mean of the random variable X,
s = standard deviation, and
Z = value from the standard normal distribution for the desired percentile (e.g., 95th, Z=1.645).
95th percentile of BMI for Men: 29+1.645(6) = 38.9
Central Limit Theorem
(Non-normal) population with , m s Take samples of size n – as long as n is
sufficiently large (usually n > 30 suffices) The distribution of the sample mean is
approximately normal, therefore can use Z to compute probabilities
nσ
μxZ
Standard error
Statistical Inference
There are two broad areas of statistical inference, estimation and hypothesis testing.
Estimation. Population parameter is unknown, sample statistics are used to generate estimates.
Hypothesis Testing. A statement is made about parameter, sample statistics support or refute statement.
What Analysis To Do When
Nature of primary outcome variable Continuous, dichotomous, categorical,
time to event Number of comparison groups
One, 2 independent, 2 matched or paired, > 2
Associations between variables Regression analysis
Estimation
Process of determining likely values for unknown population parameter
Point estimate is best single-valued estimate for parameter
Confidence interval is range of values for parameter:
point estimate + margin of errorpoint estimate + t SE (point estimate)
Hypothesis Testing Procedures
1. Set up null and research hypotheses, select a
2. Select test statistic3. Set up decision rule4. Compute test statistic5. Draw conclusion & summarize
significance (p-value)
P-values P-values represent the exact
significance of the data Estimate p-values when rejecting H0
to summarize significance of the data (approximate with statistical tables, exact value with computing package)
If p < a then reject H0
Errors in Hypothesis Tests
Conclusion of Statistical TestDo Not Reject H0 Reject
H0
H0 true Correct Type I error
H0 false Type II error Correct
Continuous OutcomeConfidence Interval for m
Continuous outcome - 1 Sample
n > 30
n < 30
n
sZX
n
stX
Example.95% CI for mean waiting time at EDData: n=100, =37.85 and s=9.5 mins
37.85 + 1.86 (35.99 to 39.71)
100
9.5 1.96 37.85
Statistical computing packages use t throughout.
New Scenario
Outcome is dichotomous Result of surgery (success, failure) Cancer remission (yes/no)
One study sample Data
On each participant, measure outcome (yes/no)
n, x=# positive responses, n
xp
Dichotomous Outcome Confidence Interval for p
Dichotomous outcome - 1 Sample
n
)p-(1pZp
proceduresexact otherwise,
5)]pn(1,pmin[n
Example.In the Framingham Offspring Study (n=3532), 1219 patients were on antihypertensive medications. Generate 95% CI.
0.345 + 0.016
(0.329, 0.361)
3532
0.345)-0.345(196.10.345
One Sample Procedures – Comparisons with Historical/External Control
Continuous DichotomousH0: =m m0 H0: p=p0
H1: >m m0, <m0, ≠m0 H1: p>p0, <p0, ≠p0
n>30
n<30
ns/
μ-X Z 0
ns/
μ-X t 0
n
)p-(1p
p-p Z
00
0
proceduresexact otherwise,
5)]pn(1,min[np 00
Statistical computing packages use t throughout.
One Sample Procedures – Comparisons with Historical/External Control
Categorical or Ordinal outcomec2 Goodness of fit test
H0: p1=p10, p2=p20, . . . , pk=pk0
H1: H0 is false
E
)E - (O Σ = χ
22
New Scenario
Outcome is continuous SBP, Weight, cholesterol
Two independent study samples Data
On each participant, identify group and measure outcome
)s(ors,X,n),s(ors,X,n 22
22212
111
Two Independent Samples
Cohort Study - Set of Subjects Who Meet Study Inclusion Criteria
Group 1 Group 2Mean Group 1 Mean Group 2
Two Independent Samples
RCT: Set of Subjects Who Meet Study Eligibility Criteria
Randomize
Treatment 1 Treatment 2Mean Trt 1 Mean Trt 2
Continuous OutcomeConfidence Interval for (m1-m2) Continuous outcome - 2 Independent Samples
n1>30 and n2>30
n1<30 or n2<30
2121 n
1
n
1 ZSp)X - X(
2121 n
1
n
1 tSp)X - X(
2nn
1)s(n1)s(nSp
21
222
211
Statistical computing packages use t throughout.
Hypothesis Testing for (m1-m2)
Continuous outcome 2 Independent Sample
H0: m1=m2 (m1-m2 = 0)
H1: m1>m2, m1<m2, m1≠m2
Hypothesis Testing for (m1-m2)
Test Statistic
n1>30 and n2> 30
n1<30 or n2<30
21
21
n
1
n
1Sp
X - XZ
21
21
n
1
n
1Sp
X - Xt
Statistical computing packages use t throughout.
An RCT is planned to show the efficacy of a new drug vs. placebo to lower total cholesterol.
What are the hypotheses?
1. H0: mP=mN H1: mP>mN
2. H0: mP=mN H1: mP<mN
3. H0: mP=mN H1: mP≠mN
New Scenario
Outcome is dichotomous Result of surgery (success, failure) Cancer remission (yes/no)
Two independent study samples Data
On each participant, identify group and measure outcome (yes/no)
2211 p,n,p,n
Dichotomous OutcomeConfidence Interval for (p1-p2)
Dichotomous outcome - 2 Independent Samples
2
22
1
1121 n
)p(1p
n
)p-(1pZ)p-p(
5)]p(1n,pn),p(1n,pmin[n 22221111
Measures of Effect for Dichotomous Outcomes
Outcome = dichotomous (Y/N or 0/1)
Risk=proportion of successes = x/n
Odds=ratio of successes to failures=x/(n-x)
Measures of Effect for Dichotomous Outcomes
Risk Difference =
Relative Risk =
Odds Ratio =
21 p-p
21 p/p
)p1/(p
)p1/(p
22
11
Confidence Intervals for Relative Risk (RR)
Dichotomous outcome 2 Independent Samples
exp(lower limit), exp(upper limit)
2
222
1
111
n
)/xx-(n
n
)/xx-(nZR)Rln(
Confidence Intervals for Odds Ratio (OR)
Dichotomous outcome 2 Independent Samples
exp(lower limit), exp(upper limit)
)x(n
1
x
1
)x(n
1
x
1ZR)Oln(
222111
Hypothesis Testing for (p1-p2)
Dichotomous outcome 2 Independent Sample
H0: p1=p2
H1: p1>p2, p1<p2, p1≠p2
Test Statistic
21
21
n1
n1
)p-(1p
p-p Z
5)]p(1n,pn),p(1n,pmin[n 22221111
Two (Independent) Group Comparisons
Difference in birth weight is -106 g,
95% CI for difference in mean Birth weight: (-175.3 to -36.7)
New Scenario
Outcome is continuous SBP, Weight, cholesterol
Two matched study samples Data
On each participant, measure outcome under each experimental condition
Compute differences (D=X1-X2) dd s,Xn,
Two Dependent/Matched Samples
Subject ID Measure 1 Measure 21 55 702 42 60..
Measures taken serially in time or under different experimental conditions
Crossover Trial
Treatment Treatment
Eligible RParticipants
Placebo Placebo
Each participant measured on Treatment and placebo
Confidence Intervals for md
Continuous outcome 2 Matched/Paired Samples
n > 30
n < 30
n
sZX d
d
n
stX d
d
Statistical computing packages use t throughout.
Hypothesis Testing for md
Continuous outcome 2 Matched/Paired Samples
H0: md=0
H1: md>0, md<0, md≠0
Test Statisticn>30
n<30
ns
μ - XZ
d
dd
ns
μ - Xt
d
dd
Independent Vs Matched Design
Statistical Significance versus Effect Size
P-value summarizes significance Confidence intervals give magnitude
of effect (If null value is included in CI, then no statistical significance)
The null value of a difference in means is…
1. 02. 0.53. 14. 2
The null value of a mean difference is…
1. 02. 0.53. 14. 2
The null value of a relative risk is…
1. 02. 0.53. 14. 2
The null value of a difference in proportions is…
1. 02. 0.53. 14. 2
The null value of an odds ratio is…
1. 02. 0.53. 14. 2
A two sided test for the equality of means produces p=0.20. Reject H0?
1. Yes2. No3. Maybe
Hypothesis Testing for More than 2 Means - Analysis of Variance
Continuous outcome k Independent Samples, k > 2
H0: m1=m2=m3 … =mk
H1: Means are not all equal
Test Statistic
k)/(N)XΣΣ(X
1)/(k)XX(ΣnF
2j
2jj
F is ratio of between group variation to within group variation (error)
ANOVA TableSource of Sums of MeanVariation Squares df Squares F
BetweenTreatments k-1 SSB/k-1
MSB/MSE
Error N-k SSE/N-k
Total N-1
)X - X( n Σ = SSB j2
j
)X - X( Σ Σ = SSE j2
)X -X( Σ Σ = SST2
ANOVA
When the sample sizes are equal, the design is said to be balanced
Balanced designs give greatest power and are more robust to violations of the normality assumption
Extensions
Multiple Comparison Procedures – Used to test for specific differences in means after rejecting equality of all means (e.g., Tukey, Scheffe)
Higher-Order ANOVA - Tests for differences in means as a function of several factors
Extensions
Repeated Measures ANOVA - Tests for differences in means when there are multiple measurements in the same participants (e.g., measures taken serially in time)
c2 Test of Independence
Dichotomous, ordinal or categorical outcome 2 or More Samples
H0: The distribution of the outcome is independent of the groups
H1: H0 is false
Test Statistic E
E)-(O χ
22
c2 Test of Independence
Data organization (r by c table)
Is the distribution of the outcome different (associated with) groups
Outcome
Group 1 2 3
A 20% 40% 40%
B 50% 25% 25%
C 90% 5% 5%
What Tests Were Used?
In Framingham Heart Study, we want to assess risk factors for Impaired Glucose
Outcome = Glucose Category Diabetes (glucose > 126), Impaired Fasting Glucose (glucose 100-125), Normal Glucose
Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics
What test would be used to assess whether sex is associated with Glucose Category?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether age is associated with Glucose Category?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether BMI is associated with Glucose Category?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
In Framingham Heart Study, we want to assess risk factors for Glucose Level
Consider a Secondary Outcome = Fasting Glucose Level
Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics
What test would be used to assess whether sex is associated with Glucose Level?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether BMI is associated with Glucose Level?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether age is associated with Glucose Level?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
In Framingham Heart Study, we want to assess risk factors for Diabetes
Consider a Tertiary Outcome = Diabetes Vs No Diabetes
Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics
What test would be used to assess whether sex is associated with Diabetes?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether BMI is associated with Diabetes?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
What test would be used to assess whether age is associated with Diabetes?
1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other
Correlation
Correlation (r)– measures the nature and strength of linear association between two variables at a time
Regression – equation that best describes relationship between variables
Simple Linear Regression
Y = Dependent, Outcome variable
X = Independent, Predictor variable
= b0 + b1 x
b0 is the Y-intercept, b1 is the slope
y
Simple Linear RegressionAssumptions
Linear relationship between X and Y Independence of errors Homoscedasticity (constant variance) of
the errors Normality of errors
Multiple Linear Regression
Useful when we want to jointly examine the effect of several X variables on the outcome Y variable.
Y = continuous outcome variable
X1, X2, …, Xp = set of independent or predictor variables
. x b + . . .+ x b + x b + b = y pp22110
Multiple Regression Analysis
Model is conditional, parameter estimates are conditioned on other variables in model
Perform overall test of regression If significant, examine individual
predictors Relative importance of predictors by p-
values (or standardized coefficients)
Multiple Regression Analysis
Predictors can be continuous, indicator variables (0/1) or a set of dummy variables
Dummy variables (for categorical predictors) Race: white, black, Hispanic
Black (1 if black, 0 otherwise) Hispanic (1 if Hispanic, 0 otherwise)
Definitions
Confounding – the distortion of the effect of a risk factor on an outcome
Effect Modification – a different relationship between the risk factor and an outcome depending on the level of another variable
Multiple Regression for SBP: Comparison of Parameter Estimates
Simple Models Multiple Regression
b p b pAge 1.03 <.0001 0.86 <.0001Male -2.26 .0009 -2.22 .0002BMI 1.80 <.0001 1.48 <.0001BP Meds 33.38 <.0001 24.12 <.0001
Focus on the association between BP meds and SBP…
RCT of New Drug to Raise HDLExample of Effect Modification
Women N Mean Std Dev
New drug 40 38.88 3.97
Placebo 41 39.24 4.21
Men N Mean Std Dev
New drug 10 45.25 1.89
Placebo 9 39.06 2.22
Simple Logistic Regression
Outcome is dichotomous (binary)
We model the probability p of having the disease.
Xbb
Xbb
10
10
e1
ep
xbbp1
pln)plogit( 10
Multiple Logistic Regression
Outcome is dichotomous (1=event, 0=non-event) and p=P(event)
Outcome is modeled as log odds
pp22110 xb ... xb xbbp-1
pln
Multiple Logistic Regression for Birth Defect (Y/N)
Predictor b p OR (95% CI for OR)Intercept -1.099 0.0994Smoke 1.062 0.2973 2.89 (0.34, 22.51)Age 0.298 0.0420 1.35 (1.02, 1.78)
Interpretation of OR for age:
The odds of having a birth defect for the older of two mothers differing in age by one year is estimated to be 1.35 times higher after adjusting for smoking.
Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack, cancer remission or death.
Measure whether person has event or not (Yes/No) and if so, their time to event.
Determine factors associated with longer survival.
Survival Analysis
Incomplete follow-up information
Censoring Measure follow-up time and not time to
event We know survival time > follow-up time
Log rank test to compare survival in two or more independent groups
Survival Curve – Survival Function
Comparing Survival Curves
H0: Two survival curves are equal
c2 Test with df=1. Reject H0 if c2 > 3.84
c2 = 6.151. Reject H0.
Cox Proportional Hazards Model
Model:
ln(h(t)/h0(t)) = b1X1 + b2X2 + … + bpXp
Exp(bi) = hazard ratio
Model used to jointly assess effects of independent variables on outcome (time to an event).
NBPHE
Questions??
Good Luck!