A Lecture on Sample Size and Statistical Inference for Health Researchers

Statistical Inference and Sample Size


Arindam [email protected]

2015-03-18


What Shall We Learn?

Revise concepts on probability

Statistical Inference - Estimation

Concept of Hypothesis Testing

Concepts of Sampling and Sample Size


Approaches to Population Parameters

We‘d like to know about population parameters

Parameters are unknown

As a Result, We calculate or study statistic in samples

Estimate Parameters in population from statistics inSamples

We Also Test Hypotheses About Parameter in our samples


Concepts of Probability


Theory of Probability and Inference

A trial/experiment has a set of specified outcomes

The outcome of one trial does not influence the outcome ofanother trial

The trials are identical

Probabilities provide a link between a population and samples


Independence Law of Probability

Two outcomes are statistically independent if the probabilityof their joint occurrence is the product of the probabilities ofoccurrence of each outcome


C and D Are Independent

P(CD) = P(C) * P(D); P(CD) = Joint Probability of

the Event C and Event D


Examples of Independent Events

If Repetitions are Independent, then they are from a RandomSample

Random Sample is About the Method that produces theSample


Law of Mutually Exclusive

Two outcomes are mutually exclusive if at most one of themcan occur at a time; that is, the outcomes do not overlap


C and D are Mutually Exclusive

P(C OR D) = P(C) + P(D)


Examples of Mutually Exclusive Events

Dead or Alive Outcomes

Head or Tail in a Toss

Vaginal OR Caesarian Section as Modes of Delivery

NZ European OR Asian OR Maori OR Pacific Islander

Others??


Not All Outcomes are Mutually Exclusive

Figure: Not all outcomes mutually exclusive


What is the Probability of Overweight OR Having HighBlood Pressure?

P(Overwt) OR P(HTN) OR P( Overwt HTN ) = 0.1 + 0.2 +

0.1 = 0.4


Question: What is the sum of two marginal probabilities?

P(Overwt) + P(HTN) = 0.3 + 0.2 = 0.5


What Happens When we Remove the Joint Occurrences?

P(Overwt OR HTN) = P(Overwt) + P(HTN) - P(O H) = 0.4

Thus in this case O and H are NOT mutually exclusive


Law of Addition

By the addition rule, for any two outcomes, the probability ofoccurrence of either outcome or both is the sum of theprobabilities of each occurring minus the probability of theirjoint occurrence


Law of Conditional Probability

For any two outcomes C and D, the conditional probability ofthe occurrence of C given the occurrence of D, P (C | D],Probability of C GIVEN D is given by


C is Conditional on D

P(C|D) = P(C D) /P(D)


Concepts of Randomness


What is a Random Variable?

A Variable Associated with Random Sample

The process that generates that variable must be random

The Likelihood of Person 1 being selected will have nothing todo with the likelihood of Person 2 being selected

Empirical relative frequency of occurrence of a value of thevariable becomes an estimate of the probability of occurrenceof that value


Consider this: Number of Boys in Families of Eight

Figure: Number of Boys in Families of 8 Children


Calculate: What is the probability of

Finding Exactly Two Boys in that Family?

P(Number of Boys = 2) = 0.0993

Finding None, One, or Two Boys in the Family?

P(Number = 1) + P(Number = 2) + P(Number = 0) =0.1310


Probability Distribution Function of this Data

Figure: Probability Distribution of Boys in Families of 8 Children


Types of Variables

Discrete

Nominal

Ordinal

Continuous

Interval

Ratio


Probability with Continuous Random Variable

What is the Probability of Findings someone with Weightexactly 50 kg?

Answer = 0! (i.e., exactly 50.000 and not 50.001 kg)

We can find someone in the interval 49.5 and 50.5 Kgs

Convert continuous variables into intervals -> treat midpointslike discrete -> list probabilities associated


Probability with Continuous Random Variable

What is the Probability of Findings someone with Weightexactly 50 kg?

Answer = 0! (i.e., exactly 50.000 and not 50.001 kg)

We can find someone in the interval 49.5 and 50.5 Kgs

Convert continuous variables into intervals -> treat midpointslike discrete -> list probabilities associated


Start with the Barplot of Relative Frequencies

Figure: Barplot of Relative Frequencies


The curve would take a smooth shape

Figure: Line Plot of Relative Frequencies


Probability Density Function

A probability density function is a curve that specifies, bymeans of the area under the curve over an interval, theprobability that a continuous random variable falls within theinterval. The total area under the curve is 1


How to Calculate the Average of a Discrete RandomVariable?

E(Y) = Σ (p*y); where E(Y) = Expected value of Y, p

= proportion, y = individual values


What is Normal Distribution?

Population = Set of All Possible Values of a Variable

Random Selection of Objects makes the variable RandomVariable

Challenge: Find a Model with few parameters and can applyto real data

Normal or Gaussian distribution is a Statistical model


Why is Normal Distribution Popular?

It works!

Central Limit Theorem

Practical


Central Limit Theorem

If a random variable Y has population mean µ and populationvariance σ , the sample mean y , based on n observations, isapproximately normally distributed with mean µ and varianceσ /

√n, for sufficiently large n


Central Limit Theorem in Simple Terms

Means of Random Samples from Any Distribution Will beNormally Distributed

Reassuring Even when we do not know the nature of theoriginal distribution


CLT Helps Us to Calculate the Confidence Intervals

Figure: 95 pct confidence interval


A Table that Helps You to Calculate the 95% CI

Figure: z value table


Example of a Normal Distribution

Figure: Density Plot


Statistics Are Random

A statistic associated with a random sample is a random variable


Illustration with an Example of IQ distribution

Figure: IQ Distribution


Points to Note

Reduction of variability by a factor of 2 will require a 4-foldincrease in sample size


Note: if you have 100 participants, and can add another10, don‘t bother

Figure: Extra 10 pct not worth


Example: Birth Weight of Babies with SIDS (SuddenInfant Death Syndrome)

78 babies died in a City diagnosed with SIDS. Birthcertificates were obtained and found that for these 78 babies,their mean birthweight was 2994 grams. It is also known thatin this population the standard deviation of birthweight isabout 800 grams.

What is the 95% Confidence Interval for Mean Birthweight forSIDS for these infants?


Answer to the Birth Weight Question

At the lower limit: 2994 - (1.96) * (800 /√

78) = 2816

At the higher limit: 2994 + (1.96) * (800 /√

78) = 3172

What if we wanted to be MORE confident? Say 99%confident?


Answer to the 99% Confidence Interval

Lower Limit = 2994 - (2.58 * 800 /√

78 ) = 2760

Upper Limit = 2994 + (2.58 * 800 / $√

78 ) = 3228


Interpretations of Confidence Intervals

As the confidence level increases, the interval level gets wider.

Why can this be?

This is the Price we pay for making sure we have straddledthe population mean

As we decrease α, we increase the level of confidence

If we want to decrease the width then we either decreaseconfidence or increase sample size


Interpretations of Confidence Intervals

As the confidence level increases, the interval level gets wider.

Why can this be?

This is the Price we pay for making sure we have straddledthe population mean

As we decrease α, we increase the level of confidence

If we want to decrease the width then we either decreaseconfidence or increase sample size


Steps of Estimation

Start with sample statistic

State about the population parameter

We use confidence interval to indicate that our intervalstraddles the parameter

Sort of flip it over, and get hypothesis testing


Steps of Hypothesis Testing

Start by assuming a parameter value

Make a probability statement about the value of statistic

Measure ?how far? an observed statistic is from ahypothesised parameter

If the distance is GREAT, we argue hypothesised parameter isINCONSISTENT with the data -> reject the hypothesis


Concepts of Distance in Hypothesis Testing

Take the basic variability of the observations (variance, σ2 )

Take the sample size (N)

If the observed value of statistic >= 2 * standard errors fromhypothesized value of parameter, question the Truth ofHypothesis

This is because the data do not match the hypothesis


Example: Are the SIDS babies? birthweight different fromthe normal population?

Mean birthweight of our sample (N = 78) babies = 2994 g

We know standard deviation of population = 800 g

Therefore standard error = 800 /√

78 = 90.6 g

For general population, average birth weight = 3300 g.

Is our sample birthweight consistent with this?


How far are SIDS birthweight from the average birthweight?

Figure: SIDS Birth Weight


Conclusions from SIDS Study

The observed difference = 308 g

This is 308/90.6 = 3.38 standard errors away fromhypothesised mean

It is GREAT distance away by our rule

Hence the SIDS babies sample is inconsistent with what isexpected!

The SIDS babies come from a DIFFERENT population, less

What are other challenges to this?


Where in the Normal Distribution We have this standarderror?

Figure: Area of Observed Value


Can We Associate a Probability Value to this TailEstimate?

The area to the right of the standard error (here 3.38) is thep-value

We know for z = 1.96, p-value = 0.025

We know for z = 2.58, p-value = 0.005


What if our statistic fell within the 2 standard errors?

We set it up before the data gathering as follows:


Figure: sample space


Concepts Related to Hypothesis Testing

Null Hypothesis - specifies hypothesised real value forparameter

Alternative Hypothesis - Real or range of values when nullhypothesis is rejected

Rejection Region values of statistic when null hypothesis isrejected


Key Table of Hypothesis Testing

Figure: Table of Hypothesis Testing


Applying this to our SIDS study

Figure: SIDS sample space


Rejection Regions - One tailed versus Two tailed

Figure: one tailed


For One tailed tests with same alpha, widen rejectionregion

Figure:


Summary of Statistical Inference

Define population

Specify parameters

Take random sample from the population

Estimate the parameter from the sample statistic

Test Hypotheses about the sample statistic and the parameter


Review: Assumptions of Hypothesis Testing

We knew the population variance and formalised the samplemean to estimate population

What Happens when:

We do not know either the population mean or the variance?

How do we compare two normal populations?

How do we estimate sample sizes?


Need for a Pivotal Variable

Think of a randomly selected sample whose mean is calculated

That mean follows a normal distribution and estimates thepopulation mean

The variance (or standard deviation of that mean) estimatesthe variance of the population as well

Pivotal Variable is the link


Pivotal Variable

Chisquare = ( (N - 1) * (standarddeviation)2 ) /σ2 Z = (y -

µ) /σ /√n ;

Chisquare = ( (N - 1) * (standarddeviation)2 ) /σ2


Requirements of a Pivotal Variable

At least a statistic,

And a parameter

Distribution of Z or Chi-square is fixed

Confidence intervals needed Z or chisquare

These quantities are known as Pivotal Variables


From Z to T

Random sample picked from a normal distribution and weknow the variance (sigmaˆ2)

Then, Z is our pivotal quantity which has a Normal(0,1)distribution.

What happens when we do not know the population variancebut need to estimate the population mean from sample?

The corresponding pivotal variable is ‘t‘, after Student orWilliam Gosset

T = (y - µ) /(s /√N )

What is the distribution of ‘t‘ ?


Properties of ‘T‘ Distribution

Similar to Normal Distribution

Depends on N

Indexed by n–1, and similar to chisquare

Bell shaped, symmetrical about 0

As N approaches infinity, t becomes similar to Normal


Concept: SIDS problem now in terms of t-statistic

This time we do not know the population variance and wouldlike to estimate the population mean

Sample mean birthweight y = 3199.8 g

Standard deviation = 663


Challenge

Without assuming population variance, can we

Obtain an interval estimate of the population mean?

Test Null Hypothesis that

Birthweight Average of SIDS Cases is 3300g?

T-value for 14 df = 2.14

Hence, upper limit: 3199.8 + 2.14 * 663/√

15 = 3566

Lower limit = 3199 - (2.14 * 663/√

15) = 2834

Note that the confidence interval is wider


Hypothesis Testing for Paired Data

Paired Data = Repeated or Multiple Measurements on thesame participants

Example: before after measurement of pain followinganalgesics administration

We want to look differences between pairs

Have the mean of sample differences come from a populationof differences with mean 0?

Assume that this difference is normally distributed


Example: Aminophylline Challenge Study

Apnea children, administered Aminophylline to these children

Measure apnea episodes 16 hours later and compared withwhat would happen 24 hours before administration

Average change for 13 children = 0.767

Sd for 13 children = 0.52

T value for 12 df = 2.18

If we consider no change = 0, then,

Rejection region = 0 - (2.18 * 0.524/√

13 ) = –0.317 andlikewise 0.317

0.767 falls outside of this region.

We reject the null hypothesis


Sampling


Importance of Sampling

Save time and money

Measurements can be more accurate when done on smallernumbers

Therefore choose the method with most accuracy andprecision


Alternatives to Sampling

Census - Expensive

Volunteer based reporting

Early responders are different from late responders and bothare different from members of the general public (“WorriedWell“)

Let the Interviewer Choose (“Choose those who are easiest tofind“)


Concepts of Sampling

Capture as many respondents as you can

Also try to capture data from nonrespondents

60% or less from postal questionnaires even after 3rd posting,while 70–75% for interviewer based sampling (Jennifer Kelseyet.al. (2007)

For prevalence estimation, completely healthy and those withdiseases do not want to participate

For common but untreatable conditions like back pain, peoplewith intractable problems over-represent in the hope that?research? will solve their problems


Key Definitions of Terms of Sampling

Sampling Unit is the basic fundamental unit around whichsampling planned (Household, persons)

Sampling Frame = Collection of sampling unit

Probability Sampling = where each sampling unit has anonzero probability of being included in the sample

Nonprobability Sampling = Convenient Sampling


Types of Probability Sampling

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Cluster Sampling

Multistage Sampling


Simple Random Sampling

Each unit has EQUAL probability of being included

Uses Random Numbers Table

With Replacement and Without Replacement (See Rexamples)


Problems of Simple Random Sampling

Investigator needs to know the sampling frame before starting

If the randomising process is not robust or well done, therecan be errors

Not suitable for all situations

Problem: if the investigator is interested to find out familysize from a school, and conduct simple random sampling,there is a problem.

Children with larger families will be oversampled and it canlead to errors


Systematic Sampling

The sampling unit is regularly spaced throughout the samplingframe

Investigator selects every kith sample

Advantages: investigator does not need to know the samplingframe in advance

Example: every 3rd newborn child in a hospital


Advantages and Disadvantages of Systematic Sampling

Simple to implement (just select the nth sample unit)

Can capture patterns easily

If there is a cyclical pattern exists, systematic sampling canmiss the pattern entirely, e.g., seasonal trends, say FluPatterns

Cannot estimate variance of population reliably from aSINGLE sample, needs at least two samples


Stratified Sampling

Divide population into strata or uniform groups

Draw Sample from each stratum

Represents Each subgroup

Can Get precise estimates compared with a correspondingsimple random sample

Can Assign Weights

Widely Used Strategy

Disadvantage: if too few units are selected for some stratathan others


Cluster Sampling

Sample Clusters rather than individuals

In the sampling frame, identify clusters (say classrooms, orhouseholds, or similar units)

Then, in each cluster, examine everyone within these clusters

Want to study prevalence of dental caries in schoolchildren?Divide schools into classrooms, and sample individualclassrooms, and examine all children in the classrooms


Advantages of Cluster Sampling

Need not enumerate entire population in advance

Economical Use of Resources


Multistage Sampling

Identify Primary Sampling Units that are Larger

From the Primary Sampling Units identify secondary samplingunits

Sample from the secondary sampling units or, extend theprocess further

Different from Cluster Sampling

In cluster sampling one selects everyone from the secondaryunit, here the secondary unit is sampled

Can use in different stages different sampling procedures


Sample Size Calculations

We need to know at least:

How variable are the data

How willing you are to accept that your conclusion is incorrectthat there is an effect when there is none (Type I error)

What is the magnitude of effect you want to detect

What is the certainty with which you want to detect the effect(power)


Importance of these criteria

The more variation in data, the more observations you need

The more certain you want to be, the more observations youwill need

If the difference is very large, you need fewer people

If the difference is very small, you need more people


The Formula

Where ∆ = ((µ1 - µ2) /σ) N = 2 * (z * (1-α /2) + z *

(1- β)2)) /∆2

Where ∆ = ((µ1 - µ2) /σ)


What is the significance of this formula

The standardized difference enters the formula as a square

The narrower the difference, the correspondingly increase insize required


Summary

This brief tour provides a snapshot of core statistical thinking

We focused on relevant study design issues

We learned about basic probability

We learned about Distributions (Z, T)

We learned about principles of estimation and hypothesistesting

We learned about sampling and sample sizes

Education

A Lecture on Sample Size and Statistical Inference for Health Researchers