Upload
dr-arindam-basu
View
256
Download
1
Tags:
Embed Size (px)
Citation preview
Statistical Inference and Sample Size
Statistical Inference and Sample Size
Arindam [email protected]
2015-03-18
Statistical Inference and Sample Size
What Shall We Learn?
Revise concepts on probability
Statistical Inference - Estimation
Concept of Hypothesis Testing
Concepts of Sampling and Sample Size
Statistical Inference and Sample Size
Approaches to Population Parameters
We‘d like to know about population parameters
Parameters are unknown
As a Result, We calculate or study statistic in samples
Estimate Parameters in population from statistics inSamples
We Also Test Hypotheses About Parameter in our samples
Statistical Inference and Sample Size
Concepts of Probability
Statistical Inference and Sample Size
Theory of Probability and Inference
A trial/experiment has a set of specified outcomes
The outcome of one trial does not influence the outcome ofanother trial
The trials are identical
Probabilities provide a link between a population and samples
Statistical Inference and Sample Size
Independence Law of Probability
Two outcomes are statistically independent if the probabilityof their joint occurrence is the product of the probabilities ofoccurrence of each outcome
Statistical Inference and Sample Size
C and D Are Independent
P(CD) = P(C) * P(D); P(CD) = Joint Probability of
the Event C and Event D
Statistical Inference and Sample Size
Examples of Independent Events
If Repetitions are Independent, then they are from a RandomSample
Random Sample is About the Method that produces theSample
Statistical Inference and Sample Size
Law of Mutually Exclusive
Two outcomes are mutually exclusive if at most one of themcan occur at a time; that is, the outcomes do not overlap
Statistical Inference and Sample Size
C and D are Mutually Exclusive
P(C OR D) = P(C) + P(D)
Statistical Inference and Sample Size
Examples of Mutually Exclusive Events
Dead or Alive Outcomes
Head or Tail in a Toss
Vaginal OR Caesarian Section as Modes of Delivery
NZ European OR Asian OR Maori OR Pacific Islander
Others??
Statistical Inference and Sample Size
Not All Outcomes are Mutually Exclusive
Figure: Not all outcomes mutually exclusive
Statistical Inference and Sample Size
What is the Probability of Overweight OR Having HighBlood Pressure?
P(Overwt) OR P(HTN) OR P( Overwt HTN ) = 0.1 + 0.2 +
0.1 = 0.4
Statistical Inference and Sample Size
Question: What is the sum of two marginal probabilities?
P(Overwt) + P(HTN) = 0.3 + 0.2 = 0.5
Statistical Inference and Sample Size
What Happens When we Remove the Joint Occurrences?
P(Overwt OR HTN) = P(Overwt) + P(HTN) - P(O H) = 0.4
Thus in this case O and H are NOT mutually exclusive
Statistical Inference and Sample Size
Law of Addition
By the addition rule, for any two outcomes, the probability ofoccurrence of either outcome or both is the sum of theprobabilities of each occurring minus the probability of theirjoint occurrence
Statistical Inference and Sample Size
Law of Conditional Probability
For any two outcomes C and D, the conditional probability ofthe occurrence of C given the occurrence of D, P (C | D],Probability of C GIVEN D is given by
Statistical Inference and Sample Size
C is Conditional on D
P(C|D) = P(C D) /P(D)
Statistical Inference and Sample Size
Concepts of Randomness
Statistical Inference and Sample Size
What is a Random Variable?
A Variable Associated with Random Sample
The process that generates that variable must be random
The Likelihood of Person 1 being selected will have nothing todo with the likelihood of Person 2 being selected
Empirical relative frequency of occurrence of a value of thevariable becomes an estimate of the probability of occurrenceof that value
Statistical Inference and Sample Size
Consider this: Number of Boys in Families of Eight
Figure: Number of Boys in Families of 8 Children
Statistical Inference and Sample Size
Calculate: What is the probability of
Finding Exactly Two Boys in that Family?
P(Number of Boys = 2) = 0.0993
Finding None, One, or Two Boys in the Family?
P(Number = 1) + P(Number = 2) + P(Number = 0) =0.1310
Statistical Inference and Sample Size
Probability Distribution Function of this Data
Figure: Probability Distribution of Boys in Families of 8 Children
Statistical Inference and Sample Size
Types of Variables
Discrete
Nominal
Ordinal
Continuous
Interval
Ratio
Statistical Inference and Sample Size
Probability with Continuous Random Variable
What is the Probability of Findings someone with Weightexactly 50 kg?
Answer = 0! (i.e., exactly 50.000 and not 50.001 kg)
We can find someone in the interval 49.5 and 50.5 Kgs
Convert continuous variables into intervals -> treat midpointslike discrete -> list probabilities associated
Statistical Inference and Sample Size
Probability with Continuous Random Variable
What is the Probability of Findings someone with Weightexactly 50 kg?
Answer = 0! (i.e., exactly 50.000 and not 50.001 kg)
We can find someone in the interval 49.5 and 50.5 Kgs
Convert continuous variables into intervals -> treat midpointslike discrete -> list probabilities associated
Statistical Inference and Sample Size
Start with the Barplot of Relative Frequencies
Figure: Barplot of Relative Frequencies
Statistical Inference and Sample Size
The curve would take a smooth shape
Figure: Line Plot of Relative Frequencies
Statistical Inference and Sample Size
Probability Density Function
A probability density function is a curve that specifies, bymeans of the area under the curve over an interval, theprobability that a continuous random variable falls within theinterval. The total area under the curve is 1
Statistical Inference and Sample Size
How to Calculate the Average of a Discrete RandomVariable?
E(Y) = Σ (p*y); where E(Y) = Expected value of Y, p
= proportion, y = individual values
Statistical Inference and Sample Size
What is Normal Distribution?
Population = Set of All Possible Values of a Variable
Random Selection of Objects makes the variable RandomVariable
Challenge: Find a Model with few parameters and can applyto real data
Normal or Gaussian distribution is a Statistical model
Statistical Inference and Sample Size
Why is Normal Distribution Popular?
It works!
Central Limit Theorem
Practical
Statistical Inference and Sample Size
Central Limit Theorem
If a random variable Y has population mean µ and populationvariance σ , the sample mean y , based on n observations, isapproximately normally distributed with mean µ and varianceσ /
√n, for sufficiently large n
Statistical Inference and Sample Size
Central Limit Theorem in Simple Terms
Means of Random Samples from Any Distribution Will beNormally Distributed
Reassuring Even when we do not know the nature of theoriginal distribution
Statistical Inference and Sample Size
CLT Helps Us to Calculate the Confidence Intervals
Figure: 95 pct confidence interval
Statistical Inference and Sample Size
A Table that Helps You to Calculate the 95% CI
Figure: z value table
Statistical Inference and Sample Size
Example of a Normal Distribution
Figure: Density Plot
Statistical Inference and Sample Size
Statistics Are Random
A statistic associated with a random sample is a random variable
Statistical Inference and Sample Size
Illustration with an Example of IQ distribution
Figure: IQ Distribution
Statistical Inference and Sample Size
Points to Note
Reduction of variability by a factor of 2 will require a 4-foldincrease in sample size
Statistical Inference and Sample Size
Note: if you have 100 participants, and can add another10, don‘t bother
Figure: Extra 10 pct not worth
Statistical Inference and Sample Size
Example: Birth Weight of Babies with SIDS (SuddenInfant Death Syndrome)
78 babies died in a City diagnosed with SIDS. Birthcertificates were obtained and found that for these 78 babies,their mean birthweight was 2994 grams. It is also known thatin this population the standard deviation of birthweight isabout 800 grams.
What is the 95% Confidence Interval for Mean Birthweight forSIDS for these infants?
Statistical Inference and Sample Size
Answer to the Birth Weight Question
At the lower limit: 2994 - (1.96) * (800 /√
78) = 2816
At the higher limit: 2994 + (1.96) * (800 /√
78) = 3172
What if we wanted to be MORE confident? Say 99%confident?
Statistical Inference and Sample Size
Answer to the 99% Confidence Interval
Lower Limit = 2994 - (2.58 * 800 /√
78 ) = 2760
Upper Limit = 2994 + (2.58 * 800 / $√
78 ) = 3228
Statistical Inference and Sample Size
Interpretations of Confidence Intervals
As the confidence level increases, the interval level gets wider.
Why can this be?
This is the Price we pay for making sure we have straddledthe population mean
As we decrease α, we increase the level of confidence
If we want to decrease the width then we either decreaseconfidence or increase sample size
Statistical Inference and Sample Size
Interpretations of Confidence Intervals
As the confidence level increases, the interval level gets wider.
Why can this be?
This is the Price we pay for making sure we have straddledthe population mean
As we decrease α, we increase the level of confidence
If we want to decrease the width then we either decreaseconfidence or increase sample size
Statistical Inference and Sample Size
Steps of Estimation
Start with sample statistic
State about the population parameter
We use confidence interval to indicate that our intervalstraddles the parameter
Sort of flip it over, and get hypothesis testing
Statistical Inference and Sample Size
Steps of Hypothesis Testing
Start by assuming a parameter value
Make a probability statement about the value of statistic
Measure ?how far? an observed statistic is from ahypothesised parameter
If the distance is GREAT, we argue hypothesised parameter isINCONSISTENT with the data -> reject the hypothesis
Statistical Inference and Sample Size
Concepts of Distance in Hypothesis Testing
Take the basic variability of the observations (variance, σ2 )
Take the sample size (N)
If the observed value of statistic >= 2 * standard errors fromhypothesized value of parameter, question the Truth ofHypothesis
This is because the data do not match the hypothesis
Statistical Inference and Sample Size
Example: Are the SIDS babies? birthweight different fromthe normal population?
Mean birthweight of our sample (N = 78) babies = 2994 g
We know standard deviation of population = 800 g
Therefore standard error = 800 /√
78 = 90.6 g
For general population, average birth weight = 3300 g.
Is our sample birthweight consistent with this?
Statistical Inference and Sample Size
How far are SIDS birthweight from the average birthweight?
Figure: SIDS Birth Weight
Statistical Inference and Sample Size
Conclusions from SIDS Study
The observed difference = 308 g
This is 308/90.6 = 3.38 standard errors away fromhypothesised mean
It is GREAT distance away by our rule
Hence the SIDS babies sample is inconsistent with what isexpected!
The SIDS babies come from a DIFFERENT population, less
What are other challenges to this?
Statistical Inference and Sample Size
Where in the Normal Distribution We have this standarderror?
Figure: Area of Observed Value
Statistical Inference and Sample Size
Can We Associate a Probability Value to this TailEstimate?
The area to the right of the standard error (here 3.38) is thep-value
We know for z = 1.96, p-value = 0.025
We know for z = 2.58, p-value = 0.005
Statistical Inference and Sample Size
What if our statistic fell within the 2 standard errors?
We set it up before the data gathering as follows:
Statistical Inference and Sample Size
Figure: sample space
Statistical Inference and Sample Size
Concepts Related to Hypothesis Testing
Null Hypothesis - specifies hypothesised real value forparameter
Alternative Hypothesis - Real or range of values when nullhypothesis is rejected
Rejection Region values of statistic when null hypothesis isrejected
Statistical Inference and Sample Size
Key Table of Hypothesis Testing
Figure: Table of Hypothesis Testing
Statistical Inference and Sample Size
Applying this to our SIDS study
Figure: SIDS sample space
Statistical Inference and Sample Size
Rejection Regions - One tailed versus Two tailed
Figure: one tailed
Statistical Inference and Sample Size
For One tailed tests with same alpha, widen rejectionregion
Figure:
Statistical Inference and Sample Size
Summary of Statistical Inference
Define population
Specify parameters
Take random sample from the population
Estimate the parameter from the sample statistic
Test Hypotheses about the sample statistic and the parameter
Statistical Inference and Sample Size
Review: Assumptions of Hypothesis Testing
We knew the population variance and formalised the samplemean to estimate population
What Happens when:
We do not know either the population mean or the variance?
How do we compare two normal populations?
How do we estimate sample sizes?
Statistical Inference and Sample Size
Need for a Pivotal Variable
Think of a randomly selected sample whose mean is calculated
That mean follows a normal distribution and estimates thepopulation mean
The variance (or standard deviation of that mean) estimatesthe variance of the population as well
Pivotal Variable is the link
Statistical Inference and Sample Size
Pivotal Variable
Chisquare = ( (N - 1) * (standarddeviation)2 ) /σ2 Z = (y -
µ) /σ /√n ;
Chisquare = ( (N - 1) * (standarddeviation)2 ) /σ2
Statistical Inference and Sample Size
Requirements of a Pivotal Variable
At least a statistic,
And a parameter
Distribution of Z or Chi-square is fixed
Confidence intervals needed Z or chisquare
These quantities are known as Pivotal Variables
Statistical Inference and Sample Size
From Z to T
Random sample picked from a normal distribution and weknow the variance (sigmaˆ2)
Then, Z is our pivotal quantity which has a Normal(0,1)distribution.
What happens when we do not know the population variancebut need to estimate the population mean from sample?
The corresponding pivotal variable is ‘t‘, after Student orWilliam Gosset
T = (y - µ) /(s /√N )
What is the distribution of ‘t‘ ?
Statistical Inference and Sample Size
Properties of ‘T‘ Distribution
Similar to Normal Distribution
Depends on N
Indexed by n–1, and similar to chisquare
Bell shaped, symmetrical about 0
As N approaches infinity, t becomes similar to Normal
Statistical Inference and Sample Size
Concept: SIDS problem now in terms of t-statistic
This time we do not know the population variance and wouldlike to estimate the population mean
Sample mean birthweight y = 3199.8 g
Standard deviation = 663
Statistical Inference and Sample Size
Challenge
Without assuming population variance, can we
Obtain an interval estimate of the population mean?
Test Null Hypothesis that
Birthweight Average of SIDS Cases is 3300g?
T-value for 14 df = 2.14
Hence, upper limit: 3199.8 + 2.14 * 663/√
15 = 3566
Lower limit = 3199 - (2.14 * 663/√
15) = 2834
Note that the confidence interval is wider
Statistical Inference and Sample Size
Hypothesis Testing for Paired Data
Paired Data = Repeated or Multiple Measurements on thesame participants
Example: before after measurement of pain followinganalgesics administration
We want to look differences between pairs
Have the mean of sample differences come from a populationof differences with mean 0?
Assume that this difference is normally distributed
Statistical Inference and Sample Size
Example: Aminophylline Challenge Study
Apnea children, administered Aminophylline to these children
Measure apnea episodes 16 hours later and compared withwhat would happen 24 hours before administration
Average change for 13 children = 0.767
Sd for 13 children = 0.52
T value for 12 df = 2.18
If we consider no change = 0, then,
Rejection region = 0 - (2.18 * 0.524/√
13 ) = –0.317 andlikewise 0.317
0.767 falls outside of this region.
We reject the null hypothesis
Statistical Inference and Sample Size
Sampling
Statistical Inference and Sample Size
Importance of Sampling
Save time and money
Measurements can be more accurate when done on smallernumbers
Therefore choose the method with most accuracy andprecision
Statistical Inference and Sample Size
Alternatives to Sampling
Census - Expensive
Volunteer based reporting
Early responders are different from late responders and bothare different from members of the general public (“WorriedWell“)
Let the Interviewer Choose (“Choose those who are easiest tofind“)
Statistical Inference and Sample Size
Concepts of Sampling
Capture as many respondents as you can
Also try to capture data from nonrespondents
60% or less from postal questionnaires even after 3rd posting,while 70–75% for interviewer based sampling (Jennifer Kelseyet.al. (2007)
For prevalence estimation, completely healthy and those withdiseases do not want to participate
For common but untreatable conditions like back pain, peoplewith intractable problems over-represent in the hope that?research? will solve their problems
Statistical Inference and Sample Size
Key Definitions of Terms of Sampling
Sampling Unit is the basic fundamental unit around whichsampling planned (Household, persons)
Sampling Frame = Collection of sampling unit
Probability Sampling = where each sampling unit has anonzero probability of being included in the sample
Nonprobability Sampling = Convenient Sampling
Statistical Inference and Sample Size
Types of Probability Sampling
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
Multistage Sampling
Statistical Inference and Sample Size
Simple Random Sampling
Each unit has EQUAL probability of being included
Uses Random Numbers Table
With Replacement and Without Replacement (See Rexamples)
Statistical Inference and Sample Size
Problems of Simple Random Sampling
Investigator needs to know the sampling frame before starting
If the randomising process is not robust or well done, therecan be errors
Not suitable for all situations
Problem: if the investigator is interested to find out familysize from a school, and conduct simple random sampling,there is a problem.
Children with larger families will be oversampled and it canlead to errors
Statistical Inference and Sample Size
Systematic Sampling
The sampling unit is regularly spaced throughout the samplingframe
Investigator selects every kith sample
Advantages: investigator does not need to know the samplingframe in advance
Example: every 3rd newborn child in a hospital
Statistical Inference and Sample Size
Advantages and Disadvantages of Systematic Sampling
Simple to implement (just select the nth sample unit)
Can capture patterns easily
If there is a cyclical pattern exists, systematic sampling canmiss the pattern entirely, e.g., seasonal trends, say FluPatterns
Cannot estimate variance of population reliably from aSINGLE sample, needs at least two samples
Statistical Inference and Sample Size
Stratified Sampling
Divide population into strata or uniform groups
Draw Sample from each stratum
Represents Each subgroup
Can Get precise estimates compared with a correspondingsimple random sample
Can Assign Weights
Widely Used Strategy
Disadvantage: if too few units are selected for some stratathan others
Statistical Inference and Sample Size
Cluster Sampling
Sample Clusters rather than individuals
In the sampling frame, identify clusters (say classrooms, orhouseholds, or similar units)
Then, in each cluster, examine everyone within these clusters
Want to study prevalence of dental caries in schoolchildren?Divide schools into classrooms, and sample individualclassrooms, and examine all children in the classrooms
Statistical Inference and Sample Size
Advantages of Cluster Sampling
Need not enumerate entire population in advance
Economical Use of Resources
Statistical Inference and Sample Size
Multistage Sampling
Identify Primary Sampling Units that are Larger
From the Primary Sampling Units identify secondary samplingunits
Sample from the secondary sampling units or, extend theprocess further
Different from Cluster Sampling
In cluster sampling one selects everyone from the secondaryunit, here the secondary unit is sampled
Can use in different stages different sampling procedures
Statistical Inference and Sample Size
Sample Size Calculations
We need to know at least:
How variable are the data
How willing you are to accept that your conclusion is incorrectthat there is an effect when there is none (Type I error)
What is the magnitude of effect you want to detect
What is the certainty with which you want to detect the effect(power)
Statistical Inference and Sample Size
Importance of these criteria
The more variation in data, the more observations you need
The more certain you want to be, the more observations youwill need
If the difference is very large, you need fewer people
If the difference is very small, you need more people
Statistical Inference and Sample Size
The Formula
Where ∆ = ((µ1 - µ2) /σ) N = 2 * (z * (1-α /2) + z *
(1- β)2)) /∆2
Where ∆ = ((µ1 - µ2) /σ)
Statistical Inference and Sample Size
What is the significance of this formula
The standardized difference enters the formula as a square
The narrower the difference, the correspondingly increase insize required
Statistical Inference and Sample Size
Summary
This brief tour provides a snapshot of core statistical thinking
We focused on relevant study design issues
We learned about basic probability
We learned about Distributions (Z, T)
We learned about principles of estimation and hypothesistesting
We learned about sampling and sample sizes