Upload
paul
View
14
Download
3
Tags:
Embed Size (px)
DESCRIPTION
sampling procedure and sampling techniques
Citation preview
VARIOUS ELEMENTARY CONCEPTS
OF SAMPLE SURVEYS
Hukum Chandra
ICAR-Indian Agricultural Statistics Research Institute,
New Delhi
Email: [email protected]
2
About you
What experiences (if any) do you have of Survey
Sampling?
3
Objectives
To introduce various sampling schemes
4
Statistical Preliminaries
Definition of
Survey
Census
Sample Survey
Sample Survey Theory
Target population
Survey population
Sampling frame
Notation
Finite population parameter
5
Complete Enumeration (Census)
One way of obtaining the required information is to collect the data foreach and every unit belonging to the population and this procedure ofobtaining information is termed as complete enumeration (Census)
The effort, money and time required for the carrying out completeenumeration to obtain the different types of data will, generally, beextremely large
However, if the information is required for each and every unit in thedomain of study, a complete enumeration is clearly necessary.
Examples of such situations are preparation of voter’ list for electionpurposes
But there are many situations, where only summary figures are requiredfor the domain of study as a whole or for group of units.
6
Need for Sampling
An effective alternative to a complete enumeration is samplesurvey where only some of the units selected from the populationare surveyed and inferences are drawn about the population on thebasis of sample
In certain investigations, it may be essential to use specializedequipment or highly trained field staff for data collection, making italmost impossible to carry out such investigations
If a sample survey is carried out according to certain specifiedstatistical principles, it is possible not only to estimate the value ofthe characteristic for the population, but also to get a valid estimateof the sampling error of the estimate
7
What is sampling? Sampling proceeds in several stages:
Define scope and objectives of the study, including Population to be studied (Identify the population of interest)
General information to collect
Choose tools and techniques for making observations, e.g. Questionnaire
Diary
Physical measurements
Select (sample) some members of the population (units)
Study the sample (Gather data on the sample)
Draw inferences about the population (Analyze the data and make inferences)
Examples:
Sampling pasta from a pan
Sampling apples from a market stall
8
Population
Population consists of complete set of all ‘observations’
of interest
necessary to identify what does and what does not
belong to the population
All households in India in 2000
All women aged 15-49 in India in 2000
All businesses in the Delhi in 2014 with more than
1000 employees
All 15 year olds in India in 2011
9
Populations and samples
Population Sample
The process of how to obtain a sample from the population is
referred to as sampling
Sampling
10
Definitions
Element : An element is a unit about which we require information. For example, a field growing a particular crop is an element for collecting information on the yield of a crop.
Population : Complete set of all observations of interest.
It is the totality of elements under consideration on which inference is required.
Thus, all fields growing a particular crop in a region constitute a population.
Sampling units
A group of elements constitute a sampling unit
Elements belonging to different sampling units are non-overlapping
A sampling unit may have one or more than one element
Sampling units are convenient as well as relatively inexpensive to observe and identifiable
For example, it is convenient to select households for collecting data on milk produced by animals rather than contacting the elements directly
11
Definitions
Sampling frameAn exhaustive list of all the sampling units constitutes a sampling frame.
An example of a sampling frame may be cultivator fields growing a particular crop or households containing animals in a region.
Sample: A subset of the population.
A part of the population selected from a sampling frame for the purpose of making inference about the population is called as a sample.
For example, a subset of the cultivator fields may be selected to estimate the yield of a crop in a region.
A random sample is a subset where units are chosen with the help of probabilities (Sampling).
12
The error which arises due to use of sample to estimate the
population parameters
Whatever method of sampling is used, there will always be a
difference between population value and its corresponding
estimate
This error is unavoidable in every sampling scheme.
A sample with the smallest sampling error will always be
considered as a good representative of the population.
This error can be reduced by increasing the size of the sample
Sampling Error
13
Besides sampling error, the sample estimate may be
subject to other error which arises due to failure to
measure some of the units in the selected sample,
observational errors or errors introduced in editing,
coding and tabulating the results
Generally, census results may suffer from non-
sampling error although these may be free from
sampling error
The non sampling error is likely to increase with
increase in sample size, while sampling error
decreases with increase in sample size
Non-Sampling Error
14
Alternatives to Sample Surveys
Analysis of administrative records (administrative
data)
(for example Health Authority data, Crime records by
Home Office or Police, School Authority data, tax
records etc)
Censuses
(all members of the population of interest are
studied)
15
Sample Surveys vs Admin Data
Administrative data may not focus on same population
(as the one of interest)
May not contain all required information
Based on definitions devised for administrative purposes
May have incomplete coverage, be out of date,
inaccurate etc
Surveys can adopt desired definitions, collect desired
data etc
16
Sample Versus Census
Which is better? Census Sample Survey
Cost
Speed
Practicality and Feasibility
Data Quality
Detail (e.g. questionnaire)
Ability to analyse small subsets
Timeliness
Sampling Error
Inference to population
17
From Population to Sample
Population parameter (e.g. population mean, average
household income, or population proportion, e.g. infant
mortality rate) – based on population data
refers to a summary value of variable in population
Draw a random sample from the population
Based on sample data, calculate a statistic (e.g.
sample mean, sample proportion) also referred to as
‘estimator’
refers to summary value of a variable based on sample
18
From Population to Sample
Estimator: An estimator is a statistic obtained by a
specified procedure for estimating a population
parameter
The estimator is a random variable and its value differs
from sample to sample
Estimate: The particular value, which the estimator
takes for a given sample, is known as an estimate
19
Example
Population parameter: population mean income
denoted
Sample statistics: mean income in the sample
denoted
The sample statistic may be used as an estimate for
the population parameter:
ˆ x
x
20
Example
Population parameter: population mean income
denoted
Sample statistics: mean income in the sample
denoted
The sample statistic may be used as an estimate for
the population parameter:
ˆ x
x
21
Types of Samples-
Different Sample Designs
22
Sample Design
A sample design is a plan determined before any data
are actually collected for obtaining a sample from a
given population.
23
Non-Probability versus Probability Samples
Non-probability sampling:
1. Convenience sampling
A sample selected because of its ease of access
to sample members
24
Non-Probability versus Probability Samples
Non-probability sampling:2. Purposive sampling
a sample selected using a deliberate subjective choice in order to produce a sample which the researcher judges to be ‘representative’ in some sense
example: a quota sample
represent the major characteristics of the population by sampling a proportional amount of each. You have to decide on which specific characteristic to base your quota
25
Non-Probability versus Probability Samples
Probability sampling
a sample that is selected by a random mechanism, where each member of the population has a known and non-zero probability of being in the sample (selection probability)
important when choosing a random sample, that the surveyor does not choose the sample himself. It has been repeatedly shown that the human investigator is not a satisfactory instrument for making random selections.
26
Pros and Cons
Convenience sampling:
extremely cheap and quick but very large bias
Purposive (Quota) sampling:
Cheaper and quicker than random sampling, but potential for ‘availability/ willingness bias’ even after weighting
Random (probability) sampling:
More expensive/ slower; will have nonresponse bias (because of people refusing to take part)
if a good response rate then should have significantly less bias then quota sample
27
Probability vs Quota samples
Probability Sampling Quota Sampling
Method of selection is specified,
objective and replicable
Quota categories are specified and
replicable; but interviewer preference
typically rules on how to fulfil quotas
Inference to population based on
mathematics
Inference based on subjective judgement
Protects (to some extent) against
availability and willingness bias
Prone to severe availability and
willingness bias; weighting is essential
but bias can remain
precision of estimates can be
estimated
Confidence intervals cannot be
calculated
More expensive, requires more
resources
Depending on nonresponse rate
likely to suffer less overall bias
Cheaper and quicker
28
Assessing a Sample Design
“Virtually all surveys that are taken seriously by social
scientists and policy makers use some form of
probability sampling…
One way to ruin an otherwise well-conceived survey is to
use a convenience sample rather than one which is
based on a probability design”
29
Types of Probability Samples
An Overview
30
Probability sampling methods
1. Simple random sampling (SRS)
Randomly chosen selections using a random number table, computer-generated random numbers, lottery balls etc
Probably easiest way of obtaining a random sample
With replacement: replace element back into ‘selection frame’ once selected, one unit could be selected several times
Without replacement
31
Simple Random Sampling (SRS)
This is the simplest and most basic method of sampling in which
the sample is drawn unit by unit, with equal probability of
selection for each unit at each draw.
Therefore, it is a method of selection of n units out of a
population of size N by giving equal probability to all units, or
A sampling procedure in which all possible combinations of n
units that may be formed from the population of N units have the
same probability of selection.
32
Simple Random Sampling (SRS)
For selecting a simple random sample in practice, units from populationare drawn one by one
If a unit is selected and observation is recorded and then returned to thepopulation before the next drawing is made and this procedure repeated ntimes. This procedure is generally known as simple random sampling withreplacement (wr)
In such a selection procedure, there is a possibility of one or morepopulation units getting selected more than once
In case, this procedure is repeated till n distinct units are selected and allrepetitions are ignored, it is called a simple random sampling withoutreplacement (wor)
33
Simple random sampling
Advantages:
Easy to understand
Used as yardstick for assessing efficiency of
complex samples
Disadvantages:
Can be time consuming to implement
Can be costly
Statistically not the most ‘efficient’ method of
sampling (e.g. use of stratification to improve
efficiency)
34
Probability sampling methods (cont)
2. Systematic Sampling
A random start followed by successive application of
the sampling interval
35
Example: Systematic Sampling
Determine the number of
units N=100
Determine the sample size
you want n= 20
The interval size is therefore
K=N/n = 100/20 = 5
K=5 (sample one fifth)
Select at random an integer
from 1 to K: e.g. 4 is chosen
Then select every K-th unit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
.
.
.
98
99
100
36
Systematic Sampling
Special methods are needed for systematic selection
with a fractional interval
Use of fractional interval
The list from which to sample should be ideally
randomly ordered
37
Systematic Sample
Disadvantage: periodicity in population list
e.g. sampling interval coincides with a periodic interval
of list
Example: suppose you select 1st , 11th, 21st, etc
element, but list is arranged that 1st is a man, 2nd his
wife, 3rd is a man, 4th his wife, etc.
we would obtain a list of males, whereas whole
population made up of males and females
Such periodicity may be easily avoided
Another way to solve this problem is to use stratification
38
Systematic Sampling
In all other sampling methods, the units (whether elements or clusters) are selected with the help of random numbers
But, a method of sampling in which only the first unit is selected with the help of random number while the rest of the units are selected according to a pre-determined pattern, is known as systematic sampling
Very useful in forest surveys for estimating the volume of timber
fisheries surveys for estimating the total catch of fish,
milk yield surveys for estimating the lactation yield
39
Systematic Sampling
Advantages
Easy to understand
Quick and easy to implement
Arranging the frame in stratified order will create
implicit stratification
Disadvantages
Periodicity: If units are ordered unnoticed or
unattended this may result in an ‘unusual sample’
40
Probability sampling methods (cont)
3. Stratified Sampling
If we have information about the composition of a
population, we may be able to improve on e.g. simple
random sampling by using stratification
Units are aggregated (grouped) into different non-
overlapping subgroups, called strata
Then a certain number of units are randomly selected
from each stratum
41
Example: Stratified Sample
if a surveyor wants to find the most popular TV
programmes, it would be advisable to first divide the
population into 3 strata, men, women and children
then select a random sample from each of the strata
care must be taken to ensure that the strata are non-
overlapping, i.e. there is no element falling into more
than 1 category.
42
Stratified Sampling
The basic idea in this sampling is to divide a heterogeneous
population into sub-populations, usually known as strata
Strata are internally homogeneous in which case a precise
estimate of any stratum mean can be obtained based on a sample
from that stratum
By combining such estimates, a precise estimate for the whole
population can be obtained
This sampling provides a better cross section of the population than
the procedure of simple random sampling
For example, in the case of survey for income estimation, whole
population can be divide into three strata Low-income, Medium and
High-income stratum
43
Stratified Sampling
It may also simplify the organization of the field work.
Geographical proximity is sometimes taken as the basis
of stratification.
The assumption here is that geographically contiguous
areas are often more a like than areas that are far apart.
Administrative convenience may also dictate the basis
on which the stratification is made
Auxiliary information may be taken as the basis of
stratification
44
Stratified Sampling
In stratified sampling, the variance of the estimator consists of only
the ‘within strata’ variation
Thus, the larger the number of strata into which a population is
divided, the higher, the precision
For estimating the variance within strata, there should be a
minimum of 2 units in each stratum
The larger the number of strata the higher will be the cost of survey
So, depending on administrative convenience, cost of the survey
and variability of the characteristic under study in the area, a
decision on number of strata will have to be arrived at
45
Example: Stratified Sample
whole
N
Whole Sampling frame (size N)
North
N1
East
N3
South
N2
West
N4
Sample separated by region into 4 strata (N1, N2, N3, N4)
Random sub- sample of n1/N1
Random sub- sample of n2/N2
Random sub- sample of n3/N3
Random sub- sample of n4/N4
Random sub-sample from each
46
Stratified Sample
Can be
Proportionate (same sampling fraction for each strata)
Disproportionate (different sampling fractions),
this means …
differential probabilities of selection
e.g. often small subgroups are selected with a higher
sampling fraction than the rest of the population to
ensure a larger number of them in your final sample to
facilitate analysis
47
Proportionate Stratified Sample
Advantages
Guards against the more unusual samples that can
be chosen by random chance
If stratifiers are related to the variables in your survey,
stratification can reduce standard errors
Disadvantages
Stratification information has to be available
48
Disproportionate Stratified Sample
Advantages
Allows one to over-sample small groups so that a
good statistical comparison can be made
Also used where the goal is to achieve an optimum
allocation between variance and cost
Disadvantages
Estimates of the total population need to be derived
using weighting (see later sessions)
49
Probability sampling methods (cont)
4. Cluster sampling
A cluster is a naturally occurring unit like a county
(country, or state)
Sampling units are selected as part of a cluster of units
Difference to stratified sampling is that the starting point
is a natural cluster, and not ‘made up’ as in stratified
sampling.
50
Cluster sampling
A sampling procedure presupposes division of the
population into a finite number of distinct and identifiable
units called the sampling units.
The smallest units into which the population can be
divided are called the elements of the population and
group of elements the clusters
A cluster may be a class of students or cultivators’ fields
in a village
When the sampling unit is a cluster, the procedure of
sampling is called cluster sampling
51
For many types of population, a list of elements is not
available, therefore, the use of an element as the
sampling unit is not feasible.
The method of cluster is available in such cases.
For example, in a city a list of all the houses may be
available, but that of persons is rarely so and list of farms
are not available, but those of villages or enumeration
districts prepared for the census are.
Cluster sampling is, therefore, widely practiced in sample
surveys.
Cluster sampling
52
For a given number of sampling units cluster sampling is more convenient and less costly than simple random sampling due to the saving time in journeys, identification and contacts etc.
Cluster sampling is generally less efficient than simple random sampling due to the tendency of the units in a cluster to be similar
In most practical situations, the loss in efficiency may be balanced by the reduction in the cost and the efficiency per unit cost may be more in cluster sampling as compares to simple random sampling
Cluster sampling
53
Clearly, the size of the cluster will influence efficiency of sampling
In general, the smaller the cluster, the more accurate will usually be the estimate of the population characteristic for a given number of elements in the sample
The optimum cluster is one which would estimate the characteristic under study with smallest standard error for a given proportion of the population sampled, or more generally, for a given cost.
Cluster sampling
54
Probability sampling methods (cont)
5. Multi-stage sampling
Large units are selected first and then smaller
units within the selected larger units are
selected (results in clustering)
55
Probability sampling methods (cont)
5. Multi-stage sampling
One of the main considerations of adopting cluster sampling is the reduction of travel cost
However, this method restricts the spread of the sample over population which results in increasing the variance of the estimator
In order to increase the efficiency of the estimator with the given cost it is natural to think of further sampling the clusters and selecting more number of clusters so as to increase the spread of the sample over population.
Sampling which consists of first selecting clusters and then selecting a specified number of elements from each selected cluster is known as two stage sampling (sub- sampling)
56
Multi-stage sampling
Clusters are generally termed as first stage units (fsu’s) or primary
stage units (psu’s)
The elements within clusters or ultimate observational units are
termed as second stage units (ssu’s) or ultimate stage units (usu’s).
This procedure can be easily generalized to give rise to multistage
sampling
It can be expected to be (i) more efficient than simple random
sampling and less efficient than cluster sampling from operational
convenience and cost point of view
(ii) less efficient than simple random sampling and more efficient
than cluster sampling from the variability point of view
57
Multi-Stage Cluster Sampling
AdvantagesHuge cost savings if survey is carried out with face-to-
face interviews
Useful when no frame is available for the final sampling unit
Disadvantages to the extent that clusters are homogeneous with
respect to the survey variables you are studying, this may result in larger standard error (less precision of estimates)
58
Successive Sampling
Many times surveys often gets repeated on many occasions (over
years or seasons) for estimating same characteristics at different
points of time.
The information collected on previous occasion can be used to
study the change or the total value over occasion for the character
and also to study the average value for the most recent occasion
For example in milk yield survey, we are interested in
1. Average milk yield for the current season
2.The change in milk yield for two different season
3.Total milk production for the year
59
Successive Sampling
The successive method of sampling consists of selecting
sample units on different occasions such that some units are
common with samples selected on previous occasions
If objective is to estimate the change, then it is better to retain
the same sample from occasion to occasion
For populations where the basic objective is to study the total,
it is better to select a fresh sample for every occasion
If the objective is to estimate the average value for the most
recent occasion, the retain a part of the sample over
occasions
60
Multiphase Sampling
It is well known that the prior information on an auxiliary
variable could be used to enhance the precision of the
estimator.
Ratio, product and regression estimators require the
knowledge of population mean and total for the auxiliary
variable x.
When such information is lacking, it is sometimes less
expensive to select a large sample on which auxiliary
variable alone is observed.
The purpose is to furnish a good estimate of population mean
of x
61
Multiphase Sampling
Subsequently, a subsample from the initial sample is selected
for observing the variable of interest.
For example: Consider problem of estimating total production
of cow milk in a certain region. For this purpose, village is
taken as the sampling unit and the number of milch cows in all
the villages of the region may not be available
Then investigator could decide to take a large initial sample of
villages and collect information on number of milch cows in the
sample villages
This information is used to build up an estimate of total
number of milch cows in the region
A subsample of villages is selected from the first-phase
sample to observe the study variable, viz., cow milk yield in
the village
62
Probability sampling methods (cont)
6. Probability Proportional to Size (PPS)
Units are sampled in two or more stages with
probabilities proportional to their size (a clever
solution to ensure equal sized fieldwork
assignments while maintaining equal
probabilities of selection)
63
Under certain circumstances, selection of units with unequal
probabilities provides more efficient estimators than equal
probability sampling, and this type of sampling is known as
unequal or varying probability sampling
The units are selected with probability proportional to a given
measure of size (pps) where the size measure is the value of
an auxiliary variable x
This sampling scheme is termed as probability proportional
to size (pps) sampling
In pps sampling, the units may be selected with or without
replacement.
Sampling with Varying Probability
63
64
In sampling theory if the auxiliary information, related to the
character under study, is available on all the population units
Then it may be advantageous to make use of this additional
information in survey sampling
One way of using this additional information is in the sample
selection with unequal probabilities of selection of units
The knowledge of auxiliary information may also be exploited at
the estimation stage. The estimator can be developed in such a
way that it makes use of this additional information
Use of Auxiliary Information
64
Examples are ratio estimator, difference estimator, regression estimator,
generalized difference estimators are the of such estimators
Obviously, it is assumed that the auxiliary information is available on all
the sampling units
Another way the auxiliary information can be used is at the stage of
planning of survey. An example of this is the stratification of the
population units by making use of the auxiliary information
Use of Auxiliary Information (contd…)
Stratification I
67
Outline
What is stratification ?
Implicit and explicit stratification
Systematic sampling
Implementation of stratification
Some examples of stratification
68
Review
Note: in simple random sampling all units have the same
probability of selection (the probabilities are known and
positive)
But in general, random sampling does not need to be
based on equal sampling probabilities (however they
need to be known and the need to be all positive), e.g.
some units have a higher probability of selection
69
Random Sampling
We sometimes sample with unequal probabilities
Think of the population as being divided into H subsets
(h = 1, ... H), with Nh units in the hth subset.
If we sample separately from each subset, then we call
the subsets sampling strata. If we sample nh units
from stratum h, then the sampling fraction (selection
probability) in that stratum is nh/Nh.
hh
h
nf
N
70
What is Stratified Sampling?
Stratified sampling involves sorting (stratifying) the sampling frame prior to selection
Implicit Stratification involves sampling systematicallyfrom an ordered (stratified) list
Explicit Stratification involves sorting the population list (frame) into distinct strata and then sampling independently from each stratum
It is possible (and often desirable) to combine explicit and implicit stratification - i.e. to stratify implicitly within explicit strata
71
Why Stratified Sampling?
The primary reason for stratification is that it ensures
(unlike SRS) that the sample proportion from any
particular stratum equals the population proportion.
will increase precision if strata are correlated with survey
measures (smaller SE and CI)
Cannot do statistical harm (estimates not less precise
than under SRS)
This is true of both explicit and implicit stratification.
A secondary motivation for stratification is to permit the
use of variable sampling fractions.
72
Systematic Sampling
Recall session 1
Involves sampling at a fixed interval down a list
If the list is ordered in some meaningful way, this has the
effect of stratification
Advantage of being easy to implement
Procedure: calculate the required interval (K=N/n), then
generate a random start (R) (random number between 1
and K). The sampled units are then the Rth, (R+K)th,
(R+2K)th etc units on the list.
73
Systematic Sampling (2)
K = N/n, where N is the total number of units on the list,
and n the desired sample size.
R is a random number between 1 and K.
Note that K need not be an integer. E.g. if desired n is
500 and N = 10,679, using K = 21.36 will give exactly n =
500, but rounding to K = 21, will give n = 508.
Do not use K = 21 and then stop once 500 are sampled:
biased! (go up to 508 sampled cases)
74
Example: Systematic Sampling
Determine the number of
units N=100
Determine the sample size
you want n= 20
The interval size is therefore
K=N/n = 100/20 = 5
K=5 (sample one fifth)
Select at random an integer
R from 1 to K: e.g. 4 chosen
Then select every K-th unit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
.
.
.
98
99
100
75
Stratum Construction
Choose factors so that strata are homogeneous
If strata are correlated with survey measures then
increase in precision
Strata examples: e.g. regions
We can sometimes estimate the precision achievable
with different choices
Choice of number of strata:
More strata, more precision
But variance estimation more difficult
And administration and sampling (and weighting) may be
more complex
76
Stratum Construction (2)
Cross factors with few categories rather than using many
categories for one factor
For example: stratify according to region and poor and
rich areas
When using a continuous factor (e.g. tax payments;
proportions of households with attribute A etc) choose
carefully the stratum boundaries (i.e. define sensible
categories and cut-off points)
77
Stratum Construction (3)
Choose stratifiers such that they are correlated with a
range of variables
For example, for national household surveys, tend to
choose stratifiers that are related to
Area characteristics (e.g. rural, urban, population density
etc)
Income / occcupation (e.g social economic group, social
class )
It is common to use 3-4 stratification variables
hierarchically (see later example)
78
Example of Stratification: A General
Population Survey
The Health Survey for England (DH)
Stage 1:
Postcode Sectors stratified by:
14 Regional Health Authorities (1st-level explicit strata)
Proportion of adults with limiting long-term illness, in three
bands (2nd-level explicit strata)
Proportion of households with ‘non-manual’ head, in two
bands (3rd-level explicit strata)
Proportion of households with no car, in two bands (4th-level
explicit strata)
Proportion "non-white" (5th-level stratification: implicit)
79
Example of Stratification: A General
Population Survey (2)
720 sectors were sampled systematically
Stage 2
Within each sector, addresses are in postcode order,
and selected systematically. This provides some
geographical stratification.
80
Example of Stratification: A Special
Population Survey
Survey of Recipients of Job Seekers Allowance (DSS)
Stage 1
Postal sectors were stratified by region and number
of recipients
200 sectors were selected with probability
proportional to number of recipients
81
Example of Stratification: A Special
Population Survey (2)
Stage 2
Recipients were stratified by sex (2 bands) x claim type
(4 bands) x length of continuous unemployment prior to
current claim (implicit)
25 recipients were selected systematically from each
sampled sector
82
Stratified sample: some notation
Dividing for example frame into distinct strata and then
sampling independently from each stratum results in:
H strata (or groups), stratum h=1,…, H
In each stratum h there are Nh units (on population
level)
An independent sample of nh units is then selected
from each stratum h
Sampling fraction (selection probability)
in the stratum is: hh
h
nf
N
83
Estimator Under Stratification (Example)
We have 2 strata (e.g. north and south GB)
Proportion of people 18+ years old in GB who use the
internet: P
Estimator p
11
*( )
Hh
h Hh hh
Np p
N
1 21 2
1 2 1 2
* *( ) ( )
N Np p p
N N N N
84
DEFF under Stratified Sample
Increase in precision under stratified sample can be
estimated using the DEFF
Numerator is the variance of the stratified design
Denominator is variance under SRS
How can be calculated?
2
2
STRAT
SRS
SEDEFF =
SE
2STRATSE
85
Variance under Stratification
Variance of a mean:
.. . and for a proportion:
2 2
21
ˆH
h h
h h
N s var x =
N n
2
21
(1 )ˆ
Hh h h
h h
N p p var p =
N n
86
Variance under Stratification (2)
where
h is the stratum
s2h is the sample variance in stratum h (estimated from
sample)
Nh is the population size in stratum h
nh is the sample size in stratum h
N is the total population size (N=N1+N2 +…+NH)
n is the total sample size (n=n1+n2 +…+nH)
87
Practical Limitations to Stratification
Often only possible at PSU level (e.g. household surveys) (PSU= primary sampling unit, e.g. postcode sector, schools etc) rather than at individual level
Correlation between strata and survey variables is typically modest
Depends on what information available on the sampling frame
Multi-purpose nature of surveys: optimal stratification for one estimate may produce no benefit for another
Typically there is a lack of information about stratum variances
88
Comparisons between Stratification
and Quota Sampling
Recall session 1
Imposing quotas has similar effect to stratification -namely to reduce sampling variance
But, quota sampling also has inherent bias towards more accessible and more willing population members
This may manifest itself as a bias in the survey measures
Thus, quota sample estimates could have relatively high precision, but be biased and therefore have low accuracy (high mean squared error) (session 3)
Stratification II
90
Outline of session
Variable Sampling Fractions
Motivations
Optimal allocation
Design effects
91
Variable Sampling Fraction (VSF)
We sometimes sample with unequal probabilities
Think of the population as being divided into H subsets
(h = 1, ... H), with Nh units in the hth subset.
If we sample separately from each subset, then we call
the subsets sampling strata. If we sample nh units
from stratum h, then the sampling fraction (selection
probability) in that stratum is nh/Nh.
hh
h
nf
N
92
Variable Sampling Fraction (VSF)
For unbiased estimation, each sampled unit i must be
assigned a weight in inverse proportion to its selection
probability.
This is usually referred to as the sampling weight or
design weight: wi
An example of such a weight in the case of stratified
sampling would be:
if sample unit i belongs to stratum h
for i hhi
h
Nw
n
93
Use of weights
So when certain types of units have been selected
based on different selection probabilities (oversampling)
then the sample weights need to be taken into account in
estimation
Corrective weighting is needed to get design-unbiased
estimates
If weights are ignored then sample estimates are biased
94
Motivations for VSF
1. To increase the sample size of small groups
(i.e. to get acceptable confidence intervals for
estimates based on those groups)
2. Because the frame / selection method gives us
no choice
3. To increase precision of estimates by over-
sampling more variable strata
95
Examples
1. A national survey where estimates are also required for
each of the component countries /regions
E.g. survey of the UK, but estimates for Scotland, Wales and NI
are also needed separately
Then a larger sampling fraction might be used in Wales and
Scotland compared to England.
2. Sampling minority ethnic groups:
a high proportion of the minority ethnic population live within a
relatively small proportion of areas
Oversampling such (ethnically dense) areas will increase
achieved sample sizes while reducing survey costs.
96
Use of Variable Sampling Fractions
Now we want to investigate further the effects of using
variable sampling fractions
We have seen we need to use weights
We want to investigate under which circumstances
precision in survey estimates is increased and when
precision is reduced after using VSF
Or in other words: what is the effect of oversampling
on the precision of estimates?
97
Standard Errors for Stratified Sampling
We have already introduced in last session a
formula for the variance
Generally, it is for a mean:
And for a proportion:
2 2
21
ˆ 1 (6.1)
Hh h h
hh h
N s n var x =
NN n
2
21
(1 )ˆ 1 (6.2)
Hh h h h
hh h
N p p n var p =
NN n
98
Variance under Stratification (2)
where
h is the stratum
s2h is the sample variance in stratum h (estimated from
sample)
Nh is the population size in stratum h
nh is the sample size in stratum h
N is the total population size (N=N1+N2 +…+NH)
n is the total sample size (n=n1+n2 +…+nH)
99
The finite population correction
The expression
is referred to as the finite population correction
This term is only important if nh/Nh not close to 0
Usually nh/Nh is very close to 0 (since N very large; even if n quite large) and the finite population correction can be ignored
Remember (standard error):
1 h
h
n
N
SE x Var x
100
Variance under Stratification
If we ignore the finite population correction (for every
stratum) we can simplify this to:
Variance of a mean:
Variance of a proportion:
2 2
21
ˆ (6.3)
Hh h
h h
N s var x =
N n
2
21
(1 )ˆ (6.4)
Hh h h
h h
N p p var p =
N n
101
Standard Errors for Stratified Sampling
In addition to the simplification of the variance
estimation formulae for a mean and a proportion if we
ignore the finite population correction (fpc), we note:
Differences between strata do not contribute to
variance. So, we should construct strata as
homogeneous (small ) as possible 2hs
102
Standard Errors for Stratified Sampling
Note that in the special case where we use the same
sampling fraction in each stratum, each of the
variance formulae simplify further.
We can substitute n/N in place of nh/Nh, and nh/n in
place of Nh/N. (6.3) and (6.4) then become:
For a mean:
For a proportion:
2
2ˆ (6.5)h hn s
var x = n
2
1ˆ (6.6)
h h hn p p var p =
n
103
We will look more at (6.5) and (6.6) later.
First, we will concentrate on Variable Sampling
Selections. In the presence of VSFs, we need formulae
(6.3) and (6.4), ignoring the fpc.
104
Example: Over-Sampling More
Variable Strata
Sometimes, we can identify strata that have high
population variances ( large). Over-sampling
these strata will tend to increase the precision of the
survey estimates (reduce standard errors).
We can only do this if we have advance estimates of
stratum variances.
Example to illustrate this:
Suppose H = 2 and N1 = N2 (=N/2).
Suppose we know (or estimate) that 2 21 22S S
2hS
105
Example (cont)
Then we can substitute into expression (6.3) (ignoring
the fpc and looking at the population variance rather
than the estimated variance) and we get:
2 2 2 2
2 2
2 21 2
2
4 4
N S N S var x =
N n N n
2 22 2
1 22 4
S S =
n n
106
Example (cont)
Now, consider two alternative sample designs:
a.) Proportional allocation
i.e. where
b.) A higher sampling fraction in stratum 1
i.e. n1 larger than n2
h hn N
n N
107
It follows:
For
a.) Substitute n1 = n2 = n/2 :
b.) Substitute e.g. n1 = 0.58n; n2 = 0.42n :
2 2 2
2 2 21.52
S S Svar x
n n n
2 2 2
2 2 21.4571.16 1.68
S S Svar x
n n n
108
Example (cont)
So, the sampling variance is slightly smaller under
design b)
It is smaller by a ratio of 1.457/1.5, i.e. 0.97
This is the design effect due to over-sampling the
more variable stratum (VSF):
2
2
1.4570.97
1.5VSF
VSF
SRS
SEDEFF
SE
0.98VSF VSFDEFT DEFF
109
Example (cont)
This example illustrates how precision can be increased
by the use of Variable Sampling Fractions! (in the case
of oversampling strata with high stratum-variances)
This approach is quite common for repeated business
and agriculture surveys, but rare for household surveys.
110
Note
We have seen when considering case b.) that a higher
sampling fraction in a stratum led to increased precision
Therefore: Important to consider which stratum allocation
will maximise survey precision (under the assumption of
not equal stratum variances)
111
Optimal Allocation
In general, the optimum allocation rule is to set:
where Ch is the unit cost of data collection for a unit in
stratum h.
If data collection costs do not vary between strata, this
simplifies to:
If stratum variances are equal, it further simplifies to a
constant K:
/h h hn N S
h h
h h
n S
N C
/h hn N K
112
Optimal Allocation (cont)
The last case demonstrates that an equal probability
selection method is optimum in the situation where
variances and data collection costs are equal in all strata
(other things being equal).
113
Example: VSFs with Equal Stratum
Variances
Example:
Again suppose H = 2, and N1 = N2.
But now suppose that stratum variances are equal, i.e.
Again consider two different sampling schemes:
a.) Proportional allocation
b.) Sampling fraction in stratum 1 is twice that in stratum 2, i.e. n1 = 2n/3; n2 = n/3.
2 21 2S S
h hn N
n N
114
Example (cont)
Then, with design a), we find (from expression 6.3, again
ignoring the fpc):
(Note: this is the formula of the variance of a mean under
SRS!)
2 22 2
2 2 2
22 2
2 22
2
2 2
N NS S
N S S var x =
n n nN nN N
115
Example (cont)
With design b), we find:
It follows:
2 22 2
2 2 2
2 2
92 2
2 2 84 4
3 3 3 3
N NS S
S S S var x =
n n n n nN N
2 2
2 2
9 / 89 / 8 1.125
/
VSFVSF
SRS
SE S nDEFF
SE S n
116
Example (cont)
This means:
The sampling variance under design b) is 9/8 (=1.125) times that under design a).
By allocating disproportionately, we have lost precision (in the case of equal stratum variances)!
In general, precision will be lost whenever variable sampling fractions are used, if the stratum variances do not vary (much).
The level of precision loss depends on the range of the weights used
117
Design Effects due to VSF’s
If we can assume stratum variances to be equal, there is an
alternative and often-used way to estimate effect of VSFs on
sampling variance.
Expression 6.1 can be used to derive expression for
effective sample size:
where: nh is the sample size in stratum h and wh is the
weight given to each case in stratum h. (Remember that wh
will be proportional to Nh/nh)
2
2ˆ
h h
VSF
h h
n wneff
n w
118
Design Effects due to VSF’s (cont)
Note that this expression only takes into account the
effect of VSFs on effective sample size, not the effect of
any other aspect of design.
Formula on previous slide can be used at design stage
to predict impact on precision of alternative allocations to
strata!
119
Design Effects due to VSF’s (cont)
In general, it will be found that:
larger range of sampling fractions (weights) results in a smaller neff (i.e. greater loss of precision)
over-sampling a large subgroup results in greater loss of precision than over-sampling a small subgroup
when main aim is to produce estimates for subgroups, equal sample sizes per subgroup will be an efficient design
when the main aim is to produce estimates for the total population, equal sampling fractions will be efficient.
120
Graphical illustration of neff
The following graph illustrates the effect of oversampling
on survey precision for a sample with 2 strata (H=2)
The graph shows relationship between the proportion of the sample in stratum 1 (n1/n) (x-axis)
and the consequent loss of precision, as measured by the design effect
(y-axis).
The three lines relate to three oversampling rates and
the subsequent relative weights that need to be used:
2:1, 4:1 and 10:1 (i.e. w1=1 in all cases).
(2:1 means that stratum 1 is oversampled by a factor of
2)
121
1
1.4
1.8
2.2
2.6
3
3.4
0 0.2 0.4 0.6 0.8 1
n1/n
DE
FF
VS
F
w2=2 w2=4 w2=10
122
Graphical illustration of neff
The graph illustrates the two points made
earlier:
larger range of sampling fractions (weights)
results in a smaller neff (i.e. greater loss of
precision)
over-sampling a large subgroup results in
greater loss of precision than over-sampling a
small subgroup
Multi-Stage Sampling
124
Outline of session
What is multi-stage / cluster sampling
Motivations for multi-stage sampling
Choice of sampling units, sample sizes at each
stage
Selection probabilities and weighting
Probability Proportional to Size (PPS) sampling
Design effects due to clustering
125
What is Multi-Stage Sampling?
The units in the population are arranged hierarchically
A 3-stage design would entail:
Primary sampling units (PSUs)
Secondary sampling units (SSUs)
Sample elements
It would be necessary to assign every element uniquely
to one SSU and every SSU uniquely to one PSU
126
What is Multi-Stage Sampling?
Stage 1: select sample of PSUs
Stage 2: select sample of SSUs within each selected PSU
Stage 3: select sample of elements within each selected SSU
Note that there could be any number of stages: 2, 3 or 4 are common
127
Examples:
general population survey :
PSUs might be postcode sectors
SSUs might be households
Elements might be persons
business survey :
PSUs might be companies
SSUs might be workplaces
Elements might be employees
128
Why Multi-Stage Sampling?
No frame of elements available, but frame of PSUs
available (examples: national sample of school pupils, where
schools could be PSUs; US face to face survey where counties are
PSU’s)
Cost of data collection (example: general population sample
involving face-to-face interviewing)
Access to elements may only be via “gatekeepers” (examples: students, employees, trainees)
Data quality (example: in the case of face-to-face interviewing,
field work can be better supervised if in clusters)
129
Design Choices (clustering):
Example: Field interviewing
Constraint Implication Tight field work periods Small workload per interviewer
Completion depends on Equal interviewer workloads
slowest interviewer
Efficient fieldwork Each workload in small area
Training/ briefing/ Large workload per interviewer
learning costs
130
Design Choices (clustering):
Some General Points:
Larger clusters will generally result in larger design
effects due to clustering (see later)
But larger clusters will also generally result in larger cost
savings (e.g. field interviewers, gatekeepers)
Necessary to make an appropriate compromise: i.e.
where cost saving outweighs loss in precision, to
produce higher overall accuracy per unit cost
(remember key aim of sample design: minimising costs,
maximising accuracy)
131
Selection Probabilities: Principle
With multi-stage sampling, the selection probability of
each element is the product of the (conditional) selection
probabilities at each stage
e.g. probability of sampling unit i in SSU j in PSU k is
Prijk = Pr (k) x Pr (j | k) x Pr (i | j,k)
So, it is important to control and record the selection
probabilities at each stage.
132
Selection Probabilities
Other things being equal, it is desirable to keep selection
probabilities equal for all elements (remember:
stratification; otherwise loss in precision).
If selection probabilities are not equal, we will need to
weight each sampled element ijk by
wijk = 1/Prijk
for unbiased estimation.
133
Selection Options
With multi-stage sampling, there are many ways to
achieve equal selection probabilities.
(epsem design = equal probability of selection method; =
self-weighting design)
In the (rare) case of equal size PSU’S and 2-stage
sampling, we can easily select PSU’s (j’s) and elements
(i’s) with equal probability.
Example: Design (0):
Pr(j) =1/3 and Pr (i|j)=1/2 and the overall probability is
Pr(i) = 1/3 * 1/2 = 1/6 for all i.
134
Selection Options
In many types of sampling situations having equal size PSU’s is rare. In the case of unequal sized PSU’s we are left with 3 alternative designs:
1. select PSUs with equal probabilities and then a fixed number of elements within each - gives unequal selection probabilities (not an epsem design)
2. select PSUs with equal probabilities and then a variable number of elements within each, to give equal overall selection probabilitiesx
3. select PSUs with PPS (probability proportional to size), then a fixed number of elements within each
135
Selection Options
Design 1) undesirable because it will generally cause loss in precision compared with an epsem design; non-epsem design undesirable; weighting needed
Design 2) avoids this problem, but causes practical problems. Number of elements sampled per PSU will vary in proportion to the population size of PSU. Elements in one PSU typically form one interviewer workload, so this is undesirable.
Also, with design 2) the sample size is not fixed in advance - it is a random variable. Very undesirable!
136
Selection Options
Design 3) overcomes all these problems, but it depends
on the availability of a reasonably accurate measure of
the number of elements in each PSU (and SSU, if a 3
stage design).
Note: when accurate measures of number of elements
within each PSU not available it may be possible to get a
reasonable good estimate of the measure of size and to
proceed with PPS sampling
The next slide discusses this design further:
137
Probability Proportional to Size (PPS)
Selection
Example: A 2-stage design
set Pr (j) proportional to Nj (number of elements in
population in PSU j = PPS sampling).
So Pr (j) = C Nj.
We then select the same number of elements, D, from
each sampled PSU, so Pr (i| j) = D/ Nj.
Then,
Pr (i) = Pr (j) x Pr (i|j) = C Nj x D/ Nj = CD, which is
the same for every element
138
Implementation of a PPS Design
We do not need to calculate the selection probabilities at
each stage in order to make the selection.
We need only to create a cumulative total down the list
of PSUs (e.g. 10,000) and then sample systematically
down that list of totals, including each PSU within which
the interval falls
139
Implementation of a PPS Design
Example: Selection of 3 PSUs from 10 with PPS and 25
units from each selected PSU, so that n=75
Pr(j) is probability of selecting the PSU
Pr(i|j) is the probability of selecting each unit, given that
PSU has been selected, and
Pr(i) is the overall probability of selecting each unit.
It can be seen that each of the 10,000 units in the
population has the same selection probability:
140
Example of a PPS Design
P (i) =
PSU Size (Nj) Pr(j)=C*Nj Pr(i| j)=D/Nj P(j) x P(i| j)=C*D
1 1000 3x1000/10000 25/1000 75/10000
2 900 3x 900/10000 25/ 900 75/10000
3 800 3x 800/10000 25/ 800 75/10000
4 1200 3x1200/10000 25/1200 75/10000
5 1500 3x1500/10000 25/1500 75/10000
6 1300 3x1300/10000 25/1300 75/10000
7 1100 3x1100/10000 25/1100 75/10000
8 500 3x1500/10000 25/ 500 75/10000
9 1000 3x1000/10000 25/1000 75/10000
10 700 3x 700/10000 25/ 700 75/10000 ________
10000 C=3/10000 D=25
141
Example of a PPS Design (cont)
We would select the sample of PSUs as follows:
N = 10,000 and n = 3 (PSUs).
To select systematically (see session: stratification I), K
=N/n= 3333 and R needs to be a random number
between 1 and 3333. Suppose we happen to generate
R = 1,050.
Then, we sample the PSUs that contain elements 1050,
(1050 + 3333) and (1050 + 2x3333), i.e. PSUs 2, 5 and 7
:
142
Example of a PPS Design (cont)
PSU Size Cum. size Selection _______________________________________________________
1 1000 1000
2 900 1900 *
3 800 2700
4 1200 3900
5 1500 5400 *
6 1300 6700
7 1100 7800 *
8 500 8300
9 1000 9300
10 700 10000
143
Some Limitations of PPS Sampling of
PSUs
We might have only imperfect estimates of number
of elements in each PSU (the size measure)
We could then adjust the sample size within each
PSU to keep overall probabilities equal or we might
simply weight by 1/Pr(i)
Sampling interval might be smaller than number of
elements in some PSUs. (This will only happen if
sampling fraction of PSUs is large and/or size of
PSUs highly variable.) Those PSUs will be certain
to be sampled, and could be sampled more than
once.
144
Some Limitations of PPS Sampling of
PSUs
We might place these PSUs in a separate stratum and
include them with certainty. We might also increase their
sample size of elements, to keep overall probabilities
equal, or we might weight
145
Design Effects due to Clustering
Clustering tends to increase sampling variance (but this
is partly offset by the fact that a larger sample size can
be obtained for any given cost).
This is because units within a cluster tend to be more
homogeneous than units as a whole.
Clustering is therefore tending to have the opposite
effect to stratification.
146
Example of Homogeneity of Clusters
Let us consider the following example to illustrate the effect
of clustering:
Population of 6 people, with values: 1, 1, 2, 2, 3, 3.
Population mean = 12/6=2
Population variance:
var (X) = = 4/6 = 2/36
2 2
1
1( 2)
6i
i
x
147
Example (cont)
a) divide population into 3 clusters: (1,1) (2,2) and (3,3).
Then: no variance within clusters (homogeneous
clusters). But variance between the cluster means is:
var (XB) = [(1-2)2 + (2-2)2 +(3-2)2] /3 = 2/3.
It implies that sampling variance is greater than 0 since
we get different estimates of the mean depending on
which cluster is sampled.
148
Example (cont)
b) divide the population into 2 clusters: (1,2,3) (1,2,3). No variance between cluster means. But variance within each cluster is:
Var (XW) = 2* [[(1-2)2 + (2-2)2 +(3-2)2]/3] /2 = 2/3
The sampling variance is 0 since there is no variability in sample means.
With design a) all the variance is between clusters -clusters are perfectly homogeneous.
With design b), clusters are as heterogeneous as the population as a whole, so cluster sampling would not cause a loss in precision.
149
Example (cont)
If we sample one cluster (and then include all elements),
design a) has a sampling variance of 2/3; design b) has
a sampling variance of 0.
This illustrates the general point that sampling variance
will be greater if clusters are relatively homogeneous
(i.e. like in a) )
150
Design Effects due to Clustering (cont)
Typically, the sorts of units that we tend to use as PSUs are relatively homogeneous, so in practice clustering nearly always results in a design effect due to clustering which is greater than one.
Examples:
people within postcode sectors,
pupils within schools,
students within classes
employees within firms.
151
Intra-Cluster Correlation
The design effect due to clustering is
where b is sample size per cluster (in practice b may vary slightly, in which case mean cluster size provides an adequate approximation), and ρ (‘roh’) is the intra-cluster correlation.
ρ =0: randomly sorted clusters
ρ =1: perfectly homogeneous clusters
DEFF bCL 1 1
152
Intra-Cluster Correlation (cont)
Note that ρ is a population characteristic relating to the
chosen definition of PSU, but sample design should
involve a careful choice of b.
Examples of possible values:
b=10: if ρ =0 then
b=10: if ρ =1 then; if then
more realistically, b=10, if ρ =0.05 then
DEFFCL 10
DEFFCL 1
DEFFCL 145. .
153
Inflation due to clustering
Reminder: the square root of DEFF is DEFT
DEFTCL inflates confidence intervals of the mean (or
proportion) as follows:
1.96* * CLx SE DEFT
154
Example of Intra-Cluster Correlations
From the British Social Attitudes Survey:
Variable b DEFT DEFT
if b=10
Household size 0.070 16.6 1.45 1.28
Owner-occupier 0.231 16.5 2.14 1.75
Has telephone 0.102 16.5 1.61 1.38
Asian 0.334 8.3 1.86 1.53
Roman Catholic 0.037 16.4 1.25 1.15
Not racially prejudiced 0.021 8.4 1.08 1.03
Extra-marital sex wrong 0.044 8.3 1.15 1.08
Dodging VAT is OK 0.021 8.2 1.07 1.04
155
Example of Intra-Cluster Correlations
Note is low for attitudinal variables, so design effects
small (DEFT small). But large for variables related to
ethnicity and housing type.
Thus, the most effective degree of clustering might be
greater for an attitude survey (fewer, larger clusters) than
for a housing survey.
156
References
Cochron, W.G., (1977). Sampling techniques; Wiley Eastern Ltd.
Des Raj, (1968). Sampling theory; Tata-Mcgraw-Hill Publishing
Company Ltd.
Hansen, M.H. and Hurwitz, W.H. (1943b). On the theory of sampling
from finite populations; Ann. Math. Statist., 14, 333-362.
Hansen, M.H., Hurwitz, W.H. and Madow, W.G., (1993). Sample survey
methods and theory, Vol. 1 and Vol. 2; John Wiley & Sons, Inc.
Murthy, M.N., (1977). Sampling theory and methods; Statistical
Publishing Society
Sukhatme, P.V., Sukhatme, B.V., Sukhatme, S. and Ashok, C. (1984).
Sampling theory of surveys with applications; Indian Society of
Agricultural Statistics.
156
157 157