1.Overview of Sampling Scheme Hukum Chandra

VARIOUS ELEMENTARY CONCEPTS

OF SAMPLE SURVEYS

Hukum Chandra

ICAR-Indian Agricultural Statistics Research Institute,

New Delhi

Email: [email protected]

mailto:[email protected]

2

About you

What experiences (if any) do you have of Survey

Sampling?

3

Objectives

To introduce various sampling schemes

4

Statistical Preliminaries

Definition of

Survey

Census

Sample Survey

Sample Survey Theory

Target population

Survey population

Sampling frame

Notation

Finite population parameter

5

Complete Enumeration (Census)

One way of obtaining the required information is to collect the data foreach and every unit belonging to the population and this procedure ofobtaining information is termed as complete enumeration (Census)

The effort, money and time required for the carrying out completeenumeration to obtain the different types of data will, generally, beextremely large

However, if the information is required for each and every unit in thedomain of study, a complete enumeration is clearly necessary.

Examples of such situations are preparation of voter’ list for electionpurposes

But there are many situations, where only summary figures are requiredfor the domain of study as a whole or for group of units.

6

Need for Sampling

An effective alternative to a complete enumeration is samplesurvey where only some of the units selected from the populationare surveyed and inferences are drawn about the population on thebasis of sample

In certain investigations, it may be essential to use specializedequipment or highly trained field staff for data collection, making italmost impossible to carry out such investigations

If a sample survey is carried out according to certain specifiedstatistical principles, it is possible not only to estimate the value ofthe characteristic for the population, but also to get a valid estimateof the sampling error of the estimate

7

What is sampling? Sampling proceeds in several stages:

Define scope and objectives of the study, including Population to be studied (Identify the population of interest)

General information to collect

Choose tools and techniques for making observations, e.g. Questionnaire

Diary

Physical measurements

Select (sample) some members of the population (units)

Study the sample (Gather data on the sample)

Draw inferences about the population (Analyze the data and make inferences)

Examples:

Sampling pasta from a pan

Sampling apples from a market stall

8

Population

Population consists of complete set of all ‘observations’

of interest

necessary to identify what does and what does not

belong to the population

All households in India in 2000

All women aged 15-49 in India in 2000

All businesses in the Delhi in 2014 with more than

1000 employees

All 15 year olds in India in 2011

9

Populations and samples

Population Sample

The process of how to obtain a sample from the population is

referred to as sampling

Sampling

10

Definitions

Element : An element is a unit about which we require information. For example, a field growing a particular crop is an element for collecting information on the yield of a crop.

Population : Complete set of all observations of interest.

It is the totality of elements under consideration on which inference is required.

Thus, all fields growing a particular crop in a region constitute a population.

Sampling units

A group of elements constitute a sampling unit

Elements belonging to different sampling units are non-overlapping

A sampling unit may have one or more than one element

Sampling units are convenient as well as relatively inexpensive to observe and identifiable

For example, it is convenient to select households for collecting data on milk produced by animals rather than contacting the elements directly

11

Definitions

Sampling frameAn exhaustive list of all the sampling units constitutes a sampling frame.

An example of a sampling frame may be cultivator fields growing a particular crop or households containing animals in a region.

Sample: A subset of the population.

A part of the population selected from a sampling frame for the purpose of making inference about the population is called as a sample.

For example, a subset of the cultivator fields may be selected to estimate the yield of a crop in a region.

A random sample is a subset where units are chosen with the help of probabilities (Sampling).

12

The error which arises due to use of sample to estimate the

population parameters

Whatever method of sampling is used, there will always be a

difference between population value and its corresponding

estimate

This error is unavoidable in every sampling scheme.

A sample with the smallest sampling error will always be

considered as a good representative of the population.

This error can be reduced by increasing the size of the sample

Sampling Error

13

Besides sampling error, the sample estimate may be

subject to other error which arises due to failure to

measure some of the units in the selected sample,

observational errors or errors introduced in editing,

coding and tabulating the results

Generally, census results may suffer from non-

sampling error although these may be free from

sampling error

The non sampling error is likely to increase with

increase in sample size, while sampling error

decreases with increase in sample size

Non-Sampling Error

14

Alternatives to Sample Surveys

Analysis of administrative records (administrative

data)

(for example Health Authority data, Crime records by

Home Office or Police, School Authority data, tax

records etc)

Censuses

(all members of the population of interest are

studied)

15

Sample Surveys vs Admin Data

Administrative data may not focus on same population

(as the one of interest)

May not contain all required information

Based on definitions devised for administrative purposes

May have incomplete coverage, be out of date,

inaccurate etc

Surveys can adopt desired definitions, collect desired

data etc

16

Sample Versus Census

Which is better? Census Sample Survey

Cost

Speed

Practicality and Feasibility

Data Quality

Detail (e.g. questionnaire)

Ability to analyse small subsets

Timeliness

Sampling Error

Inference to population

17

From Population to Sample

Population parameter (e.g. population mean, average

household income, or population proportion, e.g. infant

mortality rate) – based on population data

refers to a summary value of variable in population

Draw a random sample from the population

Based on sample data, calculate a statistic (e.g.

sample mean, sample proportion) also referred to as

‘estimator’

refers to summary value of a variable based on sample

18

From Population to Sample

Estimator: An estimator is a statistic obtained by a

specified procedure for estimating a population

parameter

The estimator is a random variable and its value differs

from sample to sample

Estimate: The particular value, which the estimator

takes for a given sample, is known as an estimate

19

Example

Population parameter: population mean income

denoted

Sample statistics: mean income in the sample

denoted

The sample statistic may be used as an estimate for

the population parameter:

ˆ x

x

20

Example

Population parameter: population mean income

denoted

Sample statistics: mean income in the sample

denoted

The sample statistic may be used as an estimate for

the population parameter:

ˆ x

x

21

Types of Samples-

Different Sample Designs

22

Sample Design

A sample design is a plan determined before any data

are actually collected for obtaining a sample from a

given population.

23

Non-Probability versus Probability Samples

Non-probability sampling:

1. Convenience sampling

A sample selected because of its ease of access

to sample members

24


Non-probability sampling:2. Purposive sampling

a sample selected using a deliberate subjective choice in order to produce a sample which the researcher judges to be ‘representative’ in some sense

example: a quota sample

represent the major characteristics of the population by sampling a proportional amount of each. You have to decide on which specific characteristic to base your quota

25


Probability sampling

a sample that is selected by a random mechanism, where each member of the population has a known and non-zero probability of being in the sample (selection probability)

important when choosing a random sample, that the surveyor does not choose the sample himself. It has been repeatedly shown that the human investigator is not a satisfactory instrument for making random selections.

26

Pros and Cons

Convenience sampling:

extremely cheap and quick but very large bias

Purposive (Quota) sampling:

Cheaper and quicker than random sampling, but potential for ‘availability/ willingness bias’ even after weighting

Random (probability) sampling:

More expensive/ slower; will have nonresponse bias (because of people refusing to take part)

if a good response rate then should have significantly less bias then quota sample

27

Probability vs Quota samples

Probability Sampling Quota Sampling

Method of selection is specified,

objective and replicable

Quota categories are specified and

replicable; but interviewer preference

typically rules on how to fulfil quotas

Inference to population based on

mathematics

Inference based on subjective judgement

Protects (to some extent) against

availability and willingness bias

Prone to severe availability and

willingness bias; weighting is essential

but bias can remain

precision of estimates can be

estimated

Confidence intervals cannot be

calculated

More expensive, requires more

resources

Depending on nonresponse rate

likely to suffer less overall bias

Cheaper and quicker

28

Assessing a Sample Design

“Virtually all surveys that are taken seriously by social

scientists and policy makers use some form of

probability sampling…

One way to ruin an otherwise well-conceived survey is to

use a convenience sample rather than one which is

based on a probability design”

29

Types of Probability Samples

An Overview

30

Probability sampling methods

1. Simple random sampling (SRS)

Randomly chosen selections using a random number table, computer-generated random numbers, lottery balls etc

Probably easiest way of obtaining a random sample

With replacement: replace element back into ‘selection frame’ once selected, one unit could be selected several times

Without replacement

31

Simple Random Sampling (SRS)

This is the simplest and most basic method of sampling in which

the sample is drawn unit by unit, with equal probability of

selection for each unit at each draw.

Therefore, it is a method of selection of n units out of a

population of size N by giving equal probability to all units, or

A sampling procedure in which all possible combinations of n

units that may be formed from the population of N units have the

same probability of selection.

32

Simple Random Sampling (SRS)

For selecting a simple random sample in practice, units from populationare drawn one by one

If a unit is selected and observation is recorded and then returned to thepopulation before the next drawing is made and this procedure repeated ntimes. This procedure is generally known as simple random sampling withreplacement (wr)

In such a selection procedure, there is a possibility of one or morepopulation units getting selected more than once

In case, this procedure is repeated till n distinct units are selected and allrepetitions are ignored, it is called a simple random sampling withoutreplacement (wor)

33

Simple random sampling

Advantages:

Easy to understand

Used as yardstick for assessing efficiency of

complex samples

Disadvantages:

Can be time consuming to implement

Can be costly

Statistically not the most ‘efficient’ method of

sampling (e.g. use of stratification to improve

efficiency)

34

Probability sampling methods (cont)

2. Systematic Sampling

A random start followed by successive application of

the sampling interval

35

Example: Systematic Sampling

Determine the number of

units N=100

Determine the sample size

you want n= 20

The interval size is therefore

K=N/n = 100/20 = 5

K=5 (sample one fifth)

Select at random an integer

from 1 to K: e.g. 4 is chosen

Then select every K-th unit

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

.

.

.

98

99

100

36

Systematic Sampling

Special methods are needed for systematic selection

with a fractional interval

Use of fractional interval

The list from which to sample should be ideally

randomly ordered

37

Systematic Sample

Disadvantage: periodicity in population list

e.g. sampling interval coincides with a periodic interval

of list

Example: suppose you select 1st , 11th, 21st, etc

element, but list is arranged that 1st is a man, 2nd his

wife, 3rd is a man, 4th his wife, etc.

we would obtain a list of males, whereas whole

population made up of males and females

Such periodicity may be easily avoided

Another way to solve this problem is to use stratification

38

Systematic Sampling

In all other sampling methods, the units (whether elements or clusters) are selected with the help of random numbers

But, a method of sampling in which only the first unit is selected with the help of random number while the rest of the units are selected according to a pre-determined pattern, is known as systematic sampling

Very useful in forest surveys for estimating the volume of timber

fisheries surveys for estimating the total catch of fish,

milk yield surveys for estimating the lactation yield

39

Systematic Sampling

Advantages

Easy to understand

Quick and easy to implement

Arranging the frame in stratified order will create

implicit stratification

Disadvantages

Periodicity: If units are ordered unnoticed or

unattended this may result in an ‘unusual sample’

40


3. Stratified Sampling

If we have information about the composition of a

population, we may be able to improve on e.g. simple

random sampling by using stratification

Units are aggregated (grouped) into different non-

overlapping subgroups, called strata

Then a certain number of units are randomly selected

from each stratum

41

Example: Stratified Sample

if a surveyor wants to find the most popular TV

programmes, it would be advisable to first divide the

population into 3 strata, men, women and children

then select a random sample from each of the strata

care must be taken to ensure that the strata are non-

overlapping, i.e. there is no element falling into more

than 1 category.

42

Stratified Sampling

The basic idea in this sampling is to divide a heterogeneous

population into sub-populations, usually known as strata

Strata are internally homogeneous in which case a precise

estimate of any stratum mean can be obtained based on a sample

from that stratum

By combining such estimates, a precise estimate for the whole

population can be obtained

This sampling provides a better cross section of the population than

the procedure of simple random sampling

For example, in the case of survey for income estimation, whole

population can be divide into three strata Low-income, Medium and

High-income stratum

43

Stratified Sampling

It may also simplify the organization of the field work.

Geographical proximity is sometimes taken as the basis

of stratification.

The assumption here is that geographically contiguous

areas are often more a like than areas that are far apart.

Administrative convenience may also dictate the basis

on which the stratification is made

Auxiliary information may be taken as the basis of

stratification

44

Stratified Sampling

In stratified sampling, the variance of the estimator consists of only

the ‘within strata’ variation

Thus, the larger the number of strata into which a population is

divided, the higher, the precision

For estimating the variance within strata, there should be a

minimum of 2 units in each stratum

The larger the number of strata the higher will be the cost of survey

So, depending on administrative convenience, cost of the survey

and variability of the characteristic under study in the area, a

decision on number of strata will have to be arrived at

45

Example: Stratified Sample

whole

N

Whole Sampling frame (size N)

North

N1

East

N3

South

N2

West

N4

Sample separated by region into 4 strata (N1, N2, N3, N4)

Random subsample of n1/N1




Random sub-sample from each

46

Stratified Sample

Can be

Proportionate (same sampling fraction for each strata)

Disproportionate (different sampling fractions),

this means …

differential probabilities of selection

e.g. often small subgroups are selected with a higher

sampling fraction than the rest of the population to

ensure a larger number of them in your final sample to

facilitate analysis

47

Proportionate Stratified Sample

Advantages

Guards against the more unusual samples that can

be chosen by random chance

If stratifiers are related to the variables in your survey,

stratification can reduce standard errors

Disadvantages

Stratification information has to be available

48

Disproportionate Stratified Sample

Advantages

Allows one to over-sample small groups so that a

good statistical comparison can be made

Also used where the goal is to achieve an optimum

allocation between variance and cost

Disadvantages

Estimates of the total population need to be derived

using weighting (see later sessions)

49


4. Cluster sampling

A cluster is a naturally occurring unit like a county

(country, or state)

Sampling units are selected as part of a cluster of units

Difference to stratified sampling is that the starting point

is a natural cluster, and not ‘made up’ as in stratified

sampling.

50

Cluster sampling

A sampling procedure presupposes division of the

population into a finite number of distinct and identifiable

units called the sampling units.

The smallest units into which the population can be

divided are called the elements of the population and

group of elements the clusters

A cluster may be a class of students or cultivators’ fields

in a village

When the sampling unit is a cluster, the procedure of

sampling is called cluster sampling

51

For many types of population, a list of elements is not

available, therefore, the use of an element as the

sampling unit is not feasible.

The method of cluster is available in such cases.

For example, in a city a list of all the houses may be

available, but that of persons is rarely so and list of farms

are not available, but those of villages or enumeration

districts prepared for the census are.

Cluster sampling is, therefore, widely practiced in sample

surveys.

Cluster sampling

52

For a given number of sampling units cluster sampling is more convenient and less costly than simple random sampling due to the saving time in journeys, identification and contacts etc.

Cluster sampling is generally less efficient than simple random sampling due to the tendency of the units in a cluster to be similar

In most practical situations, the loss in efficiency may be balanced by the reduction in the cost and the efficiency per unit cost may be more in cluster sampling as compares to simple random sampling

Cluster sampling

53

Clearly, the size of the cluster will influence efficiency of sampling

In general, the smaller the cluster, the more accurate will usually be the estimate of the population characteristic for a given number of elements in the sample

The optimum cluster is one which would estimate the characteristic under study with smallest standard error for a given proportion of the population sampled, or more generally, for a given cost.

Cluster sampling

54


5. Multi-stage sampling

Large units are selected first and then smaller

units within the selected larger units are

selected (results in clustering)

55


5. Multi-stage sampling

One of the main considerations of adopting cluster sampling is the reduction of travel cost

However, this method restricts the spread of the sample over population which results in increasing the variance of the estimator

In order to increase the efficiency of the estimator with the given cost it is natural to think of further sampling the clusters and selecting more number of clusters so as to increase the spread of the sample over population.

Sampling which consists of first selecting clusters and then selecting a specified number of elements from each selected cluster is known as two stage sampling (sub- sampling)

56

Multi-stage sampling

Clusters are generally termed as first stage units (fsu’s) or primary

stage units (psu’s)

The elements within clusters or ultimate observational units are

termed as second stage units (ssu’s) or ultimate stage units (usu’s).

This procedure can be easily generalized to give rise to multistage

sampling

It can be expected to be (i) more efficient than simple random

sampling and less efficient than cluster sampling from operational

convenience and cost point of view

(ii) less efficient than simple random sampling and more efficient

than cluster sampling from the variability point of view

57

Multi-Stage Cluster Sampling

AdvantagesHuge cost savings if survey is carried out with face-to-

face interviews

Useful when no frame is available for the final sampling unit

Disadvantages to the extent that clusters are homogeneous with

respect to the survey variables you are studying, this may result in larger standard error (less precision of estimates)

58

Successive Sampling

Many times surveys often gets repeated on many occasions (over

years or seasons) for estimating same characteristics at different

points of time.

The information collected on previous occasion can be used to

study the change or the total value over occasion for the character

and also to study the average value for the most recent occasion

For example in milk yield survey, we are interested in

1. Average milk yield for the current season

2.The change in milk yield for two different season

3.Total milk production for the year

59

Successive Sampling

The successive method of sampling consists of selecting

sample units on different occasions such that some units are

common with samples selected on previous occasions

If objective is to estimate the change, then it is better to retain

the same sample from occasion to occasion

For populations where the basic objective is to study the total,

it is better to select a fresh sample for every occasion

If the objective is to estimate the average value for the most

recent occasion, the retain a part of the sample over

occasions

60

Multiphase Sampling

It is well known that the prior information on an auxiliary

variable could be used to enhance the precision of the

estimator.

Ratio, product and regression estimators require the

knowledge of population mean and total for the auxiliary

variable x.

When such information is lacking, it is sometimes less

expensive to select a large sample on which auxiliary

variable alone is observed.

The purpose is to furnish a good estimate of population mean

of x

61

Multiphase Sampling

Subsequently, a subsample from the initial sample is selected

for observing the variable of interest.

For example: Consider problem of estimating total production

of cow milk in a certain region. For this purpose, village is

taken as the sampling unit and the number of milch cows in all

the villages of the region may not be available

Then investigator could decide to take a large initial sample of

villages and collect information on number of milch cows in the

sample villages

This information is used to build up an estimate of total

number of milch cows in the region

A subsample of villages is selected from the first-phase

sample to observe the study variable, viz., cow milk yield in

the village

62


6. Probability Proportional to Size (PPS)

Units are sampled in two or more stages with

probabilities proportional to their size (a clever

solution to ensure equal sized fieldwork

assignments while maintaining equal

probabilities of selection)

63

Under certain circumstances, selection of units with unequal

probabilities provides more efficient estimators than equal

probability sampling, and this type of sampling is known as

unequal or varying probability sampling

The units are selected with probability proportional to a given

measure of size (pps) where the size measure is the value of

an auxiliary variable x

This sampling scheme is termed as probability proportional

to size (pps) sampling

In pps sampling, the units may be selected with or without

replacement.

Sampling with Varying Probability

63

64

In sampling theory if the auxiliary information, related to the

character under study, is available on all the population units

Then it may be advantageous to make use of this additional

information in survey sampling

One way of using this additional information is in the sample

selection with unequal probabilities of selection of units

The knowledge of auxiliary information may also be exploited at

the estimation stage. The estimator can be developed in such a

way that it makes use of this additional information

Use of Auxiliary Information

64

Examples are ratio estimator, difference estimator, regression estimator,

generalized difference estimators are the of such estimators

Obviously, it is assumed that the auxiliary information is available on all

the sampling units

Another way the auxiliary information can be used is at the stage of

planning of survey. An example of this is the stratification of the

population units by making use of the auxiliary information

Use of Auxiliary Information (contd…)

Stratification I

67

Outline

What is stratification ?

Implicit and explicit stratification

Systematic sampling

Implementation of stratification

Some examples of stratification

68

Review

Note: in simple random sampling all units have the same

probability of selection (the probabilities are known and

positive)

But in general, random sampling does not need to be

based on equal sampling probabilities (however they

need to be known and the need to be all positive), e.g.

some units have a higher probability of selection

69

Random Sampling

We sometimes sample with unequal probabilities

Think of the population as being divided into H subsets

(h = 1, ... H), with Nh units in the hth subset.

If we sample separately from each subset, then we call

the subsets sampling strata. If we sample nh units

from stratum h, then the sampling fraction (selection

probability) in that stratum is nh/Nh.

hh

h

nf

N

70

What is Stratified Sampling?

Stratified sampling involves sorting (stratifying) the sampling frame prior to selection

Implicit Stratification involves sampling systematicallyfrom an ordered (stratified) list

Explicit Stratification involves sorting the population list (frame) into distinct strata and then sampling independently from each stratum

It is possible (and often desirable) to combine explicit and implicit stratification - i.e. to stratify implicitly within explicit strata

71

Why Stratified Sampling?

The primary reason for stratification is that it ensures

(unlike SRS) that the sample proportion from any

particular stratum equals the population proportion.

will increase precision if strata are correlated with survey

measures (smaller SE and CI)

Cannot do statistical harm (estimates not less precise

than under SRS)

This is true of both explicit and implicit stratification.

A secondary motivation for stratification is to permit the

use of variable sampling fractions.

72

Systematic Sampling

Recall session 1

Involves sampling at a fixed interval down a list

If the list is ordered in some meaningful way, this has the

effect of stratification

Advantage of being easy to implement

Procedure: calculate the required interval (K=N/n), then

generate a random start (R) (random number between 1

and K). The sampled units are then the Rth, (R+K)th,

(R+2K)th etc units on the list.

73

Systematic Sampling (2)

K = N/n, where N is the total number of units on the list,

and n the desired sample size.

R is a random number between 1 and K.

Note that K need not be an integer. E.g. if desired n is

500 and N = 10,679, using K = 21.36 will give exactly n =

500, but rounding to K = 21, will give n = 508.

Do not use K = 21 and then stop once 500 are sampled:

biased! (go up to 508 sampled cases)

74

Example: Systematic Sampling

Determine the number of

units N=100

Determine the sample size

you want n= 20

The interval size is therefore

K=N/n = 100/20 = 5

K=5 (sample one fifth)

Select at random an integer

R from 1 to K: e.g. 4 chosen

Then select every K-th unit

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

.

.

.

98

99

100

75

Stratum Construction

Choose factors so that strata are homogeneous

If strata are correlated with survey measures then

increase in precision

Strata examples: e.g. regions

We can sometimes estimate the precision achievable

with different choices

Choice of number of strata:

More strata, more precision

But variance estimation more difficult

And administration and sampling (and weighting) may be

more complex

76

Stratum Construction (2)

Cross factors with few categories rather than using many

categories for one factor

For example: stratify according to region and poor and

rich areas

When using a continuous factor (e.g. tax payments;

proportions of households with attribute A etc) choose

carefully the stratum boundaries (i.e. define sensible

categories and cut-off points)

77

Stratum Construction (3)

Choose stratifiers such that they are correlated with a

range of variables

For example, for national household surveys, tend to

choose stratifiers that are related to

Area characteristics (e.g. rural, urban, population density

etc)

Income / occcupation (e.g social economic group, social

class )

It is common to use 3-4 stratification variables

hierarchically (see later example)

78

Example of Stratification: A General

Population Survey

The Health Survey for England (DH)

Stage 1:

Postcode Sectors stratified by:

14 Regional Health Authorities (1st-level explicit strata)

Proportion of adults with limiting long-term illness, in three

bands (2nd-level explicit strata)

Proportion of households with ‘non-manual’ head, in two

bands (3rd-level explicit strata)

Proportion of households with no car, in two bands (4th-level

explicit strata)

Proportion "non-white" (5th-level stratification: implicit)

79

Example of Stratification: A General

Population Survey (2)

720 sectors were sampled systematically

Stage 2

Within each sector, addresses are in postcode order,

and selected systematically. This provides some

geographical stratification.

80

Example of Stratification: A Special

Population Survey

Survey of Recipients of Job Seekers Allowance (DSS)

Stage 1

Postal sectors were stratified by region and number

of recipients

200 sectors were selected with probability

proportional to number of recipients

81

Example of Stratification: A Special

Population Survey (2)

Stage 2

Recipients were stratified by sex (2 bands) x claim type

(4 bands) x length of continuous unemployment prior to

current claim (implicit)

25 recipients were selected systematically from each

sampled sector

82

Stratified sample: some notation

Dividing for example frame into distinct strata and then

sampling independently from each stratum results in:

H strata (or groups), stratum h=1,…, H

In each stratum h there are Nh units (on population

level)

An independent sample of nh units is then selected

from each stratum h

Sampling fraction (selection probability)

in the stratum is: hh

h

nf

N

83

Estimator Under Stratification (Example)

We have 2 strata (e.g. north and south GB)

Proportion of people 18+ years old in GB who use the

internet: P

Estimator p

11

*( )

Hh

h Hh hh

Np p

N

1 21 2

1 2 1 2

* *( ) ( )

N Np p p

N N N N

84

DEFF under Stratified Sample

Increase in precision under stratified sample can be

estimated using the DEFF

Numerator is the variance of the stratified design

Denominator is variance under SRS

How can be calculated?

2

2

STRAT

SRS

SEDEFF =

SE

2STRATSE

85

Variance under Stratification

Variance of a mean:

.. . and for a proportion:

2 2

21

ˆH

h h

h h

N s var x =

N n

2

21

(1 )ˆ

Hh h h

h h

N p p var p =

N n

86

Variance under Stratification (2)

where

h is the stratum

s2h is the sample variance in stratum h (estimated from

sample)

Nh is the population size in stratum h

nh is the sample size in stratum h

N is the total population size (N=N1+N2 +…+NH)

n is the total sample size (n=n1+n2 +…+nH)

87

Practical Limitations to Stratification

Often only possible at PSU level (e.g. household surveys) (PSU= primary sampling unit, e.g. postcode sector, schools etc) rather than at individual level

Correlation between strata and survey variables is typically modest

Depends on what information available on the sampling frame

Multi-purpose nature of surveys: optimal stratification for one estimate may produce no benefit for another

Typically there is a lack of information about stratum variances

88

Comparisons between Stratification

and Quota Sampling

Recall session 1

Imposing quotas has similar effect to stratification -namely to reduce sampling variance

But, quota sampling also has inherent bias towards more accessible and more willing population members

This may manifest itself as a bias in the survey measures

Thus, quota sample estimates could have relatively high precision, but be biased and therefore have low accuracy (high mean squared error) (session 3)

Stratification II

90

Outline of session

Variable Sampling Fractions

Motivations

Optimal allocation

Design effects

91

Variable Sampling Fraction (VSF)

We sometimes sample with unequal probabilities

Think of the population as being divided into H subsets

(h = 1, ... H), with Nh units in the hth subset.

If we sample separately from each subset, then we call

the subsets sampling strata. If we sample nh units

from stratum h, then the sampling fraction (selection

probability) in that stratum is nh/Nh.

hh

h

nf

N

92

Variable Sampling Fraction (VSF)

For unbiased estimation, each sampled unit i must be

assigned a weight in inverse proportion to its selection

probability.

This is usually referred to as the sampling weight or

design weight: wi

An example of such a weight in the case of stratified

sampling would be:

if sample unit i belongs to stratum h

for i hhi

h

Nw

n

93

Use of weights

So when certain types of units have been selected

based on different selection probabilities (oversampling)

then the sample weights need to be taken into account in

estimation

Corrective weighting is needed to get design-unbiased

estimates

If weights are ignored then sample estimates are biased

94

Motivations for VSF

1. To increase the sample size of small groups

(i.e. to get acceptable confidence intervals for

estimates based on those groups)

2. Because the frame / selection method gives us

no choice

3. To increase precision of estimates by over-

sampling more variable strata

95

Examples

1. A national survey where estimates are also required for

each of the component countries /regions

E.g. survey of the UK, but estimates for Scotland, Wales and NI

are also needed separately

Then a larger sampling fraction might be used in Wales and

Scotland compared to England.

2. Sampling minority ethnic groups:

a high proportion of the minority ethnic population live within a

relatively small proportion of areas

Oversampling such (ethnically dense) areas will increase

achieved sample sizes while reducing survey costs.

96

Use of Variable Sampling Fractions

Now we want to investigate further the effects of using

variable sampling fractions

We have seen we need to use weights

We want to investigate under which circumstances

precision in survey estimates is increased and when

precision is reduced after using VSF

Or in other words: what is the effect of oversampling

on the precision of estimates?

97

Standard Errors for Stratified Sampling

We have already introduced in last session a

formula for the variance

Generally, it is for a mean:

And for a proportion:

2 2

21

ˆ 1 (6.1)

Hh h h

hh h

N s n var x =

NN n

2

21

(1 )ˆ 1 (6.2)

Hh h h h

hh h

N p p n var p =

NN n

98

Variance under Stratification (2)

where

h is the stratum

s2h is the sample variance in stratum h (estimated from

sample)

Nh is the population size in stratum h

nh is the sample size in stratum h

N is the total population size (N=N1+N2 +…+NH)

n is the total sample size (n=n1+n2 +…+nH)

99

The finite population correction

The expression

is referred to as the finite population correction

This term is only important if nh/Nh not close to 0

Usually nh/Nh is very close to 0 (since N very large; even if n quite large) and the finite population correction can be ignored

Remember (standard error):

1 h

h

n

N

SE x Var x

100

Variance under Stratification

If we ignore the finite population correction (for every

stratum) we can simplify this to:

Variance of a mean:

Variance of a proportion:

2 2

21

ˆ (6.3)

Hh h

h h

N s var x =

N n

2

21

(1 )ˆ (6.4)

Hh h h

h h

N p p var p =

N n

101


In addition to the simplification of the variance

estimation formulae for a mean and a proportion if we

ignore the finite population correction (fpc), we note:

Differences between strata do not contribute to

variance. So, we should construct strata as

homogeneous (small ) as possible 2hs

102


Note that in the special case where we use the same

sampling fraction in each stratum, each of the

variance formulae simplify further.

We can substitute n/N in place of nh/Nh, and nh/n in

place of Nh/N. (6.3) and (6.4) then become:

For a mean:

For a proportion:

2

2ˆ (6.5)h hn s

var x = n

2

1ˆ (6.6)

h h hn p p var p =

n

103

We will look more at (6.5) and (6.6) later.

First, we will concentrate on Variable Sampling

Selections. In the presence of VSFs, we need formulae

(6.3) and (6.4), ignoring the fpc.

104

Example: Over-Sampling More

Variable Strata

Sometimes, we can identify strata that have high

population variances ( large). Over-sampling

these strata will tend to increase the precision of the

survey estimates (reduce standard errors).

We can only do this if we have advance estimates of

stratum variances.

Example to illustrate this:

Suppose H = 2 and N1 = N2 (=N/2).

Suppose we know (or estimate) that 2 21 22S S

2hS

105

Example (cont)

Then we can substitute into expression (6.3) (ignoring

the fpc and looking at the population variance rather

than the estimated variance) and we get:

2 2 2 2

2 2

2 21 2

2

4 4

N S N S var x =

N n N n

2 22 2

1 22 4

S S =

n n

106

Example (cont)

Now, consider two alternative sample designs:

a.) Proportional allocation

i.e. where

b.) A higher sampling fraction in stratum 1

i.e. n1 larger than n2

h hn N

n N

107

It follows:

For

a.) Substitute n1 = n2 = n/2 :

b.) Substitute e.g. n1 = 0.58n; n2 = 0.42n :

2 2 2

2 2 21.52

S S Svar x

n n n

2 2 2

2 2 21.4571.16 1.68

S S Svar x

n n n

108

Example (cont)

So, the sampling variance is slightly smaller under

design b)

It is smaller by a ratio of 1.457/1.5, i.e. 0.97

This is the design effect due to over-sampling the

more variable stratum (VSF):

2

2

1.4570.97

1.5VSF

VSF

SRS

SEDEFF

SE

0.98VSF VSFDEFT DEFF

109

Example (cont)

This example illustrates how precision can be increased

by the use of Variable Sampling Fractions! (in the case

of oversampling strata with high stratum-variances)

This approach is quite common for repeated business

and agriculture surveys, but rare for household surveys.

110

Note

We have seen when considering case b.) that a higher

sampling fraction in a stratum led to increased precision

Therefore: Important to consider which stratum allocation

will maximise survey precision (under the assumption of

not equal stratum variances)

111

Optimal Allocation

In general, the optimum allocation rule is to set:

where Ch is the unit cost of data collection for a unit in

stratum h.

If data collection costs do not vary between strata, this

simplifies to:

If stratum variances are equal, it further simplifies to a

constant K:

/h h hn N S

h h

h h

n S

N C

/h hn N K

112

Optimal Allocation (cont)

The last case demonstrates that an equal probability

selection method is optimum in the situation where

variances and data collection costs are equal in all strata

(other things being equal).

113

Example: VSFs with Equal Stratum

Variances

Example:

Again suppose H = 2, and N1 = N2.

But now suppose that stratum variances are equal, i.e.

Again consider two different sampling schemes:

a.) Proportional allocation

b.) Sampling fraction in stratum 1 is twice that in stratum 2, i.e. n1 = 2n/3; n2 = n/3.

2 21 2S S

h hn N

n N

114

Example (cont)

Then, with design a), we find (from expression 6.3, again

ignoring the fpc):

(Note: this is the formula of the variance of a mean under

SRS!)

2 22 2

2 2 2

22 2

2 22

2

2 2

N NS S

N S S var x =

n n nN nN N

115

Example (cont)

With design b), we find:

It follows:

2 22 2

2 2 2

2 2

92 2

2 2 84 4

3 3 3 3

N NS S

S S S var x =

n n n n nN N

2 2

2 2

9 / 89 / 8 1.125

/

VSFVSF

SRS

SE S nDEFF

SE S n

116

Example (cont)

This means:

The sampling variance under design b) is 9/8 (=1.125) times that under design a).

By allocating disproportionately, we have lost precision (in the case of equal stratum variances)!

In general, precision will be lost whenever variable sampling fractions are used, if the stratum variances do not vary (much).

The level of precision loss depends on the range of the weights used

117

Design Effects due to VSF’s

If we can assume stratum variances to be equal, there is an

alternative and often-used way to estimate effect of VSFs on

sampling variance.

Expression 6.1 can be used to derive expression for

effective sample size:

where: nh is the sample size in stratum h and wh is the

weight given to each case in stratum h. (Remember that wh

will be proportional to Nh/nh)

2

2ˆ

h h

VSF

h h

n wneff

n w

118

Design Effects due to VSF’s (cont)

Note that this expression only takes into account the

effect of VSFs on effective sample size, not the effect of

any other aspect of design.

Formula on previous slide can be used at design stage

to predict impact on precision of alternative allocations to

strata!

119

Design Effects due to VSF’s (cont)

In general, it will be found that:

larger range of sampling fractions (weights) results in a smaller neff (i.e. greater loss of precision)

over-sampling a large subgroup results in greater loss of precision than over-sampling a small subgroup

when main aim is to produce estimates for subgroups, equal sample sizes per subgroup will be an efficient design

when the main aim is to produce estimates for the total population, equal sampling fractions will be efficient.

120

Graphical illustration of neff

The following graph illustrates the effect of oversampling

on survey precision for a sample with 2 strata (H=2)

The graph shows relationship between the proportion of the sample in stratum 1 (n1/n) (x-axis)

and the consequent loss of precision, as measured by the design effect

(y-axis).

The three lines relate to three oversampling rates and

the subsequent relative weights that need to be used:

2:1, 4:1 and 10:1 (i.e. w1=1 in all cases).

(2:1 means that stratum 1 is oversampled by a factor of

2)

121

1

1.4

1.8

2.2

2.6

3

3.4

0 0.2 0.4 0.6 0.8 1

n1/n

DE

FF

VS

F

w2=2 w2=4 w2=10

122

Graphical illustration of neff

The graph illustrates the two points made

earlier:

larger range of sampling fractions (weights)

results in a smaller neff (i.e. greater loss of

precision)

over-sampling a large subgroup results in

greater loss of precision than over-sampling a

small subgroup

Multi-Stage Sampling

124

Outline of session

What is multi-stage / cluster sampling

Motivations for multi-stage sampling

Choice of sampling units, sample sizes at each

stage

Selection probabilities and weighting

Probability Proportional to Size (PPS) sampling

Design effects due to clustering

125

What is Multi-Stage Sampling?

The units in the population are arranged hierarchically

A 3-stage design would entail:

Primary sampling units (PSUs)

Secondary sampling units (SSUs)

Sample elements

It would be necessary to assign every element uniquely

to one SSU and every SSU uniquely to one PSU

126

What is Multi-Stage Sampling?

Stage 1: select sample of PSUs

Stage 2: select sample of SSUs within each selected PSU

Stage 3: select sample of elements within each selected SSU

Note that there could be any number of stages: 2, 3 or 4 are common

127

Examples:

general population survey :

PSUs might be postcode sectors

SSUs might be households

Elements might be persons

business survey :

PSUs might be companies

SSUs might be workplaces

Elements might be employees

128

Why Multi-Stage Sampling?

No frame of elements available, but frame of PSUs

available (examples: national sample of school pupils, where

schools could be PSUs; US face to face survey where counties are

PSU’s)

Cost of data collection (example: general population sample

involving face-to-face interviewing)

Access to elements may only be via “gatekeepers” (examples: students, employees, trainees)

Data quality (example: in the case of face-to-face interviewing,

field work can be better supervised if in clusters)

129

Design Choices (clustering):

Example: Field interviewing

Constraint Implication Tight field work periods Small workload per interviewer

Completion depends on Equal interviewer workloads

slowest interviewer

Efficient fieldwork Each workload in small area

Training/ briefing/ Large workload per interviewer

learning costs

130

Design Choices (clustering):

Some General Points:

Larger clusters will generally result in larger design

effects due to clustering (see later)

But larger clusters will also generally result in larger cost

savings (e.g. field interviewers, gatekeepers)

Necessary to make an appropriate compromise: i.e.

where cost saving outweighs loss in precision, to

produce higher overall accuracy per unit cost

(remember key aim of sample design: minimising costs,

maximising accuracy)

131

Selection Probabilities: Principle

With multi-stage sampling, the selection probability of

each element is the product of the (conditional) selection

probabilities at each stage

e.g. probability of sampling unit i in SSU j in PSU k is

Prijk = Pr (k) x Pr (j | k) x Pr (i | j,k)

So, it is important to control and record the selection

probabilities at each stage.

132

Selection Probabilities

Other things being equal, it is desirable to keep selection

probabilities equal for all elements (remember:

stratification; otherwise loss in precision).

If selection probabilities are not equal, we will need to

weight each sampled element ijk by

wijk = 1/Prijk

for unbiased estimation.

133

Selection Options

With multi-stage sampling, there are many ways to

achieve equal selection probabilities.

(epsem design = equal probability of selection method; =

self-weighting design)

In the (rare) case of equal size PSU’S and 2-stage

sampling, we can easily select PSU’s (j’s) and elements

(i’s) with equal probability.

Example: Design (0):

Pr(j) =1/3 and Pr (i|j)=1/2 and the overall probability is

Pr(i) = 1/3 * 1/2 = 1/6 for all i.

134

Selection Options

In many types of sampling situations having equal size PSU’s is rare. In the case of unequal sized PSU’s we are left with 3 alternative designs:

1. select PSUs with equal probabilities and then a fixed number of elements within each - gives unequal selection probabilities (not an epsem design)

2. select PSUs with equal probabilities and then a variable number of elements within each, to give equal overall selection probabilitiesx

3. select PSUs with PPS (probability proportional to size), then a fixed number of elements within each

135

Selection Options

Design 1) undesirable because it will generally cause loss in precision compared with an epsem design; non-epsem design undesirable; weighting needed

Design 2) avoids this problem, but causes practical problems. Number of elements sampled per PSU will vary in proportion to the population size of PSU. Elements in one PSU typically form one interviewer workload, so this is undesirable.

Also, with design 2) the sample size is not fixed in advance - it is a random variable. Very undesirable!

136

Selection Options

Design 3) overcomes all these problems, but it depends

on the availability of a reasonably accurate measure of

the number of elements in each PSU (and SSU, if a 3

stage design).

Note: when accurate measures of number of elements

within each PSU not available it may be possible to get a

reasonable good estimate of the measure of size and to

proceed with PPS sampling

The next slide discusses this design further:

137

Probability Proportional to Size (PPS)

Selection

Example: A 2-stage design

set Pr (j) proportional to Nj (number of elements in

population in PSU j = PPS sampling).

So Pr (j) = C Nj.

We then select the same number of elements, D, from

each sampled PSU, so Pr (i| j) = D/ Nj.

Then,

Pr (i) = Pr (j) x Pr (i|j) = C Nj x D/ Nj = CD, which is

the same for every element

138

Implementation of a PPS Design

We do not need to calculate the selection probabilities at

each stage in order to make the selection.

We need only to create a cumulative total down the list

of PSUs (e.g. 10,000) and then sample systematically

down that list of totals, including each PSU within which

the interval falls

139

Implementation of a PPS Design

Example: Selection of 3 PSUs from 10 with PPS and 25

units from each selected PSU, so that n=75

Pr(j) is probability of selecting the PSU

Pr(i|j) is the probability of selecting each unit, given that

PSU has been selected, and

Pr(i) is the overall probability of selecting each unit.

It can be seen that each of the 10,000 units in the

population has the same selection probability:

140

Example of a PPS Design

P (i) =

PSU Size (Nj) Pr(j)=C*Nj Pr(i| j)=D/Nj P(j) x P(i| j)=C*D

1 1000 3x1000/10000 25/1000 75/10000

2 900 3x 900/10000 25/ 900 75/10000

3 800 3x 800/10000 25/ 800 75/10000

4 1200 3x1200/10000 25/1200 75/10000

5 1500 3x1500/10000 25/1500 75/10000

6 1300 3x1300/10000 25/1300 75/10000

7 1100 3x1100/10000 25/1100 75/10000

8 500 3x1500/10000 25/ 500 75/10000

9 1000 3x1000/10000 25/1000 75/10000

10 700 3x 700/10000 25/ 700 75/10000 ________

10000 C=3/10000 D=25

141

Example of a PPS Design (cont)

We would select the sample of PSUs as follows:

N = 10,000 and n = 3 (PSUs).

To select systematically (see session: stratification I), K

=N/n= 3333 and R needs to be a random number

between 1 and 3333. Suppose we happen to generate

R = 1,050.

Then, we sample the PSUs that contain elements 1050,

(1050 + 3333) and (1050 + 2x3333), i.e. PSUs 2, 5 and 7

:

142

Example of a PPS Design (cont)

PSU Size Cum. size Selection _______________________________________________________

1 1000 1000

2 900 1900 *

3 800 2700

4 1200 3900

5 1500 5400 *

6 1300 6700

7 1100 7800 *

8 500 8300

9 1000 9300

10 700 10000

143

Some Limitations of PPS Sampling of

PSUs

We might have only imperfect estimates of number

of elements in each PSU (the size measure)

We could then adjust the sample size within each

PSU to keep overall probabilities equal or we might

simply weight by 1/Pr(i)

Sampling interval might be smaller than number of

elements in some PSUs. (This will only happen if

sampling fraction of PSUs is large and/or size of

PSUs highly variable.) Those PSUs will be certain

to be sampled, and could be sampled more than

once.

144

Some Limitations of PPS Sampling of

PSUs

We might place these PSUs in a separate stratum and

include them with certainty. We might also increase their

sample size of elements, to keep overall probabilities

equal, or we might weight

145

Design Effects due to Clustering

Clustering tends to increase sampling variance (but this

is partly offset by the fact that a larger sample size can

be obtained for any given cost).

This is because units within a cluster tend to be more

homogeneous than units as a whole.

Clustering is therefore tending to have the opposite

effect to stratification.

146

Example of Homogeneity of Clusters

Let us consider the following example to illustrate the effect

of clustering:

Population of 6 people, with values: 1, 1, 2, 2, 3, 3.

Population mean = 12/6=2

Population variance:

var (X) = = 4/6 = 2/36

2 2

1

1( 2)

6i

i

x

147

Example (cont)

a) divide population into 3 clusters: (1,1) (2,2) and (3,3).

Then: no variance within clusters (homogeneous

clusters). But variance between the cluster means is:

var (XB) = [(1-2)2 + (2-2)2 +(3-2)2] /3 = 2/3.

It implies that sampling variance is greater than 0 since

we get different estimates of the mean depending on

which cluster is sampled.

148

Example (cont)

b) divide the population into 2 clusters: (1,2,3) (1,2,3). No variance between cluster means. But variance within each cluster is:

Var (XW) = 2* [[(1-2)2 + (2-2)2 +(3-2)2]/3] /2 = 2/3

The sampling variance is 0 since there is no variability in sample means.

With design a) all the variance is between clusters -clusters are perfectly homogeneous.

With design b), clusters are as heterogeneous as the population as a whole, so cluster sampling would not cause a loss in precision.

149

Example (cont)

If we sample one cluster (and then include all elements),

design a) has a sampling variance of 2/3; design b) has

a sampling variance of 0.

This illustrates the general point that sampling variance

will be greater if clusters are relatively homogeneous

(i.e. like in a) )

150

Design Effects due to Clustering (cont)

Typically, the sorts of units that we tend to use as PSUs are relatively homogeneous, so in practice clustering nearly always results in a design effect due to clustering which is greater than one.

Examples:

people within postcode sectors,

pupils within schools,

students within classes

employees within firms.

151

Intra-Cluster Correlation

The design effect due to clustering is

where b is sample size per cluster (in practice b may vary slightly, in which case mean cluster size provides an adequate approximation), and ρ (‘roh’) is the intra-cluster correlation.

ρ =0: randomly sorted clusters

ρ =1: perfectly homogeneous clusters

DEFF bCL 1 1

152

Intra-Cluster Correlation (cont)

Note that ρ is a population characteristic relating to the

chosen definition of PSU, but sample design should

involve a careful choice of b.

Examples of possible values:

b=10: if ρ =0 then

b=10: if ρ =1 then; if then

more realistically, b=10, if ρ =0.05 then

DEFFCL 10

DEFFCL 1

DEFFCL 145. .

153

Inflation due to clustering

Reminder: the square root of DEFF is DEFT

DEFTCL inflates confidence intervals of the mean (or

proportion) as follows:

1.96* * CLx SE DEFT

154

Example of Intra-Cluster Correlations

From the British Social Attitudes Survey:

Variable b DEFT DEFT

if b=10

Household size 0.070 16.6 1.45 1.28

Owner-occupier 0.231 16.5 2.14 1.75

Has telephone 0.102 16.5 1.61 1.38

Asian 0.334 8.3 1.86 1.53

Roman Catholic 0.037 16.4 1.25 1.15

Not racially prejudiced 0.021 8.4 1.08 1.03

Extra-marital sex wrong 0.044 8.3 1.15 1.08

Dodging VAT is OK 0.021 8.2 1.07 1.04

155

Example of Intra-Cluster Correlations

Note is low for attitudinal variables, so design effects

small (DEFT small). But large for variables related to

ethnicity and housing type.

Thus, the most effective degree of clustering might be

greater for an attitude survey (fewer, larger clusters) than

for a housing survey.

156

References

Cochron, W.G., (1977). Sampling techniques; Wiley Eastern Ltd.

Des Raj, (1968). Sampling theory; Tata-Mcgraw-Hill Publishing

Company Ltd.

Hansen, M.H. and Hurwitz, W.H. (1943b). On the theory of sampling

from finite populations; Ann. Math. Statist., 14, 333-362.

Hansen, M.H., Hurwitz, W.H. and Madow, W.G., (1993). Sample survey

methods and theory, Vol. 1 and Vol. 2; John Wiley & Sons, Inc.

Murthy, M.N., (1977). Sampling theory and methods; Statistical

Publishing Society

Sukhatme, P.V., Sukhatme, B.V., Sukhatme, S. and Ashok, C. (1984).

Sampling theory of surveys with applications; Indian Society of

Agricultural Statistics.

156

157 157

Documents

1.Overview of Sampling Scheme Hukum Chandra