Statistics in Survey Analysis

8/10/2019 Statistics in Survey Analysis

1/25

1

Statistics in Survey Analysis1

Ric CoeICRAF, Nairobi, Kenya

Contents

Introduction ..................................................................................................................................... 1Preliminaries ................................................................................................................................... 3Descriptive Statistics ...................................................................................................................... 4

1. Summarizing Single Variables............................................................................................ 42. Two variables. ...................................................................................................................... 6

Descriptive statistics - common problems .................................................................................. 11Confirmatory analysis: estimation and hypothesis testing......................................................... 13

The problem .............................................................................................................................. 13Estimates, standard errors and confidence intervals. .............................................................. 14Hypothesis tests: The logic....................................................................................................... 16Examples of calculations .......................................................................................................... 17

Limitations................................................................................................................................. 19What should you do .................................................................................................................. 20

Confirmatory Analysis - Regression ........................................................................................... 20

Starting Regression ................................................................................................................... 20Fitting the regression line ......................................................................................................... 21Check the fit .............................................................................................................................. 23

Interpretation ............................................................................................................................. 23Adding more variables - Multiple regression.......................................................................... 23

Interpretation................................................................................................................................. 24References ..................................................................................................................................... 25

Introduction

This guide summarises the use of simple statistical analyses in the interpretation ofsurvey data. It is aimed at the typical small surveys (up to a few hundred respondents)carried out by researchers looking at the role and uptake of new agricultural technologies.

1Modified from input to a course Formal data analysis for bean researchers organised by CIAT at CMRT,

Egerton University, February 1996. Thanks to Soniia David for permission to quote the example.


2/25

2

There are several common problems in the approaches to survey analysis used by manyresearchers, probably a result of the research methods courses followed during training.One is to concentrate attention on a few well known statistical techniques, such as chi-squared tests in 2-way tables and regression analysis, and to place a naively simplistic

reliance on the results. This is the topic of this guide. A second problem is to treat

statistical analysis as a recipe that can be followed to a successful conclusion withoutmuch thought or understanding along the way. This is the topic of a companion guideSteps in survey analysis (Coe 2002). A third problem is to ignore the context in whichthe survey was carried out, so ignoring many of the possibilities and limitations of thestatistical analysis. This is the topic of the guide Approaches to analysis of survey data

(SSC, 2001).

Example

The example used in this guide was a survey of farmers in two districts of Uganda. It

aimed to characterize the pattern of bean growing and understand role of new beanvarieties in the household economy of new farmers. A few of the stated objectives were:

Overall: Provide a baseline against which to measure adoption and impact of improvedbean varieties.

Hypotheses:1. Adoption.a. There is no relationship between adoption of new varieties and wealth.

b. The rate of adoption for MCM5001 will be higher in Mbale than Mukono, due tostrong non-appreciation of small seeded varieties in Mukono.

2. Impact.a. Adoption of new varieties will result in an increase in absolute quantities and

proportion of beans sold, hence increasing household income from beans.

b. Adoption of new varieties will not result in increased sales of fresh beans.c. Adoption of new varieties will not change the amount of income from beans controlled

by women.d. ...

The examples are based on a subset of just 50 households from the whole survey of 179.

The variables used in the example have been labeled so should be self-explanatory.

In this guide SPSS has been used for the statistical analysis. General points appear innormal text. Computer output and other items relating specifically to the example are

boxed.


3/25

3

Preliminaries

Before starting analysis:

1.

Make sure you are familiar with the data source and collection methods.For example:

Was a random sampling scheme used? Were individual questionnaires completed during a group meeting? Who was the data collected by? Why and when?

1.

Clarify objectives

These should have been listed in detail when the survey was planned. If they were not, orhave changed, they must be listed now. It is impossible to analyze a survey if you donot know what you are trying to find out.

3. Coding and Data entry.

4. Make sure you understand the data. You must understand the exact meaning of everynumber and code.

Data that needs clarifying.

Variable WIVES (Question 3): Does 1 mean 1 wife or 2 wives? (conflict betweenquestionnaire and code book).

Variable ARRANGE (Question 4). Does NA mean there are no bean plots or nohusband/wife?

Variables OCCUPHDI and OCCUPHD2 (Question 8): Why are two occupations given

when the question asks for the main occupation?

Variable KAW94A (Question 21). What is the difference between naand No?

Variable AMKW94A Question 21). What are the units?


4/25

4

Descriptive Statistics

1. Summarizing Single VariablesQualitative (Coded) variables.

Useful summaries are just frequencies and percentages.

MATOKE Grows matoke

Valid Cum

Value Label Value Frequency Percent Percent Percent

Yes 1 42 84.0 84.0 84.0

No 2 8 16.0 16.0 100.0

------- ------- -------

Total 50 100.0 100.0

Valid cases 50 Missing cases 0

HHTYPE Household type

Valid Cum

Value Label Value Frequency Percent Percent Percent

Male headed one wife 1 27 54.0 54.0 54.0

Male headed more tha 2 4 8.0 8.0 62.0

Female headed absent 3 3 6.0 6.0 68.0

Female headed, no hu 4 13 26.0 26.0 94.0

Single man 5 2 4.0 4.0 98.0

Other 7 1 2.0 2.0 100.0

------- ------- -------

Total 50 100.0 100.0

Valid cases 50 Missing cases 0


5/25

5

Note different emphasis of frequencies and percentages. Frequencies emphasize

the sample, percentages emphasize the population. Give total sample size withpercentages.

Take care with percentages: make sure you are using an appropriate baseline

(what is 100%) and remember that percentages might not have to add to 100, as inthe example below.

Edit the computer output for presentation!

Crop % growing

Cassava 100Beans 98Matoke 84Maize 78

Yams 20Sample size 50

Look carefully at and identify rare cases. Such data points may be errors, or mayneed special treat

What is the 1 other household type in question 2?

One farmer does not grow beans. Should this case be deleted from allanalyses?

Bar charts are most appropriate when the categories can be ordered in some usefulway.

Quantitative Variables

In summarizing quantitative variables the most interesting things are:

o Location (What is a typical value)o Spread (How much variation is there?)o Odd values (What is their source and interpretation?)

Location is measured by mean or median (not usefully the mode)

Spread is measured by standard deviation or distance between quartiles.

Quantities such as the 10% and 90% point are useful in some situations.


6/25

6

Use Histograms and boxplots.

2. Two variables.

Two qualitative variables = cross tabulation

Interpretation can be helped by careful layout.

Percentages may be calculated of row totals, column totals or overall totals. Not

all of them will make sense!

Amount of beans harvested in 94a

Mean 15.9

Standard deviation 34.2Median 4.025% point 075% 14.0Mean (ignoring 200) 10.1

total beans harvested 94 a

200.0175.0150.0125.0100.075.050.025.00.0

40

30

20

10

0

Std. De v = 34.21

M ean = 16.0

N = 47.00


7/25

7

Household typeCrop earning

highest income

Male

Headed

Female

Headed

Single

Male Total

Coffee 19 7 1 27Groundnut 2 4 0 6Bogoya 1 3 0 4Cassava 1 0 1 2

Matoke 2 0 0 2Beans 1 0 0 1Other 5 0 0 5

No sales 0 2 0 2Total 49


8/25

8

One quali tative and one quantitative variable = group comparison

Two quantitative variables

A scatter diagram is the only really useful way to summarize two quantitative

variables and their relationship.

The correlation coefficient is a summary of the strength of linear relationshipbetween variables. It should NOT be quoted unless the data have first been looked

at in a scatter diagram.

If there appears to be a relationship between variables the points to look for are:

Total beans harvested in 94aHousehold type

Male Female

Mean 31.3 5.9Median 10.0 025% point 0 0

Number 31 16

15311N =

Simplified hhtype

femalemaleMissing

totalbeans

harvested

94a

50

45

40

35

30

25

20

15

10

5

0

9

16


9/25


10/25

10

Three or more variables

When three or more variables are being

investigated, cross tabulations becomesparse and difficult to interpret and

clear graphs difficult to construct.

A simple example of the need for notalways considering just two variables

at a time is given. In both Region 1and Region 2 it is clear adoption is notrelated to income (67% adopt in bothhigh and low income groups in Region1 and 33% in Region 2) but if the sumof the two regions is studied there

appears to be higher adoption in the

high income group.

Exactly the same thing occurs withcontinuous variables where spurious

correlation (or lack of it) can be due toa third variable which has not beenallowed for. More advanced graphical(e.g. small multiple pictures) and numerical (regression and log-linear modeling,multivariate methods such as principal components) methods exist to help there.

planted 94a

planted 9 4b

harvested 94a

harvested 94b

ArtificialExample

Region 1

Adoption_ +

Income

L 10 20

H 20 40Region 2

Adoption- +

Income

L 40 20

H 20 10

OverallAdoption

- +

Income

L 50 40

H 40 50


11/25

11

Descriptive statistics - common problems

Use of standard techniques rather than the most appr opriate.

An example is the histogram to show the distribution of a continuous variable. Thehistogram shows features such as location and skewness. However, other

possibilities are cumulative histograms (which show % points), boxplots (good forcomparing, and showing outliers), q-q or normal probability plots (to check if thevariable has a normal distribution) or stem-and-leaf plots (to look at individualvalues).

Be imaginative - find the best way to display the information you want.

Histogram

AMPLT94 A

Noofobs

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

< = 0 ( 0,5 ] ( 5,1 0] ( 10 ,1 5] ( 15 ,2 0] ( 20 ,2 5] ( 25 ,3 0] ( 30 ,3 5] ( 35 ,4 0] > 4 0

Cummulative histogram

AMPLT94 A

Noofobs

0

4

8

12

16

20

24

28

32

36

40

44

48

52

< = 0 ( 0,5 ] ( 5,1 0] ( 10 ,1 5] ( 15 ,2 0] ( 20 ,2 5] ( 25 ,3 0] ( 30 ,3 5] ( 35 ,4 0] > 4 0

Non-Outlier Max = 7Non-Outlier Min = 0

75% = 325% = 0

Median = 1.75

Outliers

Extremes

Box Plot

0

10

20

30

40

AMPLT94 A

Quantile-Quantile

Distribution: Normal

Theoretical Quantile

ObservedValue

.05 .1 .25 .5 .75 .9 .95 .99

-10

0

10

20

30

40

50

-2 -1 0 1 2 3

Use of techni ques you can get your computer to do.

Much statistics software is very flexible. If you learn enough about it you can get itto do most things, but not everything.

Be prepared to do some analysis, including drawing of graphs or tables, by hand.

Concentration on means when var iat ion is impor tant.


12/25

12

Cases which deviate from the mean, contributing to variability, are probably just asimportant as the average values.

Make sure you understand whether variation is important, and if so, describe it.

Limited use of deri ved quanti ti es.

It is unlikely that each substantive question can be answered from columns of raw dataalone. Calculations of new variables is certain to be important.

Calculate new variables that are needed to answer the questions.

Confusion over the unit of analysis.

Many datasets contain data collected at more that 1 level ( e.g. plot, person, household,

community). Analyses must use the relevant level. Mixed levels are almost wrong.Even in surveys with data collected at one level there is room for confusionregarding, for example, calculations of percentages.

Variety Number offarmers plantingin 94A

Average of thosefarmers who planted

Kawanda 11 2.45

Manyigamulimi 21 10.53Kanyebwa 0 -

White haricot 0 -All others 14 2.04

No beans planted 18 -

The various interesting percentages are:

Percent of all farmers planting Kawanda = 11/50 = 22%

Percent of all farmers who planted in 94A who planted Kawanda

= 11/(50-18) = 34%

Percent of amount planted that was planted to Kawanda= (11 x 2.45) / (11 x 2.45 + 21 x 10.53 + 14 x 2.04) =

26.95/276.64 = 9.7%

Not working with relevant subsets of the data

Should the farmer who never grows beans be deleted from the dataset? Should cases

for whom farming is not the main occupation be omitted when analyzing economicactivity?


13/25

13

Make sure all relevant data, but no irrelevant data, is being used.

Poor handli ng of outli ers.

Be on the look out for all odd observations, which might represent mistakes orunusual cases. Mistakes must be corrected. Treatment of unusual cases depends oncontext. Including them can distort the picture. Omitting them can induce bias.

Balance between Exploratory analysis and Data Dredging

Exploratory analysis means looking for interesting patterns in the data withoutfocusing on a specific question (e.g. Who are the farmers who have heard of the new

variety?). This can be valuable, and show up facts which had not been thought of orhypothesized.

Data dredging means searching through many statistics until something turns up.For example, doing a cross-calculation of Heard of new varieties with every otherqualitative variable. The results will be spurious ( if you search through enough

columns of random numbers you will eventually find interesting correlations).

The distinction between the two approaches is fine!

Confirmatory analysis: estimation and

hypothesis testing

The problem

A. Household Type

Male Female

Labour

Never hire

or exchange 23 13 36Hire or

exchange 10 3 13

33 16 49

In the Table A we can see:


14/25

14

33% of the households are female headed.

30% of male headed households hire labour, but only 19% of female headed householdsdo.

B. Farmers who planted beans in 94 aMale Female Overall

Amount Mean 6.5 2.9 5.8

Planted s.d. 9.5 1.3 8.6

n 24 6 30

In Table B we can see:

The mean amount of beans planted in 94a by farmers who grew beans that season is 5.8kg.

The amount planted by males was 6.5 kg, but only 2.9 kg by females.

All these results are based on data from a sample of just 50 farmers in the district.

How reliable are they? If we had measured a different 50 how similar would theresults have been? If we had measured 500, or the whole population, would theconclusions have been much the same?

The results differ from true answer for two reasons:

Non sampling errors - incorrect responses, mistakes in coding and data entry, poor

recall, biased selection of respondents.

Sampling errors - those due to the fact that we have measured only some (a sample)of the population.

The non-sampling errors can not usually be measured, but can be minimized by goodsurvey practice. Sampling errors can be measured, and that is the purpose of muchconfirmatory statistics.

Estimates, s tandard erro rs and con fidence intervals.

Proportions

The proportion of female headed households in the population is P. P is unknown.The sample value is p = 0.33 ( = 16/49). The uncertainty due to sampling errors in


15/25

15

this is measured by thestandard error. The standard error is se pp p

n( )

( )

1,

where n = sample size.

se(p) is estimated by

. ( . )

.

33 1 33

49 07

This is the standard deviation of possible estimates that could be produced bydifferent simple random samples of the same size.

The standard error is best interpreted via a confidence interval. A 95% confidenceinterval for p is p 2 x se(p)

= 0.33 2 x 0.07= (0.19, 0.47)

This is interpreted as We are 95% confident that the true percentage of femaleheaded households is between 19% and 47%. Hence the uncertainty in results dueto sampling error is quantified.

Means

The mean amount of beans planted in 94a is 5.8 kg. The standard deviation of this

is se means

n

( ) 2

, where s2is the variance in amount of beans and n the sample size.

se mean( ) .

. 8 6

301 6

2

The 95% confidence interval is

mean 2 x se(mean)= 2 x 1.6= (2.6, 9.0)

The mean amount of beans planted is between 2.6 and 9.0 kg.

Differences

If interested in differences between subgroups we can similarly estimate thedifference and find a standard error of the estimate.

Difference in mean amount of beans planted bymales and females = 6.5 - 2.9

= 3.6 kg.


16/25

16

se difference s

n

s

n( ) 1

2

1

2

2

2

= 9 5

24

1 3

6

2 2. .

= 2.0

95% confidence interval for difference is

3.6 2 x 2.0(-0.4, 7.6)

The mean difference between amounts planted by males and females could beanything between -0.4 kg and 7.6 kg.

Hypothesis tests: The logic

The logic of all the tests commonly used depends on the fact that random samples from apopulation behave in a predictable way. The mean amount of beans planted by femalehouseholds of 2.9 kg, is not the actual mean of all households in the districts where the studytook place. If a different sample had been randomly selected the mean would have been

different. The question is How different?. If all households are very similar (low variationbetween households) then it really does not matter which sample is selected. On the other

hand, high variation in the population will lead to very different sample means, and henceless certainty in the results obtained. The mathematics of statistics allows quantification ofthese ideas, and hence answers to the question of how certain we are of the results.

The logic of the hypothesis tests is as follows:

1. Assume some fact is true - the null hypothesis (e.g. There is no difference in meanamount of beans planted by male and female headed income households).

2. Deduce how the sample would behave if (1) is true (e.g. How big could the sample

differences between male and female headed households be?)

3. Compare the actual sample with the predictions in (2).

4. If (2) and (3) do not agree then (1) must be untrue - the null hypothesis is rejected.

If (2) and (3) do agree then there is no reason, in this data, not to believe (1).


17/25

17

The level of agreement is measured by the 'significance level', explained in the examplesbelow.

Examples of calculations

Chi-squared test for no association in a 2 x 2 table.

Taking Table A as an example, we want to test whether the proportion ofhouseholds hiring labour is the same in male and female headed households. The steps are:

1. Formulate the null hypothesis: the proportion is equal for both male andfemale households.

If (1) is true, then this proportion is estimated by 36/49. Hence we would expect numbers ineach category to be :


18/25

18

Male Female

Never hire

33 36

4924 2x = .

16

36

4911 8x = .

Hire

3 13

498 83 x = . 16

13

494 2 .

3. The difference between observed and expected frequencies is summarised as

4. If (1) is valid then the value of 2 should be an observation from a 1

2 -

distribution. Comparison with tables shows that 0.74 is not an extreme observation. Anumber at least as big as this would occur 39% of the time. The significance level is p =

0.39. Hence there is no strong reason not to believe the null hypothesis.

t-test to compare two means

In example B the steps needed are:

1. Formulate the null hypothesis: the difference in mean amount of beansplanted for male and female households is zero.

2,3 If (1) is true, then the difference in means of 3.6kg, scaled by its standarderror

(= 2.0) ,

t 3 6

2 01 8

.

.. ,

is an observation from a t28distribution.

22 2 2 2

=( - 3)

+( - )

+( - )

+( - )

=24 2 2

24 2

11 8 13

11 8

8 8 10

8 8

4 2 3

4 20 74

.

.

.

.

.

.

.

..


19/25

19

4. Comparison with tables shows that 1.8 is not an extreme observation. Adifference as big as this would occur 8% of the time (1) is true. The significance level is p =0.08. Hence there is not much reason not to believe the null hypothesis.

Limitations

Assumptions.

The calculations in both 4.1 and 4.2 are based on a series of assumptions. The keyones are:

Independence. In both examples A and B we assume observations are independent.

Lack of independence is caused by:

(i) non-simple random samples. In this case we have used a str atified sample.

(ii) interference between observations. This would be the case if individuals

within these household responded, or if data were collected at a group meeting.

Lack of bias due to non-response, interviewer effects, attempts to 'please' theresearcher etc.

Equality of variance and normal distribution (t-test). These assumptions can be

checked. In example B the data is clearly not normally distributed

Limits to interpretation.

(1) If the result is significant we can reject the null hypothesis, and concludethat there is a real difference in the population. If the result is not significant we have not

proved there is no difference. It is never possible to prove the null hypothesis is true (if

almost never will be!). All we can say is this study has not produced evidence to make usdisbelieve the null hypothesis.

(2) At what level of significance should the null hypothesis be rejected? 5% is

commonly used but there is absolutely no reason why it should be treated as a rigid cut off.6% and 4% significance levels are, for all real purposes, equivalent.

(3) Whether the null-hypothesis is rejected depends as much on the sample sizeand precision of the study, as on the 'truth' of the null hypothesis. A small, imprecise surveywill not detect a difference that could be picked up by a larger study. May be we just did notcollect enough data!


20/25

20

(4) The whole logic of significance testing and the p-value rests on what wouldhappen in repeated surveys of the same design, using new randomisations. Is this sense,when we know the survey would not and can not ever be repeated?

(5) In most analysis exercises, differences which 'look interesting' at the

exploratory stage are investigated further in the confirmatory analysis. If the tests toperform have been selected because differences look large, all significance levels areinvalid.

(6) If a large number of tests are performed, as is often the case in analysis of a

study with many variables, then we would expect 5% of the tests to give "significant" resultsat the p = 0.5 level even if all null hypotheses were true. Hence it can be difficult tointerpret the results of multiple tests.

What should you do

(1) Treat the significance level p as an indication of 'strength of evidence'

against the null hypothesis, not as a Yes/No decision maker.

(2) Concentrate on estimating the size of differences, rather than just testingwhether they exist. Confidence intervals for differences will be much more useful thanhypothesis tests.

At the end of every significance test apply the SO WHAT? test. Ask yourself 'Sowhat?'. Has the significance test really improved your understanding of the situationand helped you take a rational decision for future action? If not forget it, and get on

with something more useful.

Confirmatory Analysis - Regression

Starting Regression

- Beware!

Even simple regression is not simple!

- Start by considering types of relationship that might exist. The most useful regressionanalysis will be one that starts from understanding of the theory behind the process beingstudied.


21/25

21

The example used here is rather artificial. It examines the proposition that the amount of

beans harvested in 94a depends only on land area.

- Plot the data to see if there is any evidence of the relationship.

LANDAREA

HVTOT94A

-20

20

60

100

140

180

220

-1 1 3 5 7 9 11

Fitting the regression line

- Software is widely available to do this

- Understand the output!


22/25

22

* * * * M U L T I P L E R E G R E S S I O N

* * * *

Listwise Deletion of Missing Data

Equation Number 1 Dependent Variable.. HVTOT94A

total beans harvested 94a

Block Number 1. Method: Enter LANDAREA

Variable(s) Entered on Step Number

1.. LANDAREA

Multiple R .54425

R Square .29621

Adjusted R Square .28057

Standard Error 29.01659

Analysis of Variance

DF Sum of Squares Mean Square

Regression 1 15946.10384 15946.10384

Residual 45 37888.31105 841.96247

F = 18.93921 Signif F = .0001

------------------ Variables in the Equation --------------

----

Variable B SE B Beta T

Sig T

LANDAREA 8.200238 1.884280 .544249 4.352

.0001

(Constant) -2.863844 6.051297 -.473

.6383

End Block Number 1 All requested variables entered.


23/25

23

Check the fit

- Look for any unusual points or outliers. They could represent mistakes or cases that

require special treatment. They certainly require explanation.

- Look for influential points, which largely determine results. They are not a bad thing,but you must be aware if your conclusions depend critically on one or two observations.

- Look at the residuals to determine:

1.

Whether they satisfy the main assumptions that validate the analysis (constantvariance, independence, roughly normally distributed)

2.

Whether they show patterns according to the value of other variables, indicating thatthose other variables should be allowed for in the analysis.

Interpretation

Significance does not tell you whether the fitted model is logically sound or if it fits

the data well.

Significance does not tell you whether the model is useful in explaining ordescribing a relationship, or if the relationship has much predictive power.

A regression model derived from survey data can not tell you what would happen

when a x-variable is changed. For example we can not use it to predict the beanharvest of a farmer whose land holding changes.

Existence of a regression relationship between two variables does not mean there is a

causal relationship.Regression relationships become useful when similar relationships are found in a numberof different conditions. Look for significant sameness between regions, crops, farmtypes, etc.

Adding mor e variables - Multiple r egression

Multiple regression is a powerful tool for understanding the relationship of onevariable to several others. BUT.....

All the limitations to interpretation above apply, and are compounded by the

existence of several x-variables.

It is hard to draw graphs that show the relationships and the way data depart fromthem, so the analyst must rely more on numerical indicators of lack of fit, outliers,


24/25

24

and influential points. Multiple regression analysis will not be successful if these arenot understood.

Stepwise and similar variable selection techniques, so loved by social scientists,

have little theoretical basis and can produce answers which are very poor. Regressionmodeling will be most successful if understanding of the underlying processes is

used to choose possible models, rather than relying on computer algorithms. The sample size required for multiple regression analysis depends on the

configuration of the data (in particular the range of the x-variables and correlationsamong them). The required sample size quickly becomes large as the number of x-variables increases. If regression analysis is the part of the principle objectives of thesurvey, it might be possible to select the sample in a way that makes the analysismore efficient.

Raw residuals vs. HHTYPE2

HHTYPE2

Raw

residuals

-80

-40

0

40

80

120

160

1 2

Interpretation

Interpret results. This does not mean understand which effects are significant butunderstand and communicate what you now know about the problem. You should be

able to:

Meet the objectives of the study.

Clearly state what is the substantive new knowledge which as been generated.

Show how this new information and understanding builds on what was therebefore. Does it:

o add more examples of something previously known?o mean that general rules or principles can be stated with more confidence?o allow predictions to be made for new and important situations?


25/25

Documents

Statistics in Survey Analysis