Statistics in Survey Analysis

Embed Size (px)

Citation preview

  • 8/10/2019 Statistics in Survey Analysis

    1/25

    1

    Statistics in Survey Analysis1

    Ric CoeICRAF, Nairobi, Kenya

    Contents

    Introduction ..................................................................................................................................... 1Preliminaries ................................................................................................................................... 3Descriptive Statistics ...................................................................................................................... 4

    1. Summarizing Single Variables............................................................................................ 42. Two variables. ...................................................................................................................... 6

    Descriptive statistics - common problems .................................................................................. 11Confirmatory analysis: estimation and hypothesis testing......................................................... 13

    The problem .............................................................................................................................. 13Estimates, standard errors and confidence intervals. .............................................................. 14Hypothesis tests: The logic....................................................................................................... 16Examples of calculations .......................................................................................................... 17

    Limitations................................................................................................................................. 19What should you do .................................................................................................................. 20

    Confirmatory Analysis - Regression ........................................................................................... 20

    Starting Regression ................................................................................................................... 20Fitting the regression line ......................................................................................................... 21Check the fit .............................................................................................................................. 23

    Interpretation ............................................................................................................................. 23Adding more variables - Multiple regression.......................................................................... 23

    Interpretation................................................................................................................................. 24References ..................................................................................................................................... 25

    Introduction

    This guide summarises the use of simple statistical analyses in the interpretation ofsurvey data. It is aimed at the typical small surveys (up to a few hundred respondents)carried out by researchers looking at the role and uptake of new agricultural technologies.

    1Modified from input to a course Formal data analysis for bean researchers organised by CIAT at CMRT,

    Egerton University, February 1996. Thanks to Soniia David for permission to quote the example.

  • 8/10/2019 Statistics in Survey Analysis

    2/25

    2

    There are several common problems in the approaches to survey analysis used by manyresearchers, probably a result of the research methods courses followed during training.One is to concentrate attention on a few well known statistical techniques, such as chi-squared tests in 2-way tables and regression analysis, and to place a naively simplistic

    reliance on the results. This is the topic of this guide. A second problem is to treat

    statistical analysis as a recipe that can be followed to a successful conclusion withoutmuch thought or understanding along the way. This is the topic of a companion guideSteps in survey analysis (Coe 2002). A third problem is to ignore the context in whichthe survey was carried out, so ignoring many of the possibilities and limitations of thestatistical analysis. This is the topic of the guide Approaches to analysis of survey data

    (SSC, 2001).

    Example

    The example used in this guide was a survey of farmers in two districts of Uganda. It

    aimed to characterize the pattern of bean growing and understand role of new beanvarieties in the household economy of new farmers. A few of the stated objectives were:

    Overall: Provide a baseline against which to measure adoption and impact of improvedbean varieties.

    Hypotheses:1. Adoption.a. There is no relationship between adoption of new varieties and wealth.

    b. The rate of adoption for MCM5001 will be higher in Mbale than Mukono, due tostrong non-appreciation of small seeded varieties in Mukono.

    2. Impact.a. Adoption of new varieties will result in an increase in absolute quantities and

    proportion of beans sold, hence increasing household income from beans.

    b. Adoption of new varieties will not result in increased sales of fresh beans.c. Adoption of new varieties will not change the amount of income from beans controlled

    by women.d. ...

    The examples are based on a subset of just 50 households from the whole survey of 179.

    The variables used in the example have been labeled so should be self-explanatory.

    In this guide SPSS has been used for the statistical analysis. General points appear innormal text. Computer output and other items relating specifically to the example are

    boxed.

  • 8/10/2019 Statistics in Survey Analysis

    3/25

    3

    Preliminaries

    Before starting analysis:

    1.

    Make sure you are familiar with the data source and collection methods.For example:

    Was a random sampling scheme used? Were individual questionnaires completed during a group meeting? Who was the data collected by? Why and when?

    1.

    Clarify objectives

    These should have been listed in detail when the survey was planned. If they were not, orhave changed, they must be listed now. It is impossible to analyze a survey if you donot know what you are trying to find out.

    3. Coding and Data entry.

    4. Make sure you understand the data. You must understand the exact meaning of everynumber and code.

    Data that needs clarifying.

    Variable WIVES (Question 3): Does 1 mean 1 wife or 2 wives? (conflict betweenquestionnaire and code book).

    Variable ARRANGE (Question 4). Does NA mean there are no bean plots or nohusband/wife?

    Variables OCCUPHDI and OCCUPHD2 (Question 8): Why are two occupations given

    when the question asks for the main occupation?

    Variable KAW94A (Question 21). What is the difference between naand No?

    Variable AMKW94A Question 21). What are the units?

  • 8/10/2019 Statistics in Survey Analysis

    4/25

    4

    Descriptive Statistics

    1. Summarizing Single VariablesQualitative (Coded) variables.

    Useful summaries are just frequencies and percentages.

    MATOKE Grows matoke

    Valid Cum

    Value Label Value Frequency Percent Percent Percent

    Yes 1 42 84.0 84.0 84.0

    No 2 8 16.0 16.0 100.0

    ------- ------- -------

    Total 50 100.0 100.0

    Valid cases 50 Missing cases 0

    HHTYPE Household type

    Valid Cum

    Value Label Value Frequency Percent Percent Percent

    Male headed one wife 1 27 54.0 54.0 54.0

    Male headed more tha 2 4 8.0 8.0 62.0

    Female headed absent 3 3 6.0 6.0 68.0

    Female headed, no hu 4 13 26.0 26.0 94.0

    Single man 5 2 4.0 4.0 98.0

    Other 7 1 2.0 2.0 100.0

    ------- ------- -------

    Total 50 100.0 100.0

    Valid cases 50 Missing cases 0

  • 8/10/2019 Statistics in Survey Analysis

    5/25

    5

    Note different emphasis of frequencies and percentages. Frequencies emphasize

    the sample, percentages emphasize the population. Give total sample size withpercentages.

    Take care with percentages: make sure you are using an appropriate baseline

    (what is 100%) and remember that percentages might not have to add to 100, as inthe example below.

    Edit the computer output for presentation!

    Crop % growing

    Cassava 100Beans 98Matoke 84Maize 78

    Yams 20Sample size 50

    Look carefully at and identify rare cases. Such data points may be errors, or mayneed special treat

    What is the 1 other household type in question 2?

    One farmer does not grow beans. Should this case be deleted from allanalyses?

    Bar charts are most appropriate when the categories can be ordered in some usefulway.

    Quantitative Variables

    In summarizing quantitative variables the most interesting things are:

    o Location (What is a typical value)o Spread (How much variation is there?)o Odd values (What is their source and interpretation?)

    Location is measured by mean or median (not usefully the mode)

    Spread is measured by standard deviation or distance between quartiles.

    Quantities such as the 10% and 90% point are useful in some situations.

  • 8/10/2019 Statistics in Survey Analysis

    6/25

    6

    Use Histograms and boxplots.

    2. Two variables.

    Two qualitative variables = cross tabulation

    Interpretation can be helped by careful layout.

    Percentages may be calculated of row totals, column totals or overall totals. Not

    all of them will make sense!

    Amount of beans harvested in 94a

    Mean 15.9

    Standard deviation 34.2Median 4.025% point 075% 14.0Mean (ignoring 200) 10.1

    total beans harvested 94 a

    200.0175.0150.0125.0100.075.050.025.00.0

    40

    30

    20

    10

    0

    Std. De v = 34.21

    M ean = 16.0

    N = 47.00

  • 8/10/2019 Statistics in Survey Analysis

    7/25

    7

    Household typeCrop earning

    highest income

    Male

    Headed

    Female

    Headed

    Single

    Male Total

    Coffee 19 7 1 27Groundnut 2 4 0 6Bogoya 1 3 0 4Cassava 1 0 1 2

    Matoke 2 0 0 2Beans 1 0 0 1Other 5 0 0 5

    No sales 0 2 0 2Total 49

  • 8/10/2019 Statistics in Survey Analysis

    8/25

    8

    One quali tative and one quantitative variable = group comparison

    Two quantitative variables

    A scatter diagram is the only really useful way to summarize two quantitative

    variables and their relationship.

    The correlation coefficient is a summary of the strength of linear relationshipbetween variables. It should NOT be quoted unless the data have first been looked

    at in a scatter diagram.

    If there appears to be a relationship between variables the points to look for are:

    Total beans harvested in 94aHousehold type

    Male Female

    Mean 31.3 5.9Median 10.0 025% point 0 0

    Number 31 16

    15311N =

    Simplified hhtype

    femalemaleMissing

    totalbeans

    harvested

    94a

    50

    45

    40

    35

    30

    25

    20

    15

    10

    5

    0

    9

    16

  • 8/10/2019 Statistics in Survey Analysis

    9/25

  • 8/10/2019 Statistics in Survey Analysis

    10/25

    10

    Three or more variables

    When three or more variables are being

    investigated, cross tabulations becomesparse and difficult to interpret and

    clear graphs difficult to construct.

    A simple example of the need for notalways considering just two variables

    at a time is given. In both Region 1and Region 2 it is clear adoption is notrelated to income (67% adopt in bothhigh and low income groups in Region1 and 33% in Region 2) but if the sumof the two regions is studied there

    appears to be higher adoption in the

    high income group.

    Exactly the same thing occurs withcontinuous variables where spurious

    correlation (or lack of it) can be due toa third variable which has not beenallowed for. More advanced graphical(e.g. small multiple pictures) and numerical (regression and log-linear modeling,multivariate methods such as principal components) methods exist to help there.

    planted 94a

    planted 9 4b

    harvested 94a

    harvested 94b

    ArtificialExample

    Region 1

    Adoption_ +

    Income

    L 10 20

    H 20 40Region 2

    Adoption- +

    Income

    L 40 20

    H 20 10

    OverallAdoption

    - +

    Income

    L 50 40

    H 40 50

  • 8/10/2019 Statistics in Survey Analysis

    11/25

    11

    Descriptive statistics - common problems

    Use of standard techniques rather than the most appr opriate.

    An example is the histogram to show the distribution of a continuous variable. Thehistogram shows features such as location and skewness. However, other

    possibilities are cumulative histograms (which show % points), boxplots (good forcomparing, and showing outliers), q-q or normal probability plots (to check if thevariable has a normal distribution) or stem-and-leaf plots (to look at individualvalues).

    Be imaginative - find the best way to display the information you want.

    Histogram

    AMPLT94 A

    Noofobs

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    22

    24

    26

    28

    < = 0 ( 0,5 ] ( 5,1 0] ( 10 ,1 5] ( 15 ,2 0] ( 20 ,2 5] ( 25 ,3 0] ( 30 ,3 5] ( 35 ,4 0] > 4 0

    Cummulative histogram

    AMPLT94 A

    Noofobs

    0

    4

    8

    12

    16

    20

    24

    28

    32

    36

    40

    44

    48

    52

    < = 0 ( 0,5 ] ( 5,1 0] ( 10 ,1 5] ( 15 ,2 0] ( 20 ,2 5] ( 25 ,3 0] ( 30 ,3 5] ( 35 ,4 0] > 4 0

    Non-Outlier Max = 7Non-Outlier Min = 0

    75% = 325% = 0

    Median = 1.75

    Outliers

    Extremes

    Box Plot

    0

    10

    20

    30

    40

    AMPLT94 A

    Quantile-Quantile

    Distribution: Normal

    Theoretical Quantile

    ObservedValue

    .05 .1 .25 .5 .75 .9 .95 .99

    -10

    0

    10

    20

    30

    40

    50

    -2 -1 0 1 2 3

    Use of techni ques you can get your computer to do.

    Much statistics software is very flexible. If you learn enough about it you can get itto do most things, but not everything.

    Be prepared to do some analysis, including drawing of graphs or tables, by hand.

    Concentration on means when var iat ion is impor tant.

  • 8/10/2019 Statistics in Survey Analysis

    12/25

    12

    Cases which deviate from the mean, contributing to variability, are probably just asimportant as the average values.

    Make sure you understand whether variation is important, and if so, describe it.

    Limited use of deri ved quanti ti es.

    It is unlikely that each substantive question can be answered from columns of raw dataalone. Calculations of new variables is certain to be important.

    Calculate new variables that are needed to answer the questions.

    Confusion over the unit of analysis.

    Many datasets contain data collected at more that 1 level ( e.g. plot, person, household,

    community). Analyses must use the relevant level. Mixed levels are almost wrong.Even in surveys with data collected at one level there is room for confusionregarding, for example, calculations of percentages.

    Variety Number offarmers plantingin 94A

    Average of thosefarmers who planted

    Kawanda 11 2.45

    Manyigamulimi 21 10.53Kanyebwa 0 -

    White haricot 0 -All others 14 2.04

    No beans planted 18 -

    The various interesting percentages are:

    Percent of all farmers planting Kawanda = 11/50 = 22%

    Percent of all farmers who planted in 94A who planted Kawanda

    = 11/(50-18) = 34%

    Percent of amount planted that was planted to Kawanda= (11 x 2.45) / (11 x 2.45 + 21 x 10.53 + 14 x 2.04) =

    26.95/276.64 = 9.7%

    Not working with relevant subsets of the data

    Should the farmer who never grows beans be deleted from the dataset? Should cases

    for whom farming is not the main occupation be omitted when analyzing economicactivity?

  • 8/10/2019 Statistics in Survey Analysis

    13/25

    13

    Make sure all relevant data, but no irrelevant data, is being used.

    Poor handli ng of outli ers.

    Be on the look out for all odd observations, which might represent mistakes orunusual cases. Mistakes must be corrected. Treatment of unusual cases depends oncontext. Including them can distort the picture. Omitting them can induce bias.

    Balance between Exploratory analysis and Data Dredging

    Exploratory analysis means looking for interesting patterns in the data withoutfocusing on a specific question (e.g. Who are the farmers who have heard of the new

    variety?). This can be valuable, and show up facts which had not been thought of orhypothesized.

    Data dredging means searching through many statistics until something turns up.For example, doing a cross-calculation of Heard of new varieties with every otherqualitative variable. The results will be spurious ( if you search through enough

    columns of random numbers you will eventually find interesting correlations).

    The distinction between the two approaches is fine!

    Confirmatory analysis: estimation and

    hypothesis testing

    The problem

    A. Household Type

    Male Female

    Labour

    Never hire

    or exchange 23 13 36Hire or

    exchange 10 3 13

    33 16 49

    In the Table A we can see:

  • 8/10/2019 Statistics in Survey Analysis

    14/25

    14

    33% of the households are female headed.

    30% of male headed households hire labour, but only 19% of female headed householdsdo.

    B. Farmers who planted beans in 94 aMale Female Overall

    Amount Mean 6.5 2.9 5.8

    Planted s.d. 9.5 1.3 8.6

    n 24 6 30

    In Table B we can see:

    The mean amount of beans planted in 94a by farmers who grew beans that season is 5.8kg.

    The amount planted by males was 6.5 kg, but only 2.9 kg by females.

    All these results are based on data from a sample of just 50 farmers in the district.

    How reliable are they? If we had measured a different 50 how similar would theresults have been? If we had measured 500, or the whole population, would theconclusions have been much the same?

    The results differ from true answer for two reasons:

    Non sampling errors - incorrect responses, mistakes in coding and data entry, poor

    recall, biased selection of respondents.

    Sampling errors - those due to the fact that we have measured only some (a sample)of the population.

    The non-sampling errors can not usually be measured, but can be minimized by goodsurvey practice. Sampling errors can be measured, and that is the purpose of muchconfirmatory statistics.

    Estimates, s tandard erro rs and con fidence intervals.

    Proportions

    The proportion of female headed households in the population is P. P is unknown.The sample value is p = 0.33 ( = 16/49). The uncertainty due to sampling errors in

  • 8/10/2019 Statistics in Survey Analysis

    15/25

    15

    this is measured by thestandard error. The standard error is se pp p

    n( )

    ( )

    1,

    where n = sample size.

    se(p) is estimated by

    . ( . )

    .

    33 1 33

    49 07

    This is the standard deviation of possible estimates that could be produced bydifferent simple random samples of the same size.

    The standard error is best interpreted via a confidence interval. A 95% confidenceinterval for p is p 2 x se(p)

    = 0.33 2 x 0.07= (0.19, 0.47)

    This is interpreted as We are 95% confident that the true percentage of femaleheaded households is between 19% and 47%. Hence the uncertainty in results dueto sampling error is quantified.

    Means

    The mean amount of beans planted in 94a is 5.8 kg. The standard deviation of this

    is se means

    n

    ( ) 2

    , where s2is the variance in amount of beans and n the sample size.

    se mean( ) .

    . 8 6

    301 6

    2

    The 95% confidence interval is

    mean 2 x se(mean)= 2 x 1.6= (2.6, 9.0)

    The mean amount of beans planted is between 2.6 and 9.0 kg.

    Differences

    If interested in differences between subgroups we can similarly estimate thedifference and find a standard error of the estimate.

    Difference in mean amount of beans planted bymales and females = 6.5 - 2.9

    = 3.6 kg.

  • 8/10/2019 Statistics in Survey Analysis

    16/25

    16

    se difference s

    n

    s

    n( ) 1

    2

    1

    2

    2

    2

    = 9 5

    24

    1 3

    6

    2 2. .

    = 2.0

    95% confidence interval for difference is

    3.6 2 x 2.0(-0.4, 7.6)

    The mean difference between amounts planted by males and females could beanything between -0.4 kg and 7.6 kg.

    Hypothesis tests: The logic

    The logic of all the tests commonly used depends on the fact that random samples from apopulation behave in a predictable way. The mean amount of beans planted by femalehouseholds of 2.9 kg, is not the actual mean of all households in the districts where the studytook place. If a different sample had been randomly selected the mean would have been

    different. The question is How different?. If all households are very similar (low variationbetween households) then it really does not matter which sample is selected. On the other

    hand, high variation in the population will lead to very different sample means, and henceless certainty in the results obtained. The mathematics of statistics allows quantification ofthese ideas, and hence answers to the question of how certain we are of the results.

    The logic of the hypothesis tests is as follows:

    1. Assume some fact is true - the null hypothesis (e.g. There is no difference in meanamount of beans planted by male and female headed income households).

    2. Deduce how the sample would behave if (1) is true (e.g. How big could the sample

    differences between male and female headed households be?)

    3. Compare the actual sample with the predictions in (2).

    4. If (2) and (3) do not agree then (1) must be untrue - the null hypothesis is rejected.

    If (2) and (3) do agree then there is no reason, in this data, not to believe (1).

  • 8/10/2019 Statistics in Survey Analysis

    17/25

    17

    The level of agreement is measured by the 'significance level', explained in the examplesbelow.

    Examples of calculations

    Chi-squared test for no association in a 2 x 2 table.

    Taking Table A as an example, we want to test whether the proportion ofhouseholds hiring labour is the same in male and female headed households. The steps are:

    1. Formulate the null hypothesis: the proportion is equal for both male andfemale households.

    If (1) is true, then this proportion is estimated by 36/49. Hence we would expect numbers ineach category to be :

  • 8/10/2019 Statistics in Survey Analysis

    18/25

    18

    Male Female

    Never hire

    33 36

    4924 2x = .

    16

    36

    4911 8x = .

    Hire

    3 13

    498 83 x = . 16

    13

    494 2 .

    3. The difference between observed and expected frequencies is summarised as

    4. If (1) is valid then the value of 2 should be an observation from a 1

    2 -

    distribution. Comparison with tables shows that 0.74 is not an extreme observation. Anumber at least as big as this would occur 39% of the time. The significance level is p =

    0.39. Hence there is no strong reason not to believe the null hypothesis.

    t-test to compare two means

    In example B the steps needed are:

    1. Formulate the null hypothesis: the difference in mean amount of beansplanted for male and female households is zero.

    2,3 If (1) is true, then the difference in means of 3.6kg, scaled by its standarderror

    (= 2.0) ,

    t 3 6

    2 01 8

    .

    .. ,

    is an observation from a t28distribution.

    22 2 2 2

    =( - 3)

    +( - )

    +( - )

    +( - )

    =24 2 2

    24 2

    11 8 13

    11 8

    8 8 10

    8 8

    4 2 3

    4 20 74

    .

    .

    .

    .

    .

    .

    .

    ..

  • 8/10/2019 Statistics in Survey Analysis

    19/25

    19

    4. Comparison with tables shows that 1.8 is not an extreme observation. Adifference as big as this would occur 8% of the time (1) is true. The significance level is p =0.08. Hence there is not much reason not to believe the null hypothesis.

    Limitations

    Assumptions.

    The calculations in both 4.1 and 4.2 are based on a series of assumptions. The keyones are:

    Independence. In both examples A and B we assume observations are independent.

    Lack of independence is caused by:

    (i) non-simple random samples. In this case we have used a str atified sample.

    (ii) interference between observations. This would be the case if individuals

    within these household responded, or if data were collected at a group meeting.

    Lack of bias due to non-response, interviewer effects, attempts to 'please' theresearcher etc.

    Equality of variance and normal distribution (t-test). These assumptions can be

    checked. In example B the data is clearly not normally distributed

    Limits to interpretation.

    (1) If the result is significant we can reject the null hypothesis, and concludethat there is a real difference in the population. If the result is not significant we have not

    proved there is no difference. It is never possible to prove the null hypothesis is true (if

    almost never will be!). All we can say is this study has not produced evidence to make usdisbelieve the null hypothesis.

    (2) At what level of significance should the null hypothesis be rejected? 5% is

    commonly used but there is absolutely no reason why it should be treated as a rigid cut off.6% and 4% significance levels are, for all real purposes, equivalent.

    (3) Whether the null-hypothesis is rejected depends as much on the sample sizeand precision of the study, as on the 'truth' of the null hypothesis. A small, imprecise surveywill not detect a difference that could be picked up by a larger study. May be we just did notcollect enough data!

  • 8/10/2019 Statistics in Survey Analysis

    20/25

    20

    (4) The whole logic of significance testing and the p-value rests on what wouldhappen in repeated surveys of the same design, using new randomisations. Is this sense,when we know the survey would not and can not ever be repeated?

    (5) In most analysis exercises, differences which 'look interesting' at the

    exploratory stage are investigated further in the confirmatory analysis. If the tests toperform have been selected because differences look large, all significance levels areinvalid.

    (6) If a large number of tests are performed, as is often the case in analysis of a

    study with many variables, then we would expect 5% of the tests to give "significant" resultsat the p = 0.5 level even if all null hypotheses were true. Hence it can be difficult tointerpret the results of multiple tests.

    What should you do

    (1) Treat the significance level p as an indication of 'strength of evidence'

    against the null hypothesis, not as a Yes/No decision maker.

    (2) Concentrate on estimating the size of differences, rather than just testingwhether they exist. Confidence intervals for differences will be much more useful thanhypothesis tests.

    At the end of every significance test apply the SO WHAT? test. Ask yourself 'Sowhat?'. Has the significance test really improved your understanding of the situationand helped you take a rational decision for future action? If not forget it, and get on

    with something more useful.

    Confirmatory Analysis - Regression

    Starting Regression

    - Beware!

    Even simple regression is not simple!

    - Start by considering types of relationship that might exist. The most useful regressionanalysis will be one that starts from understanding of the theory behind the process beingstudied.

  • 8/10/2019 Statistics in Survey Analysis

    21/25

    21

    The example used here is rather artificial. It examines the proposition that the amount of

    beans harvested in 94a depends only on land area.

    - Plot the data to see if there is any evidence of the relationship.

    LANDAREA

    HVTOT94A

    -20

    20

    60

    100

    140

    180

    220

    -1 1 3 5 7 9 11

    Fitting the regression line

    - Software is widely available to do this

    - Understand the output!

  • 8/10/2019 Statistics in Survey Analysis

    22/25

    22

    * * * * M U L T I P L E R E G R E S S I O N

    * * * *

    Listwise Deletion of Missing Data

    Equation Number 1 Dependent Variable.. HVTOT94A

    total beans harvested 94a

    Block Number 1. Method: Enter LANDAREA

    Variable(s) Entered on Step Number

    1.. LANDAREA

    Multiple R .54425

    R Square .29621

    Adjusted R Square .28057

    Standard Error 29.01659

    Analysis of Variance

    DF Sum of Squares Mean Square

    Regression 1 15946.10384 15946.10384

    Residual 45 37888.31105 841.96247

    F = 18.93921 Signif F = .0001

    ------------------ Variables in the Equation --------------

    ----

    Variable B SE B Beta T

    Sig T

    LANDAREA 8.200238 1.884280 .544249 4.352

    .0001

    (Constant) -2.863844 6.051297 -.473

    .6383

    End Block Number 1 All requested variables entered.

  • 8/10/2019 Statistics in Survey Analysis

    23/25

    23

    Check the fit

    - Look for any unusual points or outliers. They could represent mistakes or cases that

    require special treatment. They certainly require explanation.

    - Look for influential points, which largely determine results. They are not a bad thing,but you must be aware if your conclusions depend critically on one or two observations.

    - Look at the residuals to determine:

    1.

    Whether they satisfy the main assumptions that validate the analysis (constantvariance, independence, roughly normally distributed)

    2.

    Whether they show patterns according to the value of other variables, indicating thatthose other variables should be allowed for in the analysis.

    Interpretation

    Significance does not tell you whether the fitted model is logically sound or if it fits

    the data well.

    Significance does not tell you whether the model is useful in explaining ordescribing a relationship, or if the relationship has much predictive power.

    A regression model derived from survey data can not tell you what would happen

    when a x-variable is changed. For example we can not use it to predict the beanharvest of a farmer whose land holding changes.

    Existence of a regression relationship between two variables does not mean there is a

    causal relationship.Regression relationships become useful when similar relationships are found in a numberof different conditions. Look for significant sameness between regions, crops, farmtypes, etc.

    Adding mor e variables - Multiple r egression

    Multiple regression is a powerful tool for understanding the relationship of onevariable to several others. BUT.....

    All the limitations to interpretation above apply, and are compounded by the

    existence of several x-variables.

    It is hard to draw graphs that show the relationships and the way data depart fromthem, so the analyst must rely more on numerical indicators of lack of fit, outliers,

  • 8/10/2019 Statistics in Survey Analysis

    24/25

    24

    and influential points. Multiple regression analysis will not be successful if these arenot understood.

    Stepwise and similar variable selection techniques, so loved by social scientists,

    have little theoretical basis and can produce answers which are very poor. Regressionmodeling will be most successful if understanding of the underlying processes is

    used to choose possible models, rather than relying on computer algorithms. The sample size required for multiple regression analysis depends on the

    configuration of the data (in particular the range of the x-variables and correlationsamong them). The required sample size quickly becomes large as the number of x-variables increases. If regression analysis is the part of the principle objectives of thesurvey, it might be possible to select the sample in a way that makes the analysismore efficient.

    Raw residuals vs. HHTYPE2

    HHTYPE2

    Raw

    residuals

    -80

    -40

    0

    40

    80

    120

    160

    1 2

    Interpretation

    Interpret results. This does not mean understand which effects are significant butunderstand and communicate what you now know about the problem. You should be

    able to:

    Meet the objectives of the study.

    Clearly state what is the substantive new knowledge which as been generated.

    Show how this new information and understanding builds on what was therebefore. Does it:

    o add more examples of something previously known?o mean that general rules or principles can be stated with more confidence?o allow predictions to be made for new and important situations?

  • 8/10/2019 Statistics in Survey Analysis

    25/25