Upload
hadung
View
214
Download
0
Embed Size (px)
Citation preview
PAD5700 lecture 8
Page 1 of 12
MASTERS OF PUBLIC AFFAIRS PROGRAM
PAD 5700 -- Public administration research methods Spring 2013
Regression IV -- Categorical data and time series
Statistic of the week
graphic source
Religion *
I. Categorical data
* This week is a bit of a catch-all week, in which we specifically address a couple of items that I
wasn't able to fit in earlier, as we continue this process of model building ‘learning-by-doing’.
The two items: categorical data and time series analysis.
Categorical data: that which is measured in categories, not as a continuous variable. By way of
explanation (and this draws from Berman & Wang, p. 44; O'Sullivan et al, p. 102-7; and Levin
and Fox p. 4-8), one generally breaks variables down as follows:
o Interval variables. I've also seen these referred to as continuous variables, as one of their
two key characteristics are that they are:
...continuous, in that they can take an infinite number of values. Age, for instance, can
be measured in years (as I write this, I am 54.68 years old). It can also be measured in
months (656.17), days (19,971?), hours (479,304), minutes (8,758,240) and
1,725,494,400 seconds (calculations based on rounding off the number of days).
Interval -- By this is meant the distance between the various measurements are equal:
one year is one year away from two years, which is also one year from three years,
etc.
o Categorical variables: what we are talking about today. Two mains types:
Ordinal -- Like interval variables, these are 'ordered', in that one is more or less than
another, but the intervals between these are not necessarily of equal value. So in a
Likert scale response to how satisfied you are with your course, your options may be
PAD5700 lecture 8
Page 2 of 12
very satisfied, satisfied, neutral, unsatisfied, or very unsatisfied. There is a definite
rank ordering here, but the distance between the various choices is not necessarily the
same.
Nominal -- variables that are no more than named designations. Religion, for
instance, is measured as Protestant (with the myriad denominations within this!),
Catholic, Buddhist, Muslim, Shinto, Jewish, Zoroastrian, Baha'i, Jain, Taoist, Sikh,
and myriad smaller groups. But (despite what advocates of too many of these belief
systems will tell you) one is not necessarily better than another, and a rank ordering
of all is something only the nuttiest zealot would attempt.
‘Dummy’ – a special kind of nominal variables are ‘dummy’, 0-1, either/or variables.
We’ve used this at least once in this class so far, with the Indiana v. Florida variable.
'Categorical' data is often what you end up with in qualitative research, especially survey
research. As an illustration, as suggested above a response to the question "How old are you,"
can be answered quantitatively with a number: 45. If this data is entered into a dataset with 1000
respondents, one can readily analyze the responses. What is the mean? 37.23 years!
Categorical data is harder to handle. Categorical data can be coded into spreadsheets
numerically: 1-5. But note that these numbers often don't function like numbers. This is
especially evident in what might be called attitudinal variables, or those that use a Likert scale, to
quantify what are otherwise non-numerical phenomena. How old one is can readily be counted,
but how one feels about (to use a Belle County dataset example) public services cannot be. Their
1 = excellent, 2 = good, 3 = fair, 4 = poor coding is ordinal, but not necessarily interval. For the
respondent the difference between 1 (excellent) and 2 (good), may not be the same as the
difference between 2 and 3 (fair). In other words, the slightest fault may cause the respondent to
classify a service as good (2) rather than excellent (1), but the service may have to be
considerably worse than 'good' (2), almost without special merit altogether, before it is classified
as fair (3). Worse, different people may apply this coding schema differently. If this is the case,
treating the variable numerically has its problems. Whereas the difference between 2 and 3 and
between 3 and 4 are exactly the same when analysing numbers, these differences aren't the same
when analysing categorical data.
In their limited discussion of categorical variables, too, O'Sullivan et al (p. 106) discuss nominal
variables (note that they don't use the term categorical, prefer instead 'nominal' and 'ordinal'). For
these, numbers assigned by SPSS, for instance, have no use beyond shorthand identifiers. The
Belle County dataset, for instance, opens with a question on race:
1 = Black or African-American
2 = Hispanic
3 = Native American/ Indian
4 = Asian
5 = White
Using descriptive statistics to generate a mean for this variable (it happens to be 4.16) means
nothing (by the way, this variable should have been created as a 'string' variable, so that you can't
calculate means for it). Instead, one would report this variable using frequencies (Analyze,
descriptive statistics, frequencies, throw 'respondent race' into variables, OK):
PAD5700 lecture 8
Page 3 of 12
Table 1 -- Respondent race
Frequency Percent Valid Percent
Cumulative Percent
Valid Black or African American 99 19.6 20.0 20.0
Hispanic 3 .6 .6 20.6
Native American/Indian 3 .6 .6 21.2
Asian 6 1.2 1.2 22.4
White 384 75.9 77.6 100.0
Total 495 97.8 100.0 Missing System 11 2.2 Total 506 100.0
As we will see, nominal variables can best be handled by creating ‘dummy’ variables. So you
could reconfigure the religion variable and make a new ‘Catholic’ variable, coding 0 = non-
Catholic, 1 = Catholic, and can now assess the impact of Catholicism on dependent variables.
Again, even income, which in this dataset is converted from a continuous to an ordinal variable,
can't readily be treated as a numerical variable. A mean income of 5.4 means little, it doesn't
even necessarily indicate that the mean income is 4/10s of the way along the interval in category
5 ($35,000 to 49,999). This data would, instead, be reported using frequencies, as follows
(Analyze, Descriptive Statistics, Frequencies, throw 'respondent income' into variables, OK):
Table 2 -- Respondent income
Frequency Percent Valid Percent
Cumulative Percent
Valid Less than $10,000 17 3.4 4.0 4.0
$10,000-$14,999 22 4.3 5.1 9.1
$15,000-$24,999 55 10.9 12.9 22.0
$25,999-$34,999 57 11.3 13.3 35.3
$35,000-$49,999 79 15.6 18.5 53.7
$50,000-$74,999 115 22.7 26.9 80.6
$75,000-$99,999 47 9.3 11.0 91.6
$100,000 or more 36 7.1 8.4 100.0
Total 428 84.6 100.0 Missing System 78 15.4 Total 506 100.0
Frequencies are especially useful for presenting classic Likert-scale type categorical data, so the
'Overall county service value' rating in the Belle County dataset would look like this (Analyze,
Descriptive Statistics, Frequencies, throw 'overall county service value rating' into variables,
OK):
PAD5700 lecture 8
Page 4 of 12
Reporting categorical
data
In this section we will
look at a number of ways
to report categorical data.
The differences between
reporting and analyzing
aren't that stark, though.
Even a simple sample
mean, or a presentation of
the frequency of responses
(i.e. simple counting),
allows one to analyze.
From Figure 3a, for
instance, one can see that Belle County residents who responded to this survey generally think
that they get value (if not 'excellent' value) from their county services. That is a form of analysis.
Statistics, and rigorous quantitative analysis of a social phenomenon, needn't require
sophisticated regression derivatives. A simple sample mean can be a powerful statistic,
explaining a lot.
It also shows how data can be far more powerful than the alternative. Are citizens happy with the
new program? The response from a member of county government might be something like this:
"Seems good. I've been talking to a dozen or so people, and they seem generally positive." Note
that this form of analysis:
Uses a very small sample, with only a dozen or so.
Uses a very unsystematic sample. How were these dozen people selected (from the local
Starbucks that morning, a Rotary club meeting that noon, and a bar that night?), and how
representative were they of the broader population?
Is opaque. How was the question asked, how were responses tallied?
Is vague. "They seem generally positive" is terribly imprecise.
Instead, why not do a systematic sample of the community? A far more robust way of reporting
community reaction to a new program would be to report, "A random sample of over 500
community members found that over 80% indicated that they get value from the new program."
This is what the Belle County dataset is able to say.
This ability of even simple counts to allow strong, useful analytical statements to be made points
to a second fundamental problem (in addition to the "if you can't count it, it doesn't count"
problem) with especially academic quantitative analysis, in that the purpose often seems to be to
show off your familiarity with sophisticated methods, rather than to analyze the phenomenon. So
often statistical steamrollers are used to break analytical walnuts. Keep it simple!
As indicated above, the 'frequencies' function in SPSS can be used to generate data useful for
reporting the results of a categorical variable. On the Belle County dataset, for instance, we can
do the following:
Table 3a -- Overall county service value rating
Frequency % Valid % Cum. %
Valid Very poor value 23 4.5 4.8 4.8
Somewhat poor value 59 11.7 12.4 17.2
Fair value 207 40.9 43.4 60.6
Good value 164 32.4 34.4 95.0
Excellent value 24 4.7 5.0 100.0
Total 477 94.3 100.0
Missin
g
System 29 5.7
Total 506 100.0
PAD5700 lecture 8
Page 5 of 12
Frequencies
Present frequencies, using Analyze, Descriptive Statistics, Frequencies (which got us the
respondent race data, above). Notice that you have the option to also produce some descriptive
statistics with this, by clicking 'Statistics' on the 'Frequencies' window. I don't find the SPSS
output to be terribly attractive or
professional looking, and
containing a lot of superfluous
information the reader might not
need. Percent, Valid Percent and
Cumulate Percent, for instance, all
are not needed. So you might want
to redo it by creating a new table in
MS Word, perhaps (to reconfigure
Figure 3a, above) using this format
that I've been using in my own
research lately:
Note, too, that in my non-stats classes, when I ask you to write papers, I also offer bonus points
for (as it is described in my standard assignments page format, the bullets below come from page
6 of the PAD5700 Assignments page):
If one was to 'incorporate' the reconfigured Table 3b 'into the narrative of the paper', one might
simply write, "As shown in Table 3b, a strong majority of over 80% of residents responded that
they felt they received at least fair value from the value of county services. Nearly 40% reported
that they received good or excellent value."
Descriptive statistics
One can present descriptive statistics, using Analyze, Descriptive Statistics, and Descriptives.
Try this using the variable 'Years resident in county', which is a purely continuous variable, with
values ranging from 1 to 83 years. The results:
Table 4 – Years resident in county
N Minimum Maximum Mean Std. Deviation
Statistic Statistic Statistic Statistic Std. Error Statistic
Years resident in county 506 1 83 19.06 .768 17.270 Valid N (listwise) 506
To report this, you would not even necessarily need a table. In the narrative, one could just write:
"The mean years resident in the county was 19.06."
Sample sorting
Table 3b
Overall county service value rating
Number Percent
Very poor value 23 4.8
Somewhat poor value 59 12.4
Fair value 207 43.4
Good value 164 34.4
Excellent value 24 5.0
Total 477 100.0 Notes:
29 cases were missing data.
The source is the Belle County dataset.
PAD5700 lecture 8
Page 6 of 12
Restrict the sample for further analysis. We did this a bit in the midterm exam. Assume that you
want to know the years resident in the county of those most negatively disposed towards Belle
County services. First go into Data, Select Cases, click the dot for 'If condition is satisfied', then
the button for 'If'. This will open a 'Select Cases: If' window. In this window, insert 'valserv'
(Overall county service value rating) in the window, and produce the rule 'valserv < 3'. This will
give you the respondents that indicated Belle County services are of poor value, or somewhat
poor value. Click Continue, then OK in the Select Cases window. Repeat then the steps above
for descriptive statistics. The results:
Table 5 -- Descriptive Statistics, Belle County critics
N Minimum Maximum Mean Std. Deviation
Statistic Statistic Statistic Statistic Std. Error Statistic
Years resident in county 82 1 79 20.38 2.049 18.554 Valid N (listwise) 82
This tells you that of the 82 respondents that rated the overall services that poorly (note that this
is the same number reported in Table 3, above, with 23 reporting 'very poor' and 59 'poor' value
= 82), and that these individuals had lived in the county slightly longer than the rest of the
sample: a mean of 20.38 years, versus the 19.06 years for the whole sample (from Table 4).
The sample can be sorted in any number of other ways. For instance, 'Based on time or case
range'. Another way to do the sorting that we just did above would be to first go in to Data, Sort
Cases, put 'valserv' in the 'sort by' window, you might as well leave the data in the default,
'Ascending' 'Sort Order'.
Note how the cases have been sorted in ascending order on the 'valserv' variable, starting with
the 'no responses', then the 1s, 2s, etc. Given that in the previous analysis we wanted to know
who indicated that the value of Belle County services was 'very poor' (1) or 'somewhat poor' (2),
we can now select these cases using Data, then Select Cases. This time we will use the 'Based on
time or case range' function, and click 'Range'. The 'Select Cases: Range' window allows us to
insert the range of cases we want to analyze. With the data sorted as it is, we can see that the 1
responses begin at case #30, and the 2 responses end at case #111. So put those two numbers in
the appropriate spots, and click OK. Calculate descriptive statistics for Years resident in county,
and you'll get the same figures as before.
Important point!!! If you save an SPSS spreadsheet while the 'Select Cases' function is being
used, the saved file may omit all omitted data. At least this was the case with older versions. So
either go back to 'Select all cases', or do not save when this function is operating.
Graphics
Pie chart. We can also produce a variety of graphics to present the data. Categorical data is
especially well-suited to everyone's favourite (and my least favourite) graphic: the pie chart. Go
to Graphs, Legacy Dialogs, Pie, click 'Summaries for groups of cases', Define. Indicate that
PAD5700 lecture 8
Page 7 of 12
'Slices Represent % of cases', 'Define slices
by' the variable 'Respondent race', and click
OK. You should get Figure 6:
This is the same data presented in Figure 1,
save that this is a graphic, rather than a
numerical presentation of the data. With
regards to when to use this, note again my
standard, professional writing grading criteria
for tables/graphs: "Note the 'well used'... This
does not mean produce a large, gaudily
coloured pie chart" like the one above "when
it would be easier to simply write '55% of
Vermonters remain opposed to the civil
unions law.'" So the pie chart above really
provides little information that can't more
effectively (and economically) be
communicated by simply writing "The
population was majority white, with a large
Black minority and smaller groups of
Hispanics, Native Americans and Asians.
Notice that only categorical data (or data with
relatively few slices of pie) are readily
presented like this. Try the same for the
variable Years resident in county, you get
Figure 7:
Psychedelic, but not very useful, is it?
Bar chart. Click: Graphs, Legacy Dialogs,
Bar, Simple, Define, 'Bars Represent % of
cases', load 'Overall county service value
rating' in as 'category axis', click OK. You
get Figure 8:
This is the same data presented in the table in
Figure 3 (and 3b), above. The Histogram
function gets you essentially the same thing.
Tables
Note that my favourite graphic, the
Scatterplot, isn't well suited to categorical
data. You can see this by going to Graphs,
Legacy Dialogs, Scatter, Simple, Define, and
PAD5700 lecture 8
Page 8 of 12
loading 'Overall county service value rating' and Gender on the Y and X axis, respectively. I'll
save paper and not copy it in, as it ain't too useful.
You can, though, present this sort of relationship by producing a simple table. SPSS used to
have a separate function for this, but I haven’t been able to find it for years. You can trick it into
producing simple tables through the Crosstabs function:
Analyze, Crosstabs.
Put Overall county service value rating in the Row,
Gender in column. Click OK. You get this:
Table 9 -- Overall county service value rating * Gender Crosstabulation
Count
Gender
Total Male Female
Overall county service value rating
Very poor value 13 10 23
Somewhat poor value 34 25 59
Fair value 111 96 207
Good value 72 92 164
Excellent value 7 17 24 Total 237 240 477
At a glance, you can see that men (bunch o' whiners!) outnumber women in the two 'poor value'
rows, women (bunch o' sissies!) outnumber men in the two 'value' rows. Again, SPSS output isn't
terribly attractive or professional looking, so you might reconfigure it as follows:
Table 10
Gender and overall county service value rating,
Bell County
Male Female
Very poor value 13 10
Somewhat poor value 34 25
Fair value 111 96
Good value 72 92
Excellent value 7 17 Notes:
29 cases were missing data.
The source is the Belle County dataset. Higher level analytical stuff
The purpose of this section's material is to shift to somewhat higher order analytical techniques
that can be applied to categorical data.
Perhaps the most fundamental thing to keep in mind when analyzing any data is to pay attention
to the units in which the data is expressed. This is especially important because categorical
variables can present challenges in interpretation. In the Bell County dataset:
'Years Resident in the County is a purely interval variable'. The units refer to years.
Age is not a purely interval variable in this dataset, it is ordinal. The units refer to
categories: 1 = under 25 years; 2 = 25-29...; 11 = 70 and older.
PAD5700 lecture 8
Page 9 of 12
'Overall County Service Value' rating is an ordinal Likert scale. The units refer to
categories: 1 = very poor value, 2 = somewhat poor value, 3 = fair value, 4 = good value,
5 = excellent value.
'Residence in City Limits' is a dichotomous (either/or) variable. The units refer to one of
two, opposite things: 1 = inside; 2 = outside.
Race, again, is a nominal variable. As constructed in the Bell County dataset, it is un
interpretable in quantitative analysis.
The point here is that interpreting the results of these variables can be tricky.
Hypothesis tests
We'll start with hypothesis tests. Hypothesis testing for categorical variables doesn't differ that
much from that for interval variables. The categorical variable is simply treated like a number.
One sample t-test
As we have seen, in SPSS-ease, this is called a one sample t-test. Assume that the Belle County
survey was based on a standard survey form recommended by the International City/County
Management Association. Further assume that the mean overall county service value rating for
some dozens of counties that have applied the ICMA survey is 3.00. We want to see if the
overall county service value rating for Belle County is significantly different from this status
quo, null hypothesis figure.
In SPSS, go to Analyze, Compare Means, One-Sample T Test. Put 'Overall county service value
rating' in as the test variable, and use a Test Value of 3.0 (the ICMA, 'null hypothesis'). Click
OK. The results:
Table 11a – Overall value compared nationally
N Mean Std. Deviation Std. Error Mean
Overall county service value rating
477 3.22 .902 .041
Table 11b – Overall value compared nationally
Test Value = 3
t df Sig. (2-tailed) Mean
Difference
95% Confidence Interval of the Difference
Lower Upper
Overall county service value rating
5.433 476 .000 .224 .14 .31
The One-Sample Statistics tell us that there were 477 responses to this question, with a mean of
3.22, a standard deviation of 0.902, and a Standard Error of the Mean or, in PAD570-ease, a
standard deviation of the sampling distribution of 0.041. Note that the mean of 3.22 refers to that
1-5 (very poor to excellent) scale. It doesn't mean 3.22%, or 3.22 years, $3.22, or 3.22 gumnuts.
The One-Sample Test data gives us a test statistic of 5.433, indicating that the likelihood that a
sample of 477 would randomly yield a sample mean of 3.22, if the true population mean was
PAD5700 lecture 8
Page 10 of 12
really 3.00, is 5.433 standard deviations of the sampling distribution from the mean. We know
that this is very unlikely, and so can conclude that it is very unlikely, with close to a zero
probability (the significance -- Sig. (2-tailed) -- is 0.000), that a sample of 477 would randomly
yield a sample mean of 3.22, if the true population mean was really 3.00. If this sample mean of
3.22 can't be explained by randomness, you can be fairly confident that it is explained by a true
difference between Belle County and the other counties that have applied this ICMA survey. In
formal hypothesis testing terms, we can reject the null hypothesis that attitudes to overall county
services in Bell County is no different than that in other counties across America.
Note: this assumes, of course, that we have been careful to minimize the likelihood that
our implementation of the survey did not introduce biases.
Independent-Sample T Test
Using SPSS, conduct an hypothesis test to see if newer and older residents differ in their Overall
County Value Service Rating. The null hypothesis is that they do not differ. Here, we want to see
if the overall county service value rating for Belle County is significantly different between the
newer and longer-term residents.
Click on Analyze, Compare Means, Independent-Sample T Test. Your Grouping Variable will
be 'valserv', click Define Groups, the Cut Point dot, then 2.5 as the cutpoint. Insert 'Years
resident in county' as the Test Variable. This, again, should compare those who indicated 1 or 2,
to those who indicated 3-5. Click OK. The results (edited to fit):
Table 12 a – Overall value by years in county
Overall county service value rating N Mean Std. Deviation Std. Error Mean
Years resident in county >= 3 395 18.72 17.031 .857
< 3 82 20.38 18.554 2.049
Table 12b – Overall value by years in county
Levene's Test for Equality of Var. t-test for Equality of Means
F Sig. t df Sig.
(2-tailed) Mean Diff.
Std. Error Diff.
Years resident in county
Equal variances assumed .788 .375 -.788 475 .431 -1.654 2.099
Equal variances not ass. -.745 111.2 .458 -1.654 2.221
The Group Statistics tell us that there were 82 cases with a value of less than 3, with a mean of
20.38 years resident in the county; 395 cases with a value greater than or equal to 3, with a mean
of 18.72. On the Independent Sample Test, the Sig. (2 tailed) figure of .431 tells us that the
statistical significance of the difference between the two means of 1.65 years is small relative to
the variance in the sample, and that we can not reject the null hypothesis that there is no
difference in the number of years resident in the county between those with negative attitudes
toward county services, and the rest of the population.
PAD5700 lecture 8
Page 11 of 12
Paired-Samples T Test
Do the value service ratings differ for the environmental programs and the public schools? Here,
we want to see if one variable differs from another. Click on Analyze, Compare Means, Paired-
Samples t Test. Highlight both 'Public school value rating' and 'Environmental programs value
rating', and load these into the 'Paired Variable' box. Click OK. The results (I've redone the
formatting):
Table 13a – Public school value v. environmental value
Mean N Std. Deviation Std. Error Mean
Pair 1 Public school value rating 3.40 433 .943 .045
Environmental programs
value rating
3.57 433 1.014 .049
Table 13b -- Public school value v. environmental value
Paired Differences
t df
Sig. (2-
tailed) Mean
Std.
Dev.
Std. Error
Mean
Public school value rating -
Environmental prog. value rating
-.164 1.278 .061 -
2.67
432 .008
The 'Paired Samples' results indicate that the two did indeed have different mean figures. The
Paired Samples Test data again give a test statistic of 2.67, with a probability of .008. This tells
us that if the people of Belle County were indifferent in their opinions of these two programs,
there is a .008 chance that scores this far apart would be generated randomly. Given that this is
less than a one percent chance, we can be about 99% confident that the perceived value of these
two programs is indeed different, or in formal hypothesis testing terms: we can reject the null
hypothesis that there is no difference in Bell County residents' attitudes regarding the value of
public schools and environmental programs.
*
II. Time series
* We are going to do a sort of poor person's time series analysis here, mostly just to introduce you
to the concept. The idea is to introduce time as a variable. An important provision to keep in
mind is one of the assumptions of regression analysis, from lecture 5, #3 in the list of
considerations of regression: "residuals are independent," by which is meant "one value of x is
not related to (is independent of) the next." Over time, values of x often are related to the next.
Much popular global warming analysis, for instance, will point to a string of recent hot years as
evidence that something is going on. Yet we know that temperatures one year to the next are not
independent of each other: an el nino cycle, for instance, can last 3-4 years. In economic terms,
though, I'll consider years independent of each other, mostly because changes in economic
PAD5700 lecture 8
Page 12 of 12
conditions generally occur quicker than one year. You can see this in the linked "Gross Domestic
Product tables" from the Bureau of Economic Analysis.
So by way of a poor person's time series analysis: take my Macro Economic statistics file:
MacroStats. Assume that you wanted to analyse variation in US GDP growth rates after World
War II. This period is selected because prior to this era the economy was especially volatile, with
the massive contractions of the Great Depression, then equally large growth periods as a result of
the stimulus provided by New Deal programs
and war. You can see this in the following line
chart:
graphs
legacy dialog
line
choose 'simple', Data in Chart are
‘Values of Individual cases’, click
'Define'
Line represents: GDP change
Category labels: Variable
Variable: Year . Click OK
You get this:
Note the wild fluctuations prior to 1950 or so,
as well as the evident slowing in US economic
growth. So we want to be able to hold constant that long term, slowing growth trend in looking
for relationships between other variables and economic growth.
Now I’ll do a multivariate regression, with economic growth the dependent variable, and time
(Year), federal expenditures, and the cost of imported oil as independent variables. The results
(put in my standard table format):
Table 15
Regression of Economic growth on year, price of imported oil, and federal spending
β
(s.e.)
Standardized β t test Probability
Constant 6.74
(55.64)
.121 .904
Year -.002
(.027)
-.014 -.086 .932
Imported oil
($ real)
-.046
(.021)
-.377 -2.17 .037
Federal outlays
(% GDP)
.111
(.244)
.078 .453 .653
Adjusted r2 = .059 F (3, 37) = .156
Holding years constant, oil prices have an impact on economic growth, while growth of federal
spending does not.